Sun StorageTek 2540 and ESX troubleshooting

We experienced a few issues with the StorageTek 2540 array that forms the core of our SAN recently. The symptom was that the array flagged itself as being in a degraded state and that one or more volumes were not assigned to the preferred controller.

The first step was to upgrade the SAN firmware and Common Array Manager (CAM) software to the latest release. Despite this, we observed the problem again. Further digging into the problem found that the failover was happening when we performed a LUN rescan under VMware ESX.

My previous understanding was that there were essentially two types of arrays: active/active and active/passive. In the active/active configuration, both controllers in an array can service I/O requests to a specific volume concurrently. In an active/passive configuration, one [active] controller handles the I/O with the second [passive] controller sitting idle, only servicing I/O if the active controller fails.

I understood the StorageTek 2540 to be an active/passive array; it is only possible to assign a volume to one controller at any time. However, in order to improve the throughput of the array, different volumes can be assigned to different controllers. For example, a volume “VOL1” might be assigned to controller A as its active controller and to controller B for its passive controller, while volume “VOL2” might be assigned to controller B as its active controller and controller A as its passive controller.

It turns out that things are more subtle than this; there is a third type of array configuration: asymmetric.

The asymmetric configuration follows the active/passive model in that only one controller is servicing I/O for a specific volume at any time, but extends this by allowing I/O operations to be received by the second controller. If this happens, the array will automatically failover the volume to the second controller to service the request. This process is called Automatic Volume Transfer (AVT). If the first controller then receives I/O operations, the AVT moves the volume back.

Yes, this could cause some flapping between controllers. It can also cause I/O stalls as the controllers fail across.

Some of the array initiator types (such as Solaris with Traffic Manager (aka MPxIO)) disable AVT, others, including the Linux initiator that we’ve used on our VMware hosts, have AVT enabled.

So the problem we’re having appears to be caused by the array failing over a volume to its second controller. But why is it doing this? The only configuration I had performed on the ESX side was to ensure the multi-pathing option was set to Most Recently Used (MRU); the correct setting for active/passive arrays. What appears to have happened is that when booting, the ESX servers are not mapping to a consistent path. Out of our five ESX servers, three were setting one controller as active, while the other two servers were setting the second controller as active. Presumably, when one of the hosts (that has the wrong active path) performs a scan, the request is sent to the failover controller which invokes AVT and fails over the volume.

How to fix?

There are two methods to change the disk array NVSRAM settings. One is to use the SANtricity Storage Manager, the other is to use the Common Array Manager (CAM). Both require the array controllers to be rebooted to make the new settings active.

The following is the script for CAM (Windows version only).

On the CAM server, change directories to the CAM directory. The default location is Program FilesSUNCommon Array ManagerComponentfmsbin .
Ensure the disk array for which the NVSRAM settings need to be updated is already added to the CAM configuration.
Create a .bat file. To create a .bat file, run: < filename>.bat <CAM arrayname>
Copy the following lines into the newly created .bat file:

call service -d %1 -c read -q nvsram region=0xf2 host=0x06

call service -d %1 -c set -q nvsram region=0xf2 offset=0x12 value=0x01 host=0x06

call service -d %1 -c set -q nvsram region=0xf2 offset=0x13 value=0x00 host=0x06

call service -d %1 -c set -q nvsram region=0xf2 offset=0x24 value=0x00 host=0x06

call service -d %1 -c set -q nvsram region=0xf2 offset=0x25 value=0x00 host=0x06

call service -d %1 -c read -q nvsram region=0xf2 host=0x06

cls

echo “The %1 controllers will need to be rebooted for changes to take effect!”Check the output between the two HEX dumps at offsets 0x12, 0x13, 0x24 and 0x25 for any changes. Not all bytes change for each array type.

The following is the procedure from within a Solaris OS

Sun StorageTek 2540 and ESX troubleshooting

Leave a Reply Cancel reply