Tim's Technical Thoughts: ESX Whitebox & RAID controller failures

The past few days have been a bit tense. Not only was there a deadline at work (an interesting study at one of our customers that had to be finished before end of March 2010), but also yesterday, my ESX whitebox decided to die on me. Of course, I took my screwdriver and box of recovery CD's and went to work.. A reconstruction of the epic struggle to get everything back to work (yes, ):

March 31, 8:00 AM. The (old & faithful 100 Mbps) 3Com switch that my PC's are currently connected to -- after having moved and being too lazy to install CAT6 cabling in my new house so I don't live between UTP cables, the wife loves it-- has crashed and had a blinking "Alert!" light; after disconnecting the power, the switch got back up again.
March 31, 8:05 AM. No internet connectivity; road works again, like the day before? Nope, turns out my ESX box, which runs a virtual m0no0wall router, has completely frozen and can only be brought back by a hard reset.
March 31; 8:10 AM. Thirdly, I discovered my Dell Perc 5i controller now freezes the computer after the power has been cycled. Interesting. Trying to enter the Perc 5i BIOS for configuration also freezes the computer. Fear kicks in.

About a year ago, I already burned a Perc 5i controller (including the sizzling, smoke and fireworks) and I decided to buy a second hand controller from eBay again. That replacement never fully worked as I liked it (for example, after resetting the computer, the controller is no longer recognized -- in fact it is only recognized after a power cycle; strange!). A bit pissed off, I blame myself for accepting a half-and-half working controller for hosting all my data (family pictures, personal documents, ...). I'm already fearing that I will have to buy a replacement controller & restore all my data from Amazon S3 & JungleDisk (which I subscribed to after the previous controller went up on smoke)... weeks of downtime.

March 31, 8:30 AM. I remember that shortly after I got the Perc 5i controller, I got a few warnings about ECC errors being discovered in the DIMM that provides the read/write cache. I decide to replace the DIMM as BIOS's crashing all of the sudden seems a bit unreal. Unfortunately, to no avail.
March 31, 8:45 AM. After some fiddling around with the controller, I notice the Perc 5i BIOS is accessible without any drives connected. Puzzling, but after performing a factory reset of the card (erasing the FlashROM) and performing a "foreign array import" of my two RAID arrays, the disks are discovered again & the computer tries to boot up. All this is followed by a little dance of happiness around the computer, thanking the computer gods for resurrecting the RAID array.
March 31, 8:55 AM. Immediately after the import, all volumes seem to report suspicious RAID consistency and an automated consistency check & back initilization is automatically started. The just recovered peace of mind is disturbed and fear for data corruption kicks in. Anyway, the only thing to do is wait several hours for the data consistency checks to complete, so I just boot into ESX.

March 31, 8:57 AM. ESX now freezes somewhere halfway in the boot. Turns out I am running an unpatched vSphere 4.0 which still has an older megaraid_sas. I remember issues were reported with this driver and this is confirmed when inspecting the vmkernel logs. They reveal that the megasas driver is receiving tons of AEN events (Automated Event Notifications):

esx01 vmkernel: 0:03:28:31.377 cpu3:4193)<6>megasas_hotplug_work[6]: event code 0x006e
esx01 vmkernel: 0:03:28:31.387 cpu3:4193)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:28:31.518 cpu1:4485)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:28:31.518 cpu0:4196)<6>megasas_hotplug_work[6]: event code 0x006e
esx01 vmkernel: 0:03:28:31.528 cpu0:4196)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:51.334 cpu3:4251)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:51.334 cpu2:4205)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:51.349 cpu2:4205)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:54.318 cpu3:4246)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:54.318 cpu0:4207)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:54.334 cpu0:4207)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:57.405 cpu3:4246)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:57.405 cpu2:4193)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:57.421 cpu2:4193)<6>megasas_hotplug_work[6]: aen registered

For an unknown reason, the ESX server is unable to cope with the massive amount of events received and slows down dreadfully (In retrospect I noticed it did not actually crash).

I decide to boot back into the Perc 5i BIOS and let the consistency check finish. Turns out again everything freezes before I can enter the BIOS so I need to disconnect all drives again, perform a factory reset & re-import my RAID arrays. I let the consistency checks start & hurry to work.

March 31, 21:00 PM. Consistency checks have finished but now ESX refuses to boot up, no longer finding the service console VMDK & reports:
```
VSD mount/Bin/SH:cant access TTY job control turned off.
```
Interesting. I discover a VMware KB that describes this behavior, which explains that sometimes LUN's can be discovered as snapshots when changes are made at the storage array. I conclude that my consistency checks & foreign array importing might have messed up the identifiers such that now ESX can no longer find the Service Console VMDK and goes berserk. After following the steps in the KB (basically resignaturing all VMFS volumes), everything works again. Afterwards, I discover that I had switched the two cables connecting both of my RAID arrays (cable 1 got attached to port 2 and vice versa). Doh!!!
March 31, 21:30 PM. Time to install ESX 4.0 update 1a; yet again, another issue: not enough diskspace to install the patches! After cleaning up the /var/cache/esxupdate, sufficient diskspace is available.
March 31, 22:00 PM. After having booted up everything, I again notice a very bad performance of ESX, and my suspicion is confirmed when I notice again the same megaraid_sas AEN events in the vmkernel logs. Strangely enough the error only occurs when I access my fileserver virtual machine, which is the only virtual machine that runs on the second of two RAID arrays... hmmm.
April 1, 13:00 PM. Some time for further analysis. I start a virtual machine running on my first RAID array and see that no AEN events are logged in the vmkernel log. Then I decide to add the VMDK's of my fileserver, all hosted on my second RAID array, one by one. The first VMDK is hotadded to a Windows 2008 virtual machine fine and I can see the data is still intact. Big relief! But indeed, when adding the second and third VMDK, the AEN events are flooding the vmkernel logs again.

At this time, I am becoming more and more convinced that not the Perc 5i controller is involved for the issues, but one or more disks in the second RAID array.
April 1, 14:00 PM. I decide I want to have a look at the Perc 5i controller logs to see if errors are logged at the HBA level. Since the Perc 5i uses a LSI logic chip, I use the procedure I blogged about a while back to install the MegaCLI tool again.

At this point, I discover that it is no longer possible to use the LSI MegaCLI tools under vSphere. I guess VMware finally decided that the Service Console has to run as a virtual machine and the Perc 5i card is no longer exposed inside the Service Console. LSI MegaCLI therefor reports that no compatible controllers are present. Bummer! Apparantly some people report in the VMware Community forums that LSI MSM (remote management server?) seems to work with limited functionality but I decide not to try to install this.
April 1, 17:00 PM. Time to think of an alternative way of discovering what is wrong in the second RAID array. It is a RAID5 array of 4 Seagate 1 TB disks (yes, the ST31000340AS series that had the firmware issues), and my suspicion is now that a single disk has failed, but the failure is not picked up by the Perc 5i controller, or not reported by the disk firmware. That is particularly bad because I don't want to pull the wrong disk out of a RAID5 array with a failed disk -- obviously causing a total data loss, which would be very, very, very, VERY depressing after all the happiness that I still had my data ;).

Time to pull out the Seagate selftests and indeed, testing each drive individually revealed that one of the drives had failed.

So the conclusion is: time for another RMA! I now have had each of my four Seagate 1 TB disks fail on me. In fact, out of the 8 Seagate drives I own, I have already requested 7 RMA's. At times like these I remember why I coughed up a massive amount of money to get my hands on the Western Digital Caviar Black edition (which AFAIK is the last consumer disk to provide a 5 year warranty).

Tim's Technical Thoughts

Thursday, April 1, 2010

ESX Whitebox & RAID controller failures - an epic struggle

1 comment: