Tim's Technical Thoughts: 2010

Wednesday, December 22, 2010

A note on Western Digital 2001FASS drives

About a year ago, I decided to open my wallet and cough up some serious money for a good NAS solution for my home usage. With an ESX whitebox, a growing number of pictures and other digital parafernalia that I like to (permanently) store, I decided that a standalone NAS solution would be more reliable than relying on a single (now aging) RAID controller in my ESX whitebox. After all, a NAS is "system independent" so it can be accessed from any device, as long as there is a network. A few weeks later, I ordered the Thecus N7700 NAS from eBay, together with three Western Digital Caviar Black 2 TB disks (type: WD2001FASS). In the meantime, I upgraded to 5 disks.

Neglecting some configurational complexities between ESX and the Thecus (see my previous blogpost on ESX's iSCSI implementation changes in 4.1), everything has been running very fine... until yesterday.

At 17:04 yesterday evening, I received a gmail notification from the Thecus NAS (yes, it sends mails through gmail) indicating one of the Caviar Black disks had failed and that my RAID5 array was now degraded. I was a bit surprised and already fearing another "Sea-gate" incident with another series of continuously failing disks (the "gate" prefix being so popular with "cablegate", I decided to introduce another one :) ).

I decided to remove the affected disk and run the Western Digital drive diagnostic tools on it (which took a dreadfully long 4 and a half hours). Sure enough, a Full Drive test revealed that there were some bad sectors on the drive but that they were succesfully remapped to the spare capacity that drives get exactly to compensate for a few bad blocks. Still, the RAID array was degraded and the drive was reported as being failed (even though it seems to be very easily fixable), so I decided to dive a little deeper into what happened in an attempt to discover why this is not automatically fixed by the drive when such a bad block is discovered.

What I found out, seriously pissed me off. Western Digital does support a mechanism to automatically remap bad blocks to the spare capacity on the drive. However, this can take a few moments so the question rises how the drive should communicate with the RAID controller to report that it is currently busy to do some block remapping. Western Digital has a technology which they refer to as TLER - Time Limited Error Recovery to delay the RAID array of marking a drive as failed.

Fantastic! The only problem is that this software feature is disabled in the 2001FASS drives, simply because it is considered a "consumer" drive. The even more expensive (and trust me, I had to use all my tactics to convince my wife to cough up the money for what I consider a really expensive drive) RE or "RAID edition" drives are in fact almost identical to the 2001FASS drives, with the exception that they have the TLER feature enabled.

Basically, this means that the 2001FASS drive is not suitable for RAID arrays. When a drive encounters a bad block, it will immediately marked as failed even though this is not the case. Talking about a serious bummer! Some report that TLER is not needed for Linux (which is basically what the Thecus NAS is, a Linux box) but my experience seems to contradict this slightly.

For me, this is an important reason not to buy Western Digital anymore -- you need to cough up an additional bucket of money for a feature that should be enabled in any drive -- after all, all motherboards today support a basic RAID functionality! Or, if you want to upgrade at a given time from one drive to multiple drives...

Wednesday, November 24, 2010

A note on ESX 4.x and my iSCSI devices

A few weeks ago, I decided to extend my iSCSI NAS (Thecus N7700) from 3x 2TB Western Digital Caviar Black disks to 5x 2TB Western Digital Caviar Black disks.

Trouble has been my companion ever since. I have been experiencing some serious performance issues since the RAID extension, and was fearing that the different firmware versions of the new Caviar Blacks was confusing my NAS system; mixing firmwares in RAID systems does not seem to be a best practice. The symptoms were very simple: from the moment a lot of I/O was generated (think: 160 MB/s write speeds to the NAS), ESX would loose the iSCSI link to the NAS, which was choking on all that traffic with a 100% CPU usage. As you very well know, storage is ESX's Achilles heel, and very shortly after that, the vmkernel logs would be flooding with messages indicating a path failure to the NAS:

0:00:41:06.581 cpu1:4261)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba36:C0:T0:L3, reservation state on device t10.E4143500000000000000000040000000AE70000000000100 is unknown.
0:00:41:06.581 cpu1:4261)ScsiDeviceIO: 1672: Command 0x16 to device "t10.E4143500000000000000000040000000AE70000000000100" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

After a multitude of firmware up- and downgrades on the Thecus N7700 and a lot of conversation with Thecus Support (which by the way I want to thank for their patience with a guy like me working in an unsupported scenario!), I stumbled across some a strange error message that I had not seen before on an ESX host:

0:00:41:06.733 cpu0:4113)FS3: 8496: Long VMFS3 rsv time on 'NASStorage04' (held for 3604 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors

Some googling quickly pointed me to a few interesting threads, which talked about a VMware KB 1002598 discussing performance issues on EMC Clariion systems with iSCSI. It seems that the iSCSI initiator in ESX allows for for delayed ACK's which apparently is important in situations of network congestion. Knowing that the N7700's CPU usage can sometimes peak to 100% and that this can very briefly can lock up the network link on the N7700, I decided to disable the Delayed ACK's, following the procedure in the VMware KB...

Great success! Performance was rock solid again, and I have no longer experienced ESX hangs ever since!

This made me think a bit, and I remember that I first noticed the performance issues a few weeks after upgrading to ESX 4.0 Update 2 -- I suppose some default setting has changed from a vanilla ESX 4.0 (which I was running earlier) to ESX 4.0 Update 2 that seems to disturb the good karma that I had going between my ESX host and N7700 NAS earlier. Let it be known to the world that also the N7700 with firmwares 2.01.09, 3.00.06 and 3.05.02.2 (the ones I tried) also is subject to the iSCSI symptoms described in VMware KB 1002598.

Friday, November 5, 2010

The joy of WSUS

After a rather unpleasant electrical powerspike earlier this week had made some of my harddisks go wierd (crashing my ESX server with an equally unpleasant PSOD), a quick inspection revealed that no real harm was done -- except for one of the dozen RAID arrays that I have decided to do an automatic rebuild (no real issue). That finished after a few hours so I was able to go back to my comfortable sofa and enjoy some more quality prime time TV (lol). At least, so I thought...

A few hours later I discovered that my domain controller had not survived the ESX crash and was very unpleasantly complaining about a corrupted registry. Deciding that a bare metal (or virtual metal) Active Directory disaster recovery was not really necessary on my home network (recreating the three user accounts was less effort ;) ), I decided to reinstall my entire domain controller. About 30 minutes after that decision, I was again running a new AD domain with the users recreated and the most important servers already rejoined to the domain.

So what did I forget to configure in my enthousiasm to just reinstall the entire bunch? Certificate services, DFS namespace, DHCP server, re-ACL of file server, recreation of user profiles and also my own WSUS server (which were all happily running on my domain controller as well -- beat that SBS!).

My own WSUS server I hear you say? Well yes, with the very unpleasant (which you will have noticed already is the word of today) bandwidth limitations we have in Belgium, my ISP decides to punish me with some low-bandwidth connection after transferring more than 80 GB of data. That is quite sufficient but I prefer not spending it on downloading all my Windows updates 14 times (which is about the total number of virtual machines, physical laptops and desktops I have running on a frequent basis).

Given that my WSUS partition was about 120 GB and 98% filled, the doom scenario of seeing my entire data transfer that my ISP allows me for this month being entirely consumed by frikkin' Windows updates after reinstalling WSUS & synchronizing for the first time, slowly started to set in. An entire month of "small band" in this digital age? The horror... the horror...

So I decided to spend a few megabytes of datatransfer of very actively googling whether it is possible to prevent WSUS from downloading all the updates from the internet. After all, the registry corruption of the domain controller had completely borked its functionality, yet the separate partition (and separate VMDK) which was holding the WSUSContent directory was undamaged.

Most fora and blogs I found on recycling WSUSContent when performing a new installation, refer to a TechNet page called "Set Up a Disconnected Network (Import and Export Updates)" , which explains how the WSUSContent can be copied from one server to other -- however, they are always exporting & importing the WSUS database as well; unfortunately this database got lost when I -- again -- enthousiastically wiped the entire corrupted OS VMDK.

So I just decided to have a go and installed WSUS from scratch, and I pointed the WSUSContent directory to the partition which already contained the updates from the old server. Then I did the following:

Configured the WSUS server exactly has before (with the same products to update)
Performed the first initial synchronization (this took a long time but using the network bandwidth monitoring in the vSphere client I could clearly see that only minimal amounts of data were transferred during this synchronization -- no actual content was downloaded!)
Approved all the updates that were previously also approved.

This turns out to work quite nicely; apparently when WSUS detects that the updates are already downloaded to disk, it will recycle the existing content! Hurray for WSUS and for not torturing me with small band for an entire month!!

Thursday, April 1, 2010

ESX Whitebox & RAID controller failures - an epic struggle

The past few days have been a bit tense. Not only was there a deadline at work (an interesting study at one of our customers that had to be finished before end of March 2010), but also yesterday, my ESX whitebox decided to die on me. Of course, I took my screwdriver and box of recovery CD's and went to work.. A reconstruction of the epic struggle to get everything back to work (yes, ):

March 31, 8:00 AM. The (old & faithful 100 Mbps) 3Com switch that my PC's are currently connected to -- after having moved and being too lazy to install CAT6 cabling in my new house so I don't live between UTP cables, the wife loves it-- has crashed and had a blinking "Alert!" light; after disconnecting the power, the switch got back up again.
March 31, 8:05 AM. No internet connectivity; road works again, like the day before? Nope, turns out my ESX box, which runs a virtual m0no0wall router, has completely frozen and can only be brought back by a hard reset.
March 31; 8:10 AM. Thirdly, I discovered my Dell Perc 5i controller now freezes the computer after the power has been cycled. Interesting. Trying to enter the Perc 5i BIOS for configuration also freezes the computer. Fear kicks in.

About a year ago, I already burned a Perc 5i controller (including the sizzling, smoke and fireworks) and I decided to buy a second hand controller from eBay again. That replacement never fully worked as I liked it (for example, after resetting the computer, the controller is no longer recognized -- in fact it is only recognized after a power cycle; strange!). A bit pissed off, I blame myself for accepting a half-and-half working controller for hosting all my data (family pictures, personal documents, ...). I'm already fearing that I will have to buy a replacement controller & restore all my data from Amazon S3 & JungleDisk (which I subscribed to after the previous controller went up on smoke)... weeks of downtime.

March 31, 8:30 AM. I remember that shortly after I got the Perc 5i controller, I got a few warnings about ECC errors being discovered in the DIMM that provides the read/write cache. I decide to replace the DIMM as BIOS's crashing all of the sudden seems a bit unreal. Unfortunately, to no avail.
March 31, 8:45 AM. After some fiddling around with the controller, I notice the Perc 5i BIOS is accessible without any drives connected. Puzzling, but after performing a factory reset of the card (erasing the FlashROM) and performing a "foreign array import" of my two RAID arrays, the disks are discovered again & the computer tries to boot up. All this is followed by a little dance of happiness around the computer, thanking the computer gods for resurrecting the RAID array.
March 31, 8:55 AM. Immediately after the import, all volumes seem to report suspicious RAID consistency and an automated consistency check & back initilization is automatically started. The just recovered peace of mind is disturbed and fear for data corruption kicks in. Anyway, the only thing to do is wait several hours for the data consistency checks to complete, so I just boot into ESX.

March 31, 8:57 AM. ESX now freezes somewhere halfway in the boot. Turns out I am running an unpatched vSphere 4.0 which still has an older megaraid_sas. I remember issues were reported with this driver and this is confirmed when inspecting the vmkernel logs. They reveal that the megasas driver is receiving tons of AEN events (Automated Event Notifications):

esx01 vmkernel: 0:03:28:31.377 cpu3:4193)<6>megasas_hotplug_work[6]: event code 0x006e
esx01 vmkernel: 0:03:28:31.387 cpu3:4193)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:28:31.518 cpu1:4485)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:28:31.518 cpu0:4196)<6>megasas_hotplug_work[6]: event code 0x006e
esx01 vmkernel: 0:03:28:31.528 cpu0:4196)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:51.334 cpu3:4251)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:51.334 cpu2:4205)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:51.349 cpu2:4205)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:54.318 cpu3:4246)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:54.318 cpu0:4207)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:54.334 cpu0:4207)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:57.405 cpu3:4246)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:57.405 cpu2:4193)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:57.421 cpu2:4193)<6>megasas_hotplug_work[6]: aen registered

For an unknown reason, the ESX server is unable to cope with the massive amount of events received and slows down dreadfully (In retrospect I noticed it did not actually crash).

I decide to boot back into the Perc 5i BIOS and let the consistency check finish. Turns out again everything freezes before I can enter the BIOS so I need to disconnect all drives again, perform a factory reset & re-import my RAID arrays. I let the consistency checks start & hurry to work.

March 31, 21:00 PM. Consistency checks have finished but now ESX refuses to boot up, no longer finding the service console VMDK & reports:
```
VSD mount/Bin/SH:cant access TTY job control turned off.
```
Interesting. I discover a VMware KB that describes this behavior, which explains that sometimes LUN's can be discovered as snapshots when changes are made at the storage array. I conclude that my consistency checks & foreign array importing might have messed up the identifiers such that now ESX can no longer find the Service Console VMDK and goes berserk. After following the steps in the KB (basically resignaturing all VMFS volumes), everything works again. Afterwards, I discover that I had switched the two cables connecting both of my RAID arrays (cable 1 got attached to port 2 and vice versa). Doh!!!
March 31, 21:30 PM. Time to install ESX 4.0 update 1a; yet again, another issue: not enough diskspace to install the patches! After cleaning up the /var/cache/esxupdate, sufficient diskspace is available.
March 31, 22:00 PM. After having booted up everything, I again notice a very bad performance of ESX, and my suspicion is confirmed when I notice again the same megaraid_sas AEN events in the vmkernel logs. Strangely enough the error only occurs when I access my fileserver virtual machine, which is the only virtual machine that runs on the second of two RAID arrays... hmmm.
April 1, 13:00 PM. Some time for further analysis. I start a virtual machine running on my first RAID array and see that no AEN events are logged in the vmkernel log. Then I decide to add the VMDK's of my fileserver, all hosted on my second RAID array, one by one. The first VMDK is hotadded to a Windows 2008 virtual machine fine and I can see the data is still intact. Big relief! But indeed, when adding the second and third VMDK, the AEN events are flooding the vmkernel logs again.

At this time, I am becoming more and more convinced that not the Perc 5i controller is involved for the issues, but one or more disks in the second RAID array.
April 1, 14:00 PM. I decide I want to have a look at the Perc 5i controller logs to see if errors are logged at the HBA level. Since the Perc 5i uses a LSI logic chip, I use the procedure I blogged about a while back to install the MegaCLI tool again.

At this point, I discover that it is no longer possible to use the LSI MegaCLI tools under vSphere. I guess VMware finally decided that the Service Console has to run as a virtual machine and the Perc 5i card is no longer exposed inside the Service Console. LSI MegaCLI therefor reports that no compatible controllers are present. Bummer! Apparantly some people report in the VMware Community forums that LSI MSM (remote management server?) seems to work with limited functionality but I decide not to try to install this.
April 1, 17:00 PM. Time to think of an alternative way of discovering what is wrong in the second RAID array. It is a RAID5 array of 4 Seagate 1 TB disks (yes, the ST31000340AS series that had the firmware issues), and my suspicion is now that a single disk has failed, but the failure is not picked up by the Perc 5i controller, or not reported by the disk firmware. That is particularly bad because I don't want to pull the wrong disk out of a RAID5 array with a failed disk -- obviously causing a total data loss, which would be very, very, very, VERY depressing after all the happiness that I still had my data ;).

Time to pull out the Seagate selftests and indeed, testing each drive individually revealed that one of the drives had failed.

So the conclusion is: time for another RMA! I now have had each of my four Seagate 1 TB disks fail on me. In fact, out of the 8 Seagate drives I own, I have already requested 7 RMA's. At times like these I remember why I coughed up a massive amount of money to get my hands on the Western Digital Caviar Black edition (which AFAIK is the last consumer disk to provide a 5 year warranty).