Wednesday, December 22, 2010

A note on Western Digital 2001FASS drives

About a year ago, I decided to open my wallet and cough up some serious money for a good NAS solution for my home usage. With an ESX whitebox, a growing number of pictures and other digital parafernalia that I like to (permanently) store, I decided that a standalone NAS solution would be more reliable than relying on a single (now aging) RAID controller in my ESX whitebox. After all, a NAS is "system independent" so it can be accessed from any device, as long as there is a network. A few weeks later, I ordered the Thecus N7700 NAS from eBay, together with three Western Digital Caviar Black 2 TB disks (type: WD2001FASS). In the meantime, I upgraded to 5 disks.

Neglecting some configurational complexities between ESX and the Thecus (see my previous blogpost on ESX's iSCSI implementation changes in 4.1), everything has been running very fine... until yesterday.

At 17:04 yesterday evening, I received a gmail notification from the Thecus NAS (yes, it sends mails through gmail) indicating one of the Caviar Black disks had failed and that my RAID5 array was now degraded. I was a bit surprised and already fearing another "Sea-gate" incident with another series of continuously failing disks (the "gate" prefix being so popular with "cablegate", I decided to introduce another one :) ).

I decided to remove the affected disk and run the Western Digital drive diagnostic tools on it (which took a dreadfully long 4 and a half hours). Sure enough, a Full Drive test revealed that there were some bad sectors on the drive but that they were succesfully remapped to the spare capacity that drives get exactly to compensate for a few bad blocks. Still, the RAID array was degraded and the drive was reported as being failed (even though it seems to be very easily fixable), so I decided to dive a little deeper into what happened in an attempt to discover why this is not automatically fixed by the drive when such a bad block is discovered.

What I found out, seriously pissed me off. Western Digital does support a mechanism to automatically remap bad blocks to the spare capacity on the drive. However, this can take a few moments so the question rises how the drive should communicate with the RAID controller to report that it is currently busy to do some block remapping. Western Digital has a technology which they refer to as TLER - Time Limited Error Recovery to delay the RAID array of marking a drive as failed.

Fantastic! The only problem is that this software feature is disabled in the 2001FASS drives, simply because it is considered a "consumer" drive. The even more expensive (and trust me, I had to use all my tactics to convince my wife to cough up the money for what I consider a really expensive drive) RE or "RAID edition" drives are in fact almost identical to the 2001FASS drives, with the exception that they have the TLER feature enabled.

Basically, this means that the 2001FASS drive is not suitable for RAID arrays. When a drive encounters a bad block, it will immediately marked as failed even though this is not the case. Talking about a serious bummer! Some report that TLER is not needed for Linux (which is basically what the Thecus NAS is, a Linux box) but my experience seems to contradict this slightly.

For me, this is an important reason not to buy Western Digital anymore -- you need to cough up an additional bucket of money for a feature that should be enabled in any drive -- after all, all motherboards today support a basic RAID functionality! Or, if you want to upgrade at a given time from one drive to multiple drives...