Wednesday, December 22, 2010

A note on Western Digital 2001FASS drives

About a year ago, I decided to open my wallet and cough up some serious money for a good NAS solution for my home usage. With an ESX whitebox, a growing number of pictures and other digital parafernalia that I like to (permanently) store, I decided that a standalone NAS solution would be more reliable than relying on a single (now aging) RAID controller in my ESX whitebox. After all, a NAS is "system independent" so it can be accessed from any device, as long as there is a network. A few weeks later, I ordered the Thecus N7700 NAS from eBay, together with three Western Digital Caviar Black 2 TB disks (type: WD2001FASS). In the meantime, I upgraded to 5 disks.

Neglecting some configurational complexities between ESX and the Thecus (see my previous blogpost on ESX's iSCSI implementation changes in 4.1), everything has been running very fine... until yesterday.

At 17:04 yesterday evening, I received a gmail notification from the Thecus NAS (yes, it sends mails through gmail) indicating one of the Caviar Black disks had failed and that my RAID5 array was now degraded. I was a bit surprised and already fearing another "Sea-gate" incident with another series of continuously failing disks (the "gate" prefix being so popular with "cablegate", I decided to introduce another one :) ).

I decided to remove the affected disk and run the Western Digital drive diagnostic tools on it (which took a dreadfully long 4 and a half hours). Sure enough, a Full Drive test revealed that there were some bad sectors on the drive but that they were succesfully remapped to the spare capacity that drives get exactly to compensate for a few bad blocks. Still, the RAID array was degraded and the drive was reported as being failed (even though it seems to be very easily fixable), so I decided to dive a little deeper into what happened in an attempt to discover why this is not automatically fixed by the drive when such a bad block is discovered.

What I found out, seriously pissed me off. Western Digital does support a mechanism to automatically remap bad blocks to the spare capacity on the drive. However, this can take a few moments so the question rises how the drive should communicate with the RAID controller to report that it is currently busy to do some block remapping. Western Digital has a technology which they refer to as TLER - Time Limited Error Recovery to delay the RAID array of marking a drive as failed.

Fantastic! The only problem is that this software feature is disabled in the 2001FASS drives, simply because it is considered a "consumer" drive. The even more expensive (and trust me, I had to use all my tactics to convince my wife to cough up the money for what I consider a really expensive drive) RE or "RAID edition" drives are in fact almost identical to the 2001FASS drives, with the exception that they have the TLER feature enabled.

Basically, this means that the 2001FASS drive is not suitable for RAID arrays. When a drive encounters a bad block, it will immediately marked as failed even though this is not the case. Talking about a serious bummer! Some report that TLER is not needed for Linux (which is basically what the Thecus NAS is, a Linux box) but my experience seems to contradict this slightly.

For me, this is an important reason not to buy Western Digital anymore -- you need to cough up an additional bucket of money for a feature that should be enabled in any drive -- after all, all motherboards today support a basic RAID functionality! Or, if you want to upgrade at a given time from one drive to multiple drives...

6 comments:

Anonymous said...

What did you end up buying instead?
Where you able to return the drive?
If you return the drive with the explanation that your NAS keeps detecting it as faulty what can they say? (besides see your NAS supplier :) )

This is becoming a problem with probably most if not all HD suppliers. They consider the TLER function to be a "PRO" feature while more and more consumers are purchasing RAID cards/NAS/ etc for home use.
Great story BTW!

Tim Jacobs said...

Up until now, I haven't bought any additional harddisks. Perhaps that I will eventually cough up the extra bucks for the RE version of the Western Digital drives.

I didn't return the drive since a full bad block scan "fixed" the drive (reallocated bad sectors). Of course, when the spare capacity for the bad sectors runs out, I will return it and use my 5 year guarantee.

I'm not sure what Western Digital will say when the drive is RMA'd with the explanation that it doesn't work in a RAID array. In fact, the drive is advertised as not being able to do that :). I suppose that, assuming they actually read the explanation before swapping the drive, they might refuse an RMA. But then again, I will not send it back unless it has really failed.

In the meantime, I can say that my Thecus is happily up and running again without any issues ever since.

Anonymous said...

So you are now running with 4 disks in degraded mode..?
Brave man :)

I saw on the Thecus site that they already covered their ass by saying WD disks might have this issue
http://www.thecus.com/Downloads/HDD_List/N7700_N7700SAS_N8800_N8800SAS_SATA_HDD_list_2010-12-10.pdf

[Notes for WD 1.5/2.0 TB HDD]
1. WD2002FYPS, WD20EADS, and WD15EADS may have compatibility issue with Thecus
N7700/N7700SAS/N8800/N8800SAS
2. About 90% of WD2002FYPS/WD20EADS/WD15EADS configuration is stable, but remaining
10% might experience RAID degrade issue randomly.
SO those 10% are getting no love from Thecus..good luck and keep us posted on what adventures you encounter in the future!

Tim Jacobs said...

No, I have the full five disks running nicely again in RAID5 -- the disk with errors got fixed after a bad block scan; with bit error rates of 10^(-14) or once every 100 terrabits or roughly every 10 terrabytes, you can really expect bit issues in the lifecycle of a 2 TB disk (bit issue = the drive reads the opposite value of what was read due to magnetisation errors). To counter these bit errors, manufacturers include some "spare capacity" that you can use in case a bit error is encountered (=bad block) -- with TLER drives, this remapping of a bad block, this remapping is automatic I think, with the non-TLER drives it seems a bad block scan is necessary.

I did that and now I still have the full capacity (2 TB) where now some bits are used from the extra spare capacity. Of course, the amount of spare capacity on a drive is finite, so once that runs out, the drive can no longer sustain its full 2 TB capacity due to too many bad block errors. That should trigger a SMART alert & allow the drive to be RMA'd (because vendors count that you don't hit the bad block count soon enough before the warranty expires). Of course, not having the full 2 TB capacity due to insufficient spare capacity to account for bad blocks, is bad in a RAID array where all drives are supposed to be created equally.

Thusfar, I have not depleted the spare capacity, no SMART errors have been tripped so the drive is actually fully functional again according to Western Digital specifications. What concerns me is that the remapping of bad blocks to the spare capacity seems to require a manual intervention: removal of the drive (marked as failed by the Thecus NAS), bad block scan on another PC, then remapping bad blocks to spare drive capacity and finally reinserting in the NAS and rebuilding the RAID array. That should be automatic and that should be the case with the WD RE series of drives.

Also, I don't blame Thecus for not supporting the drive -- they are in fact correct to state that this is not a supported drive, simply due to the fact that the drive firmware is CRIPPLED such that it does not seem to do bad block remapping automatically (or fast enough). What I don't understand is that drives such as the RE4 which support TLR are not on the Thecus compatibility list. So to conclude: the blame is with WD, not Thecus (at least IMHO).

That being said, my FASS2001 drivers are working perfectly fine again now -- I'm just a bit pissed because of the entire TLER/crippled firmware/expensive drives situation which still seems to require a manual intervention from my part in case of RAID troubles (which is precisely why I coughed up €800 for a Thecus NAS -- to be 100% sure that when things go wrong, I still have my data!!).

Tim Jacobs said...

No, I have the full five disks running nicely again in RAID5 -- the disk with errors got fixed after a bad block scan; with bit error rates of 10^(-14) or once every 100 terrabits or roughly every 10 terrabytes, you can really expect bit issues in the lifecycle of a 2 TB disk (bit issue = the drive reads the opposite value of what was read due to magnetisation errors). To counter these bit errors, manufacturers include some "spare capacity" that you can use in case a bit error is encountered (=bad block) -- with TLER drives, this remapping of a bad block, this remapping is automatic I think, with the non-TLER drives it seems a bad block scan is necessary.

I did that and now I still have the full capacity (2 TB) where now some bits are used from the extra spare capacity. Of course, the amount of spare capacity on a drive is finite, so once that runs out, the drive can no longer sustain its full 2 TB capacity due to too many bad block errors. That should trigger a SMART alert & allow the drive to be RMA'd (because vendors count that you don't hit the bad block count soon enough before the warranty expires). Of course, not having the full 2 TB capacity due to insufficient spare capacity to account for bad blocks, is bad in a RAID array where all drives are supposed to be created equally.

Thusfar, I have not depleted the spare capacity, no SMART errors have been tripped so the drive is actually fully functional again according to Western Digital specifications. What concerns me is that the remapping of bad blocks to the spare capacity seems to require a manual intervention: removal of the drive (marked as failed by the Thecus NAS), bad block scan on another PC, then remapping bad blocks to spare drive capacity and finally reinserting in the NAS and rebuilding the RAID array. That should be automatic and that should be the case with the WD RE series of drives.

Also, I don't blame Thecus for not supporting the drive -- they are in fact correct to state that this is not a supported drive, simply due to the fact that the drive firmware is CRIPPLED such that it does not seem to do bad block remapping automatically (or fast enough). What I don't understand is that drives such as the RE4 which support TLR are not on the Thecus compatibility list. So to conclude: the blame is with WD, not Thecus (at least IMHO).

That being said, my FASS2001 drivers are working perfectly fine again now -- I'm just a bit pissed because of the entire TLER/crippled firmware/expensive drives situation which still seems to require a manual intervention from my part in case of RAID troubles (which is precisely why I coughed up €800 for a Thecus NAS -- to be 100% sure that when things go wrong, I still have my data!!).

Tim Jacobs said...

Update 2012: In the meantime, I've purchased a Synology DS1812+ NAS, which accepts and works with the FASS2001 and FAEX2003 drives without any issues; in fact, I expanded from the original 5 2TB disks to 8 drives in a now fully packed Synology NAS. The Thecus has now been downgraded to a backup NAS which carries some (supported by Thecus) Seagate Barracuda 1 TB drives I still had lying around. Time solved the problem, as usual :).