Wednesday, November 24, 2010

A note on ESX 4.x and my iSCSI devices

A few weeks ago, I decided to extend my iSCSI NAS (Thecus N7700) from 3x 2TB Western Digital Caviar Black disks to 5x 2TB Western Digital Caviar Black disks.

Trouble has been my companion ever since. I have been experiencing some serious performance issues since the RAID extension, and was fearing that the different firmware versions of the new Caviar Blacks was confusing my NAS system; mixing firmwares in RAID systems does not seem to be a best practice. The symptoms were very simple: from the moment a lot of I/O was generated (think: 160 MB/s write speeds to the NAS), ESX would loose the iSCSI link to the NAS, which was choking on all that traffic with a 100% CPU usage. As you very well know, storage is ESX's Achilles heel, and very shortly after that, the vmkernel logs would be flooding with messages indicating a path failure to the NAS:

0:00:41:06.581 cpu1:4261)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba36:C0:T0:L3, reservation state on device t10.E4143500000000000000000040000000AE70000000000100 is unknown.
0:00:41:06.581 cpu1:4261)ScsiDeviceIO: 1672: Command 0x16 to device "t10.E4143500000000000000000040000000AE70000000000100" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.


After a multitude of firmware up- and downgrades on the Thecus N7700 and a lot of conversation with Thecus Support (which by the way I want to thank for their patience with a guy like me working in an unsupported scenario!), I stumbled across some a strange error message that I had not seen before on an ESX host:

0:00:41:06.733 cpu0:4113)FS3: 8496: Long VMFS3 rsv time on 'NASStorage04' (held for 3604 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors

Some googling quickly pointed me to a few interesting threads, which talked about a VMware KB 1002598 discussing performance issues on EMC Clariion systems with iSCSI. It seems that the iSCSI initiator in ESX allows for for delayed ACK's which apparently is important in situations of network congestion. Knowing that the N7700's CPU usage can sometimes peak to 100% and that this can very briefly can lock up the network link on the N7700, I decided to disable the Delayed ACK's, following the procedure in the VMware KB...

Great success! Performance was rock solid again, and I have no longer experienced ESX hangs ever since!

This made me think a bit, and I remember that I first noticed the performance issues a few weeks after upgrading to ESX 4.0 Update 2 -- I suppose some default setting has changed from a vanilla ESX 4.0 (which I was running earlier) to ESX 4.0 Update 2 that seems to disturb the good karma that I had going between my ESX host and N7700 NAS earlier. Let it be known to the world that also the N7700 with firmwares 2.01.09, 3.00.06 and 3.05.02.2 (the ones I tried) also is subject to the iSCSI symptoms described in VMware KB 1002598.

No comments: