Wednesday, November 24, 2010

A note on ESX 4.x and my iSCSI devices

A few weeks ago, I decided to extend my iSCSI NAS (Thecus N7700) from 3x 2TB Western Digital Caviar Black disks to 5x 2TB Western Digital Caviar Black disks.

Trouble has been my companion ever since. I have been experiencing some serious performance issues since the RAID extension, and was fearing that the different firmware versions of the new Caviar Blacks was confusing my NAS system; mixing firmwares in RAID systems does not seem to be a best practice. The symptoms were very simple: from the moment a lot of I/O was generated (think: 160 MB/s write speeds to the NAS), ESX would loose the iSCSI link to the NAS, which was choking on all that traffic with a 100% CPU usage. As you very well know, storage is ESX's Achilles heel, and very shortly after that, the vmkernel logs would be flooding with messages indicating a path failure to the NAS:

0:00:41:06.581 cpu1:4261)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba36:C0:T0:L3, reservation state on device t10.E4143500000000000000000040000000AE70000000000100 is unknown.
0:00:41:06.581 cpu1:4261)ScsiDeviceIO: 1672: Command 0x16 to device "t10.E4143500000000000000000040000000AE70000000000100" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

After a multitude of firmware up- and downgrades on the Thecus N7700 and a lot of conversation with Thecus Support (which by the way I want to thank for their patience with a guy like me working in an unsupported scenario!), I stumbled across some a strange error message that I had not seen before on an ESX host:

0:00:41:06.733 cpu0:4113)FS3: 8496: Long VMFS3 rsv time on 'NASStorage04' (held for 3604 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors

Some googling quickly pointed me to a few interesting threads, which talked about a VMware KB 1002598 discussing performance issues on EMC Clariion systems with iSCSI. It seems that the iSCSI initiator in ESX allows for for delayed ACK's which apparently is important in situations of network congestion. Knowing that the N7700's CPU usage can sometimes peak to 100% and that this can very briefly can lock up the network link on the N7700, I decided to disable the Delayed ACK's, following the procedure in the VMware KB...

Great success! Performance was rock solid again, and I have no longer experienced ESX hangs ever since!

This made me think a bit, and I remember that I first noticed the performance issues a few weeks after upgrading to ESX 4.0 Update 2 -- I suppose some default setting has changed from a vanilla ESX 4.0 (which I was running earlier) to ESX 4.0 Update 2 that seems to disturb the good karma that I had going between my ESX host and N7700 NAS earlier. Let it be known to the world that also the N7700 with firmwares 2.01.09, 3.00.06 and (the ones I tried) also is subject to the iSCSI symptoms described in VMware KB 1002598.

Friday, November 5, 2010

The joy of WSUS

After a rather unpleasant electrical powerspike earlier this week had made some of my harddisks go wierd (crashing my ESX server with an equally unpleasant PSOD), a quick inspection revealed that no real harm was done -- except for one of the dozen RAID arrays that I have decided to do an automatic rebuild (no real issue). That finished after a few hours so I was able to go back to my comfortable sofa and enjoy some more quality prime time TV (lol). At least, so I thought...

A few hours later I discovered that my domain controller had not survived the ESX crash and was very unpleasantly complaining about a corrupted registry. Deciding that a bare metal (or virtual metal) Active Directory disaster recovery was not really necessary on my home network (recreating the three user accounts was less effort ;) ), I decided to reinstall my entire domain controller. About 30 minutes after that decision, I was again running a new AD domain with the users recreated and the most important servers already rejoined to the domain.

So what did I forget to configure in my enthousiasm to just reinstall the entire bunch? Certificate services, DFS namespace, DHCP server, re-ACL of file server, recreation of user profiles and also my own WSUS server (which were all happily running on my domain controller as well -- beat that SBS!).

My own WSUS server I hear you say? Well yes, with the very unpleasant (which you will have noticed already is the word of today) bandwidth limitations we have in Belgium, my ISP decides to punish me with some low-bandwidth connection after transferring more than 80 GB of data. That is quite sufficient but I prefer not spending it on downloading all my Windows updates 14 times (which is about the total number of virtual machines, physical laptops and desktops I have running on a frequent basis).

Given that my WSUS partition was about 120 GB and 98% filled, the doom scenario of seeing my entire data transfer that my ISP allows me for this month being entirely consumed by frikkin' Windows updates after reinstalling WSUS & synchronizing for the first time, slowly started to set in. An entire month of "small band" in this digital age? The horror... the horror...

So I decided to spend a few megabytes of datatransfer of very actively googling whether it is possible to prevent WSUS from downloading all the updates from the internet. After all, the registry corruption of the domain controller had completely borked its functionality, yet the separate partition (and separate VMDK) which was holding the WSUSContent directory was undamaged.

Most fora and blogs I found on recycling WSUSContent when performing a new installation, refer to a TechNet page called "Set Up a Disconnected Network (Import and Export Updates)" , which explains how the WSUSContent can be copied from one server to other -- however, they are always exporting & importing the WSUS database as well; unfortunately this database got lost when I -- again -- enthousiastically wiped the entire corrupted OS VMDK.

So I just decided to have a go and installed WSUS from scratch, and I pointed the WSUSContent directory to the partition which already contained the updates from the old server. Then I did the following:
  • Configured the WSUS server exactly has before (with the same products to update)
  • Performed the first initial synchronization (this took a long time but using the network bandwidth monitoring in the vSphere client I could clearly see that only minimal amounts of data were transferred during this synchronization -- no actual content was downloaded!)
  • Approved all the updates that were previously also approved.
This turns out to work quite nicely; apparently when WSUS detects that the updates are already downloaded to disk, it will recycle the existing content! Hurray for WSUS and for not torturing me with small band for an entire month!!