Tim's Technical Thoughts: VMWare

Showing posts with label VMWare. Show all posts

Wednesday, November 24, 2010

A note on ESX 4.x and my iSCSI devices

A few weeks ago, I decided to extend my iSCSI NAS (Thecus N7700) from 3x 2TB Western Digital Caviar Black disks to 5x 2TB Western Digital Caviar Black disks.

Trouble has been my companion ever since. I have been experiencing some serious performance issues since the RAID extension, and was fearing that the different firmware versions of the new Caviar Blacks was confusing my NAS system; mixing firmwares in RAID systems does not seem to be a best practice. The symptoms were very simple: from the moment a lot of I/O was generated (think: 160 MB/s write speeds to the NAS), ESX would loose the iSCSI link to the NAS, which was choking on all that traffic with a 100% CPU usage. As you very well know, storage is ESX's Achilles heel, and very shortly after that, the vmkernel logs would be flooding with messages indicating a path failure to the NAS:

0:00:41:06.581 cpu1:4261)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba36:C0:T0:L3, reservation state on device t10.E4143500000000000000000040000000AE70000000000100 is unknown.
0:00:41:06.581 cpu1:4261)ScsiDeviceIO: 1672: Command 0x16 to device "t10.E4143500000000000000000040000000AE70000000000100" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

After a multitude of firmware up- and downgrades on the Thecus N7700 and a lot of conversation with Thecus Support (which by the way I want to thank for their patience with a guy like me working in an unsupported scenario!), I stumbled across some a strange error message that I had not seen before on an ESX host:

0:00:41:06.733 cpu0:4113)FS3: 8496: Long VMFS3 rsv time on 'NASStorage04' (held for 3604 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors

Some googling quickly pointed me to a few interesting threads, which talked about a VMware KB 1002598 discussing performance issues on EMC Clariion systems with iSCSI. It seems that the iSCSI initiator in ESX allows for for delayed ACK's which apparently is important in situations of network congestion. Knowing that the N7700's CPU usage can sometimes peak to 100% and that this can very briefly can lock up the network link on the N7700, I decided to disable the Delayed ACK's, following the procedure in the VMware KB...

Great success! Performance was rock solid again, and I have no longer experienced ESX hangs ever since!

This made me think a bit, and I remember that I first noticed the performance issues a few weeks after upgrading to ESX 4.0 Update 2 -- I suppose some default setting has changed from a vanilla ESX 4.0 (which I was running earlier) to ESX 4.0 Update 2 that seems to disturb the good karma that I had going between my ESX host and N7700 NAS earlier. Let it be known to the world that also the N7700 with firmwares 2.01.09, 3.00.06 and 3.05.02.2 (the ones I tried) also is subject to the iSCSI symptoms described in VMware KB 1002598.

Thursday, April 1, 2010

ESX Whitebox & RAID controller failures - an epic struggle

The past few days have been a bit tense. Not only was there a deadline at work (an interesting study at one of our customers that had to be finished before end of March 2010), but also yesterday, my ESX whitebox decided to die on me. Of course, I took my screwdriver and box of recovery CD's and went to work.. A reconstruction of the epic struggle to get everything back to work (yes, ):

March 31, 8:00 AM. The (old & faithful 100 Mbps) 3Com switch that my PC's are currently connected to -- after having moved and being too lazy to install CAT6 cabling in my new house so I don't live between UTP cables, the wife loves it-- has crashed and had a blinking "Alert!" light; after disconnecting the power, the switch got back up again.
March 31, 8:05 AM. No internet connectivity; road works again, like the day before? Nope, turns out my ESX box, which runs a virtual m0no0wall router, has completely frozen and can only be brought back by a hard reset.
March 31; 8:10 AM. Thirdly, I discovered my Dell Perc 5i controller now freezes the computer after the power has been cycled. Interesting. Trying to enter the Perc 5i BIOS for configuration also freezes the computer. Fear kicks in.

About a year ago, I already burned a Perc 5i controller (including the sizzling, smoke and fireworks) and I decided to buy a second hand controller from eBay again. That replacement never fully worked as I liked it (for example, after resetting the computer, the controller is no longer recognized -- in fact it is only recognized after a power cycle; strange!). A bit pissed off, I blame myself for accepting a half-and-half working controller for hosting all my data (family pictures, personal documents, ...). I'm already fearing that I will have to buy a replacement controller & restore all my data from Amazon S3 & JungleDisk (which I subscribed to after the previous controller went up on smoke)... weeks of downtime.

March 31, 8:30 AM. I remember that shortly after I got the Perc 5i controller, I got a few warnings about ECC errors being discovered in the DIMM that provides the read/write cache. I decide to replace the DIMM as BIOS's crashing all of the sudden seems a bit unreal. Unfortunately, to no avail.
March 31, 8:45 AM. After some fiddling around with the controller, I notice the Perc 5i BIOS is accessible without any drives connected. Puzzling, but after performing a factory reset of the card (erasing the FlashROM) and performing a "foreign array import" of my two RAID arrays, the disks are discovered again & the computer tries to boot up. All this is followed by a little dance of happiness around the computer, thanking the computer gods for resurrecting the RAID array.
March 31, 8:55 AM. Immediately after the import, all volumes seem to report suspicious RAID consistency and an automated consistency check & back initilization is automatically started. The just recovered peace of mind is disturbed and fear for data corruption kicks in. Anyway, the only thing to do is wait several hours for the data consistency checks to complete, so I just boot into ESX.

March 31, 8:57 AM. ESX now freezes somewhere halfway in the boot. Turns out I am running an unpatched vSphere 4.0 which still has an older megaraid_sas. I remember issues were reported with this driver and this is confirmed when inspecting the vmkernel logs. They reveal that the megasas driver is receiving tons of AEN events (Automated Event Notifications):

esx01 vmkernel: 0:03:28:31.377 cpu3:4193)<6>megasas_hotplug_work[6]: event code 0x006e
esx01 vmkernel: 0:03:28:31.387 cpu3:4193)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:28:31.518 cpu1:4485)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:28:31.518 cpu0:4196)<6>megasas_hotplug_work[6]: event code 0x006e
esx01 vmkernel: 0:03:28:31.528 cpu0:4196)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:51.334 cpu3:4251)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:51.334 cpu2:4205)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:51.349 cpu2:4205)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:54.318 cpu3:4246)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:54.318 cpu0:4207)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:54.334 cpu0:4207)<6>megasas_hotplug_work[6]: aen registered
esx01 vmkernel: 0:03:29:57.405 cpu3:4246)<6>megasas_service_aen[6]: aen received
esx01 vmkernel: 0:03:29:57.405 cpu2:4193)<6>megasas_hotplug_work[6]: event code 0x0071
esx01 vmkernel: 0:03:29:57.421 cpu2:4193)<6>megasas_hotplug_work[6]: aen registered

For an unknown reason, the ESX server is unable to cope with the massive amount of events received and slows down dreadfully (In retrospect I noticed it did not actually crash).

I decide to boot back into the Perc 5i BIOS and let the consistency check finish. Turns out again everything freezes before I can enter the BIOS so I need to disconnect all drives again, perform a factory reset & re-import my RAID arrays. I let the consistency checks start & hurry to work.

March 31, 21:00 PM. Consistency checks have finished but now ESX refuses to boot up, no longer finding the service console VMDK & reports:
```
VSD mount/Bin/SH:cant access TTY job control turned off.
```
Interesting. I discover a VMware KB that describes this behavior, which explains that sometimes LUN's can be discovered as snapshots when changes are made at the storage array. I conclude that my consistency checks & foreign array importing might have messed up the identifiers such that now ESX can no longer find the Service Console VMDK and goes berserk. After following the steps in the KB (basically resignaturing all VMFS volumes), everything works again. Afterwards, I discover that I had switched the two cables connecting both of my RAID arrays (cable 1 got attached to port 2 and vice versa). Doh!!!
March 31, 21:30 PM. Time to install ESX 4.0 update 1a; yet again, another issue: not enough diskspace to install the patches! After cleaning up the /var/cache/esxupdate, sufficient diskspace is available.
March 31, 22:00 PM. After having booted up everything, I again notice a very bad performance of ESX, and my suspicion is confirmed when I notice again the same megaraid_sas AEN events in the vmkernel logs. Strangely enough the error only occurs when I access my fileserver virtual machine, which is the only virtual machine that runs on the second of two RAID arrays... hmmm.
April 1, 13:00 PM. Some time for further analysis. I start a virtual machine running on my first RAID array and see that no AEN events are logged in the vmkernel log. Then I decide to add the VMDK's of my fileserver, all hosted on my second RAID array, one by one. The first VMDK is hotadded to a Windows 2008 virtual machine fine and I can see the data is still intact. Big relief! But indeed, when adding the second and third VMDK, the AEN events are flooding the vmkernel logs again.

At this time, I am becoming more and more convinced that not the Perc 5i controller is involved for the issues, but one or more disks in the second RAID array.
April 1, 14:00 PM. I decide I want to have a look at the Perc 5i controller logs to see if errors are logged at the HBA level. Since the Perc 5i uses a LSI logic chip, I use the procedure I blogged about a while back to install the MegaCLI tool again.

At this point, I discover that it is no longer possible to use the LSI MegaCLI tools under vSphere. I guess VMware finally decided that the Service Console has to run as a virtual machine and the Perc 5i card is no longer exposed inside the Service Console. LSI MegaCLI therefor reports that no compatible controllers are present. Bummer! Apparantly some people report in the VMware Community forums that LSI MSM (remote management server?) seems to work with limited functionality but I decide not to try to install this.
April 1, 17:00 PM. Time to think of an alternative way of discovering what is wrong in the second RAID array. It is a RAID5 array of 4 Seagate 1 TB disks (yes, the ST31000340AS series that had the firmware issues), and my suspicion is now that a single disk has failed, but the failure is not picked up by the Perc 5i controller, or not reported by the disk firmware. That is particularly bad because I don't want to pull the wrong disk out of a RAID5 array with a failed disk -- obviously causing a total data loss, which would be very, very, very, VERY depressing after all the happiness that I still had my data ;).

Time to pull out the Seagate selftests and indeed, testing each drive individually revealed that one of the drives had failed.

So the conclusion is: time for another RMA! I now have had each of my four Seagate 1 TB disks fail on me. In fact, out of the 8 Seagate drives I own, I have already requested 7 RMA's. At times like these I remember why I coughed up a massive amount of money to get my hands on the Western Digital Caviar Black edition (which AFAIK is the last consumer disk to provide a 5 year warranty).

Monday, August 31, 2009

Hosting your DNS on vSphere 4 - caveat

For a while now, I was having an issue with my whitebox ESX4.0 server: after rebooting this machine, I was unable to connect to it using the vSphere client. The error I was receiving was a simple "503: Service unavailable". The hostd.log on the host was filled with errors like:

--F637FB90 warning 'Proxysvc Req00002'-- Connection to localhost:8309 failed with error N7Vmacore15SystemExceptionE(Connection refused).

and I noticed that the /var/log/messages contains a lot of vmware-authd start & stop messages. I struggled and managed to find a workaround which consisted of:

Logging onto the service console as root

Edit the /etc/vmware/hostd/config.xml file and disabling the "proxysvc" component of hostd.

Restart the hostd process (service mgmt-vmware restart)

Wait for all my autostart VM's to come online

Re-enable the "proxysvc" and restart hostd once again

Today, I discovered this thread on the VMware communities which contained the answer I was looking for: the DNS servers I had configured on my ESX box were virtual machines running on the box itself (in my case: a m0n0wall virtual appliance and a Windows 2008 domain controller with DNS). Apparently this disrupts the proxysvc component of hostd (since the virtual DNS servers are not reachable at the time hostd is first started - autostart is yet to kick in), causing it to fail to start properly and preventing vSphere client connections. Furthermore, this prevented the autostart of VM's all together, thus never getting DNS to get up and running at all.

The solution was to clear my /etc/resolv.conf file and now everything works fine immediately after a reboot (no more attempts to connect to a virtual machine that is not yet running)! This completely slashes DNS support (in particular if you are using HA, you'll need to do good /etc/hosts maintenance). Since your typical production environment probably is not running the entire DNS infrastructure as a or several virtual machine(s), you probably are never exposed to this issue anyway.

Thursday, April 9, 2009

Active Directory over SSL in VMware Lifecycle Manager

I recently have been playing around with VMware's Lifecycle Manager appliance, and one of the small "gotcha's" I ran into was how to configure secure communications between the LCM appliance and the Active Directory backend I was authenticating against.

After configuring LCM to use Active Directory and SSL, I was getting the following error message:

Error: Unable to connect to LDAP Server / simple bind failed: dc.pretnet.local:636

In order to get the SSL authentication working for Active Directory (or LDAP in general), you need to be sure that the Certificate Authority that issues your domain controller certificates is trusted by the appliance (you don't need to actually import the domain controller certificate itself, just the issuing CA is sufficient). This is done by going through the following steps:

First, obtain a copy of the issuing certification authority's certificate (without private key obviously). Ensure that it is in the X.509 format, Base64 encrypted or DER encrypted. The appliance doesn't seem to support certificate containers (P7B format), so when you export the certificate using the Certificates MMC, ensure you select one of the first two options as the export format!!
To add the X.509 certificate to the appliance, go to the "Network" tab and select the "SSL Certificate" configuration pane. Here, import the certificate file.

Next, restart the "VMO Configuration Server", which you can find at the bottom of the "Server" tab in the GUI.

Note: if you get an error message that first you need to fix your LDAP configuration (and "Plugins" section) before you can restart the VMO Configuration Service, go back to the LDAP configuration and disable SSL for a moment.

That's it! Secure Active Directory authentication (which is what we all want) is now working properly! It's a good idea to import the certificate right away, because your other configuration tasks are severily limited when the authentication (either using the built-in OpenLDAP server on the appliance, or using Active Directory) is not working properly.

As a sidenote, I would like to add that, despite VMware recommending to run Lifecycle Manager on a dedicated Windows box (LCM Administration Guide v1.01, p21), the appliance is a really convenient way of running and upgrading this product without too much hassle. Of course, don't forget to offload the configuration database from the appliance (use a dedicated SQL or Oracle server)!

Monday, December 22, 2008

Counting ESX Server storage paths

At a customer, we have been hitting with one of the built-in storage limits of ESX Server: you can only present up to 1024 storage paths to a single ESX host. Depending on your SAN topology, each LUN that you present over a fiber fabric uses 4, 8 or even 16 storage paths. You can check this using the esxcfg-mpath command:

Disk vmhba1:9:2 /dev/sdf (102400MB) has 8 paths and policy of Fixed
FC 13:0.0 10000000c96e8972<->50001fe15009264e vmhba1:9:2 On active preferred
FC 13:0.0 10000000c96e8972<->50001fe15009264a vmhba1:10:2 On
FC 13:0.0 10000000c96e8972<->50001fe15009264c vmhba1:11:2 On
FC 13:0.0 10000000c96e8972<->50001fe150092648 vmhba1:12:2 On
FC 16:0.0 10000000c96e8ccc<->50001fe15009264f vmhba2:12:2 On
FC 16:0.0 10000000c96e8ccc<->50001fe15009264b vmhba2:13:2 On
FC 16:0.0 10000000c96e8ccc<->50001fe15009264d vmhba2:14:2 On
FC 16:0.0 10000000c96e8ccc<->50001fe150092649 vmhba2:15:2 On

To count the total number of paths presented to a single ESX host, you can use the following service console command:

esxcfg-mpath -l | grep paths | awk '{ split($0, array, "has "); split(array[2], array2, " paths"); SUM +=array2[1] } END { print SUM}'

Probably the awk syntax can be greatly shortened but I am no awk/grep/sed expert :). Nevertheless, you can script this command into a cron job such that you can receive reports on whether or not you are hitting this limit.

Friday, November 21, 2008

VMware Tools without a reboot?

Every now and then, you see blogposts appearing on the "issue" that you need to reboot a guest operating system after you install or update the VMware Tools. Many people have pondered about whether a reboot is in fact really necessary and if it can be avoided all together. Recent posts about this can be read here and here, refering to this VMware community thread -- the question is still alive in multiple-year spanning threads like this one right here. I usually frown my eyebrowses when reading on these "no reboot" topics, yet I am interested in keeping up with the advancements in that subject for some of the large customers that I come in contact with professionaly.

The scripts and methods outlined in these blogposts sound a bit tricky at first if you ask me, and I feared they might not have the outcome you expected. I would think the VMware tools really require a reboot on some operating systems because you update parts of the virtual device drivers and those need to be reloaded by a reboot of the operating system (Note: strictly speaking you don't need a reboot for all types of device drivers, only under a specific set of circumstances documented by Microsoft. The VMware disk drivers host a boot device so that would fit under the "requires a reboot" category from that document). This means that just running the installer with a "Suppress Reboot" parameter on all your machines will place the new VMware Tools files on your harddisk, but will not actively load all of them... I am not sure if that is a state I would want my production virtual machines in!? And to be very clear: what these scripts do is request an automatic postpone of the reboot, not trigger some hidden functionality in VMware Tools not to really reboot after all!

To remove all suspicion, I did a little test on a Windows 2003 virtual machine and upgraded the tools from ESX 3.0.2 to ESX 3.5U2 without rebooting (using the commandline setup.exe /S /v"REBOOT=R /qb" on the VMware Tools ISO). This effectively updates the following services and drivers without rebooting:

VMware services (bumped from build 63195 to build 110268)
VMware SVGA II driver, VMware Pointing Device driver

It left the following drivers untouched:

VMware Virtual disk SCSI Disk Device ("dummy" harddisk driver - Microsoft driver)
NECVMWar VMware IDE CDR10 (virtual CD-ROM driver)
Intel Pro/1000 MT Network Connection (vmnet driver - Microsoft driver)
LSI Logic PCI-X Ultra320 SCSI Host Adapter (storage adapter - Microsoft driver)

It turned out that these drivers didn't require updating for my specific virtual machine (even after a reboot). In fact, I wasn't immediatelly able to find one machine in the test environment at work that required updating any bootdisk device drivers (and some still had 3.0.2 VMware Tools running!).

To conclude, I would say that in some circumstances it is safe to postpone the reboot of your virtual machine, if at minimum the boot disk device drivers are not touched. Postponing the reboot is very convenient if you use it in the context of a patch weekend where you want to postpone the restart to one big, single reboot at the end of all your patches.

Update: as Duncan Epping points out in a recent blogpost, be also advises that updating the network driver effectively drops all network connections. This is for all practical purposes maybe just as bad as actually rebooting your server, so beware with the "fake level of safety and comfort" that you might have by postponing a VMware Tools reboot!

Thursday, August 14, 2008

Matching LUN's between ESX hosts and a VCB proxy

One of the problems that I encountered at a customer was to discover what VMFS partitions were presented to a VCB proxy. It turned out to be a bit more complex than I had first expected.

Introduction
VMware released the VCB framework (VMware Consolidated Backup) to make a backups of a virtual machine. The VCB framework is typically installed on a Windows host (the VCB proxy), and in order to make SAN backups, you need to present both the source LUN, which contains the virtual machines to backup, and the destination LUN, where the backup files are stored, to that VCB proxy.

This setup is relatively simple to maintain in smaller environments. However, once you get in a big environment were a dozen teams are involved (separate networking teams, separate SAN teams, separate Windows teams and separate VMware teams), it can become quite challenging to find out which of the 12 LUN's that are presented to a Windows host in fact belong to a specific ESX host.

Finding unique identifiers for a LUN
The mission is to find a unique identifier (UID) that can be used both on the ESX host and the Windows box. The first two obvious candidates to uniquely identify a ESX managed LUN on a SAN network are:

The VMFS ID for the partition
Upon the initialization of a VMFS partition, it is assigned a unique identifier that can be found by looking in the /vmfs/volumes directory on an ESX host, or by using the esxcfg-vmhbadevs -m command on the ESX host. The output looks like this:

vmhba1:0:2:1 /dev/sdb1 48858dc4-f4e218d1-d3a8-001cc497e630
vmhba1:4:1:1 /dev/sdc1 483cf914-29b60dc5-dbfd-001cc497e630
vmhba1:4:2:1 /dev/sdd1 479da7c1-4494cd90-d327-001cc497e630

The first disk is the (remainder) of the locally attached storage, and the two other disks are presented from the SAN. The first column indicates that HBA 1, SCSI target 4 and LUN's 1 and 2 are used (and partition 1 on each LUN); the second column lists the Linux device name under the Service Console and the third column lists the VMFS ID.
The WWPN (World Wide Port Name) of the disk on the SAN
On a fiber-channel SAN network, each device is assigned a unique identifier called the WWPN. You can compare the WWPN as performing the same function as a MAC address on an Ethernet network. The WWPN's of the disks that are presented to an ESX host can be obtained from the Service Console using the esxcfg-mpath -l command:

Disk vmhba1:4:1 /dev/sdc (256000MB) has 16 paths and policy of Fixed
FC 13:0.0 10000000c96e8972<->500507630308060b vmhba1:4:1 On
FC 13:0.0 10000000c96e8972<->500507630313060b vmhba1:5:1 On
FC 13:0.0 10000000c96e8972<->500507630303060b vmhba1:6:1 On active preferred
FC 13:0.0 10000000c96e8972<->500507630303860b vmhba1:7:1 On
FC 13:0.0 10000000c96e8972<->500507630308860b vmhba1:8:1 On
FC 13:0.0 10000000c96e8972<->500507630313860b vmhba1:9:1 On
FC 13:0.0 10000000c96e8972<->500507630318060b vmhba1:10:1 On
FC 13:0.0 10000000c96e8972<->500507630318860b vmhba1:11:1 On
FC 16:0.0 10000000c96e8ccc<->500507630303460b vmhba2:4:1 On
FC 16:0.0 10000000c96e8ccc<->500507630308460b vmhba2:5:1 On
FC 16:0.0 10000000c96e8ccc<->500507630313460b vmhba2:6:1 On
FC 16:0.0 10000000c96e8ccc<->500507630303c60b vmhba2:7:1 On
FC 16:0.0 10000000c96e8ccc<->500507630308c60b vmhba2:8:1 On
FC 16:0.0 10000000c96e8ccc<->500507630313c60b vmhba2:9:1 On
FC 16:0.0 10000000c96e8ccc<->500507630318460b vmhba2:10:1 On
FC 16:0.0 10000000c96e8ccc<->500507630318c60b vmhba2:11:1 On

Disk vmhba1:4:2 /dev/sdd (256000MB) has 16 paths and policy of Fixed
FC 13:0.0 10000000c96e8972<->500507630308060b vmhba1:4:2 On
FC 13:0.0 10000000c96e8972<->500507630313060b vmhba1:5:2 On
FC 13:0.0 10000000c96e8972<->500507630303060b vmhba1:6:2 On
FC 13:0.0 10000000c96e8972<->500507630303860b vmhba1:7:2 On
FC 13:0.0 10000000c96e8972<->500507630308860b vmhba1:8:2 On
FC 13:0.0 10000000c96e8972<->500507630313860b vmhba1:9:2 On
FC 13:0.0 10000000c96e8972<->500507630318060b vmhba1:10:2 On
FC 13:0.0 10000000c96e8972<->500507630318860b vmhba1:11:2 On
FC 16:0.0 10000000c96e8ccc<->500507630303460b vmhba2:4:2 On
FC 16:0.0 10000000c96e8ccc<->500507630308460b vmhba2:5:2 On active preferred
FC 16:0.0 10000000c96e8ccc<->500507630313460b vmhba2:6:2 On
FC 16:0.0 10000000c96e8ccc<->500507630303c60b vmhba2:7:2 On
FC 16:0.0 10000000c96e8ccc<->500507630308c60b vmhba2:8:2 On
FC 16:0.0 10000000c96e8ccc<->500507630313c60b vmhba2:9:2 On
FC 16:0.0 10000000c96e8ccc<->500507630318460b vmhba2:10:2 On
FC 16:0.0 10000000c96e8ccc<->500507630318c60b vmhba2:11:2 On

In this output, you can see two HBA's (that have WWPN's 10000000c96e8972 and 10000000c96e8ccc) that see two LUN's vmhba1:4:1 and vmhba1:4:2 that are presented over 16 paths.

On the VCB proxy / Windows box, I used the Emulex HBAnywhere utility to retrieve the WWPN's of the LUN's that were presented. The output is shown in the following screenshot:

It is also possible to use the HbaCmd.exe AllNodeInfo command to retrieve a list of all WWPN's that a certain HBA sees.

Looks nice, what's the problem?
Using the WWPN seemed to be the obvious answer to identifying the LUN's on both the ESX host and the VCB proxy. Until I discovered that two different LUN's where presented using the same WWPN (obviously they were on two different SAN's and presented to two different hosts). On one of our ESX hosts, a 256 GB LUN was presented using WWPN 50:05:07:63:03:08:06:0b, and on the VCB proxy, a 500 GB LUN was presented using that same WWPN -- apparently our SAN team recycles the WWPN's on the different fibre channel fabrics.

To make matters even worse, I noticed that the same LUN was presented using one WWPN to an ESX host, and with another WWPN to the VCB proxy (I am no SAN expert myself but I assume it is possible to present the same LUN in different SAN zones using different WWPN's). I was able to verify this since VCB was able to do a SAN backup of a virtual machine that resides on a LUN with a WWPN on the ESX side that is not presented to the VCB proxy.

The next step: VMFS ID's as a unique identifier
So, if you cannot rely on the WWPN's to uniquely identify a LUN on a host that is connected to multiple SAN's, then surely VCB must use the VMFS ID to know what LUN to read the virtual machine data from? Right?

On the VCB proxy & Windows machine, I tried to discover the VMFS ID's using the vcbSanDbg.exe tool (included in the VCB framework and available as a separate download from the VMware website -- careful, the separate download is an older version than the one included in the VCB 1.5 framework). An excerpt from its lengthy output:

C:\Program Files\VCB>vcbSanDbg | findstr "ID: NAA: volume"
[info] Found logical volume 48761b97-a4f562bd-6875-0017085d.
[info] Found logical volume 48761bc5-3f508baa-2f5d-0017085d.
[info] Found logical volume 483cf913-05b4f526-45b5-001cc497.
[info] Found logical volume 479da7ac-55fe7dfe-378c-001cc497.
[info] Found logical volume 477c2b4a-7db36616-30ea-001cc495.
[info] Found logical volume 48843bec-154cf784-871a-001cc495.
[info] Found SCSI Device: NAA:600508b10010443953555534314200044c4f47494341
[info] Found SCSI Device: NAA:60060e801525180000012518000000374f50454e2d56
[info] Found SCSI Device: NAA:600508b4000901eb0001100003230000485356323130
[info] ID: LVID:48761b97-dacedf9f-ebb9-0017085d0f91/48761b97-a4f562bd-6875-0017085d0f91/1
Name: 48761b97-a4f562bd-6875-0017085d
[info] Found SCSI Device: NAA:600508b4000901eb0001100003260000485356323130
[info] ID: LVID:48761bc6-7b4afa63-97d9-0017085d0f91/48761bc5-3f508baa-2f5d-0017085d0f91/1
Name: 48761bc5-3f508baa-2f5d-0017085d
[info] Found SCSI Device: NAA:6005076303ffc60b0000000000001049323130373930
[info] ID: LVID:483cf913-458f9fa5-a749-001cc497e630/483cf913-05b4f526-45b5-001cc497e630/1
Name: 483cf913-05b4f526-45b5-001cc497
[info] Found SCSI Device: NAA:6005076303ffc60b000000000000104a323130373930
[info] ID: LVID:479da7b6-877867e9-dd06-001cc497e630/479da7ac-55fe7dfe-378c-001cc497e630/1
Name: 479da7ac-55fe7dfe-378c-001cc497
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128d323130373930
[info] ID: LVID:477c2b4a-969e01e0-8d49-001cc495fb46/477c2b4a-7db36616-30ea-001cc495fb46/1
Name: 477c2b4a-7db36616-30ea-001cc495
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128e323130373930
[info] Found SCSI Device: NAA:600508b40006e8890000b000010a0000485356323130
[info] Found SCSI Device: NAA:600508b40006e8890000b00003770000485356323130
[info] ID: LVID:48843bec-28cc17a4-ca9e-001cc495fb46/48843bec-154cf784-871a-001cc495fb46/1
Name: 48843bec-154cf784-871a-001cc495

Unfortunately, I was not able to discover the VMFS ID's I saw on the ESX host in this output, even though there are some resemblances:

ESX host VMFS ID 483cf914-29b60dc5-dbfd-001cc497e630 looks a lot like vcbSanDbg.exe output's logical volume 483cf913-05b4f526-45b5-001cc497.
ESX host VMFS ID 479da7c1-4494cd90-d327-001cc497e630 looks a lot like vcbSanDbg.exe output's logical volume 479da7ac-55fe7dfe-378c-001cc497.

Furthermore, I found out that current versions of VCB do not rely on the VMFS ID to discover virtual machines on a LUN. In Andy Tucker's talk "VMware Consolidated Backup: today and tomorrow" at VMworld 2007, it is clearly stated (slide 19) that there...

No “VMFS Driver for Windows” on proxy

And furthermore that the usage of VMFS signatures is on the "todo" list for identifying LUNs on the SAN network (slide 34).

Other ideas?
So where does one turn when all possible solutions seem to lead to a dead end? Right: the VMware community forums. The answer came in this thread by snapper.

What I learned today is that besides the WWPN on a fiber channel network, there is another unique identifier called the NAA (Network Address Authority) to identify devices on the FC fabric. You can obtain the NAA for the LUN's on an ESX host using the esxcfg-mpath command in verbose mode using:

esxcfg-mpath -lv | grep ^Disk | grep -v vmhba0 | awk '{print $3,$5,$2}' | cut -b15-

The output on our ESX host looks much like this:

6005076303ffc60b0000000000001049323130373930 (256000MB) vmhba1:4:1
6005076303ffc60b000000000000104a323130373930 (256000MB) vmhba1:4:2

The NAA can be seen in the vcbSanDbg.exe output shown above, and can be filtered as follows:

vcbSanDbg.exe | findstr "NAA:"

The output should look like this:

C:\Program Files\VCB>vcbSanDbg | findstr "NAA:"

[info] Found SCSI Device: NAA:600508b10010443953555534314200044c4f47494341
[info] Found SCSI Device: NAA:60060e801525180000012518000000374f50454e2d56
[info] Found SCSI Device: NAA:600508b4000901eb0001100003230000485356323130
[info] Found SCSI Device: NAA:600508b4000901eb0001100003260000485356323130
[info] Found SCSI Device: NAA:6005076303ffc60b0000000000001049323130373930
[info] Found SCSI Device: NAA:6005076303ffc60b000000000000104a323130373930
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128d323130373930
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128e323130373930
[info] Found SCSI Device: NAA:600508b40006e8890000b000010a0000485356323130
[info] Found SCSI Device: NAA:600508b40006e8890000b00003770000485356323130

Et voila, now I can start running the esxcfg-mpath command on all our ESX hosts and start matching these NAA's with those in the output of vcbSanDbg to discover what our Windows VCB proxy has access to.

Tuesday, August 12, 2008

VMWare D-Day: 12/08/2008

I recon "12 August 2008" will be long remembered by all VMWare enthousiasts out there.

That is the day that a major bug caused ESX 3.5 Update 2 no longer to recognise any license, even if the license file at your license server was perfectly valid. There is no need to sketch the horror that follows when your ESX clusters no longer detect a valid license: Vmotion fails, DRS fails, HA fails, powering on virtual machines is no longer possible... Ironically, today is also Microsoft's Patch Tuesday of August, which probably means that quite some system admininistrators where caught with their pants down (and their VM's powered off during a scheduled maintenance window) when this bug struck.

The symptoms and errors that we have been experiencing are the following:

Unable to VMotion a host from ESX 3.0.2 to ESX 3.5. The VMotion progresses until 10% and then aborts with error messages such as "operation timed out" or "internal system error".
HA agent getting completely confused (unable to install, reconfigure for HA does not work).
Unable to power on new machines:

[2008-08-12 14:11:16.022 'Vmsvc' 121330608 info] Failed to do Power Op: Error: Internal error
[2008-08-12 14:11:16.065 'vm:/vmfs/volumes/48858dc4-f4e218d1-d3a8-001cc497e630/HOSTNAME/HOSTNAME.vmx' 121330608 warning] Failed operation
[2008-08-12 14:11:16.066 'ha-eventmgr' 121330608 info] Event 15 : Failed to power on HOSTNAME on esx.test.local in ha-datacenter: A general system error occurred

VMWare is promising a patch tomorrow, but several forum posts (here and here) are wondering how this patch will be distributed and -- given the deep integration of the licensing components within ESX -- whether this will require a reboot of the ESX host or not (which can be quite problematic if you cannot VMotion machines away). A possible workaround for this issue is to introduce a 3.0.2 host in the cluster as I have seen in our environment that VMotioning from 3.5 to 3.0.2 still works.

Edit (21:20 PM): hopes are up that VMware should be able to release a patch that doesn't require the ESX host to reboot. See what Toni Verbeiren has to say about it on his blog.

Edit (9:00 AM 13 AUG): a patch has been released by VMware. Regarding whether hosts need to be rebooted or not... there is good news and there is bad news: "to apply the patches, no reboot of ESX/ESXi hosts is required. One can VMotion off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs need to be powered off before the patches are applied and powered back on afterwards."

You can follow the developing crisis at the following sources:

Even our dear friends at Microsoft write about the problem, see the blogpost "It's rude to laugh at other people's misfortunes - even VMware's" here.

Tuesday, July 29, 2008

Full backups of virtual machines and Windows VSS

Introduction
One of the new features that is appearing in backup products that take backups of an entire virtual machine, as opposed to using an agent inside the guest operating system, is the ability to cooperate with Windows VSS (Volume Snapshot Service) inside the guest. For example, the recently released version of VMWare's Consolidated Backup 1.5, now supports VSS quiescing for Windows 2003, Windows Vista, Windows 2008; vizioncore's vRanger Pro backup utility has been supporting VSS for Windows 2003 for some versions already.

Several opinions exist on whether this is in fact a useful feature or not; for example, not so long ago the developers of esXpress talked about not including VSS quiescing into their product at that time because it adds additional complexity and does not offer any significant benefits in their opinion (see here). This discussion is still alive as you can see for example here, and the big question is indeed: can you rely on live backups of database virtual machines?

The early days of VSS
The root of the discussion is at the intended use of VSS: on a physical machine that is running a database application such as SQL Server, Exchange or even Active Directory or a DHCP server for that matter, you cannot directly read the database files since they are exclusively locked by the database application. This used to be particularly troublesome because the only way to get a backup of the data inside such a database is to use some sort of export function that had to be programmed into the database application (think of the BACKUP TSQL command or a brick-level backup of an Exchange server).

Microsoft tackled this problem by introducing VSS, which presents a fully readable point-in-time snapshot of a filesystem to the (backup) application that initiates the snapshot. That way, a backup application can read the database file contents and put it away safely in case it is ever needed.

However, there are two problems when reading files from a filesystem that is "frozen" in time:

a file can be in progress of being written (i.e. only 400 bytes of a 512-byte block are filled with actual data).
data still in a filesystem cache or buffer in memory and not yet written to the disk (in the filesystem journal).

On top of the filesystem issues, there are two problems when reading a database that is still in use but "frozen" purely at a filesystem level:

at the time of the snapshot, a transaction could still be in progress. This can be an issue when the transaction is not supposed to be committed to the database at the end: as you know, a database query can initiate thousands of changes and perform a ROLLBACK at the end to reset any changes made since the start of the transaction.

A good (ficteous) example here is when you try to draw 1000 euros in cash from an ATM: if you change your mind right before clicking the "confirm transaction" button on the ATM screen, then you don't want your 1000 euros to be really gone if at the same time a database snapshot is taken and your final "ROLLBACK" command is not included in the database!
some data could still be in memory and not written to a logfile or a database file (so-called "dirty pages").

Crash consistency versus transactional consistency
If you don't take these four problems into account, then restoring a snapshot of such a filesystem would be in fact the same as bringing back up the server after you suddenly pulled the power plug. Such a snapshot is said to be in a crash-consistent state, i.e. the same state as a sudden power-loss.

Modern filesystems have built-in mechanisms (so-called "journalling") to tackle these problems and to ensure that when such a "frozen" filesystem is restored from a backup, the open files are put back in a consistent state as possible. Obviously, any data that only existed in memory and never was written to a filesystem journal/disk is lost. Databases rely on transaction logging to recover from a crash-consistent state back to a consistent database; this is typically done by simply rolling back all unfinished transactions, effectively ignoring all transactions that were not committed or rolled back.

Windows VSS wants to go beyond a crash-consistent snapshot and solves both the filesystem and database problem by not only freezing all I/O to the filesystem but also asking both the filesystem and all applications to flush its dirty data to disk. This allows the creation of both a filesystem consistent and an application-consistent backup. VSS has built-in support for several Windows-native technologies such as NTFS filesystems, Active Directory databases, DNS databases, ... to flush their data to disk before the snapshot is presented to the backup application requesting the snapshot. Other programs, such as SQL/Oracle databases or Exchange mailservers, use "VSS Writer" plugins to get notified when a VSS snapshot is taken and when they have to flush their dirty database pages to disk to bring the database in a transactionally consistent state.

From Technet:

[...] If an application has no writer, the shadow copy will still occur and all of the data, in whatever form it is in at the time of the copy, will be included in the shadow copy. This means that there might be inconsistent data that is now contained in the shadow copy. This data inconsistency is caused by incomplete writes, data buffered in the application that is not written, or open files that are in the middle of a write operation. Even though the file system flushes all buffers prior to creating a shadow copy, the data on the disk can only be guaranteed to be crash-consistent if the application has completed all transactions and has written all of the data to the disk. (Data on disk is “crash-consistent” if it is the same as it would be after a system failure or power outage.). [...] All files that were open will still exist, but are not guaranteed to be free of incomplete I/O operations or data corruption.

Under this design, the responsibility for data consistency has been shifted from the requestor application to the production application. The advantage of this approach is that application developers — those most knowledgeable about their applications — can ensure, through development of their own writers, the maximum effectiveness of the shadow copy creation process.

Conclusions for the physical world: the above makes clear that there is a huge benefit in using VSS when working on physical machines: VSS is a requirement to be able to backup the entire database files and to ensure that the database is not in an inconstent state when you want to do the restore the database- and logfiles and attempt to mount them. The main advantage here is that a restored database does not have to go through a series of consistency checks that typically take up many, many hours.

Going to the virtual world
In the virtual world, there are several different types of backups that can be performed:

Performing the backup inside the guest OS.
Performing a backup of the harddisk files (VHD/VMDK) when using a virtualization product that is hosted on another operating system, such as Microsoft Virtual Server or VMWare Workstation/Server.
Performing a backup of the harddisk files (VHD/VMDK) when using a bare-metal hypervisor based product such as Microsoft Hyper-V or VMWare's ESX/ESXi Server.

Obviously, when you perform the backup inside the guest OS, you still encounter the same problems as when attempting to back up a physical host: open files and database files are locked and thus cannot be backed up directly, so you have to revert to using VSS for the reasons discussed above.

But what about the other two ways of performing a virtual machine backup, when attempting to back up the entire harddisk file? For starters, it is important to realize that "file locking" now occurs at two levels:

The VHD/VMDK harddisk files themselves are opened and locked by the virtualization software (be it the hypervisor for bare-metal virtualization or the executable when using hosted virtualization);
Files can be opened and locked inside in the guest operating system.

The first issue of the open VHD/VMDK harddisk files is solved depending on the virtualization product: if you are using host-based virtualization, you can obtain a readable VHD/VMDK file by using VSS on the host operating system and asking to present an application-consistent variant of the VHD/VMDK files. If you are using a bare-metal hypervisor, a typical mechanism is by taking a snapshot of a virtual machine (which, for example in VMWare ESX, shifts the file lock from the VMDK file to the snapshot delta file, thus releasing the VMDK file for reading).

Open files inside the guest OS
Ironically, the solution of the first problem of open VHD/VMDK host files introduces the second problem of open files inside the guest os: once you have your snapshot of the VHD/VMDK files (be it through VSS for host-based virtualization or a VM snapshot for bare-metal hypervisors)... that snapshot is only in a crash-consistent state! After all, it is a point-in-time "freeze" of the entire harddisk and restoring such an image file would be equivalent to restarting the server after a total powerloss occured.

VMWare attempted to tackle this problem by introducing a "filesystem sync driver" in their VMTools (which you are supposed to install in every virtual machine running on a VMWare product). This filesystem sync driver mimics VSS in the sense that it requests that the filesystem flushes its buffer to disk, guaranteeing that the snapshot -- and thus corresponding full virtual machine backup -- is in a filesystem consistent state. Obviously, this does not solve the problem for databases which tend to react quite violently to these kind of non-VSS "freezes" of the filesystem. Prototype horror stories can be read here (AD) and here (Exchange).

So what are the real solutions for this problem? I can think of two at this moment:

After taking a snapshot, do not only backup the disks but also the memory. Then, when restoring the backup, do not "power on" the virtual machine but instead "resume" it. At first, the machine will probably be "shocked" to see that the time has lept forward and that many TCP/IP connections are suddenly being dropped, but the database server you are running should be able to handle this and properly commit any unsaved data from memory to disk.
Trigger a VSS operation inside the guest OS to commit all changes to disk and ensure filesystem- and applicationlevel consistency, and only then take the full virtual machine snapshot.

The VSS interaction with the guest operating system was first introduced by vizionCore in their vRanger Pro 3.2.0 -- which required the installation of an additional service inside the guest VM, .NET 2.0 and was only officially supported for Windows 2003 SP1+ in 32bit. With the release of VMWare Consolidated Backup 1.5, VMWare announced the default queiscing of disks on ESX 3.5 Update 2 would now be done using the new VSS driver -- supported on Windows 2003/2008/Vista in both 32 & 64-bit variants. Hurray! Problem solved, right?

So VSS seems nice, but is it necessary?
Obviously, your gut feeling will tell you that it is "nicer" and "more gentle" to the guest virtual machine when using VSS when taking a snapshot and a backup. The arguments on the difference between crash-consistency, filesystem consistency and application-level consistency (which translates to transactional consistency for databases) give solid grounds to this gut feeling.

Personally, I cannot find an argument that states that VSS is also really necessary to create a full virtual machine backup. In the physical world, filesystems and databases have been hardened to recover from the crash-consistent state that you obtain when taking a snapshot of a running virtual machine to back up and restore. Hands-on experience about this robustness can be read on several informal channels such as forum posts here.

However, if you want to be sure that your database is in a consistent state (for a faster recovery) and certainty that those few seconds of data that were not yet committed from memory to disk are in fact included in your snapshot, then VSS is what you need. The next question to answer is: what is the risk of VSS messing up and is this probability larger than not being able to restore a non-VSS-based snapshot?

Conclusion
Performing live backups of virtual machines seems like an interesting and simple feature of virtualisation at first. However, at a second glance, there are some important decisions to be made regarding the use of VSS/snapshotting technology that can impact your restore strategy and success. Even without any quiescing mechanism, the operating system should be able to handle the crash-consistent backups that are taken by performing live machine backups and should therefore be sufficiently reliable. With the ready availability of VSS in the new VMWare Tools that come with ESX 3.5 Update 2, much more than crash-consistent backups can be guaranteed without the need to install additional agents. The increased reliability and faster restore time (no filesystem/database consistency checks) that come with VSS quiesced snapshots make full virtual machine backups now a fully mature solution without the need to worry for possibly inconsistent backups.

Side remarks
Some additional remarks regarding full virtual machine backup:

Full VM backups can be an addition to guest-based file level backups, but they can never be a complete replacement:
- you might take a full VM based snapshot of your Exchange or SQL database every day, but a filebased/bricklevel backup (which is far more convenient to use for your typical single file/single mailbox restore operations) might be taken several times a day, depending on the SLA that your IT department has with the rest of the company.
- a full vm backup is a good place to start a full server recovery. It is a bad place to start a single-file or a single mailbox restore.
- a full VM backups using VSS do not allow the backup of SQL transaction logs (see "what is not supported" in the SQL VSS Writer overview), nor do they commit transaction logs to the database in order to clear up the transaction logs (an absolute necessity for Exchange databases or for several types of SQL databases).
Microsoft does not support any form of snapshotting technology on domain controllers. For more information, see MSKB 888794 on "Considerations when hosting Active Directory domain controller in virtual hosting environments".

Edit (12 Aug 2008): VeeAm has released a very interesting whitepaper that discusses not only the necessity for VSS awareness during the backup process, but also during the restore process. They give the example of a domain controller that performs USN rollbacks when being backed up using VSS but not restored using a VSS aware software. Another nice example is Exchange 2003 that requires VSS aware restore software in order to be supported by Microsoft.

Postscriptum: I started writing this article a few days before VCB 1.5 was released, and the original point I was trying to make at that time was that there were too many disadvantages to the available VSS implementations (yet another service to install, .NET 2.0, very limited OS support) to really profit from the benefits that VSS could offer. Of course, in the meantime, VMWare has taken away most of those objections by including VSS support in their VMTools for a wide range of server operating systems. This forced me to reconsider my view on whether VSS would be a good idea or not.

Sunday, May 25, 2008

Installing LSI Logic RAID monitoring tools under the ESX service console

As I discussed in a recent post, I used a Dell Perc 5i SAS controller in my ESX whitebox server. One of the nice features of this controller is that it is a rebranded LSI Logic controller (with a different board layout!), supported by LSI Logic firmwares and the excellent monitoring tools that LSI offers.

Of course, it is important to keep track of your RAID array status, so I decided to install the MegaCLI monitoring software under the ESX Server 3.5 Service Console. Here's how I did it and configured the monitoring on my system:

The MegaCLI software can be downloaded from the LSI Logic website. I used version 1.01.39 for Linux, which comes in a RPM file.
After uploading the RPM file to the service console, it was a matter of installing it using the "rpm" command:

rpm -i -v MegaCli-1.01.39-0.i386.rpm

This installs the "MegaCli" and "MegaCli64" commands in the /opt/MegaRAID/MegaCli/ directory of the service console.

That's it, MegaCLI is ready to be used now. Some useful commands are the following:

/opt/MegaRAID/MegaCli/MegaCli -AdpAllInfo -aALL
This lists the adapter information for all LSI Logic adapters found in your system.
/opt/MegaRAID/MegaCli/MegaCli -LDInfo -LALL -aALL
This lists the logical drives for all LSI Logic adapters found in your system. The "State" should be set to "optimal" in order to have a fully operational array.
/opt/MegaRAID/MegaCli/MegaCli -PDList -aALL
This lists all the physical drives for the adapters in your system; the "Firmware state" indicates whether the drive is online or not.

The next step is to automate the analysis of the drive status and to alert when things go bad. To do this, I added an hourly cron job that lists the physical drives and then analyzes the output of the MegaCLI command.

I created a file called "analysis.awk" in the /opt/MegaRAID/MegaCLI directory with the following contents:

# This is a little AWK program that interprets MegaCLI output

/Device Id/ { counter += 1; device[counter] = $3 }
/Firmware state/ { state_drive[counter] = $3 }
/Inquiry/ { name_drive[counter] = $3 " " $4 " " $5 " " $6 }
END {
for (i=1; i<=counter; i+=1) printf ( "Device %02d (%s) status is: %s <br/>\n", device[i], name_drive[i], state_drive[i]); }
This awk program processes the output of MegaCli, as you can test by running the following command:

./MegaCli -PDList -aALL | awk -f analysis.awk

when being in the /opt/MegaRAID/MegaCLI directory.
Then I created the cron job by placing a file called raidstatus in /etc/cron.hourly, with the following contents:

#!/bin/sh

/opt/MegaRAID/MegaCli/MegaCli -PdList -aALL| awk -f /opt/MegaRAID/MegaCli/analysis.awk >/tmp/megarc.raidstatus

if grep -qEv "*: Online" /tmp/megarc.raidstatus
then
/usr/local/bin/smtp_send.pl -t tim@pretnet.local -s "Warning: RAID status no longer optimal" -f esx@pretnet.local -m "`cat /tmp/megarc.raidstatus`" -r exchange.pretnet.local
fi

rm -f /tmp/megarc.raidstatus
exit 0

Don't forget to run a chmod a+x /etc/cron.hourly/raidstatus in order to make the file executable by all users.

In order to send an e-mail when things go wrong, I used the SMTP_Send Perl script smtp_send.pl that was discussed by Duncan Epping on his blog.

Thursday, May 22, 2008

Renaming a VirtualCenter 2.5 server

After running my VirtualCenter server on a standalone host for quite some time, I decided to join it into the domain that I am running on my ESX box (in order to let it participate in the automated WSUS patching mechanism). This also seemed like a perfect opportunity to rename the server's hostname from W2K3-VC.pretnet.local to virtualcenter.pretnet.local. However, after the hostname change, the VMWare VirtualCenter service would no longer start with an Event ID 1000 in the eventlog.

Somehow, this didn't come as a surprise ;). This has been discussed before on the VMWare forums (here and here), but I post it here because I did not immediatelly find a step-by-step walkthrough.

The problem was in fact twofold, the solution rather simple:

Renaming SQL servers is a bad idea in general (so it appears). For my small, nonproduction environment, I use SQL Server 2005 Express edition that comes with the VirtualCenter installation. If you rename a SQL server, you need to internally update the system tables using a set of stored procedures in order to make everything consist again. This is done by installing the "SQL Server Management Studio Express" and then executing the following TSQL statements:

sp_dropserver 'W2K3-VC\SQLEXP_VIM'
GO
sp_addserver 'VIRTUALCENTER\SQLEXP_VIM', local
GO
sp_helpserver
SELECT @@SERVERNAME, SERVERPROPERTY('ServerName')

The first statement removes the old server instance (replace W2K3-VC with your old server name), the second statement adds the new server instance (replace VIRTUALCENTER with your new server name). The sp_helper and SELECT statement query the internal database and variables for the actually recognized SQL server instances. You need to perform a reboot in order to get the proper instances with the last two statements.
Secondly, the System ODBC connection that is used by VMWare required an update to point to the new SQL Server instance. This was of course done using the familiar "Data Sources (ODBC)" management console.

Afterwards, the VMWare Virtual Center Server service started just fine again.

Sunday, March 9, 2008

ESX 3.5 on a whitebox

It has been very quiet from my end for the past weeks because I was very busy at a client & at the same time spending all my free time working on my ESX-on-whitebox hardware project. After being inspired by some colleagues, I decided to order the following hardware:

Asus P5BP-E/4L motherboard
This motherboard supports an Intel S775 processor, has VGA and audio onboard and most importantly, the LAN controllers on this motherboard are ESX certified (Broadcom 57xx chipset).
Intel Q6600 Quad Core processor (2.4 GHz) and 8 GB ECC RAM (4x 2GB)
Just to be sure I have enough CPU power and memory resource pools :)
Dell Perc 5i Integrated SAS Controller
My colleagues advised me that storage was the biggest bottleneck in their ESX whiteboxes (based around the very nice Asus P5M2/SAS board). I decided to go for a dedicated hardware controller. I picked up the Dell Perc 5i controller, which is more or less a rebranded LSI Logic 8408 SAS controller on EBay with 256MB of RAM and a battery backup unit for about 175 EUR.

The main advantage of SAS controllers is that they also support the (cheaper) SATA consumer drives. A quick test confirmed this; I had absolutely no problems at all with this controller & even flashed the latest LSI Logic firmware to it :).

Maybe of interest for some: the later Dell firmwares and also the later LSI logic firmwares for this controller provide support for Write Back without a BBU present.
SATA to SAS cables
The Dell Perc 5i has SFF-8484 SAS connectors on board, so I purchased two Adaptec SFF-8484 to 4xSATA cables from a nearby store to attach all the drives.
8 Seagate SATA harddisks (4x 1TB and 4x 200GB)
Space... loads of space.

The hardest thing was getting all these disks in my Silentmaxx ST11 casing; it required some case modding and loads of patience to get everything well fitted. The 500W PSU that is necessary to provide enough juice, was recycled from an Antec Sonata case. I also added a small 3Com 3C905 100Mbps card for my ISP modem connection.

The installation of ESX 3.5 was a piece of a cake & and I can confirm that the above hardware works like a charm. For those interested, I also noticed that ESX 3.5 supports the ICH7 SATA controllers (found on many consumer motherboards as well). I think -- but this has to be confirmed by someone else -- that you need to configure your ICH7 disks in a RAID before the ESX kernel will accept them as a storage pool.