Monday, August 31, 2009

Hosting your DNS on vSphere 4 - caveat

For a while now, I was having an issue with my whitebox ESX4.0 server: after rebooting this machine, I was unable to connect to it using the vSphere client. The error I was receiving was a simple "503: Service unavailable". The hostd.log on the host was filled with errors like:

--F637FB90 warning 'Proxysvc Req00002'-- Connection to localhost:8309 failed with error N7Vmacore15SystemExceptionE(Connection refused).

and I noticed that the /var/log/messages contains a lot of vmware-authd start & stop messages. I struggled and managed to find a workaround which consisted of:

  • Logging onto the service console as root

  • Edit the /etc/vmware/hostd/config.xml file and disabling the "proxysvc" component of hostd.

  • Restart the hostd process (service mgmt-vmware restart)

  • Wait for all my autostart VM's to come online

  • Re-enable the "proxysvc" and restart hostd once again

Today, I discovered this thread on the VMware communities which contained the answer I was looking for: the DNS servers I had configured on my ESX box were virtual machines running on the box itself (in my case: a m0n0wall virtual appliance and a Windows 2008 domain controller with DNS). Apparently this disrupts the proxysvc component of hostd (since the virtual DNS servers are not reachable at the time hostd is first started - autostart is yet to kick in), causing it to fail to start properly and preventing vSphere client connections. Furthermore, this prevented the autostart of VM's all together, thus never getting DNS to get up and running at all.

The solution was to clear my /etc/resolv.conf file and now everything works fine immediately after a reboot (no more attempts to connect to a virtual machine that is not yet running)! This completely slashes DNS support (in particular if you are using HA, you'll need to do good /etc/hosts maintenance). Since your typical production environment probably is not running the entire DNS infrastructure as a or several virtual machine(s), you probably are never exposed to this issue anyway.

Thursday, April 9, 2009

Active Directory over SSL in VMware Lifecycle Manager

I recently have been playing around with VMware's Lifecycle Manager appliance, and one of the small "gotcha's" I ran into was how to configure secure communications between the LCM appliance and the Active Directory backend I was authenticating against.


After configuring LCM to use Active Directory and SSL, I was getting the following error message:
Error: Unable to connect to LDAP Server / simple bind failed: dc.pretnet.local:636

In order to get the SSL authentication working for Active Directory (or LDAP in general), you need to be sure that the Certificate Authority that issues your domain controller certificates is trusted by the appliance (you don't need to actually import the domain controller certificate itself, just the issuing CA is sufficient). This is done by going through the following steps:
  1. First, obtain a copy of the issuing certification authority's certificate (without private key obviously). Ensure that it is in the X.509 format, Base64 encrypted or DER encrypted. The appliance doesn't seem to support certificate containers (P7B format), so when you export the certificate using the Certificates MMC, ensure you select one of the first two options as the export format!!


  2. To add the X.509 certificate to the appliance, go to the "Network" tab and select the "SSL Certificate" configuration pane. Here, import the certificate file.


  3. Next, restart the "VMO Configuration Server", which you can find at the bottom of the "Server" tab in the GUI.


    Note: if you get an error message that first you need to fix your LDAP configuration (and "Plugins" section) before you can restart the VMO Configuration Service, go back to the LDAP configuration and disable SSL for a moment.
That's it! Secure Active Directory authentication (which is what we all want) is now working properly! It's a good idea to import the certificate right away, because your other configuration tasks are severily limited when the authentication (either using the built-in OpenLDAP server on the appliance, or using Active Directory) is not working properly.

As a sidenote, I would like to add that, despite VMware recommending to run Lifecycle Manager on a dedicated Windows box (LCM Administration Guide v1.01, p21), the appliance is a really convenient way of running and upgrading this product without too much hassle. Of course, don't forget to offload the configuration database from the appliance (use a dedicated SQL or Oracle server)!

Monday, December 22, 2008

Counting ESX Server storage paths

At a customer, we have been hitting with one of the built-in storage limits of ESX Server: you can only present up to 1024 storage paths to a single ESX host. Depending on your SAN topology, each LUN that you present over a fiber fabric uses 4, 8 or even 16 storage paths. You can check this using the esxcfg-mpath command:

Disk vmhba1:9:2 /dev/sdf (102400MB) has 8 paths and policy of Fixed
FC 13:0.0 10000000c96e8972<->50001fe15009264e vmhba1:9:2 On active preferred
FC 13:0.0 10000000c96e8972<->50001fe15009264a vmhba1:10:2 On
FC 13:0.0 10000000c96e8972<->50001fe15009264c vmhba1:11:2 On
FC 13:0.0 10000000c96e8972<->50001fe150092648 vmhba1:12:2 On
FC 16:0.0 10000000c96e8ccc<->50001fe15009264f vmhba2:12:2 On
FC 16:0.0 10000000c96e8ccc<->50001fe15009264b vmhba2:13:2 On
FC 16:0.0 10000000c96e8ccc<->50001fe15009264d vmhba2:14:2 On
FC 16:0.0 10000000c96e8ccc<->50001fe150092649 vmhba2:15:2 On


To count the total number of paths presented to a single ESX host, you can use the following service console command:

esxcfg-mpath -l | grep paths | awk '{ split($0, array, "has "); split(array[2], array2, " paths"); SUM +=array2[1] } END { print SUM}'

Probably the awk syntax can be greatly shortened but I am no awk/grep/sed expert :). Nevertheless, you can script this command into a cron job such that you can receive reports on whether or not you are hitting this limit.

Sunday, November 30, 2008

App-V 4.5 Certificate Galore

1) Setting
This weekend I finally found some time to delve a bit deeper into properly configuring an App-V 4.5 infrastructure for large scale deployments. One of the first things that I investigated was the usage of RTSPS for smoother firewall tunneling: as you know, when using RTSP a series of ports is dynamically chosen, which means that you need to open up entire portranges in your firewall. This is not something your firewall guys will like if you work in a larger environment.

Going for RTSPS means you need to use a server public certificate and a corresponding private key in order to let the App-V server sign and encrypt its communications. I have blogged before about how to configure this in SoftGrid 4.1/4.2 -- luckily the procedure for configuring an SSL certificate got a lot simpler. At least, that is what I thought. Some issues I ran into that might save you some valuable troubleshooting time:
  • As always, when requesting a certificate from your Enterprise PKI, use the Virtual Application Server's FQDN as the subject. It is probably also a good idea to use the hostname as a subject alternate name for those people that still refer to servers by their shortnames.

  • After the App-V 4.5 Web Management Service has been installed, don't forget to configure the certificate for the IIS Default Website. In IIS7, that requires adding a binding & selecting the proper certificate. It is not clear to me why the App-V installer cannot handle this automatically!?

  • App-V 4.5 runs under the NETWORK SERVICE account by default and no longer under the SYSTEM account as SoftGrid 4.1/4.2 used to. This has some consequences when it comes to Windows PKI: you need to grant the NETWORK SERVICE account read permissions on the private key.
This later action is a lot harder than you think when reading them ;). Read on for more information.

2) Configuring permissions on private keys
You have three options to get this working:
  • If you are using a Windows 2008 Enterprise CA and are using your own certificate templates, then you can modify the template to automatically grant the NETWORK SERVICE account read permissions on all certificates issued using that template.


    Since you will typically be creating a new certificate template for server deployment (to enable longer than 2 years validity & exporting of private keys), this is probably the easiest solution if you have a Windows Server 2008 Enterprise CA.

  • In a pre-Windows 2008 CA world, you will have to use the WinHTTPcertcfg.exe tool, the Windows HTTP Services Certificate Configuration tool. In our situation, we need to modify the ACL of the certificate to grant read access to the service account of the Management Service (which is the NETWORK SERVICE by default).

    winhttpcertcfg -g -c LOCAL_MACHINE\My -s (subjectname) -a NetworkService

    Verify that everything went ok by listing the permissions:

    winhttpcertcfg –l –c LOCAL_MACHINE\My –s (subjectname)

  • It is also possible to explicitly set the permissions on the private key file. This information is based on information obtained from the App-V blog, with some corrections below.

    • First, obtain the certificate thumbprint. You can find this in the details tab of the certificate:

      Copy/paste the thumbprint for the next commandline.

    • Next, use the FindPrivateKey.exe utility to locate the private key file on disk (compiled version available here -- download & use untrusted executables from the internet at your own risk). Use the following syntax:

      FindPrivateKey.exe My LocalMachine -t "your thumbprint"

      This will give you the full path. Read the caveat message below if this path looks awkward.

    • Grant the NETWORK SERVICE account read & execute permissions on the private key file.

  • CAVEAT: the location of the private key should be in a publicly accessible location. For WinXP/Win2K3 the default is: C:\Documents and Settings\All Users\Application Data\Microsoft\Crypto\RSA\MachineKeys For W2K8/Vista, this changed to: C:\ProgramData\Microsoft\Crypto\RSA\MachineKeys If you have a different location, then take actions to deplace the private key. I requested my certificate through the Web Enrollment pages of Active Directory Certificate Services on Windows 2008. This stores the public & private key in your user account's profile by default. I knew this and drog & dropped the public certificate from the "Certificates (My User)" to the "Certificates (My Computer)" MMC and when your private key was marked as exportable, this is indeed possible. However, this does not actually move the private key and leaves it in your user profile location (for example: C:\Users\Administrator\AppData\Roaming\Microsoft\Crypto\RSA). I fixed this by explicitly exporting the certificate & private key from my user account and then explicitly importing everything again. So huge warning for all you regular crypto-users: no more drag 'n dropping of public/private keypairs!
4) Conclusion
A bit messy... yet secure! The move towards the NETWORK SERVICE account for the App-V Management service (... and other Microsoft products as well) is obviously a good choice, yet it brings along some difficulties that probably can be streamlined from within the App-V Management Server's installer.

PS: You didn't forget to grant the NETWORK SERVICE account also read permissions on your content directory, since otherwise your streaming won't work?

Friday, November 21, 2008

VMware Tools without a reboot?

Every now and then, you see blogposts appearing on the "issue" that you need to reboot a guest operating system after you install or update the VMware Tools. Many people have pondered about whether a reboot is in fact really necessary and if it can be avoided all together. Recent posts about this can be read here and here, refering to this VMware community thread -- the question is still alive in multiple-year spanning threads like this one right here. I usually frown my eyebrowses when reading on these "no reboot" topics, yet I am interested in keeping up with the advancements in that subject for some of the large customers that I come in contact with professionaly.

The scripts and methods outlined in these blogposts sound a bit tricky at first if you ask me, and I feared they might not have the outcome you expected. I would think the VMware tools really require a reboot on some operating systems because you update parts of the virtual device drivers and those need to be reloaded by a reboot of the operating system (Note: strictly speaking you don't need a reboot for all types of device drivers, only under a specific set of circumstances documented by Microsoft. The VMware disk drivers host a boot device so that would fit under the "requires a reboot" category from that document). This means that just running the installer with a "Suppress Reboot" parameter on all your machines will place the new VMware Tools files on your harddisk, but will not actively load all of them... I am not sure if that is a state I would want my production virtual machines in!? And to be very clear: what these scripts do is request an automatic postpone of the reboot, not trigger some hidden functionality in VMware Tools not to really reboot after all!

To remove all suspicion, I did a little test on a Windows 2003 virtual machine and upgraded the tools from ESX 3.0.2 to ESX 3.5U2 without rebooting (using the commandline setup.exe /S /v"REBOOT=R /qb" on the VMware Tools ISO). This effectively updates the following services and drivers without rebooting:
  • VMware services (bumped from build 63195 to build 110268)
  • VMware SVGA II driver, VMware Pointing Device driver
It left the following drivers untouched:
  • VMware Virtual disk SCSI Disk Device ("dummy" harddisk driver - Microsoft driver)
  • NECVMWar VMware IDE CDR10 (virtual CD-ROM driver)
  • Intel Pro/1000 MT Network Connection (vmnet driver - Microsoft driver)
  • LSI Logic PCI-X Ultra320 SCSI Host Adapter (storage adapter - Microsoft driver)
It turned out that these drivers didn't require updating for my specific virtual machine (even after a reboot). In fact, I wasn't immediatelly able to find one machine in the test environment at work that required updating any bootdisk device drivers (and some still had 3.0.2 VMware Tools running!).

To conclude, I would say that in some circumstances it is safe to postpone the reboot of your virtual machine, if at minimum the boot disk device drivers are not touched. Postponing the reboot is very convenient if you use it in the context of a patch weekend where you want to postpone the restart to one big, single reboot at the end of all your patches.

Update: as Duncan Epping points out in a recent blogpost, be also advises that updating the network driver effectively drops all network connections. This is for all practical purposes maybe just as bad as actually rebooting your server, so beware with the "fake level of safety and comfort" that you might have by postponing a VMware Tools reboot!

Thursday, August 14, 2008

Matching LUN's between ESX hosts and a VCB proxy

One of the problems that I encountered at a customer was to discover what VMFS partitions were presented to a VCB proxy. It turned out to be a bit more complex than I had first expected.

Introduction
VMware released the VCB framework (VMware Consolidated Backup) to make a backups of a virtual machine. The VCB framework is typically installed on a Windows host (the VCB proxy), and in order to make SAN backups, you need to present both the source LUN, which contains the virtual machines to backup, and the destination LUN, where the backup files are stored, to that VCB proxy.

This setup is relatively simple to maintain in smaller environments. However, once you get in a big environment were a dozen teams are involved (separate networking teams, separate SAN teams, separate Windows teams and separate VMware teams), it can become quite challenging to find out which of the 12 LUN's that are presented to a Windows host in fact belong to a specific ESX host.

Finding unique identifiers for a LUN
The mission is to find a unique identifier (UID) that can be used both on the ESX host and the Windows box. The first two obvious candidates to uniquely identify a ESX managed LUN on a SAN network are:
  • The VMFS ID for the partition
    Upon the initialization of a VMFS partition, it is assigned a unique identifier that can be found by looking in the /vmfs/volumes directory on an ESX host, or by using the esxcfg-vmhbadevs -m command on the ESX host. The output looks like this:

    vmhba1:0:2:1 /dev/sdb1 48858dc4-f4e218d1-d3a8-001cc497e630
    vmhba1:4:1:1 /dev/sdc1 483cf914-29b60dc5-dbfd-001cc497e630
    vmhba1:4:2:1 /dev/sdd1 479da7c1-4494cd90-d327-001cc497e630


    The first disk is the (remainder) of the locally attached storage, and the two other disks are presented from the SAN. The first column indicates that HBA 1, SCSI target 4 and LUN's 1 and 2 are used (and partition 1 on each LUN); the second column lists the Linux device name under the Service Console and the third column lists the VMFS ID.

  • The WWPN (World Wide Port Name) of the disk on the SAN
    On a fiber-channel SAN network, each device is assigned a unique identifier called the WWPN. You can compare the WWPN as performing the same function as a MAC address on an Ethernet network. The WWPN's of the disks that are presented to an ESX host can be obtained from the Service Console using the esxcfg-mpath -l command:

    Disk vmhba1:4:1 /dev/sdc (256000MB) has 16 paths and policy of Fixed
    FC 13:0.0 10000000c96e8972<->500507630308060b vmhba1:4:1 On
    FC 13:0.0 10000000c96e8972<->500507630313060b vmhba1:5:1 On
    FC 13:0.0 10000000c96e8972<->500507630303060b vmhba1:6:1 On active preferred
    FC 13:0.0 10000000c96e8972<->500507630303860b vmhba1:7:1 On
    FC 13:0.0 10000000c96e8972<->500507630308860b vmhba1:8:1 On
    FC 13:0.0 10000000c96e8972<->500507630313860b vmhba1:9:1 On
    FC 13:0.0 10000000c96e8972<->500507630318060b vmhba1:10:1 On
    FC 13:0.0 10000000c96e8972<->500507630318860b vmhba1:11:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630303460b vmhba2:4:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630308460b vmhba2:5:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630313460b vmhba2:6:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630303c60b vmhba2:7:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630308c60b vmhba2:8:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630313c60b vmhba2:9:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630318460b vmhba2:10:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630318c60b vmhba2:11:1 On

    Disk vmhba1:4:2 /dev/sdd (256000MB) has 16 paths and policy of Fixed
    FC 13:0.0 10000000c96e8972<->500507630308060b vmhba1:4:2 On
    FC 13:0.0 10000000c96e8972<->500507630313060b vmhba1:5:2 On
    FC 13:0.0 10000000c96e8972<->500507630303060b vmhba1:6:2 On
    FC 13:0.0 10000000c96e8972<->500507630303860b vmhba1:7:2 On
    FC 13:0.0 10000000c96e8972<->500507630308860b vmhba1:8:2 On
    FC 13:0.0 10000000c96e8972<->500507630313860b vmhba1:9:2 On
    FC 13:0.0 10000000c96e8972<->500507630318060b vmhba1:10:2 On
    FC 13:0.0 10000000c96e8972<->500507630318860b vmhba1:11:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630303460b vmhba2:4:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630308460b vmhba2:5:2 On active preferred
    FC 16:0.0 10000000c96e8ccc<->500507630313460b vmhba2:6:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630303c60b vmhba2:7:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630308c60b vmhba2:8:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630313c60b vmhba2:9:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630318460b vmhba2:10:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630318c60b vmhba2:11:2 On

    In this output, you can see two HBA's (that have WWPN's 10000000c96e8972 and 10000000c96e8ccc) that see two LUN's vmhba1:4:1 and vmhba1:4:2 that are presented over 16 paths.

    On the VCB proxy / Windows box, I used the Emulex HBAnywhere utility to retrieve the WWPN's of the LUN's that were presented. The output is shown in the following screenshot:


    It is also possible to use the HbaCmd.exe AllNodeInfo command to retrieve a list of all WWPN's that a certain HBA sees.
Looks nice, what's the problem?
Using the WWPN seemed to be the obvious answer to identifying the LUN's on both the ESX host and the VCB proxy. Until I discovered that two different LUN's where presented using the same WWPN (obviously they were on two different SAN's and presented to two different hosts). On one of our ESX hosts, a 256 GB LUN was presented using WWPN 50:05:07:63:03:08:06:0b, and on the VCB proxy, a 500 GB LUN was presented using that same WWPN -- apparently our SAN team recycles the WWPN's on the different fibre channel fabrics.

To make matters even worse, I noticed that the same LUN was presented using one WWPN to an ESX host, and with another WWPN to the VCB proxy (I am no SAN expert myself but I assume it is possible to present the same LUN in different SAN zones using different WWPN's). I was able to verify this since VCB was able to do a SAN backup of a virtual machine that resides on a LUN with a WWPN on the ESX side that is not presented to the VCB proxy.

The next step: VMFS ID's as a unique identifier
So, if you cannot rely on the WWPN's to uniquely identify a LUN on a host that is connected to multiple SAN's, then surely VCB must use the VMFS ID to know what LUN to read the virtual machine data from? Right?

On the VCB proxy & Windows machine, I tried to discover the VMFS ID's using the vcbSanDbg.exe tool (included in the VCB framework and available as a separate download from the VMware website -- careful, the separate download is an older version than the one included in the VCB 1.5 framework). An excerpt from its lengthy output:

C:\Program Files\VCB>vcbSanDbg | findstr "ID: NAA: volume"
[info] Found logical volume 48761b97-a4f562bd-6875-0017085d.
[info] Found logical volume 48761bc5-3f508baa-2f5d-0017085d.
[info] Found logical volume 483cf913-05b4f526-45b5-001cc497.
[info] Found logical volume 479da7ac-55fe7dfe-378c-001cc497.
[info] Found logical volume 477c2b4a-7db36616-30ea-001cc495.
[info] Found logical volume 48843bec-154cf784-871a-001cc495.
[info] Found SCSI Device: NAA:600508b10010443953555534314200044c4f47494341
[info] Found SCSI Device: NAA:60060e801525180000012518000000374f50454e2d56
[info] Found SCSI Device: NAA:600508b4000901eb0001100003230000485356323130
[info] ID: LVID:48761b97-dacedf9f-ebb9-0017085d0f91/48761b97-a4f562bd-6875-0017085d0f91/1
Name: 48761b97-a4f562bd-6875-0017085d
[info] Found SCSI Device: NAA:600508b4000901eb0001100003260000485356323130
[info] ID: LVID:48761bc6-7b4afa63-97d9-0017085d0f91/48761bc5-3f508baa-2f5d-0017085d0f91/1
Name: 48761bc5-3f508baa-2f5d-0017085d
[info] Found SCSI Device: NAA:6005076303ffc60b0000000000001049323130373930
[info] ID: LVID:483cf913-458f9fa5-a749-001cc497e630/483cf913-05b4f526-45b5-001cc497e630/1
Name: 483cf913-05b4f526-45b5-001cc497
[info] Found SCSI Device: NAA:6005076303ffc60b000000000000104a323130373930
[info] ID: LVID:479da7b6-877867e9-dd06-001cc497e630/479da7ac-55fe7dfe-378c-001cc497e630/1
Name: 479da7ac-55fe7dfe-378c-001cc497
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128d323130373930
[info] ID: LVID:477c2b4a-969e01e0-8d49-001cc495fb46/477c2b4a-7db36616-30ea-001cc495fb46/1
Name: 477c2b4a-7db36616-30ea-001cc495
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128e323130373930
[info] Found SCSI Device: NAA:600508b40006e8890000b000010a0000485356323130
[info] Found SCSI Device: NAA:600508b40006e8890000b00003770000485356323130
[info] ID: LVID:48843bec-28cc17a4-ca9e-001cc495fb46/48843bec-154cf784-871a-001cc495fb46/1
Name: 48843bec-154cf784-871a-001cc495


Unfortunately, I was not able to discover the VMFS ID's I saw on the ESX host in this output, even though there are some resemblances:
  • ESX host VMFS ID 483cf914-29b60dc5-dbfd-001cc497e630 looks a lot like vcbSanDbg.exe output's logical volume 483cf913-05b4f526-45b5-001cc497.

  • ESX host VMFS ID 479da7c1-4494cd90-d327-001cc497e630 looks a lot like vcbSanDbg.exe output's logical volume 479da7ac-55fe7dfe-378c-001cc497.
Furthermore, I found out that current versions of VCB do not rely on the VMFS ID to discover virtual machines on a LUN. In Andy Tucker's talk "VMware Consolidated Backup: today and tomorrow" at VMworld 2007, it is clearly stated (slide 19) that there...
No “VMFS Driver for Windows” on proxy

And furthermore that the usage of VMFS signatures is on the "todo" list for identifying LUNs on the SAN network (slide 34).

Other ideas?
So where does one turn when all possible solutions seem to lead to a dead end? Right: the VMware community forums. The answer came in this thread by snapper.

What I learned today is that besides the WWPN on a fiber channel network, there is another unique identifier called the NAA (Network Address Authority) to identify devices on the FC fabric. You can obtain the NAA for the LUN's on an ESX host using the esxcfg-mpath command in verbose mode using:

esxcfg-mpath -lv | grep ^Disk | grep -v vmhba0 | awk '{print $3,$5,$2}' | cut -b15-

The output on our ESX host looks much like this:

6005076303ffc60b0000000000001049323130373930 (256000MB) vmhba1:4:1
6005076303ffc60b000000000000104a323130373930 (256000MB) vmhba1:4:2

The NAA can be seen in the vcbSanDbg.exe output shown above, and can be filtered as follows:

vcbSanDbg.exe | findstr "NAA:"


The output should look like this:

C:\Program Files\VCB>vcbSanDbg | findstr "NAA:"

[info] Found SCSI Device: NAA:600508b10010443953555534314200044c4f47494341
[info] Found SCSI Device: NAA:60060e801525180000012518000000374f50454e2d56
[info] Found SCSI Device: NAA:600508b4000901eb0001100003230000485356323130
[info] Found SCSI Device: NAA:600508b4000901eb0001100003260000485356323130
[info] Found SCSI Device: NAA:6005076303ffc60b0000000000001049323130373930
[info] Found SCSI Device: NAA:6005076303ffc60b000000000000104a323130373930
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128d323130373930
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128e323130373930
[info] Found SCSI Device: NAA:600508b40006e8890000b000010a0000485356323130
[info] Found SCSI Device: NAA:600508b40006e8890000b00003770000485356323130


Et voila, now I can start running the esxcfg-mpath command on all our ESX hosts and start matching these NAA's with those in the output of vcbSanDbg to discover what our Windows VCB proxy has access to.

Tuesday, August 12, 2008

VMWare D-Day: 12/08/2008

I recon "12 August 2008" will be long remembered by all VMWare enthousiasts out there.

That is the day that a major bug caused ESX 3.5 Update 2 no longer to recognise any license, even if the license file at your license server was perfectly valid. There is no need to sketch the horror that follows when your ESX clusters no longer detect a valid license: Vmotion fails, DRS fails, HA fails, powering on virtual machines is no longer possible... Ironically, today is also Microsoft's Patch Tuesday of August, which probably means that quite some system admininistrators where caught with their pants down (and their VM's powered off during a scheduled maintenance window) when this bug struck.

The symptoms and errors that we have been experiencing are the following:
  • Unable to VMotion a host from ESX 3.0.2 to ESX 3.5. The VMotion progresses until 10% and then aborts with error messages such as "operation timed out" or "internal system error".

  • HA agent getting completely confused (unable to install, reconfigure for HA does not work).

  • Unable to power on new machines:

    [2008-08-12 14:11:16.022 'Vmsvc' 121330608 info] Failed to do Power Op: Error: Internal error
    [2008-08-12 14:11:16.065 'vm:/vmfs/volumes/48858dc4-f4e218d1-d3a8-001cc497e630/HOSTNAME/HOSTNAME.vmx' 121330608 warning] Failed operation
    [2008-08-12 14:11:16.066 'ha-eventmgr' 121330608 info] Event 15 : Failed to power on HOSTNAME on esx.test.local in ha-datacenter: A general system error occurred

VMWare is promising a patch tomorrow, but several forum posts (here and here) are wondering how this patch will be distributed and -- given the deep integration of the licensing components within ESX -- whether this will require a reboot of the ESX host or not (which can be quite problematic if you cannot VMotion machines away). A possible workaround for this issue is to introduce a 3.0.2 host in the cluster as I have seen in our environment that VMotioning from 3.5 to 3.0.2 still works.

Edit (21:20 PM): hopes are up that VMware should be able to release a patch that doesn't require the ESX host to reboot. See what Toni Verbeiren has to say about it on his blog.

Edit (9:00 AM 13 AUG): a patch has been released by VMware. Regarding whether hosts need to be rebooted or not... there is good news and there is bad news: "to apply the patches, no reboot of ESX/ESXi hosts is required. One can VMotion off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs need to be powered off before the patches are applied and powered back on afterwards."

You can follow the developing crisis at the following sources:
Even our dear friends at Microsoft write about the problem, see the blogpost "It's rude to laugh at other people's misfortunes - even VMware's" here.