Tuesday, August 12, 2008

VMWare D-Day: 12/08/2008

I recon "12 August 2008" will be long remembered by all VMWare enthousiasts out there.

That is the day that a major bug caused ESX 3.5 Update 2 no longer to recognise any license, even if the license file at your license server was perfectly valid. There is no need to sketch the horror that follows when your ESX clusters no longer detect a valid license: Vmotion fails, DRS fails, HA fails, powering on virtual machines is no longer possible... Ironically, today is also Microsoft's Patch Tuesday of August, which probably means that quite some system admininistrators where caught with their pants down (and their VM's powered off during a scheduled maintenance window) when this bug struck.

The symptoms and errors that we have been experiencing are the following:
  • Unable to VMotion a host from ESX 3.0.2 to ESX 3.5. The VMotion progresses until 10% and then aborts with error messages such as "operation timed out" or "internal system error".

  • HA agent getting completely confused (unable to install, reconfigure for HA does not work).

  • Unable to power on new machines:

    [2008-08-12 14:11:16.022 'Vmsvc' 121330608 info] Failed to do Power Op: Error: Internal error
    [2008-08-12 14:11:16.065 'vm:/vmfs/volumes/48858dc4-f4e218d1-d3a8-001cc497e630/HOSTNAME/HOSTNAME.vmx' 121330608 warning] Failed operation
    [2008-08-12 14:11:16.066 'ha-eventmgr' 121330608 info] Event 15 : Failed to power on HOSTNAME on esx.test.local in ha-datacenter: A general system error occurred

VMWare is promising a patch tomorrow, but several forum posts (here and here) are wondering how this patch will be distributed and -- given the deep integration of the licensing components within ESX -- whether this will require a reboot of the ESX host or not (which can be quite problematic if you cannot VMotion machines away). A possible workaround for this issue is to introduce a 3.0.2 host in the cluster as I have seen in our environment that VMotioning from 3.5 to 3.0.2 still works.

Edit (21:20 PM): hopes are up that VMware should be able to release a patch that doesn't require the ESX host to reboot. See what Toni Verbeiren has to say about it on his blog.

Edit (9:00 AM 13 AUG): a patch has been released by VMware. Regarding whether hosts need to be rebooted or not... there is good news and there is bad news: "to apply the patches, no reboot of ESX/ESXi hosts is required. One can VMotion off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs need to be powered off before the patches are applied and powered back on afterwards."

You can follow the developing crisis at the following sources:
Even our dear friends at Microsoft write about the problem, see the blogpost "It's rude to laugh at other people's misfortunes - even VMware's" here.

No comments: