Thursday, August 14, 2008

Matching LUN's between ESX hosts and a VCB proxy

One of the problems that I encountered at a customer was to discover what VMFS partitions were presented to a VCB proxy. It turned out to be a bit more complex than I had first expected.

Introduction
VMware released the VCB framework (VMware Consolidated Backup) to make a backups of a virtual machine. The VCB framework is typically installed on a Windows host (the VCB proxy), and in order to make SAN backups, you need to present both the source LUN, which contains the virtual machines to backup, and the destination LUN, where the backup files are stored, to that VCB proxy.

This setup is relatively simple to maintain in smaller environments. However, once you get in a big environment were a dozen teams are involved (separate networking teams, separate SAN teams, separate Windows teams and separate VMware teams), it can become quite challenging to find out which of the 12 LUN's that are presented to a Windows host in fact belong to a specific ESX host.

Finding unique identifiers for a LUN
The mission is to find a unique identifier (UID) that can be used both on the ESX host and the Windows box. The first two obvious candidates to uniquely identify a ESX managed LUN on a SAN network are:
  • The VMFS ID for the partition
    Upon the initialization of a VMFS partition, it is assigned a unique identifier that can be found by looking in the /vmfs/volumes directory on an ESX host, or by using the esxcfg-vmhbadevs -m command on the ESX host. The output looks like this:

    vmhba1:0:2:1 /dev/sdb1 48858dc4-f4e218d1-d3a8-001cc497e630
    vmhba1:4:1:1 /dev/sdc1 483cf914-29b60dc5-dbfd-001cc497e630
    vmhba1:4:2:1 /dev/sdd1 479da7c1-4494cd90-d327-001cc497e630


    The first disk is the (remainder) of the locally attached storage, and the two other disks are presented from the SAN. The first column indicates that HBA 1, SCSI target 4 and LUN's 1 and 2 are used (and partition 1 on each LUN); the second column lists the Linux device name under the Service Console and the third column lists the VMFS ID.

  • The WWPN (World Wide Port Name) of the disk on the SAN
    On a fiber-channel SAN network, each device is assigned a unique identifier called the WWPN. You can compare the WWPN as performing the same function as a MAC address on an Ethernet network. The WWPN's of the disks that are presented to an ESX host can be obtained from the Service Console using the esxcfg-mpath -l command:

    Disk vmhba1:4:1 /dev/sdc (256000MB) has 16 paths and policy of Fixed
    FC 13:0.0 10000000c96e8972<->500507630308060b vmhba1:4:1 On
    FC 13:0.0 10000000c96e8972<->500507630313060b vmhba1:5:1 On
    FC 13:0.0 10000000c96e8972<->500507630303060b vmhba1:6:1 On active preferred
    FC 13:0.0 10000000c96e8972<->500507630303860b vmhba1:7:1 On
    FC 13:0.0 10000000c96e8972<->500507630308860b vmhba1:8:1 On
    FC 13:0.0 10000000c96e8972<->500507630313860b vmhba1:9:1 On
    FC 13:0.0 10000000c96e8972<->500507630318060b vmhba1:10:1 On
    FC 13:0.0 10000000c96e8972<->500507630318860b vmhba1:11:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630303460b vmhba2:4:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630308460b vmhba2:5:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630313460b vmhba2:6:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630303c60b vmhba2:7:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630308c60b vmhba2:8:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630313c60b vmhba2:9:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630318460b vmhba2:10:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630318c60b vmhba2:11:1 On

    Disk vmhba1:4:2 /dev/sdd (256000MB) has 16 paths and policy of Fixed
    FC 13:0.0 10000000c96e8972<->500507630308060b vmhba1:4:2 On
    FC 13:0.0 10000000c96e8972<->500507630313060b vmhba1:5:2 On
    FC 13:0.0 10000000c96e8972<->500507630303060b vmhba1:6:2 On
    FC 13:0.0 10000000c96e8972<->500507630303860b vmhba1:7:2 On
    FC 13:0.0 10000000c96e8972<->500507630308860b vmhba1:8:2 On
    FC 13:0.0 10000000c96e8972<->500507630313860b vmhba1:9:2 On
    FC 13:0.0 10000000c96e8972<->500507630318060b vmhba1:10:2 On
    FC 13:0.0 10000000c96e8972<->500507630318860b vmhba1:11:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630303460b vmhba2:4:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630308460b vmhba2:5:2 On active preferred
    FC 16:0.0 10000000c96e8ccc<->500507630313460b vmhba2:6:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630303c60b vmhba2:7:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630308c60b vmhba2:8:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630313c60b vmhba2:9:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630318460b vmhba2:10:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630318c60b vmhba2:11:2 On

    In this output, you can see two HBA's (that have WWPN's 10000000c96e8972 and 10000000c96e8ccc) that see two LUN's vmhba1:4:1 and vmhba1:4:2 that are presented over 16 paths.

    On the VCB proxy / Windows box, I used the Emulex HBAnywhere utility to retrieve the WWPN's of the LUN's that were presented. The output is shown in the following screenshot:


    It is also possible to use the HbaCmd.exe AllNodeInfo command to retrieve a list of all WWPN's that a certain HBA sees.
Looks nice, what's the problem?
Using the WWPN seemed to be the obvious answer to identifying the LUN's on both the ESX host and the VCB proxy. Until I discovered that two different LUN's where presented using the same WWPN (obviously they were on two different SAN's and presented to two different hosts). On one of our ESX hosts, a 256 GB LUN was presented using WWPN 50:05:07:63:03:08:06:0b, and on the VCB proxy, a 500 GB LUN was presented using that same WWPN -- apparently our SAN team recycles the WWPN's on the different fibre channel fabrics.

To make matters even worse, I noticed that the same LUN was presented using one WWPN to an ESX host, and with another WWPN to the VCB proxy (I am no SAN expert myself but I assume it is possible to present the same LUN in different SAN zones using different WWPN's). I was able to verify this since VCB was able to do a SAN backup of a virtual machine that resides on a LUN with a WWPN on the ESX side that is not presented to the VCB proxy.

The next step: VMFS ID's as a unique identifier
So, if you cannot rely on the WWPN's to uniquely identify a LUN on a host that is connected to multiple SAN's, then surely VCB must use the VMFS ID to know what LUN to read the virtual machine data from? Right?

On the VCB proxy & Windows machine, I tried to discover the VMFS ID's using the vcbSanDbg.exe tool (included in the VCB framework and available as a separate download from the VMware website -- careful, the separate download is an older version than the one included in the VCB 1.5 framework). An excerpt from its lengthy output:

C:\Program Files\VCB>vcbSanDbg | findstr "ID: NAA: volume"
[info] Found logical volume 48761b97-a4f562bd-6875-0017085d.
[info] Found logical volume 48761bc5-3f508baa-2f5d-0017085d.
[info] Found logical volume 483cf913-05b4f526-45b5-001cc497.
[info] Found logical volume 479da7ac-55fe7dfe-378c-001cc497.
[info] Found logical volume 477c2b4a-7db36616-30ea-001cc495.
[info] Found logical volume 48843bec-154cf784-871a-001cc495.
[info] Found SCSI Device: NAA:600508b10010443953555534314200044c4f47494341
[info] Found SCSI Device: NAA:60060e801525180000012518000000374f50454e2d56
[info] Found SCSI Device: NAA:600508b4000901eb0001100003230000485356323130
[info] ID: LVID:48761b97-dacedf9f-ebb9-0017085d0f91/48761b97-a4f562bd-6875-0017085d0f91/1
Name: 48761b97-a4f562bd-6875-0017085d
[info] Found SCSI Device: NAA:600508b4000901eb0001100003260000485356323130
[info] ID: LVID:48761bc6-7b4afa63-97d9-0017085d0f91/48761bc5-3f508baa-2f5d-0017085d0f91/1
Name: 48761bc5-3f508baa-2f5d-0017085d
[info] Found SCSI Device: NAA:6005076303ffc60b0000000000001049323130373930
[info] ID: LVID:483cf913-458f9fa5-a749-001cc497e630/483cf913-05b4f526-45b5-001cc497e630/1
Name: 483cf913-05b4f526-45b5-001cc497
[info] Found SCSI Device: NAA:6005076303ffc60b000000000000104a323130373930
[info] ID: LVID:479da7b6-877867e9-dd06-001cc497e630/479da7ac-55fe7dfe-378c-001cc497e630/1
Name: 479da7ac-55fe7dfe-378c-001cc497
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128d323130373930
[info] ID: LVID:477c2b4a-969e01e0-8d49-001cc495fb46/477c2b4a-7db36616-30ea-001cc495fb46/1
Name: 477c2b4a-7db36616-30ea-001cc495
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128e323130373930
[info] Found SCSI Device: NAA:600508b40006e8890000b000010a0000485356323130
[info] Found SCSI Device: NAA:600508b40006e8890000b00003770000485356323130
[info] ID: LVID:48843bec-28cc17a4-ca9e-001cc495fb46/48843bec-154cf784-871a-001cc495fb46/1
Name: 48843bec-154cf784-871a-001cc495


Unfortunately, I was not able to discover the VMFS ID's I saw on the ESX host in this output, even though there are some resemblances:
  • ESX host VMFS ID 483cf914-29b60dc5-dbfd-001cc497e630 looks a lot like vcbSanDbg.exe output's logical volume 483cf913-05b4f526-45b5-001cc497.

  • ESX host VMFS ID 479da7c1-4494cd90-d327-001cc497e630 looks a lot like vcbSanDbg.exe output's logical volume 479da7ac-55fe7dfe-378c-001cc497.
Furthermore, I found out that current versions of VCB do not rely on the VMFS ID to discover virtual machines on a LUN. In Andy Tucker's talk "VMware Consolidated Backup: today and tomorrow" at VMworld 2007, it is clearly stated (slide 19) that there...
No “VMFS Driver for Windows” on proxy

And furthermore that the usage of VMFS signatures is on the "todo" list for identifying LUNs on the SAN network (slide 34).

Other ideas?
So where does one turn when all possible solutions seem to lead to a dead end? Right: the VMware community forums. The answer came in this thread by snapper.

What I learned today is that besides the WWPN on a fiber channel network, there is another unique identifier called the NAA (Network Address Authority) to identify devices on the FC fabric. You can obtain the NAA for the LUN's on an ESX host using the esxcfg-mpath command in verbose mode using:

esxcfg-mpath -lv | grep ^Disk | grep -v vmhba0 | awk '{print $3,$5,$2}' | cut -b15-

The output on our ESX host looks much like this:

6005076303ffc60b0000000000001049323130373930 (256000MB) vmhba1:4:1
6005076303ffc60b000000000000104a323130373930 (256000MB) vmhba1:4:2

The NAA can be seen in the vcbSanDbg.exe output shown above, and can be filtered as follows:

vcbSanDbg.exe | findstr "NAA:"


The output should look like this:

C:\Program Files\VCB>vcbSanDbg | findstr "NAA:"

[info] Found SCSI Device: NAA:600508b10010443953555534314200044c4f47494341
[info] Found SCSI Device: NAA:60060e801525180000012518000000374f50454e2d56
[info] Found SCSI Device: NAA:600508b4000901eb0001100003230000485356323130
[info] Found SCSI Device: NAA:600508b4000901eb0001100003260000485356323130
[info] Found SCSI Device: NAA:6005076303ffc60b0000000000001049323130373930
[info] Found SCSI Device: NAA:6005076303ffc60b000000000000104a323130373930
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128d323130373930
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128e323130373930
[info] Found SCSI Device: NAA:600508b40006e8890000b000010a0000485356323130
[info] Found SCSI Device: NAA:600508b40006e8890000b00003770000485356323130


Et voila, now I can start running the esxcfg-mpath command on all our ESX hosts and start matching these NAA's with those in the output of vcbSanDbg to discover what our Windows VCB proxy has access to.

Tuesday, August 12, 2008

VMWare D-Day: 12/08/2008

I recon "12 August 2008" will be long remembered by all VMWare enthousiasts out there.

That is the day that a major bug caused ESX 3.5 Update 2 no longer to recognise any license, even if the license file at your license server was perfectly valid. There is no need to sketch the horror that follows when your ESX clusters no longer detect a valid license: Vmotion fails, DRS fails, HA fails, powering on virtual machines is no longer possible... Ironically, today is also Microsoft's Patch Tuesday of August, which probably means that quite some system admininistrators where caught with their pants down (and their VM's powered off during a scheduled maintenance window) when this bug struck.

The symptoms and errors that we have been experiencing are the following:
  • Unable to VMotion a host from ESX 3.0.2 to ESX 3.5. The VMotion progresses until 10% and then aborts with error messages such as "operation timed out" or "internal system error".

  • HA agent getting completely confused (unable to install, reconfigure for HA does not work).

  • Unable to power on new machines:

    [2008-08-12 14:11:16.022 'Vmsvc' 121330608 info] Failed to do Power Op: Error: Internal error
    [2008-08-12 14:11:16.065 'vm:/vmfs/volumes/48858dc4-f4e218d1-d3a8-001cc497e630/HOSTNAME/HOSTNAME.vmx' 121330608 warning] Failed operation
    [2008-08-12 14:11:16.066 'ha-eventmgr' 121330608 info] Event 15 : Failed to power on HOSTNAME on esx.test.local in ha-datacenter: A general system error occurred

VMWare is promising a patch tomorrow, but several forum posts (here and here) are wondering how this patch will be distributed and -- given the deep integration of the licensing components within ESX -- whether this will require a reboot of the ESX host or not (which can be quite problematic if you cannot VMotion machines away). A possible workaround for this issue is to introduce a 3.0.2 host in the cluster as I have seen in our environment that VMotioning from 3.5 to 3.0.2 still works.

Edit (21:20 PM): hopes are up that VMware should be able to release a patch that doesn't require the ESX host to reboot. See what Toni Verbeiren has to say about it on his blog.

Edit (9:00 AM 13 AUG): a patch has been released by VMware. Regarding whether hosts need to be rebooted or not... there is good news and there is bad news: "to apply the patches, no reboot of ESX/ESXi hosts is required. One can VMotion off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs need to be powered off before the patches are applied and powered back on afterwards."

You can follow the developing crisis at the following sources:
Even our dear friends at Microsoft write about the problem, see the blogpost "It's rude to laugh at other people's misfortunes - even VMware's" here.

Friday, August 8, 2008

WM6 and self-signed certificates

When playing around with a new (unofficial) WM6.1 rom for my Mio A701, I bumped into a well known problem with installing self-signed certificates on (homebrew?) WM6 ROMs: it is not possible to install a new CA certificate with the error message "The certificate was not successfully added; please restart your device and try again". Obviously, restarting the device did not fix the problem.

A few months ago, I already encountered the problem and I knew you could bypass it by importing the certificate directly into the mobile device's registry. However, the procedures that I read all involved:
  1. flashing Windows Mobile 5 (or a WM6 version that was patched to accept any certificate),
  2. importing the certificate in that temporary ROM,
  3. exporting the relevant registry data,
  4. reflashing back to the rom that has the certificate problem,
  5. importing the certificate through the registry file you obtained earlier in step 3.
As you can imagine, this is quite some work and since I am a lazy person by nature, I did not want to go back to WM5 after just having flashed my Mio to a brandnew and shiny WM6. Therefore, I decided to develop a shorter workaround that doesn't involve reflashing.

The tricky part is that you need to create the proper registry file to import. This file looks like:
Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\Comm\Security\SystemCertificates\Root\Certificates\824AF72AB87E17AC777098A4164D7A90C90C0D69]
"Blob"=hex:19,00,00,00,01,00,00,00,10,00,00,00,4f,e5,c4,01,4e,7d,89,4a,da,42,\
3f,f7,24,0f,7f,a2,19,00,00,00,01,00,00,00,10,00,00,00,cb,bc,40,37,8a,45,2c,\
...
(please disregard the unintentional wrapping of the registry location; everything between the square brackets should be on one line).

The difficult part is converting your self-signed certificate to the proper registry format. Here's how I did that:
  • On a regular PC, use Internet Explorer to go to a website with the certificate that you want to install on your mobile device (typically this will be Outlook Web Access or something). Open the certificate and install it on your local PC (let the certificate import wizard automatically place the certificate in whatever store it finds necessary).

  • View the certificate (in Internet Explorer or by using the Certificate MMC) and go to the "Details" tab. There you will find the "Thumbprint" of the algorithm. You will need to look up this number in a few moments, so be sure to remember the first few digits. In the case for the company I work for, the thumbprint is "824af72ab8somethingsomething".

  • Open your registry editor and go to the following location:

    HKEY_CURRENT_USER\Software\Microsoft\SystemCertificates\Root\Certificates\

    There should be a registry key that has the thumbprint of your certificate as its name:


    Rightclick that registry key and click "Export...". Choose a location for the exported registry data.

  • Next, open the registry export in Notepad. Replace the registry key location (between the square brackets) to HKEY_LOCAL_MACHINE\Comm\Security\SystemCertificates\Root\Certificates\ followed by the thumbprint. Next, replace the first 12 bytes in the "Blob" registry value by: hex:19,00,00,00,01,00,00,00,10,00,00,00.

  • Your result should look like this:
    Windows Registry Editor Version 5.00

    [HKEY_LOCAL_MACHINE\Comm\Security\SystemCertificates\Root\Certificates\824AF72AB87E17AC777098A4164D7A90C90C0D69]
    "Blob"=hex:19,00,00,00,01,00,00,00,10,00,00,00,4f,e5,c4,01,4e,7d,89,4a,da,42,\
    3f,f7,24,0f,7f,a2,19,00,00,00,01,00,00,00,10,00,00,00,cb,bc,40,37,8a,45,2c,\
    ...
    Compare this with the original registry export that I have shown above, the differences are shown in bold.

  • Save the registry file, copy it to your mobile device and import it there. Voila! Finished!
You can use the "Certificates" control panel to verify that your certificate is properly recognized!

Note: you must either restart the ActiveSync process on your device because it will not immediately recognize the new certificate; you can kill the ActiveSync process or restart your device (but first wait at least a few minutes such that Windows Mobile can commit your registry changes to memory!).

Obviously, this is completely not supported or endorsed by anybody on this planet. Perform these actions at your own risk and be sure you know what to do in case you brick your device!