Thursday, August 14, 2008

Matching LUN's between ESX hosts and a VCB proxy

One of the problems that I encountered at a customer was to discover what VMFS partitions were presented to a VCB proxy. It turned out to be a bit more complex than I had first expected.

Introduction
VMware released the VCB framework (VMware Consolidated Backup) to make a backups of a virtual machine. The VCB framework is typically installed on a Windows host (the VCB proxy), and in order to make SAN backups, you need to present both the source LUN, which contains the virtual machines to backup, and the destination LUN, where the backup files are stored, to that VCB proxy.

This setup is relatively simple to maintain in smaller environments. However, once you get in a big environment were a dozen teams are involved (separate networking teams, separate SAN teams, separate Windows teams and separate VMware teams), it can become quite challenging to find out which of the 12 LUN's that are presented to a Windows host in fact belong to a specific ESX host.

Finding unique identifiers for a LUN
The mission is to find a unique identifier (UID) that can be used both on the ESX host and the Windows box. The first two obvious candidates to uniquely identify a ESX managed LUN on a SAN network are:
  • The VMFS ID for the partition
    Upon the initialization of a VMFS partition, it is assigned a unique identifier that can be found by looking in the /vmfs/volumes directory on an ESX host, or by using the esxcfg-vmhbadevs -m command on the ESX host. The output looks like this:

    vmhba1:0:2:1 /dev/sdb1 48858dc4-f4e218d1-d3a8-001cc497e630
    vmhba1:4:1:1 /dev/sdc1 483cf914-29b60dc5-dbfd-001cc497e630
    vmhba1:4:2:1 /dev/sdd1 479da7c1-4494cd90-d327-001cc497e630


    The first disk is the (remainder) of the locally attached storage, and the two other disks are presented from the SAN. The first column indicates that HBA 1, SCSI target 4 and LUN's 1 and 2 are used (and partition 1 on each LUN); the second column lists the Linux device name under the Service Console and the third column lists the VMFS ID.

  • The WWPN (World Wide Port Name) of the disk on the SAN
    On a fiber-channel SAN network, each device is assigned a unique identifier called the WWPN. You can compare the WWPN as performing the same function as a MAC address on an Ethernet network. The WWPN's of the disks that are presented to an ESX host can be obtained from the Service Console using the esxcfg-mpath -l command:

    Disk vmhba1:4:1 /dev/sdc (256000MB) has 16 paths and policy of Fixed
    FC 13:0.0 10000000c96e8972<->500507630308060b vmhba1:4:1 On
    FC 13:0.0 10000000c96e8972<->500507630313060b vmhba1:5:1 On
    FC 13:0.0 10000000c96e8972<->500507630303060b vmhba1:6:1 On active preferred
    FC 13:0.0 10000000c96e8972<->500507630303860b vmhba1:7:1 On
    FC 13:0.0 10000000c96e8972<->500507630308860b vmhba1:8:1 On
    FC 13:0.0 10000000c96e8972<->500507630313860b vmhba1:9:1 On
    FC 13:0.0 10000000c96e8972<->500507630318060b vmhba1:10:1 On
    FC 13:0.0 10000000c96e8972<->500507630318860b vmhba1:11:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630303460b vmhba2:4:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630308460b vmhba2:5:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630313460b vmhba2:6:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630303c60b vmhba2:7:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630308c60b vmhba2:8:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630313c60b vmhba2:9:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630318460b vmhba2:10:1 On
    FC 16:0.0 10000000c96e8ccc<->500507630318c60b vmhba2:11:1 On

    Disk vmhba1:4:2 /dev/sdd (256000MB) has 16 paths and policy of Fixed
    FC 13:0.0 10000000c96e8972<->500507630308060b vmhba1:4:2 On
    FC 13:0.0 10000000c96e8972<->500507630313060b vmhba1:5:2 On
    FC 13:0.0 10000000c96e8972<->500507630303060b vmhba1:6:2 On
    FC 13:0.0 10000000c96e8972<->500507630303860b vmhba1:7:2 On
    FC 13:0.0 10000000c96e8972<->500507630308860b vmhba1:8:2 On
    FC 13:0.0 10000000c96e8972<->500507630313860b vmhba1:9:2 On
    FC 13:0.0 10000000c96e8972<->500507630318060b vmhba1:10:2 On
    FC 13:0.0 10000000c96e8972<->500507630318860b vmhba1:11:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630303460b vmhba2:4:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630308460b vmhba2:5:2 On active preferred
    FC 16:0.0 10000000c96e8ccc<->500507630313460b vmhba2:6:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630303c60b vmhba2:7:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630308c60b vmhba2:8:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630313c60b vmhba2:9:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630318460b vmhba2:10:2 On
    FC 16:0.0 10000000c96e8ccc<->500507630318c60b vmhba2:11:2 On

    In this output, you can see two HBA's (that have WWPN's 10000000c96e8972 and 10000000c96e8ccc) that see two LUN's vmhba1:4:1 and vmhba1:4:2 that are presented over 16 paths.

    On the VCB proxy / Windows box, I used the Emulex HBAnywhere utility to retrieve the WWPN's of the LUN's that were presented. The output is shown in the following screenshot:


    It is also possible to use the HbaCmd.exe AllNodeInfo command to retrieve a list of all WWPN's that a certain HBA sees.
Looks nice, what's the problem?
Using the WWPN seemed to be the obvious answer to identifying the LUN's on both the ESX host and the VCB proxy. Until I discovered that two different LUN's where presented using the same WWPN (obviously they were on two different SAN's and presented to two different hosts). On one of our ESX hosts, a 256 GB LUN was presented using WWPN 50:05:07:63:03:08:06:0b, and on the VCB proxy, a 500 GB LUN was presented using that same WWPN -- apparently our SAN team recycles the WWPN's on the different fibre channel fabrics.

To make matters even worse, I noticed that the same LUN was presented using one WWPN to an ESX host, and with another WWPN to the VCB proxy (I am no SAN expert myself but I assume it is possible to present the same LUN in different SAN zones using different WWPN's). I was able to verify this since VCB was able to do a SAN backup of a virtual machine that resides on a LUN with a WWPN on the ESX side that is not presented to the VCB proxy.

The next step: VMFS ID's as a unique identifier
So, if you cannot rely on the WWPN's to uniquely identify a LUN on a host that is connected to multiple SAN's, then surely VCB must use the VMFS ID to know what LUN to read the virtual machine data from? Right?

On the VCB proxy & Windows machine, I tried to discover the VMFS ID's using the vcbSanDbg.exe tool (included in the VCB framework and available as a separate download from the VMware website -- careful, the separate download is an older version than the one included in the VCB 1.5 framework). An excerpt from its lengthy output:

C:\Program Files\VCB>vcbSanDbg | findstr "ID: NAA: volume"
[info] Found logical volume 48761b97-a4f562bd-6875-0017085d.
[info] Found logical volume 48761bc5-3f508baa-2f5d-0017085d.
[info] Found logical volume 483cf913-05b4f526-45b5-001cc497.
[info] Found logical volume 479da7ac-55fe7dfe-378c-001cc497.
[info] Found logical volume 477c2b4a-7db36616-30ea-001cc495.
[info] Found logical volume 48843bec-154cf784-871a-001cc495.
[info] Found SCSI Device: NAA:600508b10010443953555534314200044c4f47494341
[info] Found SCSI Device: NAA:60060e801525180000012518000000374f50454e2d56
[info] Found SCSI Device: NAA:600508b4000901eb0001100003230000485356323130
[info] ID: LVID:48761b97-dacedf9f-ebb9-0017085d0f91/48761b97-a4f562bd-6875-0017085d0f91/1
Name: 48761b97-a4f562bd-6875-0017085d
[info] Found SCSI Device: NAA:600508b4000901eb0001100003260000485356323130
[info] ID: LVID:48761bc6-7b4afa63-97d9-0017085d0f91/48761bc5-3f508baa-2f5d-0017085d0f91/1
Name: 48761bc5-3f508baa-2f5d-0017085d
[info] Found SCSI Device: NAA:6005076303ffc60b0000000000001049323130373930
[info] ID: LVID:483cf913-458f9fa5-a749-001cc497e630/483cf913-05b4f526-45b5-001cc497e630/1
Name: 483cf913-05b4f526-45b5-001cc497
[info] Found SCSI Device: NAA:6005076303ffc60b000000000000104a323130373930
[info] ID: LVID:479da7b6-877867e9-dd06-001cc497e630/479da7ac-55fe7dfe-378c-001cc497e630/1
Name: 479da7ac-55fe7dfe-378c-001cc497
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128d323130373930
[info] ID: LVID:477c2b4a-969e01e0-8d49-001cc495fb46/477c2b4a-7db36616-30ea-001cc495fb46/1
Name: 477c2b4a-7db36616-30ea-001cc495
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128e323130373930
[info] Found SCSI Device: NAA:600508b40006e8890000b000010a0000485356323130
[info] Found SCSI Device: NAA:600508b40006e8890000b00003770000485356323130
[info] ID: LVID:48843bec-28cc17a4-ca9e-001cc495fb46/48843bec-154cf784-871a-001cc495fb46/1
Name: 48843bec-154cf784-871a-001cc495


Unfortunately, I was not able to discover the VMFS ID's I saw on the ESX host in this output, even though there are some resemblances:
  • ESX host VMFS ID 483cf914-29b60dc5-dbfd-001cc497e630 looks a lot like vcbSanDbg.exe output's logical volume 483cf913-05b4f526-45b5-001cc497.

  • ESX host VMFS ID 479da7c1-4494cd90-d327-001cc497e630 looks a lot like vcbSanDbg.exe output's logical volume 479da7ac-55fe7dfe-378c-001cc497.
Furthermore, I found out that current versions of VCB do not rely on the VMFS ID to discover virtual machines on a LUN. In Andy Tucker's talk "VMware Consolidated Backup: today and tomorrow" at VMworld 2007, it is clearly stated (slide 19) that there...
No “VMFS Driver for Windows” on proxy

And furthermore that the usage of VMFS signatures is on the "todo" list for identifying LUNs on the SAN network (slide 34).

Other ideas?
So where does one turn when all possible solutions seem to lead to a dead end? Right: the VMware community forums. The answer came in this thread by snapper.

What I learned today is that besides the WWPN on a fiber channel network, there is another unique identifier called the NAA (Network Address Authority) to identify devices on the FC fabric. You can obtain the NAA for the LUN's on an ESX host using the esxcfg-mpath command in verbose mode using:

esxcfg-mpath -lv | grep ^Disk | grep -v vmhba0 | awk '{print $3,$5,$2}' | cut -b15-

The output on our ESX host looks much like this:

6005076303ffc60b0000000000001049323130373930 (256000MB) vmhba1:4:1
6005076303ffc60b000000000000104a323130373930 (256000MB) vmhba1:4:2

The NAA can be seen in the vcbSanDbg.exe output shown above, and can be filtered as follows:

vcbSanDbg.exe | findstr "NAA:"


The output should look like this:

C:\Program Files\VCB>vcbSanDbg | findstr "NAA:"

[info] Found SCSI Device: NAA:600508b10010443953555534314200044c4f47494341
[info] Found SCSI Device: NAA:60060e801525180000012518000000374f50454e2d56
[info] Found SCSI Device: NAA:600508b4000901eb0001100003230000485356323130
[info] Found SCSI Device: NAA:600508b4000901eb0001100003260000485356323130
[info] Found SCSI Device: NAA:6005076303ffc60b0000000000001049323130373930
[info] Found SCSI Device: NAA:6005076303ffc60b000000000000104a323130373930
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128d323130373930
[info] Found SCSI Device: NAA:6005076303ffc403000000000000128e323130373930
[info] Found SCSI Device: NAA:600508b40006e8890000b000010a0000485356323130
[info] Found SCSI Device: NAA:600508b40006e8890000b00003770000485356323130


Et voila, now I can start running the esxcfg-mpath command on all our ESX hosts and start matching these NAA's with those in the output of vcbSanDbg to discover what our Windows VCB proxy has access to.

No comments: