Live Migration considerations using vHBA based VMs running on Windows Server 2012/2012R2 using the default Microsoft DSM

 

Hello everyone! My name is Palash Acharyya and today I am going to talk about a support case I worked where it was determined that a vendor’s DSM was a preferred solution instead of MSDSM when their storage was used in an environment consisting of virtual Fibre Channel Adapters based VMs running on Windows Server 2012/2012 R2.

 

Summary:

When Live Migrating a VM from one host to another, Hyper-V does not maintain the VM’s MPIO path’s TPGID/RTPID. Unknown to the VM’s DSM (e.g. MSDSM), a given path may now be in a different target port group. This can cause the VM to incorrectly think a Standby path is actually Active Optimized (and vice-versa). I/O can then get routed down a Standby path. This can lead to significantly delayed I/O performance within the VM. In a worst case scenario, Live Migration failures may also occur.

 

We take an example of a certain Storage vendor over here in this case. The issue is resolved in their updated DSM which monitors for TPGID/RTPID changes in the VM. Customer can contact the DSM vendor for specifics.

 

At times, we might see significantly decreased I/O performance in the VM after a Live Migration. On a rare occasion, we might also see live migration failures. We frequently see TPGID/RTPID mismatch which is the root cause of this issue (detailed below). Typically, there is no error message other than the I/O performance to the SAN is degraded. After a live migration, mpclaim reports the following for MPIO Disk 0:

 

C:\Users\Administrator>mpclaim -s -d 0

MPIO Disk0: 08 Paths, Round Robin with Subset, Implicit Only
Controlling DSM: Microsoft DSM
SN: ED8C7618566F92206C9CE900A9C2D800
Supported Load Balance Policies: FOO RRWS LQD WP LB

Path ID State SCSI Address Weight
------------------------------------------------------------
0000000077060001 Standby 006|000|001|000 0
        TPG_State : Standby ,TPG_Id: 1, : 4

0000000077060000 Active/Optimized 006|000|000|000 0
        TPG_State : Active/Optimized ,TPG_Id: 2, : 8

0000000077050001 Standby 005|000|001|000 0
        TPG_State : Standby ,TPG_Id: 1, : 2

0000000077050000 Active/Optimized 005|000|000|000 0
        TPG_State : Active/Optimized ,TPG_Id: 2, : 6

0000000077040001 Active/Optimized 004|000|001|000 0
        TPG_State : Active/Optimized ,TPG_Id: 2, : 5

0000000077040000 Standby 004|000|000|000 0
        TPG_State : Standby ,TPG_Id: 1, : 1

0000000077030001 Active/Optimized 003|000|001|000 0
        TPG_State : Active/Optimized ,TPG_Id: 2, : 7

0000000077030000 Standby 003|000|000|000 0
        TPG_State : Standby ,TPG_Id: 1, : 3

 

Above you see eight paths to the volume. Four Active Optimized and four Standby paths. Note the TPG_Id values. These values never change during the lifetime of the volume (at least not with MSDSM). After a Hyper-V Live Migration, we noticed that some of those TPG_Id values were no longer correct. So, the vendor's developers wrote an in-house tool to submit an inquiry VPD 83h request down each path and display the “true” TPG_Id. When tested with MSDSM, we find the following:

 

----------------------------------------------
System Disk, Path ID, SCSI Address, TPG_Id
----------------------------------------------
          2, 77060001, 006|000|001|000, 1:4
          2, 77060000, 006|000|000|000, 2:8
          2, 77050001, 005|000|001|000, 1:2
          2, 77050000, 005|000|000|000, 2:6
          2, 77040001, 004|000|001|000, 2:5
          2, 77040000, 004|000|000|000, 1:1
          2, 77030001, 003|000|001|000, 1:3
          2, 77030000, 003|000|000|000, 2:7

 

Above we see that Path ID 77030001 and 77030000 are now out of sync. The DSM thinks Path ID 77030001 is at TPG_Id 2:7 but it’s now at 1:3. Likewise, the DSM thinks Path ID 77030000 is at TPG_Id 1:3 but it’s now at 2:7. This will cause the DSM to incorrectly route I/O down a Standby path about 25% of the time (i.e. 1 of 4 active paths are out of sync).
Both Hyper-V hosts running Server 2012 R2 Standard with all updates. Guest VM failures seen with 2008 R2, 2012, or 2012 R2.

Terms used above:

DSM: DSM or Device Specific Module incorporates knowledge of the manufacturer's hardware. It interacts with the MPIO driver.

TPGID/RTPID: Target Port Group ID

 

More Information:

Inside Page 18 of publicly available "MPIO Users Guide for Windows Server 2012", it is very clearly mentioned that:

 

Determining whether to use the Microsoft DSM vs. a Vendor’s DSM

To determine which DSM to use with your storage, refer to information from your hardware storage array manufacturer. Multipath solutions are supported as long as a DSM is implemented in line with logo requirements for MPIO. Most multipath solutions for Windows today use the MPIO architecture and a DSM provided by the storage array manufacturer. You can use the Microsoft DSM provided by Microsoft in Windows Server 2012 if it is also supported by the storage array manufacturer. Refer to your storage array manufacturer for information about which DSM to use with a given storage array, as well as the optimal configuration of it.

NOTE : Multipath software suites available from storage array manufacturers may provide an additional value-add beyond the implementation of the Microsoft DSM because the software typically provides auto-configuration, heuristics for specific storage arrays, statistical analysis, and integrated management. We recommend using the DSM provided by the hardware storage array manufacturer to achieve optimal performance because the storage array manufacturer can make more advanced path decisions in their DSM that are specific to their array, which may result in quicker path failover times.

 

The default output of Get-MPIOSetting will have PathVerifyEnabled set to 0. While we can change the value in the registry, some vendors do it as a part of the installation.

 

More about MPIO timers: https://technet.microsoft.com/en-us/library/ee619749(v=ws.10).aspx

MPIO User’s Guide for Windows Server 2012 : https://www.microsoft.com/en-us/download/details.aspx?id=30450

 

Palash Acharyya

Support Escalation Engineer | Microsoft Windows Core

Disclaimer : This information is provided ‘as-is’ with no warranties