Poor Cluster Shared Volume read/write performance when node is not CSV owner

CoreJN 1 Reputation point
2022-05-20T09:45:42.163+00:00

We have a 5 node Windows Server 2016 Failover Cluster setup using an HPE Nimble as shared storage. We're using the cluster for Hyper-V. All virtual machine VHDXs are stored on the cluster shared volume (CSV).

We're having a problems with disk performance within VMs when the VM is running on a node which does not own the CSV storage.

Transferring files via SMB between VMs when they are both running on a node which owns the CSV speeds are between 1.5GB/s and 2GB/s. If you take the storage ownership away from that node, speeds drop to ~100MB/s.

This seems like the storage traffic is going via the 1GB network, through the owner node then into the SAN. From what I understand this shouldn't be the case unless the CSV has been set to redirected mode. (I've not confirmed this with Wireshark or anything yet, working on that)

I've run the command Get-ClusterSharedVolumeState which returned the following:

BlockRedirectedIOReason : NotBlockRedirected
FileSystemRedirectedIOReason : NotFileSystemRedirected
Name : Cluster Disk 1
Node : HyperV03
StateInfo : Direct
VolumeFriendlyName : VM-CSV
VolumeName : \?\Volume{9323278e-8374-474c-b9e7-1097305c0d1f}\

BlockRedirectedIOReason : NotBlockRedirected

FileSystemRedirectedIOReason : NotFileSystemRedirected
Name : Cluster Disk 1
Node : Hyperv06
StateInfo : Direct
VolumeFriendlyName : VM-CSV
VolumeName : \?\Volume{9323278e-8374-474c-b9e7-1097305c0d1f}\

BlockRedirectedIOReason : NotBlockRedirected
FileSystemRedirectedIOReason : NotFileSystemRedirected
Name : Cluster Disk 1
Node : hyperv05
StateInfo : Direct
VolumeFriendlyName : VM-CSV
VolumeName : \?\Volume{9323278e-8374-474c-b9e7-1097305c0d1f}\

BlockRedirectedIOReason : NotBlockRedirected
FileSystemRedirectedIOReason : NotFileSystemRedirected
Name : Cluster Disk 1
Node : Hyperv04
StateInfo : Direct
VolumeFriendlyName : VM-CSV
VolumeName : \?\Volume{9323278e-8374-474c-b9e7-1097305c0d1f}\

BlockRedirectedIOReason : NotBlockRedirected
FileSystemRedirectedIOReason : NotFileSystemRedirected
Name : Cluster Disk 1
Node : Hyperv02
StateInfo : Direct
VolumeFriendlyName : VM-CSV
VolumeName : \?\Volume{9323278e-8374-474c-b9e7-1097305c0d1f}\

According to this output redirection isn't the cause of the issue.

Can anyone think of a reason why else this might be happening?

Connections to the SAN have all been setup using HPE Windows Toolkit which configures the MPIO settings and various other bits for you. We've confirmed all nodes are able to hit transfers speeds of the expected 1GB/s+ but only when that node takes ownership of the CSV.

Windows Server 2016
Windows Server 2016
A Microsoft server operating system that supports enterprise-level management updated to data storage.
2,529 questions
Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,742 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
1,014 questions
{count} votes

2 answers

Sort by: Most helpful
  1. CK LIM 6 Reputation points
    2022-08-03T02:20:26.733+00:00

    I have the same issue and we are using windows 2019 DC version with 5 nodes clustering.

    I ask the vendors to request Microsoft to provide the patch/fix but they replied it's by design and expected behaviour which i do not agreed with that. I still want the vendor to pursue this fix/patch with Microsoft

    1 person found this answer helpful.

  2. Josaphat-Baby Bisengo Kizimbukidi 0 Reputation points
    2024-11-21T12:57:09.74+00:00

    Key Observations:

    CSV Ownership: Performance degrades when the VM accesses the CSV through a non-owner node. This suggests that storage traffic is being routed through the owner node.

    Redirected Mode: Redirected mode is typically used when direct storage access fails, forcing traffic through SMB over the network. This can significantly reduce performance, especially with a 1 Gbps network.

    Network Bandwidth: The observed performance drop to 100 MB/s aligns with the limitations of a 1 Gbps network, confirming a dependency on redirected mode.

    SMB Traffic: The significant drop in file transfer speeds between VMs suggests suboptimal network configuration or bandwidth issues.

    Recommended Solutions:

    Verify Redirected Mode Status:

    • Use PowerShell to check if the CSV is in redirected mode:
       powershell
       Copy code
       Get-ClusterSharedVolumeState
    
         - If redirected mode is active, investigate the underlying reasons (e.g., direct storage access issues or network misconfigurations).
         
         **Configure a Dedicated Network for CSV/SMB Traffic:**
         
            - Ensure the CSV/SMB traffic is routed over a separate high-speed network, preferably 10 Gbps or higher, to handle the workload efficiently.
            
            **Check Network Topology and Configuration:**
            
               - Confirm that all cluster nodes can communicate directly with the shared storage (HPE Nimble) without relying on the CSV owner.
               
                  - Verify that network adapters support SMB Multi-Channel and are configured correctly.
                  
                  **Manage CSV Ownership:**
                  
                     - Align the VM placement with the CSV ownership. Ensure VMs are running on the same node as their corresponding CSV to minimize redirected traffic.
                     
                        - Use PowerShell or cluster management tools to automate ownership realignment as needed.
                        
                        **Upgrade the SAN and Network Infrastructure:**
                        
                           - If the current SAN and network operate on a 1 Gbps connection, consider upgrading to 10 Gbps to avoid bandwidth bottlenecks.
                           
                           **Analyze Network Traffic with Wireshark:**
                           
                              - Use Wireshark to trace the traffic path and confirm whether storage traffic is being redirected via the CSV owner. This will help identify bottlenecks and misconfigurations.
                              
                              **Update Firmware and Drivers:**
                              
                                 - Ensure that the network adapter drivers, cluster software, and HPE Nimble firmware are up to date to prevent compatibility or performance issues.
                                 
                                 **Review Workload Balancing:**
                                 
                                    - Distribute VM workloads across nodes while ensuring they remain aligned with the ownership of the CSV to reduce the dependency on redirected traffic.
                                    
    

    Immediate Steps to Take:

    1. Confirm if the CSV is operating in redirected mode and understand why.
    2. Temporarily assign VMs to the CSV owner node to restore performance.
    3. Plan for a high-speed network setup dedicated to cluster traffic. Key Observations: CSV Ownership:
      Performance degrades when the VM accesses the CSV through a non-owner node. This suggests that storage traffic is being routed through the owner node. Redirected Mode:
      Redirected mode is typically used when direct storage access fails, forcing traffic through SMB over the network. This can significantly reduce performance, especially with a 1 Gbps network. Network Bandwidth:
      The observed performance drop to 100 MB/s aligns with the limitations of a 1 Gbps network, confirming a dependency on redirected mode. SMB Traffic:
      The significant drop in file transfer speeds between VMs suggests suboptimal network configuration or bandwidth issues.

    Recommended Solutions:

    1. Verify Redirected Mode Status:
      • Use PowerShell to check if the CSV is in redirected mode:
             Get-ClusterSharedVolumeState
        
          - If redirected mode is active, investigate the underlying reasons (e.g., direct storage access issues or network misconfigurations).
        
          **Configure a Dedicated Network for CSV/SMB Traffic:**
        
             - Ensure the CSV/SMB traffic is routed over a separate high-speed network, preferably 10 Gbps or higher, to handle the workload efficiently.
        
             **Check Network Topology and Configuration:**
        
                - Confirm that all cluster nodes can communicate directly with the shared storage (HPE Nimble) without relying on the CSV owner.
        
                   - Verify that network adapters support SMB Multi-Channel and are configured correctly.
        
                   **Manage CSV Ownership:**
        
                      - Align the VM placement with the CSV ownership. Ensure VMs are running on the same node as their corresponding CSV to minimize redirected traffic.
        
                         - Use PowerShell or cluster management tools to automate ownership realignment as needed.
        
                         **Upgrade the SAN and Network Infrastructure:**
        
                            - If the current SAN and network operate on a 1 Gbps connection, consider upgrading to 10 Gbps to avoid bandwidth bottlenecks.
        
                            **Analyze Network Traffic with Wireshark:**
        
                               - Use Wireshark to trace the traffic path and confirm whether storage traffic is being redirected via the CSV owner. This will help identify bottlenecks and misconfigurations.
        
                               **Update Firmware and Drivers:**
        
                                  - Ensure that the network adapter drivers, cluster software, and HPE Nimble firmware are up to date to prevent compatibility or performance issues.
        
                                  **Review Workload Balancing:**
        
                                     - Distribute VM workloads across nodes while ensuring they remain aligned with the ownership of the CSV to reduce the dependency on redirected traffic.
        

    Immediate Steps to Take:

    1. Confirm if the CSV is operating in redirected mode and understand why.
    2. Temporarily assign VMs to the CSV owner node to restore performance.
    3. Plan for a high-speed network setup dedicated to cluster traffic.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.