Private Cloud Simulator for Windows Server 2022
Introduction
The current industry trend is for private cloud solutions to comprise tightly integrated software and hardware components in order to deliver a resilient private cloud with high performance. Issues in any of the components (software, hardware, drivers, firmware, and so forth) can compromise the solution and undermine the promises made regarding a Service Level Agreement (SLA) for the private cloud.
Some of these issues are surfaced only under a high-stress, cloud-scale deployment, and are potentially hard to find using traditional standalone, component-focused tests. The Private Cloud Simulator is a cloud validation test suite that enables you to validate your components in a cloud scenario and identify these types of issues.
Target Audience
The target audience for this document are those working towards validating their hardware for Windows Server Logo, Microsoft Azure Stack solutions and Microsoft Azure Stack HCI solutions.
Test Overview
Private Cloud Simulator (PCS) simulates a live datacenter/private cloud by creating VM workloads, simulating data center operations (load balancing, software/hardware maintenance), and injecting compute/storage faults (unplanned hardware/software failure). PCS uses a Microsoft SQL Server database to record test and solution data during the run. It then presents a report that includes operation pass/fail rates and logs whihch provide the capability to correlate data for pass/fail determination and failure diagnosis (as applicable).
Links to the required files
Below table contains the links to the files that you need to download to run PCS tests.
Name | Location |
---|---|
HLK Kit | Windows Server 2022 HLK |
HLK Azure Stack Update Package | Download the latest version of the HLK update from Microsoft Collaborate site. FileName: Windows HLK WS 2022 Update Package-210527.zip |
HLK Playlist | Windows Server 2022 HLK CompatPlaylist |
PCSFiles.vhd | PCSFiles.vhd SHA256 hash value is 95B1BCC54E38B459943CDEE26F09B04DAE828CAA8A95151E46E769A9A1927F61 |
Windows Server 2022 Update | Install the latest version at Windows Update site |
You can use Get-FileHash PowerShell cmdlet to compute the hash value for a file.
Common Lab Infrastructure Setup
Topology
PCS lab environment contains the following elements:
- An Active Directory domain controller/DNS/DHCP server for the test domain.
- You can find information about Active Directory at https://msdn.microsoft.com/library/bb727067.aspx
- Active Directory Domain Services Functional Levels needs to be Windows Server 2012 or higher.
- A dedicated HLK controller machine. OS must be Windows Server 2019.
- A dedicated PCS controller machine. OS must be Windows Server 2022.
- A compute cluster, which hosts Hyper-V virtual machines. The minimun number of nodes depends on the type of PCS jobs.
Supporting Documents:
- Deploy a Hyper-Converged Cluster using Storage Spaces Direct
- Failover-Clustering
- Microsoft Azure Stack Logo Requirements released via Microsoft Collaborate to Microsoft partners
Notes:
- All the above machines must be joined to the same test domain.
- All PCS tests need to be run as the same user in the 'Domain Admins' group for the test domain.
- Use the same user with Domain Admin credentials to install the HLK controller.
HLK Controller System Requirements
Minimum system requirements are as shown in the table below.
Resource | Minimum requirement |
---|---|
CPU (or vCPU) | 4 cores |
Memory | 12 GB RAM |
Available disk space | 200 GB |
Operating system | Windows Server 2019 Datacenter |
Active Directory domain | Join it to the test domain |
HLK Controller Setup
Download Windows HLK Studio & follow the Windows HLK Getting Started guide to set up Windows HLK.
Download HLK Azure Stack Update Package-WSSD Premium (required for LAN.AzureStack profile) & follow the steps listed below for applying the updates in HLK Studio.
Download PCSFiles.vhd
Copy the PCSFiles.vhd file to the Tests\amd64 test folder on the HLK Controller. Below is the default path for an HLK installation:
C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64
Steps to install HLK Azure Stack Update Package (for LAN.AzureStack). First, download the HLK PCS update for LAN.AzureStack profile Windows HLK WS 2022 Update Package-210527.zip. Extract the zip archive & copy-replace the files in the HLK folder with the ones from the update package in the following format:
- Replace Microsoft.Networking.Test.Common.dll to *C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64\Pcs*
- Replace PrivateCloudSimulator-Manager.psm1 to *C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64\Pcs*
- Replace LaunchCreateExportVM.ps1 to *C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64\Pcs\Exports*
- Or, You may just run this command from the extracted folder (the following commands need to be run from an elevated prompt and from inside the folder extracted)
- xcopy .\Microsoft.Networking.Test.Common.dll "C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64\Pcs\Microsoft.Networking.Test.Common.dll" /O /F /R /Y
- xcopy .\PrivateCloudSimulator-Manager.psm1 "C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64\Pcs\PrivateCloudSimulator-Manager.psm1" /O /F /R /Y
- xcopy .\LaunchCreateExportVM.ps1 "C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64\Pcs\Exports\LaunchCreateExportVM.ps1" /O /F /R /Y
Get IOMeter files
IOMeter is a workload that must be installed on the HLK controller.
Download the i386 Windows version of IOMeter release dated 2006.07.27 from the IOMeter website.
Run the setup (or unzip the package) to unpack the files.
Copy IOMeter.exe, Dynamo.exe to Tests\amd64\pcs\GuestScenarioManager\IOMeter folder on the HLK controller. Below is the default path for an HLK installation:
C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64\pcs\GuestScenarioManager\IOMeter
PCS Controller System Requirements
Minimum system requirements are as shown in the table below.
Resource | Minimum requirement |
---|---|
CPU (or vCPU) | 4 cores |
Memory | 12 GB RAM |
Free space on the boot drive | 200 GB |
Operating system | Windows Server 2022 Datacenter |
Active Directory domain | Join it to the test domain |
PCS Controller Setup
- PCS controller MUST be a Generation v2 VM or a physical machine.
- Secure Boot and BitLocker MUST be disabled. This is required because PCS enables TestSigning boot configuration. If you are using Generation 2 Hyper-V VM as PCS controller, stop the VM to disable Secure Boot in the VM's settings.
- Install the HLK Client using the Windows HLK Getting Started guide and open the requisite ports.
- Install .NET Framework 3.5 (This feature is not included by default in Windows Server 2022).
- Generic Installation Instructions can be found at the following locations:
- For builds released via Microsoft Connect, see details below:
Mount the ISO supplied with the build and find the file at MountedDriveLetter:\sources\sxs\microsoft-windows-netfx3-ondemand-package.cab
Copy the file to a local folder on the PCS controller
Install the package by executing this command line using admin privileges
Add-WindowsFeature Net-Framework-Features -source <Local Folder>
PCS Tests
This section discusses how to find an appropriate PCS test for your device/solution, configure the lab, and kickoff PCS execution.
- You need to use the same domain admin user account to setup lab and run tests.
- Secure Boot State must be OFF on all nodes and PCS controller.
PCS Test Selection
The PCS jobs are used to certify multiple categories of devices and solutions. The table below, maps them to the appropriate PCS job.
Target | Certification Program | Job Name in HLK |
---|---|---|
NIC | Windows Server Logo | PrivateCloudSimulator-Device.Network.LAN.10GbOrGreater |
NIC | SDDC Standard | PrivateCloudSimulator-Device.Network.LAN.10GbOrGreater |
NIC | SDDC Premium | PrivateCloudSimulator-Device.Network.LAN.AzureStack |
NIC | AZURESTACK | PrivateCloudSimulator-Device.Network.LAN.AzureStack |
Solution | SDDC Standard | PrivateCloudSimulator-System.Solutions.StorageSpacesDirect (MIN) & (MAX) |
Solution | SDDC Premium | PrivateCloudSimulator-System.Solutions.StorageSpacesDirect (MIN) & (MAX) |
Solution | AZURESTACK | PrivateCloudSimulator-System.Solutions.AzureStack (MIN) & (MAX) |
PCS jobs are summarized below:
- PrivateCloudSimulator - Device.Network.LAN.10GbOrGreater
This test contains a set of actions, that specifically target the network adapter device along with VM and compute cluster actions. - PrivateCloudSimulator - Device.Network.LAN.AzureStack
This test contains an extended set of actions, that verify network adapter support for the new 'Software Defined Networking' feature in Windows Server, along with VM and compute cluster actions. - PrivateCloudSimulator - System.Solutions.StorageSpacesDirect (MIN)/(MAX)
This test contains an extended set of actions, that target the entire solution built on an hyper-converged storage spaces direct cluster. The (MIN) test should be run on a cluster with the minimum number of supported nodes for the solution. The (MAX) test should be run on a cluster with the maximum number of supported nodes for the solution. - PrivateCloudSimulator - System.Solutions.AzureStack (MIN)/(MAX)
This test contains an extended set of actions, that target the entire AzureStack solution. The (MIN) test should be run on a cluster with the minimum number of supported nodes for the solution. The (MAX) test should be run on a cluster with the maximum number of supported nodes for the solution.
PCS Job Execution Flow
Each PCS job contains the following tasks.
- Initialize PCS Controller
- In this stage, the PCS execution engine sets up a SQL server and IIS on the PCS controller machine
- It also copies content (e.g. evaluation OS VHD files) to enable VM creation in the next stage
- Create VMs
- This stage sees the PCS engine start creating VMs on each node of the cluster
- VM creation stops when the target number of VMs/node has been reached.
- This step is a part of PCS setup phase. Test run duration timer kicks in post this stage.
- Run PCS Actions
- Now, PCS initiates various types of actions (VM, Cluster, Storage, Network) on each node of the cluster.
- Actions run in parallel and co-ordinate among themselves to exercise the device (storage, network) and the solution through the private cloud/datacenter lifecycle
- Actions run periodically and stop once the target execution time (defined by the profile/job) of the test has been reached.
- Test execution time is defined per profile and can vary based on the profile you are running. Test execution timer kicks in after all the VMs are created.
- The steps in each action and the corresponding result of each step is stored in the SQL server.
- Cleanup Run
- In this stage, VMs created in stage (4) are cleaned up and the cluster is restored to a clean state (as possible).
- It generates a report file (PcsReport.htm) and a ZIP file that contains test logs.
- Report result in HLK Studio
- In this stage, the HLK studio reports the result of the PCS run.
- The result can be packaged into an HLKX file for submission to Microsoft.
Execute PCS Tests
PrivateCloudSimulator - Device.Network.LAN.10GbOrGreater
System Requirements
Requirement | Description |
---|---|
Component Being certified | NIC |
Setup Type | Hyper-converged setup with S2D storage. Note: An SDDC certified HBA is required. |
Minimum Number of Server Nodes | 3 identical machines |
Server Spec | CPU: 16 physical cores (e.g. 2 sockets with 8 cores), MEMORY: 128 GB, 64GB free space on boot drive |
Storage Overall | 4 TB free space per node on HDD, 800 GB free space per node on SSD |
Disk | If there are drives used as cache, there must be at least 2 per server. There must be at least 4 capacity (non-cache) drives per server. See S2D hardware requirements for more information. |
Network Card | NIC being certified |
Switch | Switch supporting all NIC features |
Setup
- Follow the Windows HLK Getting Started guide to install HLK client software on all cluster nodes.
- Follow the Windows Server Storage Spaces Direct cluster guide to deploy a cluster.
- All nodes must be connected to the same physical switches.
- 10GbE or better networking bitrate must be used. Create a virtual swith with the same name on each node.
- Virtual machines, created by PCS, connect to the virtual switch to send network traffic between them. These VMs get IP address via DHCP. Make sure your DHCP server assigns valid IP addresses to these VMs. If DHCP server is not available or fails, VMs would use Automatic Private IP Addressing (APIPA) to self-configure an IP address and subnet. Each VM must have a valid IP address to send network traffic between VMs.
Execute
Open HLK Studio
Follow the Windows HLK Getting Started guide to create a machine pool
Navigate to the Project tab and click Create Project
Enter a project name and press Enter
Navigate to the Selection tab
Select the machine pool containing the network adapter device
Select device manager
Select the device. It should be ok to select any relevant NIC device (does not matter which member of the virtual switch team) on any of the compute nodes that is targeted for certification.
Right-click on the selected device and select Add/Modify Features
In the features dialog, select Device.Network.LAN.10GbOrGreater and then click OK. For most NIC cards (with speeds 10GbE or higher) this feature should have been selected automatically.
Navigate to the Tests tab
Select PrivateCloudSimulator - Device.Network.LAN.10GbOrGreater
Click Run Selected
In the Schedule dialog,
- Enter values for the required test parameters
- DomainName: Test user's domain name
- UserName: Test user's user name
- Password: Test user's password
- ComputeCluster: Name of compute cluster
- StoragePath: Default value is "". It uses all the available CSVs from compute cluster. You can use different paths by entering comma separated paths. Example: "C:\ClusterStorage\Volume1,C:\ClusterStorage\Volume2"
- VmSwitchName: Name of virtual switch on all nodes
- FreeDriveLetter: Default value is R. During setup, PcsFiles.vhd file is mounted to this drive letter on PCS controller. Make sure this drive letter is available.
- IsCreateCluster: Use default value
- IsRemoveCluster: Use default value
- IsConfigureHyperV: Use default value
- Map machines to roles
- PrimaryNode: This is the node with the selected device
- Test Controller: Select PCS test controller machine
- OtherNodes: Select other cluster nodes
- Enter values for the required test parameters
Click OK to schedule the test
Please refer to View PCS report in real-time through SQL Server Reporting Services to view the real-time results for the test run.
Duration
- PCS actions (listed below) run for about 24 hours.
- The complete run may take an additional 24-36 hours (including time for setup and cleanup).
PCS Actions
The table below lists the actions that are included in this test.
Action Name | Description |
---|---|
VmCloneAction | Creates a new VM. |
VmLiveMigrationAction | Live-migrates the VM to another cluster node. |
VmSnapshotAction | Takes a snapshot of the VM. |
VmStateChangeAction | Changes the VM state (for example, to Paused). |
VmStorageMigrationAction | Migrates VM storage (the VHD(s)) between cluster nodes. |
VmGuestRestartAction | Restarts the VM. |
VmStartWorkloadAction | Starts a user-simulated workload. |
VmGuestFullPowerCycleAction | Power-cycles the VM. |
ComputeNodeEvacuationAction | Restart a cluster node. |
PrivateCloudSimulator - Device.Network.LAN.AzureStack
System Requirements
Requirement | Description |
---|---|
Component Being certified | NIC (with RDMA) |
Setup Type | Hyper-converged setup with S2D storage. Note: An SDDC certified HBA is required. |
Minimum Number of Server Nodes | 3 identical machines |
Server Spec | CPU: 16 physical cores (e.g. 2 sockets with 8 cores), MEMORY: 128 GB, 64GB free space on boot drive |
Storage Overall | 4 TB free space per node on HDD, 800 GB free space per node on SSD |
Disk | If there are drives used as cache, there must be at least 2 per server. There must be at least 4 capacity (non-cache) drives per server. See S2D hardware requirements for more information. |
Network Card | NIC being certified |
Switch | Switch supporting all NIC features |
Setup
Hyper-V host that contains PCS Controller VM must be Windows Server 2022 or later.
Follow the Windows HLK Getting Started guide to install HLK client software on all cluster nodes
Follow the Windows Server Storage Spaces Direct cluster guide to deploy a cluster
For instructions to set up networking for Storage Spaces Direct, see Windows Server Converged NIC and Guest RDMA Deployment Guide.
PCS Controller VM should be built as a generation 2 VM and have 2 network interfaces, one for the management network and the other for SDN (PA address space) topology. The interface for SDN topology will be assigned an IP address from the IP address space passed in as the AddressPrefixes parameter.
All the nodes must be able to communicate with the PCS Controller VM at all times through a management interface. For this purpose, each server should have one additional NIC for management interface, which does not need to meet strict bitrate requirements.
All the nodes and PCS Controller must have the same most recent KB installed.
10GbE or better networking bitrate is required for the NICs under test. Each server should have two identical 10gb or greater NICs.
If RDMA capable NICs are used, the physical switch must meet the associated RDMA requirements.
Set NICs' properties that are specific to AzureStack deployments to make sure NICs getting certified can support these properties. You can use PowerShell Get-NetAdapterAdvancedProperty cmdlet to verify NIC properties.
- VXLAN Encapsulated Task Offload == Enabled
- Encapsulation Overhead == 160
- Jumbo Packet >= 1500
- MtuSize == 1660
Make sure that every node contains a teaming enabled virtual switch with the same name.
New-VMSwitch -Name SdnSwitch -NetAdapterName "Name 1,Name 2" -AllowManagementOS -EnableEmbeddedTeaming
Configure Nested Virtualization: Nested virtualization for the PCS Controller VM must be enabled. While the PCS VM is in the OFF state, run the following command on the Hyper-V host.
Set-VMProcessor -VMName <VMName> -ExposeVirtualizationExtensions $true
Make sure that RDMA is setup on all nodes and reflects when queried through Get-SMBClientNetworkInterface & Get-SMBServerNetworkInterface.
Live Migration settings (Failover Cluster Manager->Networks->Live Migration Settings) must be set appropriately to use storage network for live migrations.
This test creates virtual machines and send traffic between them using the virtual switch created. The vNic (virtual nic) of the PCS virtual machines are assigned IP address from the IP address space passed in as the AddressPrefixes parameter.
Execute
Open HLK Studio
Navigate to the Project tab and click Create Project
Enter a project name and press Enter
Navigate to the Selection tab
Select the machine pool containing the network adapter device
Select device manager
Select the device. It should be OK to select any relevant NIC device (does not matter which member of the virtual switch team) on any of the compute nodes that is targeted for certification.
Right-click on the selected device and select Add/Modify Features
In the features dialog, select Device.Network.LAN.AzureStack and click OK.
Navigate to the Tests tab
Select PrivateCloudSimulator - Device.Network.LAN.AzureStack
Click Run Selected
In the Schedule dialog,
- Enter values for the required test parameters
- DomainName: Test user's fully qualified domain name (FQDN).
- UserName: Test user's user name
- Password: Test user's password
- ComputeCluster: compute cluster name
- StoragePath: Default value is ''. It uses all the available CSVs from compute cluster. You can use different paths by entering comma separated paths. Volume Names shouldn’t contain empty spaces. Example: 'C:\ClusterStorage\Volume1,C:\ClusterStorage\Volume2' (single quotes are needed)
- VmSwitchName: Name of virtual switch to be used for SDN. Example: SdnSwitch
- FreeDriveLetter: Default value is R. During setup, PcsFiles.vhd file is mounted to this drive tter on PCS controller. Make sure this drive letter is available.
- AdapterNames: Comma seperated list of adapter names that are part of the vmSwitch. Use the format "'Name 1', 'Name 2'" (double quotes and single quotes are needed) for multiple adapters. Names must be derived from Get-NetAdapter cmdlet.
- VLan: Vlan ID set on vmSwitch. Only required if your physical switch is configured for Vlan. Enter '0' to indicate that there is no Vlan tagging.
- RDMAEnabled: Enter $True if NIC supports RDMA
- SetEnabled: Enter $True if NIC supports Switch Embedded Teaming
- HnvEnabled: Enter $True if NIC supports Hyper-V Network Virtualization
- TaskOffloadEnabled: Enter $True if NIC supports Encapsulate Task Offload
- TestControllerNetAdapterName: Adapter name on PCS Controller that can be assigned a static IP in the AddressPrefixes range to communicate with SDN Network Controller virtual machines. Example: 'Ethernet 2' (single quotes are needed if there are spaces in the name)
- VHDSourcePath: a VHDX file for Windows Server 2022 DataCenter. This VHD file will be used to create Network Controller VMs. Default value is c:\pcs\BaseVHDX\20348.1.amd64fre.fe_release.210507-1500_server_serverdatacentereval_en-us.vhd. NOTE: Make sure that for Windows Server 2022 the value is set to 20348.1.amd64fre.fe_release.210507-1500_server_serverdatacentereval_en-us.vhd.
- KBPackagePath: Comma seperated list of Windows Update Packages that should be applied to the VHD file that specified in parameter VHDSourcePath. These update packages should match the ones installed on all cluster nodes and PCS controller machine. Default value is '' (single quotes are needed). It means no KB would be injected into VHD file.
- You should install the lastest version or a recent version of Windows Update packages. You can use Get-Hotfix cmdlet to find out what have installed on your machines.
- Most of the Windows Update Packages require you to install 'servicing stak update (SSU)' first. In other words, you should eneter at least two KBs in this parameter.
- Example:
- KB4501371 (June 18, 2019)
- In "How to get this update" section, it says 'servicing stack update (SSU)' KB4504369 is required.
- In this parameter, you should enter 'c:\KB\Windows-KB4504369-x64.msu,c:\KB\Windows-KB4501371-x64.msu'. (single quote is required, KB4504369 will be installed before installing KB4501371.)
- You need to download the MSU files from Windows Update site and copy them to c:\KB folder on the PCS controller machine.
- Important: The file name format MUST be "Windows-KBNumber-x64.msu". A dash (-) is required before and after KBNumber.
- AddressPrefixes: The IP address range to be used by Tenant VMs and Hosts. These addresses will be used for SDN datacenter management.
- VipPrefixes: Two IP address ranges that are used by SLB for VIP load balancing scenarios. Use the format "'192.160.2.0/23','192.160.3.0/23'" (double quotes and single quotes are needed)
- ClientAddressPrefix: The IP address range used by Client VMs.
- Map machines to roles
- PrimaryNode: This is the node with the selected device, automatically selected by HLK.
- Test Controller: Select PCS test controller machine
- OtherNodes: Select other cluster nodes
- Enter values for the required test parameters
Click OK to schedule the test
Please refer to View PCS report in real-time through SQL Server Reporting Services to view the real-time results for the test run.
Cleanup
Use the C:\Pcs\ReRunPcsCleanup.cmd script on the PCS-Controller for cleaning up state of the setup if the test abruptly ends. It is very important that stale VMs & SDN infrastructure is cleaned up before starting a new run.
Please make sure the following items are cleaned up before starting a new run:
Clustered VM roles (FailoverClusterManager->Cluster->Roles)
Get-ClusterGroup -Cluster $clusterName
All the VMs created by PCS
Get-ClusterNode -Cluster $clusterName | % { Get-VM -ComputerName $_.Name }
vNics created by PCS/SDN
Get-ClusterNode -Cluster $clusterName | % { Get-VMNetworkAdapter -ComputerName $_.Name -ManagementOS | Select-Object ComputerName,Name,SwitchName }
Storage/CSV-volumes on the cluster do not have any entries pertaining to PCS (C:\ClusterStorage\Volume1\PCS)
Duration
- PCS actions (listed below) run for about 24 hours.
- The complete run may take an additional 36-48 hours (including time for setup and cleanup).
PCS Actions
The table below lists the actions that are included in this test.
Action Name | Description |
---|---|
NetRunEastWestCrossSubnetTrafficAction | Run traffic between two Tenant Vms in same VNetwork, but different Vsubnets |
NetRunEastWestSameSubnetTrafficAction | Run traffic between two Tenant Vms in same Vsubnet |
NetLoadBalancerEastWestInterTenantTrafficAction | Run traffic between load balanced tenants and another Vm in a different App Tier. Simulates load balanced traffic amongst frontent application (website) Vms. |
NetLoadBalancerEastWestIntraTenantTrafficAction | Run traffic between load balanced tenants and a Vm in the same App Teir. Simulates load balanced traffic from backend application (DB) to frontent application (website). |
NetLoadBalancerInboundTrafficAction | Run traffic from outside the Tenant network to a load balanced Vms (website). |
NetLoadBalancerNorthSouthTrafficAction | Run traffic from inside the Tenant network to a load balanced Vms. |
NetLoadBalancerOutboundTrafficAction | Run traffic from load balancedVms inside the Tenant network to a Vm outside. |
NetAddInboundVipToLoadBalancerAction | Creates Virtual Ips for Tenant VMs dynamically, mainly for other traffic actions to use. |
VmCloneAction | Creates Virtual Ips for Tenant VMs dynamically, mainly for other traffic actions to use. |
VmLiveMigrationAction | Live-migrates the VM to another cluster node. |
VmStateChangeAction | Changes the VM state (for example, to Paused). |
VmStorageMigrationAction | Migrates VM storage (the VHD(s)) between cluster nodes. |
VmGuestRestartAction | Restarts the VM. |
VmGuestFullPowerCycleAction | Power-cycles the VM. |
PrivateCloudSimulator - System.Solutions.StorageSpacesDirect
Setup
- Setup a hyper-converged solution. See here for an example.
- We recommend making the number of volumes a multiple of the number of servers in your cluster. For example, if you have 4 servers, you will experience more consistent performance with 4 total volumes than with 3 or 5. This allows the cluster to distribute volume "ownership" (one server handles metadata orchestration for each volume) evenly among servers.
- We recommand using Resilient File System (ReFS) for Storage Spaces Direct.
- By default, test creates 20 VMs per cluster node. Estimated average VM's VHD file size could be 40GB. To run this test in a 4-node cluster environment, your virtual disk size should be at least 20 * 40 * 4 = 3200GB.
- Minimum Configuration
- This config contains the minimum of cluster nodes, slowest supported processor, least memory and lowest storage capacity supported by the solution family.
- Please use the PrivateCloudSimulator - System.Solutions.StorageSpacesDirect (MIN) job to validate this setup
- Maximum Configuration
- This config contains the maximum of cluster nodes and the maximum storage supported by the solution family.
- Processor and memory should be equal or higher than the lowest supported value for the solution, but need not be the maximum possible supported value. The processor and memory values should be representative of the most common skus for the solution.
- Please use the PrivateCloudSimulator - System.Solutions.StorageSpacesDirect (MAX) job to validate this setup
Execute
Open HLK Studio
Follow the Windows HLK Getting Started guide to create a machine pool
Navigate to the Project tab and click Create Project
Enter a project name and press Enter
Navigate to the Selection tab
Select the machine pool containing the system under test and PCS controller machine.
Select systems on the left panel and then select the PCS test controller (NOTE: NOT the machine that needs to be certified).
Right-click on the selected PCS controller machine and select Add/Modify Features
In the features dialog, select System.Solution.StorageSpacesDirect and click OK
Navigate to the Tests tab
Select PrivateCloudSimulator - System.Solutions.StorageSpacesDirect (MAX) or PrivateCloudSimulator - System.Solutions.StorageSpacesDirect (MIN) (based on the solution size you are testing)
Click Run Selected
In the Schedule dialog,
- Enter values for the required test parameters
- DomainName: Test user's fully qualified domain name (FQDN).
- UserName: Test user's user name
- Password: Test user's password
- ComputeCluster: compute cluster name
- StoragePath: Default value is "". It uses all the available CSVs from compute cluster. You can use different paths by entering comma seperated paths. Example: "C:\ClusterStorage\Volume1,C:\ClusterStorage\Volume2" (double quote is needed)
- VmSwitchName: Enter the name of the virtual switch. This name must be the same on all nodes
- FreeDriveLetter: Default value is R. During setup, PcsFiles.vhd file is mounted to this drive letter on PCS controller. Make sure this drive letter is available.
- Map machines to roles
- Test Controller: Select PCS test controller machine
- Enter values for the required test parameters
Click OK to schedule the test.
Please refer to View PCS report in real-time through SQL Server Reporting Services to view the real-time results for the test run.
Duration
- PCS Actions (listed below) will run for 96 hours.
- The complete run may take an additional 24-36 hours (including time for setup and cleanup).
PCS Actions
The profile defines the actions to execute to validate the disk drives for Microsoft AzureStack. The table below lists the actions that are included in this profile.
Action Name | Description |
---|---|
VmCloneAction | Creates a new VM. |
VmLiveMigrationAction | Live-migrates the VM to another cluster node. |
VmSnapshotAction | Takes a snapshot of the VM. |
VmStateChangeAction | Changes the VM state (for example, to Paused). |
VmStorageMigrationAction | Migrates VM storage (the VHD(s)) between cluster nodes. |
VmGuestRestartAction | Restarts the VM. |
VmStartWorkloadAction | Starts a user-simulated workload. |
VmGuestFullPowerCycleAction | Power-cycles the VM. |
ComputeNodeEvacuation | Drains all resources from one cluster node. |
ClusterCSVMoveAction | Move the CSV disks to the best available node. |
StorageNodePoolMove | Moves a storage pool (created in Storage Spaces) to a different owner node in the storage cluster. |
StorageNodeRestart | Restarts a node in the storage cluster. |
StorageNodeBugcheck | Bug checks one node of the storage cluster. |
StorageNodeUpdateStorageProviderCacheAction | Calls update-storageprovidercache command in PowerShell. |
PrivateCloudSimulator - System.Solutions.AzureStack
Setup
- Setup a hyper-converged solution. See here for an example.
- We recommend making the number of volumes a multiple of the number of servers in your cluster. For example, if you have 4 servers, you will experience more consistent performance with 4 total volumes than with 3 or 5. This allows the cluster to distribute volume "ownership" (one server handles metadata orchestration for each volume) evenly among servers.
- You need to use Resilient File System (ReFS) for Storage Spaces Direct. Otherwise, the job would fail.
- By default, test creates 20 VMs per cluster node. Estimated average VM's VHD file size could be 40GB. To run this test in a 4-node cluster environment, your total virtual disk size should be at least 20 * 40 * 4 = 3200GB.
- Minimum Configuration
- This config contains the minimum of cluster nodes, slowest processor, least memory and lowest storage capacity supported by the solution family.
- Please use the PrivateCloudSimulator - System.Solutions.AzureStack (MIN) job to validate this setup
- Maximum Configuration
- This config contains the maximum of cluster nodes and the maximum storage supported by the solution family.
- Processor and memory should be equal or higher than the lowest supported value for the solution, but need not be the maximum possible supported value. The processor and memory values should be representative of the most common skus for the solution.
- Please use the PrivateCloudSimulator - System.Solutions. AzureStack (MAX) job to validate this setup
Execute
Open HLK Studio
Follow the Windows HLK Getting Started guide to create a machine pool
Navigate to the Project tab and click Create Project
Enter a project name and press Enter
Navigate to the Selection tab
Select the machine pool containing the system under test
Select systems on the left panel and then select the PCS test controller (NOTE: Not the machine that needs to be certified).
Right-click on the selected device and select Add/Modify Features
In the features dialog, select System.Solution.AzureStack and click OK
Navigate to the Tests tab
Select PrivateCloudSimulator - System.Solutions.AzureStack
Click Run Selected
In the Schedule dialog,
- Enter values for the required test parameters
- DomainName: Test user's fully qualified domain name (FQDN).
- UserName: Test user's user name
- Password: Test user's password
- ComputeCluster: compute cluster name
- StoragePath: Default value is "". It uses all the available CSVs from compute cluster. You can use different paths by entering comma seperated paths. Example: "C:\ClusterStorage\Volume1,C:\ClusterStorage\Volume2" (double quote is needed)
- VmSwitchName: Enter the name of the virtual switch. This name must be the same on all nodes
- FreeDriveLetter: Default value is R. During setup, PcsFiles.vhd file is mounted to this drive letter on PCS controller. Make sure this drive letter is available.
- Map machines to roles
- Test Controller: Select PCS test controller machine
- Enter values for the required test parameters
Click OK to schedule the test.
Please refer to View PCS report in real-time through SQL Server Reporting Services to view the real-time results for the test run.
Duration
- PCS Actions (listed below) will run for 96 hours.
- The complete run may take an additional 24-36 hours (including time for setup and cleanup)
Actions
The profile defines the actions to execute to validate the storage Enclosure for Microsoft AzureStack. The table below lists the actions that are included in this profile.
Action Name | Description |
---|---|
VmCloneAction | Creates a new VM. |
VmLiveMigrationAction | Live-migrates the VM to another cluster node. |
VmSnapshotAction | Takes a snapshot of the VM. |
VmStateChangeAction | Changes the VM state (for example, to Paused). |
VmStorageMigrationAction | Migrates VM storage (the VHD(s)) between cluster nodes. |
VmGuestRestartAction | Restarts the VM. |
VmStartWorkloadAction | Starts a user-simulated workload. |
VmGuestFullPowerCycleAction | Power-cycles the VM. |
ClusterCSVMoveAction | Move the CSV disks to the best available node. |
StorageNodePoolMove | Moves a storage pool (created in Storage Spaces) to a different owner node in the storage cluster. |
StorageNodeRestart | Restarts a node in the storage cluster. |
StorageNodeBugcheck | Bug checks one node of the storage cluster. |
StorageNodeUpdateStorageProviderCacheAction | Calls update-storageprovidercache command in PowerShell. |
View PCS report in real-time through SQL Server Reporting Services
While PCS operations are running, reports are saved in a SQL database on the PCS Controller. Each report lists all operations that were performed, their pass percentages, and all resources that were acquired and released during the test. A new database is created for each test run to enable you to review data from previous test runs at any time.
To view the report, follow these steps:
By default, Internet Explorer Enhanced Security Configuration is enabled on Windows Server. You need to disable it to view the report.
Open Server Manager => Local Server => Click IE Enhanced Security Configuration to turn it off for administrators and users.
Open IE from PCS controller and visit
http://<PcsControllerMachineName>/Reports
.Click PCS Reports => PCSRuns.
Each PCS run is identified by a unique Pass Run ID.
Click a Pass Run ID (for example, click f44b3f88-3dbf-476e-9294-9d479ca0a369) to open a report from the PCS run. The data in these reports is live. While a test runs, you can monitor the progress of a test run in real-time.
- An overview of all resources (nodes, cluster, and VMs) that participated in the test run.
- All actions that were performed on each resource. The Pass and Fail columns report the number of actions that passed and failed.
In the Overall Operation Information table, you can click links in the Action/Pass/Fail column to open detail pages, which give you more information about the action's results. For example, if you clicked the failure number 9 by the VMLiveMigrationAction entry, you would see the summary shown in the following illustration.
The first entry above provides the following information:
Failure ID: When we encounter a failure in PCS, we generalize the Failure Message and generate a unique Hash for it. In above example the Failure ID is 97c12afd-23a8-3982-e304-a5dc6793950d
Failure Hash: Generalized failure message. In the example above, the failure hash is
Virtual Machine <VIRTUAL MACHINE> live migration failed at progress <PERCENTAGE> (migration state: Migrating)
Error: Virtual machine migration operation for '<VIRTUAL MACHINE>' failed at migration destination '<COMPUTE NODE>'. (Virtual machine ID <GUID>)
Failed to receive data for a Virtual Machine migration: This operation returned because the timeout period expired. (0x800705B4).Count Current Run: The count of actions of a particular type that failed with this particular error message during this run. In the above example, VMLiveMigrationAction was run 3 times.
Count All Runs: A count of actions that failed because of this particular failure across all PCS runs. For the VMLiveMigrationAction, this count was 3.
PCS Runs Affected: Tells how many runs have been affected by this failure. For VMLiveMigrationAction, only 1 PCS run was affected.
To look further into the error - you can click a failure ID on that screen to drill down to a global history of the failure type across all PCS runs. For example, click 97c12afd-23a8-3982-e304-a5dc6793950d to display the following. The page lists all failed operations, grouped by failure type, which has the effect of highlighting key features that you might need to investigate.
If you click the Action ID, you can drill down farther to see an Action Log Report. Errors are shown in red; Warnings are shown in yellow.
Troubleshoot a PCS run from the HLK Controller
There are multiple stages in PCS Execution Flow. Below is an example when viewing a result from HLK Manager => Explorers => Job Monitor => select Machine Pool => select the job in Job Execution Status.
If PCS failed at Setup, Execute, or Cleanup stage, you can browse job logs by right click the job name (or a child task name) => click Browse Job Logs. The log file names are PCS-E2Elaunch_Setup.log, PCSE2Elaunch_Execute.log, and PCS-E2Elaunch_Cleanup.log. Log files should contain information about failures. Try to search for unexpected exception near the end of log files.
Troubleshoot a PCS run from the PCS Controller
When a PCS job fails at Setup/Execute/Cleanup stage, you can rerun the stage directly from PCS controller. This method is useful to for troubleshooting problems in these stages.
- Open elevated command prompt
- ReRun ReRunPcsSetup.cmd, ReRunPcsExecute.cmd, or ReRunPcsCleanup.cmd script
Logs and Diagnose
PCS has three main stages: Setup, Execute, and Cleanup. A PCS job uses PCS-E2Elaunch.ps1 script to launch these three stages. Their log file names are called PCS-E2ELaunch_Setup.log, PCS-E2ELaunch_Execute.log and PCSE2ELaunch_Cleanup.log.
When a PCS run has completed, PCS analyzes logs during Cleanup stage. A run succeeded when the following criteria are met, with the analyzed report saved as PCSReport.htm.
- All PCS actions has at least 90% pass rate
- No unexpected crash of any cluster node, except the ones initiated by PCS for testing purpose
The following files are generated on PCS Controller during Cleanup stage.
- PcsReport.htm: summary about the run.
- ClusterName-PRE.mht.html: cluster validation test report that is run before Execute stage
- ClusterName-POST.mht.html: cluster validation test report that is run after Execute stage
- PcsLog-DateTime.zip: contains logs and is copied to the HLK Controller when test finished.
- MHTML folder: contains PCS SQL logs
- SDDCDiagnosticInfo folder: contains cluster logs and event logs
The issues seen or resulting from a PCS certification run has been observed to not be related to PCS itself many times. Below contains a basic guide to help narrow down some of the issues.
- Run cluster validation test and check report for errors.
- On the failover cluster manager, check whether all the nodes, vDisk, and Pool are in healthy condition. If they are not, it is fine to invest time on checking the logs/debugging before calling upon MSFT.
- Open Hyper-V manager and make sure the VMs and vSwitches get enumerated (also possible by running Get-VM or Get-VMSwitch).
- Make sure you are able create a vSwitch outside of PCS tests on one/all of the compute nodes.
- Make sure you can create a VM on one/all of the nodes and can attach a vmNetworkAdapter it to a vSwitch.
- Look for dump files generated due to bugchecks by running "dir /s *.dmp" from the %systemdrive% on the compute nodes.
- Possible usage of LiveKD to look at kernel modules/threads that are stuck, if you do not have kernel debugger attached.
- Check if compute nodes' license is active, as Eval version license get reset every 180 days.
Generate a ZIP file that contains PCS logs
You can run the following script from PCS controller to generate a ZIP file that contains required logs. This method is useful when job is cacelled or while test is running.
C:\pcs\PCS-E2ELaunch.ps1 -DomainName <string> -UserName <string> -Password <string> -ComputeCluster <string> [-StorageCluster <string>] -CollectLog [-CollectLogLevel <int>]
Parameters
- DomainName: Test user's fully qualified domain name (FQDN).
- UserName: Test user's user name
- Password: Test user's password
- ComputeCluster: Name of compute cluster name
- StorageCluster: optional, Name of storage cluster name. Don't specify this parameter if Computer and Storage clusters are the same.
- CollectLog: Required
- CollectLogLevel: optional, default is 1. Enter 3 to collect verbose logs.
Generate PcsReport.htm file manually
While PCS is running, you can run the following cmdlets on PCS controller to generate a HTML report that lists unexpected bugchecks from all nodes.
Import-Module C:\PCS\PrivateCloudSimulator-Manager.psm1
Get-PCSReport
Customize PCS actions
Each PCS job has its own xml files that define its actions.
Each job could contain up to 3 xml files: PrivateCloudSimulator.xml, PrivateCloudSimulator_Create.xml, PrivateCloudSimulator_Storage.xml
These XML files can be found on HLK Controller. Below is an example for PrivateCloudSimulator - System.Solution.AzureStack job. The highlighted folder name is the name of HLK job.
C:\Program Files (x86)\Windows Kits\10\Hardware Lab Kit\Tests\amd64\PCS\System.Solutions.AzureStack\PrivateCloudSimulator_Create.xml
Example 1: Enable/Disable an action
<ConfigurableType Type="Microsoft.PrivateCloudSimulator.VM.Actions.HyperV.VmCloneAction, Microsoft.PrivateCloudSimulator.VM.Actions.HyperV">
<ConfigurableTypeField FieldName="Interval" ValueType="System.TimeSpan" Value="00:01:00" />
<ConfigurableTypeField FieldName="StartupNumber" ValueType="System.Int32" Value="2" />
<ConfigurableTypeField FieldName="InjectVMRTInGuest" ValueType="System.Boolean" Value="true" />
<ConfigurableTypeField FieldName="BaseVHDPath" ValueType="System.String" Value="%BASEVHD%" />
</ConfigurableType>
- Test Action name is VmCloneAction.
- The Interval field sets the frequency with which the action runs. Use the format hh:mm:ss. For example, the value 02:00:00 repeats the action every 2 hours.
- The StartUpNumber field defines the number of instances of that action to initiate on each node of the compute cluster. To disable an action, set this field to zero.
- Don't modify other fields.
Example 2: Change VMs to use differencing disks
<ConfigurableType Type="Microsoft.PrivateCloudSimulator.VM.Actions.HyperV.VmCloneBase, Microsoft.PrivateCloudSimulator.VM.Actions.HyperV">
<ConfigurableTypeField FieldName="VmClusteringPercentage" ValueType="System.Int32" Value="100" />
<ConfigurableTypeField FieldName="UseDiffDisks" ValueType="System.Boolean" Value="false" />
</ConfigurableType>
- PCS by default makes a copy of the provided guest OS VHD to create VMs that have dynamic virtual disks by default. To create VMs that have differencing disks instead, set the UseDiffDisks value to true.
Example 3: Change the number of created VMs per node
<ConfigurableType Type="Microsoft.PrivateCloudSimulator.VM.Actions.HyperV.VmCreationBase, Microsoft.PrivateCloudSimulator.VM.Actions.HyperV">
<!-- MaxVmCount is Max Number of VMs on any one node -->
<ConfigurableTypeField FieldName="MaxVmCount" ValueType="System.Int32" Value="20" />
</ConfigurableType>
- PCS by default creates 20 VMs per cluster node. The average VM size could be 40GB. In a 4-node cluster environment, it could take 20 * 4 * 40 = 3200GB disk space. If you are trying to certify your hardware, don't change the default value. You should consider adding more disks, instead of reducing the number.
Customize Action Logs
A PCS run has a RunId. A PCS action has an action ID. When a PCS action fails, PCS removes the variant (i.e. VM name) from the failure message and generates a unique hash value for it. Similar failures have same unique hash value. PCS then groups them together in SQL report site.
PCS uses .NET Trace Listeners to collect test results. These listeners are defined in Microsoft.PrivateCloudSimulator.exe.config.
- SQLOnline: This listener logs the results into SQL database.
- AnalyticalLogGather: This listener collects extra information when an action is failed.
When a particular action fails or a particular hash value is seen, you can configure AnalyticalLogGather listener to collect event logs, cluster logs, or call a script. This is defined in ActionFailureReactionPolicy.xml.
In ActionFailureReactionPolicy.xml, PCS supports two types of triggers and three types of reactions. Using this XML, you can define rules like "when trigger X is seen, take reactions Y and Z". Most actions would have NodeScope set to ReservedOnly and MaxLevel set to 3 (Critical, Error, and Warning events).
Trigger:
Type | Data |
---|---|
ActionFail | ActionFullName |
KnownFailure | FailureHash |
Reaction:
Type | Data |
---|---|
ETWCollection | Channel, NodeScope, StorageLocation, MaxLevel |
ClusterLogCollection | UseLocalTime, NodeScope, StorageLocation, MaxTimeDuration (optional) |
CustomPS | ScriptFullPath, NodeScope, Argument |
Valid NodeScope values are the following:
- AllNodes
- ComputeOnly
- StorageOnly
- EdgeOnly
- NCOnly
- ReservedOnly
Valid MaxLevel values are the following:
- 0 (logs at all levels)
- 1 (Critical)
- 2 (Error)
- 3 (Warning)
- 4 (Information)
- 5 (Verbose)
Examples:
<Trigger>
<Type>ActionFail</Type>
<Data Name="ActionFullName" Value="Microsoft.HyperV.Test.Stress.PrivateCloud.ComputeNode.Action.StorageNodeRestartAction">
</Data>
<ReactionMatchList>
<!-- Details of Reaction are Defined Below and are referenced using the ID attribute-->
<MatchingReaction ID ="1"></MatchingReaction>
<MatchingReaction ID ="2"></MatchingReaction>
</ReactionMatchList>
</Trigger>49
<Reaction ID="1">
<Type>ETWCollection</Type>
<Data Name="Channel" Value="Microsoft-Windows-Hyper-V-VMMS-Analytic"></Data>
<Data Name="NodeScope" Value="ReservedOnly"></Data>
<Data Name="StorageLocation" Value="C:\PCS\PCSEventData\%NODE%\%ActionId%\EventLogs"></Data>
<Data Name="MaxLevel" Value="3"></Data>
</Reaction>
Action log files are saved to 'FORENSICLOGLOCATION' folder on PCS controller. By default, it is C:\PCS\PCSEventData.
For each failed action, the following information is collected from the reserved node(s). This log location can be seen in the action's SQL report page.
- %MachineName%\%RunId%\ClusterLogs\%ActionId%
- %MachineName%\%RunId%\EventLogs\%ActionId%
- %MachineName%\%RunId%\CustomResponse\%ActionId%