Failover cluster maintenance procedures
Applies to: Azure Stack HCI, versions 22H2 and 21H2; Windows Server 2022, Windows Server 2019, Windows Server 2016
Important
Azure Stack HCI is now part of Azure Local. Product documentation renaming is in progress. However, older versions of Azure Stack HCI, for example 22H2 will continue to reference Azure Stack HCI and won't reflect the name change. Learn more.
This article assumes that you need to power down a physical server to perform maintenance, or restart it for some other reason. To install updates on an Azure Stack HCI cluster without taking servers offline, see Update Azure Stack HCI clusters.
Taking a server offline for maintenance requires taking portions of storage offline that are shared across all servers in a failover cluster. This requires pausing the server that you want to take offline, putting the server's disks in maintenance mode, moving clustered roles and virtual machines (VMs) to other servers in the cluster, and verifying that all data is available on the other servers in the cluster. This process ensures that the data remains safe and accessible throughout the maintenance period.
You can use either Windows Admin Center or PowerShell to take a server offline for maintenance. This topic covers both methods.
Take a server offline using Windows Admin Center
The simplest way to prepare to take a server offline is by using Windows Admin Center.
Verify it's safe to take the server offline
Using Windows Admin Center, connect to the server you want to take offline. Select Storage > Disks from the Tools menu, and verify that the Status column for every virtual disk shows Online.
Then, select Storage > Volumes and verify that the Health column for every volume shows Healthy and that the Status column for every volume shows OK.
Pause and drain the server
Before either shutting down or restarting a server, you should pause the server and drain (move off) any clustered roles such as VMs running on it. Always pause and drain clustered servers before taking them offline for maintenance.
Using Windows Admin Center, connect to the cluster and then select Compute > Servers from the Tools menu in Cluster Manager.
Select Inventory. Click on the name of the server you wish to pause and drain, and select Pause. You should see the following prompt:
Pause server(s) for maintenance: Are you sure you want to pause server(s)? This moves workloads, such as virtual machines, to other servers in the cluster.
Select yes to pause the server and initiate the drain process. The server status will show as In maintenance, Draining, and roles such as Hyper-V and VMs will immediately begin live migrating to other server(s) in the cluster. This can take a few minutes. No roles can be added to the server until it's resumed. When the draining process is finished, the server status will show as In maintenance, Drain completed. The operating system performs an automatic safety check to ensure it is safe to proceed. If there are unhealthy volumes, it will stop and alert you that it's not safe to proceed.
Shut down the server
Once the server has completed draining, you can safely shut it down for maintenance or reboot it.
Warning
If the server is running Azure Stack HCI, version 20H2, Windows Server 2019, or Windows Server 2016, you must put the disks in maintenance mode before shutting down the server and take the disks out of maintenance mode before resuming the server into the cluster.
Resume the server
When you are ready for the server to begin hosting clustered roles and VMs again, simply turn the server on, wait for it to boot up, and resume the server using the following steps.
In Cluster Manager, select Compute > Servers from the Tools menu at the left.
Select Inventory. Click on the name of the server you wish to resume, and then click Resume.
Clustered roles and VMs will immediately begin live migrating back to the server. This can take a few minutes.
Wait for storage to resync
When the server resumes, any new writes that happened while it was unavailable need to resync. This happens automatically, using intelligent change tracking. It's not necessary for all data to be scanned or synchronized; only the changes. This process is throttled to mitigate impact to production workloads. Depending on how long the server was paused and how much new data was written, it may take many minutes to complete.
Important
You must wait for re-syncing to complete before taking any other servers in the cluster offline.
To check if storage resync is complete:
- Connect to the cluster using Windows Admin Center and select Storage > Volumes.
- Select Inventory.
- Check the Status column for every volume. If it shows OK, storage resync is complete. It's now safe to take other servers in the cluster offline.
Take a server offline using PowerShell
Use the following procedures to properly pause, drain, and resume a server in a failover cluster using PowerShell.
Verify it's safe to take the server offline
To verify that all your volumes are healthy, run the following cmdlet as an administrator:
Get-VirtualDisk
Here's an example of what the output might look like:
FriendlyName ResiliencySettingName FaultDomainRedundancy OperationalStatus HealthStatus Size FootprintOnPool StorageEfficiency
------------ --------------------- --------------------- ----------------- ------------ ---- --------------- -----------------
Mirror II Mirror 1 OK Healthy 4 TB 8.01 TB 49.99%
Mirror-accelerated parity OK Healthy 1002 GB 1.96 TB 49.98%
Mirror Mirror 1 OK Healthy 1 TB 2 TB 49.98%
ClusterPerformanceHistory Mirror 1 OK Healthy 24 GB 49 GB 48.98%
Verify that the HealthStatus property for every volume is Healthy and the OperationalStatus shows OK.
To do this using Failover Cluster Manager, go to Storage > Disks.
Pause and drain the server
Run the following cmdlet as an administrator to pause and drain the server:
Suspend-ClusterNode -Drain
To do this in Failover Cluster Manager, go to Nodes, right-click the node, and then select Pause > Drain Roles.
If the server is running Azure Stack HCI, version 21H2 or Windows Server 2022, pausing and draining the server will also put the server's disks into maintenance mode. If the server is running Azure Stack HCI, version 20H2, Windows Server 2019, or Windows Server 2016, you'll have to do this manually (see next step).
Put disks in maintenance mode
In Azure Stack HCI, version 20H2, Windows Server 2019, and Windows Server 2016, putting the server's disks in maintenance mode gives Storage Spaces Direct an opportunity to gracefully flush and commit data to ensure that the server shutdown does not affect application state. As soon as a disk goes into maintenance mode, it will no longer allow writes. To minimize storage resynch times, we recommend putting the disks into maintenance mode right before the reboot and bringing them out of maintenance mode as soon as the system is back up.
Note
If the server is running Azure Stack HCI, version 21H2 or Windows Server 2022, you can skip this step because the disks are automatically put into maintenance mode when the server is paused and drained. These operating systems have a granular repair feature that makes resyncs faster and less impactful on system and network resources, making it feasible to have server and storage maintenance done together.
If the server is running Windows Server 2019 or Azure Stack HCI, version 20H2, run the following cmdlet as administrator:
Get-StorageScaleUnit -FriendlyName "Server1" | Enable-StorageMaintenanceMode
If the server is running Windows Server 2016, use the following syntax instead:
Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "Server1"} | Enable-StorageMaintenanceMode
Shut down the server
Once the server has completed draining, it will show as Paused in PowerShell and Failover Cluster Manager.
You can now safely shut the server down or restart it by using the Stop-Computer
or Restart-Computer
PowerShell cmdlets, or by using Failover Cluster Manager.
Note
When running a Get-VirtualDisk
command on servers that are shutting down or starting/stopping the cluster service, the server's Operational Status may be reported as incomplete or degraded, and the Health Status column may list a warning. This is normal and should not cause concern. All your volumes remain online and accessible.
Take disks out of maintenance mode
If the server is running Azure Stack HCI, version 20H2, Windows Server 2019, or Windows Server 2016, you must disable storage maintenance mode on the disks before resuming the server into the cluster. To minimize storage resynch times, we recommend bringing them out of maintenance mode as soon as the system is back up.
Note
If the server is running Azure Stack HCI, version 21H2 or Windows Server 2022, you can skip this step because the disks will automatically be taken out of maintenance mode when the server is resumed.
If the server is running Windows Server 2019 or Azure Stack HCI, version 20H2, run the following cmdlet as administrator to disable storage maintenance mode:
Get-StorageScaleUnit -FriendlyName "Server1" | Disable-StorageMaintenanceMode
If the server is running Windows Server 2016, use the following syntax instead:
Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "Server1"} | Disable-StorageMaintenanceMode
Resume the server
Resume the server into the cluster. To return the clustered roles and VMs that were previously running on the server, use the optional -Failback flag:
Resume-ClusterNode –Failback Immediate
To do this in Failover Cluster Manager, go to Nodes, right-click the node, and then select Resume > Fail Roles Back.
Once the server has resumed, it will show as Up in PowerShell and Failover Cluster Manager.
Wait for storage to resync
When the server resumes, you must wait for re-syncing to complete before taking any other servers in the cluster offline.
Run the following cmdlet as administrator to monitor progress:
Get-StorageJob
If re-syncing has already completed, you won't get any output.
Here's some example output showing resync (repair) jobs still running:
Name IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
---- ---------------- ----------- -------- --------------- -------------- ----------
Repair True 00:06:23 Running 65 11477975040 17448304640
Repair True 00:06:40 Running 66 15987900416 23890755584
Repair True 00:06:52 Running 68 20104802841 22104819713
The BytesTotal column shows how much storage needs to resync. The PercentComplete column displays progress.
Warning
It's not safe to take another server offline until these repair jobs finish.
During this time, under HealthStatus, your volumes will continue to show as Warning, which is normal.
For example, if you use the Get-VirtualDisk
cmdlet while storage is re-syncing, you might see the following output:
FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
MyVolume1 Mirror InService Warning True 1 TB
MyVolume2 Mirror InService Warning True 1 TB
MyVolume3 Mirror InService Warning True 1 TB
Once the jobs complete, verify that volumes show Healthy again by using the Get-VirtualDisk
cmdlet. Here's some example output:
FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
MyVolume1 Mirror OK Healthy True 1 TB
MyVolume2 Mirror OK Healthy True 1 TB
MyVolume3 Mirror OK Healthy True 1 TB
It's now safe to pause and restart other servers in the cluster.
Next steps
For related information, see also: