Failover Cluster Node State is “Down” and Cluster Service Terminates or Adding a New Failover Cluster Node Fails with Time Out Error
Situation
Consider one of the following scenarios.
- You have a failover cluster. In the Nodes node of Failover Cluster Manager MMC, the Status for one or more nodes is displayed as Down. The server(s) are actually up and Cluster service is running. Later, the Cluster service is terminated due to timeout error and restarted. There are no more relevant messages in Event Logs.
- You're trying to add a new node to a failover cluster. The Add Node Wizard passes you to the Configure the Cluster page where the progress bar is displayed. The progress bar hangs for an extended period of time with the status of “Waiting for notification that the node <Node Name> is a fully functional member of the cluster”. Later, the status changes to “Unable to successfully cleanup”. Finally, the Add Node Wizard fails with the following error message.
The server '<Node FQDN>' could not be added to the cluster.
An error occurred while adding node '<Node FQDN>' to cluster '<Cluster Name>'.
This operation returned because the timeout period expired
Symptoms
The following is the only relevant error message that appears in the node's System Event Log if the node experiencing this issue is already a cluster member (Scenario 1 listed above).
Log Name: System Source: Service Control Manager Date: 16.07.2011 14:06:26 Event ID: 7024 Task Category: None Level: Error Keywords: Classic User: N/A Computer: <Node FQDN> Description: The Cluster Service service terminated with service-specific error The wait operation timed out.. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}" EventSourceName="Service Control Manager" /> <EventID Qualifiers="49152">7024</EventID> <Version>0</Version> <Level>2</Level> <Task>0</Task> <Opcode>0</Opcode> <Keywords>0x8080000000000000</Keywords> <TimeCreated SystemTime="2011-07-16T10:06:26.443710600Z" /> <EventRecordID>15132</EventRecordID> <Correlation /> <Execution ProcessID="788" ThreadID="3484" /> <Channel>System</Channel> <Computer>Node FQDN</Computer> <Security /> </System> <EventData> <Data Name="param1">Cluster Service</Data> <Data Name="param2">%%258</Data> </EventData> </Event> |
If the node is not a cluster member yet and you try to add it (Scenario 2 listed above), the following event might be logged in addition to the above one.
Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 09.07.2011 5:21:30 Event ID: 1572 Task Category: Cluster Virtual Adapter Level: Critical Keywords: User: SYSTEM Computer: <Node FQDN> Description: Node '<Node Name>' failed to join the cluster because it could not send and receive failure detection network messages with other cluster nodes. Please run the Validate a Configuration wizard to ensure network settings. Also verify the Windows Firewall 'Failover Clusters' rules. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" /> <EventID>1572</EventID> <Version>0</Version> <Level>1</Level> <Task>39</Task> <Opcode>0</Opcode> <Keywords>0x8000000000000000</Keywords> <TimeCreated SystemTime="2011-07-09T01:21:30.183978000Z" /> <EventRecordID>8752339</EventRecordID> <Correlation /> <Execution ProcessID="6388" ThreadID="6840" /> <Channel>System</Channel> <Computer>Node FQDN</Computer> <Security UserID="S-1-5-18" /> </System> <EventData> <Data Name="NodeName">Node Name</Data> </EventData> </Event> |
Note
The latter error message does not consistently appear on every repro. So it can be used only as an additional symptom of the issue.
More Information
If you run Failover Cluster Validation Wizard it founds no issues since all the necessary firewall rules are in place and enabled.
(It would help, though, if the issue is with Firewall Rules or network connectivity indeed. See the links section at the end of this article for more details on such cases).
Cause
If the Failover Cluster Validation Wizard doesn't detect the issue it is most likely due to the state of Windows Firewall. It can be a problem with the switch configuration. (For example Auto DoS / Storm Protection in some HP's switch will block the UDP's packet conversation in the initial handshake)
Resolution
Launch Server Manager MMC for the servers in question. Navigate to Configuration → Windows Firewall With Advanced Security. From the Actions pane, click Properties. Ensure that for all profiles (not only the Domain one) the Inbound connections setting is not set to Block all connections. Acceptable options are either Block (default) or Allow.
**
**If the switch, in an HP it should look that way:
Additional Troubleshooting Steps
If you are unsure whether the cluster problems are caused by Windows Firewall you may use the following command to temporarily disable the firewall on all cluster nodes at once.
001 002 003 004 005 006 007 008 009 |
Import-Module -Name "FailoverClusters" $Node = Get-Cluster -Name "Cluster.Contoso.com" | Get-ClusterNode | Select-Object -ExpandProperty "Name" $Command = { Set-Alias -Name "NetSh" -Value "$Env:SystemRoot\System32\NetSh.exe" NetSh AdvFirewall Set AllProfiles State "Off" } Invoke-Command -ScriptBlock $Command -ComputerName $Node |
Note
This will not work if the state of Windows Firewall is enforced with Group Policy, or if [[Windows PowerShell Remoting]] is disabled.
Note
Under no circumstances, you should leave the cluster in this state after your troubleshooting is complete (successfully or not). Windows Firewall is an important security measure that is highly recommended for all environments, even those well protected on the perimeter level.
Below is the listing of the Windows Firewall exception properties. This exception is created by default when Windows Failover Clustering feature is installed. This means that the exception is in place even before the node is joined to the cluster.
netsh advfirewall firewall show rule name="Failover Clusters (UDP-In)" verbose Rule Name: Failover Clusters (UDP-In) Ok. |
If for whatever reason, Windows Firewall settings in your environment block the intra-cluster communications, you'd want to make sure your exceptions have the same or less restrictive settings.
Note
This exception is enabled and applies to all network profiles. Also, this is not the only exception created and required by Failover Clustering feature. Lack of other exceptions can cause similar problems in different areas of Clustering functionality.
See Also
The following articles describe similar yet different scenarios.
- Event ID 1572 — Network Connectivity and Configuration
- You are unable to join a node into a Cluster if UDP port 3343 is blocked
- Failover Cluster Communication Failures