Part 10: Datacenter Activation Coordination – My nodes not invalid!

The Start-DatabaseAvailabilityGroup cmdlet is used to restore failed nodes in a Database Availability Group (DAG) by joining the evicted nodes back to the existing cluster. When a node joins a cluster it is assigned a unique decimal value ID.  For example, the first node is assigned Node ID 1, the second node Node ID 2, and so on up to 16 supported nodes in a DAG.  You can view the node ID assigned to a particular node via the registry.  Navigate to HKLM \ Cluster \ Nodes:

 

image

 

The subkeys are the decimal values assigned to each node.  Selecting a subkey will expose the NetBIOS node name assigned to the NodeID.  In this example Node ID 1 is assigned to server MBX-1.

 

image

 

The servers list of the DAG is a multi-valued attribute that represents the members of the DAG. 

 

[PS] C:\>Get-DatabaseAvailabilityGroup DAG | fl name,servers

Name : DAG
Servers : {MBX-4, MBX-3, MBX-2, MBX-1}

 

When Stop-DatabaseAvailabilityGroup is run, the task parses the servers list and determines if an action should be taken on that server. For example, if the next server on the list falls within the AD site being stopped, the cmdlet will take action on that server. 

 

[PS] C:\>Get-DatabaseAvailabilityGroup DAG | fl name,servers,startedmailboxservers,stoppedmailboxservers

Name : DAG
Servers : {MBX-4, MBX-3, MBX-2, MBX-1}
StartedMailboxServers : {MBX-3.domain.com, MBX-4.domain.com}
StoppedMailboxServers : {MBX-2.domain.com, MBX-1.domain.com}

 

After running Restore-DatabaseAvailabilityGroup, the nodes on the stopped servers list are evicted from the cluster.  This results in the Node IDs within the cluster registry associated with those nodes being freed. 

 

image

 

At this stage, the secondary site is now functional, independent of the original.  Hopefully there will be a time where the primary site is accessible and its nodes are ready to be added back to the cluster as functional members.  Start-DatabaseAvailabilityGroup is used to bring the nodes back to the cluster.

 

Occasionally when running Start-DatabaseAvailabilityGroup the following error is thrown:

 

WARNING: Server 'DAG-4' failed to be started as a member of database availability group 'DAG'. Error: A server-side database availability group administrative operation failed. Error: The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API’ "AddClusterNode() (MaxPercentage=37) failed with 0x13af. Error: The cluster node is not valid"' failed. [Server: DAG-1.domain.com]

 

D:\Utilities\ERR>err 0x13af
# for hex 0x13af / decimal 5039
ERROR_CLUSTER_INVALID_NODE winerror.h
# The cluster node is not valid.
SQL_5039_severity_16 sql_err
# MODIFY FILE failed. Specified size is less than current
# size.
# as an HRESULT: Severity: SUCCESS (0), FACILITY_NULL (0x0), Code 0x13af
# for hex 0x13af / decimal 5039
ERROR_CLUSTER_INVALID_NODE winerror.h
# The cluster node is not valid.
# 3 matches found for "0x13af"

 

Why does this error occur?

 

Most customers will never see this error, and those that do typically see this error only in testing, where the commands associated with the Datacenter Switchover process are executed in a quick sequential fashion.  When Start-DatabaseAvailabilityGroup is run the command steps through the stopped mailbox servers list.  If you look at the list carefully you will notice that the nodes most likely appear in reverse order that then they were originally added to the DAG.

 

[PS] C:\>Get-DatabaseAvailabilityGroup DAG | fl name,servers,startedmailboxservers,stoppedmailboxservers

Name : DAG
Servers : {MBX-4, MBX-3, MBX-2, MBX-1}
StartedMailboxServers : {MBX-3.domain.com, MBX-4.domain.com}
StoppedMailboxServers : {MBX-2.domain.com, MBX-1.domain.com}

In this example, MBX-1 was added before MBX-2.  This would indicate that MBX-1 was assigned Node ID 1 and MBX-2 was assigned Node ID 2.  As nodes are added to a cluster, the first available node ID is recycled. In this example, Start-DatabaseAvailabilityGroup first detects the server MBX-2, therefore this server is the first server attempted to add to the existing cluster.  The cluster service then attempts to assign Node ID 1 to MBX-2.  For a short period of time the cluster service caches the previous Node ID to Server mapping that originally existed.  Therefore, in cache, Node ID 1 is still assigned to server MBX-1.  In this instance though we are trying to add server MBX-2 and the cluster by default attempts to assign Node ID 1 – resulting in a collision between the desired Node ID assignment and what exists in cache.  This ultimately results in the invalid node ID error returned to the cmdlet.

 

In the case of using the ADsite parameter with Start-DatabaseAvailabilityGroup, adding MBX-2 fails while adding server MBX-1 succeeds.  Based on the previous explanation this is expected, the cmdlet failed to add MBX-2 and now moved to MBX-1.  MBX-1 during the join process is assigned the first available node ID, in this case Node ID 1, which matches what previously existed in cache.  Since this match existed, the join was successful.  Executing the command multiple times with the –ADSite parameter will eventually result in all nodes being successfully added to the DAG.  If you run Start-DatabaseAvailabilityGroup with the –mailboxServer parameter, the error will continue to be seen in the short term unless the server specified matches the first available node ID.

 

Again, most administrators will not encounter this error during production site activations.  The time between when Restore-DatabaseAvailablityGroup is run and the time that the primary site is available to be added back to the DAG is sufficient for the cache to expire.  In some circumstances it may become necessary to restart the Cluster service on the surviving nodes to force the cache to expire and correct this error.  You can also use Start-DatabaseAvailabilityGroup with the –mailboxServer parameter to add the nodes back in the original order they were installed.