Unable to correct invalid SQL Server Network Configuration on clustered SQL Server causes clustered SQL Server fail to start “permanently”
Recently, we encountered an issue reported by a customer on a clustered SQL Server. What happens is the following.
The Customer needs to restrict the number of IP addresses SQL 2005 Server is listening. He uses the following procedure for that: open “SQL Server Configuration Manager” (SSCM) -> “SQL Server 2005 Network Configuration” -> “Protocols for XX” , right clicked TCP/IP choosing Protocols and set “Listen All” to “NO”. After that he wants to set the proper IP-addresses. This is working fine on a non-clustered system. On a clustered system the following happens when he is restarting SQL Server: SQL Server does not come online anymore. And the SQL Server error log reports the following.
SQL Server 2005 errorlog
2005-11-28 10:53:29.02 Server Error: 17182, Severity: 16, State: 1.
2005-11-28 10:53:29.02 Server TDSSNIClient initialization failed with error 0x32, status code 0x1c.
2005-11-28 10:53:29.02 Server Error: 17182, Severity: 16, State: 1.
2005-11-28 10:53:29.02 Server TDSSNIClient initialization failed with error 0x32, status code 0x1.
2005-11-28 10:53:29.03 Server Error: 17826, Severity: 18, State: 3.
2005-11-28 10:53:29.03 Server Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
2005-11-28 10:53:29.03 Server Error: 17120, Severity: 16, State: 1.
2005-11-28 10:53:29.03 Server SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.
Configuring the server to listening on individual IP-addresses is not supported on Clustered SQL Server, thus the failure indicated by the error log is accurate. However, the customer can NOT set “Listen ALL” to “Yes” any more to bring the server online. To be more accurate, even though the customer can set “Listen ALL” to “Yes” using SSCM on each physical node, each time he try to bring SQL Server online, the value will be overwritten to “NO” and the clustered SQL Server will end up failing. This is a serious issue that causes frustration of our customers.
The root cause of this issue is because of cluster checkpoint service behavior. If a setting is changed while the resource is online, that change will get check-pointed to the CPT hive file in the cluster quorum disk. If the resource is offline while you make the parameter change, then it will never be check-pointed. Each time you bring up the resources, the check-pointed value will overwrite the local value. SQL Server network configuration is one resource that is check-pointed. So, if you put the SQL Server resource offline while you change the “Listen All” from “NO” to “YES”, and then you try to put the resource back online, it will fail as the "local" change was overwritten (during resource startup) with what was persisted in the checkpoint file.
Because of this check-pointing behavior, any time that SQL Server network configurations are modified into invalid values while the server is online, restarting the clustered SQL Server will cause the server fail to start “permanently”.
To get out of such BAD state, one workaround is to disable the check-pointing for SQL Server network configuration, described as following.
1. While SQL Server instance is in offline/failed state, disable cluster checkpointing for network configuration by:
cluster res "SQL Server" /removecheck:”SoftwareMicrosoftMicrosoft SQL ServerMSSQL.XXXMSSQLSERVER”
2. Correct the configuration by using SSCM. Verified the key was corrected on both nodes.
3. Bring SQL cluster back online.
4. Re-enabled cluster checkpointing for network configuration by:
cluster res "SQL Server" /addcheck: ”SoftwareMicrosoftMicrosoft SQL ServerMSSQL.XXXMSSQLSERVER”
Note that, for named instance, the resource display name "SQL Server" should be replaced with "SQL Server (<instance name>)".
If the workaround does not resolve the issue described as above in your case, please let us know.
Do you know that you can post question w.r.t SQL Server data access, connectivty issues at https://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=87&SiteID=1?
Nan Tu, Software Design Engineer, SQL Protocols
Disclaimer: This posting is provided "AS IS" with no warranties, and confers no rights
Comments
Anonymous
December 14, 2005
there is a similar story in sql 2000
a resource kit tool clusterrecovery.exe can do the same thing (refresh the registry checkpoint) with a GUI, without having to know the registry keysAnonymous
February 05, 2006
I am trying to connect to a server which is installed with SQL server 2000, using SQL server 2005 client. I am getting the following error
When connecting to SQL server 2005, his failure may be caused by the fact that under the default settingsSQL server does not allow remote connections. (Provider: Named pipes Provider, error: 40 - could not open a connection to SQL server)(Microsoft SQL server, Error:53)Anonymous
March 02, 2006
Manik,
This is a connectivity issue not pertain only to Cluster Scenario. In your case, you need to check:
(1) Remote connectivity is enabled. If not, you can use SQL Server Surface Area Configuration to enable that.
(2) Make sure that the listening port of SQL Server is not blocked by Windows FirewallAnonymous
June 05, 2006
When connecting to SQL server 2005, his failure may be caused by the fact that under the default settingsSQL server does not allow remote connections. (Provider: Named pipes Provider, error: 40 - could not open a connection to SQL server)(Microsoft SQL server, Error:53)Anonymous
January 17, 2007
I get the error message above. I have made the exception in the firewall for both Database engine and the Browser processes. I configured Surface Area config to allow remote with TCP and Named Pipes. I turned on the service "Browser". I still get the error above.Anonymous
January 18, 2007
Please try to answer questions in http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=362498&SiteID=1, If it doesn't solve your problem, please post your question on the forum. It'll be answered in a timely manner. Thanks, NanAnonymous
June 06, 2007
Thanks for your blog Nan. I was expecting to have to make a call but was saved hours of heartache by being able to stop the cluster checkpoint. I found your explanation doing a search on the network error. May I suggest that KB docs 823938 and 265808 be updated to warn folks like me that do not realize static ports are not supported on a cluster? I struggled a bit with the syntax but eventually got it. Fortunately, it was not a production machine. RalphAnonymous
February 15, 2008
Hi all, I did use this but got error as below: C:>cluster res "SQL Server" /removecheck:"SoftwareMicrosoftMicrosoft SQL ServerMSSQL.1MSSQLSERVER" The 'removecheck' option does not require any parameters to be specified. See "CLUSTER RESOURCE /?" for correct syntax. So, I open a case with support team and the soluction was: 1.With SQL offline on cluster C:>cluster res "SQL Server" /checkpoints --- List all checkpoint to SQL Server
- Run /removecheckpoints C:>cluster res "SQL Server" /removecheckpoints:"SoftwareMicrosoftMicrosoft SQL ServerMSSQL.1MSSQLServer" 3.Correct the configuration by using SSCM. 4.run /addcheckpoints C:>cluster res "SQL Server" /addcheckpoints:"SoftwareMicrosoftMicrosoft SQL ServerMSSQL.1MSSQLServer"
- Bring SQL cluster back online. Yes....it´s ok now !! regards Nilton Pinheiro
Anonymous
March 26, 2008
Worked like a champ. Thanks for the post - we looked everywhere trying to fix this.Anonymous
May 09, 2008
Thanks heaps, worked perfectly - just the commands in the original post weren't right but Nilton Pinheiro's correctios were. It's /removecheckpoints not /removecheck and /addcheckpoints not /addcheckAnonymous
June 25, 2008
Nilton, you save me from a new support call... txs!Anonymous
February 11, 2009
This post just saved me from insanity, thank you so much!Anonymous
March 30, 2009
Thanks Nan and Nilton, worked perfectlyAnonymous
November 09, 2010
In my case (this took some time to figure out) SQL 2005 32 bit on Win 2008 64 bit, you need to remove/add the WOW6432 checkpoint. SoftwareWow6432NodeMicrosoftMicrosoft SQL ServerMSSQL.1MSSQLSERVER ...and as to the conflicting final 2 steps: Nan Tu 3) start up cluster 4) /addcheckpoint vs. Nilton 4) /addcheckpoint 5) bring cluster back online ...I went with Nan Tu's approach and that worked for me. Thanks!