SQL failing to come online with SQLClusterResourceWorkerWaitForCompletion() failing
A few days back, I was working on an issue where in my client applied the patch KB3045318 on his SQL server instance running on SQL server 2012 SP1 build.
As it was a 2 node - Failover Clustered Instance, he had applied the patch simultaneously on both the nodes and post which he has rebooted the active node followed by the passive node (during which the failover also did not succeed). Post reboot, the clustered SQL resources are failing to come online and thus he tried to failover the SQL resource from the active to the passive node which again failed.
Thus when he reached out to me, I analyzed the scenario and on examining the cluster logs found the below log snippet.
000039f4.00003bb8::INFO [RES] SQL Server <SQL Server (CXL_PROD)>: [sqsrvres] Dependency expression for resource 'SQL Network Name (GSIECWT5009)' is '([0ebf6db8-26b2-4d6f-91b7-5ec89c770143])'
000039f4.00003bb8::ERR [RES] SQL Server <SQL Server (CXL_PROD)>: [sqsrvres] SQLClusterResourceWorkerWaitForCompletion() failed with error 0. Please contact customer support
000039f4.00003bb8::ERR [RES] SQL Server <SQL Server (CXL_PROD)>: [sqsrvres] Failed to prepare environment for online. See previous message for detail. Please contact customer support
000039f4.00003bb8::INFO [RES] SQL Server <SQL Server (CXL_PROD)>: [sqsrvres] SQL Server resource state is changed from 'ClusterResourceOnlinePending' to 'ClusterResourceFailed'
000039f4.00003bb8::ERR [RHS] Online for resource SQL Server (CXL_PROD) failed.
Here we see it is the SQLClusterResourceWorkerWaitForCompletion() function which is failing to yield the outcome and thus SQL clustered resource is unable to come online.
On research figured out that the cause for this failure is seems to be that the SQL resource DLL detects the installation of a patch during the first failover to the newly patched node. Thus it spawns a worker thread to update the databases accordingly (e.g. create updated system stored procedures and views, etc.). As a first step this worker thread will re-create the NTFS file permissions on the data folders of this SQL instance. As the resource DLL runs in the process RHS.EXE (Failover Cluster Resource Host Subsystem), it runs with the credentials of NT AUTHORITY\SYSTEM.
The reason being the old log folder may have too many files. During a failover the worker thread while be re-creating the NTFS file permissions on MSSQL folder and will have to move through a lot of files and eventually may time out and cause failure.
Thus we need to rename the log folder in the root directory and create a dummy LOG folder which will help us to resolve the issue.
Hope this helps.. Happy troubleshooting!!