Udostępnij za pośrednictwem


Cluster service failure after AD lockdown...

Users were unable to connect to their shares.  John discovered that the Cluster service wasn't started, and that any attempts to start it resulted in an error 1068.  He attempted to ping the virtual server's IP address and it returned a "request timed out" message.  He got the same error when trying to ping the cluster node's public adapter. 

When he got to the node he found the Cluster service in a Starting state.  He soon discovered that he had no network connectivity to or from either Cluster node, and that their network cards were missing from "Network Connections"  The only changes made to the network were just a few minor group policy settings to lock down permissions a bit.  Maybe that had something to do with this? It looked like it was going to be a long night...

This is another fairly common problem.  This is not really just a Cluster problem, but that is usually how it is presented to me.  Of course if networking is not functional, then Cluster isn't going to work either. :) I have worked at least three of these issues in the last two months, and thought it warranted discussion since there isn't a public KB article on this particular scenario yet.  I hope to fully document every error encountered here, so that others may find this post when they run into this situation.  (KB articles sometimes take a while to get published)

System event log:

SAM event ID: 12291 "SAM failed to start the TCP/IP or SPX/IPX listening thread"
IPSec event ID: 4292 "The IPSec driver has entered Block mode."
DfsSvc event ID: 14523 "DFS could not contact any DC for Domain DFS operations."

Application event log:

EventSystem event ID: 4609 "The COM+ Event System detected a bad return code during its internal processing. HRESULT was 80004015 from line 142 of d:\nt\com\complus\src\events\tier2\service.cpp."

Other problems discovered with this node:

The Com+ Event System, Network Connections and Shell Hardware Detection services were in a Starting state.

The following services failed to start:

Cluster Service: Error 1068: The dependency service or group failed to start.
File Replication: Error 1068: The dependency service or group failed to start.
---dependencies opens up a window titled "Service Dependencies" and the message is: Wind32: Access is denied.
IPSEC Services: Error 1899: The endpoint mapper database entry could not be created.
System Event Notification: Error 1068: The dependency service or group failed to start.
--trying to view the dependencies on the server returns the following message: Win32: Access is denied
Task Scheduler: "The endpoint mapper database could not be loaded"

We have three services failing with "the dependency service or group failed to start."
When we try to view the dependencies we get an access denied message.

Let's look in the registry to see what each of these services depend on:

Cluster service:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc
DependOnService:

ClusNet
RpcSs
W32Time
NetMan

File Replication:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs
DependOnService:

EventLog
RpcSs
EventSystem

System Event Notification:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\SENS
DependOnService:

EventSystem

So the common dependencies are RpcSs and EventSystem

RpcSs is the Remote Procedure Call (RPC) service, and EventSystem is the Com+ Event System service.  We know from earlier that Com+ Event System is one of the services stuck in a Starting state, so that is why the File Replication and System Event Notification services haven't started.  One of the other dependencies for the Cluster service is NetMan, which is the Network Connections service.  Network Connections is also one of the services stuck in a Starting state.

So now the real question is: Why are the Com+ Event System and Network Connections services not starting?

If we view the dependencies for these two services, we just find RpcSs listed.  So it all boils down to RPC.  However, the Remote Procedure Call (RPC) service is actually started.

If you do a search in the knowledge base on these errors, you are likely to come across this article:

909444 Systems that have changed the default Access Control List permissions on the %windir%\registration directory may experience various problems after you install the Microsoft Security Bulletin MS05-051 for COM+ and MS DTC

This discusses changes made by a hotfix that would cause these problems.  The fix is to correct NTFS permissions on the %SystemRoot%\Registration directory.  However the permissions here are the same as in the article.

You may also come across this one:

916254 COM+-related events may be logged in Event Viewer when you install Windows XP Service Pack 2 and join the computer to a domain

Most would come across this second article and instantly dismiss it since it says "Windows XP Service Pack 2." However, we have a lot of the same symptoms, and since XP SP2 and Server 2003 SP1 include a lot of the same security changes it warrants further investigation.
One of the security changes in SP1 for Windows Server 2003 was to change the Logon Account used for RPC.
RPC use to log on as Local System and now uses an account with less privileges: Network Service.

The article states that this issue occurs if the SERVICE account is missing from the policy setting "Impersonate a client after authentication" 

We can see if SERVICE is missing from this policy by performing the following steps:

1. Open up Local Security Policy in order to see what the effective settings are:

Start, Run, secpol.msc

2. Expand Local Policies, User Rights Assignment and then open up "Impersonate a client after authentication"  

At minimum the following should be listed: Administrators and SERVICE

The problem that I have seen recently happens when someone decides to change the "Impersonate a client after authentication" user right in group policy.  Typically how it goes is they decide to lockdown their servers, and only give specific accounts certain privileges.   However, after incorrectly removing the SERVICE account from this privilege the server loses all network connectivity.  Fortunately this problem doesn't show up until after a reboot.  (You have an opportunity to identify that the problem exists before causing a major outage of all servers in a large OU.)

The fix is simple for the servers that haven't been restarted:

1. Correct the policy and then force group policy to be reapplied. (gpupdate /force)

(To correct the policy: just add SERVICE and Administrators to this policy setting in addition to the other ones defined)

If you have already rebooted the servers after applying the incorrect policy settings they will not be corrected by just simply changing the policy back since they have already lost network access. (unless the policy change was made locally to begin with)

1. Export the following registry key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\RpcSs

2. In the services snap-in: Change Remote Procedure Call (RPC) to start up with the Local System account instead of Network Service, and then reboot

3. At this point the majority of the services should be started and we should now have network access. Ensure that the offending group policy has been corrected with the proper accounts, force group policy to apply, (gpupdate /force) and then reboot.

4. Change the logon account for Remote Procedure Call (RPC) service back to Network Service by importing the reg file that you exported in step one, and then reboot. Alternatively: navigate to the following reg key and then reboot

Technorati tags: Active Directory, Cluster, Windows Server 2003

:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\RpcSs

Change the ObjectName value from LocalSystem to: NT Authority\NetworkService

For more information regarding this security setting see article on Technet: SeImpersonatePrivilege
I have commented KB 269229 to reflect the requirement for SERVICE to be included in this User Right.

Please let me know if you like the format of this post or if you have any questions.

Until next time.

Thanks, 

Justin Turner
This posting is provided "AS IS" with no warranties, and confers no rights.

Comments

  • Anonymous
    January 01, 2003
    Thank you for visiting.  Text size will be increased with the next post.  Thanks for the suggestion. Justin Turner

  • Anonymous
    January 01, 2003
    Nice to hear that it helped.

  • Anonymous
    January 01, 2003
    For one of the two machines the two machines, on which I had already tried to install a MS Cluster. I had to find this fix, in addition to the above to get RPC and Network connections back online. The second node RPC came online without these additional mods. http://www.eggheadcafe.com/software/aspnet/32648815/2003-server-r2--network.aspx

  1. ON DOMAIN CONTROLLER Group Policy for this SQL CLUSTER, Go to Computer Configuration - Windows Settings - Local Policies – User right Assignment- look for "Bypass traverse checking" Policy and add NETWORK SERVICE.
  2. ON LOCAL SQL SERVER, Open Windows Explorer and Go to WindowsRegistration folder - go to properties - Security tab - add the following accounts with permissions. a.Administrator - Full rights b.System - Full rights c.everyone - Read / Modify(WRITE) and List Then click "APPLY" and go to "General" tab and click on the "Advance" button. Here click the "Inheritance option" and finally click "OK"
  3. Open regedit a.go to "My ComputerHKEY_CLASSES_ROOT_CLSID". Right click on it and select "Permissions" and add "Authenticated Users" with "Full Permissions" b.Go to "My ComputerHKEY_LOCAL_MACHINESYSTEMCurrentControlSetServices". Right click and select "Permissions" and add "Network Service" and "Local Service" with "Full Permissions" 4.Finally go to "My ComputerHKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesRpcSs and set the "ObjectName" to "NT AuthorityNetworkService" 5.Reboot the promblematic server and check if the issue still exists.
  • Anonymous
    January 01, 2003
    Robert:  You're welcome.  I'm glad it helped.

  • Anonymous
    January 01, 2003
    Just a quick note to say that they did update KB 269229 with my comment about requiring the SERVICE account

  • Anonymous
    January 01, 2003
    The comment has been removed

  • Anonymous
    December 16, 2006
    Mr Turner. Great article tons of good info could you though increase some of the text size? I appreciate the link Sir I will be visiting regularly.

  • Anonymous
    January 17, 2007
    Yup, that fixed it, We have successfully recovered both our domain controllers using this fix. Appearantly someone on the development staff had changed the Impersonate Priviledge to work only for our service account, and not for the rest. Development for the lose!

  • Anonymous
    April 05, 2007
    Thanks a million, your article has allowed us to get back up and running after a few hours of downtime.  Basically all we did was change the logon for the RPC, back to Local System.  So we now have network connectivity, Exchange and most importantly, Remote Desktop Connection, so we don't have to be lying on the floor at the local system in the server room :)  Now we can look at sorting the policy settings you mentioned, from the comfort of our own desks. Thanks again.

  • Anonymous
    May 04, 2007
    Thankyou for an informative article. Your article saved me from having to do a server rebuild, as I had no idea what had gone wrong, until I came across this article on Google. This happened on SBS2003 in my case - and as I'm the only Administrator, I'm at a loss to understand how the users Administrator and SERVICE were ever removed from "Impersonate a client after authentication" as I don't remember doing it!!! Thanks again. :-)

  • Anonymous
    June 26, 2007
    Thank you for spelling out step by step how to fix my 'sick' Domain Controller.  I experienced EXACTLY what you outlined in this article and was able to fix it.  Thank you!!!!

  • Anonymous
    August 03, 2007
    Thank you Justin, This problem had plagued our network for a few months. I had only stumbled upon the temportary fix of setting each machine's RPC service to Local System Account, but it was just a bandaid on a gushing wound. Thank you, Thank you, Thank you. ~Dan

  • Anonymous
    August 07, 2007
    Just wanted to let you know you saved our bacon with this article. THANKS!

  • Anonymous
    August 30, 2007
    Justin you're the man, you saved my weekend (after foolishly applying a malformed security policy). Your article is really helpful and important. I think the title "Cluster service failure after AD lockdown" is a bit illusive, it doesn't reflect the real context of the problem. it can happen actually on any domain member (SQL server services also failed) Thanks again!

  • Anonymous
    September 16, 2007
    Thanks for the greats tips. Problem solved for me during Active Directory upgrade from win2k to win2k3. I remeber that installation of Norton Antivirus Client Server Suite ask me to change impersonate key of domain group policy years old. Thanks a lot Michele Maran from Italy

  • Anonymous
    September 21, 2007
    We had some issues with 2003 SP1 and the Time Service - after a reinstall of SP1 to fix the issue, we had the COM+ and RPC issue also.  In our case, the "Impersonate..." policy was never defined in the DCP.  Just performing the final restart now, and i'd just like to take the chance to backup Meir's comment that indeed - Justin - YOU ARE THE MAN!!!  :0)

  • Anonymous
    October 16, 2007
    You just saved me hours of work.  Thank you for your breakout of this problem.  It affected us during a security patch and reboot session this morning, even though it only affected some of our machines, the advice and underlying reasons were dead on. Thank you very much for sharing this info.

  • Anonymous
    December 18, 2007
    Thanks. Thanks. Thanks. I've been fighting this problem for days and days at one of customers sites (away from home). Thanks for getting me home for the HOLIDAYS.

  • Anonymous
    March 11, 2008
    Justin,    Great information!  I almost passed by because of the title, but this was exactly what we needed. Thanks again for the investigative work, and making it available for us to find!

  • Anonymous
    April 07, 2008
    BRILLIANT!  Thanks for the leg work and making me look good to my director!