DC’s and VM’s – Avoiding the Do-Over

發行項
06/05/2009

Hello everyone, Mark from DS again. With more and more companies using virtualization, such as Microsoft Virtual Server, Server 2008 Hyper-V or VMWare, in their environments these days you may end up in the following situation I recently worked on:

1) Customer wanted to roll back one of his DC’s in his test environment to basically “back out” of some changes that had been made recently. This was a single domain forest that consisted on two Domain Controllers. Both of the DC’s were running Windows 2003 SP2.

2) Virtual Machine snapshots were being taken instead of normal system state backups.

3) They restored one of the DC’s from one of the snapshots.

4) Replication was broken.

Replication symptoms consisted of the following:

1) The Netlogon service is in a paused state.

2) In the Directory Service event log a replication error was logged, Source was NTDS Replication with the Event ID 2095.

3) Also in the Directory Service event logs were two warnings, Source was NTDS General with the event ID’s 1113 and 1115.

Here are samples of the Directory Service event log events with the description of the event.

Event Type: Error
Event Source: NTDS Replication
Event Category: Replication
Event ID: 2095
Date:
Time:
User:
Computer:
Description: During an Active Directory replication request, the local domain controller (DC) identified a remote DC which has received replication data from the local DC using already-acknowledged USN tracking numbers. Because the remote DC believes it is has a more up-to-date Active Directory database than the local DC, the remote DC will not apply future changes to its copy of the Active Directory database or replicate them to its direct and transitive replication partners that originate from this local DC. If not resolved immediately, this scenario will result in inconsistencies in the Active Directory databases of this source DC and one or more direct and transitive replication partners. Specifically the consistency of users, computers and trust relationships, their passwords, security groups, security group memberships and other Active Directory configuration data may vary, affecting the ability to log on, find objects of interest and perform other critical operations. To determine if this misconfiguration exists, query this event ID using https://support.microsoft.com or contact your Microsoft product support. The most probable cause of this situation is the improper restore of Active Directory on the local domain controller. User Actions: If this situation occurred because of an improper or unintended restore, forcibly demote the DC.

Event Type: Warning
Event Source: NTDS General
Event Category: Replication
Event ID: 1113
Date:
Time:
User:
Computer:
Description: Inbound replication has been disabled by the user.
Event Type: Warning
Event Source: NTDS General
Event Category: Replication
Event ID: 1115
Date:
Time:
User:
Computer:
Description: Outbound replication has been disabled by the user.

If you run the command repadmin /options <The DC Name> you can verify that inbound and outbound replication is disabled. You will see something similar to this:

Current DC Options: IS_GC DISABLE_INBOUND_REPL DISABLE_OUTBOUND_REPL

With more and more companies using Virtualization to replace actual physical hardware, especially in test environments, I believe we are going see more issues such as this one. This can also happen in situations where you are converting physical hardware to virtual machines which we refer to as “PtoV” (physical to virtual).

First we need to understand some basic background information regarding Active Directory (AD) replication. Domain Controllers (DC’s) use Update Sequence Numbers (USN’s) to track the updates that need to be replicated between replication partners. Every time a change in made to the data in the directory the USN is incremented to indicate a change was made. For each directory the DC stores, USN’s are used to track the latest updates that a DC has received from each source replication partner. Each DC also has a table where it knows about every other DC highest USN that stores a replica of that directory partition. Each DC also has a value on its NTDS Settings object called an invocation ID. This value is used to indentify its version of its local AD database.

There are two values that use USN’s during the replication process. One is the up-to-dateness vector, the other is the high water mark. The up-to-dateness vector is a value that the destination DC maintains for tracking the originating updates that are received from its source DC’s. When the destination DC requests its updates for a directory partition it supplies its up-to-dateness to the source DC who can use that value to reduce the set of attributes it needs to send to the destination DC. The source DC will send its up-to-dateness vector value to the destination DC once the replication cycle has completed. The high water mark is a value that the destination DC maintains to keep track of the latest change it has received from a specific source DC for an object in a specific directory partition. This value prevents the source DC from sending out changes to the destination DC that have already been applied by the destination DC.

The invocation ID is a GUID value that identifies the directory database running on a DC and is maintained separately from the identity of the server object. The server object identity never changes but the identity of the directory database (invocation ID) will change when a system state is restored by using the Microsoft API’s. All the domain controllers keep track of the directory database on its source replication partners. Both the up-to-dateness vector and the high water mark refer to the invocation ID so that other DC’s know which copy of the AD the replication is coming from.

I know this can be confusing so let’s add some graphics that may help to understand this better. Let’s say we have two DC’s, DC1 and DC2. Both of these DC’s are running as Virtual Machines on a host machine running your favorite Virtualization Software. For all intents and purposes we are assuming that replication is working fine and both of the DC’s are up to date on replication. Before we start, we take a “snapshot” of DC1. As we can see below we add a new user “Jeff Smith” on DC1. The USN is incremented from 4710 to 4711 on DC1.

Now we replicate the new user to DC2. DC1 will notify DC2 that it has changes that it needs to replicate. DC2 will then request the changes and send DC1 what it thinks is DC1’s high water mark is. In this case DC2 thinks that value is 4710 so that is what it sends. When they are done replicating DC1 will send DC2 its up-to-dateness vector so DC2 will have the new value.

Now let’s suppose that other changes in the environment are occurring and replicating as they should. “Jeff” logs on and changes his password. When he does this DC2 is the DC where the change takes place. This will increment the USN on DC2 as it was 2452 and we increment the USN for DC2 to 2453.

Next we replicate that password change over to DC1. DC2 tells DC1 that it has changes it needs to get. DC1 will send DC2 what it thinks DC2’s USN is, in this case DC1 thinks DC2 is at 2452.

Once they are done replicating DC1 USN will be 5040 and DC2 will know it DC1 is at 5040. DC2 will be at 2453 and DC1 will know that value as well.

Now you want to roll that one DC back. You apply the snapshot to the DC as a restore procedure. When this happens, the invocation ID remains the same, the USN’s are “rolled back” to the time the snapshot was taken. Now when the replication process starts the “snapshot” DC requests changes from its source DC it sends the old up-to-dateness vector to the source DC. The source DC sees this value and it knows what the value should be and they are different. The value sent has a lower value then the source DC has in its table for the destination DC. The response sent back to the destination DC by the source DC basically telling the destination DC its database is out of date. When this happens we have built-in protection so that the destination DC will take measures not replicate with other. This is referred to as a “USN rollback” situation.

The protection that the USN rollback system will take will be is:

1) Pause the Netlogon service.

2) Disable the inbound and outbound replication.

To correct this situation we need to do the following on the DC that has the roll back issue.

1) Forcefully demote the DC by running dcpromo /forceremoval. This will remove AD from the server without attempting to replicate any changes off. Once it is done and you reboot the server and it will be a standalone serve in a workgroup.

2) Run a metadata cleanup of the DC that was demoted per KB article 216498 on one of the replication partners.

3) If the demoted server held any of the FSMO (Flexible Single Master Operations) roles then use the KB article 255504 to seize the roles to another DC.

4) Once replication has occurred end to end in your environment you can rejoin the demoted server back to the domain then promote to a DC.

To prevent this from happening adhere to the following best practices:

1) Do not use imaging software to take an image of the DC.

2) Do not take or apply snapshots of the DC.

3) Do not shut the Virtual Machine down and simply copy the virtual disk as a backup.

4) If you have the ability to “discard changes” as you do if you are running “Virtual Server 2005 R2”, do not enable this type of setting on a DC Virtual Machine.

5) Use NTBACKUP.EXE, WBADMIN.EXE, or any third party software that is available as long as it is certified to be AD-compatible to take system state backups.

6) Only restore a system state to the DC or restore a full backup.

References:

875495 How to detect and recover from a USN rollback in Windows Server 2003

https://support.microsoft.com/default.aspx?scid=kb;EN-US;875495

Appendix A: Virtualized Domain Controllers and Replication Issues

https://technet.microsoft.com/en-us/library/dd348479.aspx

Backup and Restore Considerations for Virtualized Domain Controllers

https://technet.microsoft.com/en-us/library/dd363545.aspx

- Mark Ramey

Comments

Anonymous
June 05, 2009
PingBack from http://serversarea.com/blog/2009/06/ask-the-directory-services-team-dcs-and-vms-%e2%80%93-avoiding-the-do-over-2/
Anonymous
June 06, 2009
Good article. Yeah, seeing this popping up in forums a lot recently. I'll just link to this article from now on in my reply to those posts :)
Anonymous
June 07, 2009
Important reference, bookmarked :-)
Anonymous
June 08, 2009
242 Microsoft Team blogs searched, 102 blogs have new articles in the past 7 days. 259 new articles found
Anonymous
June 10, 2009
All - great info. However, in MSFT's VS 2005 whitepaper about running DCs on VMs, there is mention of a regchange (notice I didn't say 'reghack') that prevents USN rollback:

Using the previous .vhd, start the domain controller in Directory Services Restore mode.
In a registry editor, if the entry DSA Previous Restore Count under HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesNTDSParameters is visible, make a note of the value. If the entry is not visible, assume a value of 0. Do not add the entry.
Add the registry entry Database restored from backup under HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesNTDSParameters Data type: REG_DWORD Value=1 Is that no longer valid?

Anonymous
June 10, 2009
That is not supported, because it is being used to circumvent a proper backup. It's hacking. Using snapshots with any virtualization technology and a DC is 100% unsupported, always. That article was clearly written back when people were cowboying virtualization because they had no idea what they were doing, and the author clearly did not either. 5 years ago, sounds about right. Don't use that article and expect to be supported if you have DC issues afterwards.
Anonymous
March 03, 2010
So this technet article is all wrong?! It applies to Server 2008 "To restore a previous version of a virtual domain controller VHD without system state data backup" http://technet.microsoft.com/en-us/library/dd363545(WS.10).aspx
Anonymous
March 03, 2010
How is that article wrong? It specifically says: "Do not use the Snapshot feature as a backup to restore a virtual machine that was configured as a domain controller. "
Anonymous
March 04, 2010
It also says: "If you do not have a system state data backup that predates the virtual machine failure, you can use a previous VHD file to restore a domain controller that is running on a virtual machine" If the VHD file has never been started in normal mode it looks to me like this article says it's ok to do it as long as you set the "DSA Previous Restore Count" to 0 in DSRM.
Anonymous
March 04, 2010
Uggghhh... Don't follow that direction until you hear back from me here. I am tracking that down to get some more info on what would happen here if there was more than one DC in the domain. And if you have that, why are you restoring the server? This looks like untested, unsupported, ancient and naive documentation from 6 years ago around Virtual Server. Use system state backups.
Anonymous
March 04, 2010
The comment has been removed
Anonymous
March 05, 2010
From chatting with one of the PQPM's here, we in Support fought to have that documentation removed, and lost. I cannot vouch for its supportability in any way I'm afraid, nor have I been able to get a developer to vouch for it. I am still asking around though.
Anonymous
March 05, 2010
I hope this didn't cause you any delay in the "Friday Mail Sack" ;)
Anonymous
March 05, 2010
Nah, but plenty of other stuff is... :-/ So, back to your question. I was able to dig up the right folks and get some calrification. I plan on having this article edited for clarity, but:

As long as you never boot the hyper-v snapshot until after you’ve set the ‘dsa restoring from backup’ key, then you’re good and supported (this was tested in Win2008 R2). However, if you ever accidentally boot the hyper-v snapshot before you’ve set the key, then you’re in a USN rollback scenario.
Step 12 is baloney. If the value is not present or correct, you cannot start over with this VHD. You must have another snapshot to restore or have made a copy of this image before you started all these steps.
And finally - the reason the article starts with "Do not use the Snapshot feature as a backup to restore a virtual machine that was configured as a domain controller" but then goes on to give steps is for the absolute last resort, last gasp, "OMG we're all gonna die man" scenarios where your system state backups are not working. The SS backups are still the mechanism you should be using, and the snapshots should never, ever be done in lieu of system state backups. That's why this article is hard to find, but USN rollback articles are easy to find - we want people using system state backups.

Anonymous
March 06, 2010
The comment has been removed
Anonymous
March 06, 2010
Looks great to me :). The only thing I might clarify in there is that "flush" and "commit" are somewhat interchangeable (or at least often related) terms that will be used through a lot of documentation. Your doc uses both also, just at different points. For example: http://msdn.microsoft.com/en-us/library/ms683106(EXCHG.10).aspx Nice work, glad to see that the Wiki is already got traction.
Anonymous
March 07, 2010
Thanks for your remark. I've added it to the article nearly “as is”. And I have one more ongoing question. “Backup and Restore Considerations for Virtualized Domain Controllers” guidance (http://technet.microsoft.com/library/dd363545.aspx) says: “There are two supported ways to perform backup and restore of a virtualized domain controller: <...> 2.Run Windows Server Backup on the host. This action calls the Volume Shadow Copy Service (VSS) writer of the guest to make sure that the backup is performed properly”. Is it correct to substitute “Windows Server Backup” in this statement with something like: “Any certified backup and restore application that is running in the “parent” (or “management”) partition of any certified virtualization platform assuming that this backup and restore application is aware of VSS in the Guest OS and calls it during backup and restore operations”? I.e. is there any magic in how Windows Server Backup specifically holds AD DS or it is just okay to call in-guest VSS and it would take care of the rest?
Anonymous
March 08, 2010
I'm not a backup/restore guru, so here's where the Wiki hopefully kicks in with community experience. :-D One specific aspect of making AD backups through VSS starting in WIn2008 is that your backup/restore software is supposed to understand the NTDS writer. Lots more info on MSDN and TechNet about this. A starting point: http://msdn.microsoft.com/en-us/library/bb968827(VS.85).aspx
Anonymous
March 08, 2010
And who's actually in charge for changing Invocation IDs and other post-restore tasks? Is it NTDS Writer who needs to be made aware of restore or some special backup app plug-in?
Anonymous
March 12, 2010
Sorry for being pushy on this but I really want to figure out how it works and still cannot get it. As noted at “VSS Backup and Restore of the Active Directory” (http://msdn.microsoft.com/library/aa384675.aspx): “Following a crash requiring disaster recovery, the Active Directory can be restored as part of the restoration of the operating system state. This restore operation is essentially a writerless restore”. For me it sounds like nobody special gets involved into bare-metal recovery of AD Controller. No writer or whatever. So how does it happen that restored Controller detects it was restored from backup and correctly notifies its replication partners? The only idea I have is that is done using reading “LastRestoreId” registry key at DS startup. That would work in case of System State restore but would not help in case of full volume recovery (because LastRestoreId key is not set in this case). So I'm completely lost here.
Anonymous
April 10, 2010
this is interesting. i installed Forefront just to learn that it cannot be used on a secondary DC. Then used clonezilla to restore to a previous point. Got this problem exactly as described. There must be some kind of a work-around. reinstalling the entire AD seems a very cumbersome way to restore a DC. Did MS not realize that before?
Anonymous
April 12, 2010
Why does there need to be a workaround?

You don't have to reinstall the entire AD. You have to demote/cleanup/promote the failing DC. If you only have one DC, you can't get into this situation anyway. Let's not overstate the problem.
USN rollback is an error condition caused by not restoring a domain controller properly in the first place. And the article is quite clear on explaining how that error state occurs.
We strongly encourage admins in your situation to restore a DC is to use a System State Backup. You did take such a backup before installing new software on your DC, did you not?

Anonymous
September 18, 2014
DC’s and VM’s – Avoiding the Do-Over - Ask the Directory Services Team - Site Home - TechNet Blogs
Anonymous
September 29, 2014
Blogs - Ask the Directory Services Team - Site Home - TechNet Blogs
Anonymous
October 09, 2014
DC’s and VM’s – Avoiding the Do-Over - Ask the Directory Services Team - Site Home - TechNet Blogs
Anonymous
November 09, 2014
DC’s and VM’s – Avoiding the Do-Over - Ask the Directory Services Team - Site Home - TechNet Blogs

共用方式為

DC’s and VM’s – Avoiding the Do-Over

Comments

其他資源