“Lag site” or “hot site” (aka delayed replication) for Active Directory Disaster Recovery support

[アーティクル]
10/20/2008

Hi, Gary from Directory Services here and I’m going to talk today about the concept of “lag sites” or “hot sites” as a recovery strategy. I recently had a case where the customer asked if the replication interval for a site link could be set higher than 10,080 minutes (7 days). The quick answer was that Active Directory only supports values from 15 up to 10,080 minutes and the schedule is based on a week. If the replinterval attribute on the site link is manually set to something lower than 15 it will use the default of 15. If it is set to something higher than 10,080, it will be ignored and 10,080 will be used.

But the underlying question kept coming back to the recommendation of a latent “lag site”.

First let me give a quick definition of a lag site or hot site and its general intended purpose. A lag site is just an Active Directory site that is configured with a replication schedule of one, two or maybe three days out of the week. That way it will have data that would be intentionally out-of-date as of the last successful inbound replication. It is sometimes used as a quick way to recover accidentally deleted objects without having to resort to finding the most recent successful backup within the tombstone lifetime of the domain that has the data.

This sounds like a decent idea, in theory. However, Microsoft Support does not recommend a lag site as a disaster recovery strategy. Servicing products such as hotfixes and service packs not recognize quasi-offline DC state monitoring software may also detect the state of a lag site DC as malfunctioning and attempt to re-enable it (or tell an unwitting administrator to do so). Microsoft makes no guarantees that the servicing and monitoring products would not re-enable Netlogon and KDC services in a lag site. In addition, other Microsoft products, such as Exchange Server, are not designed to operate in a lag site and they may not function properly with lag site DCs.

The following lists some reasons why lag sites should not be relied upon as a disaster recovery strategy, especially in lieu of proper Active Directory System State backups:

Lag sites are not guaranteed to be intact in a disaster:

If the disaster is not discovered in time before replication occurs, the problem is replicated to the lag site, and the lag site cannot be used to undo the disaster. A lag site typically needs to be three days latent in order to cover situations that occur during the weekend where visibility is low. However this means that you are actually forced to ‘lose’ more changes than a reliable daily backup being run on domain controllers.
Thus, the administrator must act immediately when a disaster occurs: inbound and outbound replications must be disabled and repadmin /force must be forbidden.

Replicating from lag site might have unrecoverable consequences:

Since a lag site contains out-of-date data, using it as a replication source may result in data loss depending on the amount of latency between the disaster and the last replication to the lag site.
If something goes wrong during recovery from a lag site, a forest recovery might be required in order to rollback the changes.

Lag sites pose security threats to the corporate environment:

For example, when an employee is fired from the company, his/her account is immediately deleted (or disabled) from Active Directory, but the account might still be left behind in the lag site. If the lag site domain controllers allow logons, this could potentially lead to unauthorized users with access to corporate resources during the lag site replication delay “window”.

Careful consideration must be put in configuring and deploying lag sites:

An Administrator needs to decide the number of lag sites to deploy in a forest. The more domains that have lag sites, the more likely one can recover from a replicated disaster. However, this would also mean increased hardware and maintenance costs.
An Administrator needs to decide the amount of latency to introduce. The shorter the latency, the more up-to-date and useful the data would be in the lag site. However, this would also mean that administrators must act quickly to stop replication to the lag site when a disaster occurs.

The above list is not exhaustive, and there could be other unseen problems with deploying lag sites as a disaster recovery strategy. It has always been strongly recommended that the best way to prepare for disasters such as mass deletions, mass password changes, etc. is to backup domain controllers daily and verify these backups regularly through test restorations.

Finally, keep in mind that testing your disaster recovery routine is vital both prior to beginning to rely on that routine in case of failure as well as once you begin to use it as your recovery strategy. Surprise is never good when a disaster strikes.

Here are some links to Microsoft recommended recovery steps and practices:

840001 How to restore deleted user accounts and their group memberships in Active Directory - https://support.microsoft.com/kb/840001

Useful shelf life of a system-state backup of Active Directory - https://support.microsoft.com/kb/216993

Managing Active Directory Backup and Restore - https://technet2.microsoft.com/windowsserver/en/library/5d683eeb-e76c-46e9-92f4-fcb2a10f955f1033.mspx

Step-by-Step Guide for Windows Server 2008 AD DS Backup and Recovery - https://technet.microsoft.com/en-us/library/cc771290.aspx

Active Directory Backup and Restore in Windows Server 2008 - https://technet.microsoft.com/en-us/magazine/cc462796.aspx

- Gary Mudgett

Comments

Anonymous
October 20, 2008
The comment has been removed
Anonymous
October 22, 2008
The comment has been removed
Anonymous
October 22, 2008
Hello Gary - good post that should allow plenty of discussions on this topic. I would say the most important statement you are making indirectly with this blog, is that the implementation of lag-sites is GENERALLY SUPPORTED. Not recommended, but supported. And I totally agree that they shouldn't be leveraged and implemented by people that don't know what they're doing. Lag-Sites by no means replace normal domain controller backups (and periodic recovery testing) - they are merely one of many options to increase the speed of object recovery in case those should be required. Regardless of the processes being used, the AD administrator must know what he or she is doing when backing up and restoring objects in AD. This is especially the case in multi-domain AD forests, where by design AD lacks the capability to completely recover objects including all potential cross-domain links, which can cause a lot of pain for AD administrators. Lag-sites can help reduce this pain, if the administrator knows how to leverage them correctly. Here are a few thoughts on your arguments against Lag-Sites: 1 - Lag Sites are not guaranteed to be intact in a disaster Yep, that's one of the reasons why you certainly still need normal domain controller backups. However, the majority of "accidental" deletions are typically detected very fast. And the whole point of implementing lag-sites is to be able to react quickly and to be able to recover objects quickly without the need to first recover a DC from backup (or, to do it right in a multi-domain environment, recover a DC from every domain from backup). In a lag-site, all you have to do is boot the respective DC into Directory Services Restore Mode (DSRM) and run the authoritative object restores directly. There is additional work to do to fully recover cross-domain links such as memberships in local groups in another domain of the forest, but leveraging lag-site DCs from those domains (for example to check what the group memberships of a given user should be) give admins a clear advantage over the need to first restore DCs from every domain to be able to recover those cross domain links. Naturally, other methods can also be used to ensure backup and recoverability of those links, but only relying on normal domain controller backups is actually a bad thing (in multi-domain forests). 2 - Replicating from lag site might have unrecoverable consequences I do not see any value at all in this argument. The whole process of object recovery in AD relies on the use of "out-of-date data". If I first have to reboot the DC into DSRM and then recover the database to a previous version from my DC backups, I'm doing nothing else: I'm putting "out-of-date data" on the DC so that I can increase the version number on the respective objects I want to recover using the "authoritative restore" method with NTDSUTIL. The same this is done with DCs in lag-sites, with the big difference being the time that it takes for me to get to the point that allows me to perform the auth restore: it's much faster on a lag-site DC since I don't first need to recover the DC's system state from the backup (which depending on the size of the AD database can take a long time - and this time has even increased quite a bit in Win2008 due to the changes of the underlying backup mechanisms). As such the risk you are stating to scare your readers "If something goes wrong during recovery from a lag site, a forest recovery might be required in order to rollback the changes" should fully apply the same way when restoring objects on a DC that was first recovered from backup - clearly in both cases admins can do stupid things, but this risk is not higher for lag-sites. 3 - Lag sites pose security threats to the corporate environment As you write "If the lag site domain controllers allow logons...", this would indeed be a risk and goes back to my initial statement that admins leveraging lag-sites need to know what they're doing. If they do and they monitor the lag-site DCs to ensure they stay configured appropriately, then I would say this risk is mitigated. 4 - Careful consideration must be put in configuring and deploying lag sites Fully agree - and this is not necessarily a downside for lag-sites either. Careful consideration is required to plan any stable backup and recovery method for AD. Especially in multi-domain AD forests administrators that don't sufficiently understand the AD replication and object linking model and it's implications for object recovery will find themselves in a situation of potentially not being able to fully recover an object to it's previous state when only relying on domain controller backups. Lag-Sites can certainly help to ease this pain. Clearly, for many admins that don't want to or don't care to dig down into the AD internals, Lag-Sites should not be recommended and instead the use of third party tools should be considered for AD backup and recovery support. I think that this blog entry - especially since it references Win2008 AD backup/recovery links - should also highlight some important changes in Win2008 with respect to recoverability of objects in AD. And I don't mean the new VSS snapshot support for AD including the capability of mounting a "previous version" of the AD database, which can also be leveraged to support the recovery of objects. I actually mean the fact that the AD service on Win2008 DCs can be stopped and restarted. While it's not supported to restore the AD database in "NTDS stopped" state, I have confirmation that it IS SUPPORTED to perform an auth-restore of objects in this state. The question that I don't have an answer to yet is: how long is it supported to have a DC run in "NTDS stopped" state? Clearly, this feature hasn't been designed to be used for lag-sites, but is there anything that speaks AGAINST using it in this fashion? Why shouldn't I be able to leave the service stopped for three days, and only start it up periodically to replicate with it's partners - and if required, use it to first auth-restore objects in AD... i.e. the new "Win2008 Lag-Site feature" ;-) Would be great to get your feedback on my comment - especially on the last question. Cheers, Guido
Anonymous
October 22, 2008
The comment has been removed
Anonymous
October 22, 2008
Hi Guido, I'm noodling on your points, but for your direct question: "The question that I don't have an answer to yet is: how long is it supported to have a DC run in "NTDS stopped" state? " Tombstone Lifetime. Exactly the same as how long it is supported to have a DC turned off, basically.
Anonymous
October 22, 2008
The comment has been removed
Anonymous
October 23, 2008
The comment has been removed
Anonymous
October 23, 2008
The comment has been removed
Anonymous
November 10, 2008
The comment has been removed
Anonymous
November 15, 2008
The comment has been removed
Anonymous
November 19, 2008
We’ve been at this for over a year (since August 2007), with more than 100 posts (127 to be exact), so
Anonymous
November 22, 2008
yep - very much aware of the upcoming AD Recycle Bin feature and glad we can talk about it now publically. But it'll still be a few years out until companies will have reached the Windows 2008 R2 Forest Functional level, which I understand is a requirement to leverage this very cool new feature. For large companies this is easily 1-2+ years post release of R2. So 2-3+ years to go. And in this time they still need a recovery solution - potentially using Lag Sites ;-) /Guido
Anonymous
December 26, 2008
The recycle bin will be a great tool to recover from delete/mass delete scenarios with all objects and their attributes intact. On the other hand it won't help you in any way if someone mistakenly modifies attributes while still retaining the objects themselves. Leveraging AD snapshots to address the second scenario is one way to go. One example of this approach can be found here: http://lindstrom.nullsession.com/?page_id=11

次の方法で共有

“Lag site” or “hot site” (aka delayed replication) for Active Directory Disaster Recovery support

Comments

その他のリソース