Disaster Recovery for AppFabric Caching configuration DB (SQL Mirroring)
In the middle tier AppFabric (AF) Caching can be setup in High Availability (HA) mode which is the feature that improves resiliency in the case a cache server goes down and for the backed, when using SQL for your configuration database, the simplest answer is to use mirroring. You may use SQL cluster but this blog will go over the easiest way to improve resilience in the backed for AppFabric configuration database. The SQL Server-Based Cluster Configuration Based article covers this subject but the following is a more holistic order step by step article including things to be mindful of and hints on how to go about testing, all base on a recent customer engagement.
Setup steps and basic concepts
Regardless of the AF Caching setup been a single or a cluster of servers, only one configuration database is required, and hence this is the one which needs SQL mirroring. For other more specific articles on SQL mirroring review this MSDN article, as well as these SQLCAT articles Implementing Application Failover with Database Mirroring and Database Mirroring Best Practices and Performance Considerations.
For a full SQL mirroring setup, a principal, a mirror and witness SQL servers are required. It is true that a witness may not be require to have a mirror setup but, in that case, the switch from a principal failing into a mirror will require an administrator to first become aware of the database failure, then manually switch to the mirror database and redirect all clients to it (DNS switching), and while all of these is taking place, caching requests will not be satisfied. To avoid this, the following only considers a setup including a witness. These are the general steps to take in order to setup a full SQL failover database for AF caching as the main purpose is to focus on the order of the steps and what is that AF caching requires.
1. Identify the 3 different SQL server instances for each required role; the obvious guidance is to have these in separate servers. For this example, they will be labeled SVR01 (principal), SVR02 (mirror) and SVR03 (Witness).
2. Create an empty DB in SVR01 (principal), this will be the principal DB, the one to be use by AF cache server(s), for this sample it will named AF4
3. Setup all the necessary logins as needed for AF cache and then configure the AF cache server(s) to point to your AF4
4. SVR01 will now have new logins (under security) created by the setup, one for each cache host machine, note that this will also be the case when running the AddCacheHost Script. As shown below, 4 cache hosts are setup - one for each account use by each of the AppFabric Caching Service running on each host (Redmond is the domain name). These logins will need to be replicated in SVR02 (mirror), this will allow the mirror DB to properly function as a principal DB for when a failure of the principal takes place.
Also note that these entries will be deleted when running the RemoveCacheHost Script
5. Stop all AppFabricCaching Service from all hosts and then generate a full backup (log and Data) of the AF4 DB (principal) and restore it to the SVR02 server as the SVR02.AF4 Database, this will be your mirror DB.
6. Take a Transaction log backup of Svr01.AF4, and then restore it on Svr02.AF4. This is important so that the log sequence chain is properly setup.
7. Right click on SVR01.AF4 and click on Tasks -> Mirror… to run the mirror wizard click on “configure Security”, follow the wizard to setup SVR01 as the principal, SVR02 as the mirror and SVR03 as the witness (no DB needs to be setup in this server) and choose to have a synchronous setup, once completed, start the mirror setup. The final setup should look similar to this, except that it should show three different names for each SQL server instance.
8. Go back to all the cache hosts and change their connection string to include the property “Failover Partner=SVR02”. Note that the DistributedCacheService.exe.config file will be found under the C:\Windows\system32\appfabric\ directory
However, a better alternative is to leverage the script mentioned on this MSDN page (under the Tip section of the “Availability Considerations” heading) which will automate this process.
9. Turn the cache host services back on and start testing
Testing
The original plan for testing what will happen when SVR01 went offline was to remove the network cable from its hub but since in a data center getting to the physical servers can be challenging, the resolution was to simply stopping the SQL Server (SVR01) service which will give the same effect.
Once the SVR01 (principal) goes offline, the witness will promote SVR02 to be the new principal.
Test showed that SVR02 took over rather fast and with no interruption to the service; however small interruptions could be seen, running in-house tests while the cache hosts run against load should be done to understand the behavior of your particular mirror setup.
Failback - in a scenario where an actual production failover takes place, the standard procedure should be to go back to a full SQL mirror setup as soon as possible (i.e. to recover the failed machine). Since right after the failure the full mirror setup no longer exists (a mirror SQL server is not present, only the principal SVR02 and the witness SVR03 remained). To mimic placing back the failed machine (SVR01) (understanding that it was repaired) simply restart the SVR01 SQL service. Once this takes place, the witness will turn the role of SVR01 into the mirror server (the role that SVR02 used to have). Hence, per each failback a switch of roles will take place.
Allow a few minutes between each of these operations to allow the witness to do all the appropriate changes. Again, as in the first failover, temporary interruption in service should not occur.
To take into account
A potential maintenance issue may be seen in step #4 above - the AppFabric Caching Service runs under the machine accounts and not a domain account; this therefore requires making account creation/deletion on the mirror server every time a new cache host machine is added/remove.
If this step is not executed when a cache host removal takes place then the missing machine account in the mirror server will go offline in the case of a failover, since the machine would not be allowed in the new principal SQL server (no user exists for it to access the server). If the AppFabric Caching Service runs under a domain group then this would not be an issue because the group account will only need to be added to the mirror server logins once and no creation or deletion of cache host will affect the access to the database.
Created by: jaimeab
Reviewed by: Jamespo