Windows Failover Clustering Support for FIM 2010 Synchronization Service

Forefront Identity Manager 2010 Synchronization Service is not cluster aware, just like it was not in ILM and MIIS. Hence the high-availability options are limited to deploying it in a cold or warm standby configuration. Another option to achieve high-availability is to deploy the FIM Synchronization Service on a virtualized server and restore a previous snapshot if such a need arises. All of these options need manual intervention and hence possibly a prolonged downtime. As customers start to rely more heavily on FIM and especially for self-service password reset and password synchronization capabilities, the issue of automated fail-over of this service becomes very important.

Alex Tcherniakhovski from Microsoft Services, Canada pioneered the way in clustering the MIIS Service using Windows 2003 Failover Clustering. The detailed description of his work can be found on his MSDN blog.

However, due to the changes to the security model of the Windows 2008 Failover Clustering, the solution provided by alextc does not work anymore. This is due to the fact that on Windows 2008, the cluster service runs under local system in a special context with limited privileges (security hardening) which among many other operations of interest for this discussion, does not have permission to run miisactive.exe.

An excellent description of how to go about developing a cluster resource DLL for Windows 2008 failover clustering can be found on blog post series titled Creating a Cluster Resource DLL Part 1 – 5 on MSDN at https://blogs.msdn.com/b/clustering/. I’ve become rusty with my C/C++ skills, but I did try hard to write a custom resource DLL using Windows Cluster API’s. However, I soon realized that due to the security hardening of the cluster service, it was not possible to impersonate another user (e.g. FIM Sync Admin) correctly to launch miisactive.exe, even though I was able to do the impersonation required for making FIM WMI calls. I checked with Win32 programming folks on internal forums and even though they said it should be possible, my attempts never succeeded and since I was running behind the delivery schedule, I had to abandon the attempt midway and work on an alternate approach which would be quick to develop and easy to maintain.

So I ended up in creating a windows service in C#, FIMSyncMonitorService, which monitors the FIMSynchronizationService and start and stops itself in sync with the status and health of FIMSynchronizationService. It is this FIMSyncMonitorService that is then made highly available as a Generic Service in Windows Failover Cluster.

At a high-level, the FIMSyncMonitorService instantiate two Timers in it’s OnStart() method for monitoring the status and health of the FIM Synchronization Service:

  • LooksAliveTimer: This timer fires frequently (say every 1 min) and does a status check on FIM Synchronization Service by querying the Service Controller. If it finds the service to be stopped, it issues a stop command for the FIMSyncMonitorService, which then cause Windows Cluster to failover to next preferred node.
  • IsAliveTimer: This timer fires less frequently (say every 5 min), but will do a detailed health check on the FIM Synchronization Service by querying the metaverse using WMI API’s. If the call fails, it will issue a stop command for the FIMSyncMonitorService again causing the failover to next preferred node.

It may be noted that the time-to-failover is essentially dependent on how long it takes to run miisactive.exe and cannot be instantaneous after the detection of failure. For the entire duration of the activation, service would be unavailable. The end-user impact of this downtime is that FIM Self-service Password Reset (FIM SSPR) functionality would be unavailable for those couple of minutes.

Comments