Do you want SharePoint 2013 in-flight workflows to complete after a Full Farm Fail Over?
Intro
I’ve been spending a lot of time lately testing SharePoint + SQL Always On scenarios and found some really interesting and important things regarding SharePoint 2013 Workflows. Before diving into the issue and solution, let me quickly recap 2013 workflows and some basics around SQL Always On. We made a huge shift to how Workflows are executed in SharePoint 2013. We stand up what’s now called a workflow farm and SharePoint 2013 farms will create a registration (powershell cmdlet) with the Workflow Farm. The key benefit is when executing SharePoint 2013 workflows, the SharePoint Farm/WFE submits the workflow instance to the WF Farm for processing. The WF Farm does the heavy lifting in this case. SharePoint 2013 is still backwards compatible in that you can create 2010 style workflows which will be processed locally the same way they are in SharePoint 2010. With SQL Always On, it’s possible to replicate SharePoint databases to other SQL Nodes in the same Data Center. This is referred to as synchronous replication which gives you high availability for a single Farm. It’s also possible to replicate those same databases asynchronously to a third SQL Node. This third node resides in a separate Data Center and a separate farm connects to those databases which would be in a read/only state. This is mainly for Disaster Recovery so if Data Center 1 dies, you can perform a full farm fail over and be up and running against those databases in read/write. This is referred to as Active/Passive SharePoint 2013 Dual Farm model.
Results of Failover Testing and Workflows
I wanted to test and see what were the outcomes of several SharePoint features after simulating both high availability failover via synchronous replication (same farm) and DR failover via asynchronous replication (different farm).
For my test, I created a fairly simple 2013 list workflow using SharePoint Designer which pauses for 10 minutes and then resumes and updates the current items UpdatedByWF field in a list. It looks like:
Note: The workflow will automatically fire when creating an item in the list.
Performed the following test:
- Start the WF by creating a list item
- Fail Over the High Availability Group while the workflow is in progress
- Wait 10 minutes to see if workflow completes
Test Time
1. Synchronous Failover to SQL 2 (same farm)
Results – Success (Workflows Complete)
2. Asynchronous Failover to SQL 3 (different farm)
Results – Fail (Workflows stay in – progress)
Note: I tested the same type of workflow except a SharePoint 2010 workflow and it completes successfully in both scenarios!
Issue
This means that out of the box, SharePoint 2013 in-flight workflows will never complete after a full farm failover until they are failed back to the farm where the 2013 workflow initiated. In my opinion, this was an eye opener because several large enterprise environments rely heavily on workflows for critical business functions.
Cause
The issue is caused with the manner in which you add the registrations to the Workflow Farm. When you register more than one SharePoint farm to the same workflow farm, you must use unique scope names. If you don’t, you’ll see an error like the following:
So in this case, the workflow farm contains two scope registrations, one for farm A and one for Farm B. When I kick off a workflow, the workflow instance is written to the instance table in the WFInstanceManagement database. The workflow instance is stamped along with Farm A’s scope ID (where the workflow initiated). For Example:
After failover, Farm B won’t understand how to interact with this instance because while it knows which instance to query, it passes its scope ID which is different than the scope ID (of Farm A) which is associated with the running workflow instance. The only supported workaround for this configuration is to fail back over to Farm A and let the workflow complete.
Resolution
The resolution to this is to set both Farms to use the same scope name. Not that simple, keep reading.
Q&A
Question: Wait, I thought this wasn’t possible?
Answer: Yes, that’s what I thought to initially but we have a force parameter which you use in Farm B to set the scope to name the same.
Question: Wouldn’t that overwrite existing settings in Farm A’s scope registration?
Answer: Yes, it does which requires further explanation before going through the steps to resolve the issue.
Question: What happens when I register a 2013 SharePoint Farm to a Workflow Farm?
Let’s assume I run the following for test purposes:
register-spworkflowservice –spsite https://intranet –scopename “contosoWF” –allowouthhttp
This will create two scopes in the WF Farm.
Parent Scope with Path /ContosoWF
Child Scope with Path /ContosoWF/default
The scopes contain security configuration which allows the SharePoint server to access and call into the WF farm via server to server authentication. The parent scope contains trusted issuer that has the STS Cert stamped from Farm A. To see this information on the WF farm run the following PowerShell:
$parentScope = get-wfscope –scopeuri https://wf:12291/ContosoWF
$parentScope.SecurityConfigurations.TrustedIssuer
The child scoped is stamped with the Realm of Farm A as well as the STS Config’s name identifier property.
$childScope = get-wfscope –scopeuri https://wf:12291/ContosoWF/default
Steps to Resolve Issue
First, I had a couple of moments where I was close to pulling out the sledge hammer on my monitor. Thanks a ton to members of WF PG that helped me out of a few ditches. The steps to work around the issue must be followed in this order:
- Both Active/Passive Farms must use the same STS Certificate
- Both Active/Passive Farms must use the same Realm
- Both Active/Passive Farms must use the same STS Config Name Identifier
- Both Farms must be registered using the same Workflow Scope Name
Step 1: Use the same STS Cert in both Farms
Follow the instructions here to replace the existing STS Cert with the same new one in both farms.
A couple of important notes before applying this step.
Note 1: Don’t try and reuse the existing STS Cert in Farm 1 over in Farm 2. It doesn’t work and you must generate a new certificate for STS and use that same certificate for both Farms.
Note 2: Running the following will likely generate an error and you can ignore the error: certutil -addstore -enterprise -f -v root $stsCertificate
Step 2: Set Farm B with Farm’s Authentication Realm
1. Farm A: run get-spauthenticationrealm and copy the output
2. Farm B: Set the Authentication Realm to Farm A’s via Set-SPAuthenticationRealm –realm <guid>
Step 3: Ensure both Farms use the same value for the following property: SPSecurityTokenServiceConfig.NameIdentifier
Important: In this case, you can simply copy the property value from Farm 1 and set Farm 2’s property with the copied value.
Farm 1: Launch PowerShell on any server and run the following:
$stc = Get-SPSecurityTokenServiceConfig
$stc.NameIdentifier
Copy the entire value. Note in my case it’s: 00000003-0000-0ff1-ce00-000000000000@fd0ed39d-de87-45d6-8c54-4ef4950ebbff
Farm 2: Launch PowerShell on any server and run the following:
$stc = Get-SPSecurityTokenServiceConfig
$stc.NameIdentifier
Note: The value should be different so in my case I’m going to set this property to the value above.
$stc.NameIdentifier = “00000003-0000-0ff1-ce00-000000000000@fd0ed39d-de87-45d6-8c54-4ef4950ebbff”
$stc.update()
Step 4: Register both Farms to the Workflow Farm
This is the most important part. In this case, we will register both farms using the same scope name. I’ll use testscope as my scope name. OOB, this command-let won’t work after running in Farm 1 so in Farm 2, we will run with the Force parameter.
- From any SharePoint Server in Farm A, run the following PowerShell:
register-spworkflowservice –spsite “https://intranet” –WorkflowHostUri https://workflowserver.contoso.com:12290 –scopename “testscope”
2. From any SharePoint Server in Farm B, run the following PowerShell:
register-spworkflowservice –spsite “https://intranet” –WorkflowHostUri https://workflowserver.contoso.com:12290 –scopename “testscope” –force
That should be it! A couple of things to remember while testing out in-flight workflows.
- The user initiating the workflow must have an associated user profile in both farms
- App Management Service Application must be provisioned in both farms
- Do not tests by running workflows as the System Account. It will error.
- When updating DNS and initiating fail over, you must flush the DNS cache on the WF Server and SharePoint WFE’s
Thanks,
Russ Maxwell, MSFT