Windows Fabric and Lync 2013/Skype for Business

What is Windows Fabric?

It is Distributed systems platform for building scalable apps, Distributed system is a software system in which all component work together , Used for on premises and cloud scenarios, Windows Fabric enables application and services across all tiers to run at cloud scale with the ability to be patched and managed without downtime

Supports stateless and stateful applications

Reverse proxy would be one example of a stateless application

Lync is a stateful application

Windows Fabric Architecture

Skype for Business and Lync 2013 are leveraging the Reliability subsystem in the windows fabric component, ensures the placement of routing groups, ensure the replication between the primary and the other secondary nodes

Windows Fabric Availability

In this example we have a windows fabric cluster with 5 Nodes, Windows fabric divide the users in these nodes in a routing groups, each of these routing groups will have up to three replica, one primary and two secondary, so what happens if the primacy goes down, Fabric knows it has 2 secondary nodes and will promote one of the secondary to be a primary, if the primary (Node 3) Fails for a period of time, windows fabric will create another secondary, Node 5 for example as the figure below,, if windows fabric cant maintain quorum or doesn’t have enough nodes it will shutdown, If a node is added to the pool groups will be rebalanced, User replicator is responsible for replicating the cluster assignment and user group assignment to all databases within the Cluster

What Skype for Business/Lync services are using windows fabric?

Routing Services:

- Contacts, Conferences, Client cert, etc  

Lync Storage Services

- writes archiving/compliance to SQL and Exchange

MCU Factory Services

- Stores current MCU Load and decides the next MCU to allocate

Conferencing Directory  Services

- Maps Dial-IN numbers to the correct conference URI

What is Replicated across Nodes?

Persistent User Data

- Synchronous replication to two more FEs (Backup / Replicas)

- Presence, Contacts/Groups, User Voice Setting, Conferences

- Lazy replication used to commit data to Shared Blob Store (SQL Backend)

Transit User Data

- Not replicated across Front End servers

- •Presence changes due to user activity, including: Calendar, Phone call, inactive

Windows Fabric Tools

FabricLookup.exe, this tools is part of Lync 2013 Resource Kit

- Returns the primary and backup replica set for routing groups

Trouble Shooting Scenario: Event Log

Name: Lync Server
Source: LS User Services
Event ID: 32174
Level: Warning

Description: Server startup is being delayed because fabric pool manager has not finished initial placement of users.
Currently waiting for routing group: {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}.
Number of groups potentially not yet placed:
Total number of groups:

Cause:
This is normal during cold-start of a Pool and during server startup.
If you continue to see this message many times, it indicates that insufficient number of Front-Ends are available in the Pool. s for windows Fabric

Possible cause for this event:

- Not all servers in the pool are up

- problem in placing the primary replica of the routing group

Solution:

- try to start the RTC service on the front ends

start-cswindowsservice -name "RTCSRV"

- Check the event log for System, Application and Lync server

- if windows fabric fails to do the replica for certain amount of time then try to perform quorum loss recovery(wait 35 minutes)

reset-cspoolregistarstate -poolfqdn (poolname) -resettype qoroumlossrecovery

Trouble Shooting Scenario: Users cannot sign in or sign in a resiliency mode

- Primary server is not available

- Both Secondary are not available

this condition can occur if the connection to the backend server is not available, or windows fabric cannot writes to the secondary replica

Resolution:

- Check connectivity issues between Front end and backend

- Check the state of replica

Get-cspoolfabricstate -poolfqdn (poolname) -verbose -type routing

get-cspoolfabricstate -routinggroup <ID> -verbose and look for fabric service write status if you see granted on that parameter means it is on a healthy state, if you see write is not allowed then we have an issue and we can run quorum loss recovery