Windows Fabric and Lync 2013/Skype for Business
What is Windows Fabric?
It is Distributed systems platform for building scalable apps, Distributed system is a software system in which all component work together , Used for on premises and cloud scenarios, Windows Fabric enables application and services across all tiers to run at cloud scale with the ability to be patched and managed without downtime
Supports stateless and stateful applications
Reverse proxy would be one example of a stateless application
Lync is a stateful application
Windows Fabric Architecture
Skype for Business and Lync 2013 are leveraging the Reliability subsystem in the windows fabric component, ensures the placement of routing groups, ensure the replication between the primary and the other secondary nodes
Windows Fabric Availability
In this example we have a windows fabric cluster with 5 Nodes, Windows fabric divide the users in these nodes in a routing groups, each of these routing groups will have up to three replica, one primary and two secondary, so what happens if the primacy goes down, Fabric knows it has 2 secondary nodes and will promote one of the secondary to be a primary, if the primary (Node 3) Fails for a period of time, windows fabric will create another secondary, Node 5 for example as the figure below,, if windows fabric cant maintain quorum or doesn’t have enough nodes it will shutdown, If a node is added to the pool groups will be rebalanced, User replicator is responsible for replicating the cluster assignment and user group assignment to all databases within the Cluster
What Skype for Business/Lync services are using windows fabric?
Routing Services:
- Contacts, Conferences, Client cert, etc
Lync Storage Services
- writes archiving/compliance to SQL and Exchange
MCU Factory Services
- Stores current MCU Load and decides the next MCU to allocate
Conferencing Directory Services
- Maps Dial-IN numbers to the correct conference URI
What is Replicated across Nodes?
Persistent User Data
- Synchronous replication to two more FEs (Backup / Replicas)
- Presence, Contacts/Groups, User Voice Setting, Conferences
- Lazy replication used to commit data to Shared Blob Store (SQL Backend)
Transit User Data
- Not replicated across Front End servers
- •Presence changes due to user activity, including: Calendar, Phone call, inactive
Windows Fabric Tools
FabricLookup.exe, this tools is part of Lync 2013 Resource Kit
- Returns the primary and backup replica set for routing groups
Trouble Shooting Scenario: Event Log
Name: Lync Server
Source: LS User Services
Event ID: 32174
Level: Warning
Description: Server startup is being delayed because fabric pool manager has not finished initial placement of users.
Currently waiting for routing group: {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}.
Number of groups potentially not yet placed:
Total number of groups:
Cause:
This is normal during cold-start of a Pool and during server startup.
If you continue to see this message many times, it indicates that insufficient number of Front-Ends are available in the Pool. s for windows Fabric
Possible cause for this event:
- Not all servers in the pool are up
- problem in placing the primary replica of the routing group
Solution:
- try to start the RTC service on the front ends
start-cswindowsservice -name "RTCSRV"
- Check the event log for System, Application and Lync server
- if windows fabric fails to do the replica for certain amount of time then try to perform quorum loss recovery(wait 35 minutes)
reset-cspoolregistarstate -poolfqdn (poolname) -resettype qoroumlossrecovery
Trouble Shooting Scenario: Users cannot sign in or sign in a resiliency mode
- Primary server is not available
- Both Secondary are not available
this condition can occur if the connection to the backend server is not available, or windows fabric cannot writes to the secondary replica
Resolution:
- Check connectivity issues between Front end and backend
- Check the state of replica
Get-cspoolfabricstate -poolfqdn (poolname) -verbose -type routing
get-cspoolfabricstate -routinggroup <ID> -verbose and look for fabric service write status if you see granted on that parameter means it is on a healthy state, if you see write is not allowed then we have an issue and we can run quorum loss recovery