Lync Server 2013 Control Panel - Issue where AdminUIHost only tries Primary SQL host (Cannot find appropriate URL) with Diagnosis
Background
Deploying an enterprise Lync 2013 solution and have been doing some failover testing.
Problem is:
Running Lync Control Panel no longer correctly shows the Management Connection Points, whether that is the Web Service URLs, server name or pool name, depending on your environment's configuration. It just dumps this prompt back to the administrators, with the text "Cannot find appropriate URL, please input a URL to connect to Lync Server Control Panel" (and yes, control panel still works if you enter the front end pool internal web services, load balancer VIP or admin FQDN with /CsCp...however one has to investigate why the changed behaviour...)
Before you ask, I had already checked:
a. admin simple URL is valid, has cert, is in the cert SAN list, is accessible via web browser https://admin.pool/Cscp just fine.
b. no problems with authentication/permissions
c. no SCP, SPN or other IIS related problems
c. was working just fine recently...read on if intrigued :)
Picture this scenario:
- Front end pool, 3 servers
- 2 x SQL hosts (one primary, one mirror) for this pool
- All Lync databases including CMS are mirrored from SQL16 (primary) to SQL17 (mirror)
- CU5 updates applied (September 2014)
Failover situation is this:
- Primary SQL is offline/failed, or temporarily down for maintenance
- DB failover has been successfully invoked for all DBs, and are now Principal on the mirror.
- Get-CSDatabaseMirrorState shows the primary SQL host is offline, and the status on the Mirror is 'Principal' for all DBs (as expected)
- All services working fine
- Event logs are clean
- Lync Control Panel can be accessed via the Admin URL, all topology is fine, with exception of Primary SQL host down.
As you can see below, all the SQL databases show the expected status of 'StatusUnavailable' on Primary, and 'Principal' on the Mirror. There are warnings prior to each group of DBs (App, CMS etc) highlighting that the Primary instance is unconnectable, as shown in yellow below. The reason we haven't removed the failed primary is because it is only temporarily offline and will be brought back, such as if you were performing SQL or SAN disk maintenance.
(note: mirroringstatusonMirror 'disconnected' and statusonPrimary of 'StatusUnavailable' are both expected in this situation (offline/down Primary)
Diagnosis:
Firstly, I logged into the Front End server and did all the usual verification/troubleshooting. No problems.
Running get-CSManagementConnection shows the following perfectly valid results.
I couldn't find any reasonable information online about exactly what mechanism Control Panel (a.k.a. AdminUIHost.exe) uses to locate the values for the connection points.
Most articles were just discussing the admin URL, trusted zone settings in IE, or other such things like permissions, which most people have obviously already got covered.
I decided to dig deeper and do some packet captures.
I RDP'd onto an inactive DR server within the same Topology and fired up Lync Control Panel with Network monitor capturing its traffic.
Here are the redacted results for your enjoyment:
1. Firstly, here we see the Control Panel (adminUIHost.exe) binding to the local DC then requesting the SCP information to show us for each pool when opening Control Panel:
2. Then we see Control Panel (adminUIHost.exe) attempting to connect to the Primary SQL host to retrieve information from the CMS database. This is quite interesting, I wasn't aware it directly connected to SQL for this particular information. I was quite aware it used SQL, but always suspected it obtained the web service & admin FQDNs for the Lync topology from AD or looked directly at Topology information for the Pool and Web Service FQDNs.
Following this trace to the end, it then gives up after failing to connect to the Primary SQL host and gives the user the prompt shown at the top of the article.
This shows us a few things:
a. AdminUIHost / Control Panel was not checking the Database Mirror status before trying to connect (e.g. Get-CSDatabaseMirrorState)
b. AdminUIHost / Control Panel was not even attempting to connect to the mirror SQL even after timing out attempting to connect to the primary!
c. Any machine running control panel must have open connectivity on port 1433 (or the port your instance is on for CMS) to the SQL server.
3. So this is the core issue. Even in a valid configuration, with the Databases correctly failed over to your mirror, AdminUIHost / Lync 2013 Control Panel (at CU5) will only ever attempt to connect to your CMS on the Primary SQL server, never the mirror. This is quite a surprising bug since all Microsoft would have to do would be have AdminUIHost/Control Panel enquire such as Get-CSDatabaseMirrorState and return only the SQL Server with the status Principal (in this case the mirror) for the CMS databases, and then connect to that host. By only ever connecting to the Primary CMS SQL despite it being known to be offline, and not even attempting to connect to the mirror, you are guaranteed to get the dialog at the top of the article when running control panel! This is limited to when you're in a failed over state, or you have no connectivity to your Primary SQL CMS. I don't however see why it should be like this - it should observe the current active SQL server for the pool, whether or not the primary is offline.
4. To prove this is the case, I decided to edit the windows HOSTS file and add an entry for the Primary SQL host (SQL16) and point it to the IP of the Mirror SQL host (17). This proves a couple of things:
a. The problem is in fact the inability of AdminUIHost / Control Panel to connect to the CMS Mirror despite valid configuration and state
b. That we have valid connectivity to the Mirror SQL host and that it has valid Connection Information in its CMS database
Flush the DNS cache and then...
Sure enough...running Lync Control Panel was fixed immediately:
Now, please, do not implement the hosts file entry in your environment. I did this just to prove the problem so I can send it to Microsoft for review and working into a future update, one would hope. It's especially important you don't put this into a live environment because goodness only knows what would happen if you let a front end server believe it's communicating with the Primary SQL via the Mirror IP. :)
The 'fix' is: make sure your Primary SQL hosting CMS is back online and the Principal for the DB. Then you'll get valid connection points when running Control Panel. That might sound obvious, but the whole point of the CMS mirror and redundancy is so things all work when failed over, and its kinda surprising Control Panel of all things changes its behaviour in such an obvious and negative way.
That's all! :)
Also, since I know people will be reaching this due to the myriad of other problems that can cause the same dialog, so here's a (very) quick list of the probable causes, to try to help those people out too:
Check:
- Permissions in AD for your user (logged into domain, as CSAdministrator member and/or been delegated Lync Admin rights)
- IIS Internal Web Site is running/accessible
- Admin URL is published in the topology, and is valid
- Internal Web Services FQDN defined in Topology is running/accessible on HTTPS and authenticates user (https://url/cscp)
- IE Security settings treat the FQDNs for admin, pool or FE server name as trusted/intranet zone and have bypass proxy for them
- Nothing AV or S/W firewall is blocking AdminUIHost.exe
- Host running AdminUIHost.exe can reach Domain Controllers on LDAP port via Primary DNS
- Host running AdminUIHost.exe can reach Primary SQL Server for CMS
- Front End Server certificate(s) for IIS Internal Web and OAuth have not expired (check in deployment wizard can be easiest)
- Lync Server application log in Event Viewer for DataMCU issues related to the web components (you may need to re-run local setup and/or diagnose individual IIS issues if this is the case)
Good luck and happy hunting! Remember, don't be afraid to whip out Fiddler or Network Monitor tools to verify the part of the connection path Control Panel is failing on in your environment! You can always test the URLs using IE.