Azure MFA Server - Accessing User Portal is extremely slow
One thing I learn from support profile is, avoid "Assumption". We usually assume things to work in one way while it may work differently under certain situations. Data point is the only thing, you should believe in, as it never lies and gives you the real picture what is happening.
I am working on an interesting scenario where Accessing User Portal was extremely slow. I have following component in my environment.
I have Azure Master MFA Server and Azure Slave MFA Server with SDK installed on both the servers. I put my internet facing server for user portal in the DMZ and it configure to connect to the Slave MFA Server for SDK.
Here are symptoms of the issue
When I configure web.config on the User Portal Server and point it to my Slave Server, accessing the User Portal Website is extremely slow, it takes over 2-4 minutes to load a page.
When I configure web.config on the User Portal Server and point it to my Master Server, accessing the User Portal website is quite fast, it usually takes few seconds to load a page.
On the surface, it sounds like a networking or server performance problem and I started investigation on the same lines. From the initial investigation of network, I do not see any sign of slowness at the network layer. Collected performance data from the Server and there are no signs of performance degradation.
So, I decided to start the investigation from scratch. The best place to start investigation is MFA Server logs. I found these logs valuable troubleshooting highly complex issues as it captures large amount of information. Here is the basic flow
Client --> User Portal --> Slave MFA server with SDK --> Master MFA server.
Delay is observed when Slave MFA Server came in between. So, I started my investigation from the Slave MFA Server. There were many logs but not the one I am interested in - Web Service SDK. I did some research and figure out how to enable logging for the Web Service SDK. Basically, we need to add a key in the web.config file of the Web Service SDK.
Now, we have MultiFactorAuthWebServiceSdk log file in the Logs directory. As I expected this logging immediately tells me what it is slow?
TestMasterConnection is taking 84+84 seconds, but after that get user settings is quite fast. I did more research and found that TestMasterConnection checks the availability of Master using RPC, just like ping tool checks. So, the question is why this is taking such a long time.
Collected network traces, while reproducing the issue and try to find if there are something at the RPC Layer. All MSRPC traffic looks good and timely completed so where is the problem? Broaden my filter and check all traffic between Master MFA Server and Slave MFA Server and found some TCP Retransmit.
My Slave MFA server is trying to send TCP Request on port 2000 to Master MFA Server. Next question, What service is running on the Master MFA Server and why it is not responding?
netstat -ano came to answer first question - what server is running on Master MFA Server. So, it is MultifactorAuthSvc.exe is listening on port 2000, which is I am interested in but why it is not responding?
Collected simultaneous traces and I see that traffic is reaching on the MFA server but there is no response. Well, first thing came into mind is Windows Firewall Service.
Collected Windows Firewall logs and indeed Windows Firewall is dropping the packet.
Why Windows Firewall is rejecting this traffic? Usually when we install any software/service, it creates some rules in Firewall to allow traffic. While, checking I found that Azure MFA Service has also created some rules to allow any traffic from any port and any IP Address.
We need to solve the puzzle why Windows Firewall is rejecting the connection? Looking closely at these rules, I found that the scope of the rule is Private but my Slave MFA server is domain joined and now we know why Windows Firewall is rejecting these packets.
Since we know what the problem is, solving is very easy. Create a new rule with Domain profile or edit existing rule( I personally do not prefer changing default configuration if creating a new one can work).
It fixed the issue and now TestMasterConnection is completing within a second.