Windows Azure Traffic Manager Performance Impact

 

A somewhat common question regarding Windows Azure Traffic Manager (WATM) deals with potential performance problems that it might cause. The questions are typically along the lines of “How much latency will WATM add to my website?”, “My monitoring site says that my website was slow for a couple hours yesterday – were there any WATM issues at that time?”, “Where are the WATM servers? I want to make sure they are in the same datacenter as my website so that performance isn’t impacted.”.

Note that this post is only about the direct performance impact that WATM can cause to a website. If you have a website in East US and one in Asia and your East US is failing the WATM probes, then all of your users will be directed to your Asia website and you will see performance impacts, but this performance impact has nothing to do with WATM itself.

 

Important notes about how WATM works

https://msdn.microsoft.com/en-us/library/windowsazure/hh744833.aspx is an excellent resource to learn how WATM works, but there is a lot of information on that page and picking out the key information relating to performance can be difficult. The important points to look at in the MSDN documentation is step #5 and #6 from Image 3, which I will explain in more detail here:

  • WATM essentially only does one thing – DNS resolution. This means that the only performance impact that WATM can have on your website is the initial DNS lookup.
  • A point of clarification about the WATM DNS lookup. WATM populates, and regularly updates, the normal Microsoft DNS root servers based on your policy and the probe results. So even during the initial DNS lookup there is no involvement by WATM since the DNS request is handled by the normal Microsoft DNS root servers. If WATM goes ‘down’ (ie. a failure in the VMs doing the policy probing and DNS updating) then there will be no impact to your WATM DNS name since the entries in the Microsoft DNS servers will still be preserved – the only impact will be that probing and updating based on policy will not happen (ie. if your primary site goes down, WATM will not be able to update DNS to point to your failover site).
  • Traffic does NOT flow through WATM. There are no WATM servers acting as a middle-man between your clients and your Azure hosted service. Once the DNS lookup is finished then WATM is completely removed from the communication between client and server.
  • DNS lookup is very fast, and is cached. The initial DNS lookup will depend on the client and their configured DNS servers, by typically a client can do a DNS lookup in ~50 ms (see https://www.solvedns.com/dns-comparison/). Once the first lookup is done the results will be cached for the DNS TTL, which for WATM is default of 300 seconds.
  • The WATM policy you choose (performance, failover, round robin) has no influence on the DNS performance. Your performance policy can negatively impact your user’s experience, for example if you send US users to a service hosed in Asia, but this performance issue is not caused by WATM.

 

Testing WATM Performance

There are a few publicly available websites that you can use to determine your WATM performance and behavior. These sites are useful to determine the DNS latency and which of your hosted services your users around the world are being directed to. Keep in mind that most of these tools do not cache the DNS results so running the tests multiple times will show the full DNS lookup, whereas clients connecting to your WATM endpoint will only see the full DNS lookup performance impact once during the TTL duration.

https://www.websitepulse.com/help/tools.php

One of the simplest tools is WebSitePulse. Enter the URL and you will see statistics such as DNS resolution time, First Byte, Last Byte, and other performance statistics. You can choose from three different locations to test your site from. In this example you will see that the first execution shows that first DNS lookup time takes 0.204 sec. The second time we run this test on the same WATM endpoint the DNS lookup time takes 0.002 sec since the results are already cached.

image

image

 

https://www.watchmouse.com/en/checkit.php

Another really useful tool to get DNS resolution time from multiple geographic regions simultaneously is Watchmouse’s Check Website tool. Enter the URL and you will see DNS resolution time, connection time, and speed from several geo locations. This is also handy to test the WATM Performance policy to see which hosted service your different users around the world are being sent to.

image

 

https://tools.pingdom.com/ – This will test a website and provide performance statistics for each element on the page on a visual graph. If you switch to the Page Analysis tab you can see the percentage of time spent doing DNS lookup.

 

https://www.whatsmydns.net/ – This site will do a DNS lookup from 20 different geo locations and display the results on a map. This is a great visual representation to help determine which hosted service your clients will connect to.

 

https://www.digwebinterface.com – Similar to the watchmouse site, but this one shows more detailed DNS information including CNAMEs and A records. Make sure you check the ‘Colorize output’ and ‘Stats’ under options, and select ‘All’ under Nameservers.

 

Summary

Given the above information we know that the only performance impact that WATM will have on a website is the first DNS lookup (times vary, but average ~50 ms), and then 0 performance impact for the duration of the DNS TTL (300 seconds default), and then again a refresh the DNS cache after the TTL expires. So the answer to the question “How much latency will WATM add to my website?" is, essentially, zero.

Comments

  • Anonymous
    July 30, 2014
    The comment has been removed
  • Anonymous
    August 01, 2014
    The comment has been removed
  • Anonymous
    September 23, 2014
    I think there is more to the performance impact to consider. WATM as part of it's monitoring is sending HTTP requests to the endpoints on a 30 second interval, and then several additional requests at a lower interval if it believes the endpoint is unavailable. This adds extra request processing to a website or cloud service, and could add latency to end users of a site since the site is servicing more requests with WATM manager in use than it was previously. Granted, a site that has been optimized should be able to handle an extra 2 requests per minute, but given the nature of the application there is a chance that the extra monitoring requests could affect the latency of requests by regular site visitors.