Windows Azure Load Balancer TimeOut for Cloud Service Roles (PAAS Web/Worker)

My name is Angshuman Nayak and I am part of the Cloud Integration Engineering team at Microsoft. I was doing research on a connectivity issue between an Azure Cloud Service (PAAS) and Azure Virtual Machine (IAAS) and found the specific timeout values and the associated symptoms (actually non-symptoms since the packet drops are silent). Since I guess a lot of developers and IT professionals using Windows Azure for their services might be interested in the Azure load balancer timeout value for TCP connection, I thought of sharing my findings.

 As with any TCP connectivity issue this finding is based on a network trace. I used Netmon to collect the traces but you are free to use any network tracing utility like WireShark etc.

                    (click and open the image on a new tab to view it in greater detail)

 

Worker Role to IAAS VM

The Worker Role uses the Public IP (VIP) since it uses the FQDN of the IAAS VM hence all calls from the Worker Role to the IAAS VM have to go through the datacenter load balancer. The WorkerRole instance sends some query after around 5 mins but the response never arrives and hence the WorkerRole (like any Windows OS) will re-send the packet for 5 times more doubling the time interval (this is default and can be modified by making registry changes). But the response never comes and hence the WorkerRole sends a Reset packet after 10 secs.

The reason for this is that the Load Balancer of Windows Azure Datacenter will silently tear down any idle connection after 4 minutes. All subsequent packets it receives will will silently dropped by the load balancer. The Windows Azure load balancer has a non-configurable idle timeout of 4 minutes. This is done since every request that stays open, is consuming memory in the load balancer device. A longer timeout means that more memory is consumed and it’s a potential Denial-of-Service attack vector.  

The connection can be kept alive by either the application using techniques enlisted below or the PAAS and IAAS roles can be made part of a same virtual network, in which case the PAAS calls to the IAAS machine will not go via the load balancer and hence this will not apply.

 

[1] - Make sure the TCP connection is not idle. To keep your TCP connection active keeping sending some data before 240 seconds is passed. This could be done via chunked transfer encoding;send something or you can just send blank lines to keep the connection active.

[2] - If you are using WCF based application please have a look at below link:

             Reference: https://code.msdn.microsoft.com/WCF-Azure-NetTCP-Keep-Alive-09f50fd9

[3] - If you are using TCP Sockets then you can also try ServicePointManager.SetTcpKeepAlive(true, 200000, 200000) might be used to do this. TCP Keep-Alive packets will keep the connection from your client to the load balancer open during a long-running HTTP request. For example if you’re using .NET WebRequest objects in your client you would set         ServicePointManager.SetTcpKeepAlive(…) appropriately. https://msdn.microsoft.com/en-us/library/system.net.servicepointmanager.settcpkeepalive(v=vs.110).aspx 

Reference - https://msdn.microsoft.com/en-us/library/system.net.servicepointmanager.settcpkeepalive.aspx

This is equally applicable if an on premise client is trying to connect to a Service running on Windows Azure over TCP since the connection has to go via the network load balancer.

Comments

  • Anonymous
    April 22, 2014
    This behavior seem incredibly broken!We go through the code on the client and server, configuring all the myriad timeouts properly to get the behavior we need. Then this invisible layer in the middle "silently drops packets". This is the worst response. Our client code sits waiting forever on data that will never come. The server reports that it has 100% responded. The data just disappears in the middle.If you are going to enforce a timeout policy, why not send a reset packet or something that will affirmatively notify the client that there is a problem?
  • Anonymous
    October 02, 2014
    Amazing, a timeout for a HA infra !!Back to physical server, MS Azure was a real nightmare ;-(Thomas Decaux,eBuildy CTO
  • Anonymous
    October 12, 2016
    IdleTimeOut is configurable. Please see the below post dated Aug 2014.https://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/