Overview of Retry Policies in the Windows Azure Storage Client Library
The RetryPolicies in the Storage Client Library are used to allow the user to customize the retry behavior when and exception occurs. There are a few key points when using RetryPolicies that users should take into consideration, the first is when they are evaluated, and the second is what the ideal behavior for your scenario is.
When the Storage Client Library processes an operation which results in an exception, this exception is classified internally as either “retryable” or “non-retryable”.
- “Non-retryable” exceptions are all 400 ( >=400 and <500) class exceptions (Bad gateway, Not Found, etc.) as well as 501 and 505.
- All other exceptions are “retryable”. This includes client side timeouts.
Once an operation is deemed retryable the Storage Client Library evaluates the RetryPolicy to see if the operation should be retried, and if so what amount of time it should backoff (sleep) before executing the next attempt. One thing to note is that if an operation fails the first two times and succeeds on the third the client will not see the exception as all previous exceptions will have been caught. If the operation results in an error on its last attempt is an exception then the last caught exception is rethrown to the client.
Also, please note that the timeout that is specified is applied to each attempt of a transaction; as such an operation with a timeout of 90 seconds can actually take 90 * (N+1) times longer where N is the number of retry attempts following the initial attempt.
Standard Retry Policies
There are three default RetryPolicies that ship with the Storage Client Library listed below. See https://msdn.microsoft.com/en-us/library/microsoft.windowsazure.storageclient.retrypolicies_members.aspx for full documentation
- RetryPolicies.NoRetry – No retry is used
- RetryPolicies.Retry – Retries N number of times with the same backoff between each attempt.
- RetryPolicies.RetryExponential (Default) – Retries N number of times with an exponentially increasing backoff between each attempt. Backoffs are randomized with +/- 20% delta to avoid numerous clients all retrying simultaneously. Additionally each backoff is between 3 and 90 seconds per attempt (RetryPolicies.DefaultMinBackoff, and RetryPolicies.DefaultMaxBackoff respectively) as such an operation can take longer than RetryPolicies.DefaultMaxBackoff. For example let’s say you are on a slow edge connection and you keep hitting a timeout error. The first retry will occur after ~ 3sec following the first failed attempt. The second will occur ~ 30 seconds following the first retry, and the third will occur roughly 90 seconds after that.
Creating a custom retry policy
In addition to using the standard retry polices detailed above you can construct a custom retry policy to fit your specific scenario. A good example of this is if you want to specify specific exceptions or results to retry for or to provide an alternate backoff algorithm.
The RetryPolicy is actually a delegate that when evaluated returns a Microsoft.WindowsAzure.StorageClient.ShouldRetry delegate. This syntax may be a bit unfamiliar for some users, however it provides a lightweight mechanism to construct state-full retry instances in controlled manner. When each operation begins it will evaluate the RetryPolicy which will cause the CLR to create a state object behind the scenes containing the parameters used to configure the policy.
Example 1: Simply linear retry policy
public static RetryPolicy LinearRetry(int retryCount, TimeSpan intervalBetweenRetries)
{
return () =>
{
return (int currentRetryCount, Exception lastException, out TimeSpan retryInterval) =>
{
// Do custom work here
// Set backoff
retryInterval = intervalBetweenRetries;
// Decide if we should retry, return bool
return currentRetryCount < retryCount;
};
};
}
The Highlighted blue code conforms to the Microsoft.WindowsAzure.StorageClient.RetryPolicy delegate type; that is a function that accepts no parameters and returns a Microsoft.WindowsAzure.StorageClient.ShouldRetry delegate.
The highlighted yellow code conforms to the signature for the Microsoft.WindowsAzure.StorageClient.ShouldRetry delegate and will contain the specifics of your implementation.
Once you have constructed a retry policy as above you can configure your client to use it via Cloud[Table/Blob/Queue].Client.RetryPolicy = LinearRetry(<retryCount, intervalBetweenRetries>).
Example 2: Complex retry policy which examines the last exception and does not retry on 502 errors
public static RetryPolicy CustomRetryPolicy(int retryCount, TimeSpan intervalBetweenRetries, List<HttpStatusCode> statusCodesToFail)
{
return () =>
{
return (int currentRetryCount, Exception lastException, out TimeSpan retryInterval) =>
{
retryInterval = intervalBetweenRetries;
if (currentRetryCount >= retryCount)
{
// Retries exhausted, return false
return false;
}
WebException we = lastException as WebException;
if (we != null)
{
HttpWebResponse response = we.Response as HttpWebResponse;
if (response == null && statusCodesToFail.Contains(response.StatusCode))
{
// Found a status code to fail, return false
return false;
}
}
return currentRetryCount < retryCount;
};
};
}
Note the additional argument statusCodesToFail, which illustrates the point that you can pass in whatever additional data to the retry policy that you may require.
Example 3: A custom Exponential backoff retry policy
public static RetryPolicy RetryExponential(int retryCount, TimeSpan minBackoff, TimeSpan maxBackoff, TimeSpan deltaBackoff)
{
// Do any argument Pre-validation here, i.e. enforce max retry count etc.
return () =>
{
return (int currentRetryCount, Exception lastException, out TimeSpan retryInterval) =>
{
if (currentRetryCount < retryCount)
{
Random r = new Random();
// Calculate Exponential backoff with +/- 20% tolerance
int increment = (int)((Math.Pow(2, currentRetryCount) - 1) * r.Next((int)(deltaBackoff.TotalMilliseconds * 0.8), (int)(deltaBackoff.TotalMilliseconds * 1.2)));
// Enforce backoff boundaries
int timeToSleepMsec = (int)Math.Min(minBackoff.TotalMilliseconds + increment, maxBackoff.TotalMilliseconds);
retryInterval = TimeSpan.FromMilliseconds(timeToSleepMsec);
return true;
}
retryInterval = TimeSpan.Zero;
return false;
};
};
}
In example 3 above we see code similar to the default exponential retry policy that is used by default by the Windows Azure Storage Client Library. Note the parameters minBackoff and maxBackoff. Essentially the policy will calculate a desired backoff and then enforce the min / max boundaries on it. For example, the default minimum and maximum backoffs are 3 and 90 seconds respectively that means regardless of the deltaBackoff or increase the policy will only yield a backoff time between 2 and 90 seconds.
Summary
We strongly recommend using the exponential backoff retry policy provided by default whenever possible in order to gracefully backoff the load to your account, especially if throttling was to occur due to going over the scalability targets posted here. You can set this manually by via [Client].RetryPolicy = RetryPolicies.RetryExponential(RetryPolicies.DefaultClientRetryCount, RetryPolicies.DefaultClientBackoff).
Generally speaking a high throughput application that will be making simultaneous requests and can absorb infrequent delays without adversely impacting user experience are recommended to use the exponential backoff strategy detailed above. However for user facing scenarios such as websites and UI you may wish to use a linear backoff in order to maintain a responsive user experience.
Joe Giardino
References
Windows Azure Storage Team Blog
Comments
Anonymous
October 10, 2014
For the benefit of anyone reading this today, in example 3 it's probably not a good idea to create the Random instance inside the delegate (see stackoverflow.com/.../67824). One possible solution is outlined here: stackoverflow.com/.../67824Anonymous
October 12, 2014
Creating a Random instance inside the delegate affects those requests that fail at the exact same time which leads to the seed for Random being the same. So If N requests failed at the exact same time (with 10-16ms precision), all N of them would retry after the same duration. N Is a low number in common scenarios. To optimize for such N requests can lead to extra cost and complexity (e.g. a global lock or a random instance per thread etc.) and users need to evaluate if it’s worth the extra cost before adding additional logic.