Getting the Page Ranges of a Large Page Blob in Segments

One of the blob types supported by Windows Azure Storage is the Page Blob. Page Blobs provide efficient storage of sparse data by physically storing only pages that have been written and not cleared. Each page is 512 bytes in size. The Get Page Ranges REST service call returns a list of all contiguous page ranges that contain valid data. In the Windows Azure Storage Client Library, the method GetPageRanges exposes this functionality.

Get Page Ranges may fail in certain circumstances where the service takes too long to process the request. Like all Blob REST APIs, Get Page Ranges takes a timeout parameter that specifies the time a request is allowed, including the reading/writing over the network. However, the server is allowed a fixed amount of time to process the request and begin sending the response. If this server timeout expires then the request fails, even if the time specified by the API timeout parameter has not elapsed.

In a highly fragmented page blob with a large number of writes, populating the list returned by Get Page Ranges may take longer than the server timeout and hence the request will fail. Therefore, it is recommended that if your application usage pattern has page blobs with a large number of writes and you want to call GetPageRanges, then your application should retrieve a subset of the page ranges at a time.

For example, suppose a 500 GB page blob was populated with 500,000 writes throughout the blob. By default the storage client specifies a timeout of 90 seconds for the Get Page Ranges operation. If Get Page Ranges does not complete within the server timeout interval then the call will fail. This can be solved by fetching the ranges in groups of, say, 50 GB. This splits the work into ten requests. Each of these requests would then individually complete within the server timeout interval, allowing all ranges to be retrieved successfully.

To be certain that the requests complete within the server timeout interval, fetch ranges in segments spanning 150 MB each. This is safe even for maximally fragmented page blobs. If a page blob is less fragmented then larger segments can be used.

Client Library Extension

We present below a simple extension method for the storage client that addresses this issue by providing a rangeSize parameter and splitting the requests into ranges of the given size. The resulting IEnumerable object lazily iterates through page ranges, making service calls as needed.

As a consequence of splitting the request into ranges, any page ranges that span across the rangeSize boundary are split into multiple page ranges in the result. Thus for a range size of 10 GB, the following range spanning 40 GB

[0 – 42949672959]

would be split into four ranges spanning 10 GB each:

[0 – 10737418239]
[10737418240 – 21474836479]
[21474836480 – 32212254719]
[32212254720 – 42949672959].

With a range size of 20 GB the above range would be split into just two ranges.

Note that a custom timeout may be used by specifying a BlobRequestOptions object as a parameter, but the method below does not use any retry policy. The specified timeout is applied to each of the service calls individually. If a service call fails for any reason then GetPageRanges throws an exception.

 namespace Microsoft.WindowsAzure.StorageClient
{
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Net;
    using Microsoft.WindowsAzure.StorageClient.Protocol;
 
    /// <summary>
    /// Class containing an extension method for the <see cref="CloudPageBlob"/> class.
    /// </summary>
    public static class CloudPageBlobExtensions
    {
        /// <summary>
        /// Enumerates the page ranges of a page blob, sending one service call as needed for each
        /// <paramref name="rangeSize"/> bytes.
        /// </summary>
        /// <param name="pageBlob">The page blob to read.</param>
        /// <param name="rangeSize">The range, in bytes, that each service call will cover. This must be a multiple of
        ///     512 bytes.</param>
        /// <param name="options">The request options, optionally specifying a timeout for the requests.</param>
        /// <returns>An <see cref="IEnumerable"/> object that enumerates the page ranges.</returns>
        public static IEnumerable<PageRange> GetPageRanges(
            this CloudPageBlob pageBlob,
            long rangeSize,
            BlobRequestOptions options)
        {
            int timeout;
 
            if (options == null || !options.Timeout.HasValue)
            {
                timeout = (int)pageBlob.ServiceClient.Timeout.TotalSeconds;
            }
            else
            {
                timeout = (int)options.Timeout.Value.TotalSeconds;
            }
 
            if ((rangeSize % 512) != 0)
            {
                throw new ArgumentOutOfRangeException("rangeSize", "The range size must be a multiple of 512 bytes.");
            }
 
            long startOffset = 0;
            long blobSize;
 
            do
            {
                // Generate a web request for getting page ranges
                HttpWebRequest webRequest = BlobRequest.GetPageRanges(
                    pageBlob.Uri,
                    timeout,
                    pageBlob.SnapshotTime,
                    null /* lease ID */);
 
                // Specify a range of bytes to search
                webRequest.Headers["x-ms-range"] = string.Format(
                    "bytes={0}-{1}",
                    startOffset,
                    startOffset + rangeSize - 1);
 
                // Sign the request
                pageBlob.ServiceClient.Credentials.SignRequest(webRequest);
 
                List<PageRange> pageRanges;
 
                using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
                {
                    // Refresh the size of the blob
                    blobSize = long.Parse(webResponse.Headers["x-ms-blob-content-length"]);
 
                    GetPageRangesResponse getPageRangesResponse = BlobResponse.GetPageRanges(webResponse);
 
                    // Materialize response so we can close the webResponse
                    pageRanges = getPageRangesResponse.PageRanges.ToList();
                }
 
                // Lazily return each page range in this result segment.
                foreach (PageRange range in pageRanges)
                {
                    yield return range;
                }
 
                startOffset += rangeSize;
            }
            while (startOffset < blobSize);
        }
    }
}

Usage Examples:

 pageBlob.GetPageRanges(10 * 1024 * 1024 * 1024 /* 10 GB */, null);
pageBlob.GetPageRanges(150 * 1024 * 1024 /* 150 MB */, options /* custom timeout in options */);

Summary

For some fragmented page blobs, the GetPageRanges API call might not complete within the maximum server timeout interval. To solve this, the page ranges can be incrementally fetched for a fraction of the page blob at a time, thus decreasing the time any single service call takes. We present an extension method implementing this technique in the Windows Azure Storage Client Library.

Michael Roberson