Windows Azure Blob MD5 Overview

Overview

Windows Azure Blob service provides mechanisms to ensure data integrity both at the application and transport layers. This post will detail these mechanisms from the service and client perspective. MD5 checking is optional on both PUT and GET operations; however it does provide a convenience facility to ensure data integrity across the network when using HTTP. Additionally since HTTPS provides transport layer security additional MD5 checking is not needed while connecting over HTTPS as it would be redundant.

To ensure data integrity the Windows Azure Blob service uses MD5 hashes of the data in a couple different manners. It is important to understand how these values are calculated, transmitted, stored, and eventually enforced in order to appropriately design your application to utilize them to provide data integrity.

Please note, the Windows Azure Blob service provides a durable storage medium, and uses its own integrity checking for stored data. The MD5's that are used when interacting with an application are provided for checking the integrity of the data when transferring that data between the application and service via HTTP. For more information regarding the durability of the storage system please refer to the Windows Azure Storage Architecture Overview.

The following table shows the Windows Azure Blob service REST APIs and the MD5 checks provided for them:

REST API

Header

Value

Validated By

Notes

Put Blob

x-ms-blob-content-md5

MD5 value of blobs bits

Server

Full Blob

Put Blob

Content-MD5

MD5 value of blobs bits

Server

Full Blob,If x-ms-blob-content-md5 is present Content-md5 is ignored

Put Block

Content-MD5

MD5 value of block bits

Server

Validated prior to storing the block

Put Page

Content-MD5

MD5 value of page bits

Server

Validated prior to storing the page

Put Block List

x-ms-blob-content-md5

MD5 value of blobs bits

Client on subsequent download

Stored as the Content-MD5 blob property to be downloaded with blob for client side checks

Set Blob Properties

x-ms-blob-content-md5

MD5 value of blobs bits

Client on subsequent download

Sets the blob Content-MD5 property.

Get Blob

Content-MD5

MD5 value of blobs bits

Client

Returns the Content-MD5 property if one was stored/set with the blob

Get Blob (range)

Content-MD5

MD5 value of blobs range bits

Client

If client specifies x-ms-range-get-content-md5: true the Content-MD5 header will be dynamically calculated over the range of bytes requested. This is restricted to <= 4 MB range requests

Get Blob Properties

Content-MD5

MD5 value of blobs bits

Client

Returns the Content-MD5 property if one was stored/set with the blob

Table 1 : REST API MD5 Compatibility

Service Perspective

From the Windows Azure Blob Storage service perspective the only MD5 values that are explicitly calculated and validated on each transaction are the transport layer (HTTP) MD5 values. MD5 checking is optional on both PUT and GET operations. Note, since HTTPS provides transport layer security when using HTTPS any additional MD5 checking would be redundant, so MD5 checking is not needed when using HTTPS. We will be discussing two separate MD5 values which will provide checks for at different layers:

  • PUT with Content-MD5: When a content MD5 header is specified, the storage service calculates an MD5 of the data sent and checks that with the Content-MD5 that was also sent. If the two hashes do not match, the operation will fail with error code 400 (Bad Request). These values are transmitted via the Content-MD5 HTTP header. This validation is available for PutBlob, PutBlock and PutPage. Note, when uploading a block, page, or blob the service will return the Content-MD5 HTTP header in the response populated with the MD5 it calculated for the data received.
  • PUT with x-ms-blob-content-md5: The application can also set the Content-MD5 property that is stored with a blob. The application can pass this in with the header x-ms-blob-content-md5, and the value with this is stored as the Content-MD5 header to be returned on subsequent GETs for the blob. This can be set when using PutBlob, PutBlockList or SetBlobProperties for the blob. If a user provides this value on upload all subsequent GET operations will return this header with the client provided value. The x-ms-blob-content-md5 header is a header we introduced for scenarios where we wanted to specify the hash for the blob content when the http request content is not fully indicative of the actual blob data, such as in PutBlockList. In a PutBlockList, the Content-MD5 header would provide transactional integrity for the message contents (the block list in the request body) , while the x -ms-blob-content-md5 header would set the service side blob property. To reiterate, if a x -ms-blob-content-md5 header is provided it will supersede the Content-MD5 header on a PutBlob operation, for a PutBlock or PutPage operation it is ignored.
  • GET: On a subsequent GET operation the service will optionally populate the Content-MD5 HTTP header if a value was previously stored with the blob via a PutBlob, PutBlockList, or SetBlobProperties. For range GETs an optional x-ms-range-get-content-md5 header can be added to the request. When this header is set to true and specified together with the Range header for a range GET, the service dynamically calculates an MD5 for the range and returns it in the Content-MD5 header, as long as the range is less than or equal to 4 MB in size. If this header is specified without the Range header, the service returns status code 400 (Bad Request). If this header is set to true when the range exceeds 4 MB in size, the service returns status code 400 (Bad Request).

Client Perspective

We have already discussed above how the Windows Azure Blob service can provide transport layer security via the Content-MD5 HTTP header or HTTPS. In addition to this the client can store and manually validate MD5 hashes on the blob data from the application layer. The Windows Azure Storage Client library provides this calculation functionality via the exposed object model and relevant abstractions such as BlobWriteStream.

Storing Application layer MD5 when Uploading Blobs via the Storage Client Library

When utilizing the CloudBlob Convenience layer methods in most cases the library will automatically calculate and transmit the application layer MD5 value. However, there is an exception to this behavior when a call to an upload method results in

  • A single PUT operation to the Blob service, which will occur when source data is less than CloudBlobClient.SingleBlobUploadThresholdInBytes.
  • A parallel upload (length > CloudBlobClient.SingleBlobUploadThresholdInBytes and CloudBlobClient.ParallelOperationThreadCount > 1).

In both of the above cases, an MD5 value is not passed in to be checked, so in this scenario if the client requires data integrity checking they need to make sure and use HTTPS. (HTTPS can be enabled when constructing a CloudStorageAccount via the constructor or by specifying HTTPS as part of the baseAddress when manually constructing a CloudBlobClient)

All other blob upload operations from the convenience layer in the SDK send MD5’s that are checked at the blob service.

In addition to the exposed object methods, you can also provide the x-ms-blob-content-md5 header via the Protocol layer on a PutBlob or PutBlockList request.

The below table lists the convention functions used to upload blobs, and which ones support sending MD5 checks and when they are sent.

Layer

Method

Notes

Convenience

CloudBlob.OpenWrite

MD5 is sent. Note, this function is not currently supported for PageBlob

Convenience

CloudBlob.UploadByteArray

MD5 is sent if:

  • Length is >= CloudBlobClient. SingleBlobUploadThresholdInBytes AND
  • CloudBlobClient. ParallelOperationThreadCount==1

Convenience

CloudBlob.UploadFile

MD5 is sent if:

  • Length is >= CloudBlobClient. SingleBlobUploadThresholdInBytes AND
  • CloudBlobClient. ParallelOperationThreadCount==1

Convenience

CloudBlob.UploadText

MD5 is sent if:

  • Length is >= CloudBlobClient. SingleBlobUploadThresholdInBytes AND
  • CloudBlobClient. ParallelOperationThreadCount==1

Convenience

CloudBlob.UploadFromStream

MD5 is sent if:

  • Length is >= CloudBlobClient. SingleBlobUploadThresholdInBytes AND
  • CloudBlobClient. ParallelOperationThreadCount==1

Table 2 : Blob upload methods MD5 compatibility

Validating Application Layer MD5 when downloading Blobs via the Storage Client Library

The CloudBlob Download methods do not provide application layer MD5 validation; as such it is up to the application to verify the Content-MD5 returned against the data returned by the service. If the application layer MD5 value was specified on upload the Windows Azure Storage Client Library will populate it in CloudBlob.Properties.ContentMD5 on any download (i.e. DownloadText, DownloadByteArray, DownloadToFile, DownloadToStream, and OpenRead).

The example below shows how a client can validate the blobs MD5 hash once all the data is retrieved.

Example

 // Initialization
string blobName = "md5test" + Guid.NewGuid().ToString();
long blobSize = 8 * 1024 * 1024;

StorageCredentialsAccountAndKey creds = 
        new StorageCredentialsAccountAndKey(AccountName, AccountKey);
CloudStorageAccount account = new CloudStorageAccount(creds, false);
CloudBlobClient bClient = account.CreateCloudBlobClient();

// Set CloudBlobClient.SingleBlobUploadThresholdInBytes, all blobs above this 
// length will be uploaded using blocks
bClient.SingleBlobUploadThresholdInBytes = 4 * 1024 * 1024;

// Create Blob Container 

CloudBlobContainer container = bClient.GetContainerReference("md5blobcontainer");
Console.WriteLine("Validating the Container");
container.CreateIfNotExist();

// Populate Blob Data
byte[] blobData = new byte[blobSize];
Random rand = new Random();
rand.NextBytes(blobData);
MemoryStream retStream = new MemoryStream(blobData);

// Upload Blob
CloudBlob blobRef = container.GetBlobReference(blobName);

// Any upload method will work here: byte array, file, text, stream
blobRef.UploadByteArray(blobData);

// Download will re-populate the client MD5 value from the server
byte[] retrievedBuffer = blobRef.DownloadByteArray();

// Validate MD5 Value
var md5Check = System.Security.Cryptography.MD5.Create();
md5Check.TransformBlock(retrievedBuffer, 0, retrievedBuffer.Length, null, 0);     
md5Check.TransformFinalBlock(new byte[0], 0, 0);

// Get Hash Value
byte[] hashBytes = md5Check.Hash;
string hashVal = Convert.ToBase64String(hashBytes);

if (hashVal != blobRef.Properties.ContentMD5) 
{
     throw new InvalidDataException("MD5 Mismatch, Data is corrupted!");
}

Figure 1: Validating a Blobs MD5 value

A note about Page Blobs

Page blobs are designed to provide a durable storage medium that can perform a high rate of IO. Data can be accessed in 512 byte pages allowing a high rate of non-contiguous transactions to complete efficiently. If HTTP needs to be used with MD5 checks, then the application should pass in the Content-MD5 on PutPage, and then use the x-ms-range-get-content-md5 on each subsequent GetBlob using ranges less than or equal to 4MBs.

Considerations

Currently the convenience layer of the Windows Azure Storage Client Library does not support passing in MD5 values for PageBlobs, nor returning Content-MD5 for getting PageBlob ranges. As such, if your scenario requires data integrity checking at the transport level it is recommended that you use HTTPS or utilize the Protocol Layer and add the additional Content-MD5 header.

In the following example we will show how to perform page blob range GETs with an optional x-ms-range-get-content-md5 via the protocol layer in order to provide transport layer security over HTTP.

Example

 // Initialization
string blobName = "md5test" + Guid.NewGuid().ToString();
long blobSize = 8 * 1024 * 1024;

// Must be divisible by 512
int writeSize = 1 * 1024 * 1024;

StorageCredentialsAccountAndKey creds = 
    new StorageCredentialsAccountAndKey(AccountName, AccountKey);
CloudStorageAccount account = new CloudStorageAccount(creds, false);
CloudBlobClient bClient = account.CreateCloudBlobClient();
bClient.ParallelOperationThreadCount = 1;

// Create Blob Container 
CloudBlobContainer container = bClient.GetContainerReference("md5blobcontainer");
Console.WriteLine("Validating the Container");
container.CreateIfNotExist();

int uploadedBytes = 0;
// Upload Blob
CloudPageBlob blobRef = container.GetBlobReference(blobName).ToPageBlob;
blobRef.Create(blobSize);

// Populate Blob Data
byte[] blobData = new byte[writeSize];
Random rand = new Random();
rand.NextBytes(blobData);
MemoryStream retStream = new MemoryStream(blobData);

while (uploadedBytes < blobSize)
{
    blobRef.WritePages(retStream, uploadedBytes);
    uploadedBytes += writeSize;
    retStream.Position = 0;
}

HttpWebRequest webRequest = BlobRequest.Get(
                                        blobRef.Uri,        // URI
                                        90,                 // Timeout
                                        null,               // Snapshot (optional)
                                        1024 * 1024,        // Start Offset
                                        3 * 1024 * 1024,    // Count 
                                        null);              // Lease ID ( optional)

webRequest.Headers.Add("x-ms-range-get-content-md5", "true");
bClient.Credentials.SignRequest(webRequest);
WebResponse resp = webRequest.GetResponse();

Figure 2: Transport Layer security via optional x-ms-range-get-content-md5 header on a PageBlob

Summary

This article has detailed various strategies when utilizing MD5 values to provide data integrity. As with many cases the correct solution is dependent on your specific scenario.

We will be evaluating this topic in future releases of the Windows Azure Storage Client Library as we continue to improve the functionality offered. Please leave comments below if you have questions.

Joe Giardino