Windows Azure Storage Client for Java Blob Features

We have released the Storage Client for Java with support for Windows Azure Blobs, Queues, and Tables. Our goal is to continue to improve the development experience when writing cloud applications using Windows Azure Storage. As such, we have incorporated feedback from customers and forums for the current .NET libraries to help create a more seamless API that is both powerful and simple to use. This blog post serves as an overview of a few new features for Blobs that are currently unique to the Storage Client for Java, which are designed to address common scenarios when working with Cloud workloads.

MD5

One of the key pieces of feedback we get is to make working with MD5 easier and more seamless. For java we have simplified this scenario to provide consistent behavior and simple configuration.

There are two different ways to use Content-MD5s in the Blob service: a transactional MD5 is used to provide data integrity during transport of blocks or pages of a blob, which is not stored with the blob, and an MD5 of the entire blob, which is stored with the blob and returned on subsequent GET operations (see the blog post here for information on what the server provides).

To make this easy, we have designed high level controls for common cross cutting scenarios that will be respected by every API. For example, no matter which API a user chooses to upload a blob (page or block) the MD5 settings will be honored. Additionally we have decoupled transactional MD5 (which is useful to ensure transport integrity of individual blocks and pages) and Blob Level MD5 which sets the MD5 value on the entire blob, which is then returned on subsequent GETs. 

The following example illustrates how to use BlobRequestOptions to utilize transactional content md5 to ensure that uploads and downloads are validated correctly. Note: transactional MD5 is not needed when using HTTPS as HTTPS provides its own integrity mechanism. Both the transactional MD5 and the full blob level MD5 are set to false (turned off) by default.  The following shows how to turn both of them on.

 // Define BlobRequestOptions to use transactional MD5
BlobRequestOptions options = new BlobRequestOptions();
options.setUseTransactionalContentMD5(true);
options.setStoreBlobContentMD5 (true); // Set full blob level MD5


blob.upload(sourceStream blobLength,
            null /* AccessCondition */,
            options,
            null /* OperationContext */);

blobRef.download(outStream,
            null /* AccessCondition */,
            options,
            null /* OperationContext */);

 

Sparse Page Blob

The most common use for page blobs among cloud applications is to back a VHD (Virtual Hard Drive) image.  When a page blob is first created it exists as a range of zero filled bytes. The Windows Azure Blob service provides the ability to write in increments of 512 byte pages and keep track of which pages have been written to. As such it is possible for a client to know which pages still contain zero filled data and which ones contain valid data. 

We are introducing a new feature in this release of the Storage Client for Java which can omit 512 byte aligned ranges of zeros when uploading a page blob, and subsequently intelligently download only the non-zero data. During a download when the library detects that the current data being read exists in a zero’d region the client simply generates these zero’d bytes without making additional requests to the server. Once the read continues on into a valid range of bytes the library resumes making requests to the server for the non-zero’d data.

The following example illustrates how to use BlobRequestOptions to use the sparse page blob feature.

 // Define BlobRequestOptions to use sparse page blob
BlobRequestOptions options = new BlobRequestOptions();
options.setUseSparsePageBlob(true);
blob.create(length);

// Alternatively could use blob.openOutputStream
blob.upload(sourceStream blobLength,
            null /* AccessCondition */,
            options,
            null /* OperationContext */);

// Alternatively could use blob.openInputStream
blobRef.download(outStream,
            null /* AccessCondition */,
            options,
            null /* OperationContext */);

Please note this optimization works in chunked read and commit sizes (configurable via CloudBlobClient. setStreamMinimumReadSizeInBytes and CloudBlobClient.setPageBlobStreamWriteSizeInBytes  respectively).  If a given read or commit consists entirely of zeros then the operation is skipped altogether.  Alternatively if a given read or commit chunk consists of only a subset of non-zero data then it is possible that the library will “shrink” the read or commit chunk by ignoring any beginning or ending pages which consist entirely of zeros. This allows us to optimize both cost (fewer transactions) and speed (less data) in a predictable manner. 

Download Resume

Another new feature to this release of the Storage Client for Java is the ability for full downloads to resume themselves in the event of a disconnect or exception.  The most cost efficient way for a client to download a given blob is in a single REST GET call.  However if you are downloading a large blob, say of several GB, an issue arises on how to handle disconnects and errors without having to pre-buffer data or re-download the entire blob.  

To solve this issue, the blob download functionality will now check the retry policy specified by the user and determine if the user desires to retry the operation. If the operation should not be retried it will simply throw as expected, however if the retry policy indicates the operation should be retried then the download will revert to using a BlobInputStream positioned to the current location of the download with an ETag check. This allows the user to simply “resume” the download in a performant and fault-tolerant way. This feature is enabled for all downloads via the CloudBlob.download method.

Best Practices

We’d also like to share some best practices for using blobs with the Storage Client for Java:

  • Always provide the length of the data being uploaded if it is available; alternatively a user may specify -1 if the length is not known. This is needed for authentication. Uploads that specify -1 will cause the Storage Client to pre-read the data to determine its length, (and potentially to calculate md5 if enabled). If the InputStream provided is not markable BlobOutputStream is used.
  • Use markable streams (i.e. BufferedInputStream) when uploading Blobs. In order to support retries and to avoid having to prebuffer data in memory a stream must be markable so that it can be rewound in the case of an exception to retry the operation. When the stream provided does not support mark the Storage Client will use a BlobOutputStream which will internally buffer individual blocks until they are commited. Note: uploads that are over CloudBlobClient.getSingleBlobPutThresholdInBytes() (Default is 32 MB, but can be set up to 64MB) will also be uploaded using the BlobOutputStream.
  • If you already have the MD5 for a given blob you can set it directly via CloudBlob.getProperties().setContentMd5 and it will be sent on a subsequent Blob upload or by calling CloudBlob.uploadProperties(). This can potentially increase performance by avoiding a duplicate calculation of MD5.
  • Please note MD5 is disabled by default, see the MD5 section above regarding how to utilize MD5.
  • BlobOutputStreams commit size is configurable via CloudBlobClient.setWriteBlockSizeInBytes() for BlockBlobs and CloudBlobClient.setPageBlobStreamWriteSizeInBytes() for Page Blobs
  • BlobInputStreams minimum read size is configurable via CloudBlobClient.setStreamMinimumReadSizeInBytes()
  • For lower latency uploads BlobOutputStream can execute multiple parallel requests. The concurrent request count is default to 1 (no concurrency) and is configurable via CloudBlobClient.setConcurrentRequestCount(). BlobOutputStream is accessible via Cloud[Block|Page]Blob.openOutputStream or by uploading a stream that is greater than CloudBlobClient.getSingleBlobPutThresholdInBytes() for BlockBlob or 4 MB for PageBlob.

Summary

This post has covered a few interesting features in the recently released Windows Azure Storage Client for Java. We very much appreciate all the feedback we have gotten from customers and through the forums, please keep it coming. Feel free to leave comments below,

Joe Giardino

Developer

Windows Azure Storage

Resources

Get the Windows Azure SDK for Java

Learn more about the Windows Azure Storage Client for Java

Learn more about Windows Azure Storage