Windows Azure Storage Client Library: Rewinding stream position less than BlobStream.ReadAheadSize can result in lost bytes from BlobStream.Read()

Update 3/09/011: The bug is fixed in the Windows Azure SDK March 2011 release .

In the current Windows Azure storage client library, BlobStream.Read() may read less than the requested number of bytes if the user rewinds the stream position. This occurs when using the seek operation to a position which is equal or less than BlobStream.ReadAheadSize byte(s) away from the previous start position. Furthermore, in this case, if BlobStream.Read() is called again to read the remaining bytes, data from an incorrect position will be read into the buffer.

What does ReadAheadSize property do?

BlobStream.ReadAheadSize is used to define how many extra bytes to prefetch in a get blob request when BlobStream.Read() is called. This design is suppose to ensure that the storage client library does not need to send another request to the blob service if BlobStream.Read() is called again to read N bytes from the current position, where N < BlobStream.ReadAheadSize. It is an optimization for reading blobs in the forward direction, which reduces the number of the get blob requests to the blob service when reading a Blob.

This bug impacts users only if their scenario involves rewinding the stream to read, i.e. using Seek operation to seek to a position BlobStream.ReadAheadSize bytes less than the previous byte offset.

The root cause of this issue is that the number of bytes to read is incorrectly calculated in the storage client library when the stream position is rewound by N bytes using Seek, where N <=BlobStream.ReadAheadSize bytes away from the previous read’s start offset. (Note, if the stream is rewound more than BlobStream.ReadAheadSize bytes away from the previous start offset, the stream reads work as expected.)

To understand this issue better, let us explain this using an example of user code that exhibits this bug.

We begin with getting a BlobStream that we can use to read the blob, which is 16MB in size. We set the ReadAheadSize to 16 bytes. We then seek to offset 100 and read 16 bytes of data. :

 BlobStream stream = blob.OpenRead();
stream.ReadAheadSize = 16;
int bufferSize = 16;
int readCount;
byte[] buffer1 = new byte[bufferSize];
stream.Seek(100, System.IO.SeekOrigin.Begin);
readCount = stream.Read(buffer1, 0, bufferSize);

BlobStream.Read() works as expected in which buffer1 is filled with 16 bytes of the blob data from offset 100. Because of ReadAheadSize set to 16, the Storage Client issues a read request for 32 bytes of data as seen in the request trace as seen in the “x-ms-range” header set to 100-131 in the request trace. The response as we see in the content-length, returns the 32 bytes:

Request and response trace:

Request header:
GET https://foo.blob.core.windows.net/test/blob?timeout=90 HTTP/1.1
x-ms-version: 2009-09-19
User-Agent: WA-Storage/0.0.0.0
x-ms-range: bytes=100-131

Response header:
HTTP/1.1 206 Partial Content
Content-Length: 32
Content-Range: bytes 100-131/16777216
Content-Type: application/octet-stream

We will now rewind the stream to 10 bytes away from the previous read’s start offset (previous start offset was at 100 and so the new offset is 90). It is worth noting that 10 is < ReadAheadSize which exhibits the problem (note, if we had set the seek to be > ReadAheadSize back from 100, then everything would work as expected). We then issue a Read for 16 bytes starting from offset 90.

 byte[] buffer2 = new byte[bufferSize];
stream.Seek(90, System.IO.SeekOrigin.Begin);
readCount = stream.Read(buffer2, 0, bufferSize);

BlobStream.Read() does not work as expected here. It is called to read 16 bytes of the blob data from offset 90 into buffer2, but only 9 bytes of blob data is downloaded because the Storage Client has a bug in calculating the size it needs to read as seen in the trace below. We see that x-ms-range is set to 9 bytes (range = 90-98 rather than 90-105) and the content-length in the response set to 9.

Request and response trace:

Request header:
GET https://foo.blob.core.windows.net/test/blob?timeout=90 HTTP/1.1
x-ms-version: 2009-09-19
User-Agent: WA-Storage/0.0.0.0
x-ms-range: bytes=90-98

Response header:
HTTP/1.1 206 Partial Content
Content-Length: 9
Content-Range: bytes 90-98/16777216
Content-Type: application/octet-stream

Now, since the previous request for reading 16 bytes just returned 9 bytes, the client will issue another Read request to continue reading the remaining 7 bytes,

 readCount = stream.Read(buffer2, readCount, bufferSize – readCount);

BlobStream.Read() still does not work as expected. It is called to read the remaining 7 bytes into buffer2 but the whole blob is downloaded as seen in the request and response trace below. As seen in the request, due to bug in Storage client, an incorrect range is sent to the service which then returns the entire blob data resulting in an incorrect data being read into the buffer. The request trace shows that the range is invalid: 99-98. The invalid range causes the Windows Azure Blob service to return the entire content as seen in the response trace. Since the client does not check to see the range and it was expecting the starting offset to be 99, it copies the 7 bytes from the beginning of the stream which is incorrect.

Request and response trace:

Request header:
GET https://foo.blob.core.windows.net/test/blob?timeout=90 HTTP/1.1x-ms-version: 2009-09-19
User-Agent: WA-Storage/0.0.0.0
x-ms-range: bytes=99-98

Response header:
HTTP/1.1 200 OK
Content-Length: 16777216
Content-Type: application/octet-stream

Mitigation

The workaround is to set the value of BlobStream.ReadAheadSize to 0 before BlobStream.Read() is called if a rewind operation is required:

 BlobStream stream = blob.OpenRead();
stream.ReadAheadSize = 0;

As we explained above, the property BlobStream.ReadAheadSize is an optimization which can reduce the number of the requests to send when reading blobs in the forward direction, and setting it to 0 removes that benefit.

Summary

To summarize, the bug in the client library can result in data from an incorrect offset to be read. This happens only when the user code seeks to a position less than the previous offset where the distance is < ReadAheadSize. The bug will be fixed a future release of the Windows Azure SDK and we will post a link to the download here once it is released.

Justin Yu