Windows Azure Storage Overview
I am at the Azure Firestarter event in Redmond today and just heard Brad Calder give a quick overview of Azure data. Here are my notes; slides and sample code are to be posted later and I will update the post with them when they are.
- Blobs
- REST APIs
- Can have a lease on the blob - allows for limiting access to the blob (used by drives)
- To create a blob…
- Use StorageCredentialsAccountAndKey to create the authentication object
- Use CloudBlobClient to establish a connection using the authentication object and a URI to the blob store (from the portal)
- Use CloudBlobContainer to create/access a container
- Use CloudBlob to access/create a blob
- Two types of blobs
- Block blob - up to 200 GB
- Targeted at streaming workloads (e.g. photos, images)
- Can update blocks in whatever order (e.g. potentially mulitple streams)
- Page blob - up to 1 TB
- Targeted at random read/write workloads
- Used for drives
- Pages not stored are effectively initialized to all zeros.
- Only charged for pages you actually store.
- Can create a 100 GB blob, but write 1 MB to it - only charged for 1 MB of pages.
- Page size == 512 bytes
- Updates must be 512 byte aligned (up to 4 MB at a time)
- Can read from any offset
- ClearPages removes the content - not charged for cleared pages.
- Block blob - up to 200 GB
- CDN
- Storage account can be enabled for CDN.
- Will get back a domain name to access blobs - can register a custom domain name via CDN.
- Different from base domain used to access blobs directly - if you use the main storage account URL, will retrieve from blob store not using CDN.
- To use CDN
- Create a blob
- When creating a blob, specify "TTL" - time to live in the CDN in seconds.
- Reference the blob using the CDN URL and it will cache it in the nearest CDN store.
- Signed URLs (Shared Access Signatures) for Blobs
- Can give limited access to blobs without giving out your secret key.
- Create a Shared Access Signature (SAS) that gives time-based access to the blob.
- Specify start-time and end-time.
- What resource-granularity (all blobs in a container, just one blob in the container)
- Read/write/delete access permissions.
- Give out URL with signature.
- Signature is validated against a signed identifier. You can instantaneously revoke access to a signature issued by removing the signed identifier.
- Can also store time range and permissions with the signed identifier rather than in the URL.
- Can change them after issuing the URL and the signature is still valid in the URL.
- Windows Azure Drive
- Provides a durable NTFS drive using page blob storage.
- Actually a formatted single-volume NTFS VHD up to 1 TB in size (same limit as page blob)
- Can only be mounted by one VM instance at a time.
- Note that each role instance runs on a VM, so only one role instance can mount a drive read/write
- Could not have both a worker role and a web role mounting the same drive read/write
- One VM instance can mount up to 16 drives.
- Because a drive is just a page blob, can upload your VHD from a client.
- An Azure instance mounts the blob
- Obtains a lease
- Specifies how much local disk storage to use for caching the page blob
- APIs
- CloudDriveInitializeCache - initalize how much local cache to use for the drive
- CloudStorageAccount - to access the blob
- Create a CloudDrive object using CreateCloudDrive specifying the URI to the page blob
- Against CloudDrive…
- Create to initialize it.
- Mount to mount it - returns path on local file system and then access using normal NTFS APIs
- Snapshot to create backups
- Can mount snapshots as read-only VHDs
- Unmount to unmount it.
- Driver for mounting blobs only in the cloud - not on development fabric.
- Instead, just use VHDs
- Provides a durable NTFS drive using page blob storage.
- Tables
- Table can have billions of entities and terabytes of data.
- Highly scalable.
- WCF Data Services - LINQ or REST APIs
- Table row has a partition key and a row key
- Partition key:
- controls granularity of locality (all entities with same partition key will be stored and cached together)
- provides entity group transactions - as long as entities have same partition key, can do up to 100 insert/update/delete operations as a batch and will be atomic.
- enable scalability - monitor usage patterns and use partition key to scale out across different servers based on partition keys
- More granularity of partition key = better scalability options
- Less granularity of partition key = better ability to do atomic operations across multiple rows (because all must have same partition key)
- Partition key:
- To create / use an entity
- Create a .NET class modeling an entity
- Specify the DataServiceKey attribute to tell WCF Data Services the primary keys (partitionkey, rowkey)
- APIs
- CloudTableClient - establish URI and account to access table store
- TableServiceContext - get from CloudTableClient
- Add entities using the context AddObject method specifying the table name and the class with the data for the new entity
- SaveChangesWithRetries against context to save the object.
- To Query… using LINQ with AsTableServiceQuery<xxxx> where xxx is the .NET class modeling the entity.
- Manages continuation tokens for you
- Then do a foreach and can use UpdateObject to update objects in the LINQ stream.
- Use SaveChangesWithRetries
- Add SaveChangesOptions.batch if < =100 records and all have same partition key - save as one batch.
- If not, sends a transaction for each object.
- Table tips
- ServicePointManager.DefaultConnectionLimit = x (default .NET HTTP connections = 2)
- Use SaveChangesWithRetries and AsTableServiceQuery to get best performance
- Handle Conflict errors on inserts and NotFound errors on Delete
- Can happen because of retries
- Avoid append only write patterns based on partitionkey values
- Can happen if partition key is based on timestamp.
- If you keep appending, defeating scale out strategy of Azure
- Make partition key be distributed not all in one area.
- Queues
- Provide reliable delivery of messages
- Allow loosely coupled workflow between roles
- Work gets loaded into a queue
- Multiple workers consume the queue
- When dequeuing a message, specify an "invisibility time" which leaves the message in the queue but makes it temporarily invisible to other workers
- Allows for reliability.
- APIs
- Create a CloudQueueClient using account and credentials
- Create a CloudQueue using the client and GetQueueReference - queue name
- CreateIfNotExist to create it if not there
- Create a CloudQueueMessage with content
- Use CloudQueue.AddMessage to add it to the queue
- Use CloudQueue.GetMessage to get it out (passing invisibility time)
- Tips on Queues
- Messages up to 8 KB in size
- Put in a blob if more and send blob pointer as message
- Remember that a message can be processed more than once.
- Make that not be a problem - idempotent
- Assume messages are not processed in any particular order
- Queues can handle up to about 500 messages/second.
- For higher throughput - batch items into a blob and send a message with reference to blob containing 10 work items.
- Worker does 10 items at a time.
- Increased throughput by 10x.
- Use DequeueCount to remove "poison messages" that seem to be repeatedly crashing workers.
- Monitor message count to increase/decrease worker instances using service management APIs.
- Messages up to 8 KB in size
- Q: can you set priorities on queue messages?
- A: No - would have to create different queues
- Q: Are blobs stored within the EU complying with the EU privacy policies?
- A: Microsoft has a standard privacy policy which we adhere to.
Comments
- Anonymous
February 13, 2011
Hi Mike, Could you explain more on Create and CreateIfNotExits for Queue, I tried it manually but there doesn't seem much difference i.e. no change in the queue data members or the queue. Naveen