Storage performance and my take on virtual storage

There were some good questions raised from my first blog post which I'll answer here. 

First, yes Exchange 2010 should seem a little snappier on the similar hardware compared to Exchange 2007 and especially compared to 2003.  But the snappiness should be almost a small side effect of the overall IO efficiency improvements if the hardware was properly sized for the original version.  Overall we have seen an improvement of 2-4x in the number of io's necessary to support a given user profile between 2007 and 2010.  To get the full benefits of the new version you do need to design your hardware deployment properly following both our guidance and that of your storage vendor's.

Ok, so how did we do it?  Getting such a large change in the performance characteristics of the system required a lot of changes.  However the conceptual core of the changes the 'theme' as it were was about finding ways to taking small random io's and make them bigger and more sequential.  Disk drives are so dense that the amount of time it takes for the head to sit and read 64kB off the platter compared to reading 4kB of the platter is small compared to the time it takes to move to the right track and then wait for the stream to rotate under the head (this is especially true if you are using green drives that use much less power because they rotate slower).   Since that is the case if you can combine sixteen 4kB io's into one big 64kB io you get close to a 16x io improvement. The biggest changes we made to get this win were to:

  • increase the page size to 32kB (we started at 4kB in Exchange 20003) 
  • change our physical schema (the layout of our key tables) so that messages and attachments are kept together and written out sequentially.  This is a particular benefit when Outlook is syncing and during other bulk operations like moves.  In those cases we get the benefit of reading many messages with one random seek of the heads.
  • We also delay updating b+ trees used for keeping sorts and views efficient until the view is actually needed.  That way we batch up the updates so that again many updates to the b+ tree can take a larger single io.

We also got some of our wins by improving our cache efficiency there were a couple of wins there:

  • We did a better job of compressing large data types on disk which reduced the memory footprint in the cache
  • We also did work to cache multiple pages in the same buffer page if the data on the page on disk is partly empty

This webcast goes into more detail about the Exchange 2010 storage changes -- https://msevents.microsoft.com/CUI/WebCastEventDetails.aspx?EventID=1032418921

Now, what about virtual storage? 

I have been accused of being a bit of an anti-virtualization bigot.  But the truth is I am a huge fan and I have seen the potential benefits first hand when I spent some time working in Microsoft's IT department.  There are many LOB applications in most companies that consume relatively small chunks of storage and CPU.  However, in a dedicated model there is a minimum practical amount of storage that can be deployed per application if it is on its own hardware.  So our own IT group used to have thousands of applications with storage utilization rates at less than 20%.  By creating a central storage utility that is shared across many applications (the disk drives that the servers connect to are 'virtual') it is possible to get much higher average utilization rates while still providing room for spikes in load and growth.  The cost savings can be dramatic. 

However, with most Exchange deployments, the amount of data involved is so high that there isn't usually a problem getting very good utilization factors and often the SANs are used in a very dedicated model for their Exchange deployments.  Without getting great capacity utilization wins it is difficult to overcome the large per spindle and per bit cost overheads associated with these approaches and the complexity of the systems can be significant.

Typically with a SAN deployment you would design it using a RAID-1 configuration for your primary system with some sort of backup to disk using snaps and then an offload to tape and have a redundant site if you were concerned about geo-scaling.  In a JBOD approach, the model is to map a single drive to a single database.  Then, you choose the number of replicas you want of each database and the number of physical locations you want across those copies.  When you lose a spindle, load is transferred to another spindle on another system.  In addition to the reduced complexity of not having a shared storage fabric, you get added availability benefits because the full hardware stack is protected by the application level replication.

This webcast goes into more detail about the Exchange 2010 High Availability changes -- https://msevents.microsoft.com/CUI/WebCastEventDetails.aspx?EventID=1032416677

Perry

Comments

  • Anonymous
    January 16, 2011
    Thanks.