Partilhar via


Virtualization, the SAN and why one big RAID 5 array is wrong.

Today I've had no less than 3 ad hoc conversations about disk sharing for VMs.  This isn't about RAID5 versus RAID10, specific performance requirements for SharePoint, or cache for simultaneous read/write requests.  This is simply aimed at giving SharePoint admins the knowledge to take to the SAN admins to insure a successful SharePoint deployment via best practices.  Our focus is on SQL server in this scenario, but the concepts are applicable.  Ready...let's dive in.

 

First of all, some definitions:

Term

Definition

SAN

Acronym for Storage Area Network.  It’s a Physical device used to extend storage space. The SAN is the parent device for all disk IO

Enclosure

Each SAN is made up of enclosures.  It’s a physical part of the SAN and is the device holding all the hard drives.  Normally, a set of fiber channels, iSCSI channels, HBAs and other miscellaneous hardware is attached to each enclosure.

Hard drive

Regular hard drive you’re familiar with.  You know - that spinny thing (or if you're really good, not spinny thing)

RAID Array

Normally RAID 5 when we discuss SANs, a RAID Array is a grouping of Hard drive (all within the same enclosure) to provide fault tolerance.  In the event a hard drive goes down, no loss of data is experienced.  In a RAID 5, your disk size is measured by: Size of Drives * (Number of Drives - 1).  Minimum number of disks for RAID5 is 3 – 2 for the data and one for the parity.

LUN

Acronym for Logical Unit Number. A LUN is logical section of a RAID array and is the actual drive letter that is exposed to Windows.

IOPS

Acronym for Input/Output Operations per Second.   IOPS is a measurement of the performance of a disk (or Array).  To calculate IOPS, we can use http://blogs.technet.com/b/cotw/archive/2009/03/18/analyzing-storage-performance.aspx,  or a much easier method: SQLIO.

SQLIO

A disk benchmarking utility which gives results in IOPS for different loads.  https://www.microsoft.com/download/en/details.aspx?displaylang=en&id=20163

Bandwidth

The theoretical maximum of a given resource without any additional load.  Imagine bandwidth as a 4 lane highway without traffic with a 70mph speed limit.  As law abiding citizens, we can drive up to 70mph all the time.  To go 10 miles will take 8.5 minutes every time because we’re always traveling 70 mph.

Throughput

The actual maximum of a given resource with additional load factored in.  Imagine throughput as a 4 lane high WITH traffic with a 70mph speed limit.  But because of congestion the actual speed we can travel varies from 45mph to 70mph. It’s never the exact same.

 

Now every SAN administrator has some brochure, or PDF, or something from there SAN vendor that says: For peak performance, create one RAID array of all the hard drives in the enclosure.   I'm sure there's some balloon that says "To minimize waste", or "To load balance across multiple hard drives is a good thing!" But this is wrong. But before you go and blow away all your LUNs and RAID arrays, let's examine a scenario:

 

Let’s assume that we have a SAN, with 4 enclosures.  Each enclosure is capable of holding 10 hard drives, and we decide to fill it with 100GB drives each rated at 100 IOPs each.  Total possible space is 1TB and our bandwidth is 1,000 IOPS per Enclosure. To maximize our investment (thereby minimizing waste), we follow our SAN
vendor’s recommendation and create one big RAID 5 array and lose 1 disk to the parity calculation.  So our available space is 900GB and our bandwidth is 1,000 IOPs. 

 

Next, we decide to deploy SharePoint 2010 via HyperV with all disks on SAN. Our server architecture is 3 servers, 1 SQL, 1 APP and 1 WFE.  We decide our drive needs are:

Server

Disk Description

Requirements

SQL

OS Drive

100GB and 50 IOPs

Data Files

100GB and 250 IOPs

Transaction Logs

100GB and 250 IOPs

WFE

OS Drive

100GB and 50 IOPs

APP

OS Drive

100GB and 50 IOPs

 

We send this to the SAN admins and the SAN admin says to themselves: “Self, enclosure #1 has a 900GB capacity and 1,000 IOPs.  SharePoint 2010 needs 500GB and 650 IOPS.” And would promptly carve up the RAID 5 array of enclosure #1 into 5 different LUNs: 3 for SQL, 1 for WFE and 1 for APP.

 

Here’s where the problem arises.  A hard drive only has one armature and uses it to read and write, but it can’t do both simultaneously.  If the hard drive is writing, and we request a read for some file, then the read gets queued until the disk I/O completes.  RAID is a double edged sword and the root of our problem:  when using any kind of RAID, the data is dispersed amongst all the drives.  On one hand, RAID is a huge performance boost because if we have 20 bits to write and 10 disks to use, then each disk only has to write 2 bits.  On the other hand, if we have 20 bits to write, and 10 bits to read, the reading bits will have to wait because all the drives are used for the writes.  Now granted, this read and write happens EXTREMELY fast, but the pause is still present and when we're talking about operations per second, they add up quickly.

 

Now, accordingly to SQL best practices, we split our Data file and Transaction logs to different disks to alleviate this queuing.  The transaction logs are very write heavy (1000 writes: 1 read or more).  The Data file is very ready heavy (1 write: 500 reads or so).  As far as SQL in concerned, they’re on separate disks because we put them on separate LUNs.  BUT WAIT: remember that all the LUNS are on one RAID array, AND that when you read or write from a RAID array you utilize all the disks.  Thus you haven’t actually split your data files and transaction logs – they’re on the same disks.  Now stack on top of that the write heavy statistics of your APP server, and the read heavy load of your WFE, and our throughput plummets.  While still under our bandwidth of 1,000 IOPS for the enclosure, the SQL LUN can't write while the WFE LUN is reading and vice versa. 

So we’ve identified the underlying problem.  What’s the solution?  Let’s revisit our server needs:

Server

Disk Description

Requirements

SQL

OS Drive

100GB and 50 IOPs

Data Files

100GB and 250 IOPs

Transaction Logs

100GB and 250 IOPs

WFE

OS Drive

100GB and 50 IOPs

APP

OS Drive

100GB and 50 IOPs

 

RAID 5 is a minimum of 3 drives: 2 data drives and a parity drive.  SQL best practices is to break up the OS, Data and Transaction logs onto separate drives.   How do we follow best practices and leverage the SAN? 

 

One possible solution would be to breakup into 3 RAID5 volumes like so (note our wasted space is now 300GB instead of 100GB):

RAID 5 volume

Disks

Available space and bandwidth

#1

1-4

300GB and 4,000 IOPs

#2

5-7

200GB and 3,000 IOPs

#3

8-10

200GB and 3,000 IOPs

Total

 

700GB and 10,000 IOPs

 

SQL

OS Drive

100GB and 50 IOPs

100GB from RAID5 volume #1

Data Files

100GB and 250 IOPs

200GB from RAID5 volume #2

Transaction Logs

100GB and 250 IOPs

200GB from RAID5 volume #3

WFE

OS Drive

100GB and 50 IOPs

100GB from RAID5 volume #1

APP

OS Drive

100GB and 50 IOPs

100GB from RAID5 volume #1

 

In the proposed solution, we are sacrificing 200GB of disks, but we’re gaining the additional performance of splitting our data files and transactions logs, and SQL is capable of leveraging the entire 3000 IOPS for each LUN.  The OS disks are still sharing a RAID array (just like in the rejected solution) so we’re not making the problem any worse, but since we moved the data and transaction to dedicated spindles, we gain performance by reducing the overall load. 

Lastly, I’ve fudged the numbers a little bit to make my point.  I’m not sure if 50 IOPs is too little or too much for an OS drive.  I don’t know if your enclosure has 10 100GB drives with 100 IOPS in RAID5 or 25 300GB drives with 10,000 IOPS in RAID10.  And the read/write challenges is nothing new hard drive manufactures nor SAN vendors – great strides have been taken to reduce the performance of a read and write request coming in simultaneously; namely cache. 

 

But here’s the facts:  if we plan ahead of time for these scenarios and work with the SAN admins, we can increase the SAN’s overall efficiency.  By knowing the loads of other servers on the RAID array, we can intelligently place our loads and prevent any cross server thrashing of the disks.  Sometimes the answer may be one big
RAID 10 array, or three smaller RAID 5 array.   Proper benchmarking is the key and should be the first step in any new server installation to insure we can identify any issues as soon as possible.

 

HTH

Comments

  • Anonymous
    January 01, 2003
    @Zack - excellent point and nearly identical to my statement in the second-to-last paragraph: "And the read/write challenges is nothing new hard drive manufactures nor SAN vendors – great strides have been taken to reduce the performance of a read and write request coming in simultaneously; namely cache."

    However, cache does not solve the issue. Cache has an upper limit to amount of pending requests it can store. If the drives are unable to keep up with the IO demand, your cache will fill-up and eventually be saturated.

    So your statement "Problem solved" is incorrect. A more accurate statement would be "Problem delayed".

  • Anonymous
    August 23, 2011
    I'm going to pass this on to my friend who configures and administers Sharepoint 2010.  Thanks

  • Anonymous
    July 30, 2014
    "A hard drive only has one armature and uses it to read and write, but it can’t do both simultaneously." That's why drives and arrays have cache. Problem solved.

  • Anonymous
    August 01, 2014
    I'm going to have to agree with Mr.Campbell here. I'm a SQL person and I just sigh when storage people put bunch of disks in one big RAID5 and gives me C: OS, D: DATA, E: TEMPDB, F:LOG.......ummm,.....no...that does not make any sense, you defeated the purpose here. if it's in the same array anyway might as well keep everything under C: and keep it simple.

  • Anonymous
    August 01, 2014
    Ryan. You should also point out the recovery perspective in an argument against one giant RAID5 while you are on it. When a disk dies in RAID5 its effects are compounding because everything slows down, and app, sql data, sql log, queue server and what have you....all slow down, cumulative effect of which may be no business for the day. On the other hand, if you had independend RAID10 for data, log and app server each.....(1) things won't slow down during rebuild and (2) if it ever did, it would be contained to just that instance.

  • Anonymous
    August 08, 2015
    That's why most san venders create disk groups under the array. you have one big logical spae but it is made up of tiered storage and more then one disk group.

    For example EMC does 4+1 for raid 5 and 7+1 for raid 6.

    so if you have 30 disks and are using raid 5 it will end up being 6 sets of 5 disks ... to form one logical space. it sounds like they are using raid 50/60 but I don't think they are. since you can add extra sets of disks to expand the array.

    So in the event of a disk failure the whole array is not slowed down .. just that subset of disks.

    Also cache helps with disk IO

  • Anonymous
    March 03, 2016
    The comment has been removed