Katmai (Sql 2008) - Data Compression (including Backup Compression)
This post will cover an initial overview of the data compression feature that will be included in Sql Server 2008, and I'll also cover briefly some information on the Backup Compression feature as well (note that these are 2 totally separate and distinct features, since they are very different, as you'll hopefully see by the end of this post). This will actually be a fairly lengthy post, but I'd rather it be that way to ensure I cover as much as possible completely. NOTE that this feature is NOT in a CTP yet, so you won't be able to make use of it in the current CTP...
Reducing the size of a data-footprint can usually be done in a few different ways: 1) limit the amount of data that is stored via archival processes, 2) compressing the actual data. Let's assume that you've either already done as much of #1 as possible, and/or can't do #1, and you're data is still large - you're left with option #2. Compressing the data can also typically come in a few different flavors: 1) compress the data-file on disk using traditional compression algorithms, 2) storing the data more efficiently within the row/page (in this case anyhow). Compressing the data via traditional methods typically requires that you compress "larger" chunks of data in order to perceive a benefit (larger in this case than a typical 8/64k size related to Sql Server storage), and also doesn't provide any in-memory benefit (i.e. the compression is only beneficial on-disk, accessing that data and pulling it into cache requires decompression, and hence the memory footprint required is the same as the data in an uncompressed format). Storing the data more efficiently in-row/page on the other hand provides additional in-memory benefits, since the data is stored in the efficient format in-memory just as on-disk. This is the way that Sql Server 2008 implements data compression (which, by the way, is similar to the way that certain competitor products that shall remain nameless also implement compression).
Let's start with a couple of simple points:
- Data compression in Sql 2008 is easily enabled/disabled for a table and/or index (via new options in traditional create/alter table/index DDL)
- Compression can be enabled/disabled for only certain partitions of a table/index
- There are 2 levels of compression that can be enabled, which include 3 key changes in Sql 2008
- ROW level compression
This relates to more efficient storage of data in-row for Sql Server. We'll talk about this more below.
- PAGE level compression
This is a superset of (and implies) row level compression. It also makes use of 2 other new changes in Sql 2008 - "Column prefix compression" and "Page dictionary compression". We'll talk about each of these more below.
- Enabling compression is performed via rebuilding an index/table - CAN be online
- Can NOT currently work with XML, BLOB, MAX data types
- This will most likely be an Enterprise-only feature (i.e. in the Enterprise/Developer SKUs)
Ok, so with some of those points out of the way, onward...
In reality, Sql Server 2005 SP2 included a form of data compression - the vardecimal storage format. I'm not going to discuss the vardecimal storage format here, but you can think of it as extremely similar for decimal data as varchar is compared to char data. However, Sql Server 2008's data compression is vastly different from this (though it still supports/includes the vardecimal format as well) - so different in fact, that if you enable data compression on a given table/index, the underlying row/page format is different - yes, that's right, you heard correctly - if you use compression (ROW or PAGE), the row/page format for Sql 2008 will be DIFFERENT from the existing row/page format (only for the table/index(es) you are using compression for). So, in Sql 2008, there are 2, yes 2, row/page formats for data. You may now be wondering "well, if the row/page format changed, how in the world did you have enough time to re-engineer every component in Sql Server that is aware of the format in such a small amount of time? The answer is that we didn't - the Storage Engine is the ONLY component in Sql 2008 that is aware of the new row/page format. Once data is passed up the food-chain so to speak (i.e. to the Relational Engine for example), the format is converted from the new format to the old format and vice versa (again, this is ONLY if you are using compression for a given table/index/partition). This has some repercussions to be aware of as well - since the data is "uncompressed" when it is passed from the Storage Engine (i.e. out of cache to be consumed), this means that if you bcp out the data, or push the data out to an external consumer in any way, the data will be translated to the original, uncompressed row format. The Storage Engine does encompass many IO bound features however, so much is gained here (backup, dbcc, scans, etc.).
So you may be wondering why the row/page format had to change - good question. For row-level compression, in the simplest answer it's because the existing row format used for variable-length data didn't scale very well (if you understand how varchar type data is stored compared to fixed-char type data, you know that there is a 2-byte overhead associated with each variable length column for determining the beginning offset in the row for the data (even if there is NO data for the given column), among other considerations for things like NULL values, etc.). Given that we're basically in effect trying to be more efficient about storing data with the compression feature in 2008, using this existing type of row format simply wouldn't cut the mustard for getting the benefits required (though it is very similar to how the vardecimal format is stored in Sql 2005 and 2008 if you're using it and not using the new compression features). For page-level compression (which includes row-level compression), the format will change because we are adding a new structure at the beginning of each page refereed to as the 'CI' structure (CI stands for "compression information") - this structure will be made up of 2 components: 1) the "column prefix" portion, and 2) the "page dictionary" portion - we'll discuss these a bit later.
Ok, now that all that is out of the way, let's discuss the internals of ROW compression a bit. The basic jist of row-level compression is more efficient storage for all kinds of data (int, float, money, datetime, character (including varchar), etc.) - it works primarily by optimizing the row-level metadata for variable length storage (i.e. compared to the current model for storing variable length data). In Sql 2005 and prior, if you store an INT value of 1, you wind up with 4 bytes of fixed-length storage for a value that takes less than a byte. With variable length data, a varchar column with a value of 'hi' would take 2 bytes for storage + 2 bytes of meta-data overhead (and vardecmial would be the same for decimal values). With Sql 2008 and ROW compression, this meta-data storage is cut from 2 bytes to 4 BITS per value (bits, not bytes), in addition to efficient storage of the actual data on byte-boundaries (for example, to store the value 1 for an int datatype would require a single byte, not 4 bytes - for the value 10,000 in int, this would require 2 bytes). So, the storage for this fixed data now becomes the number of bytes to store the value + 4 bits of meta-data overhead (per value/column). I'm not going to discuss the actual row layout/format for the new structure at the moment, as it is quite complex, but will possibly go into it in a future post - the key thing to understand currently is that actual amounts used for storing ROW compressed data, both fixed and variable length data types (but NOT LOB/XML/MAX data types currently). And yes, there are special considerations for NULL and 0s and zero-length strings to efficiently store this type of data as well.
How about PAGE compression - as mentioned above, PAGE compression includes ROW compression (so, if you enable PAGE compression, you get ROW compression as well). In addition to the row-compression stuff from above, PAGE compression also uses "column prefix" matching, as well as the concept of a "page dictionary". So, what exactly are each of these concepts you say????
Let's start with "column prefix", which, as you may have guessed, has much to do with repeating patterns at the beginning of the values for a given column (which, not coincidentally is quite common for the contents on many pages, especially in the cases of index pages, since a single page most likely contains similar data). Assume you have a column of data on a single page of rows that contain values like 'Chad', 'Chadwick', 'Chadly', 'Chad', 'Chadster', 'Chadwick', and 'Chadly' (values repeated purposely) - as you can tell, there's quite a bit of redundant data 'prefixing' each of the rows in this column on this page, yes? So, what you might end up with in a scenario like this would be a column prefix value of 'Chad' stored in the CI structure, and each column ending up with pointers to that prefix value - resulting in values like this on-disk: '<empty>', '1wick', '1ly', '<empty>', '1ster', '1wick', and '1ly' (NOTE: this isn't necessarily exactly what you'd end up with, it's an example - you may end up with 2 prefix values, 1 for 'Chad' and 1 for 'Chadwick' for example, etc., etc...all depends on the data and what the algorithm decides is best for that page). In this scenario, we've now gone from using something in the lines of 44 bytes of storage to using more like ~25 bytes of storage (4 bytes in the CI structure for Chad, plus the sum of values) plus a little meta-data overhead - an approximately 38% savings of storage space in this case.
Now let's move on to the "page dictionary" concept - this is basically what it sounds like - a 'dictionary' of repeating column values on the given page is created in the CI structure, and again, pointers to the value in the dictionary are left in the original column value location. The page dictionary is generated on top of "column prefix" values, so you can most definitely end up with values in the dictionary that have pointers to column prefix values (this is by design). So, using the same column value samples as above with the Chad's, after the "column prefix" values are calculated and stored as mentioned above, you'd potentially end up with a page dictionary that contained the values '1ly', and '1wick', then the in-line row-values would ultimately look something like '<empty>', '2', '3', '<empty>', '1ster', '3', and '2'. In this case, we went from something in the lines of our original ~25 bytes of storage to around ~17 bytes of storage - another ~30+% savings.
First off, notice that in both cases I'm using approximate values - this is because the savings depends on lots of things (including a very complex format), and also on the amount of meta-data that is being stored to track the information. The good news is that we are also providing a few procedures with Sql 2008 that will allow you to estimate the savings you will be able to perceive for a given table/index, which is called "sp_estimate_data_compression_savings", which you'll be able to read extensively about in BOL.
Now, a couple of things to ensure we're clear on with PAGE level compression:
- Each page is distinct from all other pages - the CI structure for each page is built on and made up of only values/rows/etc. for that page, independent of other pages
- Page dictionary data builds on column prefix data - so, you may have dictionary values that contain column prefix pointers - this is a good thing
- Page level CI structures aren't built until a page is full (or pretty near full) - this makes sense in multiple ways if you think about it: 1) since you're using compression to conserve space, and space is allocated to the engine on page level boundaries anyhow, you wouldn't be conserving any space by compressing a 1/2 full page to a 1/4 full page (the db would still have a full page allocated); 2) it would be kind of tough to build the column prefix and page dictionary data without any data to analyze, yes?
Additionally, ROW and PAGE level compression are 2 different forms of compression that can be enabled/disabled at the table/index/partition level independently from other table/index/partitions - yes, this means that within a single index made up of multiple partitions, you can have some partitions that are ROW compressed, some that are PAGE compressed, and/or some that are not compressed at all, providing you all the power and flexibility to manage your data as you see fit.
Now for some of the other questions that I'm sure you're curious about:
1) How will this impact performance?
Well, my favorite answer, that depends :-)...it's true though. Obviously, compression of any form (including this implementation) incurs CPU time - that doesn't necessarily mean however that your CPU utilization is guaranteed to go up. Compression in this manner will decrease the following:
- I/O utilization
- Memory utilization (less # of pages in cache, better memory utilization)
- Potentially Page latching (i.e. the less # of pages in cache that exist, less cache that must be scanned, less # of latches to be taken)
If the benefit from the savings of CPU cost for the above outweighs the upward cost of CPU utilization for "compression", the net effect would be lower CPU utilization - now this would entirely depend on your workload, system, etc. It's entirely possible that you currently have a totally CPU-bound workload due to true CPU-bound reasons, and you'll see increased CPU utilization with no offset (assuming you have no IO for example, or plenty of cache). Tests with your environment and your workload would need to be performed to see what benefit you'd receive. If you're workload is IO bound, or memory bound, then sacrificing a bit of CPU time to increase your overall throughput will be an easy decision. Additionally, the whole idea is to increase throughput, so if your throughput goes up, but your CPU also goes up a few % points, that's generally a win as well.
2) How much compression will I see / what will the compression ratio for my data be?
Again, it depends - which is hopefully somewhat obvious from the discussions above. This answer depends entirely on your data - for ROW level compression, it depends entirely on the size of data, type of data, sparse-ness, etc. of the data. If you have lots of null values, or lots of small values compared to the storage size of the datatype, then you'll see very good compression ratios - if not, you won't see great compression ratios. For PAGE compression, in addition to the same considerations for ROW compression, it will additionally depend on the repetition of data, density/selectivity of data, etc. Again, you'll want to use the "sp_estimate_data_compression_savings" procedure to get an idea of what you might see.
Finally, no the manageability/supportability side, many tools include enhancements for insight (DBCC PAGE, new perfmon counters "CI Computations Succeeded" and "CI Computations Attempted", page counts in DMVs, etc.).
Well that covers the data compression component, how about backup compression? Well, this is a much easier story to tell - backup compression takes advantage of your typical standard on-disk compression methods. You most certainly can have a database that contains data compression in tables/index/partitions AND is also compressed via backup compression at backup time. Backup compression is implemented via a new option in the BACKUP syntax, and it is either ON or OFF (much was researched around implementing different levels of backup compression, but if you've ever worked with backup compression software that supports levels of compression, it's usually very easy to find the point at which the trade-off from compression achieved vs. CPU time spent is a no-brainer - this is why it's a simple ON/OFF option at this point). You can also set a new system level sp_configure option to define the default compression value (on/off) that will be the default anytime a BACKUP is taken without specifying the compression option. Backup compression is supported in ALL backup types, and can be levereged with log backups for improving things such as log shipping throughput across a WAN. A few points about backup compression:
- Backup compression will most likely be an Enterprise-only feature (i.e. in the Enterprise/Developer SKUs)
- Regardless of the SKU support, ALL editions will support RESTORING compressed backups
- Compression ratios will depend on your data as usual, but we've seen around 5:1 ratios on 'standard' data sets. Again, this will be entirely dependent on your data set - no guarantees with anything here, you'll need to test on your dataset and see what you can see.
- Database mirroring has been enhanced to leverage compression algorithms for streaming data to the mirror (big improvements here)
Ok, so that hopefully covers data compression and backup compression a bit, by all means post with any questions/comments/etc. and I'll be sure to take note.
Finally, much of the information included herein was relayed to me from Sunil Agarwal (a PM for Sql Server here at Microsoft), so lots of thanks should go to him for the content - I'm sure if there's anything misstated herein, that it was due to my interpretation and not from his information for sure.
Chad Boyd ~~~ This posting is provided "AS IS" with no warranties, and confers no rights. Use of any included script samples are subject to the terms specified at https://www.microsoft.com/info/cpyright.htm.
Comments
- Anonymous
July 29, 2007
Thanks! real useful.One aspect is not covered here: the impact performance (for ROW and PAGE) compression when updating data. I would expect to use compression for mostly read-only data and never in an OLTP environment howerver how huge the tables are, but the way it is done, I wonder a bit... - Anonymous
July 29, 2007
Good question Ericga...it's not a simple answer however, so I'll plan to make that a future blog post with some details. The biggest impact would be on CPU utilization (as mentioned in the article), since all data is always managed in memory - and if the benefit of less IO/memory usage outweighs the cost of CPU, then you'd be doing well in many scenarios - it again would always depend on your data. For example, if you're inserting/updating data that is indexed via ever-increasing values, the cost would be very small - the largest cost would most likely be incurred during page splits, which will only occur on inserts/updates within an existing range (if that makes sense). Again, tough one to answer in a few sentances, so I'll post an item about this by itself at some point in the near future. Thanks! - Anonymous
August 02, 2007
The comment has been removed - Anonymous
August 06, 2007
What is the compression algorithm used for backup compression? Or is that a private matter :). Is the framework open such that one could put their own in? I ask becuase there are new cel phone compression algorithms that would be very beneifical to database backups compared to traditional ones. - Anonymous
August 08, 2007
Sorry, but I really can't comment on the algorithm used :-). I'll verify to be certain, but I'm also quite certain that it's not an open framework (i.e. use your own) - that would potentially lead to many scenarios that wouldn't be pleasant :-).