Freigeben über


Data compression techniques and trade offs

Ok, now that we have sort of agreed (https://blogs.msdn.com/sqlserverstorageengine/archive/2007/09/30/data-compression-why-do-we-need-it.aspx) that data compression is a good thing, you may wonder how SQL Server compresses the data, what does this compression mean to my data and to my workload?

If you are a theory nut or crave for mathematics, you can browse the web or read the related books and find many fine techniques to compress the data but note, not all compression techniques are applicable to databases. I finished my engineering degree almost 30 years ago and, thanks to the computer science profession and the hours spent in coding/testing, developing products and now with my work as a program manager, I have lost touch (my excuse) with physics/mathematics, the subjects I used to love the most in college.

In general, there are two kinds of compression as follows

· Lossy: This causes some loss of data when it is uncompressed. This is done to get fast/better compression. It is more suitable for compressing images, music and videos files where some loss in fidelity is undetectable to human eye/ears. One example is JPEG. Clearly, this is not suitable for documents or for databases.

· Lossless: This allows original data to be reconstructed exactly from the compressed format. This is what we need for data in the database. There are two common techniques here. First technique is based on the encoding the input data in such a way that more commonly occurring data is coded using less number of bytes. One example of one such technique is Huffman Coding. Second technique is based on statistical modeling for text data. One example in this category is LZ (Lempel-Ziv) algorithm. There are many versions of it but the central idea is to build a dictionary of common occurring symbols in a block and then reference these using a pointer. So the commonly occurring symbol need only be stored once there by saving the space. SQL Server uses its own proprietary implementation of dictionary based algorithm.

 

Let us first get two important points out of the way.

1. The goal of compression in database is NOT to get the maximum compression possible but to find a right balance between the compression achieved and the cost of compressing/de-compressing the data. It will serve you no good if SQL Server can compress the data 90% but the CPU cost of SELECT and DML operations become unacceptably high. You often hear that so and so database system compresses the data 60% while the other database system compresses the same data 40%. The important question to ask is what is the impact on the workload (i.e. CPU)? If the impact on the workload is unacceptable, the customers will not use the data compression in spite of better compression. Ideally, we need to strike a right balance between the compression achieved and its CPU impact. For this reason, SQL Server team evaluated various compression techniques, estimated the data compression we could achieve for different DW databases, estimated the CPU impact of compression/decompression and then zeroed-in to the types of compression to deliver what we have in SQL Server 2008. It is not to say that we won’t enhance compression in future. We will always look for opportunities to enhance our data compression offering. So please do always feel free to contact us with your ideas and the kinds of data challenges you are facing.

2. The compression achieved depends on the data distribution and the schema. So for example, if all integer values that appear in a column can fit in 1 byte and you have declared the column type to be INT, you can possibly save 3 bytes (i.e. 75%) barring any row-overhead per column values. On the other hand, if you had declared the column to be tinyint (i.e. 1 byte), then there are no savings with compression. So the same data but different schema will give you different perception about the compression. Similarly, for the data distribution.

In the next BLOG I will provide more details on types of compression in SQL Server 2008

Comments

  • Anonymous
    September 29, 2007
    PingBack from http://www.artofbam.com/wordpress/?p=3966

  • Anonymous
    September 30, 2007
    Dal team dello Storage Engine di SQL Server arrivano alcune notizie interessanti a proposito delle nuove

  • Anonymous
    September 30, 2007
    As announced in Tech-Ed 2007, data compression is a new and exciting feature targeted to be available

  • Anonymous
    February 15, 2008
    The comment has been removed

  • Anonymous
    February 18, 2008
    Thanks for your comment. Some asnwers here (1) if data is encrpted before comrpession, you will not get much comrpession. TDE (a new featrure in SQL2008) encrupts after compression so you can have both compression and encryption (2) SQL Native backup compression is a very competitive offering. I suggest that you compare with lite-speerd. Other point that I want to make is that backup is typically IO bound, So if you use data comrpession and say you are able to compress 50%, then your backup will finish in 1/2 the time. (3) Data is kept compressed in the buffer pool and we 'decompress' only the requested columns, not the entire row or page. We don't cache 'decompressed data at this time. I did not understand your point on "Isn't it effective only in the case of asymmetric algorithms without" (4) The compression/decompression happens at column/row/page level. The size of the file only matters because there are lot more pages to compresss but the effort is linear with respect to the number of pages. I recommend you try data compression and see the impact on your workload.

  • Anonymous
    February 25, 2008
    Nice to read to such a nice thing in easy-to-understand language.I amn doing a short project on information and entropy and I am interested in knowing factors which limit the extent to compress data without any loss of information and secondly how images are compressed under JPEG. Well ,I would like to tell you That I am a Physics student , so please don't use engineering jargons.

  • Anonymous
    February 26, 2008
    I would suggest looking at the web. I found the following links. http://www.prepressure.com/library/compression_algorithms/jpeg http://www.jpeg.org/jpeg/jpegls.html?langsel=en thanks

  • Anonymous
    April 22, 2009
    Hi Sunil, I saw your articles ,it's Awesome . I am facing some issue in data compression , My client having around 20 TB of data in SQL Server . The client need to do data Compression on this and need to make a SSAS Cube for SQL Reporting. I just want to know where can have compression of Data's , It can be in SSAS level or SQL Server Database Level . If it SSAS Level : How and What are the step should be consider ? If it SQL Server Level : What the better solution for 20 TB Data Compression ? NB: The entire data store into a single Table in SQL 2005. Please help to achieve this. Thanks, Rajesh

  • Anonymous
    April 23, 2009
    I saw your articles ,it's Awesome . [sunila] thanks I am facing some issue in data compression , My client having around 20 TB of data in SQL Server . The client need to do data Compression on this and need to make a SSAS Cube for SQL Reporting. I just want to know where can have compression of Data's , It can be in SSAS level [sunila] the data compression I described in not in SSAS...I am not sure if SSAS has its own compression or SQL Server Database Level . If it SSAS Level : How and What are the step should be consider ? If it SQL Server Level : What the better solution for 20 TB Data Compression ? [sunila] please look at the compression strategy part of the blog...if you have read-mostly data, we recommend PAGE compression NB: The entire data store into a single Table in SQL 2005. [sunila] not sure what you mean..looks like you have one dominant table...this is definitely a manageability challenge...Have you partitioned it?