Freigeben über


Details on PAGE compression (column-prefix)

In the previous blog, I had mentioned that the PAGE compression is used to “minimize the data redundancy in columns in one or more rows on a given page”. You may wonder what exactly SQL Server does under the PAGE compression cover? With PAGE compression, the SQL Server eliminates two types of data redundancy known as column-prefix compression and dictionary compression. I will describe column-prefix compression in this blog followed by dictionary compression in the next.

Under column-prefix compression, the SQL Server looks for common ‘byte’ pattern at the beginning of a column across all rows on the page. If it finds at least two instances of the same columns with some common byte pattern, it stores that ‘byte’ pattern once on the page and refers to this byte pattern from those respective columns. This process is repeated for all columns in the row. Now, there are two things that need clarification here.

· We are referring to the ‘byte’ pattern. This means that under column-prefix compression, the SQL Server is only looking at bytes that are common without regards of the data type (i.e. it is type agnostic). It does not matter if the common bytes represent a part of integer value or float or character string.

· Common ‘byte’ pattern is matched starting from the beginning of the columns, hence the name column-prefix compression. So for example, if you have two values represented in hexadecimal representation as 0x020406FF and 0x0204CCFF. In this case, the common ‘byte’ pattern is ‘0x0204’ only and we don’t consider the value FF which is the 4th byte because the 3rd byte was different. Similarly, if you have two values as 0xFFAABBCCDDEE and 0x33AABBCCDDEE, there is no common ‘byte’ pattern or column prefix because the first byte is different even though the next 5 bytes are exactly the same. One can argue that we could potentially compress this too and you are obviously right. We decided not to compress it because our design goal was to find the balance between compression savings and CPU overhead of compressing/de-compressing the data.

 

You may wonder where and how SQL Server stores this common ‘byte’ pattern? SQL Server stores the column-prefix data in a new record, called the anchor-record. Anchor record has the same schema as the regular row (data or index) and is stored on the same page immediately following the page header. If a column does not have a column-prefix value, it is treated as empty or null just like any other empty or null value stored in a regular data or index row. If none of the columns have column-prefix, the anchor record is not created. During column-prefix compression, the SQL Server stores meta-information in the data or index record so that the full column value can be constructed. Let us consider an example to illustrate this.

 

The following figure shows a data page without column-prefix compression. For simplicity, this page shows three data rows with three columns of type CHAR. Like I had mentioned before, column-prefix compression is 'data type' agnostic so a similar example will hold for other data types as well.
______________________________________
| HEADER |

|_____________________________________|
| |
|_____________________________________|
| AABBB | CCCBC | ABCD |

|_____________|_____________|_________|

 | AAABC | BBBB | ABCD |

|_____________|_____________|_________|

| AAACCC | CCCDD | BBBB |

|_____________|_____________|_________|

 

Now, let us look at each column and find the common ‘byte’ pattern. For the first column, the common byte pattern is ‘AAA’, for the second column it is ‘CCC’ and for the third column it is ‘ABCD’. Using this information, we can create an anchor record. So the modified page will look as follows

__________________________________
| HEADER |

|_____________________________________|
| AAACCC |CCCDD |ABCD | <<---- Anchor record

|_____________________________________
|_____________________________________|
| AABBB | CCCBC | ABCD |

|_____________|_____________|_________|

 | AAABC | BBBB | ABCD |

|_____________|_____________|_________|

| AAACCC | CCCDD | BBBB |

|_____________|_____________|_________|

I want to bring your attention to two points. First, the anchor record is just like any other data or index row but SQL Server knows to treat this record differently. So for example, you can never retrieve anchor record as part of SELECT query. Second, the anchor record not only stores the common byte pattern for each column but it also stores the value of the largest column containing the common byte pattern. So we store ‘AAACCC’ for the first column and similarly for the others. Now you may wonder where are we getting the space savings from? Well, we are not done with column-prefix yet. Once the anchor-record has been created, we need to refer to it from the respective columns.

So the ‘final’ modified page will look as follows. The first column in the first row has a value 2BBB which means: take the first 2 bytes from the corresponding column of the anchor record and then concatenate that value with ‘BBB’ so you will get the value ‘AABBB’. Another example is for the third column of row 2, the value <null> indicates to take the value of this column from the anchor record which evaluates to ‘ABCD’. Note, there is meta-information in the record that helps SQL Sever differentiate between this <null> and actual <null> value. You can see that the space savings will depend upon size of the column-prefix and the number of columns referencing it. Again, the SQL Server associates meta-information to differentiate between the ‘number’ of bytes to be fetched from the corresponding column in the anchor record and the actual data for the column (i.e. how do we know that the first byte is the length of prefix bytes or part of the column value). The value 0BBBB in the second column of the second row represents that pick 0 (zero) bytes from the corresponding column in anchor record. For this case, the real value of the column is ‘BBBB’. As you can see that constructing the real value of the column is more expensive than if the data was not compressed. Similarly, when a column value is modified or a new row is inserted, the SQL Server has to look at anchor-record and do the space optimization when applicable. So the cost of compression is paid both by SELECT and for DML operations. 

__________________________________
| HEADER |

|_____________________________________|
| AAACCC |CCCDD |ABCD | <<---- Anchor record

|_____________________________________
|_____________________________________|
| 2BBB | 3BC | <null> |

|_____________|_____________|_________|

 | 3BC | 0BBBB | <null> |

|_____________|_____________|_________|

| <null> | <null> | 0BBBB |

|_____________|_____________|_________|

 

In the next blog, I will describe the dictionary compression.

Comments