Share via


SQL Server 2008 R2 - UCS2 compression what is it - Impact on SAP systems

We just released CTP2 of SQL Server 2008 R2. What is this CTP? Think about it as Beta2 of SQL Server 2008 R2. There will be another CTP which should be released in the last quarter of this calendar year. Final release time frame should be very early in the second half of next calendar year. As a background, the R2 version of SQL Server 2008 only has minimal changes in the SQL Server Relational Engine. Most of the changes have to do with new components in the overall SQL Server package. However there is one change in the Relational Engine which could have significant impact on TCO. As most of you know, SQL Server stores Unicode strings in columns of the datatype nvarchar. As encoding we use UCS2. This encoding usually stores every character in 2 bytes. This is different to other encodings like UTF8 where characters out of single byte code pages get stored as one byte on disk. (However on the other side UTF8 can use up to 4bytes in typical double byte collations like Kanji.) – The current solution had room for improvement: Save space for characters out of single byte code pages.

The goal we set ourselves for SQL Server 2008 R2 looked as follows:

· Spend one byte of disk storage for single byte collations like the typical Latin1 collation

· Improve storage efficiency even for double byte character collations

· Make the usage of the new way of storing nvarchar content completely transparent. Means beyond the usage of either ROW or PAGE compression, no further steps need to be applied.

· For already existing ROW or PAGE compressed tables in databases which were attached or restored from a SQL Server 2008 source newly inserted or modified tables will be stored in the new format even in combination with row in the old format on the same page

· Performance impact shouldn’t be measurable in productive scenarios

In order to achieve our goal, we implemented a new way how we actually store our UCS2 encoded nvarchar datatypes on disk. The algorithm applied is widely known as SCSU (Simple Compression Scheme for Unicode). It is applied at the moment a row is placed on a page and it is absolutely transparent to the application and other layers of SQL Server. We achieved the goal of making the usage transparent with coupling our UCS2 compression with ROW and PAGE compression. Means if one chooses to compress a table in SQL Server 2008 R2 with ROW or PAGE compression the new UCS2 compression comes with it automatically. Not using any compression against a table also doesn’t give any Unicode compression. So we implemented UCS2 compression as an extension of our database compression features we implemented already in the SQL Server 2008 release.

In terms of performance impact I spent a lot of time over last Christmas testing based on the SAP SD benchmark and some examples of typical BW databases. One really can say that the impact of the UCS2 compression on top of our existing compression methods is in the extreme low single digits. This is basically a change that in a productive system isn’t really recognizable. These tests and measurements conducted with other workloads, including standard TPC based benchmark workload, gave us enough confidence to simply couple the new UCS2 compression with our existing methods because the increase on CPU consumption proofed negligible.

So what is the benefit we measured in some of the comparisons? See this table:

Locale

Pure UCS2 in SQL Server 2008

UTF-8

UCS2 compression in SQL Server 2008 R2

English  

1

0.5

0.5

Kanji  

1

1+

0.85

Korean  

1

1

1

Turkish  

1

0.53

0.52

German  

1

0.5

0.5

Vietnamese  

1

0.68

0.61

Hindi  

1

1

0.5

 

What does this really mean for a SAP ERP system? In order to answer this question I took a copy of our own 5+TB SAP ERP database. I completely defragmented the database and did a series of exercises which ended up with the results presented in the following table:

Compression Type

Percentage of original

No compression