Tertiary Collation and the performance impact on order clause
Today, One customer asked about collation SQL_Latin1_General_CP1_CI_AI with non unique nonclustered index we are getting a sort on the query plan. The sample script is:
CREATE TABLE TableWithASColumn(ID INT PRIMARY KEY, CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS)
CREATE NONCLUSTERED INDEX IX_CharData ON TableWithASColumn(CharData)
CREATE TABLE TableWithAIColumn(ID INT PRIMARY KEY,CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AI)
CREATE NONCLUSTERED INDEX IX_CharData ON TableWithAIColumn(CharData)
CREATE TABLE TableWithAIColumnUniqueIndex(ID INT PRIMARY KEY,CharData VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AI)
CREATE UNIQUE NONCLUSTERED INDEX IX_CharData ON TableWithAIColumnUniqueIndex(CharData)
SELECT ID, CharData FROM TableWithASColumn ORDER BY CharData
SELECT ID, CharData FROM TableWithAIColumn ORDER BY CharData
SELECT ID, CharData FROM TableWithAIColumnUniqueIndex ORDER BY CharData
If you run the script inside SQL Server Management Studio, and Show the Actual Execution Plan, you will see below picture:
At a glance, it seems very weird because building an index on a key column means that the index already sort the data according to the key column, we just need scan the index, and the output should already be sorted. Another wired issue is that this only happen on certain collation. Before we talk about the root cause, let us discuss a little about comparison and sort.
A typical comparison interface is int Compare(object Other) which compare this instance with another instance. It returns 0 if two objects are equal, –1 if this instance is smaller, and 1 if it is bigger. The sort algorithm is implemented by calling the Compare Method for two objects, and depends on the result of Compare() method, put two objects in the correct order. If two objects are equal, the order of these two objects on totally random since there is no other way to ordering them.
Now, let us look at the collation cases. Suppose I am using a case insensitive collation, such as latin1_general_ci_ai it will treat character ‘b’ and ‘B’ are equal. So when I order by the column, I know that character a will be before ‘b’ and ‘B’, but the order of ‘b’ or ‘B"’ will be totally random.
What about a user want to sort ‘b’ before ‘B’ and treat them as equal during comparison (I think it is a totally reasonable requirement). In this case, we need a comparison case insensitive and sorting case sensitive collation. SQL Server don’t have such collation flag to allow us to achieve this officially. However, there are a bunch of “old” collation (When I say old, I mean these collations are not changed/updated since 2000, and are not recommended to use in future development), which actually implement this feature. Such collations are called SQL TERTIARY Collation, sql_latin1_cp1_ci_ai is one of such kind of collation. In such collation, we internal have a tertiary weight table which defines a weight for each character. During sorting on columns, and if two characters are the same, we will lookup the table to find the tertiary weight of the two characters, and sorting them according to the tertiary weight.
As you see from the description, if I issue an order by clause on a column with TERTIARY collation, I can’t simply scan the index defined on column since the index is not sorting any more (index is built on top of comparison, but not sort order), and that is reason that we have to put a SORT on top of the index scan which explain the above weird result.
After search the Internet, I found one solution for this issue described in KB Article at https://support.microsoft.com/kb/951935. You can also look at TERTIARY_WEIGHTS (Transact-SQL) at https://msdn.microsoft.com/en-us/library/ms186881.aspx to get a list of SQL TERTIARY Collations
Now, the question is that should I use such collation in my application. My suggestion is that don’t use these collations since they are “old” collation, and they can not sort all characters correctly. Another thing you need to keep in mind is that the above behavior ONLY apply to char/varchar column, but not NVARCHAR column. So it is another reason is avoid using these collations (because you might want consistent behavior between char and nchar columns).