Sorting and comparing unicode and binary data - why do I get a weird sort order?
A customer recently questioned me regarding a sorting issue they were having in the following scenario:
My database is running under the "SQL_Latin1_General_CP1_CI_AS" Sql collation, but when I run
the script below I seem to get incorrect sort results. Why does the "abc-test" item come
immediately after the "abctest" item and not after "abctest1" as I would expect it to?
create table #temp1 (
ServerItem nvarchar(260) collate SQL_Latin1_General_CP1_CI_AS primary key clustered
)
insert #temp1
select 'abctest'
insert #temp1
select 'abc-test'
insert #temp1
select 'abctest1'
insert #temp1
select 'abc-test1'
select * from #temp1
order by ServerItem
drop table #temp1
RESULTS:
ServerItem
------------
abctest
abc-test
abctest1
abc-test1
Is this right? Is this a bug?
No, this is not a bug - and yes, this is correct and by design. The reasoning lies in how certain SQL collations particularly handle sorting rules for unicode data...this collation in particular uses a word sort algorithm to sort the unicode data, which ignores punctuation characters (and hyphen is one of those, amoung many other such as apostrophe's, dialect characters, etc.). The simple rule to follow when comparing or sorting multiple unicode strings is to not assume that it is done character-by-character, as it isn't.
A similar distinction occurs when using a Binary code page, like Latin1_General_BIN for example (popular amoung many cross-platform COTS applications). Did you know that if you are using a binary sort order that UPPERCASE letters will always sort before lowercase letters? This is because binary sorting sorts based on the actual bit representation of the characters, and UPPERCASE letters have a lower ascii-code than that of lowercase letters. Using a similar example as above, here is what you would see using a binary sort order:
create table #temp1 (ServerItem nvarchar(260) collate Latin1_General_BIN primary key clustered)
insert #temp1
select 'abctest'
insert #temp1
select 'ABCTEST'
insert #temp1
select 'BBCTEST'
insert #temp1
select 'bbctest'
select * from #temp1
order by ServerItem
drop table #temp1
RESULTS:
ServerItem
-------------
ABCTEST
BBCTEST
abctest
bbctest
See how BBCTEST comes before abctest? You'd see the same type of result if you changed "BBCTEST" to just "Bbctest".
Multiple things are taken into consideration when sorting/comparing data in Sql Server (and many other platforms) including your collation setting, code page, binary/non-binary, unicode vs. ascii data, etc. Be sure to understand differences between all characteristics of your system before deploying...
Chad Boyd ~~~ This posting is provided "AS IS" with no warranties, and confers no rights. Use of any included script samples are subject to the terms specified at https://www.microsoft.com/info/cpyright.htm.