Condividi tramite


Sorting undefined characters in Unicode and/or Windows collation

When comparing two Unicode strings in non-binary collations, SQL Server uses a library that is essentially same as the Windows API CompareStringW. It defines a weight for each recognized character, and then use the weights to compare characters. However, not all code points are defined in the sorting library. They may be undefined because:

1) The code point is not defined in Unicode standard.
2) The code point is defined in Unicode standard, but not defined by Windows yet. It takes time and effort to define linguistic sorting semantics for new characters. Windows team typically needs to work with local standard body and/or regional PMs to define sorting rules for new characters. They add new character support in every release, and try to catch up. Some characters may have font defined, therefore could be correctly displayed, but still not defined in terms of comparison. For example, NCHAR(13144) - NCHAR(13174).
3) The code point is defined in Windows, but not defined in SQL Server yet.

Windows NLS team has decided that undefined characters are ignored during string comparison, partly because there is no real good way to compare them against other defined characters. SQL Server inherited this semantics. This does cause some confusing behavior. See below examples.

declare @undefined_char1 nvarchar(10), @undefined_char2 nvarchar(10)
set @undefined_char1 = nchar(0x0000)
set @undefined_char2 = nchar(13144)
select 'Undefine characters compare equal to empty string'
where @undefined_char1 = ''
select 'All undefine characters compare equal'
where @undefined_char1 = @undefined_char2
go

create table t (c nvarchar(10))
go
create unique index it on t(c)
go
-- first insert succeeds, but second insert fails with duplicate key error.
insert t values (nchar(0x0000))
insert t values (nchar(13144))
go

As you can see, since all undefined characters compare equal, they could cause duplicate key errors. Similarly, if you create one table with name of an undefined character, and then try to create another table with another undefined character, the second table creation would fail due to duplicate names, even though the code points of the two undefined characters are different. This could also cause confusing results in string matching builtins such as CHARINDEX, PATINDEX, or LIKE.

While these results seem confusing, the basic rule is actually very simple, i.e., undefined characters are ignored during comparison and string matching. Once you understand and remember this rule, the behavior should be easy to understand.

There have been arguments whether undefined characters should be ignored. Since this is the behavior on Windows platform, and there is not a definitely better way to sort them, and for backwards compatibility, we are going to maintain this behavior.

If your app needs to work with these undefined characters and expect to treat them as regular characters, you can use binary collation. In binary collation, comparison is done purely based on code points, not linguistic rules, so there is no notion about defined vs. undefined.

Comments