SQL Server and UTF-8 Encoding (1) -True or False

Articolo
12/29/2011

Since there is confusion on SQL Server’s support for UTF-8, Unicode data etc., I would like to share the following post from QingsongYao which provides very useful content on this:

“Today, I will start my series of articles about SQL Server and Unicode UTF-8 Encoding. In many times, I found when people ask me about UTF-8, they actually don't understand UTF-8. So today's talk will be quite short. I just clarify some misunderstand.

1. SQL Server doesn't support Unicode, do you mean UTF-8?

Sometime, people just say "SQL Server doesn't support Unicode". Actually, it is wrong, SQL Server support Unicode since SQL Server 7.0 by providing nchar/nvarchar/ntext data type. But SQL Server doesn't support UTF-8 encoding for Unicode data, it supports the UTF-16 encoding. I copy several definitions from Internet for these concepts:

    "Unicode is the universal character encoding, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.  The unique number, which is called the code point of a character, is in the range 0 (hex) to 10FFFF(hex). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For example, Character A is defined as  "Latin Uppercase Alphabet",  and assigned a code point U+0041.

       In additional to assign each character to a unique code point, Unicode Consortium also defined several Unicode transformation formats (UTFs) which are the algorithmic mapping from every Unicode code point to a unique byte sequence. Note, the Unicode code point itself has nothing related to how to store/transform in a Computer, but a UTF does.

       The commonly used UTFs are UTF-16 and UTF-8. UTF-8 is the byte-oriented encoding form of Unicode which commonly used on Unix Operating System, Web Html and Xml File.  UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. UTF-16 is commonly used on Windows, .Net and Java. The transform between different UTFs are loseless and round-tripping. "

In summary, don't confuse with Unicode and UTF-8 Encoding. They are totally different concepts.

2. UTF-8 Encoding is much better than UTF-16 Encoding

There are tons of articles comparing with UTF-8 encoding with UTF-16 encoding. I will compare these two encoding side by side in my next article. I can have 100 reasons to say UTF-8 Encoding is better than UTF-16 Encoding, and verse vice. The correct answer is that no encoding is absolute better than the others. User should choose the suitable encoding according to your application software requirement. The operation system, programming language, database platform do matter when choosing the encoding. UTF-8 is most common on the web. UTF-16 is used by Java and Windows. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

So please don't jeopardize SQL Server's Unicode support because of it only support one of the UTFs.

3. SQL Server cannot store all Unicode Characters

You may notice that I say SQL Server support UTF-16 Encoding in previous paragraph, but I also said SQL Server' nvarchar type encoding is UCS-2. I intend to do this in this article because I want to discuss the different between these two Encodings in here. Let us describe in details in what area SQL Server support UTF-16 encoding:

SQL Server can store any Unicode characters in UTF-16 encoding. The reason is that the storage format for UCS-2 and UTF-16 are the same.
SQL Server can display any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows functions to display characters, the Windows functions and fonts can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
SQL Server can input any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows IMEs (Input Method Editors) to input, the Windows IMEs can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
SQL Server can sort/compare any defined Unicode characters in UTF-16 encoding. Note, not all code points are map to valid Unicode character. For example, The Unicode Standard, Version 5.1 defines code points for around 10,000 characters. All these characters can be compared/sorted in SQL Server latest version: SQL Server 2008.

In contrast, I also list the UTF-16 thing SQL Server doesn't support:

SQL Server cannot detect Invalid UTF-16 sequence. Unpaired surrogate character is not valid in UTF-16 encoding, but SQL Server accept it as valid. Note, in reality, it is unlikely end-user will input invalid UTF-16 sequence since they are not support in any language or by any IMEs.
SQL Server treats a UTF-16 supplementary character as two characters. The Len function return 2 instead of 1 for such input.
SQL Server has potential risk of breaking a UTF-16 supplementary character into un-pair surrogate character, such as calling substring function. Note, in the real scenario, the chance of this can happen is much lower, because 1) supplementary character is rare 2) string function will only break this when it happens be the boundary. For example, when calling substring(s,5,1) will break if and only if the character at index 5 is a supplementary character.

In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation. Please refer to my previous blogs to details.”

Comments

Anonymous
December 30, 2011
Short but very useful post ( even with my poor english ), i think i have understood most of your explanations. A good idea was to recall the limitations for the support of UTF-16 by SQL Serverand the difference of storage format between UTF_16 and UCS-2 (none...). I hope you will go on to produce so good articles. Thanks

Condividi tramite

SQL Server and UTF-8 Encoding (1) -True or False

Comments

Risorse aggiuntive