Which SQL Server Collation setting to use?

Question

Which SQL Server Collation setting to use?

techresearch7777777 1,961

Hello, user is requesting a new empty DB to be created on our SQL Server 2019 instance and some of the requirements worth noting are...

Encoding: UTF8

Case Insensitive

Was mentioned it should be similar to one of our existing old SQL Server DBs which the Compatibility level on that is "SQL Server 2012 (110)" and "Collation" Option is set at "Latin1_General_CI_AS"

Which SQL Server Collation setting (and any other settings) should be used for this new DB considering above mentioned requirements?

Was thinking it would need one of the Collation selections that has the "*_UTF8" endings since SQL 2019 now supports it but I don't see corresponding exact name selection that would be "Latin1_General_CI_AS_UTF8".

Thanks in advance.

3 answers

Your answer

Answer 1

Hi @techresearch7777777

I don't see corresponding exact name selection that would be "Latin1_General_CI_AS_UTF8".

Latin1_General_CI_AS_UTF8 is not a supported collation.

To find the supported collations, please run below query:

SELECT * FROM fn_helpcollations()  
WHERE Name like 'Latin%CI%AS%UTF8'

Results:

User's image

Regarding the Compatibility level, you could change the compatibility level of a database like this:

CREATE DATABASE TestDB
COLLATE Latin1_General_100_CI_AS_SC_UTF8
ALTER DATABASE TestDB
SET COMPATIBILITY_LEVEL = 110;

Best regards,

Cosmog

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Answer 2

Olaf Helper 46,286

Which SQL Server Collation setting to use?

Your post/question is a bit strange for me.

First, the SQL Server setting, as well as database collation setting is a default, the effective collation is defined on table column level.

Then, the collation effects only ASCII text, means data types char/varchar.

That types can store 1 byte characters, while UTF characters can have up to 4 bytes; that don't work anyway.

You have to use Unicode = 2 bytes characters, see Collation and Unicode support

Answer 3

Latin_General_100_CI_AS_SC_UTF8 is not a bad choice for a collation. However, I am not sure that the encoding mentioned in the requirement is relevant here. That may be related to what the application uses internally, but that does not have to match how data is stored in SQL Server.

In SQL Server you have nvarchar and varchar. With nvarchar data is stored in the UTF-16 encoding, and you have access to the full range of Unicode. Or, well, exact how big that range is depends on the version of the collation. Latin1_General_CI_AS is defined from a quite old definition of Unicode and lacks support for some major languages like Hindi and Pinyin (i.e. Chinese written with Latin letters.) It also have no understanding of fun things like emojis.

A collation with 100 in the name supports more characters, and if you add_SC in the name, you get full understanding of those emojis.

With varchar and a non-UTF8 collation, you only have support for 250 or so characters with Latin1_General_CI_AS. With Latin1_General_100_CI_AS_SC_UTF8, you have support for the same set of characters with both varchar and nvarchar. The only difference is how data is stored. For text that is mainly in English, the UTF-8 gives you some space saving, but you can achieve that with row compression as well.

But how you store the data internally in SQL Server has little relevance to the application. The client APIs will take care of the conversions needed. But the application may need to state that data is UTF-8 so that the APIs understand it. And this matters, no matter if you have varchar or nvarchar in the database.

Share via

Which SQL Server Collation setting to use?

3 answers

Your answer