International Considerations for Full-Text Search

Article
12/03/2008

Choosing a Language When Creating a Full-Text Index

There are a couple of things to consider when choosing the column language when creating a full-text index. These considerations relate to how your text is tokenized and then indexed by Microsoft full-text engine for SQL Server (MSFTESQL). A word breaker tokenizes the text being indexed on word boundaries. These word boundaries, in the English language, are typically white space or some form of punctuation. In other languages, such as German, words or characters may be combined together; therefore, your choice of a column-level language should represent the language you expect will be stored in rows of that column. If you are unsure, a general best bet is to use the neutral word breaker, which performs its tokenization purely on white space and punctuation. An additional benefit of your column-level language choice is "stemming". Stemming in full-text queries is defined as the process of searching for all stemmed (inflectional) forms of a word in a particular language.

Another consideration in language choice is related to the way in which the data is represented. For data not stored in varbinary(max) column, no special filtering is performed. Rather, the text is generally passed through the word breaking component as-is. Word breakers are designed mainly to process written text. So, if you have any type of markup (such as HTML) on your text, you may not get great linguistic accuracy during indexing and search. In that case, you have two choices—the preferred method is simply to store the text data in varbinary(max) column, and to indicate its document type so it may be filtered. If this is not an option, you may consider using the neutral wordbreaker and, if possible, adding markup data (such as 'br' in HTML) to your noise word lists.

Note

Language based stemming does not come into play when you specify the neutral language.

Language Support

In Microsoft SQL Server 2005, full-text queries can use languages other than the default language for the column to search full-text data. As long as the language is supported and its resources are installed, the language specified in the LANGUAGE language_term clause of the CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE query will be used for word breaking, stemming, and thesaurus and noise-word processing.

The following table shows the language in which the full-text index data is stored. The language is based on the Unicode collation locale identifier selected during Microsoft SQL Server Setup.

Unicode collation locale identifier	Language for full-text data storage
Chinese Bopomofo (Taiwan)	Traditional Chinese
Chinese Punctuation	Simplified Chinese
Chinese Stroke Count	Simplified Chinese
Chinese Stroke Count (Taiwan)	Traditional Chinese
Dutch	Dutch
English UK	English UK
French	French
General Unicode	English US
German	German
German Phonebook	German
Italian	Italian
Japanese	Japanese
Japanese Unicode	Japanese
Korean	Korean
Korean Unicode	Korean
Spanish (Spain)	Spanish
Swedish/Finnish	Swedish

All other Unicode collation locale identifier values that are not in this list get mapped to the neutral language word-breaker and -stemmer, which uses white spaces to delimit words.

Note

The Unicode collation locale identifier setting is used against all data types eligible for full-text indexing (such as char, nchar, and so on). If you have the sort order of a char, varchar, or text type column set to a language setting different from the Unicode collation locale identifier language, the Unicode collation locale identifier is still used during full-text indexing and querying of the char, varchar, and text type columns.

Share via

International Considerations for Full-Text Search

Choosing a Language When Creating a Full-Text Index

Language Support

See Also

Other Resources

Help and Information

Additional resources