International Considerations for Full-Text Search
Choosing a Language When Creating a Full-Text Index
There are a couple of things to consider when choosing the column language when creating a full-text index. These considerations relate to how your text is tokenized and then indexed by Microsoft full-text engine for SQL Server (MSFTESQL). A word breaker tokenizes the text being indexed on word boundaries. These word boundaries, in the English language, are typically white space or some form of punctuation. In other languages, such as German, words or characters may be combined together; therefore, your choice of a column-level language should represent the language you expect will be stored in rows of that column. If you are unsure, a general best bet is to use the neutral word breaker, which performs its tokenization purely on white space and punctuation. An additional benefit of your column-level language choice is "stemming". Stemming in full-text queries is defined as the process of searching for all stemmed (inflectional) forms of a word in a particular language.
Another consideration in language choice is related to the way in which the data is represented. For data not stored in varbinary(max) column, no special filtering is performed. Rather, the text is generally passed through the word breaking component as-is. Word breakers are designed mainly to process written text. So, if you have any type of markup (such as HTML) on your text, you may not get great linguistic accuracy during indexing and search. In that case, you have two choices—the preferred method is simply to store the text data in varbinary(max) column, and to indicate its document type so it may be filtered. If this is not an option, you may consider using the neutral wordbreaker and, if possible, adding markup data (such as 'br' in HTML) to your noise word lists.
Note
Language based stemming does not come into play when you specify the neutral language.
Language Support
In Microsoft SQL Server 2005, full-text queries can use languages other than the default language for the column to search full-text data. As long as the language is supported and its resources are installed, the language specified in the LANGUAGE language_term
clause of the CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE query will be used for word breaking, stemming, and thesaurus and noise-word processing.
The following table shows the language in which the full-text index data is stored. The language is based on the Unicode collation locale identifier selected during Microsoft SQL Server Setup.
Unicode collation locale identifier | Language for full-text data storage |
---|---|
Chinese Bopomofo (Taiwan) |
Traditional Chinese |
Chinese Punctuation |
Simplified Chinese |
Chinese Stroke Count |
Simplified Chinese |
Chinese Stroke Count (Taiwan) |
Traditional Chinese |
Dutch |
Dutch |
English UK |
English UK |
French |
French |
General Unicode |
English US |
German |
German |
German Phonebook |
German |
Italian |
Italian |
Japanese |
Japanese |
Japanese Unicode |
Japanese |
Korean |
Korean |
Korean Unicode |
Korean |
Spanish (Spain) |
Spanish |
Swedish/Finnish |
Swedish |
All other Unicode collation locale identifier values that are not in this list get mapped to the neutral language word-breaker and -stemmer, which uses white spaces to delimit words.
Note
The Unicode collation locale identifier setting is used against all data types eligible for full-text indexing (such as char, nchar, and so on). If you have the sort order of a char, varchar, or text type column set to a language setting different from the Unicode collation locale identifier language, the Unicode collation locale identifier is still used during full-text indexing and querying of the char, varchar, and text type columns.
See Also
Other Resources
Full-Text Search Concepts
Data Types (Transact-SQL)
CONTAINS (Transact-SQL)
CONTAINSTABLE (Transact-SQL)
FREETEXT (Transact-SQL)
FREETEXTTABLE (Transact-SQL)