Languages Supported by Windows Search
This topic describes how Windows Search supports multiple languages.
Tokenization, Wordbreakers, and Language Resources
Windows Search is language-independent, but the accuracy of search across languages may vary because of the way wordbreakers tokenize text. Wordbreakers implement various tokenization rules for languages and break text into individual tokens, or words, to be indexed or searched.
Both the language of the indexed text and the query string are broken into tokens. Because tokenization rules vary by language, there are separate wordbreakers for each language or family of languages. If there is a mismatch between the query language and the indexed language, the results can be unpredictable.
Windows Search ships with a well defined set of wordbreakers. Classic wordbreaker and stemmer components are supported in Windows Vista and later. If the language of a document cannot be determined, Windows Search attempts to detect the language to identify the most appropriate wordbreaker. Windows Search attempts to detect the language by calling the GetSystemPreferredUILanguages function to determine the first Multiple User Interface (MUI) language (which is typically the system UI language unless MUI language packs are installed). If that call succeeds, the wordbreaker for the first MUI language is used. If the call to GetSystemPreferredUILanguages fails, Windows Search retrieves the system locale by calling the GetSystemDefaultLCID function and uses the wordbreaker associated with that locale.
If no wordbreaker is installed for a language, Windows Search breaks on white space by using the Neutral wordbreaker.
You can remove a language through the registry, as illustrated in the following example.
HKEY_LOCAL_MACHINE
SYSTEM
CurrentControlSet
Control
ContentIndex
Language
Dutch_Dutch
(Default)
Locale
NoiseFile
StemmerClass = CLSID
WBreakerClass = CLSID
Tip
If you make changes to the registry, restart Windows Search.
When Windows Search requires a new wordbreaker, the class identifier (CLSID) is read, and the instantiated wordbreaker is cached.
You can create a custom wordbreaker for a language by implementing the IWordBreaker interface. Windows Search then calls the IWordBreaker methods when it builds content indexes and runs queries.
Locale information for indexed content is retrieved from the source of the content. If the source implementer does not know the locale of the indexed content, it should set the locale to LOCALE_NEUTRAL.
For example, if you implement a filter handler (an implementation of the IFilter interface), property handler, or protocol handler, you should set the locale for indexed content to LOCALE_NEUTRAL unless you have specific locale information and are confident of its accuracy.
Tip
If an index query is based on user input, the locale should match the language in which the user is typing. You can determine this locale by calling the GetKeyboardLayout function.
Languages Supported by Wordbreakers
Windows Search includes wordbreakers to support the following languages.
Registry key | Language (sublanguage) | LCID |
---|---|---|
Arabic_SaudiArabia | Arabic (Neutral) | 0x0001 |
Bengali_Default | Bangla (Neutral) | 0x0045 |
Bulgarian_Default | Bulgarian (Bulgaria) | 0x0402 |
Catalan_Default | Catalan (Catalan) | 0x0403 |
Chinese_HongKong | Chinese (Hong Kong SAR, PRC) | 0x0C04 |
Chinese_Simplified | Chinese (Simplified) | 0x0804 |
Chinese_Traditional | Chinese (Traditional) | 0x0404 |
Croatian_Default | Croatian (Croatia) | 0x041A |
Czech_Default | Czech (Czech Republic) | 0x0405 |
Danish_Default | Danish (Denmark) | 0x0406 |
Dutch_Dutch | Dutch (Netherlands) | 0x0413 |
English_UK | English (United Kingdom) | 0x0809 |
English_US | English (United States) | 0x0409 |
Finnish_Default | Finnish (Finland) | 0x040B |
French_French | French (France) | 0x040C |
German_German | German (Germany) | 0x0407 |
Greek_Default | Greek (Greece) | 0x0408 |
Gujarati_Default | Gujarati (India) | 0x0447 |
Hebrew_Default | Hebrew (Neutral) | 0x000D |
Hindi_Default | Hindi (India) | 0x0439 |
Hungarian_Default | Hungarian (Hungary) | 0x040E |
Icelandic_Default | Icelandic (Iceland) | 0x040F |
Indonesian_Default | Indonesian (Indonesia) | 0x0421 |
Italian_Italian | Italian (Italy) | 0x0410 |
Japanese_Default | Japanese (Japan) | 0x0411 |
Kannada_Default | Kannada (India) | 0x044B |
Korean_Default | Korean (Korea) | 0x0412 |
Latvian_Default | Latvian (Latvia) | 0x0426 |
Lithuanian_Default | Lithuanian (Lithuanian) | 0x0427 |
Malay_Malaysia | Malay (Malaysia) | 0x043E |
Malayalam_Default | Malayalam (Neutral) | 0x004C |
Marathi_Default | Marathi (India) | 0x044E |
Norwegian_Bokmal | Norwegian (Bokmål, Norway) | 0x0414 |
Polish_Default | Polish (Poland) | 0x0415 |
Portuguese_Portugal | Portuguese (Portugal) | 0x0816 |
Portuguese_Brazil | Portuguese (Brazil) | 0x0416 |
Punjabi_Default | Punjabi (India) | 0x0446 |
Romanian_Default | Romanian (Romania) | 0x0418 |
Russian_Default | Russian (Neutral) | 0x0019 |
Serbian_Cyrillic | Serbian (Serbia and Montenegro, Former, Cyrillic) | 0x0C1A |
Serbian_Latin | Serbian (Serbia and Montenegro, Former, Latin) | 0x081A |
Slovak_Default | Slovak (Slovakia) | 0x041B |
Slovenian_Default | Slovenian (Slovenia) | 0x0424 |
Spanish_Modern | Spanish (Spain, Modern Sort) | 0x0C0A |
Swedish_Default | Swedish (Sweden) | 0x041D |
Tamil_Default | Tamil (India) | 0x0449 |
Telugu_Default | Telugu (India) | 0x044A |
Thai_Default | Thai (Thailand) | 0x041E |
Turkish_Default | Turkish (Türkiye) | 0x041F |
Ukrainian_Default | Ukrainian (Ukraine) | 0x0422 |
Urdu_Default | Urdu (Pakistan) | 0x0420 |
Vietnamese_Default | Vietnamese (Vietnam) | 0x042A |
Note
LCIDs for some languages in the table are generated using the language identifier, sublanguage identifier, and sort identifier.
For more information about languages and associated identifiers, see Language Identifier Constants and Strings.
Note
There is no guarantee that all of these language registry keys will be present on any given machine. The wordbreaker for any given language may or may not be installed in the machine depending on user settings.
Beginning in Windows 8.1, the preferred way to use wordbreakers is via the WinRT API WordsSegmenter class.
Additional Resources
- For information on how to implement and use custom word breakers and stemmers for additional languages and locales, see Extending Language Resources in Windows Search.
- If you need to identify the language of a piece of text, you can use Language Auto-Detection (LAD), which is available in Windows 7 and later. For more information, see Extended Linguistic Services (ELS).
- For information on managing, querying, and extending the index, see the Windows Search Developer's Guide.
Related topics