Languages Supported by Windows Search

Article
06/20/2022

This topic describes how Windows Search supports multiple languages.

Tokenization, Wordbreakers, and Language Resources

Windows Search is language-independent, but the accuracy of search across languages may vary because of the way wordbreakers tokenize text. Wordbreakers implement various tokenization rules for languages and break text into individual tokens, or words, to be indexed or searched.

Both the language of the indexed text and the query string are broken into tokens. Because tokenization rules vary by language, there are separate wordbreakers for each language or family of languages. If there is a mismatch between the query language and the indexed language, the results can be unpredictable.

Windows Search ships with a well defined set of wordbreakers. Classic wordbreaker and stemmer components are supported in Windows Vista and later. If the language of a document cannot be determined, Windows Search attempts to detect the language to identify the most appropriate wordbreaker. Windows Search attempts to detect the language by calling the GetSystemPreferredUILanguages function to determine the first Multiple User Interface (MUI) language (which is typically the system UI language unless MUI language packs are installed). If that call succeeds, the wordbreaker for the first MUI language is used. If the call to GetSystemPreferredUILanguages fails, Windows Search retrieves the system locale by calling the GetSystemDefaultLCID function and uses the wordbreaker associated with that locale.

If no wordbreaker is installed for a language, Windows Search breaks on white space by using the Neutral wordbreaker.

You can remove a language through the registry, as illustrated in the following example.

HKEY_LOCAL_MACHINE
   SYSTEM
      CurrentControlSet
         Control
            ContentIndex
               Language
                  Dutch_Dutch
                     (Default)
                     Locale
                     NoiseFile
                     StemmerClass = CLSID
                     WBreakerClass = CLSID

Tip

If you make changes to the registry, restart Windows Search.

When Windows Search requires a new wordbreaker, the class identifier (CLSID) is read, and the instantiated wordbreaker is cached.

You can create a custom wordbreaker for a language by implementing the IWordBreaker interface. Windows Search then calls the IWordBreaker methods when it builds content indexes and runs queries.

Locale information for indexed content is retrieved from the source of the content. If the source implementer does not know the locale of the indexed content, it should set the locale to LOCALE_NEUTRAL.

For example, if you implement a filter handler (an implementation of the IFilter interface), property handler, or protocol handler, you should set the locale for indexed content to LOCALE_NEUTRAL unless you have specific locale information and are confident of its accuracy.

Tip

If an index query is based on user input, the locale should match the language in which the user is typing. You can determine this locale by calling the GetKeyboardLayout function.

Languages Supported by Wordbreakers

Windows Search includes wordbreakers to support the following languages.

Registry key	Language (sublanguage)	LCID
Arabic_SaudiArabia	Arabic (Neutral)	0x0001
Bengali_Default	Bangla (Neutral)	0x0045
Bulgarian_Default	Bulgarian (Bulgaria)	0x0402
Catalan_Default	Catalan (Catalan)	0x0403
Chinese_HongKong	Chinese (Hong Kong SAR, PRC)	0x0C04
Chinese_Simplified	Chinese (Simplified)	0x0804
Chinese_Traditional	Chinese (Traditional)	0x0404
Croatian_Default	Croatian (Croatia)	0x041A
Czech_Default	Czech (Czech Republic)	0x0405
Danish_Default	Danish (Denmark)	0x0406
Dutch_Dutch	Dutch (Netherlands)	0x0413
English_UK	English (United Kingdom)	0x0809
English_US	English (United States)	0x0409
Finnish_Default	Finnish (Finland)	0x040B
French_French	French (France)	0x040C
German_German	German (Germany)	0x0407
Greek_Default	Greek (Greece)	0x0408
Gujarati_Default	Gujarati (India)	0x0447
Hebrew_Default	Hebrew (Neutral)	0x000D
Hindi_Default	Hindi (India)	0x0439
Hungarian_Default	Hungarian (Hungary)	0x040E
Icelandic_Default	Icelandic (Iceland)	0x040F
Indonesian_Default	Indonesian (Indonesia)	0x0421
Italian_Italian	Italian (Italy)	0x0410
Japanese_Default	Japanese (Japan)	0x0411
Kannada_Default	Kannada (India)	0x044B
Korean_Default	Korean (Korea)	0x0412
Latvian_Default	Latvian (Latvia)	0x0426
Lithuanian_Default	Lithuanian (Lithuanian)	0x0427
Malay_Malaysia	Malay (Malaysia)	0x043E
Malayalam_Default	Malayalam (Neutral)	0x004C
Marathi_Default	Marathi (India)	0x044E
Norwegian_Bokmal	Norwegian (Bokmål, Norway)	0x0414
Polish_Default	Polish (Poland)	0x0415
Portuguese_Portugal	Portuguese (Portugal)	0x0816
Portuguese_Brazil	Portuguese (Brazil)	0x0416
Punjabi_Default	Punjabi (India)	0x0446
Romanian_Default	Romanian (Romania)	0x0418
Russian_Default	Russian (Neutral)	0x0019
Serbian_Cyrillic	Serbian (Serbia and Montenegro, Former, Cyrillic)	0x0C1A
Serbian_Latin	Serbian (Serbia and Montenegro, Former, Latin)	0x081A
Slovak_Default	Slovak (Slovakia)	0x041B
Slovenian_Default	Slovenian (Slovenia)	0x0424
Spanish_Modern	Spanish (Spain, Modern Sort)	0x0C0A
Swedish_Default	Swedish (Sweden)	0x041D
Tamil_Default	Tamil (India)	0x0449
Telugu_Default	Telugu (India)	0x044A
Thai_Default	Thai (Thailand)	0x041E
Turkish_Default	Turkish (Türkiye)	0x041F
Ukrainian_Default	Ukrainian (Ukraine)	0x0422
Urdu_Default	Urdu (Pakistan)	0x0420
Vietnamese_Default	Vietnamese (Vietnam)	0x042A

Note

LCIDs for some languages in the table are generated using the language identifier, sublanguage identifier, and sort identifier.

For more information about languages and associated identifiers, see Language Identifier Constants and Strings.

Note

There is no guarantee that all of these language registry keys will be present on any given machine. The wordbreaker for any given language may or may not be installed in the machine depending on user settings.

Beginning in Windows 8.1, the preferred way to use wordbreakers is via the WinRT API WordsSegmenter class.

Additional Resources

For information on how to implement and use custom word breakers and stemmers for additional languages and locales, see Extending Language Resources in Windows Search.
If you need to identify the language of a piece of text, you can use Language Auto-Detection (LAD), which is available in Windows 7 and later. For more information, see Extended Linguistic Services (ELS).
For information on managing, querying, and extending the index, see the Windows Search Developer's Guide.

Windows Search Overview
Windows Search as a Development Platform
Using Managed Code with Shell Data and Windows Search

Share via

Languages Supported by Windows Search

Tokenization, Wordbreakers, and Language Resources

Languages Supported by Wordbreakers

Additional Resources

Feedback

Additional resources

Share via

Languages Supported by Windows Search

Tokenization, Wordbreakers, and Language Resources

Languages Supported by Wordbreakers

Additional Resources

Related topics

Feedback

Additional resources