Compartilhar via


Creating Custom Dictionaries for special terms to be indexed 'as-is' in SQL Server 2008 Full-Text Indexes

 

Did You Know?:   That SQL Server 2008 provides the ability to create a Custom Dictionary of special terms to be indexed 'as-is' (i.e: 'AT&T') into the FTIndex?... 

 

Many times you find yourself in a situation where you would like to index a given word/term ‘as-is’ in your Full-Text Index. However, due the specific tokenization behavior from a given WB, it gets splitted in several individual terms. This might cause at query time recall issues.

 

For instance, your data contains the following term “ISBN-7”. Internally, most (if not all) WBs will tokenize this as ‘ISBN’ and ‘7’ separately. These two terms are going to be persisted in the FTIndex instead of the single and original “ISBN-7”. Note that because ‘7’ might be a default noise word for most languages (single digit numbers typically are considered as noise words), what happens is that the FTIndex will only index the term ‘ISBN’.

Later, at query time when you search for ‘ISBN-7’ using any of our available predicates (i.e: CONTAINS) or tvfs (i.e:CONTAINSTABLE), the same process will occur and the query will be internally translated to a query for ‘ISBN’ only. You will then receive all rows/documents containing ‘ISBN’, not only the one you cared about initially: ‘ISBN-7’.

 

By disabling STOPLIST for a given FTIndex, you are going to solve the missing results part of the problem, but you are going to have larger FTIndexes leading to performance implications and still not solving the problem of false positives, as you will still find documents containing 'ISBN' and '7' somewhere in the corpus of the document, but not necessarily together (next to each other) as you wanted originally.

This is a common problem when querying for email addresses for example, where the <'@'> character serves as separator . In this case, you are going to find document containing the exact email address as you desired, but as well documents containing both parts of the email separated for other special character different than <'@'> , leading to false positives again. There are other different variants of this behavior that can lead to false positives as well, etc.. While this behavior is expected, as you see, it can lead to undesired results.

 

In order to prevent this, SQL Server 2008 new WB family is introducing the ability to list a set of words/terms that should not be tokenized by a given WB, thus forcing them to be indexed as they are. These lists (called custom dictionaries) are simple text files and associated to each language supported by SQL Server Full-Text Search.

 

Here are the details in how to accomplish this:

 

(Reference: https://technet.microsoft.com/en-us/library/cc263242.aspx)

 

A custom dictionary is a Unicode-encoded file that can be used to specify words that you want the word breaker of the same language to consider as complete words. Custom dictionaries are not provided, by default. To modify the word breaker behavior for more than one language you must create a separate custom dictionary for each language for which you want to modify the word breaker’s behavior. You cannot create a custom dictionary for the language-neutral word breaker.

-Example created by Venkat, one of our team testers:

Consider the term “34590-97-8”. When this is passed to the German wordbreaker (for instance), the output terms are many:

select [display_term] from sys.dm_fts_parser('34590-97-8', 1031,0,0)

display_term

34590

nn34590

97

nn97

8

nn8

This output can be seen using:

select * from sys.dm_fts_parser('34590-97-8', 1031,0,0)

If you need to keep these numbers together for business reasons (I.e: this is a chemical code, account, phone number, etc) you then need to create a custom dictionary for German language in this case. Here are the steps:

1. Log on to the machine (on which sql server is installed) as a member of the Administrators group.

2. Start Notepad and type the words you want in your custom dictionary. Be sure to avoid invalid entries as described in the Reference link previously included. Remember that each word must be on a separate line and separated by a carriage return (CR) and line feed (LF).

3. On the File menu, click Save As.

4. In the Save as type list, select All Files.

5. In the Encoding list, select Unicode.

6. In the File name box, type the file name in the following format: CustomNNNN.lex, where NNNN is the language hex code of the language for which you are creating the custom dictionary (i.e: English= 1033). See Table 1, in the document referenced before in this post for a list of valid file names for supported languages and dialects.

7. In the Save in list, navigate to the folder that contains the word breakers. By default, this is the sql Binn directory. Example C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn.

8. Click Save.

9. Restart fdhost by executing “exec sp_fulltext_service 'restart_all_fdhosts' ” on the sql server instance.

Now the custom dictionary will treat the term differently as a single word:

select [display_term] from sys.dm_fts_parser('34590-97-8', 1031,0,0)

display_term

34590-97-8

Currently, these text files only support a list of words and not more complex expressions, as for instance regular ones, etc.. (i.e: ISBN-*).

Important Note: This will only work with newly shipped SQL Server 2008 Word Breakers and not with older word breakers unchanged in our 2008 release. These new wordbreakers are also shipped with the OS in Vista+ operating systems. These wordbreaker dlls have names as NaturalLanguage6, NLS*.dll etc. As a notorious exception, the english wordbreaker shipped with SQL Server 2008 is identical to our 2005 version, thus it does not belong to this new category of Word Breakers capable to support a custom dictionary. In order to use a custom dictionary for English in SQL Server 2008, you will need to change the registry registration and indicate that the English WB existing in you OS (only if it is Vista+) is the one that should be used for your SQL intance from that moment on. Please follow this article in how to change this registration for a given WB or iFilter:

https://msdn.microsoft.com/en-us/library/ms345188.aspx

The above article shows how to do it for Danish, Polish etc. To do this for the new English word breaker, you need the corresponding values for the new English Word Breaker and stemmer. The wordbreaker files to copy to your sql Binn directory are below. These should be in your windows\system32 folder on Vista+ operating systems.

1. NlsData0009.dll

2. NlsLexicons0009.dll

3. NlsGrammars0009.dlll

Other settings needed:

String value

TsaurusFile

tsEnu.xml

DWORD value

Locale

00000409

String value

WBreakerClass

{9DAA54E8-CD95-4107-8E7F-BA3F24732D95}

string value

StemmerClass

{61A48126-EF74-4d4a-9DDA-43FD542CAD1E}

 

 

We hope this information is useful to you. We are going to include this information in our official Books Online (BOL) release in upcoming SQL Server 2008 SPs as well as in our next major release.

 

Regards!

Comments

  • Anonymous
    March 04, 2010
    Thanks. It's really helps. The were a bit problem, that NaturalLanguage6 have to leave from SQL/binn directory, but other NlsData0009.dll and NlsLexicons0009.dll have been taken from sys32 folder. NlsGrammars0009.dll I was unable to find, but it works without it.

  • Anonymous
    August 03, 2010
    Hi, I am facing the issue in SQL Server German. Where I want to store the email addresses as-is. I have tried modifying the Custom Dictionary file. It works with the example you described above but not for the emails. Could you pleas let me know the way to specify an email address in the Custom Dictionary file for SQL German. Please reply to me at vijay.das@hotmail.com

  • Anonymous
    September 19, 2010
    The comment has been removed

  • Anonymous
    March 03, 2011
    I am facing the issue in SQL Server German. Where I want to store the AT&T as-is. I have tried modifying the Custom Dictionary file. It works with the example you described above but not for the AT&T and anything containing &. It works for AT-T. AT%T but not for AT&T Please let me know if I am missing something thanks Kanja ksaha@business-instruments.com

  • Anonymous
    March 09, 2011
    The comment has been removed

  • Anonymous
    July 11, 2011
    The comment has been removed

  • Anonymous
    September 10, 2013
    Any help?

  • Anonymous
    October 14, 2013
    Has anyone had any success with these instructions? I found them about 18 months ago and after much fiddling managed to get some simple phrases picked up by the search, from my custom dictionary. However I'm now trying to amend my already working dictionary to include some new phrases and cannot get it to work. The command exec sp_fulltext_service 'restart_all_fdhosts simply doesn't get the search to pick up the custom dictionary again. The old entries still work, the new ones are not picked up. I feel like there's something else I need to restart somewhere. We've stopped and started the SQL instance, I've even in desperation deleted my index completely and rebuilt it from scratch, but no effect. I have found several thread on the internet of people having similar problems but none have a resolution. Does anyone know the magic I'm missing?

  • Anonymous
    March 27, 2014
    Mark, did you rebuild your full-text index?

  • Anonymous
    March 27, 2014
    On SS 2012 the filename for an English dictionary looks like it needs to be Custom0009.lex. On SS 2008 it needs to be Custom1033.lex. Also the # symbol seems to work as a wildcard for numbers on 2008, but not on 2012. It would be REALLY nice if this was all documented somewhere.

  • Anonymous
    March 30, 2014
    Please disregard my comments above about the use of the # character and the filename of the custom dictionary on SQL Server 2008. Further investigation has determined that those findings are not correct.