Поделиться через


Part2 - SQL Full text Search and International Characters: Interesting Scenarios, solutions and deep-dive using DMV’s

Continuing on my last post which dealt with Chinese
characters, I will cover some user scenarios on Germanic characters like German/Swedish/Norwegian with SQL
FTS. Read on...

Scenario1 –
German language
 

Mr Customer: Hey, we have some
full text search catalogs for German
words, and we noticed that FREETEXTTABLE
fails to find text clearly defined in the catalog!

Me: Hmm, can you give me more details on this
behaviour.

Mr Customer:
Sure. Here is what I see happening,
E.g. in the German
catalog, define a column to be searchable for the table and set the word
breaker language to German and let’s assume that the column has a product name
= ‘d.3’

Now, when I do a search for ‘d.3’ I got 0 rows returned back. What’s up with
that?

Me: Ah, gotcha! Here is what is happening…

I setup a sample repro query and as expected this did not return any rows.

 SELECT FT_TBL.id,FT_TBL.data,KEY_TBL.RANK

FROM dbo.German AS FT_TBL 

INNER JOIN FREETEXTTABLE(dbo.German, data,'d.3') AS KEY_TBL

ON FT_TBL.id = KEY_TBL.[KEY]

The word ‘d.3’ is a noise word (and Noise words are not
stored in a full-text index). A perfect example would be words like “an”, “the”, “but” etc. To
prove this, I again go back to the trusted full-text DMV’s.

 SELECT * FROM sys.dm_fts_parser ('d.3', 1031, 0, 0) --1031 is the LCID for GERMAN

 

keyword         

occurrence 

special_term    

display_term 

expansion_type

source_term

0x0064          

1

Noise Word

d

0

d.3

0x0033          

2

Noise Word

3

0

d.3

0x006E006E0033

2

Noise Word

nn3

0

d.3

As you can see all the words when broken down are NOISE
WORDS. This means that they will not be stored in the full-text index, which is
why your queries do not return any results.

In fact, if you run this query from Management Studio and
in the results pane, switch to the Messages tab and you will see this
warning,

Informational: The
full-text search condition contained noise word(s).

Most people miss out on this indicator, since it’s a little tucked away.
The above message means that “some string” in the search query was a noise word
which was ignore by SQL Server when doing the search lookup.
Okay,
so what do you do if you really really want to search for some keywords like “d.3”
or “AT&T”?

Solution1
For the above scenario#2, FREETEXTTABLE is the incorrect function to
use. FREETEXT predicate is to search for values that match the meaning,
but not the exact wording. If you want to find the exact match, you should use
either CONTAINS or CONTAINSTABLE.

Solution2
With SQL Server 2008, by default the system Stoplist is used for
a catalog. So I created an empty stoplist and associated it with the above
full-text index. This time when I run the query it did return 1 row for the
string ‘d.3’

id data RANK
--- ----- --------

1 d.3 186

Note:
Stoplists/Stopwords are available starting from SQL Server 2008 onwards. Please
check this link for more information https://msdn.microsoft.com/en-us/library/ms142551.aspx.

If you are using SQL 2005, you can manually edit the noiseDEU.txt in the
FTDATA folder and remove the alphabets and numbers. This will require a restart
of the full-text search service + full population for the changes to show up. The
disadvantage of doing this is that you full-text index gets bloated and becomes
larger in size.

Scenario2 –
Swedish/Norwegian characters

Mr Customer: I have a Full-text
index created on a table that contains data in both Swedish and Norwegian.
Certain queries produce incorrect results and the behaviour is inconsistent.

Me: I am all ears!
Mr Customer:
My Swedish users raised a point that they aren’t able to search using
some special chars like Ø . Here’s my
complete situation.

I have lots of rows in my table which contains the word OPPGJØR. Now when I run the following
query using a prefix search, it gives me 0 rows as output. (FYI - OPPGJØR translates to SETTLEMENT in Norwegian ).

 select * from TBL1 where contains(*,'"OPPGJØ*"')

But if I remove full-text and do a LIKE search I see the rows containing
OPPGJØR being returned.

 select * from TBL1 where datacolumn like 'OPPGJØ*'

Even more strange, if I place the prefix search like this I see
the rows being returned.

 select * from TBL1 where contains(*,'"OPPGJØR*"')

What is the difference in search between OPPGJØ*
vs.  OPPGJØR?

Me: Hmm! Let us breakdown the problem
and see why this is happening.

First off I see that the word Ø is being used in the search condition. This is an
accented word. So words such as café
and cafe are treated differently. I
checked the ACCENT SENSITIVITY property of the catalog and it was ON.
I turned it off using the following command,

 alter fulltext catalog CATALOGNAME
REBUILD WITH ACCENT_SENSITIVITY=OFF;

After this, the above query started giving me the rows for 'OPPGJØ*’. The
disadvantage of doing this is that, it can start to give you more result rows.
In fact it gave me results for OPPGJO since this
the treated the same as OPPGJØ. In fact
this customer of mine had other special Norwegian words like Æ and Å which are
again accented.

Here is the breakdown using the same DMV as before.

 Select  document_id, occurrence_count, display_term from sys.dm_fts_index_keywords_by_document ( DB_ID('DBNAME'), OBJECT_ID('TBL1') ) Order by Keyword
 

This gave me the following output,

1

oppgj

1

oppgj

1

oppgj

1

oppgj

1

oppgjor

1

oppgjor

1

oppgjor

1

oppgjor

Next, I tried to pass in the LCID parameter to the search condition to
see if that helped.

 select * from Item it
where contains(*,'"OPPGJØ*"', LANGUAGE 1053)

select * from Item it
where contains(*,'"OPPGJØ*"', LANGUAGE 1044)

1053 --> Swedish

1044 --> Norwegian

You can look up this information in sys.fulltext_languages.
But this also gave me 0 rows when accent_sensitivity was turned ON. Like I
mentioned earlier these word-breakers are shipped out the box with SQL Server
and NaturalLanguage6.dll is the one responsible for the above. More information
on this here.

I tested this same on a SQL Server 2005 SP3 instance since the word-breakers
were different back then (nls400.dll).

 select * from dbo.swede where contains (text, '"OPPGJØ*"')

select * from dbo.swede where contains (text, '"åä*"')

And to my surprise, I got the expected output. So this looks like
an issue with the new SQL 2008 word-breaker for Swedish/Norwegian. To be more
specific this appears to be an issue in SQL 2008 specific to umlaut & other
Germanic alphabets. You can read about umlaut here.

I did prefix searches (word* ) for the
following characters and they all returned me incorrect rows with sensitivity
ON.

ä
ö
ü
Ä
Ö
Ü

Ø Æ

Now, we are really getting somewhere close to nailing this issue. Since we know
the problem has been isolated to prefix search on accented characters, I did
some searching and found this KB article that talks about the exact problem
with “extended characters or accents in a prefix search”!

FIX: Prefix search
with extended characters or accents returns incorrect result in SQL Server 2008

https://support.microsoft.com/default.aspx?scid=kb;EN-US;973090

 

Solution
The solution here was to apply SQL 2008 SP1 Cumulative Update 4 which
is described in the above article. After I tested this fix, the prefix searches
started returned me expected results similar to SQL 2005.

Also to get better localized results, I changed the word-breaking language for
the column datacolumn in TBL1 to Norwegian.

Other tips
& considerations for international languages
Windows LCID Information
https://msdn.microsoft.com/en-us/library/ms788149.aspx

International Considerations for
Full-Text Search

https://msdn.microsoft.com/en-us/library/ms142507(SQL.90).aspx

I hope this was useful in demonstrating some interesting case scenarios on
international characters and its usage with SQL Full-text search.

The ultimate search engine would basically understand everything in
the world and it would always give you the right thing.

Comments

  • Anonymous
    April 29, 2013
    Hi Sudarshan, Interesting article, I was pointed to it after posting a problem with full text indexation of accented words on a forum (see here : social.msdn.microsoft.com/.../25f2771c-2643-4b60-a06c-895a632b9477). I'm curious to know, do the 4 "oppgj" in the DMV output table in Scenario 2 show up coincidentally because they turn out to be proper Swedish words, or did the word breaker somehow chop off the "ØR" part in  OPPGJØR ? The latter would pretty much explain my problem, but I'm still unsure if I'm experiencing the same bug explained in KB 973090 ("the query may return incorrect data" is rather vague...)