INFORMATIONAL: FULL TEXT SEARCH ENGINE & FULL TEXT INDEX POPULATION INTERNALS

Well I’ve looked around in our blog and it pained me to see there were no posts on Full Text Search; an often misunderstood and under-utilized feature that SQL Server provides. Wait no more, here comes the first one (of many I hope) :) . In this post I’ve tried to make things simple & help understand how FTS works in SQL Server 2005.

For you Search enthusiasts and many bewildered DBA’s, you may have wondered as to what exactly is happening inside SQL Server when you try to populate a full text index. Where do you look to know if things are moving along, how to find what the stage of population and how to detect if things are not exactly hunky-dory.

These are some things that you need to know before we get into the finer details:-

1) A full-text index is not a normal index and it is not even remotely similar to a normal index on a table.

2) It is token based and is created & maintained by the MS SQL Full-text Engine (MSFTESQL.exe)

3) It stored in a compressed format and what’s more it’s not even stored inside your database files. The actual FT Index resides on your hard disk.

4) To build a full-text index there are lots of hands involved including the mother-ship (SQL Server), the FT-Engine and a 3rd process called the Filter Daemon (more on this later).

                                                                    clip_image002

Some of the things that are stored in a full text index are:-

a) Keyword – the token generated by the word-breaker.

b) ColID – represents the table + the column that is enabled for FTS.

c) DocID – Row equivalent in a FTIndex.

d) Occurrence – The offset of the keyword within the DocID.

The process of filling up these indexes is called as Population or Crawling. There are 2 ways for you to achieve this – Manual and Automatic (using change tracking mechanism). When population is initiated by the user, the data is pushed to MSFTESQL process by SQLSERVR. Based on the type of data it will break it down, translate, apply grammar rules and index the data. The grammar is decided by a component called the word-breaker which you choose when you setup the full-text index. An interesting thing happens here wherein 1 full text index can contain multiple fragments. At the end of this population, the FT Engine tries to make this into a single master index. This process is called master merge.

                                                          clip_image004

Think back to what I said sometime earlier about the index structure. The DocID and Occurrence are the values that are stored compressed. This is done to save space & IO cost. The master merge process is the one responsible for merging these fragments & Docid’s in them. This makes sense since it’s cheaper to access for a single processing rather than every time as to access DocID/Occ they have to be uncompressed, saved and compressed again.

So if you see something like this in your log, you now understand what has happened.

2008-11-04 12:38:44.47 spid23s Informational: Full-text Full population completed for table or indexed view '[DB].[dbo].[FTTable]' (table or indexed view ID '12345', database ID '5'). Number of documents processed: 7712. Number of documents failed: 0. Number of documents need retry: 0.

2007-10-09 20:12:51.73 spid21s Changing the status to MERGE for full-text catalog "SampleCataloge" (5) in database "SampleDB" (9). This is an informational message only. No user action is required.

When I said log, I meant the FTlog that is created for every catalog. By default this is located in the SQL Server Log Folder, the same place as the Errorlogs. Every population (manual & automatic), change tracking, master merge, failures are all logged in this FTLog.

The naming of these error logs is in the format : SQLFT <dbid> <catID>.LogN

E.g. If DBID is 6 and Catalog ID is 5, the log would be named SQLFT000060005.log

A small note on performance: Depending on the size of the catalog and the number of tables enabled for full-text indexing, the master-merge process can be I/O intensive. Now how do I find out when I am populating what is going on and how to figure out its not stuck anywhere.

1) Look at the Status column and these are some expected values:

select * from sys.dm_fts_index_population

Status 5 : Processing normally

Status 8 : Encountered row-level errors that will be retried

Status 7 : Has stopped processing

2) Look at the Status column and below are some expected values. Also the item count will indicate the number of items that are currently stored in the FTCatalog. Generally when a population is in progress this value is expected to increase.

select fulltextcatalogproperty(‘catalog_name’, ‘PopulateStatus’)

select fulltextcatalogproperty(‘catalog_name’, ‘ItemCount’)

Status 1 : Full population in progress

Status 6 : Incremental population in progress

Status 7 : Building index

Status 4 : Recovering

3) Another way is to track the size of the .CI/.PI files in the catalog folder. This is a manual process and makes sense when you are populating a large table with lot of documents. In this case, the longer the population goes on, these files are expected to grow in size & number.

Hope this post cleared things on the internals of the full text engine.

In the next post I will cover troubleshooting full text population & indexing failures (yes, IFilter issues too J) . Also touch upon full text performance best practices. If time permits I will try to talk about SQL 2000 FTS. Stay tuned for more …..

Sudarshan N
Technical Lead | Microsoft SQL Server

Comments

  • Anonymous
    March 24, 2009
    PingBack from http://blog.a-foton.ru/index.php/2009/03/24/informational-full-text-search-engine-full-text-index-population-internals/

  • Anonymous
    August 05, 2009
    Is there now a way to index password-protected files?   We have a Web application that allows people to upload MS Excel and Word files, but they are not being indexed because sections of the documents are password-protected.

  • Anonymous
    August 17, 2009
    Hi Andrew, There is no way currently with SQL 2005 or SQL 2008 to index password protected office files. There are no plans (as of today) to include this for the next release as well. You can post a wish at http://connect.microsoft.com/sqlserver --Sudarshan

  • Anonymous
    January 30, 2010
    The comment has been removed

  • Anonymous
    February 17, 2010
    Dani, <assuming that this is SQL 2005> Look in the log folder and you will find your SQLFT logs. It should be named SQLFT00<dbid>00<catalogid>.log. This might tell you what's happening with the full-text catalog and errors (if any). You can also query the DMV's sys.dm_fts_index_population & sys.dm_fts_active_catalogs, look at status column value, which would indicate any anomalies. Also, check what the SPID is doing in sysprocesses for the query to fulltextcatalogproperty and see its waittype. Regarding the status "Change Pending", after you restart full-text service it will go into this state because there might have been some population/change tracking index updates going on, which will resume once the service comes back online. If the FTCatalogs are not business critical and you feel this is hampering your main application, you can drop them using DROP FULLTEXT CATALOG, while we figure this one out. HTH.

  • Anonymous
    January 06, 2017
    I have one table with one full-text index. I see this in the log "Full-text Full population completed for table or indexed view" timestamped 20 minutes ago. However, OBJECTPROPERTY(t.object_id, 'TableFulltextPopulateStatus') returns 1 (Full Population In Progress), sys.dm_fts_index_population.completion_type_description is 'NONE', and sys.fulltext_indexes.crawl_end_date is null. Why do the DMVs think that it is still in progress long after the log says it is done?