Jaa


Exchange 2007 Search - Part 2: Content Indexing

 

This blog is the second of three which focus on Exchange 2007 Search and Content Indexing. The titles of the three blogs are as follows:

Exchange 2007 Search - Part 1: Introduction to Exchange 2007 Search (this Blog)
Exchange 2007 Search - Part 2: Content Indexing
Exchange 2007 Search - Part 3: The Search Process

Content Indexing Overview

As soon as the Exchange Search services are started, a worker thread begins work to determine the status of each database on the server. The MonitorAndUpdateMDBList worker thread is responsible for determining the status of each database on the server. That status is stored in memory and the process runs about every 30 seconds to keep the status up to date.  If a database is online and enabled for indexing, a catalog object is created in memory that holds one of three values; New, Crawling, or Notification. A value of New will initiate the creation of a property store and catalog for the database; once these are created, the value is changed to Crawling. The Crawling value identifies databases that are performing their initial crawl. A Notification value identifies a database that has finished its initial crawl and is ready to process events from the Event History table.

Crawling – The catalog for a database will hold the Crawling value until all mailboxes in that database have been initially indexed. By searching AD, the Content Indexer will build a Mailbox List in memory for all mailboxes in the database. Each mailbox will be given a value in the Property Store of NotStarted, NormalCrawlInProgress, and Done. At startup there are ten worker threads created that are dedicated to crawling. Those threads process mailboxes from the Mailbox List and remove a mailbox from the list once the items in the mailbox have been indexed. Once the list is empty and all mailboxes have been indexed, the catalog value for the database will change to Notification. Note that in Exchange 2007, unlike older versions, crawling cannot be scheduled. Crawling is performed on demand. This means that some indexing processes do not run continuously, but rather intermittently.

Notification Indexing - Once the initial crawl has occurred on a database, a process called the Notification Watcher constantly checks the Event History table for entries to be indexed. Checking one database at a time, the Notification Watcher can read up to 2000 events at a time. Watermarks are used so the Notification Watcher knows where it left off and where to begin next. Events that are found as interesting are added to the notification queue to be indexed.

When a mailbox is moved to a new database a process called One-Off Crawling is used to index the mailbox. Once the mailbox has been indexed it returns to a Notification status and normal notification processing resumes. This notification indexing reduces the disc I/O overhead. See this link for additional information on the effects of Content Indexing on a server:

Configuring, validating and monitoring your Exchange 2007 storage - Content Indexing
https://msexchangeteam.com/archive/2007/01/15/432199.aspx.

Indexing can be disabled on a specific mailbox database by using the Exchange Management Shell;

Set-MailboxDatabase <name> -indexenabled:$false. Disabling indexing at the database level can help isolate issues related to indexing during troubleshooting. You could eliminate one database from indexing, or isolate one from the rest. To disable indexing completely on a mailbox server you should stop the “Microsoft Exchange Search Indexer” service. 

Components of Content Indexing

The four main components of Content Indexing are Store.exe, MSExchangeSearch, MS Search, and MS Search Filter Daemon. The components work together using Named Pipes, shared memory, and COM/RPC to build the index catalogs and respond to client queries. Below, Figure 2.1 illustrates the components that make up Exchange 2007 Search. Following the diagram is a detailed explanation of each component. 

image 

Figure 2.1 Exchange 2007 Search Components

Store.exe

(MsExchangeIS) Microsoft Information Store Service

The Exchange 2007 Information Store contains important subcomponents used in Content Indexing and Search, the Event History Table, Property Store, and Query Processor

Event History Table – this jet database table in an information store mailbox database, like its name implies, contains numbered events that are written each time an important event occurs in the store. Many Exchange-related services will read through this table looking for events that are “interesting” or important to its specific function and ignoring events that are of no importance.  Mailbox items that are new, changed, deleted or moved trigger an event that is added to this table. Such an event lets Exchange Search know there is an item needing to be indexed. The MSExchangeSearchcomponent is responsible for reading this table continuously (about once a millisecond). There is one Event History Table per mailbox database. More information on this process is explained later in this Bulletin. 

Property Store – previous versions of search also used a Property Store; however, the Property Store was kept as a separate file from the Information Store database. The Property Store is now a jet database table in the information store database containing metadata for indexed items; there is one Property Store table per mailbox database. The Property Store contains properties about indexed items that help match entries in the index catalog to objects in the store. MSExchangeSearch uses the Document ID (assigned during indexing) to search for a match in the Property Table to the Entry ID (FID/MID or Folder ID/Message ID) of the document. Search also checks the Property Table for the current indexing status of the document.

Query Processor – The information store utilizes MSSSearch 3.0 for queries in Exchange 2007. Previous versions of Exchange used MSSearch 2.0. When a client sends a search request to the store the Query Processor is initiated.  The Query Processor builds the search request and works with MSSearch to find and return the data requested back to the client. This process is more completely explained in the third post of this blog series.

MSExchangeSearch

(Microsoft.Exchange.Search.ExSearch.exe) Microsoft Exchange Search Indexer

One of the four main components, Microsoft Exchange Search Indexer (MSExchangeSearch), is responsible for all index enabled mailbox stores on a server. Anytime a message is modified, created, deleted, or moved, an event is created in the Event History Table. MSExchangeSearch reads the Event History Table. Events that MSExchangeSearch finds as interesting are added to a queue to be processed by the Indexer. Events are not removed from this queue until notified by MSSearch that they have been successfully indexed. This all happens extremely quickly and it is why the catalog is never more than a few minutes out of date.  In addition, MSExchangeSearch is responsible for writing and maintaining the metadata to the Property Store for the indexed items: Document ID, Entry ID and the indexed state of the item. If a database catalog is deleted or deemed out of date, the MSExchangeSearch service is responsible for initializing the new crawl of the database.

msftesql-Exchange

Microsoft Search (Exchange) - MSSearch 3.0

Another main component of Content Indexing, the responsibility of msftesql-Exchange is reading and writing to the index catalog. Created during the initial crawl process, the catalog files and directory are created in the same location as the database files.  This path of the catalog cannot be changed. However, moving a mailbox database will move the catalog. Restoring a database from backup does not restore the catalog. However, a new index crawl is initiated on a new catalog. Other responsibilities of the msftesql-Exchange service are performing admin functions, executing full-text queries from the store’s query engine and managing the Filter Daemon. 

The ResetSearchIndex.ps1 script can force a rebuild of the catalog for a specific database on your server; ResetSearchIndex.ps1 [-force] <dbname> [<dbname2>]. This process will remove and recreate the index catalog. The index catalog files can be considered expendable, if the catalog is found to be more than seven days out of date the catalog will be discarded and a new crawl and catalog will be initialized.

Corruption, accidental deletion, or simply troubleshooting search problems are some of the reasons to manually rebuild the Index Catalog.

How to Rebuild the Full-Text Index Catalog
https://technet.microsoft.com/en-us/library/aa995966.aspx

Rebuilding the catalog will resolve issues of corruption with Index Catalog files as noted in this KB article:

The Outlook Web Access search function does not work for some users in Exchange 2007
https://support.microsoft.com/kb/945077

Msftefd.exe

Microsoft Search Filter Daemon

The Filter Daemon is responsible for running through the words and character streams and applying filters and word breakers in the indexing process. The actual process is as follows: after all the data from the item is streamed from the store to the Filter Daemon, the content is passed through the filters and word breakers. The Filter Daemon breaks the textual stream into words, removes noise words (like “the” “and” etc...) and passes the words to be indexed to MSSearch 3.0 to create the actual index entries in the catalog.

During server startup, the msftesql-Exchange service is set to manual and MSExchangeSearch is set to automatic. MSExchangeSearch cannot start until msftesql-Exchange starts. Msftesql-Exchange spawns msftefd.exe. The chart below shows the relationships of the three services and processes. Note that none of the three services or processes depends on the Microsoft Information Store Service and note that the Microsoft Information Store Service does not depend on any of three services below.

Clarification of Exchange 2007 Search Services and Processes and Their Relationships

Service Name Startup Type Display Name Depends on

Msftesql-Exchange Manual Microsoft Search (Exchange) Remote Procedure Call                                 

MSExchangeSearch Automatic Microsoft Exchange Search Indexer Msftesql-Exchange
MSExchangeAD Topology

Msftefd.exe* Spawned by MSSearch Microsoft Search Filter Daemon Msftesql-Exchange

*Note: The Msftefd.exe is a process, not a service. Msftefd.exe is spawned by (is instantiated and terminated by) MSFTESQL-Exchange on an as-needed basis.  This process is instantiated for both crawling and index maintenance and terminated when it is idle for a specific time. 
  

Filters

Filters are used to extract the text from specific types of documents, html, doc, xml, xls, pdf, and so on. In the registry under HKLM\Software\Microsoft\Exchange\MSSearch\Filters there will be a list of filters that the server is able to use; see picture below.

image

Figure 2.2 Exchange 2007 MSSearch Filters

Office 2007 documents are not indexed by default ( docx, xlsx, ect) the addition of this IFilter pack would allow indexing of these attachments. If an extension is not listed in the registry we simply skip the attachment and index the rest of the message. You can enable additional file types to be indexed by registering Filter Pack IFilters for Office 2007. For further information, see this article:

944516  How to register Filter Pack IFilters with Exchange Server 2007
https://support.microsoft.com/default.aspx?scid=kb;EN-US;944516

After installing the Office 2007 IFilter Pack above, you must also install this hotfix on the Exchange 2007 server if your users either use Office 2007 or receive any Office 2007 attachments:

960166    Error message when a search process crawls a .vsd file on a Windows 64-bit operating system that is running the 2007 Office Filter Pack: "The filtering process has been terminated"
https://support.microsoft.com/default.aspx?scid=kb;EN-US;960166

Word Breakers

Word Breakers understand the rules of language and are used to convert strings of characters to words, and words to word tokens that are then passed to the msftesql-Exchange service to be indexed and written to the catalog. MAPI.net is used in Exchange 2007 to expose data in the Information Store to the Filter Daemon for indexing.

During the indexing process if any part of the message should fail to be indexed the entire message fails to be indexed. For example, if while indexing a message containing a docx attachment if we fail to open and index the attachment then the entire message is skipped, we will not index the body of the message. This is different from Exchange 2003. However, if an IFilter for a specific attachment type is not listed in the registry, we will skip indexing of that attachment type and the message body will be indexed.

 

Noise Words in Exchange 2007

The Filter Daemon also removes Noise Words. The query processor has a mechanism that discards from the query commonly occurring words that do not factor into the search. These words are called noise words. Noise words are listed in the locale specific noise word files on the server. For example, in the English (US) locale, words such as "a," "and," "is," and "the" are in the English noise word file (if one exists) can be left out of the full-text index since they are empirically known to be useless to a search. The query processor determines the noise word file to use based on the locale of the caller making the
query. The query processor removes any of these words from the restriction prior to optimization since they would not be found in the full-text index. Therefore, note that Noise words are subtracted both from the Index by the Filter Daemon during creating of the index and from the Search by the Query Processor.

NOTE, by default, there are no noise word files in Exchange 2007. However, there is the capability to create and use noise word files in Exchange 2007. Noisexxx.txt is the name of the file and the xxx depends on Language ID.  For example, noiseenu.txt would be for English (US) and noisefra.txt would be for French.

The noise word file, when present, should be located in the Exchange install directory in the Exchange Server/bin/FTERef subdirectory in files with names following the pattern: "noisexxx.txt”.  For example, if you install Exchange 2007 on the C: drive and use English (US) language for your noise word file, it would be located here with the following name:

C:\Program Files\Microsoft\Exchange Server\Bin\FTERef\noiseenu.txt

The complete list of language codes for noise word file names are listed below.

    { 0x0804, L"CHS" }, // Simplified chinese (PRC)
    { 0x0404, L"CHT" }, // Traditional chinese (Tiawan)
    { 0x0406, L"DAN" }, // Danish
    { 0x0407, L"DEU" }, // German
    { 0x0409, L"ENU" }, // English (US)
    { 0x0809, L"ENG" }, // English (UK) 
    { 0x0C0A, L"ESN" }, // Spanish
    { 0x040C, L"FRA" }, // French
    { 0x0410, L"ITA" }, // Italian
    { 0x0411, L"JPN" }, // Japanese
    { 0x0412, L"KOR" }, // Korean
    { 0x0413, L"NLD" }, // Dutch
    { 0x0415, L"PLK" }, // Polish  
    { 0x0416, L"PTB" }, // Portuguese
    { 0x0419, L"RUS" }, // Russian
    { 0x041D, L"SVE" }, // Swedish
    { 0x041E, L"THA" }, // Thai
    { 0x041F, L"TRK" }  // Turkish

 

Content Indexing and Exchange 2007 High Availability

Exchange 2007 offers high availability options, Single Copy Cluster (SCC), Cluster Continuous Replication (CCR), and Local Continuous Replication (LCR).

SCC mailbox servers share a single instance of a mailbox database and index catalog. There is no change in the Content Index process for a SCC.

With CCR mailbox servers there are two instances of a mailbox database: one active and one passive. An instance of content indexer on each node creates a unique catalog for each database. Each catalog has a unique GUID held in the database that matches it to the content indexer on the node that created it. When failing over, the second catalog will always be used with the second database, and the first catalog with the first database. One current limitation is
that there is no way to detect how up-to-date or how healthy the catalog on the passive node is. The MSExchangeSearch process on the passive node continuously updates the catalog, so that it can be used for fail-over, at any time.

In an LCR implementation, there is only a single copy of the content index catalog. When the offline database on an LCR server is activated the original catalog is not automatically moved over, this can be manually copied to the new active database location and the index will be up to date. If the catalog is not copied over, a new catalog will be created and a full crawl will begin.


Summary

In summary, Content Indexing in Exchange 2007 includes the capability to check the indexing status of each database every 30 seconds. Crawling in Exchange 2007 is performed on demand rather than a schedule. Notifications are sent to the indexing process to know what is already indexed and what needs to be indexed next in a queue.

The components of Exchange 2007 Content Indexing include the Microsoft Information Store, the Microsoft Exchange Search Indexer, Microsoft Search (Exchange) (MSSearch 3.0), and Microsoft Search Filter Daemon.  The Filter Daemon uses Filters and Word Breakers to create the Index.

The Filter Daemon also removes Noise words during crawling to create the Index. Exchange 2007 includes the capability to add Noise word files. By default, there are no Noise word files provided with Exchange 2007.

SCC mailbox servers share a single instance of a mailbox database and catalog while CCR mailbox servers contains one mailbox database associated with its own catalog and a copy of the mailbox database associated with its own unique catalog. In an LCR implementation, there is only a single copy of the Catalog.

Bob Want, Senior Support Escalation Engineer, Enterprise Communications Services, Microsoft
Jack French, Senior Support Escalation Engineer, Enterprise Communications Services, Microsoft