Search Architecture in Microsoft Office SharePoint Server 2007
This article describes or explains the following:
- The indexing and search components that comprise the overall search architecture in Office SharePoint Server 2007.
- The purpose and capabilities of the different search roles in an Office SharePoint Server 2007 farm.
- The indexing processes used by Office SharePoint Server 2007.
- How protocol handlers and iFilters fit into the Office SharePoint Server 2007 search architecture.
- The use of word breakers and stemmers.
- The possible dependencies on 32-bit architecture in specific scenarios.
- How Office SharePoint Server 2007 manages and propagates indexes from indexers to query servers.
- The query processes in an enterprise search solution based on Office SharePoint Server 2007.
Indexing and Search Architecture
A search solution based on Office SharePoint Server 2007 is comprised of two main components, namely the indexing engine and the query engine. In brief, the indexing engine is responsible for crawling and indexing content that may be stored in a variety of formats and locations in your corpus, while the query engine provides the capability to search the indexed information.
Index Engine
The indexing engine provided by Office SharePoint Server 2007 is capable of crawling a variety of content sources, such as Web sites, SharePoint sites, Exchange Server Public Folders, line-of-business data, and file shares. The indexing engine retrieves the definitions of the content sources from the search configuration data. You have control over defining the content sources to be crawled.
The indexing engine provides the base logic for the indexing process, and loads specific components called protocol handlers to connect to and crawl the different types of content sources. Protocol handlers, in turn, load additional components called iFilters to read the contents of specific file types.
The indexing engine maintains a file-based index, which contains the indexed content. Furthermore, the indexing engine also maintains managed properties (in what is called as the property store or search schema) and scope definitions in the search and configuration databases managed by SQL Server.
Query Engine
Put simply, a Web server initiates each query by collecting terms from the user, and then contacts the query engine to search the full-text index for items that contain the searched-for terms. The results are supplemented with keywords, best-bets, and managed properties from the search configuration database, managed by SQL Server. If the query consists of only a property filter, the Web server needs only to contact the database server, and does not contact the query engine.
Queries are initiated through the Search object model or the Search Web service on Web servers.
Additional components, such as word breaks and stemmers are used throughout the entire process. These components will be discussed in detail later in this paper.
You can physically separate the indexing engine from the query engine by implementing specific roles in your server farm. You can also physically separate the indexing and query engines from the Web servers that expose the search object model and the search Web service.
Server Roles
When you create solutions with SharePoint technologies, you must be aware of the roles that servers can take within your server farm. For search, the roles that you should be aware of are the indexer role, the query server role, the Web server role, and the database server role.
Indexer Role
The indexer role provides the indexing services, such as crawling content, managing crawl schedules, and defining crawl rules. One of the main tasks assigned to the indexer is to crawl content sources and index the information that is stored there. A content source is simply a specification of the type of system and location to be crawled, along with at least one start address. In Office SharePoint Server 2007, you can specify up to 500 start addresses per content source and, furthermore, a Shared Service Provider can define up to 500 content sources.
An indexer is characterized by the following requirements:
- Processor. An indexer typically requires a large amount of processor power. Processor utilization on an indexer will most likely be the highest for the indexing process than any other process that occurs in your farm. You should ensure that you have adequate processing power for your indexer; typically, you will require multiple multi-cored processors.
- Disk access. An indexer has two typical disk access patterns. While content is being crawled and indexed, the Index Server will exhibit write-intensive characteristics, and if indexes are propagated to Query Servers, then the disk will be read in small fragments. The disk-write operations are the most intensive, so you should optimize your disk configuration for write-access. To achieve this, the recommended disk configuration is physical RAID 10 (disk striping, with mirroring for fault-tolerance).
- Memory. Indexing content is not typically a memory-intensive operation, although documents are read into memory for indexing. You can control how much memory is required for indexing purposes to some extent by controlling:
- The maximum size of the documents to be indexed.
- The degree to which documents are indexed in parallel.
- Network. An Index Server makes use of the network primarily at indexing time. It connects to content sources and reads document contents over the network, and it propagates small index fragments to query servers during the indexing process (if propagation occurs).
The processor is most commonly the first bottleneck for an indexer, especially if sufficiently high processor power has not been provided. You should monitor processor utilization at indexing time to determine whether you need to add more processing power to an indexer. However, if you have provided a high processor power, you are most likely to experience network latency problems between the indexer and the content.
Query Server Role
The query server role runs queries over the full-text index. Query servers are managed at the farm level.
If you physically separate the query server role from the indexer onto one or more servers, then the full-text index is propagated from the indexer to all query servers in the farm. Propagation occurs continuously while content is being added to the index on the indexer and you are not required to configure or administer the propagation. Most queries will require results to be returned from the Query Server; the only exception is when the user has issued a query that only consists of a property filter. In that case, the query can be satisfied by the database server role alone.
A query server is characterized by the following requirements:
- Processor. Apart from normal processor instructions for reading from disks and managing memory, a query server’s processor requirements vary depending on the size of the index being searched.
- Disk access. A query server has two typical disk access patterns. While queries are being satisfied, a query server may exhibit read-intensive characteristics if the data it requires is not held in memory. If index propagation occurs from an Index Server, then the disk will be written to in small fragments and the query server must perform a master merge. Of these two patterns, the disk-read operations are the most intensive, so you should optimize your disk configuration for read-access. You can do this by implementing physical disk striping across multiple hard disks, each with their own physical controller. The recommended disk configuration is physical RAID 10 (disk striping, with mirroring for fault-tolerance).
- Memory. Memory is the most intensively used physical resource by a query server. A query server caches the results from recent queries, and will only remove those cached data when either:
- It has run out of physical memory and new results must be read into the cache to satisfy a query.
- Data in the cache have been invalidated because a crawl has discovered updated information.
- Network. A query server makes use of the network primarily at query time. It receives search requests from Web servers, and sends results back over the network. Network resources are also used at indexing time when an Index Server propagates small index fragments to query servers.
Web Server Role
The Web server role responds to search queries for users and applications. The Web server collects query terms from the user, either through built-in Web Parts, custom Web Parts, or from custom applications. Based on the information collected, the Web server is responsible for formulating the specific query. Depending on the contents of the query the Web server will contact query servers and the database server to retrieve the required results and access control lists. For example, if the query consists of only a property filter, the Web server needs only to contact the database server, and does not contact the query engine, whereas if the query contains keywords to be searched for, then both the query server and database server will be contacted.
When all of the results and access control lists have been returned to the Web server, it security trims the results, based on the identity of the user who issued the query and the access control lists returned by the database server. After security trimming has taken place, the Web server presents the results, either by rendering them on Web pages or by returning the results to a calling application.
Database Server Role
The database server role performs search-specific actions that apply at configuration, indexing, and query time. All of the administrative search configuration settings are stored by the database server; these settings include content source definitions, crawl rules, and scope definitions.
In addition to storing configuration data in the search configuration database, the database server also stores data that is retrieved from the crawl processes. Specifically, when managed property values and access control lists are retrieved from content sources, their values are stored in the search database. In addition, when a query is issued by a user, the Web server contacts the database server to retrieve managed property values and access control lists, based on the data returned from the query server.
Indexing Processes
The indexing process consists of the following general steps:
- The indexer retrieves the start addresses of content sources.
- The indexer invokes a protocol handler to connect to and traverse the content source.
- The protocol handler identifies content nodes, such as files and Web pages.
- The protocol handler retrieves system-level metadata and access control lists (if access control lists are available).
- The protocol handler invokes the iFilter associated with the content node type.
- The iFilter retrieves content and metadata from the content node.
- Content and metadata are parsed by the word breaker and are added to the full-text index.
- Metadata and access control lists are added to the search database.
Protocol Handlers
The crawl process requires protocol handlers to connect to content sources and iFilters to access the data stored within files that are located at the content source.
Protocol Handlers
In general, protocol handlers are responsible for:
- Connecting to source systems over a given protocol, such as HTTP://.
- Traversing the source system.
- Identifying content nodes, such as files or Web pages.
- Invoking iFilters to read those content nodes.
- Retrieving any system-level metadata, such as permissions, and default properties such as Title.
- Returning streams of content and metadata to the indexing engine.
Protocol Handler Characteristics
The various protocol handlers necessarily exhibit different characteristics and behaviors because the corresponding source systems are very different:
- Web protocol handler. The Web protocol handler makes HTTP requests of the start addresses in a content source. It then traverses Web sites by following hyperlinks on Web pages. The Web protocol handler does not retrieve access control lists, so any content that indexed will not be security-trimmed at query time.
- SharePoint protocol handler. The SharePoint protocol handler enumerates the content to index by invoking the SiteData.asmx Web service. The Web service returns a list that contains content nodes. The SharePoint protocol handler then makes HTTP requests for each node. Access control lists are also returned, so full security trimming can occur for SharePoint-based content.
- File protocol handler. The File protocol handler connects to and traverses file shares over the FILE:// protocol. The handler can traverse sub folders, and enumerates files for indexing. It then indexes those files over the FILE:// protocol. For security-trimming purposes, the File protocol handler retrieves NTFS access control lists, not file-share permissions
- Exchange Public Folder protocol handler. Exchange Public Folders are crawled over the HTTP protocol by using Microsoft® Outlook® Web Access. Therefore, you must ensure that the Exchange Server administrator has enabled Office Outlook Web Access.
- Business Data Catalog protocol handler. The Business Data Catalog (BDC) enables data from relational databases and Web services to be used in Microsoft Office SharePoint Server 2007 solutions. The BDC protocol handler simply indexes the BDC application that is managed by Microsoft Office SharePoint Server 2007 by using an internal communication mechanism. The BDC application fetches the actual content to be indexed, by using either ADO.NET or Web services. The Business Data Catalog protocol handler supports full security trimming through a custom security trimmer. The BDC protocol handler does not load iFilters to read and index binary data. Therefore, you cannot index the contents of binary files by using the BDC.
- The Lotus Notes protocol handler. The Lotus Notes protocol handler supports full security trimming. However, access control lists are retrieved through a one-to-one mapping of Windows user accounts to Lotus Notes users, by using a Lotus Notes View. Therefore, there is an administrative overhead involved in maintaining this mapping. Note that for your deployment to crawl a Lotus Notes database you must install the Lotus Notes client, which is used with the Lotus C++ API Toolkit to enable the protocol handler to access the Lotus Notes database. For a complete view of how to configure the crawl of Lotus Notes, see Configure Office SharePoint Server Search to crawl Lotus Notes (http://go.microsoft.com/fwlink/?LinkId=109151&clcid=0x409).
You can install custom protocol handlers for your deployment that provide access to new content over different protocols.
iFilters
The crawl process uses iFilters to open documents and other content source items in their native formats. The role of the iFilter is to filter out embedded formatting and retrieve information and properties from the documents. When the iFilter has performed the filtering action the information and properties that it has gathered are then stored in the content index.
When you deploy your Office SharePoint Server 2007 environment, you are provided with a set of iFilters out-of-the-box. You can expand this set to include other file types such as Portable Document Format (PDF) files. Files such as these are common within an organization, and you can install custom iFilters for your deployment to provide access to these files through Search.
Word Breakers and Stemmers
When content has been streamed from iFilters and protocol handlers back to the indexer, the Office SharePoint Server 2007 indexing engine loads additional components to help with the indexing process for that stream of data.
Word Breakers in the Indexing Process
Word breakers are used at indexing time to break the stream of data into words that can be indexed. Core word breakers identify characters such as spaces and punctuation marks to demarcate the words to be indexed. For example, the simple phrase enterprise search would be broken into two separate words, and the hyphenated word of security-trim would be broken into security and trim.
Additional word breakers may also be used, based on the detected language of the content being indexed, to break compound words into discrete words. For example, if the language of the content is detected as English, and the content contains the compound word nonfunctional, then the English language word breaker will identify both non and functional as words to be indexed, and will store them along with the compound word.
Word breakers also identify noise words, and ensure that they are not indexed. You can add noise words that are specific to your organization to the noise word files. The files are stored in the \Program Files\Microsoft Office Servers\12.0\Data\Config folder.
Word Breakers at Query Time
Word breakers are also used at query time to break the search terms supplied by the user into searchable words. The language-specific word breakers to be loaded are identified by the language of the user’s Web browser.
Stemmers
Stemmers perform a different function to word breakers. In general, stemmers deal with the different inflectional forms that a word can take while retaining essentially the same meaning. Stemmers generally deal with verbs and nouns. For example, a stemmer can analyze a verb such as run, and can recognize that runs, running, and ran are all essentially the same word. Similarly, a stemmer can analyze the noun cat, and can determine that the singular, plural, singular possessive and plural possessive forms are equivalent, such as cat, cats, cat’s, and cats’.
Stemmers in the Indexing Process
Stemming does not typically take place during the indexing process for most languages. However, some language-specific word breakers, such as Arabic and Hebrew, do perform stemming at indexing time. Where stemming takes place at indexing time, multiple forms of the word being stemmed are added to the index. Where this happens, more space is used to store multiple word forms, but that can be used efficiently at query time. The two languages where stemming does actually occur at indexing time are Arabic and Hebrew. This is because both of those languages have a relatively small base of stem terms, each of which can have hundreds, or even thousands of inflectional forms. Consequently, it makes good sense to apply stemming at indexing time for these languages, so that the query process can run efficiently.
Stemmers at Query Time
For all languages other than Arabic and Hebrew, if stemming is required then it is generally preferable to do so at query time. Because most languages have a wide base of stem terms, with only a small number of inflectional forms for each stem, it is relatively efficient to perform searches for stemmed words.
Enabling Language-Specific Stemmers
For a small number of languages, such as Czech, Danish, Finnish, Greek, Hungarian, Polish, and Turkish, language stemmers are not actually enabled at the server level. To enable these stemmers, you must edit the registry. You should refer to the Microsoft Knowledge Base (http://support.microsoft.com/kb/929912/en-us) for instructions on how to enable stemmers for these specific languages.
32-Bit and 64-Bit Architectures
Office SharePoint Server 2007 is available in 32-bit and 64-bit versions. Typically, you will observe much better performance, stability, and scalability if you implement your entire solution on a 64-bit architecture, but some scenarios may prevent you from doing so.
Query Servers and Web Servers
In general, you can use a mix of architectures for different servers in your farm. For example, you can use 32-bit Web servers and 64-bit query servers. However, we recommend that you do not mix architectures for multiple servers that perform the same role. For example, you should not implement one 32-bit Web server and one 64-bit Web server.
Indexers
Perhaps the most important decision about architecture affects the indexer. In general, the indexer will benefit from 64-bit architecture, but you must be aware that protocol handlers and iFilters are architecture-dependent. For example, if you install the 64-bit version of Office SharePoint Server 2007 for your indexer, you must ensure that you can obtain 64-bit versions of third-party iFilters and protocol handlers.
Index Management and Propagation
Office SharePoint Server 2007 implements a completely new index management scheme, compared to previous versions of SharePoint Products and Technologies.
Master and Shadow Indexes
Although you should view the full-text catalog file as a single logical file, it is actually managed as a main master index file and a series of shadow index files. This scheme enables indexing operations to occur efficiently because small shadow files are created and written to, instead of having to update the potentially large master index file when new content is indexed. When about 10% of the content has been updated, the shadow indexes are all merged into the master, and then new shadow index files are created. This process is called a master merge and can occur more than once during the crawl process on the indexer.
Continuous Propagation from Indexer to Query Servers
If you have physically separated the query server role from the indexer, then the full-text index that is built and maintained by the indexer is propagated to the query server. You may implement multiple query servers in your farm, in which case the index is propagated to all query servers. The propagated index can then be used query servers to satisfy searches performed by users and applications.
The full text index is propagated to the query server(s) throughout the indexing process in small fragments (shadow index files). The propagation is efficient and continuous, so you are not required to administer the process. In most scenarios, a piece of data that has been crawled by the indexer will be searchable on a query server within a few seconds.
Query servers perform similar index management processes as indexers, such as merging master and shadow indexes, but you are not required to administer this process. Furthermore, it does not matter whether a piece of data that has been indexed resides in the master index or a shadow index — it is still searchable.
Query Processes
The query process consists of the following general steps:
- Query terms are collected by a Web server.
- Query terms are supplemented with contextual information, such as the identity of the user and from which site collection they are performing the search.
- The Web server initiates the query by contacting a query server to run the query on the full-text index. Stemmers and thesaurus expansion are used (if activated). The Web server also contacts the search database for managed properties and access control lists.
- The Web server security-trims the results based on the identity of the user and the access control lists, and the Web server then returns the trimmed results to the caller.