Share via


SharePoint 2007: How to Plan for Crawling Content

The purpose of this article is to help search services administrators understand how Microsoft Office SharePoint Server 2007 crawls and indexes content and to help them plan to crawl content.

This topic is a how to.
Please keep it as clear and simple as possible. Avoid speculative discussions as well as a deep dive into underlying mechanisms or related technologies.

Before end users can use the enterprise search functionality in Office SharePoint Server 2007 to search for content, you must first crawl the content that you want to make available for users to query.

For the purpose of this article, content is any item that can be crawled, such as a Web page, a Microsoft Office Word document, business data, or an e-mail message file.

When planning to crawl content, you should consider the following questions:

  • Where is the content that you want to crawl physically located?

  • Is some of the content that you want to crawl stored in different types of sources, such as file shares, SharePoint sites, Web sites, or other places?

  • Do you want to crawl all the content at specific sources or just some of it?

  • What types of files make up the content that you want to crawl?

  • When and how often should you crawl content?

  • How is this content secured?

Use the information in this article to help you answer these questions and make the necessary planning decisions about the content you want to crawl and how and when you want to crawl that content.

Office SharePoint Server 2007 includes the Office SharePoint Server Search service, which is used to crawl and index content. This service is part of an SSP and all content crawled using a particular SSP is indexed in a single index. For information about choosing the number of SSPs to use to index content, see Plan Shared Services Providers.

About crawling and indexing content

Crawling and indexing content is the process by which the system accesses and parses content and its properties, sometimes called metadata, to build a content index from which search queries can be served.

The result of successfully crawling content is that the individual files or pieces of content that you want to make available to search queries are accessed and read by the crawler. The keywords and metadata for those files are stored in the content index, sometimes called the index. The index consists of the keywords that are stored in the file system of the index server and the metadata that is stored in the search database. The system maintains a mapping between the keywords, the metadata associated with the individual pieces of content from which the keywords were crawled, and the URL of the source from which the content was crawled.

Note: The crawler does not change the files on the host servers in any way. Instead, the files on the host servers are simply accessed, read, and the text and metadata for those files are sent to the index server to be indexed. However, because the crawler reads the content on the host server, some servers that host certain sources of content might update the last accessed date on files that have been crawled.

Identify the sources of content that you want to crawl

In many cases, the needs of your organization might only require that you crawl all the content contained by the SharePoint sites in your organization's server farm. In this case, you might not need to identify the sources of content you want to crawl because all site collections in a server farm can be crawled using the default content source. For more information about the default content source, see "Plan content sources" later in this article.

Many organizations also need to crawl content that is external from the server farm, such as file shares or Web sites on the Internet. Office SharePoint Server 2007 can crawl and index content that is hosted on other Windows SharePoint Services or Office SharePoint Server farms, Web sites, file shares, Microsoft Exchange public folders, IBM Lotus Notes servers, and business data that is stored in databases. This greatly increases the amount of content that can be made available to search queries.

In many cases, however, you might not want to crawl every site collection in your server farm, because content stored in some site collections might not be relevant in search results. In this case, you must do one or both of the following:

  • Note the site collections that you do not want to crawl. If you decide to use the default content source, you must ensure that the start addresses for the site collections you do not want to crawl are not listed in the default content source.

  • Note the individual start addresses of the site collections that you do want to crawl. If you decide to create additional content sources to use to crawl this content, you need to know these start addresses. For information about when to use one or more content sources, see "Plan content sources" later in this article.

With the Infrastructure Update for Microsoft Office Servers installed, there are two ways to process search queries in order to return search results to users. You can query the Search Server content index, or you can use federated search. Note that the Infrastructure Update for Microsoft Office Servers provides Office SharePoint Server 2007 with the federated search capability that first appeared in Search Server 2008.

There are advantages to each approach. For a comparison of these two approaches to processing search queries, see Federated Search Overview (http://go.microsoft.com/fwlink/?LinkID=122651). For a list and brief description of articles about understanding and using federation, see Working with Federation (Office SharePoint Server). For more information about the Infrastructure Update for Microsoft Office Servers, see Install the Infrastructure Update for Microsoft Office Servers (Office SharePoint Server 2007).

Plan content sources

Before you can crawl content, you must first determine where the content is and on what types of servers the content is hosted. After this information is gathered, a shared services administrator can create one or more content sources that are used to crawl that content. These content sources provide the following information to the crawler during a crawl:

  • Type of content you want to crawl — for example, a SharePoint site or a file share.

  • Start address from which to start crawling.

  • Behavior to use when crawling — for example, how deep to crawl from the start address, or how many server hops to allow.

  • Crawling schedule. 

This section helps you plan for the content sources needed by your organization.

The default content source is called Local Office SharePoint Server sites. Shared services administrators can use this content source to crawl and index all content in all Web applications associated with the SSP. By default, Office SharePoint Server 2007 adds the start address (in this case a URL) of the top-level site of each site collection created in the Web application that uses the same SSP to the default content source.

For some organizations, simply using the default content source to crawl all sites in their site collections satisfies their search requirements. However, many organizations need additional content sources.

Reasons for creating additional content sources include the need to:

  • Crawl different types of content.

  • Crawl some content on different schedules than other content.

  • Limit or increase the quantity of content that is crawled.

Shared services administrators can create up to 500 content sources in each SSP and each content source can contain up to 500 start addresses. To keep administration as simple as possible, you should create only as many content sources as you need.

Crawl different types of content

You can only crawl one type of content per content source. That is, you can create a content source that contains URLs for SharePoint sites and another that contains URLs for file shares, but you cannot create a single content source that contains URLs to both SharePoint sites and file shares. The following table lists the types of content sources that can be configured

This type of content source Includes this type of content

SharePoint sites

SharePoint sites from the same farm or different Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008 farms.

SharePoint sites from Microsoft Office SharePoint Portal Server 2003 or Microsoft Windows SharePoint Services 2.0 farms.

Note: Unlike crawling SharePoint sites on Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008, the crawler cannot automatically crawl all subsites in a site collection from previous versions of SharePoint Products and Technologies. Therefore, when crawling SharePoint sites from previous versions, you must specify the URL of each top-level site and each subsite that you want to crawl. Sites listed in the Site Directory of Microsoft Office SharePoint Portal Server 2003 farms are crawled when the portal site is crawled. For more information about the Site Directory, see About the Site Directory (http://go.microsoft.com/fwlink/?LinkId=88227&clcid=0x409). 

Web sites

Other Web content in your organization not found on SharePoint sites

Content on Web sites on the Internet

File shares

Content on file shares within your organization

Exchange public folders

Microsoft Exchange Server content

Lotus Notes

E-mail messages stored in Lotus Notes databases

Note: Unlike all other types of content sources, the Lotus Notes content source option does not appear in the user interface until you have installed and configured the appropriate prerequisite software. For more information, see Configure Office SharePoint Server Search to crawl Lotus Notes (Office SharePoint Server 2007).

Business data

Business data stored in line-of-business applications

Plan content sources for business data 

Business data content sources require that the applications hosting the data are first registered in the Business Data Catalog. You must create one or more separate content sources of the Business Data content source types to crawl business data. You can create one content source to crawl all applications registered in the Business Data Catalog, or you can create separate content sources to crawl individual applications that are registered in the Business Data Catalog.

Often, the people who plan for integration of business data into your site collections are not the same people involved in the overall content planning process. Therefore, include business application administrators in your content planning teams so that they can advise you how to integrate their data into your other content and effectively present it in your site collections.

For more information about planning business data search, see Plan for business data search.

Crawl content on different schedules

Shared services administrators often must decide whether some content is crawled more frequently than other content. The larger the volume of content that you crawl, the more likely it is that you are crawling content from different sources. These different sources might or might not be of the same type and might be hosted on servers of varying speeds in relation to one another.

These factors make it more likely that you need additional content sources to crawl those different sources of content at different times.

Primary reasons that content is crawled on different schedules are:

  • To accommodate downtimes and periods of peak usage.

  • To more frequently crawl the content that is updated more frequently.

  • To crawl content hosted on slower host servers separately from content crawled on faster host servers.

In many cases, not all of this information can be known until after Office SharePoint Server 2007 is deployed and running for some time. Instead, some of these decisions are made during the operations phase. However, it is a good idea to consider these factors during planning so that you can plan your crawl schedules based on the information at hand.

The following two sections provide more information about crawling content on different schedules.

Downtimes and periods of peak usage

Consider the downtimes and peak usage times of the servers that host the content you want to crawl. For example, if you are crawling content hosted by many different servers outside your server farm, it is likely that these servers are backed-up on different schedules and have different peak usage times. The administration of servers outside your server farm is typically out of your control. Therefore, we recommend that you coordinate your crawls with the administrators of the servers that host the content you want to want to crawl to ensure you do not attempt to crawl content on their servers during a downtime or peak usage time.

A common scenario involves content outside the control of your organization that relates to the content on your SharePoint sites. You can add the start addresses for this content to an existing content source or create a new content source for external content. Because availability of external sites varies widely, it is helpful to add separate content sources for different external content. In this way, the content sources for external content can be crawled at different times than your other content sources. You can then update external content on a crawl schedule that accounts for the availability of each site.

Content that is updated frequently

When planning crawl schedules, consider that some sources of content are typically updated more frequently than others. For example, if you know that content on some site collections or external sources are updated only on Fridays, it would be a waste of resources to crawl that content more frequently than once each week. However, your server farm might contain other site collections that are continually updated Monday through Friday, but not typically updated on Saturdays and Sundays. In this case, that you might want to crawl several times each week day, but only once or twice on weekends.

The way in which content is stored across the site collections in your environment can guide you to create additional content sources for each of your site collections in each of your Web applications. For example, if a site collection stores only archived information, you may not need to crawl that content as frequently as you crawl a site collection that stores frequently updated content. In this case, you might want to crawl these two site collections using different content sources so that they can be crawled on different schedules without having to crawl the archive sites as frequently as the other content.

Full and incremental crawl schedules

Shared services administrators can configure the crawl schedules independently for each content source. For each content source, they can specify a time to do full crawls and a separate time to do incremental crawls. Note that you must run a full crawl for a particular content source before you can run an incremental crawl. If you choose an incremental crawl for content that has not yet been crawled, the system performs a full crawl.

Because a full crawl crawls all content that the crawler encounters and has at least read access to, regardless of whether that content has been previously crawled, full crawls can take significantly more time to complete than incremental crawls.

We recommend that you plan crawl schedules based on the availability, performance, and bandwidth considerations of the servers running the search service and the servers hosting the crawled content.

When you plan crawl schedules, consider the following best practices:

  • Group start addresses in content sources based on similar availability and with acceptable overall resource usage for the servers that host the content.

  • Schedule incremental crawls for each content source during times when the servers that host the content are available and when there is low demand on the resources of the server.

  • Stagger crawl schedules so that the load on the servers in your farm is distributed over time.

  • Schedule full crawls only when necessary for the reasons listed in the next section. We recommend that you do full crawls less frequently than incremental crawls.

  • Schedule administration changes that require a full crawl to occur shortly before the planned schedule for full crawls. For example, we recommend that you attempt to schedule the creation of the crawl rule before the next scheduled full crawl so that an additional full crawl is not necessary.

  • Base simultaneous crawls on the capacity of the index server to crawl them. We recommend that you typically stagger your crawl schedules so that the index server does not crawl using multiple content sources at the same time. For best performance, we suggest that you stagger the crawling schedules of content sources. The performance of the index server and the servers hosting the content determines the extent to which crawls can be overlapped. A strategy for crawl scheduling can be developed over time as you can become familiar with the typical crawl durations for each content source. 

Reasons for a search services administrator to do a full crawl include:

  • One or more hotfix or service pack was installed on servers in the farm. See the instructions for the hotfix or service pack for more information.

  • An SSP administrator added a new managed property.

  • To re-index ASPX pages on Windows SharePoint Services 3.0 or Office SharePoint Server 2007 sites. The crawler cannot discover when ASPX pages on Windows SharePoint Services 3.0 or Office SharePoint Server 2007 sites have changed. Because of this, incremental crawls do not re-index views or home pages when individual list items are deleted. We recommend that you periodically do full crawls of sites that contain ASPX files to ensure that these pages are re-indexed.

  • To detect security changes that were made on a file share after the last full crawl of the file share.

  • To resolve consecutive incremental crawl failures. In rare cases, if an incremental crawl fails one hundred consecutive times at any level in a repository, the index server removes the affected content from the index.

  • Crawl rules have been added, deleted, or modified.

  • To repair a corrupted index.

  • The search services administrator has created one or more server name mappings.

  • The account assigned to the default content access account or crawl rule has changed.

The system does a full crawl even when an incremental crawl is requested under the following circumstances:

  • An SSP administrator stopped the previous crawl.

  • A content database was restored from backup. If you are running the Infrastructure Update for Microsoft Office Servers, you can use the restore operation of the stsadm command-line tool to change whether a content database restore causes a full crawl.

  • A farm administrator has detached and reattached a content database.

  • A full crawl of the site has never been done.

  • The change log does not contain entries for the addresses that are being crawled. Without entries in the change log for the items being crawled, incremental crawls cannot occur.

  • The account assigned to the default content access account or crawl rule has changed.

  • To repair a corrupted index. Depending upon the severity of the corruption, the system might attempt to perform a full crawl if corruption is detected in the index.

You can adjust schedules after the initial deployment based on the performance and capacity of servers in the farm and the servers hosting content.  

Limit or increase the quantity of content that is crawled

For each content source, you can select how extensively to crawl the start addresses in that content source. You also specify the behavior of the crawl, sometimes called the crawl settings. The options you can choose for a particular content source vary somewhat based on the content source type that you select. However, most options determine how many levels deep in the hierarchy from each start address listed in the content source are crawled. Note that this behavior is applied to all start addresses in a particular content source. If you need to crawl some sites at deeper levels, you can create additional content sources that encompass those sites.

The options available in the properties for each content source vary depending upon the content source type that is selected. The following table describes the crawl settings options for each content source type. 

Content source type Crawl settings options

SharePoint sites

Everything under the host name for each start address

Only the SharePoint site of each start address

Web sites

Only within the server of each start address

Only the first page of each start address

Custom — Specify page depth and number of server hops.

The default setting for this option is unlimited page depths and server hops.

File shares

The folder and all subfolders of each start address

Only the folder of each start address

Exchange public folders

The folder and all subfolders of each start address

Only the folder of each start address

Business data

Crawl entire Business Data Catalog

Crawl selected applications

As the preceding table shows, shared services administrators can use crawl setting options to limit or increase the quantity of content that is crawled.

The following table describes best practices when configuring crawl setting options.

For this content source type Use this crawl setting option Use this crawl setting option

SharePoint sites

You want to include the content on the site itself.

-or-

You do not want to include the content available on subsites, or you want to crawl them on a different schedule.

Crawl only the SharePoint site of each start address

SharePoint sites

You want to include the content on the site itself.

-or-

You want to crawl all content under the start address on the same schedule.

Crawl everything under the host name of each start address

Web sites

Content on the site itself is relevant.

-or-

Content available on linked sites is not likely to be relevant.

Crawl only within the server of each start address

Web sites

Relevant content is on only the first page.

Crawl only the first page of each start address

Web sites

You want to limit how deep to crawl the links on the start addresses.

Custom — Specify the number of pages deep and number of server hops to crawl

We recommend you start with a small number on a highly connected site because specifying more than three pages deep or more than three server hops can crawl the entire Internet.

File shares

Exchange public folders

Content available in the subfolders is not likely to be relevant.

Crawl only the folder of each start address

File shares

Exchange public folders

Content in the subfolders is likely to be relevant.

Crawl the folder and subfolder of each start address

Business data

All applications that are registered in the Business Data Catalog contain relevant content.

Crawl the entire Business Data Catalog

Business data

Not all applications that are registered in the Business Data Catalog contain relevant content.

-or-

You want to crawl some applications on a different schedule.

Crawl selected applications

 

Plan file-type inclusions and IFilters

Content is only crawled if the relevant file name extension is included in the file-type inclusions list and an IFilter is installed on the index server that supports those file types. Several file types are included automatically during initial installation. When you plan for content sources in your initial deployment, determine whether content you want to crawl uses file types that are not included. If file types are not included, you must add those file types on the Manage File Types page during deployment and ensure that an IFilter is installed and registered to support that file type.

Office SharePoint Server 2007 provides several IFilters, and more are available from Microsoft and third-party vendors. For information about how to install and register additional IFilters that are available from Microsoft, see How to register Microsoft Filter Pack with SharePoint Server 2007 and with Search Server 2008 (http://go.microsoft.com/fwlink/?LinkId=110532). If necessary, software developers can create IFilters for new file types.

On the other hand, if you want to exclude certain file types from being crawled, you can delete the file name extension for that file type from the file type inclusions list. Doing so excludes file names that have that extension from being crawled.

For a table that lists which file types are supported by the IFilters that are installed by default and which file types are enabled on the Manage File Types page by default, see IFilters in Office SharePoint Server 2007.

IFilters and Microsoft Office OneNote

An IFilter is not provided for the .one file name extension used by Microsoft Office OneNote. If you want users to be able to search content in Office OneNote files, you must install an IFilter for OneNote. To do this, you must do one of the following.

Limit or exclude content by using crawl rules

When you add a start address to a content source and accept the default behavior, all subsites or folders below that start address are crawled unless you exclude them by using one or more crawl rules.

For more information about crawl rules, see Plan crawl rules later in this article.

Other considerations when planning content sources

You cannot crawl the same addresses using multiple content sources. For example, if you use a particular content source to crawl a site collection and all of its subsites, you cannot use a different content source to crawl one of those subsites separately on a different schedule. To accommodate this restriction, you might need to crawl some of these sites separately. Consider the following scenario:

The SSP administrator at Contoso wants to crawl http://contoso, which contains the http://contoso/sites/site1 and http://contoso/sites/site2 subsites. He wants to crawl http://contoso/sites/site2 on a different schedule than the other sites. To achieve this, he adds the addresses http://contoso and http://contoso/sites/site1 to one content source and selects the Crawl only the SharePoint site of each start address setting. He then adds http://contoso/sites/site2 to another content source and specifies a different schedule for that content source.

In addition to crawl schedules, there are other things to consider when planning content sources. For example, whether you group start addresses in a single content source or create additional content sources to crawl those start addresses depends largely upon administration considerations. Administrators often make changes that require a full update of a particular content source. Changes to a content source require a full crawl of that content source. To make administration easier, organize content sources in such a way that updating content sources, crawl rules, and crawling content is convenient for administrators.

Content sources summary

Consider the following when planning your content sources:

  • A particular content source can be used to crawl only one of the following content types: SharePoint sites, Web sites that are not SharePoint sites, file shares, Exchange public folders, Lotus Notes databases, and business data.

  • Shared services administrators can create up to 500 content sources in each SSP, and each content source can contain up to 500 start addresses. To keep administration as simple as possible, you should create only as many content sources as you absolutely need.

  • Each URL in a particular content source must be of the same content source type.

  • For a particular content source, you can choose how deep to crawl from the start addresses. These configuration settings apply to all start addresses in the content source. The available choices on how deep you can crawl the start addresses differ depending upon the content source type that is selected.

  • You can schedule when to perform either a full or incremental crawl for the entire content source. For more information about scheduling crawls, see "Full and incremental crawl schedules" earler in this article.

  • Shared services administrators can modify the default content source, create additional content sources for crawling other content, or both. For example, they can configure the default content source to also crawl content on a different server farm or they can create a new content source to crawl other content.

  • To effectively crawl all the content needed by your organization, use as many content sources as make sense for the types of sources you want to crawl, and for the frequency at which you plan to crawl them.

Plan for authentication

When the crawler accesses the start addresses that are listed in content sources, the crawler must be authenticated by and granted access to the servers that host that content. This means that the domain account used by the crawler must have at least read permission to the content.

The default content access account is the account that is used by default when crawling content sources. This account is specified by the shared services administrator. Alternatively, you can use crawl rules to specify a different content access account to use when crawling particular content. Regardless whether you use the default content access account or a different content access account specified by a crawl rule, the content access account that you use must have read access to all content that is crawled, or the content is not crawled and is not available to queries.

We recommend that you select a default content access account that has the broadest access to most of your crawled content, and only use other content access accounts when security considerations require separate content access accounts. For information about creating a separate content access accounts to crawl content that cannot be read using the default content access account, see Plan crawl rules later in this article.

For each content source you plan, identify the start addresses that cannot be accessed by the default content access account and plan to add crawl rules for URL patterns that encompass those start addresses.  

Ensure that the domain account used for the default content access account or any other content access account is not the same domain account that is used by an application pool associated with any Web application you crawl. Doing so can cause unpublished content in SharePoint sites and minor versions of files (history) in SharePoint sites to be crawled and indexed.

Another important consideration is that the crawler must use the same authentication method as the host server. By default, the crawler attempts to authenticate using NTLM authentication. You can configure the crawler to use a different authentication method, if necessary. For more information, see "Authentication requirements for crawling content" in Plan authentication methods (Office SharePoint Server).

Plan protocol handlers

All content that is crawled requires the use of a protocol handler to gain access to that content. Office SharePoint Server 2007 provides protocol handlers for all common Internet protocols. However, if you want to crawl content that requires a protocol handler that is not installed with Office SharePoint Server 2007, you must install the third-party or custom protocol handler before you can crawl that content.

For a list that shows the protocol handlers that are installed by default, see Protocol handlers in Office SharePoint Server 2007

Plan to manage the impact of crawling

Crawling content can significantly decrease the performance of the servers that host the content. The impact that this has on a particular server varies depending upon the load that the host server is experiencing and whether the server has sufficient resources (particularly CPU and RAM) to maintain service level agreements under normal or peak usage.

Crawler impact rules enable farm administrators to manage the impact your crawler has on the servers being crawled. For each crawler impact rule, you can specify a single URL or use wildcard characters in the URL path to include a block of URLs to which the rule applies. You can then specify how many simultaneous requests for pages are made to the specified URL or choose to request only one document at a time and wait a number of seconds that you choose between requests.

Crawler impact rules reduce or increase the rate at which the crawler requests content from a particular start address or range of start addresses (sometimes called a site name), regardless of the content source used to crawl those addresses. The following table shows the wildcard characters that you can use in the site name when adding a rule. 

Wildcard to use Result

* as the site name

Applies the rule to all sites.

*.* as the site name

Applies the rule to sites with dots in the name.

*.site_name.com as the site name

Applies the rule to all sites in the site_name.com domain (for example, *.adventure-works.com).

*.top-level_domain_name as the site name

Applies the rule to all sites that end with a specific top-level domain name, for example, *.com or *.net.

?

Replaces a single character in a rule. For example, *.adventure-works?.com applies to all sites in the domains adventure-works1.com, adventure-works2.com, and so on.

You can create a crawler impact rule that applies to all sites within a particular top-level domain. For example, *.com applies to all Internet sites with addresses that end in .com. For example, an administrator of a portal site might add a content source for samples.microsoft.com. The rule for *.com applies to this site unless you add a crawler impact rule specifically for samples.microsoft.com.

For content within your organization that other administrators are crawling, you can coordinate with those administrators to set crawler impact rules based on the performance and capacity of the servers. For most external sites, this coordination is not possible. Requesting too much content on external servers or making requests too frequently can cause administrators of those sites to limit your future access if your crawls are using too many resources or too much bandwidth. Therefore, the best practice is to crawl more slowly. In this way, you can mitigate the risk of losing access to crawl the relevant content.

During initial deployment, set the crawler impact rules to make as small an impact on other servers as possible while still crawling enough content frequently enough to ensure the freshness of the crawled content.

During the operations phase, you can adjust crawler impact rules based on your experiences and data from crawl logs.  

Plan crawl rules

Crawl rules apply to a particular URL or set of URLs represented by wildcards (also referred to as the path affected by the rule). You use crawl rules to do the following things:

  • Avoid crawling irrelevant content by excluding one or more URLs. This also helps to reduce the use of server resources and network traffic, and to increase the relevance of search results.

  • Crawl links on the URL without crawling the URL itself. This option is useful for sites with links of relevant content when the page containing the links does not contain relevant information.

  • Enable complex URLs to be crawled. This option crawls URLs that contain a query parameter specified with a question mark. Depending upon the site, these URLs might or might not include relevant content. Because complex URLs can often redirect to irrelevant sites, it is a good idea to enable this option on only sites where the content available from complex URLs is known to be relevant.

  • Enable content on SharePoint sites to be crawled as HTTP pages. This option enables the index server to crawl SharePoint sites that are behind a firewall or in scenarios in which the site being crawled restricts access to the Web service used by the crawler.

  • Specify whether to use the default content access account, a different content access account, or a client certificate for crawling the specified URL. 

Crawl rules apply simultaneously to all content sources in the SSP. 

Often, most of the content for a particular site address is relevant, but not a specific subsite or range of sites below that site address. By selecting a focused combination of URLs for which to create crawl rules that exclude unneeded items, shared services administrators can maximize the relevance of the content in the index while minimizing the impact on crawling performance and the size of search databases. Creating crawl rules to exclude URLs is particularly useful when planning start addresses for external content, the impact on resource usage of which is not under the control of people in your organization.

When creating a crawl rule, you can use standard wildcard characters in the path. For example:

  • http://server1/folder* contains all Web resources with a URL that starts with http://server1/folder.

  • *://*.txt includes every document with the .txt file name extension.

Because crawling content consumes resources and bandwidth, it is better to include a smaller amount of content that you know is relevant than a larger amount of content that might be irrelevant. After the initial deployment, you can review the query and crawl logs and adjust content sources and crawl rules to be more relevant and include more content.  

Specify a different content access account

For crawl rules that include content, administrators have the option of changing the content access account for the rule. The default content access account is used unless another account is specified in a crawl rule. The main reason to use a different content access account for a crawl rule is that the default content access account does not have access to all start addresses. For those start addresses, you can create a crawl rule and specify an account that does have access.

Ensure that the domain account used for the default content access account or any other content access account is not the same domain account that is used by an application pool associated with any Web application you crawl. Doing so can cause unpublished content in SharePoint sites and minor versions of files (history) in SharePoint sites to be crawled and indexed. 

Plan search settings that are managed at the farm level

In addition to the settings that are configured at the SSP level, several settings that are managed at the farm level affect how content is crawled. Consider the following farm-level search settings while planning for crawling:

  • Contact e-mail address: Crawling content affects the resources of the servers that are being crawled. Before you can crawl content, you must provide in the configuration settings the e-mail address of the person in your organization whom administrators can contact in the event that the crawl adversely affects their servers. This e-mail address appears in logs for administrators of the servers being crawled so that those administrators can contact someone if the impact of crawling on their performance and bandwidth is too great, or if other issues occur.

    The contact e-mail address should belong to a person who has the necessary expertise and availability to respond quickly to requests. Alternatively, you can use a closely monitored distribution-list alias as the contact e-mail address. Regardless of whether the content crawled is stored internally to the organization or not, quick response time is important.

  • Proxy server settings: You can choose whether to use a proxy server when crawling content. The proxy server to use depends upon the topology of your Office SharePoint Server 2007 deployment and the architecture of other servers in your organization.

  • Time-out settings: The time-out settings are used to limit the time that the search server waits while connecting to other services.

  • SSL setting: The Secure Sockets Layer (SSL) setting determines whether the SSL certificate must exactly match to crawl content. 

Indexing content in different languages

When crawling content, the crawler determines each individual word in the content it finds. Languages that have words separated by spaces make it relatively easy for the crawler to distinguish each word. In other languages, finding the boundary between words can be more complex.

Office SharePoint Server 2007 provides word breakers and stemmers by default to help crawl and index content in many languages. Word breakers find word boundaries in full-text indexed data, while stemmers conjugate verbs.

If you are crawling any of the languages in the article Word breakers and stemmers by language in Office SharePoint Server 2007, Office SharePoint Server 2007 automatically uses the appropriate word breaker and stemmer for that language. An asterisk (*) indicates that the stemming feature is on by default. 

When the crawler indexes content for a language that is not supported, the neutral breaker is used. If the neutral breaker does not give you the results you expect, you can try third-party solutions that work with Office SharePoint Server 2007.

As a best practice, be sure that you install the appropriate word breaker and stemmer for each of the languages that you need to support. Word breakers and stemmers must be installed on all of the servers that are running the Office SharePoint Server Search service.

For more information about word breakers and stemmers, see Plan for multilingual sites