Managing Scope Rules
The Crawl Scope Manager (CSM) enables you define scope rules that include or exclude URLs from the Windows Search crawl scope. The Crawl Scope Manager lets you do the following:
- Add new scope rules to the working rule set
- Remove existing scope rules
- Enumerate default scope rules
- Discover whether a given URL is included or excluded from the crawl scope or has a parent or child scope rule
This topic includes the following subjects:
- About Scope Rules
- Before You Begin
- Adding Scope Rules
- Removing Scope Rules
- Reverting to Default Rules
- Enumerating Scope Rules
- Exploring the Existence and Relationships of Scope Rules
- Related Topics
About Scope Rules
A scope rule is a rule that includes or excludes URLs within a search root from being crawled and indexed. Inclusion rules dictate that the Indexer include that URL in the scrawl scope, and exclusion rules dictate that the Indexer exclude that URL (and its children) from the crawl scope.
For example, let's say you've installed a new application whose data files are located in the folder WorkteamA\ProjectFiles on a local machine. Suppose you want everything within the ProjectFiles folder indexed except for items in the subfolder Prototypes. In this situation, you would need an include rule for myPH:///C:\WorkteamA\ProjectFiles\ and an exclude rule for myPH:///C:\WorkteamA\ProjectFiles\Prototypes\.
There are three types of rules, taking the following order of precedence:
- Group Policy Rules are set by administrators and can override all other rules.
- User Rules are set by a user modifying the scope in the Search Options dialog or by another application managing the scope. Users or other applications can remove all user-set rules and revert to default rules.
- Default Rules are typically set by an application to define a default scope. For example, default rules might be set when a new protocol handler or container is added to the system.
Together, these types of rules comprise the working rule set from which the Crawl Scope Manager (CSM) generates the full list of URLs to crawl. While default rules can be overridden by group policy and user rules in the working rule set, they are maintained in their own default rule set, which you can revert back to at any time. The Indexer crawls the URLs from the working rule set and adds items, properties, and content to the catalog.
Note Users with access to the Control Panel can modify the rules through that interface. Therefore, applications offering scope management should always get the rules directly from the CSM using the enumeration methods instead of relying on its own saved copy of user rules.
Exclusion rules can define pattern URLs with the wildcard character '*'. An example would be the following: file:///C:\ProjectA\*\. An exclusion rule using this pattern prevents the Indexer from crawling all folders beneath the ProjectA directory. A more complicated example would be if there was an inclusion rule for file:///C:\ProjectA\ and an exclusion pattern rule for file:///C:\ProjectA\*\data\*. In this case, the indexer would crawl items in:
- C:\ProjectA\
- C:\ProjectA\version1\testfiles\
- C:\ProjectA\version1\temp\data\
- but not C:\ProjectA\version1\data\
Before You Begin
Before using any of the Crawl Scope Manager interfaces there are a few prerequisite steps:
- Create the CSearchManager object and obtain its ISearchManager interface.
- Call ISearchManager::GetCatalog for "SystemIndex" to obtain an instance of an ISearchCatalogManager.
- Call ISearchCatalogManager::GetCrawlScopeManager() to obtain an instance of ISearchCrawlScopeManager.
After making any changes to the Crawl Scope Manager, you must call the SaveAll() method. This method takes no parameters and returns S_OK on success.
Adding Scope Rules
The working rules set for the Crawl Scope Manager includes user and default rules, as well as any rules forced by group policy. User rules are set up by users in a user interface, and default rules can be set by:
- Group policies implemented by a system administrator (these do not use the ISearchCrawlScopeManager interface)
- Installations or updates of an application like Windows Search or a protocol handler
- Setup application for the addition of a new data store or container
The ISearchCrawlScopeManager provides two methods for adding new scope rules, as described in the following table. Paths for include rules for the file system must end with a back slash '\' (e.g., file:///C:\files\), and paths for exclude rules must end with an asterisk (e.g., file:///c:\files\*). Only exclusion rules can contain pattern URLs. Furthermore, we recommend including users' security identifiers (SIDs) in paths for better security. Per-user paths are more secure as queries would then run in a per-user process, ensuring that one user cannot see items indexed from another user's inbox, for example.
Method | Description |
---|---|
AddUserScopeRule | Adds a rule for a given URL, as specified by the user. These rules override default rules. Use this method if you have implemented a user interface that lets users manage their own scope rules and URLs. |
AddDefaultScopeRule | Adds a rule for a given URL, as specified by another application like a protocol handler. Use this method when you have implemented a new protocol handler or added a new data store. These rules can be overridden by user rules. |
Each method takes in a URL to an indexable location and flags that determine whether the URL should be included or excluded. The fFollowFlags parameter is reserved for future use. When you add a new scope rule and the Crawl Scope Manager determines that rule already exists (based on the URL or the pattern provided), the working rule set is updated so that (1) the old rule is replaced by the new rule and (2) any user rules that contradict it are removed.
Tip While the file://
root is included by default in the crawl scope, Program Files is not indexed by default. Therefore, applications with data saved to their Program Files directory need to add their location as a default rule.
Notes on User Rules
If a new user rule is the same as an existing default rule, the new user rule overrides the default rule in the working rule set. If the new user rule is the same as an existing user rule, the old user rule is replaced.
Setting the flag "fOverrideChildren" has the following results in the working rule set:
- TRUE results in the removal of all child rules from the working rule set (both user rules and default rules)
- FALSE results in re-adding all default rules which are children of the new user rule to the working rules set. If a child default rule is an inclusion and the new user rule is an exclusion, then the default rule is changed to an inclusion user rule.
Removing Scope Rules
You can use ISearchCrawlScopeManager to remove a scope rule from the working rule set. The ISearchCrawlScopeManager provides two methods for removing scope rules.
Method | Description |
---|---|
RemoveScopeRule | Removes a user rule for a given URL from the working rule set. If the user rule is a duplicate of or overrides a default rule, the default rule remains in the working rule set. |
RemoveDefaultScopeRule | Removes a default rule for a given URL from both the working rule set and the default rule set. After calling this method, you cannot revert to this default rule using RevertToDefaultScopes(). |
Each method takes a URL and a flag indicating if the rule to be removed is an inclusion or exclusion rule. These methods returns an error if a rule with that URL and inclusion/exclusion flag is not found.
Tip If you want to remove a scope from the crawl scope entirely, you should use the ISearchCrawlScopeManager::RemoveRoot method, which removes the search root and all associated scope rules. Doing this during an uninstall, for example, is considered best practice.
It is also possible to remove all user-set overrides of a search root and revert back to the original search root and default scope rules. Refer to the next section for more information.
Note On Microsoft Windows Vista, if users are removed through the User Profiles in the Control Panel, CSM removes all rules and roots with their SID and removes their indexed items from the catalog. On XP, you need to remove the users' roots and rules manually.
Reverting to Default Rules
Reverting to default rules removes all user rules for a given URL or root and restores all default rules to the working rule set. It does not, however, remove rules set by group policy. The ISearchCrawlScopeManager::RevertToDefaultScopes method takes no parameters and returns an error code if it is unable to revert to default rules.
Enumerating Scope Rules
The CSM enumerates scope rules using a standard COM-style enumerator interface, IEnumSearchScopeRules. You can use this interface to enumerate scope rules for a number of purposes. For example, you might want to display the entire working rule set in a user interface, or discover if a given rule or the child of a rule is already in the crawl scope.
Exploring the Existence and Relationships of Scope Rules
The Crawl Scope Manager also enables you to query whether a given URL is included in the crawl scope and has a parent or child scope rule. You can also find out why a URL is included/excluded from the crawl scope. These methods are not intended to be used with pattern URLs.
Method | Description |
---|---|
GetParentScopeVersionId | Gets the version ID of the parent inclusion URL. You can use this method to see if the parent scope has changed since the last time you checked it. Example: If a mail application uses provider-managed notifications to the Indexer, it may get the parent scope version before it closes and check it again when it opens. Then it can determine if it needs to push a new set of notifications to the Indexer. |
HasChildScopeRule | Returns TRUE if the given URL has a child rule (a rule applying to a child at any level within its URL hierarchy). Example: If the given URL is "file:///C:\Folder\", this method returns TRUE if the CSM has a scope rule specifically for "file:///C:\Folder\Subfolder\". |
HasParentScopeRule | Returns TRUE if the given URL has a parent rule (a rule applying to a parent at any level in the URL hierarchy). Example: If the given URL is "file:///C:\Folder\Subfolder", this method returns TRUE if the CSM has a scope rule specifically for "file:///C:\Folder\". |
IncludedInCrawlScope | Returns TRUE if the given URL is included in the crawl scope. |
IncludedInCrawlScopeEx | Returns a CLUSION_REASON enumeration explaining why the given URL is included or excluded from the crawl scope as well as the boolean TRUE if the given URL is included in the crawl scope. This method can help you identify conflicts in your working rule set. |
Note The IncludeInCrawlScope and IncludeInCrawlScopeEx methods determine if the URL will be crawled based solely on the rules in the Crawl Scope Manager. A URL may not be crawled for other reasons, such as the FANCI bit being set (i.e, a user has disallowed fast indexing using the folder's Property dialog).
If you believe a file path should be excluded but it is listed as included, be sure that exclude rules end with "<path>\*". If you believe a file or file path should be included but it is not, be sure to check the FANCI bit setting for the file or path. You can do this by right-clicking it and selecting Properties, then be sure the "For fast searching, allow Indexing Service to index this folder" option is set. If the file or file path is not marked for indexing here, then it will not be indexed even if it is in an include rule.