crawlerglobaldefaults.xml reference
Applies to: FAST Search Server 2010
Use crawlerglobaldefaults.xml to specify FAST Search Web crawler configuration options that apply to all crawl collections. Configuration options include DNS, content submission, duplicate detection, and other global settings. This is an advanced feature. You will rarely have to use it.
Warning
Any changes that you make to this file will be overwritten and lost if you:
-
Run the Set-FASTSearchConfiguration Windows PowerShell cmdlet.
-
Install a FAST Search Server 2010 for SharePoint update or service pack.
Remember to reapply your changes after you run the Set-FASTSearchConfiguration Windows PowerShell cmdlet or install a FAST Search Server 2010 for SharePoint update or service pack.
The FAST Search Web crawler looks for the crawlerglobaldefaults.xml file that is named in <FASTSearchFolder>\etc\ on startup (where <FASTSearchFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch). You can override this location by passing the -F <path> argument to the crawler.exe executable in NodeConf.xml (after you edit NodeConf.xml, restart nctrl.exe or run nctrl.exe reloadcfg).
If a crawlerglobaldefaults.xml file cannot be found, the FAST Search Web crawler reverts to defaults for the settings that can be specified in this file. Some settings can be overridden on the crawler.exe command line. For more information, see crawler.exe reference.
Customizing crawlerglobaldefaults.xml
Note
To modify a configuration file, verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.
To edit this file:
Edit crawlerglobaldefaults.xml in a text editor, not a general-purpose XML editor. Use the existing file in <FASTSearchFolder>\etc\ as a starting point. Include the elements and settings that you must have.
Run nctrl.exe restart crawler to restart the FAST Search Web crawler process with the options that you set in step 1.
If the FAST Search Web crawler is running as a multi-node crawler, this file must be edited on each server where a crawler is running. Each crawler must also be restarted, by running nctrl.exe restart multinodescheduler on the node running the multi-node scheduler and nctrl.exe restart nodescheduler on the servers that are running the node schedulers.
crawlerglobaldefaults.xml quick reference
This table lists the elements in crawlerglobaldefaults.xml. The elements can appear in any order, except for GlobalConfig, in which all sections and attributes must be contained, and member, which can only occur inside an attribute element.
Element | Description |
---|---|
CrawlerConfig |
This root element identifies the file as a FAST Search Web crawler configuration file. |
GlobalConfig |
This element identifies the file as a global configuration settings file for the FAST Search Web crawler. |
attrib |
This child element specifies a configuration setting, specified either by its value or a set of member elements. Formatted as:
|
member |
This child element can only occur in an attrib element. It specifies a configuration setting in a list, and is formatted as:
|
section |
This child element contains multiple settings grouped by type. |
This table lists the options in crawlerglobaldefaults.xml.
Option | Description |
---|---|
GlobalConfig options |
These options are valid inside the GlobalConfig element. |
feeding options |
These options are valid inside a section element that has the name "feeding". They configure characteristics of submitting Web items to content indexing. |
dns options |
These attributes specify settings related to the crawler's internal DNS resolver. |
near_duplicate_detection options |
These options configure the near duplicate detection algorithm for collections that have it enabled. |
timeouts options |
These options specify global crawler time-out settings. |
crawlerglobaldefaults.xml file format
XML elements in crawlerglobaldefaults.xml begin with <
and end with />
.
The basic element format is as follows:
<attrib name="value" type="value"> value </attrib>
For example:
<attrib name="sitemanager_numsites" type="integer" > 1024 </attrib>
Elements, section names, attributes, and attribute values are case-sensitive. Attribute names and types must be enclosed in quotation marks (" ").An element definition can span multiple lines. Spaces, carriage returns, line feeds, and tab characters are ignored in an element definition.
For example:
<attrib
name="sitemanager_numsites"
type="integer"
> 1024 </attrib
>
Tip
For long parameter definitions, position values on separate lines and use indentation to make the file easier to read.
The <GlobalConfig>
element is a special case and is required. All other elements are contained within the <GlobalConfig>
element, and the element is closed with </GlobalConfig>.
The basic structure of the crawlerglobaldefaults.xml file is as follows:
<?xml version="1.0"?>
<CrawlerConfig>
<GlobalConfig>
...
</GlobalConfig>
</CrawlerConfig>
You can add comments anywhere, delimited by <!--
and -->
.
CrawlerConfig
This is the top-level element. It has no attributes.
GlobalConfig
This element contains the global crawler configuration. It has no attributes.
attrib
This child element specifies a configuration option, either a single value or a list using the member element.
Attributes
Attribute | Value | Description |
---|---|---|
name |
option name |
Specifies the option to configure. See the valid options in the option sections later in this topic. |
type |
string|integer|real|boolean|list-string |
Specifies the type for the option value:
|
The value of the type attribute must match the type associated with the option that is specified for the name attribute. For example, the numprocs option must always be used with the integer type.
Example
The following example specifies the value 2 for the numprocs option:
<attrib name="numprocs" type="integer"> 2 </attrib>
member
This specifies an element in a list of option values. It has no attributes.
The member element can only be used inside an attrib element.
Example
The following example specifies two browser engines for the browser_engines option:
<attrib name="browser_engines" type="list-string">
<member> hostname1:13045 </member>
<member> hostname2:13045 </member>
</attrib>
section
This child element groups a set of related options. A section element contains attrib elements.
Attributes
Attribute |
Value |
Description |
name |
name |
Specifies the name of the section. Supported sections are listed in the options tables later in this topic. |
Example
The following example configures the DNS options, specifying only the timeout option:
<section name="dns">
<attrib name="timeout" type="integer"> 30 </attrib>
</section>
GlobalConfig options
These options are valid inside the GlobalConfig element.
Option | Type | Value | Description |
---|---|---|---|
browser_engines |
list-string |
hostname:port |
List of browser engines. The crawler uses these to process Web pages that contain JavaScripts. Default: Configured automatically by the installer |
datadir |
string |
directory |
The location of the crawler content store. Overridden by the -d option to crawler.exe. |
dbtrace |
boolean |
yes|no |
Enable/disable database operation tracing. For debugging only. Default: no |
directio |
boolean |
yes|no |
Enable/disable direct I/O in postprocess and duplicate server. For debugging only. Default: no |
disk_resume_threshold |
real |
1-2^63 |
Threshold (in bytes) at which the crawler resumes crawling of all collections, if they have already been suspended by disk_suspend_threshold. Default: 629145600 |
disk_suspend_threshold |
real |
1-2^63 |
Threshold (in bytes) when the crawler suspends crawling of all collections. Default: 524288000 |
dns_resolver_threads |
integer |
1-64 |
Maximum number of DNS threads. Increasing this value may improve DNS resolve performance if you are crawling a large number of hostnames. Default: 5 |
dns_use_platform_api |
boolean |
yes|no |
Specifies whether to use the OS gethostbyname API for resolving DNS names and NetBIOS names, or the internal resolver. The internal DNS resolver offers better performance and scalability, but does not support NetBIOS names. Default: yes |
duplicate_servers |
list-string |
hostname:port |
List of duplicate servers. Default: Configured automatically by the installer |
logdir |
string |
directory |
The location of the crawler log. Overridden by the -L option to crawler.exe |
logfile_ttl |
integer |
1-2^31 |
How long (in days) to keep rotated log files before deleting them. Default: 365 |
numprocs |
integer |
1-8 |
Number of site manager processes to start. Default: 2 |
ppdup_dbformat |
string |
hastlog|diskhashlog|gigabase |
Database format that is used by the duplicate server in a multi-node FAST Search Web crawler deployment. Default: hashlog |
rc_update_freq |
integer |
1-3600 |
Specifies the update frequency of crawl statistics (in seconds) to the monitoring service. Default: 120 |
sitemanager_numsites |
integer |
1-1024 |
Maximum number of site workers per site manager. Default: 1024 |
store_cleanup |
string |
hh:mm |
Time of the daily storage cleanup that uses 24-hour clock time. Default: 04:00 |
xmlrpcport |
integer |
port number |
The crawler base port. Overridden by the -p option to crawler.exe |
Example
The following example specifies options of different types:
<attrib name="ipv4" type="boolean"> yes </attrib>
<attrib name="numprocs" type="integer"> 2 </attrib>
<attrib name="disk_resume_threshold" type="real"> 629145600 </attrib>
<attrib name="browser_engines" type="list-string">
<member> localhost:13045 </member>
</attrib>s
feeding options
The following options are valid inside a section element that has the name feeding. These options configure characteristics of submitting Web items to content indexing.
Option | Type | Value | Description |
---|---|---|---|
feeder_threads |
integer |
1-8 |
Specifies the number of content feeder threads to start. For large-scale scenarios, increasing the number of threads can improve performance. Note Must only be changed when the <FASTSearchFolder>\data\crawler\store\dsqueues directory is empty. Default: 1 |
fs_threshold |
integer |
0-2^31 |
Specifies the maximum size of items sent in a batch to indexing. Any item larger than this value will be sent as a URL reference, which the item processor downloads individually from the crawler. Default: 128 |
max_batch_datasize |
integer |
0-2^31 |
Specifies the maximum number of bytes per batch. Reducing the maximum batch data size may reduce item processor memory usage. Default: 50MB |
max_batch_size |
integer |
1-1024 |
The maximum number of items in each batch submission. Smaller batches may be sent if not enough items are available, or if the memory size of the batch grows too large. Reducing the maximum batch size may reduce item processor memory usage, but may also decrease performance. Default: 128 |
max_cb_timeout |
integer |
1-3600 |
The maximum number of seconds to wait for outstanding callbacks in content indexing during shutdown. Default: 1800 |
Example
The following example specifies a typical feeding section:
<section name="feeding">
<attrib name="feeder_threads" type="integer"> 1 </attrib>
<attrib name="max_cb_timeout" type="integer"> 1800 </attrib>
<attrib name="max_batch_size" type="integer"> 128 </attrib>
<attrib name="max_batch_datasize" type="integer"> 52428800 </attrib>
<attrib name="fs_threshold" type="integer"> 128 </attrib>
</section>
dns options
These attributes specify settings related to the crawler's internal DNS resolver. In single node installations, the node scheduler calls DNS to resolve host names. In a multiple node installation, this job is performed by the multi-node scheduler.
Option | Type | Value | Description |
---|---|---|---|
db_cachesize |
integer |
1-2^31 |
DNS database cache size in bytes. A multi-node scheduler will use 4 times this amount. Default: 10485760 |
ipv4 |
Boolean |
yes|no |
Indicates if the crawler should resolve host names into IPv4 addresses. Default: yes |
ipv6 |
Boolean |
yes|no |
Specifies if the crawler should resolve host names into IPv6 addresses. Default: yes |
max_rate |
integer |
1-200 |
Maximum number of DNS requests to issue per second. Default: 100 |
max_retries |
integer |
1-10 |
Maximum number of DNS retries to issue for a failed lookup before giving up. Default: 5 |
min_rate |
integer |
1-10 |
Minimum number of DNS requests to issue per second. Default: 5 |
min_ttl |
integer |
1-2^31 |
Minimum lifetime of resolved names (in seconds), before it tries to re-resolve. Default: 21600 |
timeout |
integer |
1-300 |
DNS request time-out (in seconds) before retrying. Default 30. |
The min_rate, max_rate, max_retries and timeout settings only apply when the internal DNS resolver is used instead of the OS DNS resolver. Refer to the dns_use_platform_api option which controls this setting.You must specify either ip4 or ipv6 set to yes.
Example
The following example specifies a typical DNS section:
<section name="dns">
<attrib name="min_rate" type="integer"> 5 </attrib>
<attrib name="max_rate" type="integer"> 100 </attrib>
<attrib name="max_retries" type="integer"> 5 </attrib>
<attrib name="timeout" type="integer"> 30 </attrib>
<attrib name="min_ttl" type="integer"> 21600 </attrib>
<attrib name="db_cachesize" type="integer"> 10485760 </attrib>
<attrib name="ipv4 " type="integer"> yes </attrib>
<attrib name="ipv6 " type="integer"> yes </attrib>
</section>
near_duplicate_detection options
Near duplicate detection is enabled on a per-collection basis. Near duplicate detection only works for languages that use a white space word separator, e.g. western languages. These options configure the near duplicate detection algorithm for collections that have it enabled.
Option | Type | Value | Description |
---|---|---|---|
min_token_size |
integer |
1-(max_token_size-1) |
This option specifies the minimum number of characters a token must have to be included in the lexicon (the lexicon is a list of the words that occur in an item). Tokens that contain fewer characters are excluded from the lexicon. Default: 5 |
max_token_size |
integer |
1-100 |
Specifies the maximum character length for a token. Tokens that contain more characters are excluded from the lexicon (the lexicon is a list of the words that occur in an item). Default: 35 |
unique_tokens |
integer |
1-10 |
Specifies the minimum number of unique tokens a lexicon must contain to perform advanced duplicate detection. (A lexicon is the list of the words that occur in an item.) Below this level, the checksum is computed on the whole item. Default: 10 |
high_freq_cut |
real |
0.0-1.0 |
Specifies the percentage of tokens (as a decimal) with a high frequency to cut from the lexicon (a lexicon is the list of the words that occur in an item). Default: 0.1 |
low_freq_cut |
real |
0.0-1.0 |
Specifies the percentage of tokens (as a decimal) with a low frequency to cut from the lexicon (a lexicon is the list of the words that occur in an item). Default: 0.2 |
Example
The following example specifies a typical near_duplicate_detection section:
<section name='near_duplicate_detection'>
<attrib name="min_token_size" type="integer"> 5 </attrib>
<attrib name="max_token_size" type="integer"> 35 </attrib>
<attrib name="unique_tokens" type="integer"> 10 </attrib>
<attrib name="high_freq_cut" type="real"> 0.1 </attrib>
<attrib name="low_freq_cut" type="real"> 0.2 </attrib>
</section>
timeouts options
These options specify various global crawler time-out settings.
Option | Type | Value | Description |
---|---|---|---|
compaction_idle |
integer |
1-3600 |
Specifies the time-out period (in seconds) for all ongoing crawl activity to stop, in preparation for the nightly content store defragmentation. Site managers that are not idle at this point must be stopped before defragmentation can start. Default: 600 |
compaction_kill |
integer |
1-3600 |
Specifies the time-out period (in seconds) for site managers to shut down before defragmentation. Site manager processes that are not stopped during this time will be killed. Default: 120 |
shutdown_fileserver |
integer |
1-3600 |
Specifies the shut-down time-out period (in seconds) for the file server. Processes that do not shut down within the time-out period are killed. Default: 10 |
shutdown_postprocess |
integer |
1-3600 |
Specifies the shut-down time-out period (in seconds) for postprocess. Processes that do not shut down within the time-out period are killed. Default: 300 |
shutdown_sitemanager |
integer |
1-3600 |
Specifies the shut-down time-out period (in seconds) for the site manager. Processes that do not shut down within the time-out period are killed. Default: 300 |
Example
The following example specifies a typical time-out section:
<section name="timeouts">
<attrib name="compaction_idle" type="integer"> 600 </attrib>
<attrib name="compaction_kill" type="integer"> 120 </attrib>
<attrib name="shutdown_sitemanager" type="integer"> 300 </attrib>
<attrib name="shutdown_postprocess" type="integer"> 300 </attrib>
<attrib name="shutdown_fileserver" type="integer"> 10 </attrib>
</section>