Regular Expressions Support in SharePoint 2010 Crawling
Search admins often need to omit from a crawl files that match a certain pattern. E.g.:
· In a bank, file names starting with SSN
· In a business site, files names with credit card number
· URLs having specific value of a certain parameter of an aspx file
· etc..
The usual solution is to allow admins to create “crawl rules” that restrict crawlers from following specific links. The most basic crawl rule specifies a complete URL for the file to be crawled, which requires the admin to create as many rules as there are files in their repository. A more practical solution often implemented involves the use of the wildcard character: “*”. This character matches everything, so admins can create a rule using the wildcard to include (or omit) all files under a particular folder or path:
\\myfileserver\myclientsfolder\*
This works if all the files are located neatly in one folder, but what if they are spread across the repository (or Web site)? This is the problem that is solved by using regular expression (RegEx) syntax.
The SharePoint Solution
In SharePoint 2007, the wild card operator “*” is the only operator supported in crawl rules for matching characters. As mentioned, it is a brute force operator that matches everything. Wildcard-only rules do not provide the admin the flexibility to, for example, recognize and omit URLs that contain Social Security Numbers, or that have an aspx parameter with a specific value.
SharePoint 2010 includes some new capabilities in this area. The default behavior of crawl rules in SharePoint 2010 is the same as it was in SharePoint 2007, but with SharePoint Search 2010, administrators can create crawl rules to include or exclude URLs that match regular expressions. To enable regular expressions, the admin need only select the check box on the Crawl Rules creation UI as shown in the image below.
Regular Expression Operators
The table below lists and describes the regular expression operators that are supported for crawl rules in SharePoint 2010:
Operator |
Symbol |
Description |
example Rule |
Will match |
Won’t match |
Group |
() |
Characters can be grouped using round brackets. Any operator applied on it will be applied on the group. |
|||
Match any character |
. |
This operator matches any character. It does not match with NULL. |
|||
Match zero or one |
? |
It allows the expression to not exist in the target address or can have only one repetition. |
https://mysite/page.html AND https://mysite/page1.html |
||
Match zero or more |
* |
It allows the expression to not exist in the target address or can have any number of repititions. |
“https://mysite/page(1)*.html |
https://mysite/page.html AND https://mysite/page111.html |
|
Match at least one |
+ |
It requires the expression on which it is applied to exist in the target address at least once. |
https://mysite/page(1)+.html |
https://mysite/page1.html AND https://mysite/page111.html |
https://mysite/page.html |
Exact count |
{num} |
This operator is denoted by a number inside “{}”, e.g. {5}. It restricts the expression on which it is applied to have exactly the specified number of repetitions in the target address. |
https://myfiles/(9){4}-(0){2}.html |
https://myfiles/9999-00.html |
|
Minimum count |
{num, } |
This operator is denoted by a number inside “{}” followed by a "," e.g. {5,}. It restricts the expression on which it is applied to have at least the specified number of repetitions in the target address. |
https://myfiles/(9){4,}-(0){2}.html |
https://myfiles/9999-00.html AND https://myfiles/99999-00.html |
https://myfiles/999-00.html |
Range count |
{num1,num2} |
This operator is denoted by 2 numbers inside “{}” separated by a "," e.g. {5,8}. First number defines lower limit and second number defines the upper limit. It restricts the expression on which it is applied to have any repititions in the URL between num1 and num2. A valid rule will always have num1 < num2. |
https://myfiles/(9){4}-(0){2,3}.html |
https://myfiles/9999-00.html AND https://myfiles/9999-000.html |
|
Alternation |
| |
This operator is applied on two expressions and it matches ONLY one of the two expressions. |
file://myshare/((folder1)|(folder2))/.* |
\\myshare\folder1\<any files> OR \\myshare\folder2\<any files> |
\\myshare\folder1folder2\<any files> |
List |
[ <list of chars> ] |
https://testhost/test[1-3].htm |
https://testhost/test1.htm OR https://testhost/test2.htm OR https://testhost/test3.htm |
https://testhost/test.htm |
Using RegEx Operators in Crawl Rules
Once you understand the RegEx operators above and how to enable them in the crawler, there are only a couple other things you need to keep in mind:
Protocol part
Regular expression operators cannot be used in the protocol part of the URL. This means, for example, the following RegEx rule cannot be created:
.*//www.microsoft.com/.*
If you try to create a rule like this, the system will add https:// in the beginning and thus make “.*” as the second part of the URL. The resulting rule in this case will be:
https:// .*//www.microsoft.com/.*
which may not be what you intended.
Case sensitive comparison
RegEx rules are case insensitive by default. In order to allow a rule to do case sensitive matching of a URL, the administrator should select the “Match case” check box in the rule creation UI as shown below:
If the “Match case” checkbox is selected, the crawler will preserve the case of matching URLs during the crawl. In the example above, the rule will match: https://test/AbC123.html and WILL NOT match to https://test/Abc123.html.
This feature comes in handy when SharePoint is used to crawl web sites hosted on Unix based web servers, which are case sensitive.
Examples
Here are some interesting examples demonstrating the usefulness of Regular Expression in crawl rules:
Rule |
Description |
\\myshare\.* |
Match everything under the share “myshare” |
file://.*/[0-9]{4}-[1,2]-[a-z]{4}.docx |
Match all the links with file names having the following pattern: <4 digints>-<1 or 2>-<4 characters>.docx |
\\myshare\((folder1)|(myfolder))\.* |
Match all files in folder1 or myfolder in \\myshare |
https://mysite/myasp.aspx[?]param1=value |
Specify a regex operator: "?" in this case, in regex rule. |
https://mysite/myasp.aspx[?]param[12]=.* |
Match all links pointing to myasp.aspx with either param1 or param2 specified. |
https://site/.*.aspx[?]category=1&subcategory=.* |
Match all aspx links that have a specific parameter value and ignore the value of second parameter |
file://clientsdata/[0-9]{3}-[0-9]{2}-[0-9]{4}.* |
Match all files that start with Social Security Number |
Syed Anas Hashmi | SDET | Microsoft Enterprise Search Group
Comments
Anonymous
August 22, 2011
Hi Syed, I've created crawl rules to only include page Dispform.aspx and exclude others pages (such as Allitems.aspx and Thumbnails.aspx ) , and I've tested it from the button 'Test' it works as expected, But When I've done Full Crawl to apply it, it makes my search result empty. Please Help, ThanksAnonymous
June 19, 2013
Is this article also valid for SharePoint 2013?