Optimizing Regex Performance, Part 3 [Ron Petrusha]

[アーティクル]
03/28/2011

Regular expressions in the .NET Framework support a number of grouping constructs, which allow a regular expression pattern to be grouped into one or more subexpressions. Grouping constructs are essential for creating backreferences, as well as for defining a subexpression to which a quantifier is applied.

The Performance Impact of Capturing Groups

The most commonly used grouping constructs in the .NET Framework regular expression language are (subexpression) , which defines a numbered capturing group, and (?<name> subexpression) , which defines a named capturing group. The use of these language elements, however, has a definite cost; they cause the GroupCollection object returned by the Match.Groups property to be populated with the most recent unnamed or named captures, and if a single grouping construct has captured multiple substrings in the input string, they also populate the CaptureCollection object returned by the Group.Captures property of a particular capturing group with multiple Capture objects.

For example, the regular expression (\b(\w+)[;,]?\s?)+([.?!]) is designed to capture an entire sentence, and the subexpression (\w+) is designed to capture the individual words in that sentence. (See Regular Expression Language Elements for reference documentation for the regular expression language supported by the .NET Framework.) The regular expression defines three capturing groups, which are listed in the following table. In addition, the first group in the GroupCollection object (at index 0) contains the complete pattern match; the values of its Group.Index, Group.Length, Group.Success, and Group.Value properties are identical to the values of its parent Match object.

Index	Pattern	Description
1	`(\b(\w+)[;,]?\s?)`	Captures a word in the sentence, along with any following punctuation mark and trailing space. The + quantifier is applied to the group so that the regular expression engine will advance beyond the first word in the input string to capture all words in the sentence.
2	`(\w+)`	Captures a word in the sentence
3	`([.?!])`	Captures the punctuation mark used to end the sentence.

The following example uses this regular expression to extract a sentence, its individual words, and its ending punctuation mark from an input string.

 [Visual Basic]
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim pattern As String = "(\b(\w+)[;,]?\s?)+([.?!])"

      Dim input As String = "This is a short, silly sentence."

      Dim match As Match = Regex.Match(input, pattern)
      If match.Success Then

         Console.WriteLine("'{0}' found at position {1}",
                           match.Value, match.Index)

         If match.Groups.Count > 1 Then
            For ctr As Integer = 1 To match.Groups.Count - 1
               Dim grp As Group = match.Groups(ctr)
               Console.WriteLine("   Group {0}: '{1}'",
                                 ctr, grp) 

               For cctr As Integer = 0 To grp.Captures.Count - 1
                  Console.WriteLine("Capture {0}: '{1}'",
                                    cctr, grp.Captures(cctr).Value)
               Next
            Next
         End If
      End If
   End Sub
End Module

[C#]
using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = @"(\b(\w+)[;,]?\s?)+([.?!])";
      string input = "This is a short, silly sentence.";

      Match match = Regex.Match(input, pattern);
      if (match.Success) {
         Console.WriteLine("'{0}' found at position {1}",
                           match.Value, match.Index);
         if (match.Groups.Count > 1)
            for (int ctr = 1; ctr <= match.Groups.Count - 1; ctr++) {
               Group grp = match.Groups[ctr];
               Console.WriteLine("   Group {0}: '{1}'", ctr, grp);  
               for (int cctr = 0; cctr <= grp.Captures.Count - 1; cctr++)
                  Console.WriteLine("     Capture {0}: '{1}'", 
                                    cctr, grp.Captures[cctr].Value);
            }
      }
   }
}

The example displays the following output. Note that the CaptureCollection captured by the first group captures each word in the sentence along with a following punctuation character if it does not conclude the sentence, the CaptureCollection captured by the second group contains each word in the sentence, and the final group contains only the sentence’s final punctuation mark.

 'This is a short, silly sentence.' found at position 0
   Group 1: 'sentence'
      Capture 0: 'This '
      Capture 1: 'is '
      Capture 2: 'a '
      Capture 3: 'short, '
      Capture 4: 'silly '
      Capture 5: 'sentence'
   Group 2: 'sentence'
      Capture 0: 'This'
      Capture 1: 'is'
      Capture 2: 'a'
      Capture 3: 'short'
      Capture 4: 'silly'
      Capture 5: 'sentence'
   Group 3: '.'
      Capture 0: '.'

In this regular expression, our first capturing group is present only so that a quantifier can be applied to it. This allows the regular expression engine to advance from one word to the next, and so to capture an entire sentence. The second group is useful if we want to programmatically capture the individual words from the sentence, which are available from the CaptureCollection object. The third is useful if we want to determine which punctuation symbol ended the sentence; it is available from the GroupCollection object.

We can disable captures (so that the GroupCollection and CaptureCollection objects are not populated as matches are found) by using the (?:subexpression) language element. We can assess the impact that capturing groups have on performance by using a StopWatch object to measure the execution time of the following three similar regular expressions when we use each to extract all of the sentences in the text of Theodore Dreiser's The Financier:

The original regular expression, (\b(\w+)[;,]?\s?)+([.?!]).
A regular expression that disables captures by the first group, (?:\b(\w+)[;,]?\s?)+([.?!]). This group exists only so that a quantifier could be applied to it.
A regular expression that has no capturing groups, (?:\b(?:\w+)[;,]?\s?)+[.?!]. It captures the entire sentence only, and disables all capturing groups. (It actually returns a Match object whose Groups property returns a GroupCollection object with a single Group object. This group at index 0 represents the entire match; it is always present and cannot be suppressed.

The source code for the example is:

 [Visual Basic]
Imports System.Diagnostics
Imports System.IO

Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim patterns() As String = { "(\b(\w+)[;,]?\s?)+([.?!])", 
                                   "(?:\b(\w+)[;,]?\s?)+([.?!])",
                                   "(?:\b(?:\w+)[;,]?\s?)+[.?!]" }
      Dim input As String = (New StreamReader(".\Dreiser_TheFinancier.txt")).ReadToEnd()
      Dim sw As Stopwatch

      For Each pattern In patterns
         ' Instantiate the regular expression object before measuring performance.
         Dim regex As New Regex(pattern)

         ' Time matches.

         sw = Stopwatch.StartNew()

         Dim ctr As Integer = 0
         Dim match As Match = Regex.Match(input, pattern)
         Do While match.Success
            ctr += 1
            match = match.NextMatch()
         Loop

         sw.Stop()

         ' Display performance information.
         Console.WriteLine("'{0}' ({1} capturing groups): {2:N0} matches in {3:ss\.fffff}",
                           regex.ToString(), regex.GetGroupNumbers().Length,
                           ctr, sw.Elapsed)
         Console.WriteLine()
      Next
   End Sub
End Module

[C#]
using System;

using System.Diagnostics;
using System.IO;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string[] patterns = { @"(\b(\w+)[;,]?\s?)+([.?!])",
                            @"(?:\b(\w+)[;,]?\s?)+([.?!])",
                            @"(?:\b(?:\w+)[;,]?\s?)+[.?!]" };
      string input = (new StreamReader(@".\Dreiser_TheFinancier.txt")).ReadToEnd();
      Stopwatch sw;

      foreach (var pattern in patterns) {
         // Instantiate the regular expression object before measuring performance.

         Regex regex = new Regex(pattern);

         // Time matches.
         sw = Stopwatch.StartNew();

         int ctr = 0;
         Match match = Regex.Match(input, pattern);
         while (match.Success) {
            ctr++;
            match = match.NextMatch();
         }
         sw.Stop();

         // Display performance information.

         Console.WriteLine("'{0}' ({1} capturing groups): {2:N0} matches in {3:ss\\.fffff}\n",
                           regex.ToString(), regex.GetGroupNumbers().Length,
                           ctr, sw.Elapsed);
      }  
   }   
}

As the following output from the example shows, populating the GroupCollection and CaptureCollection objects has a significant impact on the regular expression engine’s performance. The regular expression with three capturing groups executed slightly more than 15faster than the one with four capturing groups, and the regular expression with just one capturing group executed more than 25% faster than the one with four. The poorer performance of the regular expression with four capturing groups is a legitimate cost if programmatic access to these captures is required. However, if it is not, populating the complete regular expression object model constitutes unnecessary overhead.

 '(\b(\w+)[;,]?\s?)+([.?!])' (4 capturing groups): 12,633 matches in 02.4953428
'(?:\b(\w+)[;,]?\s?)+([.?!])' (3 capturing groups): 12,633 matches in 02.0867863
'(?:\b(?:\w+)[;,]?\s?)+[.?!]' (1 capturing groups): 12,633 matches in 01.7976809

Capturing and Non-Capturing Grouping Constructs

The regular expression language in the .NET Framework supports a number of grouping constructs, which are documented in detail in the Grouping Constructs topic in the MSDN Library. The following table lists each grouping construct and indicates whether it captures the text matched by the group.

Language Element	Description	Capturing or Non-capturing
`(subexpression)`	Captures a group, which can be referenced by number or retrieved from the GroupCollection object by its ordinal position.	Capturing
`(?<name>subexpression)`	Captures a named group, which can be referenced by name or by number or retrieved from the GroupCollection object by its name or ordinal position.	Capturing
`(?<name1-name2>subexpression)`	Balancing group definition. Tracks nested constructs such as opening and closing parentheses or opening and closing brackets. Capturing groups is integral to the operation of this language element.	Capturing
`(?:subexpression)`	Non-capturing group. Typically, it is used to define a subexpression to which a quantifier can be applied.	Non-capturing
`(?imnsx-imnsx:subexpression)`	Defines options (such as case-sensitivity or the significance of white space) that apply to `subexpression`.	Non-capturing
`(?=subexpression)`	Positive lookahead assertion. Looks ahead of the current position in the input string to determine if `subexpression` is matched.	Non-capturing
`(?!subexpression)`	Negative lookahead assertion. Looks ahead of the current position in the input string to determine if `subexpression` is not matched.	Non-capturing
`(?<=subexpression)`	Positive lookbehind assertion. Looks behind the current position in the input string to determine if `subexpression is matched.`	Non-capturing
`(?<!subexpression)`	Negative lookbehind assertion. Looks behind the current position in the input string to determine if `subexpression` is not matched.	Non-capturing
`(?>subexpression)`	Non-backtracking subexpression. Does not surrender any characters from a matched subexpression to ensure that the regular expression pattern as a whole is matched.	Non-capturing

Disabling Captures

You can disable all captures in a group, or you can disable all but named captures. Unnamed captures can be disabled for a regular expression as a whole, for a part of a regular expression, or for a particular subexpression.

As the previous example showed, the (?:subexpression) language construct disables captures by a group. It disables captures only for the group to which it applies; it does not disable captures by nested or inner groups. In the previous example, for instance, the regular expression pattern (?:\b(\w+)[;,]?\s?)+([.?!]) disables captures by the outer group (group 1). The inner group (\w+) (group 2) is not affected and remains a capturing group.

You can also control implicit captures by using regular expression language options. These options are specified either as members of the RegexOptions enumeration or as inline characters in an options language construct. For more information, see Regular Expression Options.

The RegexOptions.ExplicitCapture option or its corresponding n language option can be used to disable all unnamed or implicit captures; only matches from named groups defined by the (?<name>subexpression) language element are captured.

The RegexOptions.ExplicitCapture option disables implicit captures by the regular expression engine for a particular regular expression or for a particular matching operation. It disables implicit captures for the regular expression if it is passed to the Regex class constructor, and it disables implicit captures for a particular matching operation if it is passed to a static Regex matching method, such as Regex.Match or Regex.Matches. The following example illustrates this. It changes the previous regular expression pattern to (\b(?<words>\w+)[;,]?\s?)+([.?!]) so that it has two implicit capturing groups and one named capturing group. When the Regex object is instantiated with its options parameter set to RegexOptions.ExplicitCapture, only matches made by the named group are captured.

 [Visual Basic]
Imports System.Text.RegularExpressions

Module Example

   Public Sub Main()
      Dim pattern As String = "(\b(?<words>\w+)[;,]?\s?)+([.?!])"
      Dim input As String = "This is a short, silly sentence."

      Dim rgx As New Regex(pattern, RegexOptions.ExplicitCapture)
      Dim match As Match = rgx.Match(input)

      If match.Success Then

         Console.WriteLine("'{0}' found at position {1}.", 
                           match.Value, match.Index)
         Console.WriteLine("There are {0} captured groups:", 
                           match.Groups.Count)
         Dim ctr As Integer = 0
         For Each grp As Group In match.Groups
            Console.WriteLine("   Group <{0}>: '{1}' at position {2}.", 
                              rgx.GroupNameFromNumber(ctr), grp.Value,
                              grp.Index)
            Dim capCtr As Integer = 1
            For Each capture In grp.Captures
               Console.WriteLine("      Capture {0}: '{1}' at position {2}.", 
                                 capCtr, capture.Value, capture.Index)
               capCtr += 1
            Next

            ctr += 1
         Next
      End If
   End Sub
End Module

[C#]
using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = @"(\b(?<words>\w+)[;,]?\s?)+([.?!])";
      string input = "This is a short, silly sentence.";
      Regex rgx = new Regex(pattern, RegexOptions.ExplicitCapture);
      Match match = rgx.Match(input);

      if (match.Success) {
         Console.WriteLine("'{0}' found at position {1}.",
                           match.Value, match.Index);
         Console.WriteLine("There are {0} captured groups:",
                           match.Groups.Count);
         int ctr = 0;
         foreach (Group grp in match.Groups) {
            Console.WriteLine("   Group <{0}>: '{1}' at position {2}.",
                              rgx.GroupNameFromNumber(ctr++), grp.Value, 
                              grp.Index);
            int capCtr = 1;
            foreach (Capture capture in grp.Captures)  
               Console.WriteLine("      Capture {0}: '{1}' at position {2}.",  
                                 capCtr++, capture.Value, capture.Index);
         }
      }
   }
}

The example displays the following output:

 'This is a short, silly sentence.' found at position 0.
There are 2 captured groups:
   Group <0>: 'This is a short, silly sentence.' at position 0.
      Capture 1: 'This is a short, silly sentence.' at position 0.
   Group <words>: 'sentence' at position 23.
      Capture 1: 'This' at position 0.
      Capture 2: 'is' at position 5.
      Capture 3: 'a' at position 8.
      Capture 4: 'short' at position 10.
      Capture 5: 'silly' at position 17.
      Capture 6: 'sentence' at position 23.

You can also disable implicit captures for part of a regular expression pattern by using the (?n:subexpression) grouping construct. It disables implicit captures for all implicit groups defined in subexpression. If the construct is applied to the entire regular expression pattern, it is equivalent to instantiating a Regex object with the RegexOptions.ExplicitCapture option. For example, the output would remain unchanged if the regular expression pattern in the previous example were replaced by (?n:(\b(?<words>\w+)[;,]?\s?)+([.?!])). On the other hand, the regular expression pattern (?n:((a+)(b+))+)(c.) disables captures in the subexpression ((a+)(b+))+), including in the two nested expressions in the subexpression.

The (?n) inline element can also be used to disable implicit captures from the point that it is encountered in a regular expression to the end of the expression, or until the implicit captures only option is disabled by the (?-n) inline option. (Note, though, that the (?n:subexpression) syntax can be used for the same result and is significantly clearer.) For example, the regular expression pattern (?n)(\b(?<words>\w+)[;,]?\s?)+(?-n)([.?!]) disables implicit captures in the subexpression (\b(?<words>\w+)[;,]?\s?)+ but enables them in the subexpression ([.?!]). As a result, a match has two capturing groups. The first captures the punctuation mark used to end a sentence. The second captures each word in the sentence.

Summary

The regular expression engine incurs a substantial performance hit when capturing groups appear in a regular expression. In cases where captures are not needed and grouping constructs are used primarily so that quantifiers can be applied to them, captures should be disabled. The regular expression language in the .NET Framework includes a non-capturing grouping construct, (?:subexpression). In addition, implicit or unnamed captures can be disabled by supplying the RegexOptions.ExplicitCapture option to a Regex class constructor or static matching method, or by using the by using the (?n:subexpression) grouping construct or the (?n) inline element.

Comments

Anonymous
March 29, 2011
If you want to optimize performance I recommend you use RegexOptions.Compiled and initialize the Regex object as a static read only field.
Anonymous
April 14, 2011
Compiled static regular expressions are not optimized for all scenarios. For a discussion of when to use each type of regular expression object (instance and static, compiled and interpreted, compiled to a separate assembly), see the "Optimizing Regular Expression Performance, Part I: Working with the Regex Class and Regex Objects" at blogs.msdn.com/.../optimizing-regular-expression-performance-part-i-working-with-the-regex-class-and-regex-objects.aspx.

次の方法で共有