Condividi tramite


Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking [Ron Petrusha]

One of the most powerful features of regular expressions in the .NET Framework -- and of Nondeterministic Finite Automaton (NFA) regular expression engines generally -- is their ability to execute in a non-linear manner. That is, instead of advancing one character at a time, they can return to a previous saved state in order to match a regular expression pattern. For example, if you try to match the string “A match.” with the regular expression pattern .+\. the regular expression engine first attempts to match the entire string. It then retreats one character so that the .+ construct matches “A match” and \. matches the period at the end of the string. This ability to return to a previous saved state in order to continue the match is known as backtracking.

Note: This is the second part of a three-part series on optimizing regular expression performance. The first part, Optimizing Regular Expression Performance, Part I: Working with the Regex Class and Regex Objects, discusses the efficient use of Regex static and instance methods.

Although backtracking contributes to the power and flexibility of regular expressions, it can have an enormous negative impact on performance. In a worst-case scenario, when backtracking is used excessively, it might take one or more days for the regular expression engine to process even a relatively short input string that does not match the regular expression pattern.

The NFA Engine places control over backtracking (and therefore responsibility for the consequences of backtracking) in the hands of regular expression developers who may be unaware of how backtracking affects performance. This fact often goes overlooked both when creating regular expressions and when assessing regular expression performance. As a result, the inappropriate use of backtracking is one of the fundamental performance bottlenecks that applications relying on regular expressions encounter.

Backtracking occurs when a regular expression contains the following language constructs and under the following conditions:

  • An alternation construct, if the first alternative does not provide a successful pattern match.
  • An indeterminate quantifier (such as * or + ), when the greedy match resulting from the expression to which the quantifier is applied prevents a successful match.
  • A conditional expression, if the first condition does not provide a successful pattern match.

When the regular expression engine encounters one of these constructs or expressions, it saves its current state so that it can return to it if necessary.

We’ll first examine backtracking with alternation, and then examine backtracking with nested quantifiers. We’ll finish by looking at a worst case scenario in which a poor use of backtracking causes execution time to increase exponentially as the number of characters in the string to be matched increases. In each case, we’ll attempt to show that an alternative pattern that better controls backtracking offers superior performance.

Note: In the code examples in this article, we make extensive use of the System.Diagnostics.Stopwatch class to measure regular expression performance. These measurements are used only for comparative purposes, and reflect performance on a particular system. They are not intended to be construed as benchmarks.

Alternation and Backtracking

The alternation construct ( | ) allows a regular expression pattern to include the equivalent of an Or operator; the regular expression engine will match text that conforms to one of two or more patterns. The regular expression engine performs no optimization on alternation constructs.

Because alternation is a backtracking construct, it can entail a significant performance penalty if the alternation pattern is not well crafted. In this section, we’ll examine some of the factors to consider when constructing an alternation pattern that will perform efficiently.

Order Matters

Alternation constructs are evaluated sequentially from left to right. The second alternative is evaluated only if the first alternative fails. If there are three or more alternatives, the third alternative is evaluated only if the first and second alternatives fail, and so on. Because of this, the ordering of items in an alternation construct is significant. Subpatterns that are more likely to be encountered in an input string should precede subpatterns that are less likely to be encountered.

The following example uses an alternation construct to find words followed by one of a number of punctuation symbols in the text of Theodore Dreiser’s The Financier. In the first regular expression pattern, punctuation symbols that are less likely to be encountered precede symbols that are more frequently encountered. In the second, punctuation symbols that are more likely to be encountered precede symbols that are less frequently encountered.

 [Visual Basic]
Option Strict On

Imports System.Diagnostics
Imports System.IO
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      ' Organize the punctuation marks in a poor order.
      Dim pattern1 As String = "\b\w+(;|:|!|\?|\.|,)\s"
      Dim rgx1 As New Regex(pattern1, RegexOptions.SingleLine)
      Dim sw As Stopwatch
      Dim sr As New StreamReader(".\Dreiser_TheFinancier.txt")
      Dim input As String = sr.ReadToEnd()
      sr.Close()

      sw = Stopwatch.StartNew()
      Dim match1 As Match = rgx1.Match(input)
      Dim ctr1 As Integer
      Do While match1.Success
         ctr1 += 1
         match1 = match1.NextMatch
      Loop
      sw.Stop()
      Console.WriteLine("Found {0} matches in {1}", ctr1, sw.Elapsed)

      ' Organize the punctuation marks in rough order of occurrence.
      Dim pattern2 As String = "\b\w+(\.|,|\?|;|:|!)\s"
      Dim rgx2 As New Regex(pattern2, RegexOptions.SingleLine)

      sw = Stopwatch.StartNew()
      Dim match2 As Match = rgx2.Match(input)
      Dim ctr2 As Integer
      Do While match2.Success
         ctr2 += 1
         match2 = match2.NextMatch
      Loop
      sw.Stop()
      Console.WriteLine("Found {0} matches in {1}", ctr2, sw.Elapsed)
   End Sub
End Module
 [C#]
using System;
using System.Diagnostics;
using System.IO;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      // Organize the punctuation marks in a poor order.
      string pattern1 = @"\b\w+(;|:|!|\?|\.|,)\s";
      Regex rgx1 = new Regex(pattern1, RegexOptions.Singleline);
      Stopwatch sw;
      StreamReader sr = new StreamReader(@".\Dreiser_TheFinancier.txt");
      string input = sr.ReadToEnd();
      sr.Close();

      sw = Stopwatch.StartNew();
      Match match1 = rgx1.Match(input);
      int ctr1 = 0;
      while (match1.Success)
      {
         ctr1++;
         match1 = match1.NextMatch();
      }
      sw.Stop();
      Console.WriteLine("Found {0} matches in {1}", ctr1, sw.Elapsed);

      // Organize the punctuation marks in rough order of occurrence.
      string pattern2 = @"\b\w+(\.|,|\?|;|:|!)\s";
      Regex rgx2 = new Regex(pattern2, RegexOptions.Singleline);

      sw = Stopwatch.StartNew();
      Match match2 = rgx2.Match(input);
      int ctr2 = 0;
      while (match2.Success)
      {
         ctr2++;
         match2 = match2.NextMatch();
      }
      sw.Stop();
      Console.WriteLine("Found {0} matches in {1}", ctr2, sw.Elapsed);
   }
}

Output from the example shows that there is a small performance benefit from ordering the alternatives based on how likely they are to be encountered in the input string. The performance improvement is approximately 6% for this very simple regular expression, as the following graph shows. In a more complex regular expression pattern, particularly if he alternation pattern is followed by additional backtracking constructs, the performance benefit is likely to be much greater.

Found 26930 matches in 00:00:00.9649407
Found 26930 matches in 00:00:00.9044185
clip_image002[4]

 

Optimization Matters

Because alternation constructs involve backtracking, optimization is critical. Alternation patterns that are poorly written and that include duplicate elements perform much more poorly than those that do not. In the following example, the regular expression \b(the|this|that|then|there|these|those)\b matches a number of words starting with “th” (the, this, that, then, there, these, those) in a string. Because of backtracking, the regular expression engine performs the following steps to match the substring “these” in the input string:

1

It determines that the current position in the input string is on a word boundary. Because a grouping and alternation construct follow, it saves this position in the input string.

2

It begins comparing the “the” alternative with the input string. It compares “t” in the regular expression pattern with “t” in the input string and finds a match.

3

It compares “h” in the regular expression pattern with “h” in the input string and finds a match.

4

It compares “e” in the regular expression pattern with “e” in the input string and finds a match.

5

It determines whether the current position in the input string is on a word boundary. Because it is in the middle of the substring “these”, the match with the first alternation pattern fails, and the regular expression engine returns to its saved position (the beginning of the substring “these”) in the input string.

6-8

It compares the “this” alternative with the input string. It matches “t” and “h”, but fails to match “i" with “e” in the input string. The regular expression engine returns to its saved position in the input string.

9-11

It compares the “that” alternative with the input string. It matches “t” and “h”, but fails to match “a" with “e” in the input string. The regular expression engine returns to its saved position in the input string.

12-15

It compares the “then” alternative with the input string. It matches “t”, “h”, and “e”, but fails to match “n" with “s” in the input string. The regular expression engine returns to its saved position in the input string.

16-19

It compares the “there” alternative with the input string. It matches “t”, “h”, and “e”, but fails to match “r" with “s” in the input string. The regular expression engine returns to its saved position in the input string.

20-25

It compares the “these” alternative with the input string. It matches “t”, “h”, “e”, “s”, and “e”. It then determines that the current position in the input string is on a word boundary. The match succeeds. The last alternative in the regular expression (those) is not tested.

By using alternation only for unique portions of the regular expression pattern, we can improve the efficiency of string comparisons. In our example, because the substring “th” is common to all alternation patterns, we can rewrite the regular expression pattern as follows:

\bth(e|is|at|en|ere|ese|ose)\b

Since individual alternatives still share characters in common, we can further optimize this regular expression pattern through grouping:

\bth(e(n|re|se)|at|is|ose)\b

While 25 comparisons were required to match the input string “these” with the regular expression pattern \b(the|this|that|then|there|these|those)\b , only 9 comparisons are required to match it with \bth(e(n|re|se)|at|is|ose)\b. The following table lists the comparisons that the regular expression engine makes to match the input string “these” with this last regular expression pattern.

1

It determines that the current position in the input string is on a word boundary.

2

It matches “t” in the pattern with “t” in the input string.

3

It matches “h” in the pattern with “h” in the input string. Because a grouping and alternation construct follows, it saves this position in the input string.

4

It matches “e” in the pattern with “e” in the input string. Because a grouping and alternation construct follows, it saves this position (the second saved position) in the input string.

5

It attempts to match “n” in the pattern with “s” in the input string. The match fails, and the regular expression engine returns to its second saved position in the input string.

6

It attempts to match “r” in the pattern with “s” in the input string. The match fails, and the regular expression engine returns to its second saved position in the input string. Note that it does not test the “e” because it failed to match “r”.

7-8

It attempts to match “se” in the pattern with “se” in the input string. The match succeeds.

9

It determines that the current position in the input string is on a word boundary. The match succeeds.

The following is the complete example.

 [Visual Basic]
Option Strict On

Imports System.Diagnostics
Imports System.IO
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim sw As Stopwatch
      Dim sr As New StreamReader(".\Dreiser_TheFinancier.txt")
      Dim input As String = sr.ReadToEnd()
      sr.Close()
      Dim ctr As Integer

      sw = Stopwatch.StartNew()
      ctr = 0
      Dim unoptPattern As String = "\b(the|this|that|then|there|these|those)\b"
      Dim rgxUnopt As New Regex(unoptPattern, RegexOptions.IgnoreCase)
      Dim match1 As Match = rgxUnopt.Match(input)
      Do While match1.Success
         ctr += 1
         match1 = match1.NextMatch()
      Loop
      sw.Stop()
      Console.WriteLine("{0}:{3}   Found {1:N0} matches in {2}", 
                        unoptPattern, ctr, sw.Elapsed, vbCrLf)
      Console.WriteLine()

      sw = Stopwatch.StartNew()
      ctr = 0
      Dim optPattern As String = "\bth(e(n|re|se)|at|is|ose)\b"
      Dim rgxOpt As New Regex(unoptPattern, RegexOptions.IgnoreCase)
      Dim match2 As Match = rgxOpt.Match(input)
      Do While match2.Success
         ctr += 1
         match2 = match2.NextMatch()
      Loop
      sw.Stop()
      Console.WriteLine("{0}:{3}   Found {1:N0} matches in {2}", 
                        optPattern, ctr, sw.Elapsed, vbCrLf)
   End Sub
End Module
 [C#]
using System;
using System.Diagnostics;
using System.IO;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        Stopwatch sw;
        StreamReader sr = new StreamReader(@".\Dreiser_TheFinancier.txt");
        string input = sr.ReadToEnd();
        sr.Close();
        int ctr;

        sw = Stopwatch.StartNew();
        ctr = 0;
        string unoptPattern = @"\b(the|this|that|then|there|these|those)\b";
        Regex rgxUnopt = new Regex(unoptPattern, 
                                   RegexOptions.IgnoreCase);
        Match match1 = rgxUnopt.Match(input);
        while (match1.Success)
        {
            ctr++;
            match1 = match1.NextMatch();
        }
        sw.Stop();
        Console.WriteLine("{0}:\n   Found {1:N0} matches in {2}\n", 
                          unoptPattern, ctr, sw.Elapsed);

        sw = Stopwatch.StartNew();
        ctr = 0;
        string optPattern = @"\bth(e(n|re|se)|at|is|ose)\b";
        Regex rgxOpt = new Regex(unoptPattern,
                                 RegexOptions.IgnoreCase);
        Match match2 = rgxOpt.Match(input);
        while (match2.Success)
        {
            ctr++;
            match2 = match2.NextMatch();
        }
        sw.Stop();
        Console.WriteLine("{0}:\n   Found {1:N0} matches in {2}", 
                          optPattern, ctr, sw.Elapsed);
    }
}

The optimized regular expression offers approximately a 35% performance improvement over the unoptimized regular expression, as indicated by the following output and graph.

\b(the|this|that|then|there|these|those)\b:
Found 13,459 matches in 00:00:00.1833098

\bth(e(n|re|se)|at|is|ose)\b:
Found 13,459 matches in 00:00:00.1184890
clip_image002[6]

 

Optional Quantifiers and Backtracking

The regular expression engine also uses backtracking whenever a subexpression includes an optional quantifier (such as * or + ). For example, the regular expression pattern \b\p{Lu}\w*\b matches all words beginning with an uppercase character. To allow for the possibility that a word might consist of a single uppercase letter, the \w* subexpression matches any word character zero or more times. Because * is an optional quantifier, the regular expression engine saves its state with each word character matched, in case it has to abandon a portion of the match and return to a previous saved state in order to match the input string with a later portion of the regular expression pattern.

Is Backtracking Necessary?

An important question to ask about any regular expression that uses backtracking is whether, in view of its associated performance penalty, backtracking is necessary for a successful match. Backtracking often is necessary because regular expressions are greedy, and a greedy match of a subexpression is likely to consume some portion of what would otherwise be a successful match. But in many cases, backtracking is used unintentionally, because optional quantifiers that backtrack are more familiar and frequently used than their alternatives. However, backtracking comes at a price. Each time an optional quantifier is applied to an input string, the regular expression engine must save the state of the match at that point so that it can return to it if a later portion of the match fails. To put it another way, needless backtracking exacts a substantial performance penalty.

Let’s return to our example regular expression pattern, \b\p{Lu}\w*\b. Because a word boundary is not the same as, or a subset of, a word character, there is no possibility that the regular expression engine will cross a word boundary when matching word characters. This means that for this regular expression, backtracking can never contribute to the overall success of any match -- it can only extract a performance penalty, because the regular expression engine is forced to save its state for each successful preliminary match of a word character.

In cases where backtracking is unnecessary, it can be disabled by using the (?>subexpression) language construct, where subexpression is the regular expression subpattern that should not backtrack. To disable backtracking, we can change our regular expression from \b\p{Lu}\w*\b to \b\p{Lu}(?>\w*)\b. The following example uses these two regular expressions to extract words from the text of Theodore Dreisier’s The Financier.

 [Visual Basic]
Imports System.Diagnostics
Imports System.IO
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim sw As Stopwatch
      Dim sr As New StreamReader(".\Dreiser_TheFinancier.txt")
      Dim input As String = sr.ReadToEnd()
      sr.Close()
      Dim ctr As Integer

      sw = Stopwatch.StartNew()
      ctr = 0
      Dim backtrackingPattern As String = "\b\p{Lu}\w*\b"
      Dim backtrackingRegex As New Regex(backtrackingPattern)
      Dim match1 As Match = backtrackingRegex.Match(input)
      Do While match1.Success
         ctr += 1
         match1 = match1.NextMatch()
      Loop
      sw.Stop()
      Console.WriteLine("{0}:\n   Found {1:N0} matches in {2}\n", 
                        backtrackingPattern, ctr, sw.Elapsed)

      sw = Stopwatch.StartNew()
      ctr = 0
      Dim noBacktrackPattern As String = "\b\p{Lu}(?>\w*)\b"
      Dim nonBacktrackingRegex As New Regex(noBacktrackPattern)
      Dim match2 As Match = nonBacktrackingRegex.Match(input)
      Do While match2.Success
         ctr += 1
         match2 = match2.NextMatch()
      Loop
      sw.Stop()
      Console.WriteLine("{0}:\n   Found {1:N0} matches in {2}", 
                        noBacktrackPattern, ctr, sw.Elapsed)
   End Sub
End Module
 [C#]
using System;
using System.Diagnostics;
using System.IO;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        Stopwatch sw;
        StreamReader sr = new StreamReader(@".\Dreiser_TheFinancier.txt");
        string input = sr.ReadToEnd();
        sr.Close();
        int ctr;

        sw = Stopwatch.StartNew();
        ctr = 0;
        string backtrackingPattern = @"\b\p{Lu}\w*\b";
        Regex backtrackingRegex = new Regex(backtrackingPattern);
        Match match1 = backtrackingRegex.Match(input);
        while (match1.Success)
        {
            ctr++;
            match1 = match1.NextMatch();
        }
        sw.Stop();
        Console.WriteLine("{0}:\n   Found {1:N0} matches in {2}\n",
                          backtrackingPattern, ctr, sw.Elapsed);
        sw = Stopwatch.StartNew();
        ctr = 0;
        string noBacktrackPattern = @"\b\p{Lu}(?>\w*)\b";
        Regex nonBacktrackingRegex = new Regex(noBacktrackPattern);
        Match match2 = nonBacktrackingRegex.Match(input);
        while (match2.Success)
        {
            ctr++;
            match2 = match2.NextMatch();
        }
        sw.Stop();
        Console.WriteLine("{0}:\n   Found {1:N0} matches in {2}",
                          noBacktrackPattern, ctr, sw.Elapsed);
    }
}

Because it does not have to keep track of previous saved states, the regular expression that explicitly disables backtracking executes approximately 40% faster than the one that uses unnecessary backtracking. The following output from this example and graph illustrate the difference in performance.

 \b\p{Lu}\w*\b:
   Found 24,157 matches in 00:00:00.2465120

\b\p{Lu}(?>\w)*\b:
   Found 24,157 matches in 00:00:00.1483269

clip_image002[8]

Note that disabling backtracking is not always an option, since it changes the behavior of the regular expression engine. In the following example, the regular expression pattern (a+)ab matches one or more “a” characters, followed by the substring “ab”. This pattern matches the input string “aaaaaaaab”. However, if we change the regular expression pattern to (?>a+)ab to disable backtracking, the input string no longer matches. This is because when backtracking disabled, the regular expression engine is greedy and will not give up the last matched “a” in the string so that the match with “ab” succeeds.

 [Visual Basic]
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim input As String = "aaaaaaaab"
      
      Dim backtrackingPattern As String = "(a+)ab"
      Console.WriteLine(Regex.IsMatch(input, backtrackingPattern))
      ' Displays True.
      
      Dim nonBacktrackingPattern As String = "(?>a+)ab"
      Console.WriteLine(Regex.IsMatch(input, nonBacktrackingPattern))
      ' Displays False.          
   End Sub
End Module

 

 [C#]
using System;
using System.Text.RegularExpressions;
 
public class Example
{
   public static void Main()
   {
      string input = "aaaaaaaab";
      
      string backtrackingPattern = "(a+)ab";
      Console.WriteLine(Regex.IsMatch(input, backtrackingPattern));
      // Displays True.
      
      string nonBacktrackingPattern = "(?>a+)ab";
      Console.WriteLine(Regex.IsMatch(input, nonBacktrackingPattern));
      // Displays False.          
   }
}

Nested Optional Quantifiers and Backtracking

When developing regular expressions, developers typically focus on creating patterns that will match particular input. They rarely focus on creating patterns that will efficiently handle input that does not match the pattern. Frequently, this is justified. For example, if a regular expression is designed to extract substrings from text that follows a known format, there is no need to focus on non-matches. However, the regular expression engine can exhibit some of its poorest performance when handling text that does not conform to a specific regular expression pattern. In particular, this occurs when a regular expression that uses nested optional quantifiers (and therefore relies on extensive backtracking) is used to process text that does not conform to the pattern.

For example, in an order processing system, let’s say a regular expression is used to ensure that an item identifier conforms to a standard pattern. All item identifiers must contain at least one alphanumeric character; if they have multiple characters, the second to last character must also be alphanumeric; and the last character must be a dollar sign (“$”). Any intermediate characters must be word characters (that is, alphanumeric characters or underscores), periods, or hyphens. The regular expression pattern that was developed to reflect the format of the item identifier is ^[0-9A-Z]([-.\w]*[0-9A-Z])*\$$. A new employee who is being trained to do data entry enters one item identifier correctly, but then enters # instead of $ as the ending symbol for the next item. The following code illustrates this regular expression pattern and the data entry input. As the output shows, the regular expression performs efficiently when it is able to match the input identifier, but performance degrades drastically if it encounters an identifier that does not match the pattern.

 [Visual Basic]
Imports System.Diagnostics
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim pattern As String = "^[0-9A-Z]([-.\w]*[0-9A-Z])*\$$"
      Dim rgx As New Regex(pattern, RegexOptions.IgnoreCase)
      Dim sw As Stopwatch

      Dim items() As String = { "A163.1523C$", 
                                "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa#" }
      Dim toValidate, result As String
      
      For Each item In items
         For ctr As Integer = 1 To item.Length
            toValidate = item.Substring(0, ctr)     
            sw = Stopwatch.StartNew() 
   
            If rgx.IsMatch(toValidate) Then
               result = "Match"
            Else
               result = "No match"
            End If
            sw.Stop()
            Console.WriteLine("{0} with {1} characters in {2}", result, ctr, 
                              sw.Elapsed)   
         Next
         Console.WriteLine()               
      Next
   End Sub
End Module 

 

 [C#]
using System;
using System.Diagnostics;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = @"^[0-9A-Z]([-.\w]*[0-9A-Z])*\$$";
      Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
      Stopwatch sw;

      string[] items= { "A163.1523C$", 
                        "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa#" };
      string toValidate, result;
      
      foreach (var item in items) {
         for (int ctr = 1; ctr <= item.Length; ctr++) {
            toValidate = item.Substring(0, ctr);     
            sw = Stopwatch.StartNew(); 
   
            if (rgx.IsMatch(toValidate))
               result = "Match";
            else
               result = "No match";

            sw.Stop();
            Console.WriteLine("{0} with {1} characters in {2}", result, ctr, 
                              sw.Elapsed);   
         }
         Console.WriteLine();               
      }
   }
}

The example displays the following output:

 No match with 1 characters in 00:00:00.0000252
No match with 2 characters in 00:00:00.0000132
No match with 3 characters in 00:00:00.0000048
No match with 4 characters in 00:00:00.0000061
No match with 5 characters in 00:00:00.0000071
No match with 6 characters in 00:00:00.0000259
No match with 7 characters in 00:00:00.0000175
No match with 8 characters in 00:00:00.0000307
No match with 9 characters in 00:00:00.0000573
No match with 10 characters in 00:00:00.0001128
Match with 11 characters in 00:00:00.0000035

No match with 1 characters in 00:00:00.0000016
No match with 2 characters in 00:00:00.0000022
No match with 3 characters in 00:00:00.0000029
No match with 4 characters in 00:00:00.0000045
No match with 5 characters in 00:00:00.0000081
No match with 6 characters in 00:00:00.0000145
No match with 7 characters in 00:00:00.0000265
No match with 8 characters in 00:00:00.0000521
No match with 9 characters in 00:00:00.0001017
No match with 10 characters in 00:00:00.0002032
No match with 11 characters in 00:00:00.0004016
No match with 12 characters in 00:00:00.0007994
No match with 13 characters in 00:00:00.0015952
No match with 14 characters in 00:00:00.0032086
No match with 15 characters in 00:00:00.0065094
No match with 16 characters in 00:00:00.0129808
No match with 17 characters in 00:00:00.0270169
No match with 18 characters in 00:00:00.0553021
No match with 19 characters in 00:00:00.1070363
No match with 20 characters in 00:00:00.2208385
No match with 21 characters in 00:00:00.4289808
No match with 22 characters in 00:00:00.8553915
No match with 23 characters in 00:00:01.6993887
No match with 24 characters in 00:00:03.3938261
No match with 25 characters in 00:00:06.8359716
No match with 26 characters in 00:00:13.5125933
No match with 27 characters in 00:00:27.2257724
No match with 28 characters in 00:00:53.8205646
No match with 29 characters in 00:01:49.2090587
No match with 30 characters in 00:03:35.8028235
No match with 31 characters in 00:07:11.6145706
No match with 32 characters in 00:10:46.0044969

This example evaluates the performance of the regular expression engine based on the number of characters passed to it in the input string. As the following chart shows, for an input string that does not match the regular expression pattern, execution time approximately doubles each time another character is added to the string. It takes slightly over 10 minutes for the regular expression engine to recognize that an input string consisting of 32 characters does not match the regular expression pattern. Expanding the input to 35 characters would take approximately 1 hour and 20 minutes. Processing a string with 38 characters would take over 10 hours. Processing a string with 40 characters would take nearly two days. Visually scanning the string would be significantly faster.

clip_image002[10]

This atrociously poor performance is tied to the behavior of the NFA regular expression engine in the .NET Framework. When the regular expression engine cannot find a match and the pattern requires backtracking, the engine must exhaust all possible paths through the data before it can conclude that a match has failed; so an unsuccessful match always represents a worst-case scenario. (For more information on the NFA engine and backtracking, see Details of Regular Expression Behavior and Backtracking in the MSDN Library.) Because of the nested quantifiers in our example, the number of possible matches that the regular expression engine has to examine before concluding that a match has failed increases exponentially with each character after the first character in the input string.

 

Not all nested quantifiers impact performance in this way. In our example, the real problem in the subexpression ([-.\w]*[0-9A-Z])* is that the last character to be matched (any numeric character from 0 to 9, or any alphanumeric character from A to Z) is a subset of the previous subexpression (any word character – that is, any numeric or alphanumeric character or an underscore -- along with a hyphen and a period). A partially successful match makes it difficult for the regular expression engine to determine whether a particular character belongs to the first subexpression, the ending character of the expression, or to a repetition of the first subexpression or the ending character. The inability to successfully match the entire regular expression results in excessive backtracking, so that the number of possible paths through the input string increases exponentially with each character that can be matched. (For a description of the additional comparisons performed by the regular expression engine when backtracking, see the “Backtracking with Nested Optional Quantifiers” section in the Backtracking topic in the MSDN Library.)

In cases where excessive backtracking causes performance problems, other language constructs can often be used in place of the nested quantifiers. In many cases, nested quantifiers can be replaced by a single quantifier and zero-width lookbehind or lookahead assertion. Zero-width assertions do not advance the position counter in the input string. They either look ahead of or immediately behind the current position to determine whether the pattern defined by the assertion is true. If it is not, the match fails. (For information on positive lookahead and positive lookbehind, see Grouping Constructs in the MSDN documentation.) For example, we can replace our previous regular expression to validate an item identifier with ^[0-9A-Z][-.\w]*(?<=[0-9A-Z])\$, which uses positive lookbehind when the $ symbol is matched to ensure that the character before $ is either numeric or alphanumeric. This regular expression is interpreted as follows:

Pattern

Description

^

Begin the match at the beginning of the string.

[0-9A-Z]

Match any numeric (0-9) or alphanumeric (A-Z) character.The comparison is not case-sensitive.

[-.\w]*

Match zero or more word, hyphen, or period characters.

\$

Match the $ character.

(?<=[0-9A-Z])

Determine whether the character before the current character is numeric (0-9) or alphanumeric (A-Z).The comparison is not case sensitive. This is a zero-width assertion; it does not cause the position counter in the input string to advance.

This produces the following output:

 No match with 1 characters in 00:00:00.0498953
No match with 2 characters in 00:00:00.0002162
No match with 3 characters in 00:00:00.0000237
No match with 4 characters in 00:00:00.0000333
No match with 5 characters in 00:00:00.0000596
Match with 6 characters in 00:00:00.0000500
Match with 7 characters in 00:00:00.0000256
Match with 8 characters in 00:00:00.0000237
Match with 9 characters in 00:00:00.0000198
Match with 10 characters in 00:00:00.0000198
Match with 11 characters in 00:00:00.0000186
Match with 12 characters in 00:00:00.0000192
Match with 13 characters in 00:00:00.0000494
Match with 14 characters in 00:00:00.0000449

No match with 1 characters in 00:00:00.0000288
No match with 2 characters in 00:00:00.0000205
No match with 3 characters in 00:00:00.0000205
No match with 4 characters in 00:00:00.0000301
No match with 5 characters in 00:00:00.0000481
No match with 6 characters in 00:00:00.0000269
No match with 7 characters in 00:00:00.0000218
No match with 8 characters in 00:00:00.0000346
No match with 9 characters in 00:00:00.0000468
No match with 10 characters in 00:00:00.0000442
No match with 11 characters in 00:00:00.0000346
No match with 12 characters in 00:00:00.0000891
No match with 13 characters in 00:00:00.0000192
No match with 14 characters in 00:00:00.0000372
No match with 15 characters in 00:00:00.0000218
No match with 16 characters in 00:00:00.0000192
No match with 17 characters in 00:00:00.0000243
No match with 18 characters in 00:00:00.0000224
No match with 19 characters in 00:00:00.0000205
No match with 20 characters in 00:00:00.0000224
No match with 21 characters in 00:00:00.0000218
No match with 22 characters in 00:00:00.0000224
No match with 23 characters in 00:00:00.0000263
No match with 24 characters in 00:00:00.0000218
No match with 25 characters in 00:00:00.0000243
No match with 26 characters in 00:00:00.0000218
No match with 27 characters in 00:00:00.0000224
No match with 28 characters in 00:00:00.0000224
No match with 29 characters in 00:00:00.0000230
No match with 30 characters in 00:00:00.0000237
No match with 31 characters in 00:00:00.0000243
No match with 32 characters in 00:00:00.0000243

In other cases, some combination of alternation (the construct “|”) along with lookahead and lookbehind can be used to replace a regular expression that is subject to excessive backtracking. For example, regular expressions that are designed to match numbers are frequently troublesome if the numeric value can be either integral or floating point. A common pattern used to match such numeric values is ^\d+(|.\d+)*$, as shown in the following example.

 [Visual Basic]
Imports System.Diagnostics
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()   
      Dim number As String = "111111111111111111111111111111!"
      Dim pattern As String = "^\d+(|.\d+)*$"
      
      Dim rgx As New Regex(pattern, RegexOptions.IgnoreCase)
      Dim sw As Stopwatch
      Dim str As String 
  
      For ctr As Integer = 1 To number.Length
         str = number.Substring(0, ctr)
         sw = Stopwatch.StartNew()
         If rgx.IsMatch(str) Then 
            Console.Write("Match with {0} characters ", str.Length)
         Else
            Console.Write("No match with {0} characters ", str.Length)
         End If  
         sw.Stop()
         Console.WriteLine("in {0}.", sw.Elapsed)
      Next
   End Sub
End Module
 [C#] 
using System;
using System.Diagnostics;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string number = "111111111111111111111111111111!";
      string pattern = @"^\d+(|.\d+)*$";
      
      Regex rgx = new Regex(pattern, 
                  RegexOptions.IgnoreCase);
      Stopwatch sw;
      string str; 
  
      for (int ctr = 1; ctr <= number.Length; ctr++) {
         str = number.Substring(0, ctr);
         sw = Stopwatch.StartNew();
         if (rgx.IsMatch(str)) 
            Console.Write("Match with {0} characters ", str.Length);
         else
            Console.Write("No match with {0} characters ", str.Length);

         sw.Stop();
         Console.WriteLine("in {0}.", sw.Elapsed);
      }
   }
}

However, as the output from the example shows, this pattern is subject to excessive backtracking.

 Match with 1 characters in 00:00:00.0011311.
Match with 2 characters in 00:00:00.0010926.
Match with 3 characters in 00:00:00.0011003.
Match with 4 characters in 00:00:00.0014570.
Match with 5 characters in 00:00:00.0002797.
Match with 6 characters in 00:00:00.0001565.
Match with 7 characters in 00:00:00.0012619.
Match with 8 characters in 00:00:00.0001456.
Match with 9 characters in 00:00:00.0011439.
Match with 10 characters in 00:00:00.0010682.
Match with 11 characters in 00:00:00.0011125.
Match with 12 characters in 00:00:00.0002053.
Match with 13 characters in 00:00:00.0001854.
Match with 14 characters in 00:00:00.0012485.
Match with 15 characters in 00:00:00.0010675.
Match with 16 characters in 00:00:00.0011137.
Match with 17 characters in 00:00:00.0012299.
Match with 18 characters in 00:00:00.0001937.
Match with 19 characters in 00:00:00.0012709.
Match with 20 characters in 00:00:00.0011323.
Match with 21 characters in 00:00:00.0010996.
Match with 22 characters in 00:00:00.0010188.
Match with 23 characters in 00:00:00.0011125.
Match with 24 characters in 00:00:00.0001770.
Match with 25 characters in 00:00:00.0001469.
Match with 26 characters in 00:00:00.0001687.
Match with 27 characters in 00:00:00.0004356.
Match with 28 characters in 00:00:00.0013370.
Match with 29 characters in 00:00:00.0010958.
Match with 30 characters in 00:00:00.0010983.
No match with 31 characters in 00:00:03.0872118.

The root of the problem is that the subexpression (|.\d+)* is intended to match either a group separator or a decimal separator, but it also matches any other character. The subexpression also includes a nested quantifier. In the event of a near-match with numeric digits, the regular expression engine will have difficulty apportioning numeric digits between the first occurrence of the \d* pattern or its subsequent possible repetitions.

A regular expression that eliminates this unnecessary backtracking and also addresses a number of limitations of the original regular expression is ^\d*(,\d{3})*(?((?=\.))\.\d+|(?<=\d))$. This regular expression uses the conditional construct with an expression (for more information, see Alternation Constructs in the MSDN Library. Note that this regular expression works with numeric values formatted using the conventions of the English (United States) culture. A regular expression that uses the group and decimal separators of the current culture could be built dynamically by retrieving these values from the current culture’s DateTimeFormatInfo object. The regular expression is interpreted as shown in the following table.

Pattern

Description

^

Begin the match at the beginning of the string.

\d*

Match zero or more decimal digits. This matches the integral portion of the numeric value.

(,\d{3})*

Match zero or more occurrences of a group separator followed by three decimal digits.

(?((?=\.))

Use positive lookahead to determine whether the next character is a decimal separator. (This is the condition clause of an alternation pattern.)

\.\d+

If the next character is a decimal separator, match the decimal separator followed by one or more decimal digits.

|(?<=\d))

If the next character is not a decimal separator, look behind to determine whether the previous character is a decimal digit. This ensures that the number contains at least one integral digit.

$

Match the end of the input string.

If this regular expression pattern replaces the pattern in the previous example, the example produces the following output:

 Match with 1 characters in 00:00:00.0012331.
Match with 2 characters in 00:00:00.0010586.
Match with 3 characters in 00:00:00.0010509.
Match with 4 characters in 00:00:00.0011676.
Match with 5 characters in 00:00:00.0002303.
Match with 6 characters in 00:00:00.0001892.
Match with 7 characters in 00:00:00.0005203.
Match with 8 characters in 00:00:00.0014442.
Match with 9 characters in 00:00:00.0010740.
Match with 10 characters in 00:00:00.0010573.
Match with 11 characters in 00:00:00.0002046.
Match with 12 characters in 00:00:00.0002309.
Match with 13 characters in 00:00:00.0012318.
Match with 14 characters in 00:00:00.0001411.
Match with 15 characters in 00:00:00.0010900.
Match with 16 characters in 00:00:00.0010785.
Match with 17 characters in 00:00:00.0002008.
Match with 18 characters in 00:00:00.0002149.
Match with 19 characters in 00:00:00.0012850.
Match with 20 characters in 00:00:00.0011137.
No match with 21 characters in 00:00:00.0010887.

Best Practices for Controlling Backtracking

To maximize the performance of regular expressions in the .NET Framework, we recommend that you do the following to limit or eliminate unnecessary backtracking:

  • In alternation constructs, order alternatives based on their approximate frequency of occurrence in the input.
  • Take the time to optimize alternation patterns.
  • Determine whether backtracking is required for the regular expression to successfully match a subexpression with a portion of the input string. If it is not necessary, disable backtracking with the (?>subexpression) language element.
  • If subexpressions are intended to validate input, replace nested quantifiers such as (\w*[A-Z])* with other language elements. The most useful language elements for this purpose are the positive lookahead assertion and the positive lookbehind assertion, the alternation construct, and the conditional construct.

Comments

  • Anonymous
    August 09, 2010
    Uuuff, now what an article. Thanks for this clarifying information - I've tried to understand MSDN documentation on regular expressions but I wasn't able to figure out what assertions, alternation, and conditional constructs are really good for. Now I think I finally understood it :-) Thanks!