Επεξεργασία

Κοινή χρήση μέσω


Best practices for regular expressions in .NET

The regular expression engine in .NET is a powerful, full-featured tool that processes text based on pattern matches rather than on comparing and matching literal text. In most cases, it performs pattern matching rapidly and efficiently. However, in some cases, the regular expression engine can appear to be slow. In extreme cases, it can even appear to stop responding as it processes a relatively small input over the course of hours or even days.

This article outlines some of the best practices that developers can adopt to ensure that their regular expressions achieve optimal performance.

Warning

When using System.Text.RegularExpressions to process untrusted input, pass a timeout. A malicious user can provide input to RegularExpressions, causing a Denial-of-Service attack. ASP.NET Core framework APIs that use RegularExpressions pass a timeout.

Consider the input source

In general, regular expressions can accept two types of input: constrained or unconstrained. Constrained input is a text that originates from a known or reliable source and follows a predefined format. Unconstrained input is a text that originates from an unreliable source, such as a web user, and might not follow a predefined or expected format.

Regular expression patterns are often written to match valid input. That is, developers examine the text that they want to match and then write a regular expression pattern that matches it. Developers then determine whether this pattern requires correction or further elaboration by testing it with multiple valid input items. When the pattern matches all presumed valid inputs, it's declared to be production-ready, and can be included in a released application. This approach makes a regular expression pattern suitable for matching constrained input. However, it doesn't make it suitable for matching unconstrained input.

To match unconstrained input, a regular expression must handle three kinds of text efficiently:

  • Text that matches the regular expression pattern.
  • Text that doesn't match the regular expression pattern.
  • Text that nearly matches the regular expression pattern.

The last text type is especially problematic for a regular expression that has been written to handle constrained input. If that regular expression also relies on extensive backtracking, the regular expression engine can spend an inordinate amount of time (in some cases, many hours or days) processing seemingly innocuous text.

Warning

The following example uses a regular expression that's prone to excessive backtracking and that's likely to reject valid email addresses. You shouldn't use it in an email validation routine. If you would like a regular expression that validates email addresses, see How to: Verify that Strings Are in Valid Email Format.

For example, consider a commonly used but problematic regular expression for validating the alias of an email address. The regular expression ^[0-9A-Z]([-.\w]*[0-9A-Z])*$ is written to process what is considered to be a valid email address. A valid email address consists of an alphanumeric character, followed by zero or more characters that can be alphanumeric, periods, or hyphens. The regular expression must end with an alphanumeric character. However, as the following example shows, although this regular expression handles valid input easily, its performance is inefficient when it's processing nearly valid input:

using System;
using System.Diagnostics;
using System.Text.RegularExpressions;

public class DesignExample
{
    public static void Main()
    {
        Stopwatch sw;
        string[] addresses = { "AAAAAAAAAAA@contoso.com",
                             "AAAAAAAAAAaaaaaaaaaa!@contoso.com" };
        // The following regular expression should not actually be used to
        // validate an email address.
        string pattern = @"^[0-9A-Z]([-.\w]*[0-9A-Z])*$";
        string input;

        foreach (var address in addresses)
        {
            string mailBox = address.Substring(0, address.IndexOf("@"));
            int index = 0;
            for (int ctr = mailBox.Length - 1; ctr >= 0; ctr--)
            {
                index++;

                input = mailBox.Substring(ctr, index);
                sw = Stopwatch.StartNew();
                Match m = Regex.Match(input, pattern, RegexOptions.IgnoreCase);
                sw.Stop();
                if (m.Success)
                    Console.WriteLine("{0,2}. Matched '{1,25}' in {2}",
                                      index, m.Value, sw.Elapsed);
                else
                    Console.WriteLine("{0,2}. Failed  '{1,25}' in {2}",
                                      index, input, sw.Elapsed);
            }
            Console.WriteLine();
        }
    }
}

// The example displays output similar to the following:
//     1. Matched '                        A' in 00:00:00.0007122
//     2. Matched '                       AA' in 00:00:00.0000282
//     3. Matched '                      AAA' in 00:00:00.0000042
//     4. Matched '                     AAAA' in 00:00:00.0000038
//     5. Matched '                    AAAAA' in 00:00:00.0000042
//     6. Matched '                   AAAAAA' in 00:00:00.0000042
//     7. Matched '                  AAAAAAA' in 00:00:00.0000042
//     8. Matched '                 AAAAAAAA' in 00:00:00.0000087
//     9. Matched '                AAAAAAAAA' in 00:00:00.0000045
//    10. Matched '               AAAAAAAAAA' in 00:00:00.0000045
//    11. Matched '              AAAAAAAAAAA' in 00:00:00.0000045
//
//     1. Failed  '                        !' in 00:00:00.0000447
//     2. Failed  '                       a!' in 00:00:00.0000071
//     3. Failed  '                      aa!' in 00:00:00.0000071
//     4. Failed  '                     aaa!' in 00:00:00.0000061
//     5. Failed  '                    aaaa!' in 00:00:00.0000081
//     6. Failed  '                   aaaaa!' in 00:00:00.0000126
//     7. Failed  '                  aaaaaa!' in 00:00:00.0000359
//     8. Failed  '                 aaaaaaa!' in 00:00:00.0000414
//     9. Failed  '                aaaaaaaa!' in 00:00:00.0000758
//    10. Failed  '               aaaaaaaaa!' in 00:00:00.0001462
//    11. Failed  '              aaaaaaaaaa!' in 00:00:00.0002885
//    12. Failed  '             Aaaaaaaaaaa!' in 00:00:00.0005780
//    13. Failed  '            AAaaaaaaaaaa!' in 00:00:00.0011628
//    14. Failed  '           AAAaaaaaaaaaa!' in 00:00:00.0022851
//    15. Failed  '          AAAAaaaaaaaaaa!' in 00:00:00.0045864
//    16. Failed  '         AAAAAaaaaaaaaaa!' in 00:00:00.0093168
//    17. Failed  '        AAAAAAaaaaaaaaaa!' in 00:00:00.0185993
//    18. Failed  '       AAAAAAAaaaaaaaaaa!' in 00:00:00.0366723
//    19. Failed  '      AAAAAAAAaaaaaaaaaa!' in 00:00:00.1370108
//    20. Failed  '     AAAAAAAAAaaaaaaaaaa!' in 00:00:00.1553966
//    21. Failed  '    AAAAAAAAAAaaaaaaaaaa!' in 00:00:00.3223372
Imports System.Diagnostics
Imports System.Text.RegularExpressions

Module Example
    Public Sub Main()
        Dim sw As Stopwatch
        Dim addresses() As String = {"AAAAAAAAAAA@contoso.com",
                                   "AAAAAAAAAAaaaaaaaaaa!@contoso.com"}
        ' The following regular expression should not actually be used to 
        ' validate an email address.
        Dim pattern As String = "^[0-9A-Z]([-.\w]*[0-9A-Z])*$"
        Dim input As String

        For Each address In addresses
            Dim mailBox As String = address.Substring(0, address.IndexOf("@"))
            Dim index As Integer = 0
            For ctr As Integer = mailBox.Length - 1 To 0 Step -1
                index += 1
                input = mailBox.Substring(ctr, index)
                sw = Stopwatch.StartNew()
                Dim m As Match = Regex.Match(input, pattern, RegexOptions.IgnoreCase)
                sw.Stop()
                if m.Success Then
                    Console.WriteLine("{0,2}. Matched '{1,25}' in {2}",
                                      index, m.Value, sw.Elapsed)
                Else
                    Console.WriteLine("{0,2}. Failed  '{1,25}' in {2}",
                                      index, input, sw.Elapsed)
                End If
            Next
            Console.WriteLine()
        Next
    End Sub
End Module
' The example displays output similar to the following:
'     1. Matched '                        A' in 00:00:00.0007122
'     2. Matched '                       AA' in 00:00:00.0000282
'     3. Matched '                      AAA' in 00:00:00.0000042
'     4. Matched '                     AAAA' in 00:00:00.0000038
'     5. Matched '                    AAAAA' in 00:00:00.0000042
'     6. Matched '                   AAAAAA' in 00:00:00.0000042
'     7. Matched '                  AAAAAAA' in 00:00:00.0000042
'     8. Matched '                 AAAAAAAA' in 00:00:00.0000087
'     9. Matched '                AAAAAAAAA' in 00:00:00.0000045
'    10. Matched '               AAAAAAAAAA' in 00:00:00.0000045
'    11. Matched '              AAAAAAAAAAA' in 00:00:00.0000045
'    
'     1. Failed  '                        !' in 00:00:00.0000447
'     2. Failed  '                       a!' in 00:00:00.0000071
'     3. Failed  '                      aa!' in 00:00:00.0000071
'     4. Failed  '                     aaa!' in 00:00:00.0000061
'     5. Failed  '                    aaaa!' in 00:00:00.0000081
'     6. Failed  '                   aaaaa!' in 00:00:00.0000126
'     7. Failed  '                  aaaaaa!' in 00:00:00.0000359
'     8. Failed  '                 aaaaaaa!' in 00:00:00.0000414
'     9. Failed  '                aaaaaaaa!' in 00:00:00.0000758
'    10. Failed  '               aaaaaaaaa!' in 00:00:00.0001462
'    11. Failed  '              aaaaaaaaaa!' in 00:00:00.0002885
'    12. Failed  '             Aaaaaaaaaaa!' in 00:00:00.0005780
'    13. Failed  '            AAaaaaaaaaaa!' in 00:00:00.0011628
'    14. Failed  '           AAAaaaaaaaaaa!' in 00:00:00.0022851
'    15. Failed  '          AAAAaaaaaaaaaa!' in 00:00:00.0045864
'    16. Failed  '         AAAAAaaaaaaaaaa!' in 00:00:00.0093168
'    17. Failed  '        AAAAAAaaaaaaaaaa!' in 00:00:00.0185993
'    18. Failed  '       AAAAAAAaaaaaaaaaa!' in 00:00:00.0366723
'    19. Failed  '      AAAAAAAAaaaaaaaaaa!' in 00:00:00.1370108
'    20. Failed  '     AAAAAAAAAaaaaaaaaaa!' in 00:00:00.1553966
'    21. Failed  '    AAAAAAAAAAaaaaaaaaaa!' in 00:00:00.3223372

As the output from the preceding example shows, the regular expression engine processes the valid email alias in about the same time interval regardless of its length. On the other hand, when the nearly valid email address has more than five characters, processing time approximately doubles for each extra character in the string. Therefore, a nearly valid 28-character string would take over an hour to process, and a nearly valid 33-character string would take nearly a day to process.

Because this regular expression was developed solely by considering the format of input to be matched, it fails to take account of input that doesn't match the pattern. This oversight, in turn, can allow unconstrained input that nearly matches the regular expression pattern to significantly degrade performance.

To solve this problem, you can do the following:

  • When developing a pattern, you should consider how backtracking might affect the performance of the regular expression engine, particularly if your regular expression is designed to process unconstrained input. For more information, see the Take Charge of Backtracking section.

  • Thoroughly test your regular expression using invalid, near-valid, and valid input. You can use Rex to randomly generate input for a particular regular expression. Rex is a regular expression exploration tool from Microsoft Research.

Handle object instantiation appropriately

At the heart of .NET's regular expression object model is the System.Text.RegularExpressions.Regex class, which represents the regular expression engine. Often, the single greatest factor that affects regular expression performance is the way in which the Regex engine is used. Defining a regular expression involves tightly coupling the regular expression engine with a regular expression pattern. That coupling process is expensive, whether it involves instantiating a Regex object by passing its constructor a regular expression pattern or calling a static method by passing it the regular expression pattern and the string to be analyzed.

Note

For a detailed discussion of the performance implications of using interpreted and compiled regular expressions, see the blog post Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking.

You can couple the regular expression engine with a particular regular expression pattern and then use the engine to match the text in several ways:

  • You can call a static pattern-matching method, such as Regex.Match(String, String). This method doesn't require instantiation of a regular expression object.

  • You can instantiate a Regex object and call an instance pattern-matching method of an interpreted regular expression, which is the default method for binding the regular expression engine to a regular expression pattern. It results when a Regex object is instantiated without an options argument that includes the Compiled flag.

  • You can instantiate a Regex object and call an instance pattern-matching method of a source-generated regular expression. This technique is recommended in most cases. To do so, place the GeneratedRegexAttribute attribute on a partial method that returns Regex.

  • You can instantiate a Regex object and call an instance pattern-matching method of a compiled regular expression. Regular expression objects represent compiled patterns when a Regex object is instantiated with an options argument that includes the Compiled flag.

The particular way in which you call regular expression matching methods can affect your application's performance. The following sections discuss when to use static method calls, source-generated regular expressions, interpreted regular expressions, and compiled regular expressions to improve your application's performance.

Important

The form of the method call (static, interpreted, source-generated, compiled) affects performance if the same regular expression is used repeatedly in method calls, or if an application makes extensive use of regular expression objects.

Static regular expressions

Static regular expression methods are recommended as an alternative to repeatedly instantiating a regular expression object with the same regular expression. Unlike regular expression patterns used by regular expression objects, either the operation codes (opcodes) or the compiled common intermediate language (CIL) from patterns used in static method calls is cached internally by the regular expression engine.

For example, an event handler frequently calls another method to validate user input. This example is reflected in the following code, in which a Button control's Click event is used to call a method named IsValidCurrency, which checks whether the user has entered a currency symbol followed by at least one decimal digit.

public void OKButton_Click(object sender, EventArgs e)
{
   if (! String.IsNullOrEmpty(sourceCurrency.Text))
      if (RegexLib.IsValidCurrency(sourceCurrency.Text))
         PerformConversion();
      else
         status.Text = "The source currency value is invalid.";
}
Public Sub OKButton_Click(sender As Object, e As EventArgs) _
           Handles OKButton.Click

    If Not String.IsNullOrEmpty(sourceCurrency.Text) Then
        If RegexLib.IsValidCurrency(sourceCurrency.Text) Then
            PerformConversion()
        Else
            status.Text = "The source currency value is invalid."
        End If
    End If
End Sub

An inefficient implementation of the IsValidCurrency method is shown in the following example:

Note

Each method call reinstantiates a Regex object with the same pattern. This, in turn, means that the regular expression pattern must be recompiled each time the method is called.

using System;
using System.Text.RegularExpressions;

public class RegexLib
{
   public static bool IsValidCurrency(string currencyValue)
   {
      string pattern = @"\p{Sc}+\s*\d+";
      Regex currencyRegex = new Regex(pattern);
      return currencyRegex.IsMatch(currencyValue);
   }
}
Imports System.Text.RegularExpressions

Public Module RegexLib
    Public Function IsValidCurrency(currencyValue As String) As Boolean
        Dim pattern As String = "\p{Sc}+\s*\d+"
        Dim currencyRegex As New Regex(pattern)
        Return currencyRegex.IsMatch(currencyValue)
    End Function
End Module

You should replace the preceding inefficient code with a call to the static Regex.IsMatch(String, String) method. This approach eliminates the need to instantiate a Regex object each time you want to call a pattern-matching method, and enables the regular expression engine to retrieve a compiled version of the regular expression from its cache.

using System;
using System.Text.RegularExpressions;

public class RegexLib2
{
   public static bool IsValidCurrency(string currencyValue)
   {
      string pattern = @"\p{Sc}+\s*\d+";
      return Regex.IsMatch(currencyValue, pattern);
   }
}
Imports System.Text.RegularExpressions

Public Module RegexLib
    Public Function IsValidCurrency(currencyValue As String) As Boolean
        Dim pattern As String = "\p{Sc}+\s*\d+"
        Return Regex.IsMatch(currencyValue, pattern)
    End Function
End Module

By default, the last 15 most recently used static regular expression patterns are cached. For applications that require a larger number of cached static regular expressions, the size of the cache can be adjusted by setting the Regex.CacheSize property.

The regular expression \p{Sc}+\s*\d+ that's used in this example verifies that the input string has a currency symbol and at least one decimal digit. The pattern is defined as shown in the following table:

Pattern Description
\p{Sc}+ Matches one or more characters in the Unicode Symbol, Currency category.
\s* Matches zero or more white-space characters.
\d+ Matches one or more decimal digits.

Interpreted vs. source-generated vs. compiled regular expressions

Regular expression patterns that aren't bound to the regular expression engine through the specification of the Compiled option are interpreted. When a regular expression object is instantiated, the regular expression engine converts the regular expression to a set of operation codes. When an instance method is called, the operation codes are converted to CIL and executed by the JIT compiler. Similarly, when a static regular expression method is called and the regular expression can't be found in the cache, the regular expression engine converts the regular expression to a set of operation codes and stores them in the cache. It then converts these operation codes to CIL so that the JIT compiler can execute them. Interpreted regular expressions reduce startup time at the cost of slower execution time. Because of this process, they're best used when the regular expression is used in a small number of method calls, or if the exact number of calls to regular expression methods is unknown but is expected to be small. As the number of method calls increases, the performance gain from reduced startup time is outstripped by the slower execution speed.

Regular expression patterns that are bound to the regular expression engine through the specification of the Compiled option are compiled. Therefore, when a regular expression object is instantiated, or when a static regular expression method is called and the regular expression can't be found in the cache, the regular expression engine converts the regular expression to an intermediary set of operation codes. These codes are then converted to CIL. When a method is called, the JIT compiler executes the CIL. In contrast to interpreted regular expressions, compiled regular expressions increase startup time but execute individual pattern-matching methods faster. As a result, the performance benefit that results from compiling the regular expression increases in proportion to the number of regular expression methods called.

Regular expression patterns that are bound to the regular expression engine through the adornment of a Regex-returning method with the GeneratedRegexAttribute attribute are source generated. The source generator, which plugs into the compiler, emits as C# code a custom Regex-derived implementation with logic similar to what RegexOptions.Compiled emits in CIL. You get all the throughput performance benefits of RegexOptions.Compiled (more, in fact) and the start-up benefits of Regex.CompileToAssembly, but without the complexity of CompileToAssembly. The source that's emitted is part of your project, which means it's also easily viewable and debuggable.

To summarize, we recommend that you:

  • Use interpreted regular expressions when you call regular expression methods with a specific regular expression relatively infrequently.
  • Use source-generated regular expressions if you're using Regex in C# with arguments known at compile time, and you're using a specific regular expression relatively frequently.
  • Use compiled regular expressions when you call regular expression methods with a specific regular expression relatively frequently and you're using .NET 6 or an earlier version.

It's difficult to determine the exact threshold at which the slower execution speeds of interpreted regular expressions outweigh gains from their reduced startup time. It's also difficult to determine the threshold at which the slower startup times of source-generated or compiled regular expressions outweigh gains from their faster execution speeds. The thresholds depend on various factors, including the complexity of the regular expression and the specific data that it processes. To determine which regular expressions offer the best performance for your particular application scenario, you can use the Stopwatch class to compare their execution times.

The following example compares the performance of compiled, source-generated, and interpreted regular expressions when reading the first 10 sentences and when reading all the sentences in the text of William D. Guthrie's Magna Carta, and Other Addresses. As the output from the example shows, when only 10 calls are made to regular expression matching methods, an interpreted or source-generated regular expression offers better performance than a compiled regular expression. However, a compiled regular expression offers better performance when a large number of calls (in this case, over 13,000) are made.

const string Pattern = @"\b(\w+((\r?\n)|,?\s))*\w+[.?:;!]";

static readonly HttpClient s_client = new();

[GeneratedRegex(Pattern, RegexOptions.Singleline)]
private static partial Regex GeneratedRegex();

public async static Task RunIt()
{
    Stopwatch sw;
    Match match;
    int ctr;

    string text =
            await s_client.GetStringAsync("https://www.gutenberg.org/cache/epub/64197/pg64197.txt");

    // Read first ten sentences with interpreted regex.
    Console.WriteLine("10 Sentences with Interpreted Regex:");
    sw = Stopwatch.StartNew();
    Regex int10 = new(Pattern, RegexOptions.Singleline);
    match = int10.Match(text);
    for (ctr = 0; ctr <= 9; ctr++)
    {
        if (match.Success)
            // Do nothing with the match except get the next match.
            match = match.NextMatch();
        else
            break;
    }
    sw.Stop();
    Console.WriteLine("   {0} matches in {1}", ctr, sw.Elapsed);

    // Read first ten sentences with compiled regex.
    Console.WriteLine("10 Sentences with Compiled Regex:");
    sw = Stopwatch.StartNew();
    Regex comp10 = new Regex(Pattern,
                 RegexOptions.Singleline | RegexOptions.Compiled);
    match = comp10.Match(text);
    for (ctr = 0; ctr <= 9; ctr++)
    {
        if (match.Success)
            // Do nothing with the match except get the next match.
            match = match.NextMatch();
        else
            break;
    }
    sw.Stop();
    Console.WriteLine("   {0} matches in {1}", ctr, sw.Elapsed);

    // Read first ten sentences with source-generated regex.
    Console.WriteLine("10 Sentences with Source-generated Regex:");
    sw = Stopwatch.StartNew();

    match = GeneratedRegex().Match(text);
    for (ctr = 0; ctr <= 9; ctr++)
    {
        if (match.Success)
            // Do nothing with the match except get the next match.
            match = match.NextMatch();
        else
            break;
    }
    sw.Stop();
    Console.WriteLine("   {0} matches in {1}", ctr, sw.Elapsed);

    // Read all sentences with interpreted regex.
    Console.WriteLine("All Sentences with Interpreted Regex:");
    sw = Stopwatch.StartNew();
    Regex intAll = new(Pattern, RegexOptions.Singleline);
    match = intAll.Match(text);
    int matches = 0;
    while (match.Success)
    {
        matches++;
        // Do nothing with the match except get the next match.
        match = match.NextMatch();
    }
    sw.Stop();
    Console.WriteLine("   {0:N0} matches in {1}", matches, sw.Elapsed);

    // Read all sentences with compiled regex.
    Console.WriteLine("All Sentences with Compiled Regex:");
    sw = Stopwatch.StartNew();
    Regex compAll = new(Pattern,
                    RegexOptions.Singleline | RegexOptions.Compiled);
    match = compAll.Match(text);
    matches = 0;
    while (match.Success)
    {
        matches++;
        // Do nothing with the match except get the next match.
        match = match.NextMatch();
    }
    sw.Stop();
    Console.WriteLine("   {0:N0} matches in {1}", matches, sw.Elapsed);

    // Read all sentences with source-generated regex.
    Console.WriteLine("All Sentences with Source-generated Regex:");
    sw = Stopwatch.StartNew();
    match = GeneratedRegex().Match(text);
    matches = 0;
    while (match.Success)
    {
        matches++;
        // Do nothing with the match except get the next match.
        match = match.NextMatch();
    }
    sw.Stop();
    Console.WriteLine("   {0:N0} matches in {1}", matches, sw.Elapsed);

    return;
}
/* The example displays output similar to the following:

   10 Sentences with Interpreted Regex:
       10 matches in 00:00:00.0104920
   10 Sentences with Compiled Regex:
       10 matches in 00:00:00.0234604
   10 Sentences with Source-generated Regex:
       10 matches in 00:00:00.0060982
   All Sentences with Interpreted Regex:
       3,427 matches in 00:00:00.1745455
   All Sentences with Compiled Regex:
       3,427 matches in 00:00:00.0575488
   All Sentences with Source-generated Regex:
       3,427 matches in 00:00:00.2698670
*/

The regular expression pattern used in the example, \b(\w+((\r?\n)|,?\s))*\w+[.?:;!], is defined as shown in the following table:

Pattern Description
\b Begin the match at a word boundary.
\w+ Matches one or more word characters.
(\r?\n)|,?\s) Matches either zero or one carriage return followed by a newline character, or zero or one comma followed by a white-space character.
(\w+((\r?\n)|,?\s))* Matches zero or more occurrences of one or more word characters that are followed either by zero or one carriage return and a newline character, or by zero or one comma followed by a white-space character.
\w+ Matches one or more word characters.
[.?:;!] Matches a period, question mark, colon, semicolon, or exclamation point.

Take charge of backtracking

Ordinarily, the regular expression engine uses linear progression to move through an input string and compare it to a regular expression pattern. However, when indeterminate quantifiers such as *, +, and ? are used in a regular expression pattern, the regular expression engine might give up a portion of successful partial matches and return to a previously saved state in order to search for a successful match for the entire pattern. This process is known as backtracking.

Tip

For more information on backtracking, see Details of regular expression behavior and Backtracking. For detailed discussions of backtracking, see the Regular Expression Improvements in .NET 7 and Optimizing Regular Expression Performance blog posts.

Support for backtracking gives regular expressions power and flexibility. It also places the responsibility for controlling the operation of the regular expression engine in the hands of regular expression developers. Because developers are often not aware of this responsibility, their misuse of backtracking or reliance on excessive backtracking often plays the most significant role in degrading regular expression performance. In a worst-case scenario, execution time can double for each additional character in the input string. In fact, by using backtracking excessively, it's easy to create the programmatic equivalent of an endless loop if input nearly matches the regular expression pattern. The regular expression engine might take hours or even days to process a relatively short input string.

Often, applications pay a performance penalty for using backtracking even though backtracking isn't essential for a match. For example, the regular expression \b\p{Lu}\w*\b matches all words that begin with an uppercase character, as the following table shows:

Pattern Description
\b Begin the match at a word boundary.
\p{Lu} Matches an uppercase character.
\w* Matches zero or more word characters.
\b End the match at a word boundary.

Because a word boundary isn't the same as, or a subset of, a word character, there's no possibility that the regular expression engine will cross a word boundary when matching word characters. Therefore for this regular expression, backtracking can never contribute to the overall success of any match. It can only degrade performance because the regular expression engine is forced to save its state for each successful preliminary match of a word character.

If you determine that backtracking isn't necessary, you can disable it in a couple of ways:

  • By setting the RegexOptions.NonBacktracking option (introduced in .NET 7). For more information, see Nonbacktracking mode.

  • By using the (?>subexpression) language element, known as an atomic group. The following example parses an input string by using two regular expressions. The first, \b\p{Lu}\w*\b, relies on backtracking. The second, \b\p{Lu}(?>\w*)\b, disables backtracking. As the output from the example shows, they both produce the same result:

    using System;
    using System.Text.RegularExpressions;
    
    public class BackTrack2Example
    {
        public static void Main()
        {
            string input = "This this word Sentence name Capital";
            string pattern = @"\b\p{Lu}\w*\b";
            foreach (Match match in Regex.Matches(input, pattern))
                Console.WriteLine(match.Value);
    
            Console.WriteLine();
    
            pattern = @"\b\p{Lu}(?>\w*)\b";
            foreach (Match match in Regex.Matches(input, pattern))
                Console.WriteLine(match.Value);
        }
    }
    // The example displays the following output:
    //       This
    //       Sentence
    //       Capital
    //
    //       This
    //       Sentence
    //       Capital
    
    Imports System.Text.RegularExpressions
    
    Module Example
        Public Sub Main()
            Dim input As String = "This this word Sentence name Capital"
            Dim pattern As String = "\b\p{Lu}\w*\b"
            For Each match As Match In Regex.Matches(input, pattern)
                Console.WriteLine(match.Value)
            Next
            Console.WriteLine()
    
            pattern = "\b\p{Lu}(?>\w*)\b"
            For Each match As Match In Regex.Matches(input, pattern)
                Console.WriteLine(match.Value)
            Next
        End Sub
    End Module
    ' The example displays the following output:
    '       This
    '       Sentence
    '       Capital
    '       
    '       This
    '       Sentence
    '       Capital
    

In many cases, backtracking is essential for matching a regular expression pattern to input text. However, excessive backtracking can severely degrade performance and create the impression that an application has stopped responding. In particular, this problem arises when quantifiers are nested and the text that matches the outer subexpression is a subset of the text that matches the inner subexpression.

Warning

In addition to avoiding excessive backtracking, you should use the timeout feature to ensure that excessive backtracking doesn't severely degrade regular expression performance. For more information, see the Use time-out values section.

For example, the regular expression pattern ^[0-9A-Z]([-.\w]*[0-9A-Z])*\$$ is intended to match a part number that consists of at least one alphanumeric character. Any additional characters can consist of an alphanumeric character, a hyphen, an underscore, or a period, though the last character must be alphanumeric. A dollar sign terminates the part number. In some cases, this regular expression pattern can exhibit poor performance because quantifiers are nested, and because the subexpression [0-9A-Z] is a subset of the subexpression [-.\w]*.

In these cases, you can optimize regular expression performance by removing the nested quantifiers and replacing the outer subexpression with a zero-width lookahead or lookbehind assertion. Lookahead and lookbehind assertions are anchors. They don't move the pointer in the input string but instead look ahead or behind to check whether a specified condition is met. For example, the part number regular expression can be rewritten as ^[0-9A-Z][-.\w]*(?<=[0-9A-Z])\$$. This regular expression pattern is defined as shown in the following table:

Pattern Description
^ Begin the match at the beginning of the input string.
[0-9A-Z] Match an alphanumeric character. The part number must consist of at least this character.
[-.\w]* Match zero or more occurrences of any word character, hyphen, or period.
\$ Match a dollar sign.
(?<=[0-9A-Z]) Look behind the ending dollar sign to ensure that the previous character is alphanumeric.
$ End the match at the end of the input string.

The following example illustrates the use of this regular expression to match an array containing possible part numbers:

using System;
using System.Text.RegularExpressions;

public class BackTrack4Example
{
    public static void Main()
    {
        string pattern = @"^[0-9A-Z][-.\w]*(?<=[0-9A-Z])\$$";
        string[] partNos = { "A1C$", "A4", "A4$", "A1603D$", "A1603D#" };

        foreach (var input in partNos)
        {
            Match match = Regex.Match(input, pattern);
            if (match.Success)
                Console.WriteLine(match.Value);
            else
                Console.WriteLine("Match not found.");
        }
    }
}
// The example displays the following output:
//       A1C$
//       Match not found.
//       A4$
//       A1603D$
//       Match not found.
Imports System.Text.RegularExpressions

Module Example
    Public Sub Main()
        Dim pattern As String = "^[0-9A-Z][-.\w]*(?<=[0-9A-Z])\$$"
        Dim partNos() As String = {"A1C$", "A4", "A4$", "A1603D$",
                                    "A1603D#"}

        For Each input As String In partNos
            Dim match As Match = Regex.Match(input, pattern)
            If match.Success Then
                Console.WriteLine(match.Value)
            Else
                Console.WriteLine("Match not found.")
            End If
        Next
    End Sub
End Module
' The example displays the following output:
'       A1C$
'       Match not found.
'       A4$
'       A1603D$
'       Match not found.

The regular expression language in .NET includes the following language elements that you can use to eliminate nested quantifiers. For more information, see Grouping constructs.

Language element Description
(?= subexpression ) Zero-width positive lookahead. Looks ahead of the current position to determine whether subexpression matches the input string.
(?! subexpression ) Zero-width negative lookahead. Looks ahead of the current position to determine whether subexpression doesn't match the input string.
(?<= subexpression ) Zero-width positive lookbehind. Looks behind the current position to determine whether subexpression matches the input string.
(?<! subexpression ) Zero-width negative lookbehind. Looks behind the current position to determine whether subexpression doesn't match the input string.

Use time-out values

If your regular expressions processes input that nearly matches the regular expression pattern, it can often rely on excessive backtracking, which impacts its performance significantly. In addition to carefully considering your use of backtracking and testing the regular expression against near-matching input, you should always set a time-out value to minimize the effect of excessive backtracking, if it occurs.

The regular expression time-out interval defines the period of time that the regular expression engine will look for a single match before it times out. Depending on the regular expression pattern and the input text, the execution time might exceed the specified time-out interval, but it won't spend more time backtracking than the specified time-out interval. The default time-out interval is Regex.InfiniteMatchTimeout, which means that the regular expression won't time out. You can override this value and define a time-out interval as follows:

If you've defined a time-out interval and a match isn't found at the end of that interval, the regular expression method throws a RegexMatchTimeoutException exception. In your exception handler, you can choose to retry the match with a longer time-out interval, abandon the match attempt and assume that there's no match, or abandon the match attempt and log the exception information for future analysis.

The following example defines a GetWordData method that instantiates a regular expression with a time-out interval of 350 milliseconds to calculate the number of words and average number of characters in a word in a text document. If the matching operation times out, the time-out interval is increased by 350 milliseconds and the Regex object is reinstantiated. If the new time-out interval exceeds one second, the method rethrows the exception to the caller.

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

public class TimeoutExample
{
    public static void Main()
    {
        RegexUtilities util = new RegexUtilities();
        string title = "Doyle - The Hound of the Baskervilles.txt";
        try
        {
            var info = util.GetWordData(title);
            Console.WriteLine("Words:               {0:N0}", info.Item1);
            Console.WriteLine("Average Word Length: {0:N2} characters", info.Item2);
        }
        catch (IOException e)
        {
            Console.WriteLine("IOException reading file '{0}'", title);
            Console.WriteLine(e.Message);
        }
        catch (RegexMatchTimeoutException e)
        {
            Console.WriteLine("The operation timed out after {0:N0} milliseconds",
                              e.MatchTimeout.TotalMilliseconds);
        }
    }
}

public class RegexUtilities
{
    public Tuple<int, double> GetWordData(string filename)
    {
        const int MAX_TIMEOUT = 1000;   // Maximum timeout interval in milliseconds.
        const int INCREMENT = 350;      // Milliseconds increment of timeout.

        List<string> exclusions = new List<string>(new string[] { "a", "an", "the" });
        int[] wordLengths = new int[29];        // Allocate an array of more than ample size.
        string input = null;
        StreamReader sr = null;
        try
        {
            sr = new StreamReader(filename);
            input = sr.ReadToEnd();
        }
        catch (FileNotFoundException e)
        {
            string msg = String.Format("Unable to find the file '{0}'", filename);
            throw new IOException(msg, e);
        }
        catch (IOException e)
        {
            throw new IOException(e.Message, e);
        }
        finally
        {
            if (sr != null) sr.Close();
        }

        int timeoutInterval = INCREMENT;
        bool init = false;
        Regex rgx = null;
        Match m = null;
        int indexPos = 0;
        do
        {
            try
            {
                if (!init)
                {
                    rgx = new Regex(@"\b\w+\b", RegexOptions.None,
                                    TimeSpan.FromMilliseconds(timeoutInterval));
                    m = rgx.Match(input, indexPos);
                    init = true;
                }
                else
                {
                    m = m.NextMatch();
                }
                if (m.Success)
                {
                    if (!exclusions.Contains(m.Value.ToLower()))
                        wordLengths[m.Value.Length]++;

                    indexPos += m.Length + 1;
                }
            }
            catch (RegexMatchTimeoutException e)
            {
                if (e.MatchTimeout.TotalMilliseconds < MAX_TIMEOUT)
                {
                    timeoutInterval += INCREMENT;
                    init = false;
                }
                else
                {
                    // Rethrow the exception.
                    throw;
                }
            }
        } while (m.Success);

        // If regex completed successfully, calculate number of words and average length.
        int nWords = 0;
        long totalLength = 0;

        for (int ctr = wordLengths.GetLowerBound(0); ctr <= wordLengths.GetUpperBound(0); ctr++)
        {
            nWords += wordLengths[ctr];
            totalLength += ctr * wordLengths[ctr];
        }
        return new Tuple<int, double>(nWords, totalLength / nWords);
    }
}
Imports System.Collections.Generic
Imports System.IO
Imports System.Text.RegularExpressions

Module Example
    Public Sub Main()
        Dim util As New RegexUtilities()
        Dim title As String = "Doyle - The Hound of the Baskervilles.txt"
        Try
            Dim info = util.GetWordData(title)
            Console.WriteLine("Words:               {0:N0}", info.Item1)
            Console.WriteLine("Average Word Length: {0:N2} characters", info.Item2)
        Catch e As IOException
            Console.WriteLine("IOException reading file '{0}'", title)
            Console.WriteLine(e.Message)
        Catch e As RegexMatchTimeoutException
            Console.WriteLine("The operation timed out after {0:N0} milliseconds",
                              e.MatchTimeout.TotalMilliseconds)
        End Try
    End Sub
End Module

Public Class RegexUtilities
    Public Function GetWordData(filename As String) As Tuple(Of Integer, Double)
        Const MAX_TIMEOUT As Integer = 1000  ' Maximum timeout interval in milliseconds.
        Const INCREMENT As Integer = 350     ' Milliseconds increment of timeout.

        Dim exclusions As New List(Of String)({"a", "an", "the"})
        Dim wordLengths(30) As Integer        ' Allocate an array of more than ample size.
        Dim input As String = Nothing
        Dim sr As StreamReader = Nothing
        Try
            sr = New StreamReader(filename)
            input = sr.ReadToEnd()
        Catch e As FileNotFoundException
            Dim msg As String = String.Format("Unable to find the file '{0}'", filename)
            Throw New IOException(msg, e)
        Catch e As IOException
            Throw New IOException(e.Message, e)
        Finally
            If sr IsNot Nothing Then sr.Close()
        End Try

        Dim timeoutInterval As Integer = INCREMENT
        Dim init As Boolean = False
        Dim rgx As Regex = Nothing
        Dim m As Match = Nothing
        Dim indexPos As Integer = 0
        Do
            Try
                If Not init Then
                    rgx = New Regex("\b\w+\b", RegexOptions.None,
                                    TimeSpan.FromMilliseconds(timeoutInterval))
                    m = rgx.Match(input, indexPos)
                    init = True
                Else
                    m = m.NextMatch()
                End If
                If m.Success Then
                    If Not exclusions.Contains(m.Value.ToLower()) Then
                        wordLengths(m.Value.Length) += 1
                    End If
                    indexPos += m.Length + 1
                End If
            Catch e As RegexMatchTimeoutException
                If e.MatchTimeout.TotalMilliseconds < MAX_TIMEOUT Then
                    timeoutInterval += INCREMENT
                    init = False
                Else
                    ' Rethrow the exception.
                    Throw
                End If
            End Try
        Loop While m.Success

        ' If regex completed successfully, calculate number of words and average length.
        Dim nWords As Integer
        Dim totalLength As Long

        For ctr As Integer = wordLengths.GetLowerBound(0) To wordLengths.GetUpperBound(0)
            nWords += wordLengths(ctr)
            totalLength += ctr * wordLengths(ctr)
        Next
        Return New Tuple(Of Integer, Double)(nWords, totalLength / nWords)
    End Function
End Class

Capture only when necessary

Regular expressions in .NET support grouping constructs, which let you group a regular expression pattern into one or more subexpressions. The most commonly used grouping constructs in .NET regular expression language are (subexpression), which defines a numbered capturing group, and (?<name>subexpression), which defines a named capturing group. Grouping constructs are essential for creating backreferences and for defining a subexpression to which a quantifier is applied.

However, the use of these language elements has a cost. They cause the GroupCollection object returned by the Match.Groups property to be populated with the most recent unnamed or named captures. If a single grouping construct has captured multiple substrings in the input string, they also populate the CaptureCollection object returned by the Group.Captures property of a particular capturing group with multiple Capture objects.

Often, grouping constructs are used in a regular expression only so that quantifiers can be applied to them. The groups captured by these subexpressions aren't used later. For example, the regular expression \b(\w+[;,]?\s?)+[.?!] is designed to capture an entire sentence. The following table describes the language elements in this regular expression pattern and their effect on the Match object's Match.Groups and Group.Captures collections:

Pattern Description
\b Begin the match at a word boundary.
\w+ Matches one or more word characters.
[;,]? Matches zero or one comma or semicolon.
\s? Matches zero or one white-space character.
(\w+[;,]?\s?)+ Matches one or more occurrences of one or more word characters followed by an optional comma or semicolon followed by an optional white-space character. This pattern defines the first capturing group, which is necessary so that the combination of multiple word characters (that is, a word) followed by an optional punctuation symbol will be repeated until the regular expression engine reaches the end of a sentence.
[.?!] Matches a period, question mark, or exclamation point.

As the following example shows, when a match is found, both the GroupCollection and CaptureCollection objects are populated with captures from the match. In this case, the capturing group (\w+[;,]?\s?) exists so that the + quantifier can be applied to it, which enables the regular expression pattern to match each word in a sentence. Otherwise, it would match the last word in a sentence.

using System;
using System.Text.RegularExpressions;

public class Group1Example
{
    public static void Main()
    {
        string input = "This is one sentence. This is another.";
        string pattern = @"\b(\w+[;,]?\s?)+[.?!]";

        foreach (Match match in Regex.Matches(input, pattern))
        {
            Console.WriteLine("Match: '{0}' at index {1}.",
                              match.Value, match.Index);
            int grpCtr = 0;
            foreach (Group grp in match.Groups)
            {
                Console.WriteLine("   Group {0}: '{1}' at index {2}.",
                                  grpCtr, grp.Value, grp.Index);
                int capCtr = 0;
                foreach (Capture cap in grp.Captures)
                {
                    Console.WriteLine("      Capture {0}: '{1}' at {2}.",
                                      capCtr, cap.Value, cap.Index);
                    capCtr++;
                }
                grpCtr++;
            }
            Console.WriteLine();
        }
    }
}
// The example displays the following output:
//       Match: 'This is one sentence.' at index 0.
//          Group 0: 'This is one sentence.' at index 0.
//             Capture 0: 'This is one sentence.' at 0.
//          Group 1: 'sentence' at index 12.
//             Capture 0: 'This ' at 0.
//             Capture 1: 'is ' at 5.
//             Capture 2: 'one ' at 8.
//             Capture 3: 'sentence' at 12.
//
//       Match: 'This is another.' at index 22.
//          Group 0: 'This is another.' at index 22.
//             Capture 0: 'This is another.' at 22.
//          Group 1: 'another' at index 30.
//             Capture 0: 'This ' at 22.
//             Capture 1: 'is ' at 27.
//             Capture 2: 'another' at 30.
Imports System.Text.RegularExpressions

Module Example
    Public Sub Main()
        Dim input As String = "This is one sentence. This is another."
        Dim pattern As String = "\b(\w+[;,]?\s?)+[.?!]"

        For Each match As Match In Regex.Matches(input, pattern)
            Console.WriteLine("Match: '{0}' at index {1}.",
                              match.Value, match.Index)
            Dim grpCtr As Integer = 0
            For Each grp As Group In match.Groups
                Console.WriteLine("   Group {0}: '{1}' at index {2}.",
                                  grpCtr, grp.Value, grp.Index)
                Dim capCtr As Integer = 0
                For Each cap As Capture In grp.Captures
                    Console.WriteLine("      Capture {0}: '{1}' at {2}.",
                                      capCtr, cap.Value, cap.Index)
                    capCtr += 1
                Next
                grpCtr += 1
            Next
            Console.WriteLine()
        Next
    End Sub
End Module
' The example displays the following output:
'       Match: 'This is one sentence.' at index 0.
'          Group 0: 'This is one sentence.' at index 0.
'             Capture 0: 'This is one sentence.' at 0.
'          Group 1: 'sentence' at index 12.
'             Capture 0: 'This ' at 0.
'             Capture 1: 'is ' at 5.
'             Capture 2: 'one ' at 8.
'             Capture 3: 'sentence' at 12.
'       
'       Match: 'This is another.' at index 22.
'          Group 0: 'This is another.' at index 22.
'             Capture 0: 'This is another.' at 22.
'          Group 1: 'another' at index 30.
'             Capture 0: 'This ' at 22.
'             Capture 1: 'is ' at 27.
'             Capture 2: 'another' at 30.

When you use subexpressions only to apply quantifiers to them and you aren't interested in the captured text, you should disable group captures. For example, the (?:subexpression) language element prevents the group to which it applies from capturing matched substrings. In the following example, the regular expression pattern from the previous example is changed to \b(?:\w+[;,]?\s?)+[.?!]. As the output shows, it prevents the regular expression engine from populating the GroupCollection and CaptureCollection collections:

using System;
using System.Text.RegularExpressions;

public class Group2Example
{
    public static void Main()
    {
        string input = "This is one sentence. This is another.";
        string pattern = @"\b(?:\w+[;,]?\s?)+[.?!]";

        foreach (Match match in Regex.Matches(input, pattern))
        {
            Console.WriteLine("Match: '{0}' at index {1}.",
                              match.Value, match.Index);
            int grpCtr = 0;
            foreach (Group grp in match.Groups)
            {
                Console.WriteLine("   Group {0}: '{1}' at index {2}.",
                                  grpCtr, grp.Value, grp.Index);
                int capCtr = 0;
                foreach (Capture cap in grp.Captures)
                {
                    Console.WriteLine("      Capture {0}: '{1}' at {2}.",
                                      capCtr, cap.Value, cap.Index);
                    capCtr++;
                }
                grpCtr++;
            }
            Console.WriteLine();
        }
    }
}
// The example displays the following output:
//       Match: 'This is one sentence.' at index 0.
//          Group 0: 'This is one sentence.' at index 0.
//             Capture 0: 'This is one sentence.' at 0.
//
//       Match: 'This is another.' at index 22.
//          Group 0: 'This is another.' at index 22.
//             Capture 0: 'This is another.' at 22.
Imports System.Text.RegularExpressions

Module Example
    Public Sub Main()
        Dim input As String = "This is one sentence. This is another."
        Dim pattern As String = "\b(?:\w+[;,]?\s?)+[.?!]"

        For Each match As Match In Regex.Matches(input, pattern)
            Console.WriteLine("Match: '{0}' at index {1}.",
                              match.Value, match.Index)
            Dim grpCtr As Integer = 0
            For Each grp As Group In match.Groups
                Console.WriteLine("   Group {0}: '{1}' at index {2}.",
                                  grpCtr, grp.Value, grp.Index)
                Dim capCtr As Integer = 0
                For Each cap As Capture In grp.Captures
                    Console.WriteLine("      Capture {0}: '{1}' at {2}.",
                                      capCtr, cap.Value, cap.Index)
                    capCtr += 1
                Next
                grpCtr += 1
            Next
            Console.WriteLine()
        Next
    End Sub
End Module
' The example displays the following output:
'       Match: 'This is one sentence.' at index 0.
'          Group 0: 'This is one sentence.' at index 0.
'             Capture 0: 'This is one sentence.' at 0.
'       
'       Match: 'This is another.' at index 22.
'          Group 0: 'This is another.' at index 22.
'             Capture 0: 'This is another.' at 22.

You can disable captures in one of the following ways:

  • Use the (?:subexpression) language element. This element prevents the capture of matched substrings in the group to which it applies. It doesn't disable substring captures in any nested groups.

  • Use the ExplicitCapture option. It disables all unnamed or implicit captures in the regular expression pattern. When you use this option, only substrings that match named groups defined with the (?<name>subexpression) language element can be captured. The ExplicitCapture flag can be passed to the options parameter of a Regex class constructor or to the options parameter of a Regex static matching method.

  • Use the n option in the (?imnsx) language element. This option disables all unnamed or implicit captures from the point in the regular expression pattern at which the element appears. Captures are disabled either until the end of the pattern or until the (-n) option enables unnamed or implicit captures. For more information, see Miscellaneous Constructs.

  • Use the n option in the (?imnsx:subexpression) language element. This option disables all unnamed or implicit captures in subexpression. Captures by any unnamed or implicit nested capturing groups are disabled as well.

Thread safety

The Regex class itself is thread safe and immutable (read-only). That is, Regex objects can be created on any thread and shared between threads; matching methods can be called from any thread and never alter any global state.

However, result objects (Match and MatchCollection) returned by Regex should be used on a single thread. Although many of these objects are logically immutable, their implementations could delay computation of some results to improve performance, and as a result, callers must serialize access to them.

If you have a need to share Regex result objects on multiple threads, these objects can be converted to thread-safe instances by calling their synchronized methods. With the exception of enumerators, all regular expression classes are thread safe or can be converted into thread-safe objects by a synchronized method.

Enumerators are the only exception. You must serialize calls to collection enumerators. The rule is that if a collection can be enumerated on more than one thread simultaneously, you should synchronize enumerator methods on the root object of the collection traversed by the enumerator.

Title Description
Details of Regular Expression Behavior Examines the implementation of the regular expression engine in .NET. The article focuses on the flexibility of regular expressions and explains the developer's responsibility for ensuring the efficient and robust operation of the regular expression engine.
Backtracking Explains what backtracking is and how it affects regular expression performance, and examines language elements that provide alternatives to backtracking.
Regular Expression Language - Quick Reference Describes the elements of the regular expression language in .NET and provides links to detailed documentation for each language element.