Condividi tramite


Optimizing Regular Expression Performance, Part I: Working with the Regex Class and Regex Objects [Ron Petrusha]

The .NET Framework’s regular expression implementation is a traditional Nondeterministic Finite Automaton (NFA) engine. Perhaps the most significant feature of NFA engines is that they place the responsibility for crafting efficient, high-performance regular expressions on the developer. (For more information about the .NET Framework’s implementation of an NFA engine, see Details of Regular Expression Behavior in the MSDN Library.) With their support for such features as backreferences and backtracking, NFA engines often offer a trade-off between flexibility and power on the one hand, and speed and performance on the other. Many regular expression developers complain about the performance of the regular expression engine but are often unable to diagnose why an individual regular expression performs poorly. In this three-part article, we’ll discuss some of the techniques that regular expression developers can apply to maximize the performance of regular expressions while preserving their power.

At the heart of the .NET Framework’s regular expression object model is the Regex object, which represents the regular expression engine. Often, the single greatest factor that affects regular expression performance is the way in which the Regex engine is used. Defining a regular expression involves tightly coupling the regular expression engine with a regular expression pattern. That coupling process, whether it involves instantiating a Regex object by passing its constructor a regular expression pattern or calling a static method by passing it the regular expression pattern along with the string to be analyzed, is by necessity an expensive one. If the regular expression engine will be used in instance method calls, an instance of the regular expression engine must be created. All regular expressions, whether used by Regex instances or in static Regex method calls, must then be compiled (more on this later). This is analogous to recompiling an application each time that it is run.

Note: To measure regular expression performance, we make extensive use of the Stopwatch class. These measurements are used for illustrative purposes only; they reflect performance when a particular regular expression is used to parse a particular input string on a particular system. They are not intended to be construed as objective benchmarks.

Taking Advantage of the Regular Expression Cache

Since the process of coupling the regular expression engine with a particular pattern is expensive, you can improve performance by ensuring that you perform this coupling as few times as possible. The following example illustrates a fairly common scenario that offers poor performance. An IsValidEmail method is called whenever an application needs to validate an email address. The method instantiates a Regex object and calls its IsMatch method to process the email address. This means that a Regex object is instantiated with the same regular expression pattern with each method call.

Note: The practice of repeatedly instantiating a Regex object with the same regular expression, as the performance measurements presented later in this section show, represents a worst practice that significantly degrades performance. Do not replicate this practice in your own code.

 [Visual Basic]
Imports System.Diagnostics
Imports System.Text.RegularExpressions

Module Example
  Public Sub Main()
    Dim inputs() As String = { "david.jones@proseware.com", _
                               "d.j@server1.proseware.com", _
                               "jones@ms1.proseware.com", _
                               "j.@server1.proseware.com", _
                               "j@proseware.com9", _
                               "js#internal@proseware.com", _
                               "j_9@[129.126.118.1]", _
                               "j..s@proseware.com", _
                               "js*@proseware.com", _
                               "js@proseware..com", _
                               "js@proseware.com9", _
                               "j.s@server1.proseware.com", _
                               """a*****""""@cohowinery.com", _
                         webmaster@aaaabbbbbbbbccccccccddddd!" }
    Dim sw As Stopwatch = Stopwatch.StartNew()
    For Each input As String In inputs
      If IsValidEmail(input) Then
        ' Handle valid data.
        Console.WriteLine("{0} is a valid email address.", _
                          input)
      Else
        ' Handle invalid data.
        Console.WriteLine("{0} is not a valid email address.", _
                          input)
      End If
    Next 
    sw.Stop()
    Console.WriteLine("Elapsed time: {0}", sw.Elapsed)
  End Sub
   
  Private Function IsValidEmail(input As String) As Boolean
    Dim pattern As String = "^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))" + _
                             "(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$"
      Dim rgx As New Regex(pattern)    ' BAD: Never reinstantiate the same object
      Return rgx.IsMatch(input)
   End Function
End Module

[C#]
using System;
using System.Diagnostics;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      Example ex = new Example();
      string[] inputs = { "david.jones@proseware.com", 
                          "d.j@server1.proseware.com", 
                          "jones@ms1.proseware.com", 
                          "j.@server1.proseware.com", 
                          "j@proseware.com9", 
                          "js#internal@proseware.com", 
                          "j_9@[129.126.118.1]", 
                          "j..s@proseware.com", 
                          "js*@proseware.com", 
                          "js@proseware..com", 
                          "js@proseware.com9", 
                          "j.s@server1.proseware.com", 
                          @"""a*****""@cohowinery.com", 
                          "webmaster@aaaabbbbbbbbccccccccddddd!" };
      Stopwatch sw = Stopwatch.StartNew();
      foreach (string input in inputs) {
         if (ex.IsValidEmail(input))
            // Handle valid email data.
            Console.WriteLine("{0} is a valid email address.", input);
         else
            // Handle invalid data.
            Console.WriteLine("{0} is not a valid email address.",
                              input);
      } 
      sw.Stop();
      Console.WriteLine("Elapsed time: {0}", sw.Elapsed);
   }

   private bool IsValidEmail(string input)
   {
      string pattern = @"^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))" + 
                       @"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$";
      Regex rgx = new Regex(pattern);   // BAD: Never reinstantiate the same object

      return rgx.IsMatch(input);   
   }
}

This example displays the following output:

 david.jones@proseware.com is a valid email address.
d.j@server1.proseware.com is a valid email address.
jones@ms1.proseware.com is a valid email address.
j.@server1.proseware.com is not a valid email address.
j@proseware.com9 is not a valid email address.
js#internal@proseware.com is a valid email address.
j_9@[129.126.118.1] is a valid email address.
j..s@proseware.com is not a valid email address.
js*@proseware.com is not a valid email address.
js@proseware..com is not a valid email address.
js@proseware.com9 is not a valid email address.
j.s@server1.proseware.com is a valid email address.
"a*****""@cohowinery.com is a valid email address.
webmaster@aaaabbbbbbbbccccccccddddd! is not a valid email address.
Elapsed time: 00:00:00.4997749

Performance suffers here because each method call requires a Regex object to be instantiated with the same regular expression pattern. But instead of instantiating a Regex object in each method call, we can call the static Regex.IsMatch method, as the following example does.

 [Visual Basic]
Imports System.Diagnostics
Imports System.Text.RegularExpressions

Module Example
  Public Sub Main()
    Dim inputs() As String = { "david.jones@proseware.com", _
                               "d.j@server1.proseware.com", _
                               "jones@ms1.proseware.com", _
                               "j.@server1.proseware.com", _
                               "j@proseware.com9", _
                               "js#internal@proseware.com", _
                               "j_9@[129.126.118.1]", _
                               "j..s@proseware.com", _
                               "js*@proseware.com", _
                               "js@proseware..com", _
                               "js@proseware.com9", _
                               "j.s@server1.proseware.com", _
                               """a*****""""@cohowinery.com", _
                         webmaster@aaaabbbbbbbbccccccccddddd!" }
    Dim sw As Stopwatch = Stopwatch.StartNew()
    For Each input As String In inputs
      If IsValidEmail(input) Then
        ' Handle valid data.
        Console.WriteLine("{0} is a valid email address.", _
                          input)
      Else
        ' Handle invalid data.
        Console.WriteLine("{0} is not a valid email address.", _
                          input)
      End If
    Next 
    sw.Stop()
    Console.WriteLine("Elapsed time: {0}", sw.Elapsed)
  End Sub
   
  Private Function IsValidEmail(input As String) As Boolean
    Dim pattern As String = "^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))" + _
                             "(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$"
      Return Regex.IsMatch(input, pattern)  ' GOOD: Take advantage of regex cache
   End Function
End Module

[C#]
using System;
using System.Diagnostics;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      Example ex = new Example();
      string[] inputs = { "david.jones@proseware.com", 
                          "d.j@server1.proseware.com", 
                          "jones@ms1.proseware.com", 
                          "j.@server1.proseware.com", 
                          "j@proseware.com9", 
                          "js#internal@proseware.com", 
                          "j_9@[129.126.118.1]", 
                          "j..s@proseware.com", 
                          "js*@proseware.com", 
                          "js@proseware..com", 
                          "js@proseware.com9", 
                          "j.s@server1.proseware.com", 
                          @"""a*****""@cohowinery.com", 
                          "webmaster@aaaabbbbbbbbccccccccddddd!" };
      Stopwatch sw = Stopwatch.StartNew();
      foreach (string input in inputs) {
         if (ex.IsValidEmail(input))
            // Handle valid email data.
            Console.WriteLine("{0} is a valid email address.", input);
         else
            // Handle invalid data.
            Console.WriteLine("{0} is not a valid email address.",
                              input);
      } 
      sw.Stop();
      Console.WriteLine("Elapsed time: {0}", sw.Elapsed);
   }

   private bool IsValidEmail(string input)
   {
      string pattern = @"^(?("")("".+?""@)|(([0-9a-zA-Z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-zA-Z])@))" + 
                       @"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}))$";
      return Regex.IsMatch(input, pattern);  // GOOD: Take advantage of regex cache
   }
}

The following output is displayed when the call to the instance Regex.IsMatch method is replaced by the equivalent call to the static Regex.IsMatch method:

 david.jones@proseware.com is a valid email address.
d.j@server1.proseware.com is a valid email address.
jones@ms1.proseware.com is a valid email address.
j.@server1.proseware.com is not a valid email address.
j@proseware.com9 is not a valid email address.
js#internal@proseware.com is a valid email address.
j_9@[129.126.118.1] is a valid email address.
j..s@proseware.com is not a valid email address.
js*@proseware.com is not a valid email address.
js@proseware..com is not a valid email address.
js@proseware.com9 is not a valid email address.
j.s@server1.proseware.com is a valid email address.
"a*****""@cohowinery.com is a valid email address.
webmaster@aaaabbbbbbbbccccccccddddd! is not a valid email address.
Elapsed time: 00:00:00.0380847

The following chart illustrates the difference in performance between the first example, which relies on repeatedly instantiating a Regex object with the same regular expression pattern to call an instance matching method, and the second, which calls a static matching method.

clip_image002

The execution time of the first example is about 15 times the execution time of the second example. The difference in performance is due to the caching of regular expressions used in static method calls. Whereas the first example instantiates a regular expression object and converts the regular expression into opcodes in each of fourteen method calls (one for each element in the string array), the second example performs this conversion only once, on the first method call. Subsequently, it retrieves the interpreted regular expression.from the cache each time the expression is needed.

Only regular expressions used in static method calls are cached; regular expressions used in instance methods are not. The size of the cache is defined by the Regex.CacheSize property. By default, 15 regular expressions are cached, although this value can be modified if necessary. If the number of regular expressions exceeds the cache size, the regular expression engine discards the least recently used regular expression to cache the newest one.

Note that there is a breaking change in regular expression caching between versions 1.1 and subsequent versions of the .NET Framework. In version 1.1, both instance and static regular expressions are cached; in version 2.0 and all subsequent versions, only regular expressions used in static method calls are cached. For more information, see Regex Class Caching Changes between .NET Framework 1.1 and .NET Framework 2.0.

Options for Building Regular Expressions

So far, we’ve focused on using static regular expression methods rather than repeatedly re-instantiating the same regular expression object. You can obtain additional performance benefits by using compiled regular expressions when appropriate. (For more information, see Compilation and Reuse.)

The regular expression engine builds regular expressions in the .NET Framework in three different ways:

  • It interprets them. That is, when a regular expression object is instantiated, or when a static regular expression method is called and the regular expression cannot be found in the cache, the regular expression engine converts the regular expressions to opcodes. When a method is called, the opcodes are converted to MSIL and executed.by the JIT compiler. Interpreted regular expressions can be used in both static and instance method calls. Interpreted regular expressions reduce startup time at the cost of slower execution time.
  • It compiles them. That is, when a regular expression object is instantiated, or when a static regular expression method is called and the regular expression cannot be found in the cache, the regular expression engine converts the regular expressions to MSIL. When a method is called, the MSIL is executed by the JIT compiler. Compiled regular expressions can be used in both static and instance method calls. Compiled regular expressions increase execution speed at the cost of longer startup times.
  • It compiles them into a separate assembly. That is, the static Regex.CompileToAssembly method is used to compile one or more regular expression patterns as MSIL into a separate assembly, where the regular expression engine and each regular expression pattern is tightly coupled into a class that derives from Regex. When the regular expression object is instantiated, its MSIL is loaded from its assembly and executed by the JIT compiler. Regular expressions compiled to a special-purpose assembly can be used only in instance method calls. Regular expressions that are compiled to assemblies move much of a regular expression’s startup cost from runtime to design time. In a run-time environment, they maximize performance both at startup and during execution.

These three ways of using a regular expression represent three different tradeoffs between performance at startup and runtime execution. Let’s look at each of the three.

Interpreted Regular Expressions

Interpreted regular expressions sacrifice execution speed for reduced startup time, so they are best used when the regular expression is limited to a small number of method calls. This is illustrated by the following example, which compares the performance of an interpreted and a compiled regular expression when extracting a sentence from a file that contains a one-sentence string.

 [Visual Basic]
Imports System.Diagnostics
Imports System.IO
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim pattern As String = "\b(\w+((\r?\n)|,?\s))*\w+[.?:;!]"
      Dim input As String = New StreamReader(".\OneSentence.txt").ReadToEnd()
      Dim sw As Stopwatch

      Console.WriteLine("***Interpreted regular expressions***")
      ' Use Regex.Matches method to find all matches.
      sw = Stopwatch.StartNew()
      Dim instRegex As New Regex(pattern, RegexOptions.Singleline)
      Dim matches As MatchCollection = instRegex.Matches(input)
      For Each match As Match In matches
         ' Do nothing. If necessary, the MatchCollection will be populated using
         ' lazy evaluation.
      Next
      sw.Stop()
      Console.WriteLine("Found {0} matches in {1} using regex.Matches()",
                        matches.Count, sw.Elapsed)

      ' Use Regex.Match method followed by Match.NextMatch to find all matches.
      sw = Stopwatch.StartNew()
      Dim ctr As Integer = 0
      Dim instRegex2 As New Regex(pattern, RegexOptions.Multiline)
      Dim match2 As Match = instRegex2.Match(input)
      If match2.Success Then
         Do
            ctr += 1
            match2 = match2.NextMatch()
         Loop While match2.Success
      End If
      sw.Stop()
      Console.WriteLine("Found {0} matches in {1} using regex.Match()/NextMatch()",
                        ctr, sw.Elapsed)

      ' Use Regex.Match(String, Int32) to find all matches.
      sw = Stopwatch.StartNew()
      Dim pos As Integer = 0
      Dim ctr2 As Integer = 0
      Dim instRegex3 As New Regex(pattern, RegexOptions.Multiline)
      Dim match3 As Match
      Do
         match3 = instRegex3.Match(input, pos)
         If match3.Success Then
            ctr2 += 1
            pos = match3.Index + match3.Length
         End If
      Loop While match3.Success And pos <= input.Length
      sw.Stop()
      Console.WriteLine("Found {0} matches in {1} using regex.Match(String, Int32)",
                        ctr2, sw.Elapsed)

      Console.WriteLine()
      Console.WriteLine("***Compiled regular expressions***")

      ' Use compiled Regex.Matches method to find all matches.
      sw = Stopwatch.StartNew()
      Dim instRegexC As New Regex(pattern, RegexOptions.Multiline Or RegexOptions.Compiled)
      Dim matchesC As MatchCollection = instRegexC.Matches(input)
      For Each matchC As Match In matchesC
         ' Do nothing. If necessary, the MatchCollection will be populated using
         ' lazy evaluation.
      Next
      sw.Stop()
      Console.WriteLine("Found {0} matches in {1} using compiled regex.Matches()",
                        matchesC.Count, sw.Elapsed)

      ' Use compiled Regex.Match method followed by Match.NextMatch to find all matches.
      sw = Stopwatch.StartNew()
      Dim ctrC As Integer = 0
      Dim instRegex2c As New Regex(pattern, RegexOptions.Multiline Or RegexOptions.Compiled)
      Dim match2c As Match = instRegex2c.Match(input)
      If match2c.Success Then
         Do
            ctrC += 1
            match2c = match2c.NextMatch()
         Loop While match2c.Success
      End If
      sw.Stop()
      Console.WriteLine("Found {0} matches in {1} using compiled regex.Match/NextMatch",
                        ctrC, sw.Elapsed)

      ' Use compiled Regex.Match(String, Int32) to find all matches.
      sw = Stopwatch.StartNew()
      Dim posC As Integer = 0
      Dim ctr2c As Integer = 0
      Dim instRegex3c As New Regex(pattern, RegexOptions.Multiline Or RegexOptions.Compiled)
      Dim match3c As Match
      Do
         match3c = instRegex3c.Match(input, posC)
         If match3c.Success Then
            ctr2c += 1
            posC = match3c.Index + match3c.Length
         End If
      Loop While match3c.Success And posC <= input.Length
      sw.Stop()
      Console.WriteLine("Found {0} matches in {1} using compiled regex.Match(String, Int32)",
                        ctr2c, sw.Elapsed)
   End Sub
End Module

[C#]
using System;
using System.Diagnostics;
using System.IO;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = @"\b(\w+((\r?\n)|,?\s))*\w+[.?:;!]";
      string input = new StreamReader(@".\OneSentence.txt").ReadToEnd();
      Stopwatch sw;

      Console.WriteLine("***Interpreted regular expressions***");
      // Use Regex.Matches method to find all matches.
      sw = Stopwatch.StartNew();
      Regex instRegex = new Regex(pattern, RegexOptions.Singleline);
      MatchCollection matches = instRegex.Matches(input);
      foreach (Match match in matches)
      {
         // Do nothing. If necessary, the MatchCollection will be populated using
         // lazy evaluation.
      }
      sw.Stop();
      Console.WriteLine("Found {0} matches in {1} using regex.Matches();",
                        matches.Count, sw.Elapsed);

      // Use Regex.Match method followed by Match.NextMatch to find all matches.
      sw = Stopwatch.StartNew();
      int ctr = 0;
      Regex instRegex2 = new Regex(pattern, RegexOptions.Multiline);
      Match match2 = instRegex2.Match(input);
      if (match2.Success)
      {
         do
         {
            ctr++;
            match2 = match2.NextMatch();
         } while (match2.Success);
      }
      sw.Stop();
      Console.WriteLine("Found {0} matches in {1} using regex.Match();/NextMatch();",
                        ctr, sw.Elapsed);

      // Use Regex.Match(String, Int32); to find all matches.
      sw = Stopwatch.StartNew();
      int pos = 0;
      int ctr2 = 0;
      Regex instRegex3 = new Regex(pattern, RegexOptions.Multiline);
      Match match3;
      do
      {
         match3 = instRegex3.Match(input, pos);
         if (match3.Success)
         {
            ctr2++;
            pos = match3.Index + match3.Length;
         }
      } while (match3.Success & pos <= input.Length);
      sw.Stop();
      Console.WriteLine("Found {0} matches in {1} using regex.Match(String, Int32);",
                        ctr2, sw.Elapsed);

      Console.WriteLine();
      Console.WriteLine("***Compiled regular expressions***");

      // Use compiled Regex.Matches method to find all matches.
      sw = Stopwatch.StartNew();
      Regex instRegexC = new Regex(pattern, RegexOptions.Multiline | RegexOptions.Compiled);
      MatchCollection matchesC = instRegexC.Matches(input);
      foreach (Match matchC in matchesC)
      {
         // Do nothing. If necessary, the MatchCollection will be populated using
         // lazy evaluation.
      }
      sw.Stop();
      Console.WriteLine("Found {0} matches in {1} using compiled regex.Matches();",
                        matchesC.Count, sw.Elapsed);

      // Use compiled Regex.Match method followed by Match.NextMatch to find all matches.
      sw = Stopwatch.StartNew();
      int ctrC = 0;
      Regex instRegex2c = new Regex(pattern, RegexOptions.Multiline | RegexOptions.Compiled);
      Match match2c = instRegex2c.Match(input);
      if (match2c.Success)
      {
         do
         {
            ctrC++;
            match2c = match2c.NextMatch();
         } while (match2c.Success);
      }
      sw.Stop();
      Console.WriteLine("Found {0} matches in {1} using compiled regex.Match/NextMatch",
                        ctrC, sw.Elapsed);

      // Use compiled Regex.Match(String, Int32); to find all matches.
      sw = Stopwatch.StartNew();
      int posC = 0;
      int ctr2c = 0;
      Regex instRegex3c = new Regex(pattern, RegexOptions.Multiline | RegexOptions.Compiled);
      Match match3c;
      do
      {
         match3c = instRegex3c.Match(input, posC);
         if (match3c.Success)
         {
            ctr2c++;
            posC = match3c.Index + match3c.Length;
         }
      } while (match3c.Success & posC <= input.Length);
      sw.Stop();
      Console.WriteLine("Found {0} matches in {1} using compiled regex.Match(String, Int32);",
                        ctr2c, sw.Elapsed);
   }
}

As the following output and chart show, the interpreted regular expression is significantly faster than its compiled counterpart.

 ***Interpreted regular expressions***
Found 1 matches in 00:00:00.0014471 using regex.Matches()
Found 1 matches in 00:00:00.0000689 using regex.Match()/NextMatch()
Found 1 matches in 00:00:00.0000656 using regex.Match(String, Int32)

***Compiled regular expressions***
Found 1 matches in 00:00:00.0079332 using compiled regex.Matches()
Found 1 matches in 00:00:00.0053529 using compiled regex.Match/NextMatch
Found 1 matches in 00:00:00.0057524 using compiled regex.Match(String, Int32)

clip_image004

If you plan to call regular expression methods just a few times for a particular regular expression, or if the approximate number of calls to regular expression methods is unknown but is expected to be rather small, use interpreted regular expressions for best performance.

Compiled Regular Expressions

In contrast to interpreted regular expressions, compiled regular expressions increase startup time but execute individual pattern-matching methods faster. This means that the performance benefit that results from compiling the regular expression increases with the number of regular expression methods called.

If the previous example, which opens a file that contains a single sentence, is replaced with an example that extracts each sentence from the entire text of Theodore Dreiser’s The Financier, the performance benefit that results from compiling a regular expression becomes very clear. This variation of the example produces the following output, which shows that methods of compiled regular expressions execute about 30% faster than methods of interpreted regular expressions.

 ***Interpreted regular expressions***
Found 13679 matches in 00:00:01.9024852 using regex.Matches()
Found 13679 matches in 00:00:01.8579707 using regex.Match()/NextMatch()
Found 13679 matches in 00:00:01.8241908 using regex.Match(String, Int32)

***Compiled regular expressions***
Found 13679 matches in 00:00:01.1399927 using compiled regex.Matches()
Found 13679 matches in 00:00:01.1378395 using compiled regex.Match/NextMatch
Found 13679 matches in 00:00:01.1460928 using compiled regex.Match(String, Int32)

clip_image006

Static Interpreted and Compiled Regular Expressions

Just as instances of the Regex class can be either compiled or interpreted, so can regular expressions used in static method calls. In interpreted static regular expressions, the internal regular expression opcodes are cached. In compiled static regular expressions, the MSIL is cached. As a result, compiled static regular expressions are faster than interpreted static regular expressions in cases in which methods are called repeatedly with the same regular expression.

The previous two examples used a regular expression to extract sentences from a one-sentence string and from the complete text of Theodore Dreiser’s The Financier. If we combine the two examples but replace the calls to instance Regex methods with calls to the corresponding static methods, the following output results:

 Reading file OneSentence.txt
***Interpreted regular expressions***
Found 1 matches in 00:00:00.0014642 using static Regex.Matches()
Found 1 matches in 00:00:00.0002807 using static Regex.Match()/NextMatch()

***Compiled regular expressions***
Found 1 matches in 00:00:00.0076935 using compiled static Regex.Matches()
Found 1 matches in 00:00:00.0089727 using compiled static Regex.Match/NextMatch

Reading file .\Dreiser_TheFinancier.txt
***Interpreted regular expressions***
Found 13679 matches in 00:00:02.1395809 using static Regex.Matches()
Found 13679 matches in 00:00:02.1683064 using static Regex.Match()/NextMatch()

***Compiled regular expressions***
Found 13679 matches in 00:00:01.2329685 using compiled static Regex.Matches()
Found 13679 matches in 00:00:01.1957317 using compiled static Regex.Match/NextMatch

This output shows the same relationship between interpreted and compiled static regular expressions that we observed for interpreted and compiled instance regular expressions. However, as the two charts that compare the performance of static and instance methods when making a single method call and when making multiple method calls to process a large block of text show, static regular expression methods offer performance that is generally inferior to instance regular expression methods. Their advantage lies in scenarios that use regular expression caching as an alternative to repeatedly re-instantiating the same regular expression object.

clip_image008

clip_image010

Regular Expressions Compiled to an Assembly

The .NET Framework also allows you to create an assembly that contains compiled regular expressions. This moves the performance hit of regular expression compilation from runtime to design time. However, it also involves some additional work: You must define the regular expressions in advance and compile them to an assembly. The compiler can then reference this assembly when compiling source code that uses the assembly’s regular expressions.

To compile regular expressions to an assembly, you call the Regex.CompileToAssembly method and pass it an array of RegexCompilationInfo objects that represent the regular expressions to be compiled, and an AssemblyName object that contains information about the assembly to be created.

For example, we can compile and store the regular expression that we’ve used to extract sentences from a string in a separate assembly. The following example does this:

 [Visual Basic]
Imports System.Reflection
Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim SentencePattern As New RegexCompilationInfo("\b(\w+((\r?\n)|,?\s))*\w+[.?:;!]",
                                                      RegexOptions.Multiline,
                                                      "SentencePattern",
                                                      "Utilities.RegularExpressions",
                                                      True)
      Dim regexes() As RegexCompilationInfo = {SentencePattern}
      Dim assemName As New AssemblyName("RegexLib, Version=1.0.0.1001, Culture=neutral, PublicKeyToken=null")
      Regex.CompileToAssembly(regexes, assemName)
   End Sub
End Module
[C#]
using System;
using System.Reflection;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      RegexCompilationInfo SentencePattern =
                           new RegexCompilationInfo(@"\b(\w+((\r?\n)|,?\s))*\w+[.?:;!]",
                                                    RegexOptions.Multiline,
                                                    "SentencePattern",
                                                    "Utilities.RegularExpressions",
                                                    true);
      RegexCompilationInfo[] regexes = { SentencePattern };
      AssemblyName assemName = new AssemblyName("RegexLib, Version=1.0.0.1001, Culture=neutral, PublicKeyToken=null");
      Regex.CompileToAssembly(regexes, assemName);
   }
}

When the example is compiled to an executable and run, it creates an assembly named RegexLib.dll. The regular expression is represented by a class named Utilities.RegularExpressions.SentencePattern that is derived from Regex, as the following MSIL Disassembler (IL DASM) window illustrates.

clip_image012

When you use this compiled regular expression to parse the text from the single-sentence file and from Theodore Dreiser’s The Financier, the following output is displayed. Calls to the SentencePattern pattern-matching methods produce execution times that are comparable to instance regular expressions that are called relatively few times, and to compiled regular expressions that are called repeatedly.

 Reading file .\OneSentence.txt
***Regular expression compiled to an assembly***
Found 1 matches in 00:00:00.0062998 using Matches()
Found 1 matches in 00:00:00.0000396 using Match()/NextMatch()
Found 1 matches in 00:00:00.0000181 using Match(String, Int32)

Reading file .\Dreiser_TheFinancier.txt
***Regular expression compiled to an assembly***
Found 13679 matches in 00:00:01.1983633 using Matches()
Found 13679 matches in 00:00:01.1931376 using Match()/NextMatch()
Found 13679 matches in 00:00:01.1720003 using Match(String, Int32)

clip_image014

clip_image016

Regular expressions that are compiled to assemblies are typically considered less flexible than interpreted or compiled regular expressions. It is usually argued that they cannot support regular expressions that are built dynamically. Moreover, they are tightly coupled with the regular expression options that were provided to the regular expression’s RegexCompilationInfo object. For example, the case sensitivity of a pattern match must be specified at the time the assembly is created; it cannot be defined by an argument supplied to a regular expression pattern-matching method.

In fact, however, regular expressions can be compiled to an assembly dynamically, and regular expression objects can be defined dynamically as well. Although assemblies that contain compiled regular expressions are generally built separately from the application that uses them, this need not be the case. An assembly can be built dynamically by its application, the application can load the assembly dynamically by using reflection, and the regular expression object’s methods can also be called by using reflection. Note, however, that these techniques rely on reflection, which has a significant performance impact.

Best Practices for Regular Expression Object Usage

To maximize the performance of regular expressions in the .NET Framework, we recommend that you do the following when working with regular expression objects:

  • Call static regular expression matching methods instead of repeatedly instantiating the same regular expression object.
  • Use interpreted regular expressions when you expect to call pattern-matching methods a limited number of times to parse text.
  • Use compiled regular expressions to optimize performance for regular expression patterns that are known in advance and when the frequency of calls to the regular expression’s pattern-matching methods can vary extensively.

Part II of this series will examine how you can improve regular expression performance by crafting regular expressions that limit excessive backtracking.

Comments

  • Anonymous
    June 24, 2010
    Great article. I trust you are going to migrate this into MSDN proper so that I don't have to try and remember that this info is bured in this blog

  • Anonymous
    June 25, 2010
    The way I mostly use regular expressions wasn't really mentioned in the article, which kind of surprises me: instead of using the static methods or repeatedly instancing the regular expression, I tend to instantiate a Regex object once and reuse it (typically as a private static field in the class using it). Is there a particular reason not to mention that way of using Regexes? Or is there a particular reason not to use them that way?

  • Anonymous
    June 26, 2010
    Good article. But what I miss is how I can optimize an existing regular expression. Depending on the pattern there can be dramatic performance differences. For example here: geekswithblogs.net/.../Referrers.aspx Yours,  Alois Kraus

  • Anonymous
    July 14, 2010
    Is there going to be a part II of this article anytime soon?

  • Anonymous
    July 21, 2010
    @Luc Cluitmans In your scenario, i think you could use the complied static instance Regex.