Udostępnij za pośrednictwem


Regex Class Caching Changes between .NET Framework 1.1 and .NET Framework 2.0 [Josh Free]

The .NET Framework System.Text.RegularExpressions.Regex class maintains a cache of parsed regular expressions.  The cache improves the performance of methods that create regular expressions, as the Regex class is able to avoid the cost of re-parsing and re-compiling existing regular expressions.  The cache does not affect the performance of match operations on the same input string, as match results are not cached.

 As part of the .NET Framework 2.0 Redistributable, the BCL team made changes to the Regex class to improve its caching correctness.  If you use the Regex class in your code then please read on – as these changes may impact the performance of your application on the 2.0 runtime:

.NET Framework 1.1 Regex cache behavior

Under 1.1, the Regex class has an unbounded cache size.  Every regular expression exists in the Regex cache.  Each new regular expression either creates a new entry in the cache or uses an existing cache entry.  Any time an existing cached entry is reused, Regex does not need to interpret or compile the regular expression string – which improves performance.

The 1.1 Regex cache entries maintain a reference count (e.g., COM style AddRef and Release) – to keep track of how many objects are using them.  The reference counts of cache entries are decremented when Regex objects  are finalized.  When the reference count on any cached entry reaches zero (0), the cache entry is deleted.

Overall, this design allows fast creation of regular expressions when the same expression already exists.   However, the cache behavior in 1.1 is flawed. The use of finalizers as part of the static cache design goes against the Reliability Best Practices in the .NET Framework Developer’s Guide.  To quote one part of the best practices guide, “Finalizers must be free of synchronization problems. Do not use a static mutable state in a finalizer. ”  Additionally, the use of heavyweight finalizers hurts the performance of the garbage collector.

.NET Framework 2.0 Regex cache behavior

There are two important cache behavior changes in .NET Framework 2.0 from .NET Framework 1.1:

  1. The 2.0 Regex class no longer has an unbounded cache size.  The cache has a fixed-size, with a default value of fifteen (15).  Programs can override the default cache size by setting the Regex.CacheSize property.

  2. The 2.0 Regex class no longer caches parsed regular expressions created by Regex instance methods, it only caches regular expressions created by Regex static methods.

    Take for example the following two calls that use regular expressions.  The first example creates a regular expression instance which is not cached, where as the second example uses a static method that does cache the parsed regular expressions.

    Creates a Regex instance ‘r’ containing the regular expression “a*” and checks for a match on the ‘inputString’
    // “a*” is not added to the cacheRegex r = new Regex(“a*”);r.Match(inputString);

    Calls the static method Match to check ‘inputString’ for a match on the regular expression “a*”
    // “a*” is added to the cacheRegex.Match(“a*”, inputString);

    Regular expressions created by instance methods are not cached in 2.0 as it makes much more sense for the application developer to manage the lifetime of their Regex object on their own.

    Regular expressions created by static methods are cached in 2.0 as users of the static methods do not have any way of managing the lifetime of their regular expressions.  Developers that want the full control of managing the lifetime of their regular expressions should use Regex instances instead of Regex static methods.

What happens when the 2.0 cache is full

The 2.0 Regex uses the Least-Recently Used (LRU) cache replacement rule.  This means that when the cache is full, the cache items that are the least recently used are the ones discarded to make room for new items.

What the 2.0 cache changes mean for your application

  1. Review the use of existing Regex instances in your application.  Since Regular expressions created with instance methods are not cached, make sure that you are not unnecessarily creating the same Regex instances over and over again by creating an instance in a tight loop:

    Bad Code - creates ‘r’ one hundred (100) times
    for (int i = 0; i < 100; i++) {    Regex r = new Regex(“a*”);     if (r.IsMatch(myArray[i])) {        …        …    }}

    Correct Code – creates ‘r’ one (1) time
    Regex r = new Regex(“a*”); for (int i = 0; i < 100; i++) {    if (r.IsMatch(myArray[i])) {        …        …    }}

  2. Consider managing the lifetime of regular expressions in your application yourself instead of relying on the underlying library.  Do this by replacing Regex static method calls with Regex instance method calls.

  3. If you prefer to only use Regex static methods in your application, consider setting the Regex.CacheSize property to a value that makes better sense than the default of fifteen (15) for your application.

Comments

  • Anonymous
    October 19, 2006
    Does this also applies when RegexOptions.Compiled is set ? Or are compiled Regex always cached ?

  • Anonymous
    October 19, 2006
    Nick, The use of RegexOptions flags does not impact the behavior of the cache.  Developers can use the RegexOptions.Compiled flag on either a Regex constructor (to create a Regex instance) or on a static Match method.  The Regex object returned from the constructor will not be added to the cache.  Remember, the cache does not impact the performance of match operations - it only impacts the performance of regular expression creation. For instance calling this constructor multiply times will cause the expression to be compiled each time: Regex r1 = new Regex("abc*", RegexOptions.Compiled);   Regex r2 = new Regex("abc*", RegexOptions.Compiled);   Regex r3 = new Regex("abc*", RegexOptions.Compiled);   However, calling the static method multiple time will use the Regex cache - and subsequent calls may benefit from the cache: Regex.Match("1234bar", @"(d*)bar", RegexOptions.Compiled); Regex.Match("1234bar", @"(d*)bar", RegexOptions.Compiled); Regex.Match("1234bar", @"(d*)bar", RegexOptions.Compiled); For more information on Regex compilation, please refer to the  "RegularExpression Compilation" section in the January 2006 edition of MSDN Magazine:  http://msdn.microsoft.com/msdnmag/issues/06/01/CLRInsideOut/

  • Anonymous
    October 19, 2006
    Hi Josh, Thanks for the info. From yr examples, it can be concluded that compiled option is per instance since it compiles 3 times with the same 'abc*', right ?

  • Anonymous
    October 19, 2006
    How about the assembly cache used by the XmlSerializer? Has the potential for massive leaks present in 1.1 been addressed in 2.0?

  • Anonymous
    October 20, 2006
    Can you please comment on the caching and performance of the following piece of Regex code(both for 1.1 and 2.0) where we are using list of patterns in a loop. I doubt any caching is going on here. string[] allowedFormatsWithTimePattern = new string[] {//list of patterns }; //loop through all the patterns from the array for(int indx = 0; indx< allowedFormatsWithTimePattern.Length; indx++) {    Regex r = new Regex(allowedFormatsWithTimePattern[indx], RegexOptions.Compiled);    if(r.IsMatch(toParse))    {        // do something    } } for better performance should we avoid looping as the number of patterns are fairly constant.

  • Anonymous
    October 20, 2006
    The comment has been removed

  • Anonymous
    August 01, 2007
    昨天和今天,我都在对我之前写的UBB解析代码进行性能优化。优化的结果是:1个具有600多个UBB标签的文本,包含多层UBB嵌套,优化前,解析出这个文本需要2分钟,优化后解析出这个文本需要1秒钟。而这次优化,核心优化的技术只有一点:正则表达式Regex 的构造位置发生变化。下面我就来慢慢来说这次优化。 UBB解析组件的简单介绍 需求: 1、把支持的14个UBB标签解析成不同的Html文本。这14个标签包含:代码高亮标签、禁用UBB标签以及一些通用的UBB标签。 2、一部分UBB 标签支持嵌套的解析,比如对以下文本的解析

  • Anonymous
    August 01, 2007
    昨天和今天,我都在对我之前写的UBB解析代码进行性能优化。优化的结果是:1个具有600多个UBB标签的文本,包含多层UBB嵌套,优化前,解析出这个文本需要2分钟,优化后解析出这个文本需要1秒钟。而这次优化

  • Anonymous
    August 01, 2007
    昨天和今天,我都在对我之前写的UBB解析代码进行性能优化。优化的结果是:1个具有600多个UBB标签的文本,包含多层UBB嵌套,优化前,解析出这个文本需要2分钟,优化后解析出这个文本需要1秒钟。而这次优化,核心优化的技术只有一点:正则表达式Regex 的构造位置发生变化。下面我就来慢慢来说这次优化。 UBB解析组件的简单介绍 需求: 1、把支持的14个UBB标签解析成不同的Html文本。这14个标签包含:代码高亮标签、禁用UBB标签以及一些通用的UBB标签。 2、一部分UBB 标签支持嵌套的解析,比如对以下文本的解析

  • Anonymous
    December 16, 2007
    昨天和今天,我都在对我之前写的UBB解析代码进行性能优化。优化的结果是:1个具有600多个UBB标签的文本,包含多层UBB嵌套,优化前,解析出这个文本需要2分钟,优化后解析出这个文本需要1秒钟。而这次...