次の方法で共有


Regex perf bug

While trying to get a WordML -> HTML conversion that would work reasonably well with .TEXT, I came across an interesting performance bug in the Regex class of .NET.

Run the following code and see what happens. It doesn't produce any output; just step over each line in the debugger and note how long the "slow" search takes:

 

using System.Text.RegularExpressions;

 

class Class1

{

static void Main()

  {

    string s = "fred" + new string(' ', 5000);

 

    Regex slow = new Regex(".*fred", RegexOptions.None);

    Regex fast = new Regex("^.*fred", RegexOptions.None);

 

    fast.Replace(s, "");

    slow.Replace(s, "");

  }

}

In theory the two expressions should be the same -- "anything followed by fred" is semantically the same as "anything from the start of the string followed by fred" -- but for some reason they behave very differently. I'll see if it's a known bug tomorrow (having trouble accessing the database from home).

Comments

  • Anonymous
    January 04, 2004
    This is not a bug. The slow replacement will indeed be slow because it will laziliy attempt to make a replacement at every place within the 5004 characters... yep lazily.

    The "fast" operation - although still lazy and much slower than it could be - will only ever attempt to match from its anchored marker.
  • Anonymous
    January 04, 2004
    The comment has been removed
  • Anonymous
    January 04, 2004
    I reckon that they are not semantically equivalent; consider the string "fredfred" and how many times replacement should be done in each case.

    Even better is the case where the string may be partially overlapping with itself; how should the pattern ".*abab" be matched with the string "ababab"?
  • Anonymous
    January 04, 2004
    Err... woops :-) When I wrote lazy I did - of course - mean greedy.
  • Anonymous
    January 04, 2004
    Jeffrey Friedl's "Mastering Regular Expressions 2nd Ed." has some great discussions of performance implications when using greedy and non-greedy expressions. It's well worth the read and has a specific chapter on the .NET implementation of regular expressions in PDF format that is freely downloadable.

    http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf


  • Anonymous
    January 05, 2004
    Roland,

    In both cases, there is no difference. Both patterns greedly match all the garbage at the start of the string and match the trailing "fred" / "abab".
  • Anonymous
    January 05, 2004
    Kelly,

    Thanks for the link to the .NET chapter. I actually read the Regexp book a long time ago and have it lying infront of me now :-). No obvious reasons are shown in the book though (quick perusal of the index...).
  • Anonymous
    January 05, 2004
    But can anyone show me where the "fast" and the "slow" patterns give different results?

    It doesn't matter if the engine is doing "the right thing" in the slow case; if it can do "the right thing" faster then it should, and not doing it is a (possibly low priority) bug with performance.

  • Anonymous
    January 05, 2004
    Peter, these things will never give different results; they will both find a match against only the last instance of "fred" contained within any given text.

    One of the things that Jeff Friedl mentions in his book - the Chapter about optimizing and perf I think - is that whenever possible you show expose leading/trailing characters in the pattern which allow many optimizations to kick in. It may be that, what you percieve as a bug is simply a bunch of optimizations "not" kicking in :)
  • Anonymous
    January 05, 2004
    Hey Darren,

    It may just be a "missing feature" but we still call those "bugs" at Microsoft :-)

    Do you remember the "news" story about there being 60,000 (or whatever) "bugs" in Windows 2000? Well now you know where they all come from. Everything at Microsoft is considered a bug -- perf issues, spelling errors, help text, new feature suggestions, icon changes, design ideas, re-arranging toolbar buttons, renaming menu items, you name it.

    We'll see if they accept my request for a change in Whudbey :-)
  • Anonymous
    January 05, 2004
    > but we still call those "bugs" at Microsoft :-)

    :-) Heh, thanks for the heads-up!

    > We'll see if they accept my request for a change in Whudbey :-)

    Great, hey... while they've got that file checked-out :-) can you ask them to fix my "pet-peeve" too

    http://regexblogs.com/dneimke/archive/2003/11/23/181.aspx

    See the section on that page titled: "My Biggest Gripe about Named Groups".

    Cheers (and Happy New Year)!
    - Darren
  • Anonymous
    January 08, 2004
    The comment has been removed
  • Anonymous
    January 10, 2004
    > "anything followed by fred" is semantically the same as "anything from the start of the string followed by fred"

    Actually it's not. In the string

    abfred

    "anything followed by fred" has three possible values

    fred
    bfred
    abfred

    whereas "anything from the start of the string followed by fred" has only one possible value

    abfred

    The unanchored .* results in quadratic backtracking.

    On my "dumb blog entries to write someday" list is one on understanding regex performance. It is very easy to write a regex with pathetic perf.
  • Anonymous
    January 11, 2004
    Raymond, my point is that this could (should ?) be special cased because it is such a basic example and it has horrible perf for no good reason. Why does the engine keep looking for text when it already knows the text doesn't exist anywhere in the string?

    I look forward to your upcoming entry though ;-)
  • Anonymous
    January 17, 2004
    The comment has been removed