Regex perf bug
While trying to get a WordML -> HTML conversion that would work reasonably well with .TEXT, I came across an interesting performance bug in the Regex class of .NET.
Run the following code and see what happens. It doesn't produce any output; just step over each line in the debugger and note how long the "slow" search takes:
using System.Text.RegularExpressions;
class Class1
{
static void Main()
{
string s = "fred" + new string(' ', 5000);
Regex slow = new Regex(".*fred", RegexOptions.None);
Regex fast = new Regex("^.*fred", RegexOptions.None);
fast.Replace(s, "");
slow.Replace(s, "");
}
}
In theory the two expressions should be the same -- "anything followed by fred" is semantically the same as "anything from the start of the string followed by fred" -- but for some reason they behave very differently. I'll see if it's a known bug tomorrow (having trouble accessing the database from home).
Comments
- Anonymous
January 04, 2004
This is not a bug. The slow replacement will indeed be slow because it will laziliy attempt to make a replacement at every place within the 5004 characters... yep lazily.
The "fast" operation - although still lazy and much slower than it could be - will only ever attempt to match from its anchored marker. - Anonymous
January 04, 2004
The comment has been removed - Anonymous
January 04, 2004
I reckon that they are not semantically equivalent; consider the string "fredfred" and how many times replacement should be done in each case.
Even better is the case where the string may be partially overlapping with itself; how should the pattern ".*abab" be matched with the string "ababab"? - Anonymous
January 04, 2004
Err... woops :-) When I wrote lazy I did - of course - mean greedy. - Anonymous
January 04, 2004
Jeffrey Friedl's "Mastering Regular Expressions 2nd Ed." has some great discussions of performance implications when using greedy and non-greedy expressions. It's well worth the read and has a specific chapter on the .NET implementation of regular expressions in PDF format that is freely downloadable.
http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf - Anonymous
January 05, 2004
Roland,
In both cases, there is no difference. Both patterns greedly match all the garbage at the start of the string and match the trailing "fred" / "abab". - Anonymous
January 05, 2004
Kelly,
Thanks for the link to the .NET chapter. I actually read the Regexp book a long time ago and have it lying infront of me now :-). No obvious reasons are shown in the book though (quick perusal of the index...). - Anonymous
January 05, 2004
But can anyone show me where the "fast" and the "slow" patterns give different results?
It doesn't matter if the engine is doing "the right thing" in the slow case; if it can do "the right thing" faster then it should, and not doing it is a (possibly low priority) bug with performance. - Anonymous
January 05, 2004
Peter, these things will never give different results; they will both find a match against only the last instance of "fred" contained within any given text.
One of the things that Jeff Friedl mentions in his book - the Chapter about optimizing and perf I think - is that whenever possible you show expose leading/trailing characters in the pattern which allow many optimizations to kick in. It may be that, what you percieve as a bug is simply a bunch of optimizations "not" kicking in :) - Anonymous
January 05, 2004
Hey Darren,
It may just be a "missing feature" but we still call those "bugs" at Microsoft :-)
Do you remember the "news" story about there being 60,000 (or whatever) "bugs" in Windows 2000? Well now you know where they all come from. Everything at Microsoft is considered a bug -- perf issues, spelling errors, help text, new feature suggestions, icon changes, design ideas, re-arranging toolbar buttons, renaming menu items, you name it.
We'll see if they accept my request for a change in Whudbey :-) - Anonymous
January 05, 2004
> but we still call those "bugs" at Microsoft :-)
:-) Heh, thanks for the heads-up!
> We'll see if they accept my request for a change in Whudbey :-)
Great, hey... while they've got that file checked-out :-) can you ask them to fix my "pet-peeve" too
http://regexblogs.com/dneimke/archive/2003/11/23/181.aspx
See the section on that page titled: "My Biggest Gripe about Named Groups".
Cheers (and Happy New Year)!
- Darren - Anonymous
January 08, 2004
The comment has been removed - Anonymous
January 10, 2004
> "anything followed by fred" is semantically the same as "anything from the start of the string followed by fred"
Actually it's not. In the string
abfred
"anything followed by fred" has three possible values
fred
bfred
abfred
whereas "anything from the start of the string followed by fred" has only one possible value
abfred
The unanchored .* results in quadratic backtracking.
On my "dumb blog entries to write someday" list is one on understanding regex performance. It is very easy to write a regex with pathetic perf. - Anonymous
January 11, 2004
Raymond, my point is that this could (should ?) be special cased because it is such a basic example and it has horrible perf for no good reason. Why does the engine keep looking for text when it already knows the text doesn't exist anywhere in the string?
I look forward to your upcoming entry though ;-) - Anonymous
January 17, 2004
The comment has been removed