Knowing when NOT to use RegEx to match Strings (System.Text.RegularExpressions) [Kit George]
We recently got asked this question by a customer: "In C#, how do I ensure that a string entered into a text box is of the format: letter,number,letter,number,letter,number ?"
The first answer seems to be pretty straightforward: use RegEx! Regular Expressions are a pretty powerful mechanism for matching strings, and seem the obvious choice. However, you've always got to remeber that RegEx, while powerful, is also a pretty hefty mechanism for String matching. When you're looking for complex strings it's often a good choice (since writing the code yourself can be unbelevably tricky), but when what you're looking for is pretty simple (as in this case), then doing your own matching shouldn't be too tough, and is going to perform a lot more solidly.
Here's a test I wrote to demonstrate this clearly. I've included two forms of RegEx matching. The first (the single line call, looking for whitespace and the specified pattern) is showing how robust RegEx can get, but check the numbers below: the operation is suitably expensive. The second helps the RegEx out a little by doing some of the more exhaustive searching work for it (in this case, a simple trim, followed by a length check). It therefore, doesn't need to do the space matching itself. The final form is simply doing the string match yourself. In this case, I use Cahr.IsLetter and Char.IsDigit to simply step through and ensure the string is in the right format.
The results bear out that, when you can write the simple test for the pattern you want yourself, it's going to be worth doing that. Even with an optimization, RegEx is not as performant as writing a simple check yourself. When the pattern checking is omre complex, then RegEx can be far more usable.
Duration for StringMatch test 1798092 Ticks
Duration for RegExMatch1 test (no optimization) 31166928 Ticks (18x slower!!!)
Duration for RegExMatch2 test (optimization) 18380496 Ticks (10x slower)
using System;
using System.Text.RegularExpressions;
using System.IO;
class Test
{
static void Main(string[] args)
{
// include your own test cases if needed
string[] testCases = {"l2j1a9", " l2j1a9 ", "l2j1a9 ",
"a l2j1a9 ", " l2j1a9 a", "lwj1a9", " l2 j1a9 "};
long duration = 0;
Console.WriteLine("Standard Test:");
foreach(string test in testCases) {
duration += RunTest(test, false);
}
Console.WriteLine("Duration of tests = {0}\r\n", duration);
duration = 0;
Console.WriteLine("\r\nRegex Test:");
foreach(string test in testCases) {
duration += RunTest(test, true);
}
Console.WriteLine("Duration of tests = {0}", duration);
}
static long RunTest(string s, bool useRegex) {
bool result = false;
// this is just a timing run, for interest
long startTime = DateTime.Now.Ticks;
if (useRegex) {
for (int i=0;i<20000;i++) {
RegexMatch(s);
}
} else {
for (int i=0;i<20000;i++) {
StringMatch(s);
}
}
long duration = DateTime.Now.Ticks - startTime;
result = (useRegex ? RegexMatch(s) : StringMatch(s) );
if (result) {
Console.WriteLine("String '{0}' matches the requirement", s);
} else {
Console.WriteLine("String '{0}' does NOT match the requirement", s);
}
return duration;
}
static bool RegexMatch(string s) {
// Not optimized version. Read as:
// any (*) amount of space (\s) at the beginining (^)
// any word character (\w), any digit (\d), word, digit, word, digit
// any (*) amount of space (\s) at the end ($)
//Regex r = new Regex(@"(^\s*)\w\d\w\d\w\d(\s*$)");
// Optimized version. We take out ONLY the checks for the whitepace
// Try this instead of the above line, and see what the perf difference is
/*
Regex r = new Regex(@"\w\d\w\d\w\d");
s = s.Trim();
if (s.Trim().Length != 6)
return false;
*/
return r.IsMatch(s);
}
static bool StringMatch(string s) {
if (s.Trim().Length != 6)
return false;
char[] chars = s.Trim().ToCharArray();
return (Char.IsLetter(chars[0]) && Char.IsDigit(chars[1]) &&
Char.IsLetter(chars[2]) && Char.IsDigit(chars[3]) &&
Char.IsLetter(chars[4]) && Char.IsDigit(chars[5]));
}
}
Comments
Anonymous
February 21, 2005
It depends, but I'd have to say in most cases performance isn't the issue. Particular in validating user input, which doesn't happen that frequently. So RTI'ing string validation code to shave off a few clock ticks is probably a waste of a programmer's time.
Programmers should instead focus on which technique is clearer to understand for other programmers. Regexs have a reputation of being hard to read. But at least, in your simple example...
new Regex(@"wdwdwd");
...I find that very easy to read. Easier then the StringMatch method. So I'd probably use the Regex.Anonymous
February 21, 2005
The comment has been removedAnonymous
February 21, 2005
Ah, I went ahead and tried it on my own.
Standard test gives 2188984.
"Unoptimzed" regex, compiled, externally gives 3439832. (only a factor 1.6, compare to your 18!)
"Optimized" regex, compiled, externally gives 3752544. The trim call and length check actually made things slower!
I then tried the unoptimized regex with new instances like your original version, but still only get a factor 7 difference, so I guess that my system (with the 1.1 framework) simply behaves differently.Anonymous
February 21, 2005
Also, marking the two groups as non-capturing brought the factor down to 1.3 on my machine, but it is highly dependent on low timing resolution...Anonymous
February 21, 2005
Wow, thanks for jumping on me quickly all, that's great. Without a shadow of a doubt, there's ways to make RegEx do this way better (CN points this out pretty well by showing that compiling the Regex can help significantly: just remeber, you don't want too many precompiled regexes, it ends up being a bind), but I was mostly trying to illustrate that if you can do things cheaply and easily yourself, go ahead. I love that people were leaping up in defense of RegEx, but know when to try something else: if performance is critical especially (in this case). We do this in plenty of places in the framework, where we need to optimize the common scenario above others. One example is providing overloads for members that take base types (check out Console.WriteLine as an example), to help ensure that operations for those specific types (the most commonly used) are as optimized as possible. I would urge a similar approach hereAnonymous
February 21, 2005
The techniques that make sense for the relative handful of developers writing the .NET framework rarely make sense for the vast legions of developers who are building on top of the framework.
The guys at the top of the kernel-OS-framework pyramid don't have much in common with the lower (and order-of-magnitude larger!) levels of the coding pyramid.
I occasionally run into developers who point to the Linux source code as a model for development, and my question to them is always the same: since when are we writing an operating system? Everyone likes to think that they're working on something fantastically complicated that will be used by millions of developers, but the reality is.. somewhat less exciting.Anonymous
February 21, 2005
Well, I think the most important lesson is that the optimized Regex, relying on String.Trimming first, was actually inferior to the "unoptimized" regex, while both are inferior to C#.
Of course, there are good reasons to optimize for the common case, but it's just as important to check that the optimization really gives you a benefit. I've seen things like keeping WeakReferences to a lot of objects to allow aggressive GC if needed. The only problem with the code was that the data protected was a simple combination of a DateTime and a primary ID from the database. That made the WeakReference itself just as heavy as the object referenced...
Just my $ 0.02.Anonymous
February 21, 2005
FWIW, here's what I get on an Athlon FX-53:
Unoptimized = 7x slower
Optimized = 6.6x slower
(move Regex r up to outer loop)
Unoptimized = 2.4x slower
I think most of the perf penalty is just instantiating the Regex object over and over.
> "We do this in plenty of places in the framework"
And I agree, you should. But I dislike working with other developers who have delusions of grandeur and think they're working on the framework too. Hint: they're not.Anonymous
February 21, 2005
If I move the Regex declaration up there to be static and put in RegexOptions.Compiled, the performance of RegEx will be much better only 50% slower. (1093750 vs 1562500)Anonymous
February 21, 2005
The numbers gets even closer than 50% if you do a better Trimmming of your strings :)
You call trim to much man!
Also call the regex once before timing, so the dll is compiled and loaded.
My number are now
781260 vs 1093764 on a P4 2.8Anonymous
February 22, 2005
The comment has been removedAnonymous
February 22, 2005
static bool StringMatch2(string s) {
if (s.Trim().Length != 6)
return false;
return (Char.IsLetter(s, 0) && Char.IsDigit(s, 1) &&
Char.IsLetter(s, 2) && Char.IsDigit(s, 3) &&
Char.IsLetter(s, 4) && Char.IsDigit(s, 5));
}Anonymous
February 22, 2005
Well I've never seen such a good dose of feedback on an issue: awesome. Obviously I'll need to be more careful on making the right code available next time. I think Jeff had an especially good point on making sure we're aware of reality in these situationsAnonymous
February 22, 2005
I lost the reference, but one of your colleagues once wrote a blog entry on the RegexOptions.Compiled flag. To summarize, there are basically a few different modes of use and re-use for regexes.
Disclaimer: this is straight from memory, there may be errors...
First, the one-shot use, like in your post. This has a significant overhead cost of creating the regex. Instead of creating the Regex object on each call, you can as well call the static overloads of the methods in the Regex class. Creating a Regex instance in this case is mere overhead that makes your code less readable.
Second, the uncompiled, but cached use. Create the Regex instance once, but without the RegexOption.Compiled flag and store it somewhere. Then reuse that same regex instance. For many practical situations, putting it in a static field for the class using the regex works well. In other situations (typical for regexes that are not known at compile time), use nonstatic fields. This works faster than your solution in all situations. From my experience, this way of using Regexes is the best in most practical situations.
Third, the compiled and cached use. Create the regex instance once, but with the RegexOptions.Compiled flag. Your regex will be faster (about 30% IIRC), but at a very significant extra startup cost (you are creating a new dynamic assembly for each regex). Initially I used this option a lot in my code, but I noticed that the overhead cost is often more than it is worth! Measuring performance is the key to find out if this route is right for you.
Fourth: the Regex class provides options to compile a set of Regexes to a 'real' assembly, which can be referenced from your project. This is probably the fastest option for runtime, but too much of a hassle at design/compile time in most cases.Anonymous
February 28, 2005
Kit George is the program manager for the .NET Base Class Library team. Kit recently posted an entry on the BCL blog describing a solution to a customer problem: We recently got asked this question by a customer: "In...Anonymous
May 31, 2009
PingBack from http://outdoorceilingfansite.info/story.php?id=23218Anonymous
May 31, 2009
PingBack from http://outdoorceilingfansite.info/story.php?id=5582