Regular Expression performance [David Gutierrez]
I often get questions about Regex and what the RegexOptions.Compiled flag actually does. There are in fact three different modes that Regex can work in: interpreted (without the compiled flag), compiled on the fly (with the compiled flag), and precompiled. Each of these modes has its own trade offs in performance - I'm mainly talking about startup performance, which is the initial cost of creating your Regex, and runtime performance, which is the cost of running matches.
Interpreted
This one is what you get by default when you don't pass in RegexOptions.Compiled as an option. Here are some interpreted usages of Regex:
r = new Regex("abc*"); Regex.Match("1234bar", "(\d*)bar");
We parse your expression into a set of custom opcodes, and then use an interpreter to run the expression later. The cost of creating the Regex is low, but this mode also has the lowest runtime performance of the three.
Compiled on the fly
In this case you've passed in RegexOptions.Compiled:
r = new Regex("abc*", RegexOptions.Compiled); Regex.Match("1234bar", "(\d*)bar", RegexOptions.Compiled);
In this case, we first do the work to parse into opcodes. Then we also do more work to turn those opcodes into actual IL using Reflection.Emit. As you can imagine, this mode trades increased startup time for quicker runtime: in practice, compilation takes about an order of magnitude longer to startup, but yields 30% better runtime performance. There are even more costs for compilation that should mentioned, however. Emitting IL with Reflection.Emit loads a lot of code and uses a lot of memory, and that's not memory that you'll ever get back. In addition. in v1.0 and v1.1, we couldn't ever free the IL we generated, meaning you leaked memory by using this mode. We've fixed that problem in Whidbey. But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly.
Precompiled
Precompilation solves many of the problems associated with compiling on the fly. The idea is that you do all of the work of parsing and generating IL when you compile your app, ending up with a custom class derived from Regex. The big trade off here is that you need write a small app which will do the compilation for you (ie an app which calls Regex.CompileToAssembly(...) with the right parameters), and thus you need to know your important regexes in advance. In general this isn't such a problem, since if you're writing a parser, you probably don't need to change your expressions at runtime. Your startup time reduces to loading and JITing your class, which should be comparable to the startup cost of interpreted mode. Runtime performance will be identical to the compiled on the fly case. It's the best of both worlds!
Comments
- Anonymous
November 12, 2004
The comment has been removed - Anonymous
November 12, 2004
The Whidbey C++ project system can manage and link multiple files from multiple languages (eg C++. VB.Net and C# all in the same project). Is it possible to add a .regex filetype into the C++ project system which will also compile and link into a single assembly? - Anonymous
November 15, 2004
We explored doing something like what Doug mentioned as a general purpose mechanism, though it didn't end up happening in Whidbey. I'll enter a feature request to make precompiling easier somehow, though. - Anonymous
November 17, 2004
In addition. in v1.0 and v1.1, we couldn't ever free the IL we generated, meaning you leaked memory by using this mode. We've fixed that problem in Whidbey.
As far as I know, Whidbey does not support unloading assemblies. Do you create AppDomain for each compiled Regex (and get all cross-appdomain performance drop) or you use some sort of internal mechanism for unloading assemblies? - Anonymous
November 23, 2004
Lexp, you're right that Whidbey can't unload assemblies. What we've done is switch the compiled on the fly case to use a new form of Reflection.Emit called lightweight code-gen. Rather than generating assemblies, modules, types and methods, with lightweight code-gen you are only allowed to generate methods. In the end you receive a delegate to the generated method, and when that delegate is GC'ed, all of the IL and the JIT'ed code is reclaimed.
Note that the precompiled case can't use lightweight code-gen, so you still can't unload those types. - Anonymous
March 03, 2005
I am currently in the middle of a way-overdue refactoring of MhtBuilder, which uses regular expressions extensively. I noticed that I had sort of mindlessly added the RegexOptions.Compiled all over the place. It says "compiled" so it must be... - Anonymous
March 11, 2009
PingBack from http://www.dijksterhuis.org/regular-expressions-advanced/