Regular Expression optimization done in Jscript 5.7 release
I modified my last blog many times. I was really amazed at the kind of response I got on this blog. Thanks for taking time and putting all the comments. Today I am going to talk at length the "Regular Expression" optimization we did in JScript that shipped with Windows Script 5.7 release. I could have changed my last blog but since I have already changed it so many times, I think it's just better to write a new one.
Web Developers who are using JSON protocol to describe their objects should read this blog as optimizations has been done in this area.If you are looking for other fixes that has gone in this release please get them from here. The improvements are quite evident for web sites which are using JSON protocol to define their objects. We got to know about this performance bug from our Windows Live team when they complained about poor performance of their web site while using JSON protocol. We were close to 30 times slower than the competition.
Now, let's take a moment to define what exactly was causing the problem. As part of using JSON protocol usage on Windows Live site, the code used to do pattern matching on a regular expression and further do some kind of replacement of matched string with a fixed specified string. The CPU time spend in doing this pattern matching and replace was very high which lead to the performance of site being bad.
Let’s take the example of the specified regular expression : /(["'])( \\.|((?!\1|\\).))*\1/g
While matching a pattern for this expressions, before going into the loop, JScript engine compares if source string right now is same as the source string that was present before entering the same loop last time. If source strings are same it doesn't enter the loop .Assumption being none of the characters in the source string were matched in the last iteration and there is no point in taking this loop again (because if this loop is taken again then it means engine is in infinite loop). Otherwise keep checking back all the loops down in the stack until either the above mentioned condition is true (both source strings are same) or it reaches bottom of the stack. If it reached bottom of the stack then this loop can be entered as it means that either this is the first time loop is being taken or loop consumed at least one character last time. Seems pretty perfect logic.
So what's the flaw with the above logic? What if the engine is going to take same loop but it matched one or more characters in the last branch. In this case above mentioned logic will still keep comparing all the loops till it reaches end of the stack and then it decides to take the loop. This is not correct. If few characters were consumed in the last loop then instead going till end of the stack, it must decide to take a loop as soon as it finds that last time part of the sources string was matched by this loop.
The fix actually addresses this bug. And as a result from being 30 times slower than the competition, we are now at par with competition. Now that's a good enough reason to start using the new JScript if you are using JSON protocol.
Thanks,
Don Raman.
Comments
Anonymous
December 03, 2007
That's great news. Thanks for the info. BTW, the example regex seems like it has potential for optimization by the user as well, e.g. by unrolling the loop and splitting into two patterns: /"[^"](?:.[^"])"|'[^'](?:.[^'])'/g On the subject of JScript regular expressions, would it be possible to fix this log-standing bug relating to multiple lookaheads at the same position ( http://regexadvice.com/blogs/mash/archive/2004/10/05/320.aspx )? It can be very difficult to work around for some kinds of patterns. Thanks!Anonymous
January 18, 2008
On the subject of JScript regular expressions, would it be possible to fix this log-standing bug relating to multiple lookaheads at the same position ( http://regexadvice.com/blogs/mash/archive/2004/10/05/320.aspx )? It can be very difficult to work around for some kinds of patterns. Thanks!Anonymous
January 19, 2008
BTW, the example regex seems like it has potential for optimization by the user as well, e.g. by unrolling the loop and splitting into two patterns: /"[^"](?:.[^"])"|'[^'](?:.[^'])'/g