Performance Quiz #8 -- The problems with parsing -- Part 1
I think I like parsers too much, I end up writing about them a lot. Maybe that's because about one program in three is, loosely speaking, a parser of some kind. All the more reason why we should be very good about writing them.
Unlike some of my previous quizes I'm not exactly sure where this one is going to go. I have a few ideas but I'm going to go on the voyage of discovery with you this time and see where it leads us; and maybe we'll take some twists and turns that I didn't expect based on what you all suggest. That should make it all the more fun.
To start us out, I wrote a quick and dirty little parser. It's motivated by a parser that's in a piece of code I'm working on right now. This parser basically takes boolean expressions that can look something like this: "(fact1 & fact2) | !fact3" and evaluates them to true or false based on whether or not fact1, fact2... etc. are true. The design is to assume that comparatively few facts will be true in any given run (maybe a few dozen) but there could be thousands of predicates (expressions) to evaluate. I've made a little harness that has the parser and some test code wrapped around it and it's posted here:
https://blogs.msdn.com/ricom/articles/495867.aspx
This time I'm going to ask some open ended questions:
Q1: What's wrong with this parser from a performance perspective?
Q2: What should we be doing to improve it?
Q3: How big of a difference is it likely to make?
These questions touch on the cornerstone of good engineering practices -- understanding what matters and what does not and prioritizing the work so that you don't spend a lot of time working on things that ultimately will not matter. But to really cement this think about what I consider to be the hardest the most important question of all:
Q4: If you hadn't written all the code yet, what would you do to get a better idea what was going to matter and what wasn't?
See the continuation in Performance Quiz #8 -- The problems with parsing -- Part 2
Comments
- Anonymous
November 22, 2005
I wrote the parser as a throwaway so there's several corners cut to keep the code simple while retaining its usefulness as a benchmark. Feel free to comment on non-performance aspects of the code if you like -- it might be educational to many of the readers -- but I probably won't say much more on those topics other than "of course you're right" :) - Anonymous
November 22, 2005
we could start by using a ref parameter in the EvaluatePredicate function - Anonymous
November 22, 2005
You say there may be thousands of predicates. Will each predicate be evaluated only once, or will each predicate have a chance of running thousands of times (as the benchmark does)?
If each predicate will be run many times, it would definitely be worthwhile to build a data-structure or generate IL for each predicate. Perhaps a 5x speed increase from doing this.
If each predicate will be run only a few times, it's much harder. I'd suggest stopping the parse as soon as you definitely knew the answer. For example, with "false & .... | something", when you reach the "&", you could skip over the "...." part (paying attention to parentheses!) until you reached an operator of lower precedence like the "|". I'm not sure how much faster this will be than the original parser, and it will be significantly more complicated. - Anonymous
November 22, 2005
Let me clarify that: the predicates tend to be evaluated in a batch. So if there were 1000 predicates we might run all 1000 of them and then a little later run all 1000 of them again and so forth. The benchmark magnifies the effect but I didn't want to burden the test case with 1000s of predicates so I substituted extra repetitions. I think that will matter somewhat in the results so we'll have to consider that when choosing where to go from here.
Compiling each predicate into IL certainly could be done.
Reader exercise: How could we get an idea what benefits that might bring? - Anonymous
November 22, 2005
Without actually running the code my guesses are below,
Q1:
(1) String allocations in GetToken() might be an issue
(2) Memory locality might be an issue
(3) Recursion might be an issue because of the time to create and destroy stacks
Q2:
(1) a. convert strInput to a char[]
b. move strInput to EvaluatePredicate() and use stackalloc
(2) a. move ichInput to EvaluatePredicate()
b. pass strInput and ichInput by ref to methods that need them
(3) Consider refactoring to the parser to have a dispatcher - Anonymous
November 22, 2005
Whoops, forgot 1 more guess ...
Q2:
(1) c. return char[] index range from GetToken() - Anonymous
November 22, 2005
q2. if the expression will be long enough, recursion may be the issue.
q4. I think that extensibility and parser efficiency would matter. - Anonymous
November 23, 2005
Is this a trick question, Rico? :-)
If it is, then my answer is that the problem is NP-Complete as it is equivalent to SAT3.
Cheers,
Dave Boyle - Anonymous
November 23, 2005
You can get a factor of a 1000 if you change string GetToken to char GetToken.
char GetToken returns:
'X' for null
'T' for a true fact
'F' for a false fact
'!','&','|','(',')' for operators - Anonymous
November 23, 2005
Oops copied the wrong number. It should read "a factor of 1.2". - Anonymous
November 23, 2005
I would also add
Thread.Sleep( 1000 );
before
clock.Start();
Otherwise the updating the icon in task bar is included in the timing. - Anonymous
November 23, 2005
Recursion. I'd like to see whether a stack-based implementation made a difference. However, I've not written a parser so apologies if I'm misinformed. - Anonymous
November 23, 2005
The comment has been removed - Anonymous
November 23, 2005
The facts must be assumed to change over time otherwise you could just evaluate the predicates once and for all which would make the test case very much less interesting.
This isn't a trick question, but of course there is no completely trivial answer or anything. It's just an interesting problem. - Anonymous
November 23, 2005
Q1: The biggest issue is the use of strings. The scanner doesn't classify the tokens. It merely chops the input up into pieces.
Q2: The simplest change: Use string constants for the delimiter tokens, and use Object.ReferenceEquals to check for equality. (This works because the string constants are interned. The interned token strings act as token identifiers.)
A more complex change is to throw out the recursive descent parser + tokenizer and replace them with a hand-coded parser. The language is simple enough to do that. There is some duplication between the scanner and the parser that would be eliminated this way.
Q3. I've always found this hard to quantify. It's especially hard to predict the impact of the hand-crafted parser. It may not end up being much faster than the optimized version of the current code.
Q4. We know that doing lots of small allocations can be relatively costly if there is not much other code being executed. With a parser, I would certainly look at ways to avoid unnecessary allocations.
Two comments on the test data:
1. The timings are somewhat dependent on the frequency and ordering of the operators, as well as the complexity of the predicates. There is too much variation at the moment to make accurate timings for minor optimizations.
2. The assumption that "comparatively few facts will be true" is violated: 60% of the used facts are true.
And a comment on the .NET BCL and parsing: On several occasions I have found it would be useful to have "substring" overloads of methods like TryParse, i.e. something like TryParse(string, startIndex, out result, out endIndex). Dictionary lookups on a substring would be useful in this scenario. - Anonymous
November 23, 2005
When I write parsers by hand I find it easier to always have a lookahead token instead of the repetitive testing needed to decide whether to get one and then consume it. But this is minor from a performance point of view, just a simplification to the development effort.
As for reading GetToken() (or the rest of the source), my eyes can't handle it. Did you parse your parser after posting it? - Anonymous
November 23, 2005
Jeffrey Writes:
>>2. The assumption that "comparatively few facts will be true" is violated: 60% of the used facts are true.
Perhaps I could have been clearer -- the number of true facts is comparatively small. I didn't mean to say anything about the ratio. - Anonymous
November 23, 2005
Norman writes:
>>As for reading GetToken() (or the rest of the source), my eyes can't handle it. Did you parse your parser after posting it?
UGH! When I changed the source from <PRE></PRE> to just fixed pitch fonts so that I didn't get bizarre paragraph breaks I inadvertantly caused all the & < and > to be HTML escaped. Yuck! I fixed it... - Anonymous
January 23, 2007
There were some great comments from the initial posting which I encourage you to read. Many people latched