Performance Quiz #3 -- Solution
And now the exciting solution to Performance Quiz #3!
There's actually an enourmous amount of philosophy that can be learned from this little example in terms of good habits but I'd like to anchor this in the numbers so let's begin with those.
Time Analysis
I did 4 runs with everything cached, and jitting everything, this actually gives the rad slower parser an advantage in terms of reported time because the input file isn't so terribly large that load time can be ignored. These results were on my own desktop machine, I'm sure you'll get different numbers if you try the test but the ratio should be similar.
Rad Parser: 1.092s
Classic Parser: 0.149s
Ratio: 7.366
So the RAD parser is over 7 times slower and if I was to subtract out the cost of starting an empty program from both (about 50ms) then you'd get more like a factor of 10 for just the code we wanted to test.
Space Analysis
These statistics were gathered by running the programs under CLRProfiler which tracks every managed allocation.
Rad Parser
Total Allocations 48307 Objects 2357857 Bytes
Top Ten Allocated Types
Count Bytes Type
42813 1408708 System.String
2514 713712 System.Int32 []
2515 200608 System.String []
14 10564 System.Object []
8 6183 System.Byte []
25 5690 System.Char []
171 3420 System.Text.StringBuilder
15 2160 System.Collections.Hashtable.bucket []
15 780 System.Collections.Hashtable
34 680 System.RuntimeType
Classic Parser
Total Allocations 539 Objects 46454 Bytes
Top Ten Allocated Types
Count Bytes Type
253 15262 System.String
14 10564 System.Object []
5 4864 System.Byte []
14 3712 System.Int32 []
21 3456 System.Char []
14 2016 System.Collections.Hashtable.bucket []
14 728 System.Collections.Hashtable
14 564 System.String []
6 384 System.Globalization.CultureInfo
16 320 System.RuntimeType
That result is simply stunning, it's 50 times the number of allocated bytes, most of which obviously were just thrown away. Lucky for us the GC was able to reguarly reclaim that memory so the impact on total workingset was considerably less (it should end up being near the size of your L2 cache because of GC hueristics). Nonetheless allocating all that memory was not a good thing for performance.
Let's look at the allocations in a bit more detail.
I extracted the top 10 allocating stacks from both runs out of the CLRProfiler log and I've got them here in text format.
Let's do the classic parser first this time because we can dispense with it fairly quickly. I'm only going to show the first two of the top ten stacks.
Stack 1 allocates 7590 bytes
6 6240 System.Object []
1 72 System.OutOfMemoryException
1 72 System.StackOverflowException
1 72 System.ExecutionEngineException
1 72 System.Threading.ThreadAbortException
1 20 System.SharedStatics
1 108 System.AppDomain
11 782 System.String
1 20 System.String []
4 80 System.Security.Permissions.HostProtectionAttribute
1 24 System.Reflection.Assembly
1 28 Microsoft.Win32.Win32Native.InputRecord
Stack 2 allocates 4702 bytes
ClassicParser::Main static void (String[])
1 76 System.IO.FileStream
1 4108 System.Byte []
6 506 System.String
1 12 System.Int32
Now what's interesting here is the main function is only stack #2, you can see the byte array there and a few string literals, on boxed integer, and of course the file stream itself. Some of those must be coming from inlining because they don't appear directly in the source code for main (might make another interesting blog entry to run all these down actually). The biggest source of allocations had no managed stack associated with them at all -- these are objects created by the runtime at startup time to set up the enviroment, open standard file streams, and get that first appdomain running. There are actually more of those for even small numbers of bytes further down and all of those actually appear in the RAD parser as well.
So what's interesting here is that of the 46454 bytes allocated in total, our parser is directly responsible for a mere 4702 of them. Wow.
Let's look now at the RAD Parser stacks, there's a lot more meat on those ones.
Stack 1 allocates 985000 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
System.String::InternalSplitKeepEmptyEntries String[] (int32[] int32[] int32 int32)
System.String::Substring String (int32 int32)
40000 985000 System.String
Stack 2 allocates 710034 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
2500 710000 System.Int32 []
1 34 System.String
Stack 3 allocates 359436 bytes
RadParser::Main static void (String[])
System.IO.StreamReader::ReadLine String ()
System.String::CtorCharArrayStartLength String (wchar[] int32 int32)
2334 359436 System.String
Stack 4 allocates 200000 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
System.String::InternalSplitKeepEmptyEntries String[] (int32[] int32[] int32 int32)
2500 200000 System.String []
Stack 5 allocates 41208 bytes
RadParser::Main static void (String[])
System.IO.StreamReader::ReadLine String ()
System.Text.StringBuilder::.ctor void (String int32)
System.Text.StringBuilder::.ctor void (String int32 int32 int32)
System.String::GetStringForStringBuilder static String (String int32 int32 int32)
166 41208 System.String
Stack 6 allocates 7590 bytes
6 6240 System.Object []
1 72 System.OutOfMemoryException
1 72 System.StackOverflowException
1 72 System.ExecutionEngineException
1 72 System.Threading.ThreadAbortException
1 20 System.SharedStatics
1 108 System.AppDomain
11 782 System.String
1 20 System.String []
4 80 System.Security.Permissions.HostProtectionAttribute
1 24 System.Reflection.Assembly
1 28 Microsoft.Win32.Win32Native.InputRecord
All I can say is wow. We have to get down to stack 6 before we see the 7590 bytes of root allocations. You can see there's significant cost associated with splitting the strings -- making the arrays and the string pieces. Another hidden cost leaps right out here as well -- since we don't know in advance how many string parts there are there's a pass over the string to find the delimeter offsets, those are stored in an temporary integer array.
Cost Per Byte
Another way of thinking about these parsers is to consider how many times any byte of the input file has to be touched, you can see in the RAD parser we have to do newline parsing, and then scanning for delimeters and splitting, and then numeric conversions. So it's readily apparent that there will be many memory touches for each byte of the file. Consider just this counting the CRLF there are 70 bytes per line of input and 2500 such lines, that's 175,000 bytes to read. But we allocated over 13 times that in bytes to process the file the rad way.
The classic parser is very economical by comparison. Even if the data had not been in the disk cache we would only add a few milliseconds of overhead to both algorithms for reading such a small file. The memory management costs of the rad way are easily greater than any reasonable I/O cost.
Warnings and Good Practice
So should you all abandon your RAD practices and go study Knuth's parsing techniques? Well... I dunno about that. But there are definitely some lessons here. I think my message is this: If you choose the RAD parser you have made your coding that much easier but you still have a responsibility to measure the quality of your solution against some test data and see what your performance is going to be like. Every professional piece of software should have performance goals, even if they are very simple ones that are very easy to meet. Sometimes goals like “Yeah, 10 seconds is fine” is all it takes, other times we have to have strict time and space budgets. If you were in “Yeah 10 seconds is fine” mode then I think the RAD parser is a fine choice.
And here’s the 2nd half of the message. If you decide to do something fancier than the RAD way, you can easily find yourself in a world of hurt on correctness issues. Even this fairly simple parsing problem has much potential for mistakes – I put in error checks for three cases but did you notice that there’s no error check for an input token that is just a “-” – some people noticed… It’s easy to make mistakes when you are reaching for sharper tools, which is all the more reason to understand your performance goals so that you can choose a plan that is going to economically meet those goals. Is the “classic” parser approach doomed to be full of errors? No I think not. We just have to be more careful, there are well understood techniques for creating provably correct parsers but we’ll obviously be investing more in our algorithm to get there.
So there’s your balance – don’t blindly choose the easiest approach, it may be much too awful. Don’t naively choose a complex approach, the effort required to achieve correctness may be daunting.
Set quantitative goals so that you can objectively make the necessary plans to meet them.
Comments
- Anonymous
June 30, 2004
The comment has been removed - Anonymous
June 30, 2004
Furthermore, I should probably go into more detail about how we approach API performance.
During initial development, we try to make a reasonable trade-off between correctness, performance, security and maintainability. Security is particularly relevant here, as there is zero tolerance for errors that could expose security holes in the platform. As an illustration of the trade-offs, we almost never use unsafe code during initial development, because the increased risk of introducing a security error does is not usually justified by the small performance benefits you can get.
There can be interesting correctness, maintainability and code size trade-offs also. As an example, most number parsing and formatting code is shared for all numeric data types. We could make optimizations for simple cases of integer parsing or where invariant culture is used, but this would likely require a significant amount of additional code.
Thus, it is not the case that we make every API absolutely as fast as we can make it. We still try hard to make them very fast. However, we focus our time and resources to make the most important APIs as a fast as possible.
For APIs at the base of the platform, we rely on partner scenarios and customer scenarios to drive which APIs to optimize further. Thurs, there are some APIs such as String.CompareOrdinal, the Hashtable indexer and StringBuilder.Append that were very highly optimized in V1.0 because of how frequently they appeared in profiling logs for key scenarios. Unlike some of the APIs mentioned in Rico’s example, it would be extremely difficult to create a roll-your-own version of these that would be much faster than the ones in the BCL. For some primitive operations like for-each on Array and String.Length, it would be nearly impossible because they are special-cased by the JIT.
In some cases the performance is a consequence of the API design. In this example, String.Split and Int32.Parse force you into creating sub-strings. That does not mean that there is anything wrong with these, as it is the simplest abstraction to deal with. However, if there is significant customer or partner demand, we may need to create optional versions of these APIs that are a little more complicated to use or may have fewer features, but can go faster or eliminate the need for intermediate string objects. Having additional ways of doing the same thing makes the object model more complex and makes the libraries bigger, so you need to really sure such new APIs are justified.
This is why your feedback on which APIs to duplicate or tune is very important. Please let us know:
http://lab.msdn.microsoft.com/productfeedback/ - Anonymous
July 01, 2004
I consider myself fortunate, in that my first job was with Intel writing a profiler, so I've always thought in terms of performance. The truth is most developers don't, and even if they wanted to, few companies encourage it. The pressure is always to write "code faster", not "write faster code". I worked for a "Web Analytics company" that tracked over 100 million hits a day by processing log files. Each hit was processed over 8 times, not to mention how often each process "touched" a particular byte. Everyone knew better, but the pressure was always on to create the next feature. The sad truth is most companies rely on the system, and "Moores law" to be performante. - Anonymous
July 01, 2004
The comment has been removed - Anonymous
July 01, 2004
Terry: Why is that sad? If the system meets the needs of the company, then isn't it doing its job? Hasn't the company stated by doing this that these features are more beneficial to it them just speeding up the performance of the current system? - Anonymous
July 02, 2004
I like to think of the process as using the same discipline even if sometimes just a very small amount of perf work is all that's called for.
Looking at Cyrus comment above I wonder, how do we know that it's meeting the needs of the company if we hadn't got an idea what those needs were?
And the answer: In the easy case the needs are so modest that any remotely sensible solution will meet them.
So all the performance work you need do is check that your runtime (or whatever metric) is in say the "5 minute zone" or the "30 minute zone" or something like that. Very simple goal, little or no planning required to meet the goal, and ultra simple validation.
These are the same principals you would use if it had been a harsh goal. But having checked them off you can be sure you've met the (modest) needs of your company. - Anonymous
July 02, 2004
DSP compiler can do inline/loop fusion and get rid of the arrays, and use a scalar.
I guess if BCL provide partial entry for split, it is possible for compiler do the convertion.
However not sure how general it is. - Anonymous
July 10, 2004
yaa i want to be aware of arrays n strings