Performance Quiz #6 -- Looking at the sixth cut
Well, it's time for me to surrender. Sort of :)
Raymond pulls out all the stops in his sixth version by painting a big bullseye on his biggest remaining source of slowness which is operator new. He turns in an excellent result here. On my benchmark machine I see the number drop from 124ms to 62ms -- a full 2x faster from start to finish. And observing the footnote on my previous message, the runtime for his application is now comparable to the CLR's startup overhead... I can't beat this time.
Let's look at the results table now to see how we ended up:
Version | Execution Time(seconds) |
Unmanaged v1 | 1.328 |
Unmanaged v2 | 0.828 |
Unmanaged v3 | 0.343 |
Unmanaged v4 | 0.187 |
Unmanaged v5 With Bug | 0.296 |
Unmanaged v5 Corrected | 0.124 |
Unoptimized Managed port of v1 | 0.124 |
Optimized Managed port of v1 | 0.093 |
Unmanaged v6 | 0.062 |
Six versions and quite a bit of work later, we've been soundly trumped. But before I discuss that, let me put up the internal profile of Raymond's version 6
I've applied my usual filters to the call tree (nothing lower than 5% inclusive) and I also pruned out a couple of functions below HeapAlloc because they have long names and are boring :)
Function Name (Sanitized) | ExclusivePercent | InclusivePercent |
_mainCRTStartup | 0 | 97.826 |
_main | 0 | 97.826 |
Dictionary::Dictionary(void) | 5.435 | 96.739 |
MultiByteToWideChar | 19.565 | 25 |
GetCPHashNode | 5.435 | 5.435 |
operator new(unsigned int) | 1.087 | 16.304 |
.. | 0 | 14.13 |
.. | 0 | 14.13 |
AllocateHeap | 4.348 | 13.043 |
_free | 0 | 8.696 |
FreeHeap | 2.174 | 8.696 |
DictionaryEntry::Parse(...) | 1.087 | 33.696 |
StringPool::AllocString(...) | 2.174 | 27.174 |
_lstrcpynW | 19.565 | 25 |
__SEH_prolog | 5.435 | 5.435 |
You can see that the memory allocation time is way down as a percentage, and of course that's a smaller percentage of a smaller total time. I think he gets a lot of raw speed from his improved locality thanks to that new allocator as well. Interestingly SEH overhead is up to a signifcant level in this run (now over 5% for the first time). Still nothing to be worried about.
So am I ashamed by my crushing defeat? Hardly. The managed code got a very good result for hardly any effort. To defeat the managed Raymond had to:
- Write his own file/io stuff
- Write his own string class
- Write his own allocator
- Write his own international mapping
Of course he used available lower level libraries to do this, but that's still a lot of work. Can you call what's left an STL program? I don't think so, I think he kept the std::vector class which ultimately was never a problem and he kept the find function. Pretty much everything else is gone.
So, yup, you can definately beat the CLR. Raymond can make his program go even faster I think.
Interestingly, the time to parse the file as reported by both programs internal timers is about the same -- 30ms for each. The difference is in the overhead.
Tomorrow I'm going to talk about the space used by these programs and that will wrap it up. Though I think Raymond is going to go on and do some actual UI and so forth with this series. That should be fun to watch.
Comments
- Anonymous
May 19, 2005
Thanks. This has been an excellent series, Rico. Showing how powerful the CLR is, as well as showing some concrete examples of how to measure and improve within its walls. I think this series needs to become a cohesive article for MSDN. - Anonymous
May 19, 2005
Great series of posts. It was wonderful to see the managed code perform so well right out of the gate. And, your subsequent optimization was simple and effective. Thanks for taking the time and sharing this with us. - Anonymous
May 20, 2005
Great series of posts. The CLR is very impressive. - Anonymous
May 20, 2005
The comment has been removed - Anonymous
May 21, 2005
>Six versions and quite a bit of work later, >we've been soundly trumped
Excuse me, but this is not fair. Raymond has never tried to beat any "managed" version (I bet he doesn't even know about your effort - you started it after he went to vacation) - all he wanted to prove is that to achieve optimization you need to measure and not to gues where the bottleneck is. - Anonymous
May 21, 2005
I assure you Raymond doesn't need any defense :)
Remember I work with Raymond, we talked about this series of postings at some length quite some time ago and then again when this got to the top of his queue.
It's all supposed to be an educational illustration. The competitive language is just a facade to make it less of a dry read.
The outcome was never really in question, the journey is the illustration.
Actually I think the fact that Raymond just did what he thought was best (well in advance of my comments) at each stage makes it more interesting, not less. - Anonymous
May 31, 2005
So I was reading through one of my favorite MSDN blogs (http://blogs.msdn.com/oldnewthing/)
And he... - Anonymous
June 07, 2005
So -- excluding startup time -- what's the performance comparison like?
It seems to me that (total - startup) ~ 0.03s, about half of the total time for v6. If startup for v6 is ~0.00s, then the actual working bits still seem twice as fast in the CLR. - Anonymous
June 07, 2005
No, they each do the job in about 30ms. The CLR has 30ms more startup overhead basically.
Check out the space analysis for the reasons why. - Anonymous
June 12, 2005
That is kind of interesting.
Do you have figures on the difference in memory usage? - Anonymous
June 12, 2005
Oops, just discovered the next blog entry :) - Anonymous
July 31, 2006
I'm just not an expert. - Anonymous
January 04, 2007
PingBack from http://console.writeline.net/blog/?p=6