Performance Quiz #6 -- Looking at the second cut
Stefang jumped into the fray with his analysis in the comments from my last posting. Thank you Stefang. And it seems, as usual, things aren't always what they appear :)
Let's continue the story without me giving away the ending just yet though. How is Raymond is doing? My last posting studied his first effort, that's the one that I ported for Quiz #6, and today we'll look at his second effort and see how he did. Remember we were targetting the time in the getline function.
To get a rough idea if he's doing well I measured the raw execution time on my own machine, from process start to process exit.
Unmanaged Version 1: Time of execution 1.328
Unmanaged Version 2: Time of execution 0.828
So the new version is 1.6 times faster. Nice!
And, in some detail, this what I see for his version #2.
Function Name | Exclusive Percent | Inclusive Percent |
_main | 0 | 97.695 |
_mainCRTStartup | 0 | 97.695 |
Dictionary::Dictionary(void) | 0.231 | 94.744 |
std::getline<char,struct std::char_traits... etc. | 0 | 61.918 |
std::getline<char,struct std::char_traits... another etc | 0.922 | 59.382 |
std::basic_filebuf<char,...>::underflow(void) | 0.692 | 37.575 |
std::basic_filebuf<char,...> >::uflow(void) | 2.259 | 34.947 |
_fgetc | 3.043 | 32.826 |
std::_Fgetc(char &,struct _iobuf *) | 0.461 | 32.688 |
__lock_file | 1.614 | 21.254 |
And sure enough the time in getline has dropped to about 62%. Of course that's 62% of a smaller number too so it's even better than it looks.
Interestingly, we're still paying what looks a hefty toll for using a multi-threaded version of the runtime... that's what we get by default. But it would be interesting to see what you guys get if you statically link with a different runtime.
Function Name(sanitized) | ExclusivePercent | InclusivePercent |
RtlLeaveCriticalSection | 18.811 | 18.811 |
RtlEnterCriticalSection | 18.119 | 18.119 |
GetCPHashNode | 4.61 | 4.61 |
InterlockedIncrement | 4.38 | 4.38 |
InterlockedDecrement | 4.149 | 4.149 |
MultiByteToWideChar | 3.965 | 8.852 |
__lock | 3.827 | 19.64 |
_fgetc | 3.043 | 32.826 |
__SEH_prolog | 2.582 | 2.582 |
std::basic_filebuf<char,struct std::char_traits<char> >::uflow(void) | 2.259 | 34.947 |
And I'm going to tip my hand now without doing the analysis, here's how the numbers stack up so far:
Version | Execution Time (seconds) |
Unmanaged v1 | 1.328 |
Unmanaged v2 | 0.828 |
Unoptimized Managed Port | 0.124 |
Color me happy :) I get numbers comparable to what Stefang reported. But the story isn't over yet! Raymond's still got plenty of ammunition. Let's see what happens when he brings out the heavy artillery. And we've yet to see what the profile looks like for the managed code.
Comments
- Anonymous
May 12, 2005
This is a really interesting demonstration of how to engineer for performance. And to serialize the tale like a detective novel... Brilliant! Keep up the good work. - Anonymous
May 16, 2005
The timing for this program depends very much on compiler settings.
I have tried two variants:
1) cl /EHsc /MT cpp.cpp (Multi-threaded library, no optimization)
2) cl /EHsc /MT /O2 cpp.cpp (Multi-threaded library, optimize for speed)
I get the following times:
1) 1400 ms
2) 940 ms
The main reason for this is that much of the STL code gets inlined and optimized away when using /O2
I am really curious about your timings. You say that your program spends 62% in getline().
When I run the two program compiled with the different settings above I get the following getline() percentage on my machine (complied with VS2003)
1) 30%
2) 13%
What could be the possible reason why you get 62% and I never get over 30% ?
I suspect that you are using VS2005 and that the reason for the performance difference is in differences in the STL library for VS2003 and VS2005.
Perhaps there is a performance bug in STL for VS2005 ?
In that case, maybe you could do something about it before VS2005 is released ? - Anonymous
May 17, 2005
It's really hard to say what the root causes might be. I have an advantage you didn't have -- I'm measuring the binaries Raymond built because I asked him to send them to me so measurements would match his results as closely as possible.
Of course adding /O2 can really change things -- perhaps Raymond build with /O1 ? We'll have to ask him when he gets back. I could try building everything again to see what difference it makes. Building without /MT might be interesting. But it's a whole different exercise.
There may also be machine architecture issues. I'm running on a P4 at about 3Ghz with HT enabled for what it's worth. - Anonymous
May 17, 2005
The reason I suspect that there is a problem with the STL implementation is that you show a function called
std::_Fgetc(char &,struct _iobuf *) with 32% incl.
No such function is used when I run the program.
in my runs _fgetc calls filbuf directly, and _fgetc only uses 4% of the runtime.
Something is really weird here. - Anonymous
May 31, 2005
So I was reading through one of my favorite MSDN blogs (http://blogs.msdn.com/oldnewthing/)
And he...