Performance Quiz #6 -- Looking at the third cut
The fun continues as today we look at Raymond's third improvement.
Raymond starts using some pretty hefty Win32 magic with a mapped file view to eliminate once-and-for-all any costs associated with the getline function. And good news, his new version is 2.41 times faster than the previous version. Myself I don't think I would have gone for the mapped file. I think I would have just used regular I/O in bigger chunks and done my own line splitting at this point. I didn't ask Raymond but I bet he did it this way at least partly because he's comfortable with MapViewOfFile.
Version | Execution Time(seconds) |
Unmanaged v1 | 1.328 |
Unmanaged v2 | 0.828 |
Unmanaged v3 | 0.343 |
Unoptimized Managed Port |
0.124 |
Let's drill into the costs a little bit more, this time I'll show them in tree view. This is all the function with an inclusive cost of at least 5%. That trims the tree nicely and yet still shows the meat.
Sanitized Function Name | ExclusivePercent | InclusivePercent |
_mainCRTStartup | 0.00 | 96.10 |
_main | 0.00 | 96.10 |
Dictionary::Dictionary(void) | 0.12 | 87.94 |
DictionaryEntry::Parse(...) | 0.24 | 10.96 |
std::basic_string< ... >::assign(...) | 0.12 | 7.07 |
std::basic_string<... >::_Grow(unsigned int,bool) | 0.12 | 6.70 |
std::basic_string<... >::_Copy(unsigned int) | 0.24 | 6.46 |
operator new(unsigned int) | 0.00 | 6.09 |
.. | 0.00 | 6.09 |
.. | 0.00 | 6.09 |
AllocateHeap | 1.83 | 5.97 |
std::codecvt<...>::do_in(...) | 3.41 | 59.44 |
__Mbrtowc | 3.41 | 56.03 |
InterlockedDecrement | 13.03 | 13.03 |
InterlockedIncrement | 9.74 | 9.74 |
Mbrtowc_lk | 3.41 | 29.84 |
MultiByteToWideChar | 11.69 | 26.43 |
GetCPHashNode | 14.37 | 14.37 |
std::vector<DictionaryEntry,...>::insert(..etc1..) | 0.12 | 7.43 |
std::vector<DictionaryEntry,...>::insert(..etc2..) | 0.12 | 7.19 |
std::vector<DictionaryEntry,...>::~vector<…. >(void) | 0.00 | 7.92 |
std::vector<DictionaryEntry,... >::_Destroy(...) | 0.00 | 7.92 |
DictionaryEntry::~DictionaryEntry(void) | 0.24 | 7.92 |
std::basic_string<...>(void) | 0.24 | 7.19 |
std::basic_string< … >::_Tidy(bool) | 0.37 | 6.94 |
_free | 0.00 | 6.58 |
FreeHeap | 1.34 | 6.58 |
Looking at that tree we can make some interesting observations. First it's pretty clear that do_in is what we should target. There's a lot of work going on there, probably more than is necessary. Goodness knows why InterlockedIncrement and InterlockedDecrement are being called by _Mbrtowc -- and that's adding up to something. Anyway you slice it going directly to MultiByteToWideChar which is what's doing the actual work is seeming like a really good idea. I bet that's where Raymond goes next.
Also worth noting is that we're measuing the teardown cost of the dictionary and that is adding 7.92% to the overall compute time almost all of which is freeing the memory. There's a good lesson there -- many application try very hard to free their memory before they exit but I find this to generally be a total waste of time. The OS will give that memory back just fine -- the last thing you should do is force it all to be loaded back into the processe's working set and then the processors cache. It might have been swapped out! In this case it's not much anyway though.
We're working pretty hard and still almost 3x slower than the managed code. Stay tuned as Quiz #6 continues to develop :)
Comments
- Anonymous
May 13, 2005
The comment has been removed - Anonymous
May 13, 2005
In both versions I'm using total process time for the reported time. The profiler is only being used to see the breakdown. They are both being measured from process launch to process exit.
The time the program prints (in both cases) is only a fraction of the total run so I'm not using that. Although I suppose I could. But then GetTickCount() isn't especially accurate. - Anonymous
May 13, 2005
So, to clarify, you're not running a "profiling build" for measuring the total time? With the wrong settings, that alone can multiply run time. What's worse - it can shift which methods are actually consuming the most cycles, which of course is a delicate problem. - Anonymous
May 14, 2005
I'd be feeling pretty stupid right now if I had done that :) - Anonymous
May 14, 2005
Sounds good. Thanks! - Anonymous
May 15, 2005
Were measurements done including or excluding .exe startup time (before executing first user instruction)?
Were required libraries already present in memory (loaded by devenv.exe or something)?
Were required libraries present in disk cache?
Somehow I think real world scenerio should include a possibility that the dictionary is the first .NET program started (or that previous .NET program was unloaded long time ago). - Anonymous
May 15, 2005
The comment has been removed - Anonymous
May 15, 2005
The comment has been removed - Anonymous
May 15, 2005
Raymond is just going after the biggest things first -- which is only reasonable. At the moment he's still got bigger problems than that 8% overhead. Let's see what develops :) - Anonymous
May 31, 2005
So I was reading through one of my favorite MSDN blogs (http://blogs.msdn.com/oldnewthing/)
And he...