Performance Quiz #6 -- The fifth cut goes foul!
Well today Raymond hits a snag in version 5 of the program. I started by profiling the code as provided (in fact I had the .exe from Raymond himself so I didn't even have to build it). The new version is actually slower (start to finish) than the previous version. Why? Well there's a little bug in there.
struct DictionaryEntry
{
bool Parse(const WCHAR* begin, const WCHAR* end);
void Destruct() {
delete[] m_pszTrad;
delete[] m_pszSimp; // this throws an exception
delete[] m_pszPinyin;
delete[] m_pszEnglish;
}
LPWSTR m_pszTrad;
LPWSTR m_pszSimp; // this field is never initialized
LPWSTR m_pszPinyin;
LPWSTR m_pszEnglish;
};
Remember the timings I'm doing include the entire run of the program -- start to finish -- and not just the dictionary creation time. That's a little bit different than what Raymond is measuring. That choice is actually giving me quite a handicap because my startup time includes all of the code necessary to get the CLR going which is not insignficant (see my previous posting). This particular bug causes an exception to be raised during the teardown of the dictionary which in turn slows things down by far more than the savings.
Why bring this up at all? I mean it's just a minor mistake that's easily corrected right? Well it's true but the point is that the code that was necessary to reach this performance level is a good bit trickier. When considering development costs you need to think very carefully about what sort of code you will have to write to get the performance that you need. Even great developers make mistakes so if you imagine this code was part of some bigger project all similarly written you could easily see developers making unfortunate mistakes when working in that code base. It makes people less likely to want to do performance work (or any work) and so the product suffers. Many people have already commented that this code has lost most of its object oriented nature. I think this little bug is a symptom of that fact. So I don't want to conclude too much but I think it's fair comment that the code is getting to be harder to maintain.
The upshot of this all is that I can't give Raymond full credit for an improvement here, it's sort of a foul tip. But luckily the problem is easily corrected so I fixed it in my copy of the source and rebuilt. Now unfortunately that means I can't be sure that I build it in exactly the same way as Raymond did so it's possible that there are other differences in my version 5 corrected number. But, including the correction, version 5 is now 1.5 times faster than verison 4. It's neck-and-neck with the managed code, at 124ms overall time.
Version | Execution Time(seconds) |
Unmanaged v1 | 1.328 |
Unmanaged v2 | 0.828 |
Unmanaged v3 | 0.343 |
Unmanaged v4 | 0.187 |
Unmanaged v5 With Bug | 0.296 |
Unmanaged v5 Corrected | 0.124 |
Unoptimized Managed port of v1 | 0.124 |
And here again are the function costs broken down in tree view, only showing functions with an inclusive cost of at least 5% to keep the size of the report manageable.
Name (Sanitized) | ExclusivePercent | InclusivePercent |
_mainCRTStartup | 0 | 94.606 |
_main | 0 | 94.191 |
Dictionary::Dictionary(void) | 2.905 | 67.635 |
MultiByteToWideChar | 7.884 | 9.959 |
DictionaryEntry::Parse(...) | 2.075 | 42.324 |
AllocString(wchar_t const *,wchar_t const *) | 0.415 | 27.386 |
_lstrcpynW | 7.054 | 7.469 |
operator new(unsigned int) | 0 | 19.502 |
_malloc | 0.415 | 19.502 |
AllocateHeap | 6.639 | 19.087 |
operator new(unsigned int) | 0 | 9.544 |
_malloc | 0 | 9.544 |
AllocateHeap | 2.905 | 9.544 |
operator new(unsigned int) | 0.415 | 6.224 |
_malloc | 0 | 5.809 |
RtlAllocateHeap | 1.66 | 5.809 |
Dictionary::~Dictionary(void) | 0.83 | 26.556 |
_free | 1.245 | 25.726 |
FreeHeap | 6.224 | 22.822 |
EnterCriticalSection | 6.224 | 6.224 |
So we can see some good things have happened. The string copying cost is way down at only about 7.5%. The vector handling is totally gone from this run (actually it fell below my 5% -- it's down at 2.9% at this point). The new winner is operator new which is coming in at a total of about 35% of the total execution time spread over 3 different places. Also becoming increasingly important is the Dictionary tear down time. It's at 26.5% now (up from 18.84%).
I've often said that it's neary impossible to use "malloc" just like it comes. Applications that need high performance invariably end up wrapping it, creating some kind of custom allocator that is more suitable for their applications' needs. Looks like that's what we'll have to do next.
Hmmm... shall I post the optimized version of the managed code now or wait until the unmanaged code is soundly beating me? Decisions decisions...
Comments
- Anonymous
May 19, 2005
That's not the only bug in his program...if DictionaryEntry::Parse is partially successful but ultimately returns false, any of the strings allocated before false is returned are lost forever. - Anonymous
May 31, 2005
So I was reading through one of my favorite MSDN blogs (http://blogs.msdn.com/oldnewthing/)
And he... - Anonymous
July 31, 2006
I'm just not an expert.