Jaa


Performance Quiz #6 -- Looking at the sixth cut

Well, it's time for me to surrender.  Sort of :)

Raymond pulls out all the stops in his sixth version by painting a big bullseye on his biggest remaining source of slowness which is operator new.   He turns in an excellent result here.  On my benchmark machine I see the number drop from 124ms to 62ms -- a full 2x faster from start to finish.  And observing the footnote on my previous message, the runtime for his application is now comparable to the CLR's startup overhead...  I can't beat this time.

Let's look at the results table now to see how we ended up:

Version Execution Time(seconds)
Unmanaged v1 1.328
Unmanaged v2 0.828
Unmanaged v3 0.343
Unmanaged v4 0.187
Unmanaged v5 With Bug 0.296
Unmanaged v5 Corrected 0.124
Unoptimized Managed port of v1    0.124
Optimized Managed port of v1 0.093
Unmanaged v6 0.062

Six versions and quite a bit of work later, we've been soundly trumped.  But before I discuss that, let me put up the internal profile of Raymond's version 6

I've applied my usual filters to the call tree (nothing lower than 5% inclusive) and I also pruned out a couple of functions below HeapAlloc because they have long names and are boring :)

Function Name (Sanitized)     ExclusivePercent      InclusivePercent
  _mainCRTStartup 0 97.826
    _main 0 97.826
      Dictionary::Dictionary(void) 5.435 96.739
        MultiByteToWideChar 19.565 25
          GetCPHashNode 5.435 5.435
     operator new(unsigned int) 1.087 16.304
          .. 0 14.13
            .. 0 14.13
              AllocateHeap 4.348 13.043
        _free 0 8.696
          FreeHeap 2.174 8.696
        DictionaryEntry::Parse(...) 1.087 33.696
          StringPool::AllocString(...) 2.174 27.174
            _lstrcpynW 19.565 25
              __SEH_prolog 5.435 5.435

You can see that the memory allocation time is way down as a percentage, and of course that's a smaller percentage of a smaller total time.  I think he gets a lot of raw speed from his improved locality thanks to that new allocator as well.  Interestingly SEH overhead is up to a signifcant level in this run (now over 5% for the first time).  Still nothing to be worried about.

So am I ashamed by my crushing defeat?  Hardly.  The managed code got a very good result for hardly any effort. To defeat the managed Raymond had to:

  • Write his own file/io stuff
  • Write his own string class
  • Write his own allocator
  • Write his own international mapping

Of course he used available lower level libraries to do this, but that's still a lot of work.  Can you call what's left an STL program?  I don't think so, I think he kept the std::vector class which ultimately was never a problem and he kept the find function.  Pretty much everything else is gone.

So, yup, you can definately beat the CLR.  Raymond can make his program go even faster I think.

Interestingly, the time to parse the file as reported by both programs internal timers is about the same -- 30ms for each.  The difference is in the overhead.

Tomorrow I'm going to talk about the space used by these programs and that will wrap it up.  Though I think Raymond is going to go on and do some actual UI and so forth with this series.  That should be fun to watch.

Comments

  • Anonymous
    May 19, 2005
    Thanks. This has been an excellent series, Rico. Showing how powerful the CLR is, as well as showing some concrete examples of how to measure and improve within its walls. I think this series needs to become a cohesive article for MSDN.
  • Anonymous
    May 19, 2005
    Great series of posts. It was wonderful to see the managed code perform so well right out of the gate. And, your subsequent optimization was simple and effective. Thanks for taking the time and sharing this with us.
  • Anonymous
    May 20, 2005
    Great series of posts. The CLR is very impressive.
  • Anonymous
    May 20, 2005
    The comment has been removed
  • Anonymous
    May 21, 2005
    >Six versions and quite a bit of work later, >we've been soundly trumped

    Excuse me, but this is not fair. Raymond has never tried to beat any "managed" version (I bet he doesn't even know about your effort - you started it after he went to vacation) - all he wanted to prove is that to achieve optimization you need to measure and not to gues where the bottleneck is.
  • Anonymous
    May 21, 2005
    I assure you Raymond doesn't need any defense :)

    Remember I work with Raymond, we talked about this series of postings at some length quite some time ago and then again when this got to the top of his queue.

    It's all supposed to be an educational illustration. The competitive language is just a facade to make it less of a dry read.

    The outcome was never really in question, the journey is the illustration.

    Actually I think the fact that Raymond just did what he thought was best (well in advance of my comments) at each stage makes it more interesting, not less.
  • Anonymous
    May 31, 2005
    So I was reading through one of my favorite MSDN blogs (http://blogs.msdn.com/oldnewthing/)

    And he...
  • Anonymous
    June 07, 2005
    So -- excluding startup time -- what's the performance comparison like?

    It seems to me that (total - startup) ~ 0.03s, about half of the total time for v6. If startup for v6 is ~0.00s, then the actual working bits still seem twice as fast in the CLR.
  • Anonymous
    June 07, 2005
    No, they each do the job in about 30ms. The CLR has 30ms more startup overhead basically.

    Check out the space analysis for the reasons why.
  • Anonymous
    June 12, 2005
    That is kind of interesting.

    Do you have figures on the difference in memory usage?
  • Anonymous
    June 12, 2005
    Oops, just discovered the next blog entry :)
  • Anonymous
    July 31, 2006
    I'm just not an expert.
  • Anonymous
    January 04, 2007
    PingBack from http://console.writeline.net/blog/?p=6