Freigeben über


Bad Analysis Worse Than None

Once again I’ll begin by saying that I’m simplifying the below in the interest of not writing a novel, so please take this article as only approximately correct.  I try to make the principles more important than the details.

 

I’ve said before, that if you make no measurements, you’re pretty much doomed to miserable performance.  If you can get good performance without measuring then I think that you’ve missed your calling in life and probably you should change careers to “Professional Lottery Ticket Picker” and please call me up J

 

So the only thing worse than no data, is bad data, or badly interpreted data, or both.  And now without further ado, here is Rico’s free advice on bad data.

 

1) If you’re using an invasive profiler, don’t believe the times it gives you

 

When I say an invasive profiler, I mean one that instruments every call site in the program that’s being monitored.  And when I say don’t believe the times, I mean you might be better off just pretending it isn’t even giving you times at all.  There are several reasons why this is usually a good idea.

 

A typical invasive profiler might hook the prolog and epilog of every function.  It measures the time spent in each call and then, in digital defiance of the Uncertainty Principle, it attempts to subtract out its own influence.  Actually the problem here isn’t really quantum mechanical at all – the problem has to do with numerically unstable computations.

 

The function being measured is going to take some amount of time let’s call it x. And, performing the measurement is going to take some amount of time, in a fit of originality we’ll call that y.  When we hook the function and note its entry time and exit time, we’ll have actually measured x+2y.  We only want x so we’re going to have to subtract out the 2y. The way this is usually done is that at the beginning of the run we carefully measure how long it takes to take a measurement, thereby getting a high precision measurement of y.  And armed with that number we bravely multiply it by two and prepare to subtract out all the interference.

 

Several things go wrong at this point.

 

First, y is very small, and if our function is a small hot thing being called very often then x is also very small.  The worst thing you can do to kill your significant digits is try to subtract two things that are close in magnitude in order to get a (much smaller) difference.  That’s exactly what we’d be doing here for small functions, because x+2y is actually quite close to 2y.  The next thing we’ll do is repeat this operation, perhaps tens of thousands of times for frequently called functions, multiplying the potential error.  I can feel my Numerical Analysis professor shuddering.

 

Second, y has significant variability from function to function.  This is what’s making the first problem truly deadly.  You see, on modern processors the cost of doing pretty much anything is deeply dependant on what happened previously, and in some cases on what is to come.  Even simple looking instructions like ones that record a time into some kind of buffer can and are deeply affected by the processors cache.  As a result, the average measurement of y at program startup could be significantly different than what is observed in any given function.  That spells trouble.  I remember a friend working on one of these started finding that in some cases he was getting negative times in functions because the overhead was much less in those cases and subtracting out the 2y measured at startup was more than the entire observed x+2y in the run.  Ouch.

 

But all is not lost, the bigger the function, or more generally, the bigger the time period being measured, the greater the accuracy.  But remember, if the function calls other functions all those nested overheads will need to be subtracted out as part of the process so that’s not helping you at all.   Only the big meaty worker functions are truly getting the best measurements.  Small functions, put little trust in the number, large function, put more trust.  Don’t compare the time spent calling a small function thousands of times to the time spent in fewer calls to a big function as the error margin is likely to be huge (I’ve seen times that were off by a factor of 5).

 

What these profilers excel at is giving you a picture of what is being called and how often.  In fact the data here is definitive.  So do trust the counts and count origins and use those to guide your analysis.  In my experience, these kinds of runs are the most costly and the least reproducible from run to run (time-wise).  That makes it hard to know if you’re actually making progress.  Take the times with a big grain of salt.  Milk the counts.

 

2) If you’re using a sampling profiler, make sure you have plenty of samples

 

Would you believe that 4 out of 5 dentists really prefer Brand X toothpaste if only 5 dentists were actually surveyed?  Of course not.  Yet perfectly sane engineers believe that 10% of their startup execution is in Function X because 1 of their 10 samples happened to land in that function.

 

Sampling profilers work just as the name implies.  They sample the processors IP (and sometimes a whole callstack) periodically when some event occurs, usually the passage of a fairly small amount of time, like say 5 milliseconds.   I like sampling profilers because they are the least invasive and so you can mostly trust the times.  I say mostly because you must have a statistically valid sample size to make any conclusions at all. 

 

Where sampling profilers really fall down is when you’re trying to measure, in detail, a brief event that occurs infrequently.  Actually even getting detail on a long event that occurs infrequently is problematic because you’ll miss so much of the fine detail with the sampling.  You really need the event to be repeated so that you have plenty of chances to get a sample at many interesting points before things get good.

 

Even with these limitations, for my money, sampled data is the best.

 

3) Fear the interference

 

I sort of hinted at this earlier but it bears discussion on its own merits. 

 

When measuring a large program it’s temping to isolate the part that’s interesting to you and measure just that without running the rest.  You might even be able to boil the relevant part down into a so-called micro-benchmark.  It’s very temping considering the time savings and productivity gains you can get by reducing the analysis.  Don’t do it.  Or at least, don’t do it without being very careful.

 

A big part of how your program runs and where the costs are has to do with how it interferes with itself.  A simple example is where two algorithms both tuned to have good caching behavior might play very poorly together because collectively they create a very bad caching pattern.  But even algorithms that are playing comparatively well together are placing somewhat of a burden on each other. If you were to run only part of the program and measure that, you might not be able to conclude very much about how changing the part you measured would affect the whole.

 

The very best measurements, and the only ones that really count, are the ones that time what the user will experience.  Anything that isn’t the user experience is something we’ve created for our own productivity, and those are useful only to the extent that they are probative. Do they help us to find real user problems more quickly?  Do they predict the user experience at all?  What correction factor must be applied (if any)?

 

I think about 4 times a year I see a new implementation of memcpy that is purportedly better than all previous known memcpy’s.  Now the thing about memcpy is people have been thinking about this problem for a long time, so already you have to treat a newcomer with some suspicion.  But the real kicker is that the proof that is usually offered comes in the form of micro-benchmarks of the standard memcpy and the super-memcpy side-by-side.  This is kind of interesting but it may not be very probative.  What really matters is: “Is the new thing faster in the context of actual programs making calls to memcpy?” Is it as frugal with cache lines?  Are there other side-effects?  Does the benchmark accurately reflect the typical alignment and length sizes seen or is it all “easy cases?”

 

I think it’s fair to say that memcpy is not changed four times a year, notwithstanding the availability of new implementations at that rate.  Thorough analysis on even so small a thing is after all quite difficult.

Comments

  • Anonymous
    February 16, 2004
    The comment has been removed

  • Anonymous
    February 16, 2004
    What is the best profiler that shows memory leaks and exceptions

  • Anonymous
    February 16, 2004
    Good article. I noted that you did not discuss instrumenting code with analysis probes, such as performance counters, to help get a handle on performance issues. Even though it is manual and invasive I've found that to be a useful technique. What is your take on that?

  • Anonymous
    February 16, 2004

    a better memcpy?

    Funny, this remembers the Amiga and its Blitter dsp. The blitter would do all those kind of things, so why would anyone with common sense try to write code for that matter instead? Stupid, waste of time.

    How does this translate today? Simple, display cards have built-in blitters. I don't know see anyone anxious about performance issues try to reinvent memcpy, ignorance set aside, while the display card can do that.
    Of course, by analogy, texture tiling can serve some other purposes as well.

  • Anonymous
    February 16, 2004
    Problem is, how do you access the blitter on any gfx card in a standard way.

  • Anonymous
    February 17, 2004
    Rico,
    It would be very helpful if you could include the names of products that you recommend. In general, my feeling is that Microsoft's best practice development environment is not well distributed to Microsoft shops. It would be very helpful if your products included full lifecycle tools, or recommendations.
    Thanks.

  • Anonymous
    February 17, 2004
    The comment has been removed

  • Anonymous
    February 17, 2004

    "Problem is, how do you access the blitter on any gfx card in a standard way."

    IDirectDrawSurface->Blt(...)

    After all a direct draw surface is just a block.

  • Anonymous
    February 17, 2004
    How do I do that on other .NET runtimes without DirectX. That is a Proprietry method not standard.

  • Anonymous
    February 17, 2004

    DirectX is part of the OS, just in case.

  • Anonymous
    February 17, 2004
    The comment has been removed

  • Anonymous
    February 17, 2004
    Good article. In my experience using Quantify I find that the ratio's of how much time was spent in one method versus another to be accurate however the actual times reported are off as you say. For example if in reality MethodA takes 10 seconds and MethodB takes 20 seconds Quantify may report that MethodA takes 20 seconds and MethodB takes 40 seconds. However when I'm profiling all I really care about is the relative weights of the times in relation to each other, not benchmarks, so as long as the ratios are the same is all that matters.

  • Anonymous
    February 17, 2004
    Yeah but the second I hookinto DX I am no longer cross platform so I may aswell not use C#.

  • Anonymous
    February 17, 2004
    In my experience, even the relative times can be off. In particular small functions tend to get overweighed.

    The trick is to just be careful. Certainly don't take the times as gospel.

  • Anonymous
    February 18, 2004
    Wise words on profiling from Rico

  • Anonymous
    February 18, 2004
    "How does this translate today? Simple, display cards have built-in blitters. I don't know see anyone anxious about performance issues try to reinvent memcpy, ignorance set aside, while the display card can do that.
    Of course, by analogy, texture tiling can serve some other purposes as well. "
    No, this will not work :)

    The Amiga blitter worked on the same memory as the 680x0 chips did, using the uneven cycles of the CPU to access the ram via the bus so it wouldn't stall any system activity. (pretty clever system actually :)). In a PC, the GPU works on its own memory, so it can't copy memblocks around in mainmem, because it doesn't have access to that ram. Moving data back and forth the AGP bus is IMHO slower than having a CPU-based memcopy. Nevertheless, a DMA based chip somewhere in the system which does have mainmem access could do what you suggest.

  • Anonymous
    February 18, 2004
    The comment has been removed

  • Anonymous
    February 18, 2004
    Optical Hypertransport bus not fast enough for you?

  • Anonymous
    September 25, 2006
    Well, once again my elite readers have made a solution posting by me nearly redundant.  There were...

  • Anonymous
    December 21, 2006
    Well, once again my elite readers have made a solution posting by me nearly redundant. There were many

  • Anonymous
    June 04, 2008
    [Weblog specializing in performance optimization] [Virtual Source Code Control Systems: Promoting and Managing Projects using Visual SourceSafe]...