Testing performance

A couple of weeks ago, I described the test development priorities for the .NET Compact Framework team.  As part of that discussion, I stated that performance should be tested in parallel with the other forms of testing (unit, customer scenarios, etc).  Today, I would like to spend some time talking about performance testing.

General considerations
When testing performance, there are a few key things to keep in mind. 

First, it is important to measure the intended scenario.  In the past, I have seen a number of tests that were intended to time a data processing operation  (ex: sort algorithm) that inadvertently included the time to load the data from the file system.  As a result, the performance data being reported was quite inaccurate.

Second, since today's operating systems are multi-tasking, it is recommended to test performance on as clean of a system (fewest running processes) as possible and to run the performance test multiple times to ensure accurate data.  Depending on the scenario being timed, the jitter (variance in results) can be significant and important to take into account when reporting results.  I will talk a bit more about performance reporting shortly.

Lastly, I have found that most meaningful performance measurements for applications I have written are when the application is in sustained operation.  What I mean by this is that any Just-In-Time compilation has already occurred and I am measuring the typical run time performance of my scenario.  Most often, code is Just-In-Time compiled once per instance of the application, so the performance impact is felt only the first time the method is called.  There are some exceptions to this rule of thumb, and I have found that in those cases, even when pre-JIT compiling the scenario, the results are reasonably accurate.  What are these exceptions?  Memory pressure and moving the application to the background.  In each of these cases, the .NET Compact Framework's Garbage Collector will run and may discard JIT compiled code and be forced to recompile when the situation is resolved (memory has been freed, the application is brought to the foreground).  In both of these cases, the timing loop will reflect the time spent freeing memory and re-JIT compiling.

Macro-benchmarks
Macro-benchmarks are performance tests for long duration scenarios.  By "long", I mean that the scenario takes a second or more to complete.  When testing macro-benchmarks, I typically run the scenario a handful (10-50) times in a loop.  Because the scenario takes a significant amount of time to run, the time spent looping does not significantly impact the measurements.  When the timing loop is complete, dividing the time spent (number of elapsed ticks) by the number of loop iterations results in the scenario's performance measurement, as shown in the following example.
// run the scenario once (to avoid timing the initial Just-In-Time compile)Scenario(data);// timing loopInt32 iterations = 10;Int32 start_ms = Environment.TickCount;for(Int32 i = 0; i < iterations; i++){    Scenario(data);}Int32 finish_ms = Environment.TickCount;// calculate the performanceDouble average_ms= (Double)(finish_ms - start_ms) / (Double)(iterations);
Macro-benchmarks are often based on customer scenarios and improving their performance can often lead to significant gains in customer satisfaction.  At times, making improvements to macro-benchmarks requires changes to the scenario implementation.  To address this, I highly recommend measuring your scenarios as early as possible in the development of the product.  That way, if the performance does not meet the requirements, there is time to write and test the new implementation.

Micro-benchmarks
Micro-benchmarks are performance tests on the small scale, measuring very short duration operations.  To get an accurate measurement, performance tests need to run the scenario a large number of times (I typically run my micro-benchmark scenarios 10,000 or more times).  Also, it is important to keep in mind that, unlike macro-benchmarks, the time spent looping can significantly impact the performance measurements.  To minimize these impacts, I recommend what I call "playing optimizing compiler" and partially unrolling your timing loop.  The example below updates the macro-benchmark example for micro-benchmark testing.
// run the scenario once (to avoid timing the initial Just-In-Time compile)Scenario(data);Int32 iterations = 1000;Int32 callsInLoop = 10;// timing loopInt32 start_ms = Environment.TickCount;for(Int32 i = 0; i < iterations; i++){    Scenario(data);    Scenario(data);    Scenario(data);    Scenario(data);    Scenario(data);    Scenario(data);    Scenario(data);    Scenario(data);    Scenario(data);    Scenario(data);}Int32 finish_ms = Environment.TickCount;// calculate the performanceDouble average_ms= (Double)(finish_ms - start_ms) / (Double)(iterations * callsInLoop);
Since the scenario is very short, this test runs the code 10,000 times and partially unrolls the loop (1,000 iterations of 10 calls).  This significantly minimizes the impact of the time spent looping.  In one example scenario I tested, the time spent when in a tight loop (10,000 iterations of 1 call), was reported as 1 millisecond.  When using the timing loop shown above,  that time was reduced to 0.95 milliseconds - 5% faster than previously reported.  With additional loop unrolling (ex: 500 iterations of 20 calls) we can further improve the accuracy of the measurement.  Of course, there is a point of diminishing returns when continued unrolling becomes unrealistic and the improved measurement accuracy is no longer significant.

Reporting performance results
I mentioned results reporting earlier.  When I report performance results, I use one of two methods: raw speed reporting and what I call "gymnastics" reporting.

In raw speed reporting, I run my performance test multiple times (the exact number depending on whether or not I am timing a micro- or macro-benchmark) and keeping only the fastest result.  This approach helps to factor out the sometimes subtle differences in results when running on multi-tasking operating systems (ex: the scheduler runs a background task) and is closer to the maximum throughput of the code.

In "gymnastics" reporting, I again run my performance test multiple times, but this time, I discard the extreme results (fastest and slowest) and average the remaining data.  The resulting data is closer to the typical performance that the customer will see during everyday use of the product.

Take care!
-- DK

Disclaimers:
This posting is provided "AS IS" with no warranties, and confers no rights.

Comments

  • Anonymous
    July 30, 2007
    One of the most common questions I hear towards the end of a product cycle is "are we ready to ship?".

  • Anonymous
    July 30, 2007
    One of the most common questions I hear towards the end of a product cycle is &quot;are we ready to ship

  • Anonymous
    August 27, 2007
    In the first installment of this series, I detailed a list of key metrics used to determine when a product

  • Anonymous
    August 27, 2007
    In the first installment of this series, I detailed a list of key metrics used to determine when a product