Visual Studio Performance Testing -- Noise is Enemy #1

Performance testing is essential to our quest to make Visual Studio provide a highly responsive user experience.

We do performance testing early and often. Before a new feature is checked into the main branch, a test build is created, and 100 to 200 tests are run to assess performance. These tests include scenarios for start-up, debugging, project build, editor interactions, and much more.

This sounds like a lot of work. Well, it is. But there’s even more work that takes place before a single bit is tested.

Here’s a brief summary of what we do. First, we work with Visual Studio product units (e.g., C#, debugger, XML editor, and about 30 others) to identify performance-sensitive user scenarios and to set response time goals for these scenarios. The product teams develop test cases for their scenarios. Our team, Developer Division Performance Engineering, reviews these tests to see if they meet certain acceptance criteria. Once a test has been accepted, it is incorporated into the division-wide suite that is used to assess if performance has degraded on a check-in. We call this a performance regression.

Of course, we need a test infrastructure as well. We have a hundred machines dedicated to performance testing. We also have automation software that runs tests, analyzes the results, and notifies product teams of performance regressions. Initially, our perspective was that automating performance testing has unique requirements that demand a unique infrastructure. Now, we recognize that performance testing has much in common with functional testing. Further, we want to use features such as load testing and test case management that are part of Visual Studio Team Test, a direction that is being pursued by Developer Division as a whole (in part to “dog food” Visual Studio features).

During a performance test, we collect a large number of PerfMon counters, CPU profiles, ETW traces, and more. However, the decision whether there is a regression (and hence if the changes can be checked-in) currently depends on just two metrics – CPU Time and Elapsed Time. If either of these is “significantly” higher than they are supposed to be, the check-in is rejected. (We deal with memory and I/O in other ways.)

But what constitutes significantly higher CPU Time and Elapsed Time? This question has been a source of animated discussions with developers eager to check-in their changes.

In the past, we used an adaptive but somewhat opaque approach to establish regression thresholds. The adaptive part allowed us to automatically reset performance baselines for regression thresholds, but it was done in a complex way. As a result, there were many times when developers contested that there was a performance regression because there was confusion about the performance baseline. One of the lessons we learned is that effective performance testing requires a transparent methodology for setting regression thresholds.

Our latest approach is to periodically set regression thresholds based on the results of multiple runs of the same baseline build. This provides a clear basis for comparing performance results on check-in.

This approach has another advantage as well – it allows us to eliminate certain sources of noise. For example, variations in the chip sets used on our test machines caused some machines to have very long I/O times (in excess of 20% longer). As a result, we developed a machine acceptance procedure whereby we eliminate these anomalous machines. There are concerns about tests as well. For example, some tests use very little CPU and so a seemingly significant increase in CPU Time can be due to very small changes in the way context switching takes place during the run. Indeed, even for CPU bound work, we have noticed CPU Time can vary by 5% to 10% due to changes in the way context switching occurs.

These are all examples of test noise.  Properly addressing test noise is critical to the effectiveness of the performance engineering process. Nothing is more frustrating to a developer than to spend several days pursuing a performance bug that turns out to be noise.

Test noise is any characteristic of a test that makes it so that CPU Time and/or Elapsed Time vary so much that we can’t reliably determine if performance has improved or degraded. To assess whether a test is too noisy, we run it multiple times on the same machine with a baseline build, and see if we get similar results. By similar, we mean that results do not vary more than 10% from the mean of the run.

Below are results from a test with little noise for CPU Time. The horizontal axis shows the run, and the vertical axis is CPU Time in milliseconds. The asterisks are test executions or iterations that are done repeatedly within a run. In most cases, we do not include the first iteration because it often differs a great deal from the others due to initialization effects. (Of course, for start-up scenarios, we only use the first iteration!)

We look for several things to see if a test is too noisy. As in the figure above, we want all of the performance counters to lie within the range of plus or minus 10% of the mean. Also, we look for a fairly even scatter of performance counters below and above the mean, but with no discernable pattern. If there is a pattern, it suggests that something is systematically biasing the test and needs to be fixed.

Now, let’s look at some noisy tests. One case is that the test may actually be addressing more than one scenario with very different performance characteristics. Here’s an example of a test that appears to have this multiple scenario character:

Another possibility is that the test may occasionally do something that takes a lot longer, or the test may be so short that it is subject to intermittent activities occurring in the OS (e.g., lazy writes) or the CLR (e.g., garbage collection). Here’s an example of a test result that appears to have this kind of problem:

In this case, there are two extreme values. At first glance, it may seem that there is another problem as well since several of the iterations have values below the “mean-10%” line. However, this is an artifact of having two large values that make the mean larger.

It’s hard work to do all of this. But performance testing was central to performance gains in Visual Studio 2008 compared with previous releases. And, with the help of changes in our handling of test noise, more performance improvements are on the way.

We will share the best practices for performance testing that we develop for Visual Studio (and other Developer Division products) with our customers. Longer term, we hope to incorporate these best practices into Visual Studio tools so that you can be more effective with your performance testing.

 

by Joe Hellerstein, Ph.D.

Senior Architect, Developer Division Performance Engineering

Joe Hellerstein is the author of Feedback Control of Computer Systems, published by Wiley-IEEE Press in 2004. He had a long and distinguished career at IBM Research prior to joining Microsoft in 2006.

Comments

  • Anonymous
    May 19, 2008
    [quote]"In the past, we used an adaptive but somewhat opaque approach to establish regression thresholds."[/quote] Would it be correct to say this was done around VS 2005 release timeframe?

  • Anonymous
    May 20, 2008
    The "opaque" approach was used to identify a performance regression in the VS 2008 timeframe. Our experience evolving such a large & complex code base has been that once code that causes a performance regression makes it into the main Branch, it is very difficult to pinpoint it later. Consequently, we have built an automated test infrastructure that serves as a quality gate to identify this code before we permit it to be integrated into the Main Branch. One of the problems we found was reporting too many "regressions" that upon investigation subsequently proved false. Analytically, we are viewing this as a problem of identifying the signal vs. the noise. We then started digging into the sources of this measurement noise. We are blogging here about what we are learning about the noise, leading directly to improvements we are making in the process for the next product development cycle.   -- Mark Friedman

  • Anonymous
    June 19, 2008
    You have no idea how envious I am of your testing resources. We write distributed system software that can run on dozens of nodes separated by multiple timezones.  Our beancounters froze all hardware purchases for about 9 months b/c the 75 machines that we had in the testing and dev't labs were "too many machines" for the testing that we need to do on our 6 core products with legacies that go back anywhere from 10 to 25+ years. Yet they still beat us up every quarter for COPQ.