Lessons from the test lab: investigating a pleasant surprise
This post describes our recent investigation into an interesting performance problem: benchmarks that we were surprised to find running significantly faster than we expected on new hardware. Along the way we discuss useful benchmarking tools, how to validate results, and why it pays to know exactly what hardware you're running on.
This all started in our performance test lab. During the development of Visual Studio, each new build undergoes a suite of automated performance tests, running in a lab full of identical machines. These performance tests allow us to track Visual Studio's performance over time, and detect performance regressions (when something gets unexpectedly worse). We recently added a batch of new machines in our lab, and that's when the fun started.
Pop Quiz: How Much Faster?
Old machine: dual-core Intel Pentium D 830 processor, running at 3 GHz, with 1 GB of RAM.
New machine: quad-core Intel Xeon 5355 processor, running at 2.66 GHz, with 4 GB of RAM.
Given the differences in the two hardware configurations above, how much faster would you expect the new machine to be when running a Visual Studio performance test? Lower than, same as, twice, three times or four times the performance of the older machine?
One line of reasoning might look at the relative clock frequencies of the processors on the two machines. This might lead you to expect the newer processor cores to perform slower than the older cores, since their clock frequency is 11% lower. By this reasoning you might conclude that single-threaded applications would perform poorly on the new machine.
Another line of reasoning would factor in the number of cores in the two systems. Since the new machine has twice the number of cores, you might expect it to have about twice the performance on multi-threaded applications. (If you also accounted for the lower clock frequency, you'd end up with a figure of 1.78 times the performance of the old machine.)
A third approach might estimate the impact of RAM size. We’ve quadrupled the amount of RAM, so maybe any benchmarks that used to page to disk can now execute entirely in memory and hence will be orders of magnitude faster. [We'll cheat here and tell you that our benchmarks are generally not memory constrained].
So far, all these options seem plausible. What's your guess?
What we naively expected to find lay somewhere between the first two lines of reasoning - that the new machines would be 1-2 times faster than the old machines, depending on the particular benchmark.
What we actually found is that many of our single-threaded CPU-bound benchmarks run about twice as fast on the new machine, while scalable multi-threaded benchmarks run up to four times as fast. This was a pleasant surprise, because it significantly reduces the overall time to run all the benchmarks. But it did leave us wondering why we were getting much greater speedups than our naive explanations would suggest. The rest of this post explores that question.
Using WinSAT and SPEC to Validate Benchmark Results
To make sure this wasn't a fluke result, we used the Windows System Assessment Tool (winsat.exe). This is a built-in tool that can give quickly give a representative view of a machine's performance. It is multi-threaded, taking full advantage of all the cores on a machine. Here are the WinSAT CPU results:
Benchmark | Old Machine | New Machine | Speedup |
CPU – Compression (MB/s) | 70.5 | 262.0 | 3.7 |
CPU – Encryption (MB/s) | 52.3 | 139.3 | 2.7 |
We also wanted to validate our results against other real-world benchmarks. For this we turned to the SPEC website. SPEC produces a series of benchmark suites, plus a very formal process that ensures results are reproducible and can fairly be applied across different manufacturers. More importantly for our purposes, SPEC posts all reported benchmark results on their web site. You won’t always be able to find your exact machine listed, but after using results from a tool like CPU-Z you can generally find results from a machine with the same CPU configuration and clock speed.
We used the "CINT2006" benchmarks – this is a widely-used benchmark suite concentrating on integer performance. We compared results for both CINT2006, which is a good test of single-threaded performance, and CINT2006 Rate, which tests the ability of a system to execute multiple copies of CINT2006, and is therefore a better test of multi-threaded performance. For two representative machines that are similar to our old and new hardware, here are the results:
Benchmark | Old Machine | New Machine | Speedup |
CINT2006 | 9.85 | 15.5 | 1.6 |
CINT2006 Rate | 18.0 | 44.4 | 2.5 |
The WinSAT and SPEC results confirm that the new machines are much faster than our naive expectations, even for benchmarks such as CINT2006 that cannot take advantage of the extra cores. So what were we missing?
Using CPU-Z to Examining Machine Configurations
To answer this, we need a deeper understanding of the configurations of the two systems.
Unfortunately, finding detailed configuration information isn't always straightforward. For example, we know that level two (L2) cache size impacts performance, but Windows doesn't report it, and it's not easy to reboot into the BIOS to take a look at cache size when the machine is located in a remote test lab. This is where machine reporting tools like CPU-Z come in. You can run CPU-Z remotely on an unknown machine and get back a nicely formatted HTML report showing exactly what the hardware is. Here's a deeper look at our old and new systems:
Feature | Old Machine | New Machine |
CPU name | Pentium D 830 (“Smithfield”) | Xeon X5355 (“Clovertown”) |
CPU speed | 3.00 GHz | 2.66 GHz |
Number of cores | 2 | 4 |
L1 cache (per core) | 16 KB | 32 KB |
L2 cache (total) | 2 MB | 8 MB |
System RAM | 1 GB DDR2 | 4 GB DDR2 |
Using BCDEdit to Disable Cores
Now we can try to tease out the relative impacts of the many changes from the old configurations the new configurations. The first and easiest step is to disable two out of four cores on a new machine, to enable a fairer "apples to apples" comparison of cores between old and new machines.
To do this we used the Windows BCDEdit tool, which replaces the old method of editing BOOT.INI by hand. Here we were particularly concerned with the order in which cores are disabled. This is important because the 8 MB of L2 cache in the Xeon “Clovertown” processors is divided: two of the four cores share 4 MB, and the other two cores share the other 4 MB. To keep our benchmark comparisons as fair as possible, we wanted to make sure that only one of the L2 caches was in use after disabling two cores. We used CPU-Z again after rebooting to confirm this.
Now we were in a position to do a fairer “cores to cores” comparison between the old and new machines. Here's a summary from WinSAT:
Benchmark | Old Machine | New (2 cores) | Speedup |
CPU – Compression (MB/s) | 70.5 | 131.9 | 1.9 |
CPU – Encryption (MB/s) | 52.3 | 69.7 | 1.3 |
Memory Bandwidth (MB/s) | 4,041 | 3,360 | 0.8 |
Now we can really see the advantage of the latest processors – on a core-for-core basis, they are 1.3-1.9x faster on the CPU-intensive WinSAT benchmarks, despite having lower clock frequencies.
Good, now on to the next… wait a second. Look at that memory bandwidth result. Our new machines have less memory bandwidth than the old machines? That doesn't look right: although memory performance hasn't been keeping pace with CPU speeds, it has been improving over time. Compared to a three-year-old machine, we'd expect these new machines to have slightly better memory bandwidth, and definitely not worse. What gives?
Memory Channels
A primary limiting factor to memory bandwidth is the number of memory channels that are in use. And this turns out to be the problem here: although the new machines have four memory channels and eight memory slots, only two of those slots are filled, because the vendor supplied us with two 2 GB memory modules per machine. This maximizes future expansion potential – we can take the machine up to 16 GB without throwing away any of our initial investment in memory. But in the meantime using two memory slots limits us to two memory channels in use. If instead we had four 1 GB memory modules we'd have four memory channels in use, improving memory interleaving from 2:1 to 4:1 and increasing memory bandwidth. To confirm this, we populated four memory slots on one of the new machines (going from 4 GB to 8 GB) and reran WinSAT:
Benchmark | 2 channels | 4 channels | Speedup |
Memory Bandwidth (MB/s) | 3,360 | 4,134 | 1.2 |
Conclusions
It's always possible to run more experiments to further isolate and explain benchmark results, but after a while you reach a point of diminishing returns. With the results we have so far, we can already draw some useful conclusions.
The first conclusion is that our naive explanations greatly underestimated just how much better the newer processors are at executing real benchmarks, despite their slower clock speeds. The results from WinSAT and SPEC clearly show this, with core-to-core performance that is 1.3-1.9x faster on the new machines, depending on the benchmark.
This is perhaps the most important lesson for developers to learn: clock speeds are no longer a good indicator of true performance. Although clock speeds have plateaued, processor designers continue to find ways to make each new generation significantly faster than the last. In our case, the old machines have Pentium D processors (“Smithfield”), while the new machines have Xeon 5-series processors (“Clovertown”). And while the newer processors have slightly slower clock speeds, their micro-architecture executes more instructions per clock cycle.
The second conclusion is that it's very hard to perform fair comparisons. The two machines have several configuration differences, including clock frequency, number of cores, core micro-architecture, cache sizes, bus speed, memory size and speed, and so on. We showed an example of isolating the effect of just one of these differences, the number of cores, using the BCDEdit tool. Isolating the effect of every single difference would require much more effort.
Indeed, some of these differences are interrelated, and it is hard to change one without affecting another. For example, CPU architects make their micro-architecture design decisions based on cache sizes. Now imagine a hypothetical experiment that tried to isolate the effect of L2 cache size by giving each core just 1 MB of cache. This would be especially hard on the newer processors, which have been designed on the assumption that they have 2 MB of L2 cache per core[1]. In trying to perform a fairer comparison, we would have actually handicapped one system!
Our final conclusion is that it truly pays to benchmark and compare systems. In our case, the simplest possible benchmark (WinSAT) showed an unexpected memory bandwidth loss, which we then traced back to a machine mis-configuration. So that was the final pleasant surprise: if we hadn't gotten curious about why the new machines were so much faster, we would never have found that they could be faster still!
David Berg
Sunny Egbo
Jonathan Hardwick
Peter Okonski
[1] Because two cores share a single 4 MB L2 cache on the Clovertown processors, the exact size of the cache that is used by each core is not fixed at 2 MB per core; the use will vary during program execution. Cache hungry threads might get more of the cache, while less cache hungry threads get less. Even when two cache hungry threads run on the two cores, their memory hotspots are asynchronous; thus, the net effect is that each thread gets more of the cache when they need it and less when they don’t need it.
Comments
Anonymous
June 18, 2008
Even though I've been doing general architecture work on Visual Studio for nearly a year now, my friendsAnonymous
June 18, 2008
To explain the memory bandwidth difference... The pentium 4 (830D) was a more memory hungry architecture. Risking a gross oversimplification I would say that the 830 D agressively transfers memory in its cache, even if it ends up not doing it. So the bandwidth measured by a benchmark in a best case scenario could be much higher than the actual bandwith that your software can use. About the amount of cache used by a thread. I was under the impression that Windows scheduled threads on different cores. i.e. thread x is not guaranteed and is not going to run always on the first core of the processor. So if you have two concurrent threads, they are probably running both on both (or all four) cores. In this case the amount of cache used by each thread would be undetermined (except by further analysis). My understanding is that the part where you say that thread gets more of the cache when it needs it is is true regardless of the number of cores or wether these cores are sharing a cache or not. What am I missing?Anonymous
June 18, 2008
The comment has been removedAnonymous
June 19, 2008
Thank you for the answer. I feel enlightened now! Especially about soft affinity. I kind of always worried about that.Anonymous
June 19, 2008
One of our main roles in DevDiv Performance Engineering is to help other teams with performance investigationsAnonymous
July 01, 2008
Thanks Mark about the enlightning details about 'Soft Affinity', and the reference to your Win2003 Performance Engineering Handbook was also useful. Looking up to next part of "Mainstream NUMA" post.Anonymous
December 23, 2008
Soma’s been talking about the upcoming Visual Studio 2010 release on his blog , which means I’m startingAnonymous
May 02, 2009
Recently, a colleague of mine, Mark Friedman, posted a blog titled “ Parallel Scalability Isn’t Child’s