다음을 통해 공유


Interpreting CPU Utilization for Performance Analysis

CPU hardware and features are rapidly evolving, and your performance testing and analysis methodologies may need to evolve as well. If you rely on CPU utilization as a crucial performance metric, you could be making some big mistakes interpreting the data. Read this post to get the full scoop; experts can scroll down to the end of the article for a summary of the key points.

 

If you’re the type of person who frequents our server performance blog, you’ve probably seen (or watched) this screen more than a few times:

 perf

This is, of course, the Performance tab in Windows Task Manager. While confusion over the meaning of the Physical Memory counters is a regular question we field on the perf team, today I’m going to explain how CPU utilization (referred to here as CPU Usage) may not mean what you would expect!

 

[Note: In the screenshot above, CPU utilization is shown as a percentage in the top left. The two graphs on the top right show a short history of CPU usage for two cores. Each core gets its own graph in Task Manager.]

 

CPU utilization is a key performance metric. It can be used to track CPU performance regressions or improvements, and is a useful datapoint for performance problem investigations. It is also fairly ubiquitous; it is reported in numerous places in the Windows family of operating systems, including Task Manager (taskmgr.exe), Resource Monitor (resmon.exe), and Performance Monitor (perfmon.exe).

 

The concept of CPU utilization used to be simple. Assume you have a single core processor fixed at a frequency of 2.0 GHz. CPU utilization in this scenario is the percentage of time the processor spends doing work (as opposed to being idle). If this 2.0 GHz processor does 1 billion cycles worth of work in a second, it is 50% utilized for that second. Fairly straightforward.

 

Current processor technology is much more complex. A single processor package may contain multiple cores with dynamically changing frequencies, hardware multithreading, and shared caches. These technological advances can change the behavior of CPU utilization reporting mechanisms and increase the difficulty of performance analysis for developers, testers, and administrators. The goal of this post is to explain the subtleties of CPU utilization on modern hardware, and to give readers an understanding of which CPU utilization measurements can and cannot be compared during performance analysis.

 

CPU Utilization’s Uses

For those who are unaware, CPU utilization is typically used to track CPU performance regressions or improvements when running a specific piece of code. Say a company is working on a beta of their product called “Foo.” In the first test run of Foo a few weeks ago, they recorded an average CPU utilization of 25% while Foo was executing. However, in the latest build the average CPU utilization during the test run is measured at 75%. Sounds like something’s gone awry.

 

CPU utilization can also be used to investigate performance problems. We expect this type of scenario to become common as more developers use the Windows Performance Toolkit to assist in debugging applications. Say that Foo gets released for beta. One customer says that when Foo is running, their system becomes noticeably less responsive. That may be a tough bug to root cause. However, if the customer submits an XPerf trace, CPU utilization (and many other nifty metrics) can be viewed per process. If Foo.exe typically uses 25% CPU on their lab test machines, but the customer trace shows Foo.exe is using 99% of the CPU on their system, this could be indicative of a performance bug.

 

Finally, CPU utilization has important implications on other system performance characteristics, namely power consumption. Some may think the magnitude of CPU utilization is only important if you’re bottlenecked on CPU at 100%, but that’s not at all the case. Each additional % of CPU Utilization consumes a bit more juice from the outlet, which costs money. If you’re paying the electricity bill for the datacenter, you certainly care about that!

 

Before I go further, I want to call out a specific caveat for the more architecturally-aware folks. Earlier, I used the phrase “cycles worth of work”. I will avoid defining the exact meaning of “work” for a non-idle processor. That discussion can quickly become contentious. Metrics like Instructions Retired and Cycles per Instruction can be very architecture and instruction dependent and are not the focus of this discussion. Also, “work” may or may not include a plethora of activity, including floating point and integer computation, register moves, loads, stores, delays waiting for memory accesses and IO’s, etc. It is virtually impossible for every piece of functionality on a processor to be utilized during any given cycle, which leads to arguments about how much functionality must participate during “work” cycles.

 

Now, a few definitions:

Processor Package: The physical unit that gets attached to the system motherboard, containing one or more processor cores. In this blog post “processor” and “processor package” are synonymous.

Processor Core: An individual processing unit that is capable of executing instructions and doing computational work. In this blog post, the terms “CPU” and “core” are intended to mean the same thing. A “Quad-Core” processor implies four cores, or CPU’s, per processor package.

Physical Core: Another name for an instance of a processor core.

Logical Core: A special subdivision of a physical core in systems supporting Symmetric Multi-Threading (SMT). A logical core shares some of its execution path with one or more other logical cores . For example, a processor that supports Intel’s Hyper-Threading technology will have two logical cores per physical core. A “quad-core, Hyper-Threaded” processor will have 8 logical cores and 4 physical cores.

Non Uniform Memory Access (NUMA) – a type of system topology with multiple memory controllers, each responsible for a discrete bank of physical memory. Requests to each memory bank on the system may take different amounts of time, depending on where the request originated and which memory controller services the request.

NUMA node: A topological grouping of a memory controller, associated CPU’s, and associated bank of physical memory on a NUMA system.

Hardware thread: A thread of code executing on a logical core.

Affinitization: The process of manually restricting a process or individual threads in a process to run on a specific core, package, or NUMA node.

Virtual Processor: An abstract representation of a physical CPU presented to a guest virtual machine.

 

Comparisons & Pitfalls

CPU utilization data is almost always useful. It is a piece of information that tells you something about system performance. The real problem comes when you try to put one piece of data in context by comparing it to another piece of data from a different system or a different test run. Not all CPU utilization measurements are comparable - even two measurements taken on the same make and model of processor. There are a few sources of potential error for folks using utilization for performance analysis; hardware features and configuration, OS features and configuration, and measurement tools can all affect validity of the comparison.

 

1. Be wary of comparing CPU utilization from different processor makes, families, or models.

This seems obvious, but I mentioned a case study above where the Foo performance team got a performance trace back from a customer, and CPU utilization was very different from what was measured in the lab. The conclusion that 99% CPU utilization = a bug is not valid if processors are at all different, because you’re not comparing apples to apples. It can be a useful gut-check, but treat it as such.

 

Key takeaway #1: Processor of type A @ 100% utilization IS NOT EQUAL TO Processor of type B @ 100% utilization

2. Resource sharing between physical cores may affect CPU utilization (for better or worse)

Single-core processors, especially on servers, are uncommon; multi-core chips are now the norm. This complicates a utilization metric for a few reasons. Most significantly, resource sharing between processor cores (logical and physical) in a package makes “utilization” a very hard-to-define concept. L3 caches are almost always shared amongst cores; L2 and L1 might also be shared. When resource sharing occurs, the net effect on performance is workload dependent. Applications that benefit from larger caches could suffer if cache space is shared between cores, but if your workload requires synchronization, it may be beneficial for all threads to be executing with shared cache. Cache misses and other cache effects on performance are not explicitly called out in the performance counter set. So the reported utilization includes time spent waiting for cache or memory accesses, and this time can grow or shrink based on the amount and kind of resource sharing.

 

Key takeaway #2: 2 HW threads on the same package @ 100% utilization IS NOT EQUAL TO 2 HW threads on different packages @ 100% utilization (for better or worse)

 

 

3. Resource sharing between logical cores may affect CPU utilization (for better or worse)

Resource sharing also occurs in execution pipelines when SMT technologies like Intel’s Hyper-threading are present. Logical cores are not the same as physical cores - execution units may be shared between multiple logical cores. Windows considers each logical core a CPU, but seeing the term “Processor 1” in Windows does not imply that the corresponding silicon is a fully functioning, individual CPU.

 

Consider 2 logical cores sharing some silicon on their execution path. If one of the logical cores is idle, and the other is running at full bore, we have 100% CPU utilization for one logical core. Now consider when both logical cores are active and running full bore. Can we really achieve double the “work” of the previous example? The answer is heavily dependent on the workload characteristics and the interaction of the workload with the resources being shared. SMT is a feature that improves performance in many scenarios, but it makes evaluating performance metrics more…interesting.

 

Key takeaway #3: 2 HW threads on the same logical core @ 100% utilization IS NOT EQUAL TO 2 HW threads on different logical cores @ 100% utilization (for better or worse)

 

 

4. NUMA latencies may affect CPU utilization (for better or worse)

An increasing percentage of systems have a NUMA topology. NUMA and resource sharing together imply that system topology can have dramatic effects on overall application performance. Similar to the previous two pitfalls, NUMA effects on performance are workload dependent.

 

If you want to see which cores belong to which NUMA nodes, right click on a process in the “Processes” tab of Task Manager and click “set affinity…”. You should get a window similar to the one below, which shows the CPU-to-node mapping if a server is NUMA-based. Another way to get this information is to execute the “!NUMA” command in the Windows Debugger (windbg.exe).

 Affinity

 

Key takeaway #4: 2 HW threads on the same NUMA node @ 100% utilization IS NOT EQUAL TO 2 HW threads on different NUMA nodes @ 100% utilization (for better or worse)

 

 

5. Processor power management (PPM) may cause CPU utilization to appear artificially high

Power management features introduce more complexity to CPU utilization percentages. Processor power management (PPM) matches the CPU performance to demand by scaling the frequency and voltage of CPU’s. During low-intensity computational tasks like word processing, a core that nominally runs at 2.4 GHz rarely requires all 2.4 billion potential cycles per second. When fewer cycles are needed, the frequency can be scaled back, sometimes significantly (as low as 28% of maximum). This is very prevalent in the market - PPM is present on nearly every commodity processor shipped today (with the exception of some “low-power” processor SKUs), and Windows ships with PPM enabled by default in Vista, Windows 7, and Server 2008 / R2.

 

In environments where CPU frequency is dynamically changing (reminder: this is more likely than not), be very careful when interpreting the CPU utilization counter reported by Performance Monitor or any other current Windows monitoring tool. Utilization values are calculated based on the instantaneous (or possibly mean) operating frequency, not the maximum rated frequency.

 

Example: In a situation where your CPU is lightly utilized, Windows might reduce the operating frequency down to 50% or 28% of its maximum. When CPU utilization is calculated, Windows is using this reference point as the “maximum” utilization. If a CPU nominally rated at 2.0 GHz is running at 500 MHz, and all 500 million cycles available are used, the CPU utilization would be shown as 100%. Extending the example, a CPU that is 50% utilized at 28% of its maximum frequency is using approximately 14% of the maximum possible cycles during the time interval measured, but CPU utilization would appear in the performance counter as 50%, not 14%.

 

You can see instantaneous frequencies of CPUs in the “Performance Monitor” tool. See the “Processor Performance” object and select the “ % of Maximum Frequency” counter.

 

[Side note related to Perfmon and power management: the “Processor Frequency” and “ % of Maximum Frequency” counters are instantaneous samples, not averaged samples. Over a sample interval of one second, the frequency can change dozens of times. But the only frequency you’ll see is the instantaneous sample taken each second. Again, ETW or other more granular measurement tools should be used to obtain statistically better data for calculating utilization.]

Key takeaway #5: 2 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

 

Key takeaway #6: 4 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

 

6. Special Perfmon counters should be used to obtain CPU utilization in virtualized environments

Virtualization introduces more complexity, because allocation of work to cores is done by the hypervisor rather than the guest OS. If you want to view CPU utilization information via Performance Monitor, specific hypervisor-aware performance counters should be used. In the root partition of a Windows Server running Hyper-V, the “Hypervisor Root Virtual Processor % Total Runtime” counter can be used to track CPU utilization for the Virtual Processors to which a VM is assigned. For deeper analysis of Hyper-V Performance Counters and Processor Utilization in virtualized scenarios, see blog posts here and here.

 

Key takeaway #7: In a virtualized environment, unique Perfmon counters exposed by the hypervisor to the root partition should be used to get accurate CPU utilization information.

 

 

7. “% Processor Time” Perfmon counter data may not be statistically significant for short test runs

For someone performing performance testing and analysis, the ability to log CPU utilization data over time is critical. A data collector set can be configured via logman.exe to log the “ % Processor Time”counter in the “Processor Information” object for this purpose. Unfortunately, counters logged in this fashion have a relatively coarse granularity in terms of time intervals; the minimum interval is one second. Relatively long sample sizes need to be taken to ensure statistical significance in the utilization data. If you need higher precision, then out-of-band Windows tools like XPerf in the Windows Performance Toolkit can measure and track CPU utilization with a finer time granularity using the Event Tracing for Windows (ETW) infrastructure.

Key takeaway #8: Perfmon is a good starting point for measuring utilization but it has several limitations that can make it less than optimal. Consider alternatives like XPerf in the Windows Performance Toolkit.

 

 

 

Best Practices for Performance Testing and Analysis Involving CPU Utilization

If you want to minimize the chances that hardware and OS features or measurement tools skew your utilization measurements, consider the following few steps:

1. If you’re beginning to hunt down a performance problem or are attempting to optimize code for performance, start with the simplest configuration possible and add complexity back into the mix later.

a. Use the “High Performance” power policy in Windows or disable power management features in the BIOS to avoid processor frequency changes that can interfere with performance analysis.

b. Turn off SMT, overclocking, and other processor technologies that can affect the interpretation of CPU utilization metrics.

c. Affinitize application threads to a core. This will enhance repeatability and reduce run-to-run variations. Affinitization masks can be specified programmatically, from the command line, or can be set using the GUI in Task Manager..

d. Do NOT continue to test or run in production using this configuration indefinitely. You should strive to test in out-of-box or planned production configuration, with all appropriate performance and power features enabled, whenever possible.

Key Takeaway #9: When analyzing performance issues or features, start with as simple a system configuration as possible, but be sure to analyze the typical customer configuration at some point as well.

2. Understand the system topology and where your application is running on the system in terms of cores, packages, and nodes when your application is not explicitly affinitized. Performance issues can suddenly appear in complex hardware topologies; ETW and XPerf in the Windows Performance Toolkit can help you to monitor this information.

a. Rebooting will generally change where unaffinitized work is allocated to CPUs on a machine. This can make topology-related performance issues reproduce intermittently, increasing the difficulty to root cause and debug problems. Reboot and rerun tests several times, or explicitly affinitize to specific cores and nodes to help flush out any issues related to system topology. This does not mean that the final implementation is required to use thread affinity, or that affinity should be used to work around potential issues; it just improves repeatability and clarity when testing and debugging.

3. Use the right performance sampling tools for the job. If your sample sets will cover a long period of time, Perfmon counters may be acceptable. ETW generally samples system state more frequently and is correspondingly more precise than Perfmon, making it effective even with shorter duration samples. Of course, there is a tradeoff - depending on the number of ETW “hooks” enabled, you may end up gathering significantly more data and your trace files may be large.

 

 

 

Finally, keep in mind that these problems are not isolated to the Windows operating system family. The increase in processor features and complexity over the past decade has made performance analysis, testing, and optimization a challenge on all platforms, regardless of OS or processor manufacturer.

And If you are comparing CPU utilization between two different test runs or systems, use the guidance in this post to double check that the comparison makes sense. Making valid comparisons means you’ll spend more of your time chasing valid performance issues.

 

 

Summary of Key Takeaways

Key takeaway #1: Processor of type A @ 100% utilization IS NOT EQUAL TO Processor of type B @ 100% utilization

Key takeaway #2: 2 HW threads on the same package @ 100% utilization IS NOT EQUAL TO 2 HW threads on different packages @ 100% utilization (for better or worse)

Key takeaway #3: 2 HW threads on the same logical core @ 100% utilization IS NOT EQUAL TO 2 HW threads on different logical cores @ 100% utilization (for better or worse)

Key takeaway #4: 2 HW threads on the same NUMA node @ 100% utilization IS NOT EQUAL TO 2 HW threads on different NUMA nodes @ 100% utilization (for better or worse)

Key takeaway #5: 2 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

Key takeaway #6: 4 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

Key takeaway #7: In a virtualized environment, unique Perfmon counters exposed by the hypervisor to the root partition should be used to get accurate CPU utilization information.

Key takeaway #8: Perfmon is a good starting point for measuring utilization but it has several limitations that can make it less than optimal. Consider alternatives like XPerf in the Windows Performance Toolkit.

Key Takeaway #9: When analyzing performance issues or features, start with as simple a system configuration as possible, but be sure to analyze the typical customer configuration at some point as well.

 

Feel free to reply with questions or additional (or alternative) perspectives, and good luck!

 

Matthew Robben

Program Manager

Windows Server Performance Team

Comments

  • Anonymous
    January 01, 2003
    The comment has been removed

  • Anonymous
    September 25, 2009
    The #5 point has long concerned me and I knew it was coming.  But I did not realize the % of Maximum Frequency counter was instantaneous.  This severely hampers the traditional capacity management approach of logging performance data at 1-5 minute samples over extended periods of time.  Even if both % Processor Time and % Max. Freq counters are captured in an attempt to normalize the utilization, the sample rate is way too low. So my question is: if I am still running Server 2003, will the frequency still be adjusted by default on modern hardware? BTW, thanks for the great post!  Keep them coming!

  • Anonymous
    October 30, 2009
    All very salient points.... One key point that is missing, however, is a basic understanding of how Windows calculates CPU.   Those with a background in OS internals might expect that the Kernel would track CPU utilization by updating the thread and process structures at context switch.   Windows however, instead uses a sampling approach - every 15.6 msec on a multiprocessor and 10msec on a uniprocessor.   On this interval it checks the thread running on each core.  That thread gets charged for the entire interval's worth of CPU.   This can lead to severe sampling error in workloads with relatively short task times.   Luckily xperf tracing can now show this disparity.