Unexplained High CPU Usage on Intel E5 Processors
An experience with unexplained high CPU utilization on a server with an Intel E5 cpu.
A client with 2 identical Hyper-v servers running almost identical VMs. One of the servers out of the blue started having high CPU utilization. The host was bouncing from 35-50% and the guests were at 99% CPU utilization. Turned off the guests and reboot server, no change. Still 35-50% utilization. Made sure any unnecessary hardware was disabled or disconnected, again no change. Experimenting with one of the guest machines it was noticed that the CPU utilization would sometimes show system interrupts at 99% then go away for a bit and then come back with any process that was active taking over the 99% utilization. That led to checking into system interrupts on each host machine and comparing them.
The previous tool of choice was KernView for 32bit machines, however this does not work for modern 64bit machines. After some digging around on the internet it turns out KernRate works on 64bit machines and can be found in the Windows Driver Development Kit 7 found here http://www.microsoft.com/en-us/download/details.aspx?id=11800. If you choose the default install path the files can be found here C:\WinDDK\7600.16385.1\Tools\Other\amd64.
The goal is the log the output for a fixed time in or to allow for comparison. The proper command was 'kernrate -s 30 -yo filename.txt' which would give a 30 second sample and write it to a file in the same path with the chosen file name. Running the command on both the host that was not having issues and the one that was having issue. For brevity below is the output and results.
Server specs (both servers are the same):
Dell 320
32GB ram
Intel E5-2420 CPU (6 hyper-threaded cores)
Server 2012 with Hyper-V role installed
Server with issues:
Results for Kernel Mode:
OutputResults: KernelModuleCount = 147
Percentage in the following table is based on the Total Hits for the Kernel
ProfileTime 276703 hits, 10002 events per hit --------
Module Hits msec %Total Events/Sec
NTOSKRNL 138197 30074 49 % 45961508
HAL 126880 30074 45 % 42197704
WIN32K 7230 30026 2 % 2408394
NTFS 1030 30055 0 % 342773
Server without issues:
Results for Kernel Mode:
OutputResults: KernelModuleCount = 145
Percentage in the following table is based on the Total Hits for the Kernel
ProfileTime 289145 hits, 10009 events per hit --------
Module Hits msec %Total Events/Sec
NTOSKRNL 244130 29999 84 % 81452620
HAL 41760 29999 14 % 13932992
WIN32K 1650 29999 0 % 550513
IPMIDRV 625 30000 0 % 208520
The server with issues has 45% of interrupts going to the HAL. The HAL is short for Hardware Abstraction Layer which is a piece of the operating system that allow other parts of the operating system interact with the physical hardware of the computer. Modern versions of Windows automatically select the HAL used based on the processor type, but I still verified both servers were using the same one. Disabling any unnecessary hardware, turned the guest machines off, updated drivers and ran KernRate between each step, all with very similar results.
After many hours of testing, one last resort before declaring a bad CPU or motherboard and calling Dell for warranty. The bios and the server rebooted. All disabled devices disabled and the guest machines off in order to limit the changes and allow a baseline test. A few minutes after rebooting, task manager showed a pleasant 10% CPU utilization. Re-enabling all devices and turning the guests back on everything seemed nice and fast, including the guest performance. On file run with KernRate to see if there was any difference in the results.
After BIOS update on bad machine:
OutputResults: KernelModuleCount = 144
Percentage in the following table is based on the Total Hits for the Kernel
ProfileTime 341514 hits, 10009 events per hit --------
Module Hits msec %Total Events/Sec
NTOSKRNL 332831 29999 97% 111047217
HAL 6673 29999 1 % 2226409
IPMIDRV 835 29999 0 % 278593
NTFS 395 29999 0 % 131789
The links below have more information including a document that contains errata for this issue. If you have a Dell, HP or other server with E5 processor(s) please be sure to update your bios for the best experience. Please be sure to take backups in the case anything bad happens.
Original article: UNEXPLAINED HIGH CPU USAGE
Information from Intel: Intel E5 CPU Errata
Related article: windows bugchecks on vmware esxi with xeon e5-2670 cpus