SCOM 2016 – Agent (Health Service) high CPU utilization and service restart
I had an interesting issue with the SCOM 2016 agents in one of my customer’s environments and would like to share some troubleshooting information, related to this, hoping that it will help those of you, who experience the same behavior.
The problem
It all started one day as one of the engineers came to me and showed me a Performance Monitor log file, where it was clearly visible that the MonitoringHost.exe process on the server is causing regular (every 15 to 20 minutes) CPU spikes up to 100%. The CPU utilization stayed high for a short moment and dropped afterwards in a quick manner. Such spikes could often go unnoticed, but in our case, there were a number of side issues on a group of servers, running multiple IIS Application Pools.
Some theory
Before going over the possible causes, let’s shortly clarify what the MonitoringHost.exe process actually is and how exactly it relates to the Microsoft Monitoring Agent service (called also Health Service). For the purpose I will reference a simple explanation provided by Microsoft in the following article:
"On a monitored Windows computer, the Operations Manager agent is listed as the Microsoft Monitoring Agent service. The Microsoft Monitoring Agent service collects event and performance data, executes tasks, and other workflows defined in a management pack. "
Further we read:
“To run workflows, the service initiates MonitoringHost.exe processes using specified credentials. These processes monitor and collect event log data, performance counter data, Windows Management Instrumentation (WMI) data, and run actions such as scripts.”
Possible causes
There could be many different causes for such behaviour and here is a list of the most common ones:
- Poor workflow design (rules, monitors, discoveries, etc.).
This is by far the most frequent cause for a high CPU utilization on the agent. Rules or monitors, which are running scripts too often, performance collection rules, collecting “heavy” data sets too frequently, object discoveries being run in very small intervals - you name it, all of those can lead to those exact symptoms.
I have selected also two blog posts, which are very good examples of this:
SCOMpercentageCPUTimeCounter cause CPU Spike and SCOM 2012 - Agent Causing High CPU Utilization
So, what can we do about it? First of all, make sure your management packs are updated. Such problems are quickly getting noticed and reported back to the respective vendors, so it is a good idea to ensure that you are running the latest version of the management pack.
If a custom management pack is the suspected cause, then it will be a good idea to remove it from your production environment, optimize it and test it thoroughly in your test environment, before deploying it again to production.
- AV software
In order to properly scan files AV products often lock them. Usually this is not problem, but in some cases the antivirus software does not remove those locks, or the locks are being set too frequently, eventually leading to some serious performance issues.
It is very important to exclude not only the agent cache (Health Service State folder), but also some other SCOM components. The recommendations regarding the exclusions are listed in the following article:
Configuring antivirus exclusions for agent and components
- Known issues (primarily with the older versions of Operations Manager
There are also some known bugs (like described in the example below), which might cause this, but those are mainly related to older version of SCOM, like 2007, 2007 R2 and 2012. If you regularly updating your SCOM infrastructure, you should be on the safe side.
Operations Manager agents consume 100 percent of CPU resources for the Monitoringhost.exe process
There are other possible causes for sure, but those listed above are the most common ones.
Now getting back to the original problem. As in many cases the troubleshooting started in the event logs (Operations Manager event log) on the affected agents and luckily, in this case the culprit have been quickly found. On all affected agents there were multiple events like the one below, which were always logged shortly before the CPU spikes:
Log Name: Operations Manager
Source: Health Service Script
Event ID: 6024
Level: Warning
Description:
LaunchRestartHealthService.js : Launching Restart Health Service. Health Service exceeded Process\Handle Count or Private Bytes threshold.
So, what was exactly happening? Because of the high number of workflows, being run by the Health Service, its Handle and Process Count were quickly increasing and reaching their default thresholds (which are actually very easy to reach, because of the low default values). So, what SCOM does behind your back when it detects that an agent has reached the threshold, is to restart the Health Service, thus killing all the running workflows. Not the best action to take, especially when you are not informed about it, right?
The connection to the high CPU usage behaviour
It was also easy to relate the Health Service Restarts to the CPU spikes on the agents. Each time the Microsoft Monitoring Agent service starts (after the restart), it initializes all the workflows with their respective modules, spawning multiple MonitoringHost.exe processes and communicates back to the management server, leading eventually to the consumption of more system resources and in particular CPU power
The solution
Luckily for us, Kevin Holman has released a blog post, where he describes how to deal with this problem. And we all know that everything he writes about SCOM is a “must read”:
Stop Healthservice restarts in SCOM 2016
After increasing all threshold as suggested in his article (you can also import a management pack with all the thresholds, you can find the link in the blog post) we noticed immediately how the CPU utilization on all previously affected agents dropped back to normal.
Conclusion
At the end of the story, it is time for a quick summary of the important stuff:
- If you are new to SCOM, note the name Kevin Holman. Read his blog posts, follow him on Twitter and make sure you carefully follow his recommendations.
- Make sure your vendor management packs are always up to date and your custom ones are well tested (and of course documented) before importing them in production.
- Don’t forget to forward the recommended Operation Manage Antivirus exclusions to your security team or to the guys in your company, responsible for the antivirus product.
- Make sure your Health Service Resource thresholds are set according to the recommendations (see the article above).