Jaa


Operations Manager Health Service Restarts due to Exceeding Handle Count Threshold

Two of my coworkers, Chris Maiden and Phil Bracher, recently ran across an issue where many of their agents were restarting. They thought this issue might be affecting others so they summarized their experience below:

The MonitoringHost processes were causing the Healthservice to restart at least once a day on several of our agents and mostly servers running Windows 2008 R2 SP1 and hosting Exchange 2010. The problem manifests itself by observed gaps in performance data. Since the Healthservice is restarting the system will not have data for the periods of time that the service was stopped.

--Update, attached Phil's MP he used to collect and view data on the handle count

clip_image001

When we believe the Healthservice is being restarted on an agent the first items to look at are the memory monitors associated with the Healthservice. There are four monitors related to both the Healthservice and MonitoringHost processes. The monitors change state when either the handle count or Private Bytes thresholds are exceeded. Additionally, the parent aggregate monitor has a recovery script that restarts the Healthservice process. Kevin Holman has a great blog about this process.

Since we knew the two main reasons for Healthservice restarts were related to memory (Private Bytes and Handle Count) we decided to create a few views to observe the trends associated with both processes. We could also begin alerting for the individual restarts per Kevin’s blog but we wanted to observe the effects of overrides to the thresholds over time to see if the trend would eventually stabilize. The views we created included both processes and their respective private bytes and Handle counts counters. 

The views provided the following results for the MonitoringHost process. The agent can create multiple MonitoringHost processes and in almost each case it was the initial process (MonitoringHost) causing the problem. The process was creating a lot of handles each day and not releasing them to the tune of approximately 6,000 handles per day. 

We were up to date on the most recent .Net patches (the agent uses the .Net framework) so we decided to select a few test agents and begin overriding the threshold for the MonitoringHost Handle Count to see if it would stabilize. Our thinking is that if we can find the sweet spot we could decide whether or not to increase the threshold and leave it.

We started with a threshold of 15,000 handles. The Healthservice continued to restart. We then increased to 30,000 handles. Same result.

clip_image002

Even at 100,000 handles the process would consume everything and a restart would occur. At this point we decided not to increase it again but rather look for any fixes which might resolve the problem since this was obviously a leak. Looking through the various hotfixes for System Center Operations Manager we found KB2878378.

The article speaks to a specific symptom where you might observe grey agents. Though we did not observe grey agents or have any of the affected advapi32.dll listed in the article we decided to install the hotfix on one of our Exchange 2010 servers in the lab. The outcome looked promising but we still leaked handles. Below you can see the agent without the patch and then with the patch applied.

clip_image003

We looked further for any hotfixes, Operations Manager or Operating System, that might impact this leak. We found the following two optional patches and decided to install them.

KB2685811 - Kernel-Mode Driver Framework version 1.11 update for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2.

KB2685813 - User-Mode Driver Framework version 1.11 update for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2.

The handles leak has been resolved and the Healthservice restarts stopped.

clip_image004

Sample.Handle.Count.for.MonitoringHost.renametoxml

Comments

  • Anonymous
    April 22, 2014
    Phil, Chris and Russ, Good job on finding the resolution for this issue!!!

  • Anonymous
    May 08, 2014
    habe das selbe Problem, jedoch auf Server 2012 R2

  • Anonymous
    May 11, 2014
    i have Server 2012 R2 please help

  • Anonymous
    May 11, 2014
    use Server 2012 R2, pleas help

  • Anonymous
    May 11, 2014
    Lothar, our scenario applied specifically to Windows Server 2008. You could also experience similar issues with other operating systems which may or may not be resolved with a hotfix. Further investigation would be needed and I suggest you open a case with support if you require assistance.

  • Anonymous
    May 13, 2014
    Thanks. Works as posted.  Been running for two weeks and the agents no longer restart every 24 hours.  Works for all the Windows 2008 R2 servers not just the Exchange 2010 servers.

  • Anonymous
    November 04, 2014
    Can you tell me what views you used and how they are configured please?

  • Anonymous
    November 06, 2014
    We're receiving this on just one of our 5 SCOM 2012 R2 (rollup3) management servers. Only difference is that 4 of them are Windows 2012 6.2 (build 9200), and the one that is constantly restarting the health service with exceeded handles is Windows 2012 6.3 (build 9600) - the latest Windows version/build. It only just exceeds though - management servers have their handle threshold set to 10,000 and this usually restarts once a day at around 10,100(!). We could obviously override the value for this one management server but it would be nice to know how to resolve/whether a hotfix might be forthcoming?

  • Anonymous
    November 06, 2014
    Anton, we created rules to collect the handle count and memory counters and then viewed those counters with views and reports. Steph, what you're seeing may not have anything to do with the fixes listed in this article. Setting the threshold higher is a good troubleshooting step in this case. If you set it to 20,000 for instance and it restarts after a few days then you know there is likely some sort of leak and would need to contact support to assist with troubleshooting. If it stops growing at 12,000 then it's likely not an issue and you just need to leave the threshold at a higher level.

  • Anonymous
    December 03, 2014
    Russ - did you create the rules in a SCOM View? Or in the local system's performance monitor? I want to create this view as well. What specific handle counts/memory counters did you use?

  • Anonymous
    December 04, 2014
    Phil Bracher shared this MP he created with me. It includes a performance collection rule and view. I have attached it to this blog.

  • Anonymous
    March 09, 2015
    Thanks Russ After importing this MP, should I expect any type of performance impact? It essentially creates this rule for every agent in the environment and gathers data going forward, correct? Did you leave this MP imported and this rule running for all systems or did you only enable it for some systems for a small amount of time? Thanks

  • Anonymous
    March 09, 2015
    Correct, it also includes a view for the data collected. The data isn't collected very often so it shouldn't be an issue to leave it in place if you want to view the data collected on an ongoing basis.