Health Service Heartbeat Failure, Diagnostics and Recoveries
I’ve seen plenty of questions come up in the forums and from customers regarding the Health Service Heartbeat Failure monitor, and its associated diagnostics and recoveries. I spent a little time digging further into these workflows and thought I’d share what I found here. Hope this helps those curious about what’s happening under the hood.
Communication Channel Basics
After an Operations Manager Agent is installed on a Windows computer, and after it is approved to establish a communication channel with an Operations Manager 2007 management group, the communication channel is maintained by the Health Service. If this communication channel is interrupted or dropped between the Agent and its primary Management Server (MS) for any reason, the Agent will make three attempts to re-establish communication with its primary MS, by default.
If the Agent is not able to re-establish the channel to its primary MS, it fails over to the next available MS. Failover configuration and the order of failover is another topic, and will not be covered here.
While the Agent is failed over to a secondary MS, it will attempt to re-establish communication with its primary MS every 60 seconds, by default. As soon as the Agent can establish communication with its primary MS again, it will disconnect from the secondary MS and fail back to its primary MS.
Health Service Heartbeat Failure Monitor
To briefly summarize the Heartbeat process, there are two configurable mechanisms that control Heartbeat behavior. Heartbeat interval and number of missed Heartbeats. If the MS fails to receive a Heartbeat from an Agent computer greater than the number of intervals specified, the Health Service Heartbeat Failure monitor will change to a critical state and generate an alert.
Read more about Heartbeat and configuration here.
Diagnostic and Recovery Tasks
There are a couple of diagnostic tasks that run when the Health Service Heartbeat Failure monitor changes to a critical state. Ping Computer on Heartbeat Failure and Check If Health Service Is Running.
Ping Computer on Heartbeat Failure
This diagnostic is defined in the Operations Manager 2007 Agent Management Library and is enabled by default. This workflow uses the Automatic Agent Management Account, which will run under the context of the Management Server Action Account by default, to execute a probe action which is defined in the Microsoft System Center Library named WmiProbe.
This probe is initiated on the Health Service Watcher. Since the Health Service Watcher is a perspective class hosted by the Root Management Server, this is where the WMI query is executed when the Health Service Heartbeat Failure monitor changes to a critical state. Even though the agent may be reporting to another MS, it is the RMS that sends the ICMP packet to the agent.
Unlike the traditional Ping.exe program we are all accustomed to, which sends four ICMP packets to the target host by default, the WMI query is executed only once and sends a single ICMP packet, so there is no calculation of percentage of lost packets one would expect to see with Ping.exe.
Following is the WMI query executed on the RMS.
SELECT * FROM Win32_PingStatus WHERE Address = '$Config/NetworkTargetToPing$'
To verify the number of ICMP packets sent, I ran a traditional Ping.exe test and the WMI query used in this workflow and traced these using Netmon. The first two entries in the image below were captured from the WMI query, and the last eight entries captured were from a Ping.exe test using default parameters (four packets).
The WMI query results are passed to a condition detection module, which filter StatusCode and execute the appropriate write action. If StatusCode <> 0, the write action ComputerDown will set state to reflect the computer is down. If StatusCode = 0, the write action ComputerUp will set state to reflect computer is up.
The condition detection modules that filter StatusCode are actually the recovery tasks shown in the Health Service Heartbeat Failure monitor. These are the reserved recoveries, Reserved (Computer Not Reachable - Critical) and Reserved (Computer Not Reachable - Success) , respectively.
Under the covers, these reserved recoveries are actually setting state of the Computer Not Reachable monitor, which is defined in the System Center Core Monitoring MP. Ultimately, if StatusCode <> 0, the Computer Not Reachable monitor will change to a critical state and generate the Failed to Connect to Computer alert.
Since this is a diagnostic task which runs during a degraded state change event, the Agent will only be pinged once when the Health Service Heartbeat Failure monitor changes to a critical state. If there are any network related problems after this monitor has changed to critical and the diagnostic task has ran, there will be no further monitoring regarding the ping status of this Agent and no “Failed to Connect to Computer” alert will be generated.
We can understand the root cause better based on whether the Health Service Heartbeat Failure alert was generated along with the Failed to Connect to Computer alert. If the Health Service Heartbeat Failure alert generated without the Failed to Connect to Computer alert, logic would tell us that the issue is not related to loss of network connectivity or that the server has shutdown or become unresponsive. Both alerts together generally indicate the server is completely unreachable due to network outage, or the server is down or unresponsive.
Check if Health Service is Running
This diagnostic is defined in the Operations Manager 2007 Agent Management Library and is enabled by default. This workflow uses the Automatic Agent Management Account, which will run under the context of the Management Server Action Account by default, to initiate a probe action which is defined in the Operations Manager 2007 Agent Management Library named QueryRemoteHS.
Specifically, this probe is initiated on the Health Service Watcher and queries Health Service state and configuration on the Agent, when the Health Service Heartbeat Failure monitor changes to a critical state. This probe module type is further defined in the Windows Core Library. It takes computer name and service name as configuration, and passes the query results through an expression filter and returns the startup type and current state of the Health Service.
If the service doesn't exist or the computer cannot be contacted, state will reflect this. Depending on output of the diagnostic task, optional recovery workflows may be initialized (i.e., reinstall agent, enable and start Health Service, and continue Health Service if paused), but these recoveries are not enabled by default.
Comments
Anonymous
January 01, 2003
HSW is from the perspective of the Root Management Server.Anonymous
January 01, 2003
@Cristhian Without understanding much of your environment and more context around the problem, I would suggest adjusting the agent heartbeat interval. Typically, in all environments, I adjust the agent heartbeat interval from 60 to 180 seconds. The server heartbeat missed count is 3, by default, and I think 3 minutes is just too low of a threshold to generate alerts. In your case, you are also receiving failed to connect alerts, so it sounds like there are some network related issues. If this seems to happen consistently at a particular time, then I would probably ask your network team if they have observed any network related issues during that time. You can also view history of these types of alerts by running a SQL query against your data warehouse, which I have posted here: blogs.technet.com/.../heartbeat-failure-and-failed-to-connect-alerts-with-duration.aspx. Hope this sheds some light on the situation.Anonymous
January 01, 2003
Magnus - this can be a bit tricky, but it's possible by implementing your own ping rule or monitor. What you could do is look at the current computer unreachable monitor (which is current being used as a recovery task by setting its state), and leverage that knowledge to implement your own. This might be something I'll write about in the future, because I've seen this question come up a couple time before. Thanks.Anonymous
January 01, 2003
Hello Jonathan, I'm having the following situation: Almost every night some clients lose communication and generate the error: failed to connect to computer health servie heartbeat failure but after some 5 minutes he returns to communicate, but the monitoring of availibilty still red and back to green only when I put the machine into maintenance mode and then shot dai works. Could you help me? Thank you!Anonymous
January 01, 2003
Sridhar - to my knowledge, there has been no change to how the heartbeat failure and failed to connect workflows function in SCOM 2012. Whether the agent is in trusted, untrusted, in front or behind a gateway, the heartbeat and ping requests work the same. If you happen to not allow ICMP traffic from your management server(s) to your agent, then ICMP (failed to connect) will always follow a heartbeat failure.Anonymous
February 02, 2011
Hello Jonathan, In section "Ping Computer on Heartbeat Failure", you mention "This probe is initiated on the Health Service Watcher. Since the Health Service Watcher is a perspective class hosted by the Root Management Server, this is where the WMI query is executed when the Health Service Heartbeat Failure monitor changes to a critical state." In section "Check if Health Service is Running", you mention "Specifically, this probe is initiated on the Health Service Watcher, which is the MS, and queries Health Service state and configuration on the Agent, when the Health Service Heartbeat Failure monitor changes to a critical state." So, is the Health Service Watcher hosted by the MS or the RMS? Thanks, LarryAnonymous
March 07, 2011
Hello Jonathan, Don't know if this is the right "monitor" but how do you setup alerts (with an 15min interval for example) which will monitor if a computer is down ? And not as we have today were this alert only comes once, and if someone closes this alert there will be no more until the Health Service State changes to okey. Earlier (before we used SCOM) we used a traditional ping based monitor to find out if a computer was up or not, and this worked pretty well. Thanks for any guidance in this matter, //MagnusAnonymous
August 27, 2012
The comment has been removedAnonymous
April 23, 2013
Jonathan, it would be great f you can shed some light on how this health service heartbeat failure works when agents are in untrusted domain and Gateways are used? How does this works in scom 2012