Alert notification troubleshooting in System Center Operations Manager
Frequently we get customers using OpsMgr 2007 or OpsMgr 2012 who claim that they are not getting alerts on agent availability when the server has been shutdown. Many times what they actually mean is they are not being notified, so you first need to determine if the alerts are happening before troubleshooting the notifications. Here is a quick rundown on how the agent availability monitoring works as well as notifications and some troubleshooting tips.
There is a distinct difference between not receiving an alert and not receiving a notification on an alert (e.g. e-mail, text, instant message). Let’s start with troubleshooting the alerts since we will not receive a notification if we don’t first receive an alert. This is where most of the work is happening and there is a lot to generating the alerts, but if you are getting these alerts then you can skip this section and go on to troubleshooting the notification problems.
Alert Basics
As you probably already know, there are some global settings that determine whether or not an agent is heartbeating and these are found in the Administration workspace of the console in the Settings node. The settings are as follows:
- Agent-> Heartbeat: tells the agent how often to send and the management server how often to expect a heartbeat from the agents it is monitoring (ex. 60 seconds)
- Server-> Heartbeat: tells the management server how many heartbeats should be missed before pinging the agent to test availability (ex. 3)
These are default settings so in this case if the management server registers four missed heartbeats, an alert will be generated (Health Service Heartbeat Failure) against the health service on the agent computer indicating that it is no longer available. In this example we need to make sure that we are waiting at least 4 minutes before expecting the heartbeat failure alert. The management server then attempts to diagnose the problem by pinging the agent computer. If the ping is unsuccessful, another alert is generated, indicating that the computer is no longer reachable (Computer Not Reachable). If the initial diagnostic ping is successful, no further action is taken.
There are two monitors that we are concerned with at this point. To see them you should be in the Authoring workspace of the console in the Management Pack Objects->Monitors node and be scoped to the Health Service Watcher. The monitors are as follows with their corresponding paths:
- Health Service Watcher > Entity Health > Availability > Computer Not Reachable
- Health Service Watcher > Entity Health > Availability > Health Service Heartbeat Failure
Note: Although heartbeat interval and number of missed heartbeats are configured at a global level and thus affect every agent and management server in the management group, the number of missed heartbeats can be overridden at the management server level and heartbeat interval can be overridden at the agent level. To check for overrides open the properties of the management server or agent in the Device Management node of the Administration workspace. Also, both of these monitors are disabled by default for client computers (i.e. XP, Vista..) but in most cases we are failing to receive the alert or notification on a server computer.
Troubleshooting Alerts
If you are not receiving the Health Service Heartbeat Failure alert after waiting the minimum time (heartbeat interval x (number of missed heartbeats plus 1)), there are a few things you can check. When you stop the health service on an agent, both its management server and the RMS log a 20022 event in the Operations Manager event log and a Health Service Heartbeat Failure alert is raised. The agent also appears grayed out in the Administration workspace (Agent Managed) node and the Health Service Watcher will show Critical. At this point you should open the Monitoring workspace of the console and click on the Discovered Inventory node. In the Action Pane on the far right choose Change Target Type-> View All Targets and select Health Service Watcher. Now the Discovered Inventory node will be displaying the health state of all discovered instances of the Health Service Watcher class which is what the monitors above are targeted to. Find your agent in the list and it should show a status of Critical. If you click on it and choose Health Explorer you see the critical status for the Monitors above. If all of this looks like it should but you are still not receiving the Health Service Heartbeat Failure then check the following:
1) Make sure the heartbeat interval global settings are not set too high and you are not expecting to receive an alert when we are allowing too much time to pass before triggering an alert. Confirm these settings by reviewing the following TechNet article:
Heartbeat and Heartbeat Failure Settings in Operations Manager 2007
2) The monitor that triggers this alert is in the System Center Core Monitoring MP and is targeted to the Health Service Watcher class. It has a default override to not generate alerts for Windows client computers (XP, Vista...) but alerts should be triggered by default for Windows server agents. If there are overrides other than this one you should consider those as a possible reason for not receiving the alerts.
3) Always check the discovered inventory and target to the Health Service Watcher. If the watcher for the agent isn't being monitored (is not there or still shows healthy) then this may be why you’re not getting the alert.
4) If your RMS is clustered, you must ensure that you have the “Use Network Name for Computer Name” option checked on the parameters tab of each of the clustered services. After checking this you should move the group to the other node to restart the services.
Typically if you are getting the Health Service Heartbeat Failure alerts you should receive the Computer Not Reachable alert then after the minimum time based on your heartbeat settings. This is a basic ping test and will alert if a ping is not returned successfully.
Notification Issues
If you are receiving the alerts in the Operations Manager console but not getting notified consistently then we may have an issue with the channel we are using (e.g. e-mail, text or IM) rather than the notification workflows in OpsMgr 2007 itself. That is OpsMgr 2007 is working but for some other reason the e-mail didn’t get sent properly. In OpsMgr 2007 we can test notifications (outside of email or instant messaging) to make sure that some internal process isn’t failing by using a command as our notification channel. In the steps below we will use a command to create a notification to a text file.
1) In settings node of the Administration workspace, select Notification and the command tab. Add a new Notification Command channel.
2) In the name type any name. In Full path type cmd.exe
3) Command line parameters type the following:
/c date /T >>c:\notification.log & Time /T >> c:\notification.log & ECHO SCOM notification >> c:\notification.log
Note that you can also use scom variables to insert in the text document. This will output to a text file the current date, time and SCOM notification text.
4) Initial Directory set it up to c:
5) Click apply then OK
6) In the notification node create a new recipient.
7) In the general tab under display name type any account. Select Always send notifications.
8) In the Notification devices, add a new device
9) In the Notification Channel select the device created previously.
10) In the Delivery address for the selected channel type a letter. It won’t be used in this case.
11) Select Always send notifications
12) Type a name for the device
13) Create a new subscription, select the account created earlier.
14) Click next and accept defaults until you get to the Alert Criteria. Here you will select all that you want to create a notification. In our case selected all boxes to receive notifications quickly. Then click next twice and finish.
15) After this you can restart the SCOM services to quickly generate alerts. It may take a few minutes to create the log text file and insert the lines.
Hopefully this will get you to a point to where we at least know if this is an alerting issue or a notification issue and perhaps an idea on how to tackle the troubleshooting side of each. If you determine this is only a notification problem and the test notification works consistently then we may need to determine why the specific notification channels outside of Operations Manager are not sending the notifications out.
Dan Johnson | System Center Support Escalation Engineer
system center 2012 operations manager system center operations manager 2007 scom 2012 scom 2007
Comments
Anonymous
September 09, 2010
does this work only for agentless managed or both. I've agent managed ones, but I'm not getting the alert.Anonymous
October 26, 2010
I am having an issue where I uninstall an agent from the console, then continue to receive notifications for that system. I have tried removing the agent from the SQL db and have verified that it is no longer in the console. Any suggestions?Anonymous
December 03, 2010
So how do you set notification for a reboot? Most servers will be back online in four minutes (by fourth check - 60 second heartbeat + 3 consec missed) and thus you wouldn't get any alerts for a reboot - only if other services are monitored. This is a basic request most admins would want and the only solutions I've seen are A) create rules to look in the event logs or B) enable the PING STATUS (Windows Server MP under Availablity) which is disabled by default.Anonymous
December 10, 2010
Ok. That works. My question is how do you put the alerts into a log file? What I want to do i send that log file using blat as a second source of alert that is not critical. For example. below 1.5 GB of free space will send alert but not that will go into panic. If the disk space is below 250MB then we get the alert. I am just figuring it out how to send separate notifications with different notification channel. Thank you.Anonymous
December 10, 2010
Ok. That works. My question is how do you put the alerts into a log file? What I want to do i send that log file using blat as a second source of alert that is not critical. For example. below 1.5 GB of free space will send alert but not that will go into panic. If the disk space is below 250MB then we get the alert. I am just figuring it out how to send separate notifications with different notification channel. Any suggestions? Thank you.Anonymous
January 04, 2011
We're on SCOM 2007 R2 and I've configured the command channel notification as described above as we face intermittent issues with the notifications, but I see only the following lines in the notification.log SCOM notification Tue 01/04/2011 07:24 AM Will it not give detailed info for each & every alert being notified?Anonymous
July 28, 2011
The comment has been removedAnonymous
January 12, 2012
What group do users have to be in to be able to view the alerts from the monitor? I set up a view for our helpdesk for servers that are down but only admins can see them. I haven't found the group to enable monitoring for that makes the alerts viewable in their console.Anonymous
February 11, 2012
This is the problem because of the SCOM certification. this means there is failure of importing SCOM certificate from the SCOM to SERVER( on which error occured). so we should delete old certificate of that server in SCOM and should create new certificate. then problrem will solve.Anonymous
December 03, 2015
The comment has been removed