Condividi tramite


“Net view” as diagnostic used when heartbeat is missing

This all started as question about how “Computer Not Reachable” works. It has been asked in newsgroup and I thought I will help and provide some inside. First, I would like to repeat, that in my opinion regarding the fact that computer not reachable detection is based on the fact that heart beat is missing while targeting “Health Service Watcher” is at least questionable and not very fortunate (it is almost like we try to find causality without root causality engine). Due to performance limitations (Main reasons lying in the fact we did not want to end up having as many workflows doing ICM Ping (which is from OpsMgr perspective a blocking operation) as many computers are present within topology, or not having efficient targeting story for such computer not reachable recognition) we now have this “nasty beast” where we execute diagnostic on heartbeat monitor state change which is later supposed to (thru some set of recoveries) make a monitor state change for “Computer Not Reachable”. This all design was not very reliable in RTM and I had to introduce some mitigations and reliability improvements in SP1 (which off course was not too positive because now two alerts are raised instead of one which was raised in RTM, but believe me, you want reliability J).

Regardless of my runt, I decided to show how to add custom diagnostics for Heartbeat monitor and eventually answer a question and provide management pack which will set the state of some custom monitor based on recovery output. This all could be advance talk, please comment and/or ask question thru this blog if something requires further explanation (this I will not discuss on newsgroup, all this provided info is about undocumented functionality which may break in any future release! ).

 

So to adding custom diagnostic is really easy. In this case all we need (monitor and target) is public and accessible outside of management pack where such elements are defined. At the end it is just adding following raw XML into custom management pack (assuming you have all MP references set properly).

 

< DiagnosticID = "Microsoft.SystemCenter.Community.Diagnostic.NetViewDiagnostic"Comment="In response to heartbeat failure, net view machine"Accessibility="Internal"Enabled="true"Target="SCLibrary!Microsoft.SystemCenter.HealthServiceWatcher"Monitor="SC2007!Microsoft.SystemCenter.HealthService.Heartbeat"ExecuteOnState="Error"Remotable="true"Timeout="300">

  < Category > Maintenance</Category>

  < ProbeActionID = "Command"TypeID="System!System.CommandExecuterProbe">

    < ApplicationName ><![CDATA[ %windir%\System32\net.exe ]]></ ApplicationName >

    < WorkingDirectory />

    < CommandLine > view $Target/Property[Type="SCLibrary!Microsoft.SystemCenter.HealthServiceWatcher"]/HealthServiceName$</CommandLine>

    < TimeoutSeconds > 30</TimeoutSeconds>

    < RequireOutput > true</RequireOutput>

  </ ProbeAction >

</ Diagnostic >

 

Now acting based on the result is where real fun starts! First I need to mention that “Computer Not Reachable” is defined as internal, so is not accessible and new monitor must be defined rather then disable original thru override included with management pack. I will not provide much information about why it is “aggregate” monitor, the only thing I will say is that in OpsMgr, aggregate monitor is only monitor without a workflow and runtime is magically making all the state changes. In our case we are setting the state thru recovery, so we do not want to “WASTE” a workflow (and there would be as many of it as many computers within enterprise are monitored) if we never expect that workflow to set the state anyway. Also, there are some “magic” modules I WILL NOT DISCUSS (maybe ever, but we will see in next release), where those modules will set the state of the monitor. (There might be some of you willing to do some reverse engineering and you may get an idea how set state critical when result of recovery provides info about command “net view” failing, and how we set state when command succeeded though.) So here is the recap thru screenshots:

 

1. After import, monitor state is NEVER set until net view command is executed

new computer has no state set

2. When RMS recognizes that heart beat is missing, we execute “net view” command inside of diagnostic. When net view succeeds, we set state of monitor to “Healthy”, we set to “Critical” otherwise

net view succeeded 

succcess state change

net view failed

failure state change

Attached is management pack that provides this functionality. It may be used AS IS and confers no rights and support. Enjoy!

Microsoft.SystemCenter.Community.Availability.xml

Comments

  • Anonymous
    July 25, 2008
    Thanks again Marius for this. I have been testing it and the only thing I am finding is that it won't auto resolve itself and that I can't seem to reset the monitor itself, making it remain at critical. Any ideas?

  • Anonymous
    July 25, 2008
    Ok, scratch that last comment about it not auto resolving and the monitor reseting. It is doing it now. Thanks so much Marius. :)

  • Anonymous
    August 07, 2008
    Ok, I am having problems with this now. Like I commented earlier, the monitor is auto resolving, but also, many times it isn't. Sometimes, it auto resolves within a few minutes, a couple times it resolves itself an hour or two later, but more often than not right now, it isn't auto resolving at all until I go to the health explore, go the the monitor, and run the diagnostic task. Any ideas on what is wrong or how to improve it? It was working beautifully when it first came out, but now it seems to have broke. On another note, and less important, how would I go about replacing the net view command with that of a script? I've tried many things, but keep getting errors when I try importing it. Thanks again, Matthew