Doing "Math" on Performance Counters in SCOM
If you look at the dashboard of a car, most of the instruments have obvious good and bad values without requiring the driver to need additional information. Ideally performance counters would be this way, but this isn't always the case. Here are some real world examples I've encountered where the data provided by the performance counter by itself wasn't enough to make a value judgment.
- Is the number of 404's per second on a web server excessive?
- I need to know that as a percentage of the total method requests per second
- Is a network adapter saturated?
- I need to convert the Bytes/sec performance counters into Bits/sec and then divide by the bandwidth.
- Is a single process using too much CPU time on a server?
- I need to divide the Process\% Processor Time counter by the number of logical cores in the machine (assuming the process is multi-threaded)
- Is a single process using too much memory on a server?
- I need to divide the working set of the process into the server's physical memory
- Is the non-paged pool approaching its limit?
- Okay, I'm dating myself to Windows 2003 here, but you get the idea.
So how do we deal with a situation where we need to monitor the health of a performance counter, but the information provided by that performance counter is not independently adequate to make a judgment?
In my Library MP I am providing
Two Probe Actions
- Microsoft.PerfCounter.Two.Counters.Probe - This probe runs a PowerShell script that queries two performance counters and returns their values in a Property Bag.
- Microsoft.PerfCounter.Create.PerfData.Probe - This probe runs a PowerShell script that divides two numbers, and multiplies the result by a third. If any of the inputs are 0.0, it will return 0.0 without risking division by zero.
Two Data Sources
- Microsoft.PerfCounter.One.Counter.Plus.Denominator.Perf.DS - This data source uses System.Performance.DataProvider from the System.Performance.Library to sample a single performance counter. The results are then passed to Microsoft.PerfCounter.Create.PerfData.Probe.
- Microsoft.PerfCounter.Two.Counters.Perf.DS - This data source uses a PowerShell script to query two performance counters. The results are then passed to Microsoft.PerfCounter.Create.PerfData.Probe.
Six Monitor Types
- Microsoft.PerfCounter.Math.One.AverageThreshold.MonitorType
- Microsoft.PerfCounter.Math.One.ConsecutiveThreshold.MonitorType
- Microsoft.PerfCounter.Math.One.SimpleThreshold.MonitorType
- Microsoft.PerfCounter.Math.Two.AverageThreshold.MonitorType
- Microsoft.PerfCounter.Math.Two.ConsecutiveThreshold.MonitorType
- Microsoft.PerfCounter.Math.Two.SimpleThreshold.MonitorType
The monitor types are deliberately designed to closely match some types you will find in System.Performance.Library.
- System.Performance.AverageThreshold
- System.Performance.ConsecutiveSamplesThreshold
- System.Performance.ThresholdMonitorType
CAUTION! - There are several things you should be aware of before using this Library.
- Using either of the data sources will result in a PowerShell script being run that "does math." Having a SCOM agent run a few scripts at a slow interval should not cause any problems. But if you use a very short sampling interval and/or query a large number of performance counters you do risk impacting agent performance.
- For sampling non-instantaneous type of performance counters Microsoft.PerfCounter.Two.Counters.Probe will sleep for three seconds to get two samples. As with Caution #1, running a large number of scripts can impact the agent. In addition to this, the data is only accurate during the three second window that the samples were taken. If your performance counter has large, brief spikes they might get missed.
- I do not allow you to use the "All Instances" feature of System.Performance.DataProvider and I do not allow the two counter probe to query all instances of a counter.
So what are some real-world examples of how this can be both safe and useful? For your convenience I have included an example MP that implements these examples. I have also included both sealed and unsealed copies of the Library.
- Collect uptime in days - ((System()\System UpTime / 86400.0) * 1.0)
- Collect the private bytes of a process as a percentage of the system commit limit - ((Process(TestLimit)\Private Bytes / Memory()\Commit Limit) * 100.0)
- Alert when Network Adapter is more than 50% used - ((NetworkAdapter(<object property>)Bytes Received/sec / NetworkAdapter(<same object>)\Current Bandwidth) * 800.0)
- Alert when a process us using more than 20% of total CPU on the system - ((Process(CPUStress)\% Processor Time / <Logical Cores>) * 1.0)
- Alert when 404/sec > 50% of all web requests - ((Web Service(Default Web Site)\Not Found Errors/sec / Web Service(Default Web Site)\Total Method Requests/sec) * 100.0)
NOTE: You can get CPUStress and TestLimit here: https://live.sysinternals.com/Files