SCOM Troubleshooting: Management server frequently greying out
The cause of this can be a few things going wrong, but as part of the troubleshooting I’ve noticed a way that works almost every time, if it applies.
Problem
All of the sudden, the management server(s) greys out. You check the services, all the services are running. Still for good measure, you restart the services – but no use. You then also try flushing the health state folder cache on the affected MS. And sure, the MS becomes healthy again.
But again after some time you notice that the MS has greyed out. You repeat the process of flushing the cache, it becomes green, and after some time becomes grey again. This cycle continues.
In the event log you may see several events, but not sure where to start. Now these can be any events that may actually be the cause of the problem, or maybe the consequence of it. That’s why you need to read carefully through each of them and find out what event is exactly the problem and which ones are the consequences.
The event we’re discussing here is one particular event 4502. Now this event ID is logged for a number of different reasons and with different descriptions. The one we’re looking for goes something like this (sample only, your descriptions would change accordingly):
A module of type "Microsoft.EnterpriseManagement.Mom.Modules.SubscriptionDataSource.InstanceSpaceSubscriptionDataSource" reported an exception System.ArgumentNullException: Value cannot be null.
Parameter name: value
at System.Collections.CollectionBase.OnValidate(Object value)
at System.Collections.CollectionBase.System.Collections.IList.Add(Object value)
at Microsoft.EnterpriseManagement.Mom.Modules.SubscriptionDataSource.HttpRESTClient.PostDataAsync(Byte[] data, Object context)
at Microsoft.EnterpriseManagement.Mom.Modules.SubscriptionDataSource.SubscriptionDataSource`2.WriteToCloud(List`1 items, DateTime firstTryDateTime)
at Microsoft.EnterpriseManagement.Mom.Modules.SubscriptionDataSource.SubscriptionDataSource`2.PostAsync(List`1 items, DateTime firstTryDateTime) which was running as part of rule "Microsoft.SystemCenter.CollectInstanceSpace" running for instance "All Management Servers Resource Pool" with id:"{4932D8F0-C8E2-2F4B-288E-3ED98A340B9F}" in management group "MG".
These events may come in conjuncture with several others, but I like to fix this one first, as it solves the problem most of the times.
Analysis
The event might seem cryptic at first, especially if you aren’t used to troubleshooting, but it provides a valuable piece of information.
Note the last line of the description. It says:
"which was running as part of rule "Microsoft.SystemCenter.CollectInstanceSpace" running for instance "All Management Servers Resource Pool" with id:"{4932D8F0-C8E2-2F4B-288E-3ED98A340B9F}" in management group "MG"."
Here, you get some interesting information, as to which exact rule/monitor is failing, and running for what instance.
Ok, so we have the rule ID and the target. The rule is “Microsoft.SystemCenter.CollectInstanceSpace”. With a quick glance at the System Center Wiki tells me that the display name of this rule is “Send Instancespace to the Cloud” and it is a “System rule that sends instancespace up to the cloud.”
So what happens here is, the rule runs at it’s scheduled interval, and fails. This causes the MS where it’s running on to go grey. When you re-initialize the cache on the MS, everything is reset, and the MS becomes green. Then again, the rule runs at its interval and fails again, the MS goes grey again, and the cycle goes on.
Resolution
Ok, so now we have some solid information to work on.
Grab the rule name, find it in the Rules in your console.
Once you do, take a look at the properties. You’d know what is it exactly doing, any overrides, what MP is it coming from, etc.
Now that you’ve found the rule that is the root of the problem, disable it. Now, go back and flush the cache on the MS again. As it is downloading the configuration again, keep an eye on the event log for any errors.
If the MS becomes and remains green, we’re done! If if goes back to grey, follow the process all over again, until you notice there are no more failing workflows from rules/monitors that are causing the MS to go grey.
One step further, if you notice that all these rules/monitors are from the same MP, chances are that MP has been corrupted and you may want to remove or update the MP.
Note that although this might solve your problem, it may not be the only one causing the issue. E.g., bad performance of your databases can also result in this problem. So if you find the problem is still persisting, look for other relevant events that might give you a hint.