Root cause analysis: Corrupted WMI
The Big question
"Who Broke the WMI and I need an RCA!!!"
This one seems to be a quite common among Server Administrators or end users when they run into an issue of broken or corrupted WMI. The most common cases where we see this are:
- System Center health client is not able to monitor and report the status of a monitored Client/Server
- Server not accessible on the network (NIC card missing under network connections)
- MSINFO32 / BGINFO does not pull any reports
- Unable to run WMI related scripts locally or remotely
- Remotely unable to run WMI query to fetch information about a server
- Remote Monitoring software's fails to work
- Installation of applications fails when they are dependent on WMI such as Disk management software or Backup software, etc.
- When you start investigating an issue where you see a lot of events for WMI under event logs
Is there a Root Cause analysis of the WMI Corruption that occurred?
Well, the straightforward answer is "NO. You cannot find an RCA for a corrupted WMI unless you are able to consistently reproduce the issue.
But Why?
Due to the complex nature of the Database/Repository that WMI maintains, once the corruption occurs, it cannot be traced back. Since the information is stored in a repository like a database, when a malfunction or unexpected WMI action occurs, it is not intelligent enough to correct it. Rather it accepts the change and becomes a Victim.
Also, there is no backup repository that WMI maintains and hence once the repository is corrupt, the only option is to try recovering the namespace details by compiling the available MOF files or perform a repository reset (rebuild repository - which should be considered as a last option anytime).
Note: Manual Backup of the repository is possible though and is advised to perform one, if you want to play around with WMI.
Guess the Culprit?
Below are few common scenarios that you can suspect that might cause the issue.
But this is very helpful when you can reproduce the problem, but that also means you already know who did it.
The only common relationship that you find among them is they more frequently try to read/modify/write to the WMI Repository.
- Any application that installs its own MOF files and adds its namespaces to WMI repository during its installation.
- Any application that is uninstalled which does attempt to remove its WMI namespace information from the repository.
Example such as SMS Client, Any Server monitoring applications from hardware vendors or software applications
- If you have RSOP logging enabled, where too much of user related and Computer related information is added to the WMI namespace when they login to the systems.
- Hardening process or software's that modify the permissions on the repository folders or WMI related files and folder paths.
These are the Major contributors to name a few. However, this is not an exhaustive list. Windows continues to stabilise WMI as a component and adds more logging capabilities. The purpose of this blog is just to give you an understanding of is it worth spending the time to investigate on RCA for WMI corruption and can we understand it.
How to troubleshoot
You can enable WMI-Tracing which is inbuilt on a later OS that can help you track who did access the WMI before it was broken.
However, it is similar to event logs where you need to dig through a lot of events and to the time of the issue to narrow it down.
This does not fall within the scope of this post as mentioned earlier and hence you can refer to these blogs and posts that talk about repairing WMI/DCOM and resetting them.
An excellent article has been published here with lots of related details as FAQ: http://blogs.msdn.com/b/wmi/archive/2006/05/12/596266.aspx
Rebuild WMI: http://blogs.technet.com/b/askperf/archive/2009/04/13/wmi-rebuilding-the-wmi-repository.aspx