The case of the MIF and WMI repository
If you have spent much time around SCCM engineers – at least those of us from Microsoft – you will have heard us decry customers deleting the WMI repository to fix issues with WMI. True that often deleting the repository seems to fix an issue but can often cause others. In the early days SMS and the OS were the only parties making use of WMI – not so any longer. More and more applications store data in and retrieve data from WMI so before rebuilding it – which is what a repository deletion does – you have to factor in whether those other applications will be OK. I recently ran across an issue where I tried long and hard to avoid deleting the repository but, ultimately, delete it I did – and that was the only real fix.
The problem looked simple enough. There were a good number of hardware inventory MIF’s that were failing to process. Looking in the dataldr.log the error was clear:
CMachineSource::InsertMachine - Top Level group not found. All MIFs require a group that is the same name as the architecture.
OK, but what does this error actually mean? From the error I initially thought there was some sort of mismatch for this machine in the database – architecture mapping info starts in the architecturemap table in the database so this seemed like a good place to start checking – but all checks here proved there were no problems in the database – the affected systems all mapped to the correct architecture. To confirm that no issues existed with the database we tried pulling the MIF from the child site and processing directly on the central – the same error was logged.
Looking through Bing search results on the error revealed that others resolved the issue by deleting the offending system from the database all the way from child to central - which didn’t make much sense to me based on what I was seeing. I don’t care for solutions with no explanation and didn’t think it would work but we gave it a go without success. Other folks solved the problem by reinstalling the client on the offending machine – but no reason was given why this fixed the problem. Still others fixed the issue by rebuilding the WMI repository. As mentioned above, rebuilding the WMI repository is just not a good solution – not to mention how invasive this would be across the number of clients exhibiting the problem. Also, I’m not one to pull the trigger on a shotgun approach like this (pun intended) so I kept digging. I wanted to understand what actually was causing the problem – particularly given that nobody has posted a solution on what was really going on.
As a test in my lab I pulled the WMI repository from a system exhibiting the issue and put it in place on a system in my test lab (same OS and service pack level). Sure enough, I started seeing the problem too. This does limit the problem and now we had confirmed the issue to be with WMI corruption – yet all WMI checks I ran returned a consistent repository. So what is the problem with WMI? With this information we returned to an examination of the MIF and another look at the error. To really understand what was happening I had to dig into the code a bit but when I did I realized that when dataldr is loading a MIF it specifically looks in the header for the architecture type – in this case system.
Once the architecture type is known the rest of the MIF is scanned for a section that defines that architecture type – in this case dataldr was looking for a class named system, yet none was found. THAT is what is causing the error!
A sample of what a group definition looks like in a MIF is below. Note that the section that dataldr is looking for would have SYSTEM as the class name instead of MICROSOFT|Workstation_STATUS|1.0. In the problem case, the MIF did not contain any section defining the SYSTEM class. Other MIFs that were processing DID have this section so now we were onto the problem!
OK, so why were we missing the SYSTEM class in the MIF. To investigate that we took a look at the inventoryagent.log on the client and noticed that when inventory was processed it specifically looked at the CCM_System class to pull SYSTEM data. Interestingly, the inventoryagent.log showed that we were successfully able to query ccm_system!
Collection: Namespace = \\.\root\ccm\invagt; Query = SELECT __CLASS, __PATH, __RELPATH, Name, SMSID, Domain, SystemRole, SystemType, LocalDateTime FROM CCM_System; Timeout = 600 secs.
So since we were able to query WMI for the system info one would expect that data to be in the inventory report – so what gives? To confirm our belief that the data was pulled correctly we configured the problem client to write it’s inventory in XML format before sending to the management point. This is done by creating a folder or file in ccm\inventory\temp called “archive_reports.sms”.
Note that with this file or folder created inventory .xml reports will be written here on ever cycle and will not be deleted. Use this for extra troubleshooting information only and then remove it and clean up the remaining .xml files!
Looking through the inventory report (small sample shown before) we STILL did not see any information for the SYSTEM class! OK, what gives? This isn’t making much sense. Back to digging.
Looking a bit deeper still we found that the inventory agent reads the information from CCM_System but STORES it under the hardware inventory token class in WMI – a class called {00000000-0000-0000-0000-000000000001}. This class is located under the root\ccm\invagt namespace. Once the information is stored in that class THEN it is written to the inventory report. Aha…so it must be that the system information isn’t listed in the token class. We investigated and, sure enough, inventory agent was trying to store it there but it could not and failed silently. On a working system the information is present
So now we know exactly WHY the problem is happening but how to fix on so many clients? That’s where it because clear that deleting the repository is the only reasonable option – and that’s what we did.
So that’s the story. I still am not in favor of deleting the repository until you know for sure why and that it is the only option. In our case the problem has not returned – but that is no guarantee that it won’t. Time will tell.
Comments
Anonymous
December 02, 2010
Interesting article. I have about 150,000 clients globally. I am going to guess and say that there are WMI issues with about 2% and the symptoms vary. This still equates to about 3000 clients scattered troughout the globe. We don't have enough technicians with the time or the skill to do this type of in depth analysis on each potentially damaged WMI repository. Can't run WMIDiag on every system and analyze on a case-by-case basis. Rarely do the other suggestions of re-compiling mofs, re-registering DLLs, etc. solve the problem. And in your case the repository needed to be rebuilt anyway. For years I have yet to see any real solution to this problem. Most of our WMI problems seem to be with mobile laptop users. I suspect mobile systems are more likely to shut down ungracefully which may explain a larger occurrence of WMI corruption (just my opinion). I wish Microsoft would publish something just stating that this is a recuring problem that is acknowledged and it is doubtful you will ever be able to manage 100% of your systems successfully. Maybe 95% at best when you consider additional issues such as Windows Update problems.Anonymous
March 31, 2014
Hi sir, while reading about the wsus patching issues i read something about " microsoft updateservice.dll" file may be i do not know the name correctly but it was something like this but now i am not able to recall what it was for may be regarding the wsus version or something and not able to find the information on internet. please help me in knowing this file if there is any....... regards mohit