Поделиться через


Is your RMS updating configuration too frequently?

The RMS is responsible for maintaining a master list of agent configurations for the management group.  As new information arrives from discoveries that are running on agents, that configuration becomes stale and needs to be updated.  You can see this happening in the OpsMgr event log on the RMS – just look for an event 21025 or 21025 as shown below.

Event Type:    Information
Event Source:    OpsMgr Connector
Event Category:    None
Event ID:    21024
Date:        6/5/2009
Time:        3:03:25 PM
User:        N/A
Computer:    STEVERACRMS
Description:
OpsMgr's configuration may be out-of-date for management group primarymg, and has requested updated configuration from the Configuration Service. The current(out-of-date) state cookie is "59 F7 28 D9 AB A3 21 0D 9B 0C FB 83 EC 87 33 28 08 C0 32 48 "

Event Type:    Information
Event Source:    OpsMgr Connector
Event Category:    None
Event ID:    21025
Date:        6/5/2009
Time:        3:03:29 PM
User:        N/A
Computer:    STEVERACRMS
Description:
OpsMgr has received new configuration for management group primarymg from the Configuration Service.  The new state cookie is "4D BD 9A A0 E6 11 26 C7 ED 0D 03 5F 55 BA FB 82 68 CC 7C 8F "

Updating configuration is a normal function for the RMS and the frequency will flex based on the size of the management group (number of agents/number of management packs) and amount of change within the management group.  During new agent deployment, for instance, you would expect the frequency of RMS configuration updates to be more frequent due to all of the discoveries firing on these new agents.  The same holds when agents are removed.  In general, however, you wouldn’t expect the frequency of RMS configuration updates to remain high for sustained periods of time.

Recently I’ve reviewed the OpsMgr event logs from the RMS of a couple of different management groups and have noticed that the frequency of RMS configuration updates was happening every couple of minutes with the largest gap between updates at about 5 minutes – and this frequency was sustained.  Is this update frequency a concern?  If sustained, yes.  Here are a few reasons why

1.  Configuration data is held in memory on the RMS – this is done for efficiency and to reduce impact on the disk.  When the configuration is updated the RMS has to adjust data in memory and also make adjustments to the disk based configuration file.  This isn’t a big deal under normal circumstances but with frequent RMS configuration updates there can be an impact.
2.  With frequent churn the RMS may not be able to obtain a consistent database state and processing may be delayed as a result.
3.  Configuration churn will cause disk utilization to increase that would normally be allocated for other processing.  In one management group – which was running borderline but still within supported range for RMS hardware - this caused intermittent but very noticeable console hangs that would persist for 5-6 minutes at a time. 
4.  Fully decked out hardware may mask the problem but you will still see degraded OpsMgr function (even if not noticeable) as a result.

Here is a snip from an OpsMgr event log from an RMS being updated very frequently

image

OK, so now we understand this isn’t a good thing – so what do we do about it?  The first thing is to look and see if your RMS is updating configuration frequently.  Not all environments will experience this.  It really depends on the number of agents, the number of management packs, what your environment looks like operationally, etc.  If your RMS is having frequent configuration updates we need to understand what discovery is causing this churn.  There are a series of SQL queries that will assist with this.  Some are taken from a blog here.

I like to run these queries in the order listed below.  Some of these queries run against the data warehouse – noted by DW – and others run against the operational database – noted by OpsDB

Top discovery rule in the last 24 hours (DW) – This query will return the top discovery rules from the last 24 hour period along with the number of changes detected per rule. 

select ManagedEntityTypeSystemName, DiscoverySystemName, count(*) As 'Changes'
from
(select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
PropertySystemName,
D.DiscoverySystemName, D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName',
ME.Path, ME.Name,
C.OldValue, C.NewValue, C.ChangeDateTime
from dbo.vManagedEntityPropertyChange C
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%'
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ChangeDateTime > dateadd(hh,-24,getutcdate())
) As #T
group by ManagedEntityTypeSystemName, DiscoverySystemName
order by count(*) DESC

and some sample results from running this query

image

Looking at the results you can see right away that the IIS MP discovery is submitting a significant number of changes within a 24 hour period.  This is quickly followed by the Dell MP.  This query was run on a management group with roughly 500 agents.  Doing the math, that means that each of our agents submitted updated NNTP discovery data over 5 times in a 24 hour period. 

When the OpsMgr agent executes a discovery it will only return data to the management servers if something was found to have changed.  So, in theory, you should be able to run a discovery every 15 minutes and see no configuration churn provided that the discovery is written according to best practices (meaning no frequently changing values are being discovered).  Please don’t take that as a recommendation – there really shouldn’t be a need to run discovery more than every 12-24 hours – and there is little benefit from a more frequent schedule.

Other queries that can help pinpoint what is happening with our discoveries.  I’ll just list these out without discussion

Discovered objects in the last 24 hours (DW) – This query works with the query above and will list out all of the objects that have been discovered in the last 24 hour period

select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
D.DiscoverySystemName,
D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName',
MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName',
ME.Path, ME.Name,
ME.DWCreatedDateTime
from dbo.vManagedEntity ME
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%'
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ME.DWCreatedDateTime > dateadd(hh,-24,getutcdate())

Modified properties in the last 24 hours (DW) – This query will list out the specific properties that changed from one discovery run to the next.  Sometimes a property will be chosen for a discovery that will be noisy in most environements – like the temperature sensor discovery in the Dell MP.  These types of properties shouldn’t be included in a discovery but some still persist.  This query will track down such properties.

select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
PropertySystemName,
D.DiscoverySystemName, D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName',
ME.Path, ME.Name,
C.OldValue, C.NewValue, C.ChangeDateTime
from dbo.vManagedEntityPropertyChange C
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%'
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ChangeDateTime > dateadd(hh,-24,getutcdate())

Entities with the most properties (DW) 
select distinct top 50 mep.ManagedEntityRowId, me.FullName, Count(mep.managedEntityPropertyRowId) as 'Total'
from ManagedEntityProperty mep WITH(NOLOCK)
LEFT JOIN ManagedEntity me WITH(NOLOCK) on mep.ManagedEntityRowId = me.ManagedEntityRowId
group by mep.ManagedEntityRowId,me.FullName order by Total desc

Instance changes in the last hour (OpsDB) – This query will list out the instances that have changed in the last hour.  Used with the previous queries this can help validate on an hourly basis what is being seen in the warehouse based queries.  Instances that show up consistently in the query results may indicate an overactive discovery.

select discovery.discoveryname, modified.*
from DiscoverySourceToTypedManagedEntity with(nolock)
inner join typedmanagedentity with(nolock)
on typedmanagedentity.typedmanagedentityid = DiscoverySourceToTypedManagedEntity.typedmanagedentityid
inner join discoverysource with(nolock)
on discoverysource.discoverysourceid = DiscoverySourceToTypedManagedEntity.discoverysourceid
inner join discovery with(nolock)
on discoverysource.discoveryruleid = discovery.discoveryid
inner join
(
Select fullname, basemanagedentityid, basemanagedentity.lastmodified
from basemanagedentity with(nolock)
Where basemanagedentity.lastmodified >= dateadd(minute, -60, getutcdate())
)as modified
on modified.basemanagedentityid = typedmanagedentity.basemanagedentityid

Relationship changes in the last hour (OpsDB) – This query is simlar to the one above but looks at relationships.

select discovery.discoveryname, relationship.*
from DiscoverySourceToRelationship with(nolock)
inner join Relationship with(nolock)
on Relationship.relationshipid = DiscoverySourceToRelationship.relationshipid
and Relationship.lastmodified >= dateadd(minute, -60, getutcdate())
inner join discoverysource with(nolock)
on discoverysource.discoverysourceid = DiscoverySourceToRelationship.discoverysourceid
inner join discovery with(nolock)
on discoverysource.discoveryruleid = discovery.discoveryid

OK, so we have reviewed the query results and it seems we have some overactive discoveries in the environment.  What can be done about it?  Three options

1.  Disable the discovery – It is a best practice to review all management packs before deploying to production to tune out unneeded components – such as discovery.  If you don’t need a discovery, disable it.  In some cases, this is easy to do by override.  In other cases a particular discovery may be part of a ‘super discovery’ – meaning a single discovery that creates and populates multiple classes.  Unless an override is exposed for the specific discovery of interest, you won’t be able to individually impact it’s frequency.
2.  Tune it – If disabling isn’t an option then we can tune the discovery to execute less frequently.  While this won’t ‘fix’ the problem it is a viable workaround to reduce the discovery noise.  There are potential impacts here so be sure you know what you are doing.  In addition, some discoveries are ‘nested’ and have dependencies between one another.  For more information take a look at my blog post here.3.  Live with it – i don’t like this option but sometimes it’s the only one available.  If the churn was observed from the OpsMgr event log but not noticed from a performance perspective then your hardware likely is masking the issue or your issue isn’t severe enough to cause noticeable impact.  In such cases you might decide to leave things alone.  If this is your decision make certain that you keep an eye on this to ensure performance impacts or other problems aren’t seen as a result.

If you do notice a problem with a particular management pack in your environment, please don’t keep this to yourself.  If it is a Microsoft management pack, let us know so we can investigate any issues.  If it is a 3rd party management pack, please contact the vendor so the issue can be investigated.  Let me state again – just because one environment may see issues with a particular discovery doesn’t mean all environments will.  These issues may exist for one management group and not another.  It really does depend on the details of your management group.

Comments

  • Anonymous
    June 09, 2009
    PingBack from http://blogs.msdn.com/steverac/archive/2009/06/09/understanding-nested-management-pack-discoveries-and-how-does-they-impact-total-discovery-time.aspx

  • Anonymous
    June 09, 2009
    Hi Steve, thanks for the insghts and for cross posting on Quaue Nocent Docent. We all agree frequent changing discoveries are bad, but still, except for new agents and non hosted classes I don't understand the needs of reload on the RMS. I try to reformulate, how can I understand what's causing 21025 on the RMS? My RMS discovery is stale so it is clearly something I get from agents (I have about 20/25 21025 per hour on the RMS). This is the difficult part correlate between somthing that changes on the agent to the RMS. I still didn't find the culprit. On this topic, from my observations, it seems that a change in a MP (say a new override) causes a reload of all the referenced MPs not just the modified one, as you know parsing those XML documents is CPU intensive, I think there's something that can be optimized here. :-)

  • Anonymous
    June 10, 2009
    An override will cause config churn, approving a pending agent will cause config churn – a single property (like a DB size that adjusts because of autogrow or a temperature reading) will cause config churn.  I don’t understand the deep details of why this is the case – but it is.  The way I find which attribute is causing the churn is with the second (I think) query on my blog – I think it’s one of your queries.  Good show on the queries BTW. Also, there are improvements for some of this in R2 - not all but some.

  • Anonymous
    June 11, 2009
    Right now I'm running R2 (we have been part of the RDP), but things are not "so" better. In SP1 every time a property changed the agents would reload (21025) in R2 I don't see such a strict correlation. From your explanation I understand that every time a discovered property changes on an agent I have a 21025 on my RMS? Uhm, if this is the correct behavior we have a bad design. First, the RMS should implement some sort of "quiet" timeout, let's say no more than one reload every x mins (possibly configurable), secondly starting with KB 958490 at every reload on the RMS "unknown state" roll up monitors are asked to recalculate, this will cause a complete reload on agents (without any 21025 event in the eventlog). I will work on this with the product team and follow up with a wrap up post, this point is strategic for the proper behavior of any OpsMgr deployment.

  • Anonymous
    June 11, 2009
    There are definately some areas where we can still make improvements for the 21025.  But R2 does make some headway on this.  Thanks for your comment and for being a part of the RDP!

  • Anonymous
    June 26, 2009
    The comment has been removed

  • Anonymous
    June 26, 2009
    No, R2/non R2 shouldn't be a factor.  My guess is that you are not selecting on the correct database - some of thse are warehouse queries and others are OpsDB queries