Troubleshooting console performance issues with Operations Manager 2007 in large environments
** Tim Helton**, one of Microsoft's top Support Escalation Engineers, has written some great tips that he uses to troubleshoot console performance issues in large environments. I took the liberty of posting his notes below:
========
These are some steps that have helped me troubleshoot console and general performance issues in OpsMgr SP1. If you are not on SP1, this should be your first course of action. The information below is primarily meant for large environments although some of these tips may still be useful for smaller installation (Less than 1000 agents).
SQL Performance:
Most console performance issues stem from poor SQL performance. The below steps are basic SQL troubleshooting but if you're not familiar with working with QL server then your resident QL Specialist should be engaged.
The OperationsManager Database should be on its own dedicated server - No other databases should live on this SQL server.
You should also try to maintain at most a 50GB OperationsManager DB for best performance and I'd recommend 30-40GB.
SQL Disk Configuration:
Disk performance issues make up the bulk of SQL performance problems. With that in mind here are some of my recommendations:
- SQL disks should be attached to a SAN, configured to use Raid 1+0
- Database and Log files should be stored on separate LUN's
- SAN configuration should be tested and verified by the SAN vendor for correct versions of drivers/firmware, and to make sure we have the required throughput.
- Virus Scanning/Filter drivers - Make sure SQL files are excluded from scanning. Uninstalling AV should have a significant improvement on disk performance as well if that's an option for you
Some Performance counters to look at:
- Physical Disk/Avg Disk Sec/Transfer - This is the number of seconds it takes to complete one Disk I/O. This number should be around .020 consistantly, and never spike over .050
- Physical Disk/%Idle Time - How much disk Idle time is being reported. Anything less than 20% can be considered 0%.
- Physical Disk/Disk Bytes/Sec - This is the throughput of the drive. Use this counter when the %Idle time is at 0%, and you will have a rough estimate of how much data you can push to the disk subsystem.
- Physical Disk/Average Disk Queue Length - If this number Spikes above 50, it may be an indication that the HBA queue depth needs to be adjusted. Engage the SAN vendor.
If these numbers look bad should take a look at SQL memory before pointing fingers at your SAN vendor.
SQL Memory Configuration:
Not enough memory, or improperly using SQL memory, could cause SQL to perform extra work that makes it look like the disk cant keep up. Make sure SQL has enough memory. SQL 2005 SP2 64 bit Enterprise Edition should be running on a 64 bit server for best performance. If we are not running Enterprise Edition, Online index operations are not available.
Some rough Estimates of memory needed on the SQL server:
- 2000 agents - 8GB
- 3500 Agents - 16GB
- 5000+ Agents - 32GB+
Some performance counters to look at:
Memory/Available Mbytes - If this number is under 200 MB then SQL has been configurated to take all the memory in the system. This is often not the most efficient for SQL. Try to determine how much memory is needed by other components on the system (OS/Sql Agents/Antivirus/Backup/Misc) and subtract that from the total system memory giving SQL the rest.
Here is an Example using an 8-way 64-bit server with 32GB of ram:
- Physical RAM: 32GB
- 2GB for the OS and other applications: 2GB
- Let's allocate 2GB for the MPA / Thread Stacks / Linked Servers etc: 2GB (on x64 SQL Server, 2MB is allocated per worker thread. The default number of Max Worker threads is 576 for an 8-processor server.)
- Let's reserve 3GB for all the other applications on the server (AV, backup etc): 3GB
So now our max server memory setting is: 32-2-2-3 = 25GB.
Here is the SQL command to limit the server to 25GB:
sp_configure 'max server memory', 25600
RECONFIGURE
GO
SQLServer:Buffer Manager/Page Life expectancy - This counter lets us know how long SQL expects a page to live inside its memory. This number should be greater than 300. The Page Life Expectancy counter tells you, on average, how long data pages are staying in the buffer. If this value gets below 300 seconds then this is a potential indication that your SQL Server could use more memory in order to boost performance. If this number falls well below 300 seconds it can be an indication that SQL may have benefited from more memory during that time period, although just because this number falls below 300 does not mean SQL needs more memory. We could have incorrect or out of date indexes. See the next Section.
SQLServer:Buffer Manager/Lazy Writes/sec - If this number increases during the time when Avg Disk Sec/Transfer increases, it is a good indication SQL needs more memory. Lazy writer is a process that moves dirty pages from the buffer to disk in order to free up buffer space. Lazy writer/sec counter tracks how many times a second the Lazy writer process is moving dirty pages. Generally speaking this should not be a high value, say more than 20 per second or so. Ideally it should be close to zero. If
it is zero, this indicates that your SQL Server's buffer cache is plenty big and SQL Server doesn't have to free up dirty pages, instead waiting for this to occur during regular checkpoints. If this value is high then a need for more memory is
indicated
Temp DB:
The Temp DB should be moved to a High Performance Disk, preferably a dedicated SAN LUN in large environments.
SQL Indexes:
Out of date or incorrect indexes can cause additional data to be read from the disk. This can cause our Page Life Expectancy to fall and increase our disk I/O. Pssdiag can be useful in identifying which Tables contain the most date and which indexes are out of date.
Identifying tables that are fragmented:
DBCC SHOWCONTIG WITH FAST Here is some sample output:
DBCC SHOWCONTIG scanning 'Alert' table...
Table: 'Alert' (1771153355); index ID: 1, database ID: 5
TABLE level scan performed.
- Pages Scanned................................: 936
- Extent Switches..............................: 427
- Scan Density [Best Count:Actual Count].......: 27.34% [117:428]
- Logical Scan Fragmentation ..................: 60.90%
In general, we would like the "Scan density" to be high (Above 80%) and the "Logical Scan Fragmentation" to be low (below 30%). What you might find is that *some* of the tables are more fragmented than others, because our built-in maintenance does not reindex all tables. This is especially true with tables like the raw perf, event, and localizedtext tables.
Reindexing:
This can be ran as a SQL agent job or manually and can have a huge effect on peformance:
USE OperationsManager
go
SET ANSI_NULLS ON
SET ANSI_PADDING ON
SET ANSI_WARNINGS ON
SET ARITHABORT ON
SET CONCAT_NULL_YIELDS_NULL ON
SET QUOTED_IDENTIFIER ON
SET NUMERIC_ROUNDABORT OFF
EXEC SP_MSForEachTable "Print 'Reindexing '+'?' DBCC DBREINDEX ('?')"
Maintenance built into OpsMgr:
There is a rule in the System Center Internal Library called "Optimize Indexes". This rule runs every night at 2:30am on the RMS and calls p_OptimizeIndexes. Make sure any standard maintenance you perform on the OpsMgr DB does not interfere
with this job.
See this blog post for more information:
http://blogs.technet.com/kevinholman/archive/2008/04/12/what-sql-maintenance-should-i-perform-on-my-opsmgr-databases.aspx
Tuning OpsMgr:
NOTE: The Event and Performance Data tables are updated in realtime and are not indexed. Opening these views can have a significant performance impact so when testing console performance all efforts should be taken to avoid these views.
Data Retention:
Data Retention can be a significant source of data in the OpsMgr DB. Reducing the amount of data in the database will offer significant performance increases. In a large environment set the Data Retention to 1 or 2 days for every category. After grooming and reindexing takes place you should see a console performance increase. This is accessed from Administration -> Settings -> grooming.
Open and Closed Alerts:
We should try to maintain less than 2,000 open and closed alerts in our database. If you see more than this then Management Packs may have to be tuned to disable noisy rules.
Tuning Management Packs:
In a large OpsMgr environment, every management pack added needs to be examined, tuned and tested. One way to do this is to remove all of the management packs and add them back one by one. As you add them back, identify what the management packs do and disable rules/discoveries that are unneeded. Tune each management pack for a couple of days , identify noisy rules, excessive perf data etc. Also make sure the newest Management packs are used. Especially with the complicated MP
such as Exchange, big performance issues have been addressed in newer versions. Research what needs to be done with each management pack that will be installed.
These next steps are best performed on a clean working system and repeated everytime a new management pack is being tested. See:
Tuning Alerts:
Run the two queries below. Identify what is causing alerts and either fix the issues or disable those rules:
Most Common Alerts in an Operational Database, by Alert Count
SELECT AlertStringName, AlertStringDescription, AlertParams, Name, SUM(1) AS
AlertCount, SUM(RepeatCount+1) AS AlertCountWithRepeatCount
FROM Alertview WITH (NOLOCK)
GROUP BY AlertStringName, AlertStringDescription, AlertParams, Name
ORDER BY AlertCount DESC
Most Common Alerts in an Operational Database, by Repeat Count
SELECT AlertStringName, AlertStringDescription, AlertParams, Name, SUM(1) AS
AlertCount, SUM(RepeatCount+1) AS AlertCountWithRepeatCount
FROM Alertview WITH (NOLOCK)
GROUP BY AlertStringName, AlertStringDescription, AlertParams, Name
ORDER BY AlertCountWithRepeatCount DESC
Tuning events:
Run the following queries to determine which events are prevalent in the database. Find the event collection rules collecting these events and disable them. If an especially noisy event rule is absolutely needed, please be aware this could cause performance issues. It may also be beneficial to tune the application causing the events.
All Events by count by day, with the total for entire database (this tells us how many events per day we are inserting, and helps us look for too many events, event storms, and the result after tuning rules that generate too many events):
SELECT CASE WHEN(GROUPING(CONVERT(VARCHAR(20), TimeAdded, 101)) = 1)
THEN 'All Days'
ELSE CONVERT(VARCHAR(20), TimeAdded, 101) END AS DayAdded,
COUNT(*) AS NumEventsPerDay
FROM EventAllView
GROUP BY CONVERT(VARCHAR(20), TimeAdded, 101) WITH ROLLUP
ORDER BY DayAdded DESC
Most Common Events by event number: (This helps us know which event ID's are the most common in the database)
SELECT top 50 Number, COUNT(*) AS TotalEvents
FROM EventView
GROUP BY Number
ORDER BY TotalEvents DESC
Most common events by event number and event publishername (This gives us the event publisher name to help see what is raising these events):
SELECT top 50 Number, Publishername, COUNT(*) AS TotalEvents
FROM EventAllView
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC
Computers generating the most events (This shows us which computers create the most event traffic and use the most database space):
SELECT top 50 LoggingComputer, COUNT(*) AS TotalEvents
FROM EventallView
GROUP BY LoggingComputer
ORDER BY TotalEvents DESC
Performance Turning:
Use the following queries to identify which performance rules are taking the bulk of the database. Tune these rules to collect performance data at a larger interval, or disable the rule all together.
Performance insertions per day:
SELECT CASE WHEN(GROUPING(CONVERT(VARCHAR(20), TimeSampled, 101)) = 1)
THEN 'All Days' ELSE CONVERT(VARCHAR(20), TimeSampled, 101)
END AS DaySampled, COUNT(*) AS NumPerfPerDay
FROM PerformanceDataAllView
GROUP BY CONVERT(VARCHAR(20), TimeSampled, 101) WITH ROLLUP
ORDER BY DaySampled DESC
Most common performance insertions by perf counter name:
select pcv.countername, count (pcv.countername) as total from
performancedataallview as pdv, performancecounterview as pcv
where (pdv.performancesourceinternalid = pcv.performancesourceinternalid)
group by pcv.countername
order by count (pcv.countername) desc
Most common performance insertions by perf object and counter name (This is the most interesting - tells us specifically which perf insertions are the noisiest):
select pcv.objectname, pcv.countername, count (pcv.countername) as total from
performancedataallview as pdv, performancecounterview as pcv
where (pdv.performancesourceinternalid = pcv.performancesourceinternalid)
group by pcv.objectname, pcv.countername
order by count (pcv.countername) desc
Here are some examples of noisy performance rules that are enabled by default in OpsMgr:
Microsoft.Windows.Server.2000 Collection Rule for the Average Disk Queue Length
Microsoft.Windows.Server.2000 Collection Rule for Average Disk Seconds Per Read
Microsoft.Windows.Server.2000 Collection Rule for Average Disk Seconds Per Transfer
Microsoft.Windows.Server.2000 Collection Rule for Average Disk Seconds Per Write
Microsoft.Windows.Server.2000 Collection rule for Current Disk Queue Length
Microsoft.Windows.Server.2000 Collection Rule for Disk Bytes Per Second
Microsoft.Windows.Server.2000 Collection Rule for Disk Reads Per Second
Microsoft.Windows.Server.2000 Collection Rule for Disk Writes Per Second
Microsoft.Windows.Server.2000 Logical Disk Free Megabytes
Microsoft.Windows.Server.2000 % Logical Disk Free Space
Microsoft.Windows.Server.2000 Network Adapter Bytes Received per Second
Microsoft.Windows.Server.2000 Network Adapter Bytes Sent per Second
Microsoft.Windows.Server.2000 Network Adapter Bytes Total per Second
Microsoft.Windows.Server.2000 Memory Available Megabytes
Microsoft.Windows.Server.2000 Memory Page Reads per Second
Microsoft.Windows.Server.2000 Memory Pages per Second
Microsoft.Windows.Server.2000 Memory Page Writes per Second
Microsoft.Windows.Server.2000 Memory % Committed Bytes in Use
Microsoft.Windows.Server.2000 Page File Percentage Use
Microsoft.Windows.Server.2000 System Context Switches per Second
Microsoft.Windows.Server.2000 Processor % Processor Time Total
Microsoft.Windows.Server.2003 Collection Rule for the Average Disk Queue Length
Microsoft.Windows.Server.2003 Collection Rule for Average Disk Seconds Per Read
Microsoft.Windows.Server.2003 Collection Rule for Average Disk Seconds Per Transfer
Microsoft.Windows.Server.2003 Collection Rule for Average Disk Seconds Per Write
Microsoft.Windows.Server.2003 Collection rule for Current Disk Queue Length
Microsoft.Windows.Server.2003 Collection Rule for Disk Bytes Per Second
Microsoft.Windows.Server.2003 Collection Rule for Disk Reads Per Second
Microsoft.Windows.Server.2003 Collection Rule for Disk Writes Per Second
Microsoft.Windows.Server.2003 Logical Disk Free Megabytes
Microsoft.Windows.Server.2003 % Logical Disk Free Space
Microsoft.Windows.Server.2003 Network Adapter Bytes Received per Second
Microsoft.Windows.Server.2003 Network Adapter Bytes Sent per Second
Microsoft.Windows.Server.2003 Network Adapter Bytes Total per Second
Microsoft.Windows.Server.2003 Memory Available Megabytes
Microsoft.Windows.Server.2003 Memory Page Reads per Second
Microsoft.Windows.Server.2003 Memory Pages per Second
Microsoft.Windows.Server.2003 Memory Page Writes per Second
Microsoft.Windows.Server.2003 Memory % Committed Bytes in Use
Microsoft.Windows.Server.2003 Page File Percentage Use
Microsoft.Windows.Server.2003 System Context Switches per Second
Microsoft.Windows.Server.2003 Processor % Processor Time Total
Microsoft.Windows.Server.2003 Processor % Processor Time
All of these rules run on a 5 minute interval and can cause a significant amount of data to be inserted into the database. These rules, along with any rules you determine with the above queries, should be disabled or tuned to run at a larger interval (say, every 15-30 minutes).
**Console Registry keys:
**Automatically Polling the OpsMgr DB every 15 seconds from multiple consoles can negatively impact performance. For best performance in large environments, turning off Polling or increasing it to at least level 5 can help:
HKCU\Software\Microsoft\Microsoft Operations Manager\3.0\console\CacheParameters\PollingInterval
0 – Turn off console polling and only refresh when manually pressing F5
1 – Console will poll every 15 seconds (default)
2 – Console will poll ever 30 seconds
10 – This is the maximum value – anything above 10 will be treated as 10
Also note that:
HKCU\Software\Microsoft\Microsoft Operations Manager\3.0\console\CacheParameters\EnableContextMenuTasks
defaults to 0 in SP1. If this is re-enabled it could cause performance issues.
**Console Machine Performance:
**The server or workstation the console runs on should be in good shape. Disks should have ample free space (especially where the profile is located) and not be fragmented. We should have at least 1GB of free memory to run the console at top speed and antivirus should be configured to exclude Local console cache path:
Documents and Settings\username>\local settings\Application Data\Microsoft\Microsoft.MOM.UI.Console
Limit the number of open consoles to the bare minimum required.
**RMS Performance:
**On the Root Management Server, the most critical resource is RAM followed by CPU. Memory (see Memory/Available Mbytes in perfmon) should never go below 200 mb). In large environments the RMS should have fast disks, especially where the Health Service Store is located). Disks should not be fragmented.
No agents should report directly to the RMS. No Gateway servers should report
directly to the RMS.
When configuring read caching vs. write caching on disk controllers for the Root Management Server, allocating at least 50% of the cache to write caching is recommended.
Antivirus software should be configured to excluded the Health Service Store folder (C:\Program Files\System Center Operations Manager\Health Service Store, by default. Located on the shared disk in a cluster).
The console should not be run on the RMS, and for best performance the console machine should be physically close to the RMS, preferably on the same subnet).
Agents:
Depending on your hardware and the management packs you are running, you should not have more than 5,000 agents in a management group although this is not a hard limit. See the Operations Manager 2007 Performance and Scalability White Paper:
Networking:
The SQL server hosting the Operations Manager DB and Root Management Server should be on the same subnet. Gigabit networking is preferred.
Hotfix 948496 should be applied to the SQL server and the RMS:
KB948496: An update to turn off default SNP features is available for Windows Server 2003-based and Small Business Server 2003-based computers: http://support.microsoft.com/?kbid=948496.
Additional Links:
OpsMgr 2007 Capacity Planner Tool: http://www.microsoft.com/downloads/details.aspx?FamilyID=6fec1f12-a62c-4e8d-8a19-56879192adc3&displaylang=en.
OpsMgr 2007 Hardware Guidance (Satya Vel's Blog): http://blogs.technet.com/momteam/archive/2008/04/10/opsmgr-2007-hardware-guidance-what-hardware-do-i-buy.aspx