System Center Advisor: Finishing off my first list of advice….
When I last told you my story of installing System Center Advisor, I had seen some new alerts for a database I had created on this server called mavstothefinals (hey Lakers and Spurs lose in the first round and Mavs are up 2-0. It could happen…). I had received an email about new alerts for this database:
I resolved each of these issues and the next day these alerts were closed as I would expect (remember I told you that alerts are closed automatically when Advisor detects the problem is resolved).
What is left for me to resolve my initial list of advice from CSS?
I sorted these alerts based on “Path” which is the object that the alert applies to (SQL Instance, database, or the computer server). You can see the first 6 alerts are really the same 2 alerts that have fired for my system databases: model, master, and msdb. The first alert I’ve already talked about in the last post indicating I have not run a clean consistency check at least one time. I personally believe that for a new production installation this makes perfect sense. Why not see whether all of your production databases including system databases have had a clean consistency check before releasing this server into production? System databases are often overlooked for this but if they become damaged you may have problems operating or starting SQL Server. Remember that while the PAGE_VERIFY CHECKSUM is a great feature and will discover I/O problems with pages that were written to disk, what about problems with pages that occur while in memory? If a page becomes damaged in memory and is written to disk,, checksum will not catch this problem (there is a feature we have to do some checksum verification for pages in cache but not as robust was when reading from disk).
So let’s say we all agree that running a consistency check on these system databases makes sense at least one time to ensure they are clean. As I said earlier we are conservative about our rules to avoid making them too noisy for the entire world so we don’t advocate how often you should run DBCC CHECKDB on a regular basis. But run it at least once to ensure they are clean. So I’ve run the following commands on this SQL instance:
dbcc checkdb(master)
go
dbcc checkdb(model)
go
dbcc checkdb(msdb)
go
I received no errors so I expect by tomorrow for these rules to be closed:
What about the other alert on backup?
Even though this fires for msdb, master, and model I picked model to show you this rule. You may be reading this post and ask yourself why would we have a rule to check to see if you have a backup? What is behind this rule?
If you look at this description it describes what we check. We simply see whether you have EVER backed up the database (a FULL backup) since you created the database. For my user database from the last post, I had created an initial backup (but remember I didn’t have a log backup and I backed it up to the same volume as database). There are two main points I want you to take away about this rule:
1) You would be surprised how many people call CSS and don’t have a backup at all to restore from (no one ever calls me and says “Hey Bob I have a backup and just wanted to let you know I’m restoring it”). I must admit to you that when I first proposed this rule to the SCAdvisor team for the beta they said “Are you kidding. Who wouldn’t backup their database?” I pointed out to them that we have the years and years of customer experiences to prove it. They went along with my suggestion and in the early TAP program many customers hit this rule especially for system databases. One interesting story was that a TAP customer hit the rule and didn’t understand it because they were sure they were backing up their databases. Turns out their backup automation solution was failing and the software to notify them the backup automation was failing was also failing.
2) System databases including model should be backed up. Master may seem obvious to you but what about model? I’ve had a few folks ask me about backing up model and why we flag it. I turn around and ask back the question “What happens if model becomes damaged?” One possible problem is that SQL Server cannot be started if model cannot be opened because it is damaged. Now what do you do? Your production server is totally down because the engine cannot open model and you have no backup. The answer is you must rebuild all system databases to get model back into a state of health. (Note: We do store the initial model.mdf file in …\MSSQL\Binn\Templates so you may be able to copy from there but the documented supported method to recover model is restore from a backup).
So trusting this advice, I decided to create a backup of master, model, and msdb. Remember the rule I hit with my user database saying I shouldn’t backup to the same volume as the database files so I will make sure to backup these up to a different volume
This leaves me with 2 alerts:
The first alert has the following details:
I certainly plan to use tempdb on this server so would like to avoid hitting this problem so the update sounds like one I would like to apply. We in CSS picked this update because many customers have encountered the errors as described (605 and 824) and thought tempdb was damaged when this is actually not the case. As it turns out this is more of a query optimizer problem. I looked at the KB article that comes with this rule and it implies to me that SQL Server 2008 R2 RTM already has this fix. Then why did the alert fire? This is one of the nice features of Advisor: our ability to ensure a fix is “completely applied”. Look at the Information detected section in the right corner of these details:
It says a SQL Server update is not needed but says a trace flag update is needed. This is because for this fix to be applied you must enable a trace flag per the KB article (BTW. This is one of the reasons for this rule. Many, many people have missed that the trace flag is needed). The information detected tells me whether trace flag 4135 or 4199 is enabled.. Why 2 trace flags? This is where things get a bit complex. The original fix for this problem required trace flag 4135. However, we have also introduced a trace flag starting with certain builds, trace flag 4199, that enables several different “optimize fixes” including this one. Here is what the KB article for this fix says:
Note You can enable trace flag 4135 or trace flag 4199 to activate this fix. Trace flag 4135 was introduced in Cumulative Update package 3 for SQL Server 2008. Trace flag 4135 is also available in SQL Server 2008 Service Pack 1, in SQL Server 2008 Service Pack 2, and in SQL Server 2008 R2. Trace flag 4199 was introduced in Cumulative Update package 7 for SQL Server 2008, in Cumulative Update package 7 for SQL Server 2008 Service Pack 1, and in Cumulative Update package 1 for SQL Server 2008 R2. For more information about trace flag 4199, click the following article number to view the article in the Microsoft Knowledge Base:
974006 (https://support.microsoft.com/kb/974006/ ) Trace flag 4199 is added to control multiple query optimizer changes previously made under multiple trace flags
Since I have SQL Server 2008 R2 RTM.installed only trace flag 4135 is possible for me to use so that is what I’ll set to resolve this alert.
Let’s look at the last alert on my list which involves tempdb:
One of the most common issues we have seen over the years regarding the performance of applications using tempdb is allocation page contention (symptoms are high waits on PAGELATCH for PFS, GAM, and SGAM pages). And one of the most common reasons for this contention is the lack of multiple database files for tempdb. Spreading out the number of files for tempdb helps relieve bottlenecks for allocation pages in these files. Therefore, we built a rule in System Center Advisor to perform a very conservative check. If you have a SQL Server that is using more than one logical processor (we calculate this by looking for the number of schedulers in sys.dm_os_schedulers that are VISIBLE ONLINE) and you only have one tempdb file, we raise this alert. We don’t make any comments on how many files you should create (because this is a widely debated topic). We just know if you have one file and multiple schedulers you are likely to hit problems when using tempdb. We used this logic because that is a common customer call to CSS. A customer will call us with performance problems. We see it is high latch contention in tempdb involving allocation pages. And we find out there is only one tempdb file. The typical suggestion is to create one file for each logical processor but there is debate about what that limit should be.. So we don’t really go down that path. We just know if you only have 1 tempdb database file you are likely going to have problems. Why flag this for every customers? Because I have yet to see an application using SQL Server that does not use tempdb.
I hope the last few blog posts give you a feel for what System Center Advisor can help you stay ahead of problems managing your SQL Server. But to be honest the rules so far I have shown you are really some of the more basic checks. Of course some of the most basic checks can flag some of the most common mistakes. We have many other rules that can help you.
In my next series of posts on System Center Advisor, I will drill into all the other rules we have shipped so far for SQL Server (there are 50 total) and then talk more about Configuration Change History. To see a complete list of alerts, do the following:
From your Alert dashboard, select the Manage Alerts button:
From there pick the Available Alerts tab
All the alerts are listed here with a link to the KB articles for each rule.
Bob Ward, Microsoft
Comments
Anonymous
April 20, 2011
Great story about the failing backup and the failed monitoring system. "Quis custodiet ipsos custodes?"Anonymous
April 20, 2011
Great post, Bob! But to be fair to the our team, we didn't say "Are you kidding. Who wouldn’t backup their database?". :-) We did say that we were concerned with how noisy that alert could be and if customers would actually take action on it. It turns out customers are taking action on it, especially on production machines. For those who don't want to take action on it, they can hit the new ignore button and selectively ignore it where it doesn't make sense in their environment. Paul Mestemaker / Sr. Program Manager / System Center AdvisorAnonymous
February 13, 2013
System Center Advisor deserves more limelight - a wonderful product in addition to real time monitoring with System Center Operations Manager. Free for customers with software assurance.