Troubleshooting SQL Server deadlocks and performance when running SharePoint
I’ve worked on a lot of cases this year involving SharePoint. Here are some example problems I encountered recently. If you want to read the background or the formal advice first, here are some good starting links
Plan for performance and capacity (Office SharePoint Server)
In addition to this, you need to consider what you can and can’t do from a SQL Server perspective, if you’re trying to optimize or troubleshoot your SharePoint deployment. You need to pay detailed attention to these articles:
https://support.microsoft.com/kb/932744/
https://support.microsoft.com/kb/841057/
In simple terms, don’t go amending the schema or the objects! So back to my main point…
Things to do first:
1. Collect some decent SQL Server diagnostic data – my preferred option,as ever, SQLDIAG with perfstats scripts
2. Turn on trace flag 1222 to output deadlock information to the errorlog, or capture deadlock graphs in your profiler traces.
3. Collect this data at period of both good and bad performance, so you have a benchmark to compare against
4. Aggregate the profiler output. Personally I use readtrace.exe directly from RMLUtilities, but other like to use SQLNexus as well.
Here are the 2 real examples of problems I encountered recently, and diagnosed using the above data collection:
1. Are you experiencing significant parallel waits – SQL Server installations will have a default Max Degree of Parallelism set to 0. I have observed certain MOSS deployments this year where this is not the best MAXDOP, dependent upon the specifics of the data distribution and deployment. If you look in the perfstats output you might see evidence like this:
-- requests --
runtime session_id request_id ecid blocking_session_id blocking_ecid task_state wait_type
------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- ---------
2009-05-27T15:06:38.060 68 0 4 0 0 RUNNING CXPACKET
2009-05-27T15:06:38.060 68 0 0 0 4 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 96 0 0 0 0 SUSPENDED WAITFOR
2009-05-27T15:06:38.060 116 0 0 0 0 RUNNING
2009-05-27T15:06:38.060 128 0 12 0 0 RUNNING CXPACKET
2009-05-27T15:06:38.060 128 0 9 0 0 RUNNABLE CXPACKET
2009-05-27T15:06:38.060 128 0 0 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 5 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 11 0 0 RUNNABLE CXPACKET
2009-05-27T15:06:38.060 128 0 6 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 2 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 3 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 4 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 7 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 8 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 1 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 10 0 0 RUNNABLE CXPACKET
2009-05-27T15:06:38.060 138 0 10 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 12 0 0 RUNNABLE CXPACKET
2009-05-27T15:06:38.060 138 0 11 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 9 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 0 0 3 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 3 0 8 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 1 0 8 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 2 0 8 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 4 0 8 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 7 0 12 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 5 0 12 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 6 0 12 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 8 0 12 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 163 0 0 0 0 RUNNABLE
2009-05-27T15:06:38.060 170 0 0 163 0 SUSPENDED LCK_M_X
This is a classic example, where a long blocking chain starts to build due to CXPACKET waits. Note that the same sessionid is reported on multiple lines, but the ecid column changes. This indicates that the batch (sessionid) is being processed in parallel by more than one worked (ecid). This extract is very early on in the problem, from here the chains grow and grow, and eventually can result in multiple deadlocks. In this scenario the batches being executed by sessions 128 and 138 (which are both running PROC_LISTUNGHOSTEDFIELDTEMPLATESINLIST ) are picking parallel plans inappropriately. We worked around this by changing MAXDOP on the server to a value of 1. This is definitely something that you should test in your environment, since it’s a server wide setting. Currently according to the above KB articles there is no fixed recommendation about MAXDOP for MOSS deployments, but things can change, so you should check the fixed requirements of your version. In this scenario we saw significant other number of procedures which encountered the same issue, procedures such as proc_CheckDirectoryTree, proc_DeplGetListItemLinks and PROC_DEPLGETLISTITEMANDRESERVEID amongst others. Typical errors you might see in the application logs for SharePoint would be
Unexpected query execution failure, error code 1205. Additional error information from SQL Server is included below. "Transaction (Process ID 163) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."
A global change of MAXDOP = 1 resolved this specific situation.
2. Are you encountering long blocking chains accessing the AllDocs table? This object is commonly accessed by a number of different stored procedures, and I’ve seen some scenarios where it becomes a point of contention. The most problematic one I came across recently resulted in numerous deadlocks, which firstly throw the above 1205 error to the SharePoint application log, and then in addition they throw the following error:
InitializePublishingWebDefaults failed. Microsoft.SharePoint.SPException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) ---> System.Runtime.InteropServices.COMException (0x80004004): Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) at Microsoft.SharePoint.Library.SPRequestInternalClass.SetWebProps
This second error occurs as SharePoint does not roll back properly after encountering the deadlock. If you have collected your SQLDIAG data as noted above, you’ll see symptoms like this in the perfstats script output:
a) long blocking chains waiting on LCK_M_U
-- requests --
runtime session_id request_id ecid blocking_session_id blocking_ecid task_state wait_type wait_duration_ms
------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- -------------------------------------------------- --------------------
2009-05-28T10:01:40.973 60 0 0 197 0 SUSPENDED LCK_M_X 3201
2009-05-28T10:01:40.973 72 0 0 60 0 SUSPENDED LCK_M_S 3224
2009-05-28T10:01:40.973 77 0 0 0 0 SUSPENDED WAITFOR 9087
2009-05-28T10:01:40.973 84 0 0 72 0 SUSPENDED LCK_M_S 1018
2009-05-28T10:01:40.973 197 0 0 0 0 RUNNING 0
2009-05-28T10:01:40.973 268 0 0 270 0 SUSPENDED LCK_M_U 36299
2009-05-28T10:01:40.973 270 0 0 197 0 SUSPENDED LCK_M_U 96753
2009-05-28T10:01:40.973 277 0 0 270 0 SUSPENDED LCK_M_U 44290
2009-05-28T10:01:40.973 280 0 0 270 0 SUSPENDED LCK_M_U 60461
2009-05-28T10:01:40.973 292 0 0 270 0 SUSPENDED LCK_M_U 52389
b) the notableactivequeries part of the perfstats script will potentially note that the plan_total_exec_count is very low. This indicates that you appear to be running a freshly compiled plan for the blocking statement. This will be all the more likely if your system was running well, and then suddenly started degrading performance and encountering errors
-- notableactivequeries --
runtime session_id request_id plan_total_exec_count plan_total_cpu_ms plan_total_duration_ms plan_total_physical_reads plan_total_logical_writes plan_total_logical_reads
------------------------------ ---------- ----------- --------------------- -------------------- ---------------------- ------------------------- ------------------------- ------------------------
2009-05-28T10:01:40.973 197 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 268 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 270 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 277 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 280 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 292 0 5 33057 406079 2550 0 2352015
c) when reviewing the error logs you note that the same statement shown below, being executed within the procedure proc_UpdateTpWebMetaData, is always involved in the deadlocking
UPDATE
Docs
SET
Docs.BuildDependencySet = NULL
FROM
Docs INNER JOIN BuildDependencies
ON
BuildDependencies.SiteId = Docs.SiteId AND
BuildDependencies.LeafName = Docs.LeafName AND
BuildDependencies.DirName = Docs.DirName
WHERE
BuildDependencies.SiteId = @SiteId AND
BuildDependencies.TargetDirName = @PreviousCustomMasterDir AND
BuildDependencies.TargetLeafName = @PreviousCustomMasterLeaf AND
Docs.BuildDependencySet IS NOT NULL
The reason for this problem, in this particular scenario, is that a new plan has been compiled recently for this statement which is using a HASH join, as opposed to a nested loop join which it was using earlier. This is caused by a lack of up to date statistics on the AllDocs table. We noted scenarios where the regular update statistics jobs (issued by SharePoint as part of the changes in 2007 SP2) were not thorough enough, and we were forced to run update statistics with the fullscan option on the AllDocs table.
The bottom line here being that although SharePoint does now do database maintenance, as shown in KB932744, there are still times potentially where you will need to do more detailed targeted maintenance.
Comments
Anonymous
January 18, 2011
How out of date is too much for the Statistics on the AllDocs table? We're experiencing blocking issues but the stats are at most a couple of days out of date. From a profiler, i can see that the stats on BuildDependencies table are updated several times throughout the day, is there a reason why this would require auto stat updates quite frequently? We tend to be experiencing Resource_Semaphore_query_compile wait types on 100's of procedure calls, all various ones.....do you think this is related?Anonymous
January 23, 2011
There's a number of different points here which are hard to cover specifically in just comments field:
- The answer to how of of date is, as often in such situations, "it depends"! From a purely practicle standpoint, they're out of date when you are getting bad degraded plans where the plan problem is caused by scans in AllDocs, and which can be fixed by updating the stats manually. Each system varies so you can't say a fixed number, I've seen some that need updating every 3-4 hours and some every fews weeks. Look for plans which change their usage of AllDocs when stats are not recently updated.
- Auto update stats will trigger on any table based on a number of system controller algorhythms surrounding the amount of changes to the rows in the table. Basically it will trigger when SQL believes it appropriate and you shouldn;t consider this a problem unless the auto stats function itself starts blocking other processes or causing other perf issues itself.
- If you're having loads of Resource_Semaphore_query_compile waits though, it's very likely that this is unrelated. You might want to look at this article about this wait type: blogs.msdn.com/.../compilation-bottlenecks-error-8628-severity-17-state-0-part-1.aspx. These are compilation bottlenecks and need to be addressed as such. If you want to discuss in more detail or provide any data drop me an email through the contact section of the blog. Graham