Today's Annoyingly-Unwieldy Term: "Intra-Query Parallel Thread Deadlocks"
I've received a couple of questions in email and in comments about deadlocks involving mysterious-sounding non-lock resources like "exchangeEvent" and "threadpool". There are a couple of examples in the comments for post https://blogs.msdn.com/bartd/archive/2006/09/25/deadlock-troubleshooting-part-3.aspx, and here's a forum post on the topic: https://forums.microsoft.com/Forums/ShowPost.aspx?PostID=3913233&SiteID=1.
Here's one example (note that I've omitted the "inputbuf" and "executionStack" nodes for the sake of brevity and clarity):
__________________________________________________________________
deadlock-list
deadlock victim=process38316d8
process-list
process id=process3808478 schedulerid=1 kpid=216 status=suspended spid=51 sbid=0 ecid=8
process id=process3809ac8 schedulerid=1 kpid=5672 status=suspended spid=51 sbid=0 ecid=17
process id=process38136d8 schedulerid=2 kpid=5644 status=suspended spid=51 sbid=0 ecid=16
process id=process3813828 schedulerid=2 kpid=6064 status=suspended spid=51 sbid=0 ecid=9
process id=process381c478 schedulerid=3 kpid=5292 status=suspended spid=51 sbid=0 ecid=10
process id=process381d2e8 schedulerid=3 kpid=4372 status=suspended spid=51 sbid=0 ecid=19
process id=process38265c8 schedulerid=4 kpid=5552 status=suspended spid=51 sbid=0 ecid=11
process id=process3827ac8 schedulerid=4 kpid=5716 status=suspended spid=51 sbid=0 ecid=18
process id=process38309b8 waittime=609 schedulerid=5 kpid=0
process id=process38312e8 schedulerid=5 kpid=3204 status=suspended spid=51 sbid=0 ecid=6
process id=process38316d8 schedulerid=5 kpid=5108 status=suspended spid=51 sbid=0 ecid=13
process id=process383a718 schedulerid=6 kpid=5216 status=suspended spid=51 sbid=0 ecid=7
process id=process383ada8 waittime=609 schedulerid=6 kpid=0
process id=process383beb8 schedulerid=6 kpid=5852 status=suspended spid=51 sbid=0 ecid=14
process id=process3845588 schedulerid=7 kpid=6096 status=suspended spid=51 sbid=0 ecid=15
process id=process38456d8 schedulerid=7 kpid=760 status=suspended spid=51 sbid=0 ecid=0
process id=process3845c18 schedulerid=7 kpid=5992 status=suspended spid=51 sbid=0 ecid=12
resource-list
threadpool id=scheduleree6080
owner-list
owner id=process38316d8
owner id=process38312e8
waiter-list
waiter id=process38309b8
exchangeEvent id=port80140950 nodeId=9
owner-list
owner event=pending id=process383ada8
owner event=pending id=process38309b8
waiter-list
waiter event=e_waitPortOpen type=consumer id=process3813828
waiter event=e_waitPortOpen type=consumer id=process3808478
waiter event=e_waitPortOpen type=consumer id=process381c478
waiter event=e_waitPortOpen type=consumer id=process38265c8
waiter event=e_waitPortOpen type=consumer id=process3845c18
waiter event=e_waitPortOpen type=consumer id=process38316d8
waiter event=e_waitPortOpen type=consumer id=process383beb8
waiter event=e_waitPortOpen type=producer id=process3845588
waiter event=e_waitPortOpen type=producer id=process38136d8
waiter event=e_waitPortOpen type=producer id=process3809ac8
waiter event=e_waitPortOpen type=producer id=process3827ac8
waiter event=e_waitPortOpen type=producer id=process381d2e8
exchangeEvent id=port80140690 nodeId=5
owner-list
owner event=pending id=process383ada8
owner event=pending id=process38309b8
waiter-list
waiter event=e_waitPortOpen type=consumer id=process38456d8
exchangeEvent id=port80140c10 nodeId=12
owner-list
owner event=pending id=process383ada8
owner event=pending id=process38309b8
waiter-list
waiter event=e_waitPortOpen type=producer id=process38312e8
waiter event=e_waitPortOpen type=producer id=process383a718
threadpool id=scheduleref6080
owner-list
owner id=process383beb8
owner id=process383a718
waiter-list
waiter id=process383ada8
__________________________________________________________________
"exchangeEvent" Deadlock Resources
Some terminology, to better understand the trace flag 1222 deadlock output shown above:
spid = system process ID, AKA "session_id" -- to oversimplify slightly, this represents a connection to SQL
sbid = system batch ID, also called "request_id" -- a query that a spid is running
ecid = execution context ID -- a worker thread running part of a query
There's one thing that you should note about this deadlock right off the bat: all of the participants (the "process" nodes in the -T1222 output) are from the same session identifier (spid) and the same batch (sbid). They each have a different thread ID (kpid and ecid). Each "process" entry in this deadlock represents a different worker thread, but all of these worker threads are running part of a single large parallel query that was submitted by spid 51.
Another interesting thing about this deadlock is in the resource list: most deadlocks involve lock resources ("pagelock", "keylock", etc), but this one only deals with "exchangeEvent" and "threadpool" resources.
Deadlocks centering around exchangeEvent resources have been given the name "intra-query parallelism deadlock". (I know -- it just drips 'sexy', doesn't it?) They may be accompanied by this error message -- sent to the client app only, not logged in the SQL errorlogs:
Server: Msg 8650, Level 13, State 1, Line 1 Intra-query parallelism caused your server command (process ID #51) to deadlock. Rerun the query without intra-query parallelism by using the query hint option (maxdop 1).
An "exchangeEvent" resource indicates the presence of parallelism operators in a query plan. The idea is that the work for an operation like a large scan, sort, or join is divided up so that it can be executed on multiple child threads. There are "producer" threads that do the grunt work and feed sets of rows to "consumers". Intra-query parallel requires signaling between these worker threads: the consumers may have to wait on producers to hand them more data, and the producers may have to wait for consumers to finish processing the last batch of data. Parallelism-related waits show up in SQL DMVs as CXPACKET or EXCHANGE wait types (note that the presence of these wait types is normal and simply indicates the presence of parallel query execution -- by themselves, these waits don't indicate that this type or any other type of deadlock is occurring).
Wherever you have threads waiting for resources, there is a risk that they will end up in a circular blocking chain (thread A holding resource X and waiting for resource Y, thread B holding resource Y and waiting for resource X). The synchronization objects used in parallel query execution are no exception; in rare cases, the threads running a single query can end up deadlocking with one another. Most intra-query parallelism deadlocks are considered bugs, although some of them can be risky bugs to fix so a fix may not be possible. If you run into one and you're already on the latest SQL service pack, your best bet may be to investigate workarounds. Luckily, this type of deadlock is relatively uncommon, and in most cases it's possible to work around the problem by eliminating parallelism in the query. Try one of these two approaches:
Workaround #1: Add an index or improve the query to eliminate the need for parallelism. In most cases, the use of parallelism in a query indicates that you have a very large scan, sort, or join that isn't supported by proper indexes. If you tune the query, you will often find that you end up with a much quicker and more efficient plan that doesn't use parallelism, and therefore isn't subject to this type of problem. Of course, in some queries (DSS/OLAP-type queries, in particular) it may be difficult to eliminate all large scans.
Workaround #2: Force single-threaded execution with an "OPTION (MAXDOP 1) " query hint at the end of the query. If you can't modify the query, you can apply the hint to any query with a plan guide (assuming that you're running SQL 2005 or later).
"threadpool" Deadlock Resources
A process waiting for a "threadpool" resource is actually waiting for a worker thread. There are a finite number of threads in SQL's thread pool, and if they are all in use, new requests must wait for an in-progress task to complete and free up a thread. Thread pool waits (in DMVs, "THREADPOOL" or "UMSTHREAD" waittype) are typically a side effect of a massive resource contention problem -- most commonly, a large blocking chain. You should investigate what is tying up all of your worker threads, and eliminate that bottleneck. While it's not clear from this deadlock output alone, I suspect that in this case there may have been many other large untuned queries using lots of parallel threads, so excessive parallelism itself may have been the cause of the thread starvation.
"resourceWait" Deadlock Resources
(UPDATE: The following info on "resourceWait" was added to this post 15 Oct 2009.)
A process waiting for a "resourceWait" resource is waiting for a "resource semaphore". Resource semaphores are typically used to govern memory used for query sorts and hashes. So the following wait graph:
resource-list
keylock hobtid=72057594038845440 dbid=6 objectname=XXXTABLE indexname=YYYINDEX id=lockffffffff81314cc0 mode=X associatedObjectId=72057594038845440
owner-list
owner id=processebb108 mode=X
waiter-list
waiter id=processebae38 mode=S requestType=wait
resourceWait
owner-list
owner id=processebae38
waiter-list
waiter id=processebb108
could be read as:
Spid A is waiting for a shared key lock, but is blocked by Spid B, who holds an exclusive lock on this key.
Spid B is waiting for more memory to run his query (and, eventually, to release the X key lock), but he is blocked by other spids, including Spid A, who are currently holding all of the memory available for this type of operation.
You could tackle this by looking for tuning opportunities in the queries run by both deadlock participants. They are probably running a query plan that involves a hash or sort operation. Remove this through indexing or query changes and you should eliminate the query's need to wait for a query memory grant. You could also try throwing RAM at the problem, but keep in mind that query workspace memory, the memory used for sorts and hashes, must be drawn from "visible buffer pool". On a 32-bit box, visible bpool is limited to approximately 2GB (3GB if you are running with /3GB). If SQL already has this much memory available to it, adding more won't help.
Caveats
Just because you see “exchangeEvent” resources in your deadlock graph doesn’t necessarily mean that you are facing an intra-query parallelism deadlock. Sometimes the engine includes extraneous resources in the deadlock graph. This makes it important to find out how the waiters relate to one another so that you can determine which of the resources is an essential part of the circular blocking chain. The post https://blogs.msdn.com/bartd/archive/2006/09/09/Deadlock-Troubleshooting_2C00_-Part-1.aspx steps you through a deconstruction of -T1222 output so that you can get a clearer understanding of the relationships. As a rule of thumb, if there are any lock resources in your deadlock output (pagelock, keylock, rowlock) along with the exchangeEvent resources, you should suspect that the exchangeEvent resources are non-essential and that you are probably facing a “normal” deadlock.
If you're interested in more background info on parallel query execution, there's a great presentation by Craig Freedman attached to this blog post: https://blogs.msdn.com/craigfr/archive/2007/04/17/parallel-query-execution-presentation.aspx
Comments
Anonymous
September 24, 2008
PingBack from http://hoursfunnywallpaper.cn/?p=7682Anonymous
September 24, 2008
PingBack from http://housesfunnywallpaper.cn/?p=7382Anonymous
March 16, 2009
I have seen few users out there thinks that Deadlocks in SQL Server is a bug which has not be correctedAnonymous
November 22, 2009
The comment has been removedAnonymous
December 03, 2009
Michael, I recommend trying Workaround #1 or Workaround #2 described in the post.Anonymous
October 28, 2011
this article was very helpful to me and saved me a great deal of time. thanks for taking the time to write it up.Anonymous
September 19, 2013
Has anyone heard / experienced if this is still a problem with sql 2012?Anonymous
August 14, 2014
In SQL Server 2012 (SP1, CUx) you can still catch this deadlock with extended events, but:T1222 does not catch it anymoreNo exception is raised The statement executes without problem Looks like, it is handled internally by SQL Server 2012.Anonymous
November 24, 2014
This helped me to solve the deadlock issue. ThanksAnonymous
June 07, 2016
Hey Bart, great write up. Thanks for taking the time to do it. I recently had this problem in an ETL query where we were using an exclusive table lock to speed things up. I figured out that the spid was blocking itself and applied the maxdop change before I found your article, but it seems to me that this problem would happen 100% of the time if you had a parallelized query with a table lock. So, in an effort to leave a google breadcrumb for someone else to find this page more easily, I wanted to comment to add some keywords so that future sql nerds could find this page by searching for their error message. Without further ado:was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victimTABLOCKX