Concurrency, Part 4 - Deadlocks
Yesterday I talked about critical sections and their use in protecting application data.
But critical sections are like duct-tape - they hold applications together but they also have their dark side. A couple of commenters have touched on this, but the biggest issue with critical sections is "deadlocks".
The simplest way of generating a deadlock is what is called a "lock order inversion". Basically if you have two objects, call them "Thing1" and "Thing2". Each is protected by a critical section.
So what happens when you have:
class Thing1 { CriticalSection _Thing1Lock; class Thing2 *_Thing2;public: void Lock() { EnterCriticalSection(&_Thing1Lock); } void Unlock() { LeaveCriticalSection(&_Thing1Lock); } void Thing1Method1() { Lock(); ... _Thing2->Thing2Method2(); ... Unlock(); } Thing1Method2() { Lock(); ... Unlock(); }};class Thing2 { CriticalSection _Thing2Lock; class Thing1 *_Thing1;public: void Lock() { EnterCriticalSection(&_Thing2Lock); } void Unlock() { LeaveCriticalSection(&_Thing2Lock); } Thing2Method1() { Lock(); ... _Thing1->Thing1Method2(); ... Unlock(); } Thing2Method2() { Lock(); ... Unlock(); }};
So what happens when Thread1 calls:
thing1->Thing1Method1();
And Thread2 calls:
thing2->Thing2Method1();
?
Thread 1 | Thread 2 |
thing1->Thing1Method1() | thing2->Thing2Method1() |
EnterCriticalSection(&_Thing1Lock) | EnterCriticalSection(&_Thing2Lock) |
... | ... |
_thing2->Thing2Method2() | _thing1->Thing1Method2() |
EnterCriticalSection(&_Thing2Lock) | EnterCriticalSection(&_Thing1Lock) |
At this point, you've deadlocked the process. Thread 1 is blocked waiting for Thread2 to release the _Thing2Lock. Thread2 is blocked waiting on Thread1 to release the _Thing1Lock.
And there's no way of getting out of this situation, you're totally stuck.
So how do you avoid deadlocks? Well, this is where the whole "concurrency discussion" starts getting tricky.
The simplest solution is once again, to avoid the problem. Define a strict ordering for your locks and stick with it.
So this would change Thing2::Thing2Method1() to be:
Thing2Method1() { _Thing1->Lock(); Lock(); ... _Thing1->Thing1Method2(); ... Unlock(); _Thing1->Unlock(); }
And now, the flow of control is:
Thread 1 | Thread 2 |
thing1->Thing1Method1() | thing2->Thing2Method1() |
EnterCriticalSection(&_Thing1Lock) | EnterCriticalSection(_Thing1Lock) |
... | |
_thing2->Thing2Method2() | |
EnterCriticalSection(&_Thing2Lock) | |
_Thing1->Thing1Method2() | |
EnterCriticalSection(&_Thing1Lock) | |
... | |
LeaveCriticalSection(&_Thing1Lock) | |
... | |
LeaveCriticalSection(&_Thing2Lock) | |
LeaveCriticalSection(&_Thing1Lock) | |
EnterCriticalSection(&_Thing2Lock) | |
... | |
_Thing1->Thing1Method2() | |
EnterCriticalSection(&_Thing1Lock) | |
... | |
LeaveCriticalSection(&_Thing1Lock) | |
... | |
LeaveCriticalSection(&_Thing2Lock) | |
LeaveCriticalSection(&_Thing1Lock) |
Voila, we've avoided the deadlock.
So that's the third principle of concurrent programming: Know your lock order and never, ever violate it.
But what happens if you can't avoid violating the lock order? It does happen, especially if your code calls out to other components.
Tomorrow, I'll talk about some ways of avoiding the problem.
Edit: Fixed typo in Thing2::Lock and Unlock, thanks ac.
Comments
- Anonymous
February 17, 2005
The hard part is maintenance, for the original programmer as well as anyone else who has to do it. If it's up to the programmer to remember to lock things in a specific order, it creates bugs that take forever to figure out. It's hard enough to remember what's in a class you haven't touched for 6 months, let alone what order you're supposed to lock things, let alone what things need to be locked. It's horrible.
A good idea just crossed my mind: Every time a thread tries to acquire a mutex/critical section/etc., the acquisition routine looks at the other mutexes that thread owns and makes sure they're being acquired in the correct order, i.e. by memory address. If they're out of order, then it does some magic to fix it. The magic could be difficult. This way locks aren't acquired in the wrong order and the philosophers can all dine (I think that's the right situation, correct me if I'm wrong).
Yes, it's horribly expensive during program runtime, but it's a lot faster during program development, testing, maintenance, release, and support.
Ugh. Every time I start thinking about concurrency, even easy concurrency, I cringe at how difficult it is and how many undiscovered bugs my code must still have.
Programming languages and tools will have to evolve better support for it, otherwise most programmers will be totally lost at it. It's simply too hard for non-Einsteins to get correct. - Anonymous
February 17, 2005
Absolutely right Ryan.
I've tried the "define a hierarchy and change your "EnterCriticalSection" logic to enforce the hierarchy (it's not too hard - you define a TLS entry that contains a linked list of the locks that have been acquired on the thread so far and check it on entry).
THe problem is that this requires a HUGE amount of effort on the part of the development team, and there are times when lock inversions are benign (if two code paths can never be executed simultaneously (because of a higher level lock, for example), then the lock order doesn't matter). - Anonymous
February 17, 2005
I'm guessing that the Lock and Unlock functions in class Thing2 should be calling Enter/LeaveCriticalSection(&_Thing2Lock) not Thing_1_Lock, right? - Anonymous
February 17, 2005
The comment has been removed - Anonymous
February 17, 2005
The comment has been removed - Anonymous
February 17, 2005
Larry, this is a great series ! Thanks.
If you write device drivers for Windows you get lock inversion checking when you run the driver verifier. This is just so cool. Kudos to the team who did that, it must have saved man-centuries by now and contributed massively to kernel stability.
By the way, I don't believe that any lock inversion is benign. You often have to think really hard to convince yourself, and no matter how sure you are that it works today, there is no guarantee that it will work after the latest bug fix (which Murphy tells us will be a rushed emergency security patch). - Anonymous
February 17, 2005
One thing I've done in the past to prevent this problem is this :
a) a unique integer is associated to each critical section
b) a stack (of integers) is in each TLS
c) debug macros for EnterCriticalSection and LeaveCriticalSection do push/pop the associated integer in the stack. If the pushed value is less than a previously pushed value, an assert is fired
this enforces a stricter policy than needed, but it solves this common problem :) - Anonymous
February 17, 2005
This applies not only to system programming, but also to database programmers. I can't tell you how many bugs I've fixed in database software because two developers writing two separate sprocs access (and lock) the same database objects in a different order. It never seems to manifest itself in testing, but in a production environment the starts to ring with statements like "My process was chosen as a deadlock victim!!!" - Anonymous
February 17, 2005
Another variation of the unique integer system is to assign a priority to each lock that doesn't need to be unique. If you try and lock multiple locks, then the new lock must have a higher priority than the previous lock. VMS used this trick with their IRQL locks (or something like that). - Anonymous
February 19, 2005
I think the issue here is that you're overly protective. You should only lock around code that absolutely needs to be locked. Pointers to the other thing are instanced, not global, so calls to their methods do not need to be locked. If you stick to locking only around code that accesses common resource (and does absolutely nothing else) and have a single lock per each resource you would not have this problem. Instead a lot of programmers decide to be "safe" and lock the whole method even though it should not be. - Anonymous
February 25, 2005
The comment has been removed - Anonymous
February 25, 2005
Niclas,
You're right. I disagree on the APIs that do locking - there are very few examples where it's appropriate for an API to trust the integrity of the APIs internal data structures to its callers - the callers will almost always get it wrong.
But you're right about multithreading not being hard - that's actually why I started writing this series - it's really NOT that hard, if you follow a few rules.