The Joys of Compiler and Processor Reordering
So I thought that a good first technical blog entry would be one about a common – but “hardly thought about by most programmers” problem called “reordering”. It is a subtle problem but very important to understand if you write lock-less multithreaded code or write code that directly reads and writes mapped hardware registers.
First we need to define reordering in concrete terms. I define it this way: reordering occurs when a program executes load and/or store instructions (load = read and store = write) in an order that is different than the order specified in the program’s source code. There are 2 flavors of reordering: compiler reordering and processor reordering. Reordering problems are specific to multi-threaded code running on single-processor or multi-processor systems. Reordering is not a problem for single threaded code. Also, reordering is not specific to kernel mode code although it has some interesting ramifications for kernel mode developers that I'll talk about in a future post. Let’s look at some code:
int
AddEm(
void
)
{
int a, b, c, z;
a = 1; // (1)
b = 2; // (2)
c = 3; // (3)
z = a + b + c; // (4)
return z;
}
In what order do the operations happen in this code (notice that I numbered the operations in the code)? The fact of the matter is – we don’t know! It could be 1,2,3,4 or 3,2,1,4 or 1,3,2,4 etc.. These are all legal optimizations that the compiler or processor can make. Why? Well – because to a single thread they are unobservable changes. A single thread can’t observe the intermediate stages of the values of a, b and c – it can only see their values at the point it sums them and stores the result in z – operation 4. The only thing that we know about the order of the operations in the above code is that the sequence will end with operation 4. If the sequence didn’t end with operation 4, then we would be able to observe the reordering in a single thread of execution – that would be really bad. Why would a compiler or processor want to do reordering anyway? There are many reasons that optimizations occur, however, let’s not focus on the reasons for the optimizations and focus on how to deal with them in our code (one reason for optimization might be combining memory accesses that are to the same processor cache line yet disparate in the source code). Now let's look at some multi-threaded code:
typedef struct _FOO {
ULONG A;
ULONG B;
ULONG C;
ULONG D;
BOOLEAN Initialized;
} FOO;
VOID
InitAFoo(FOO* f)
{
f->A = 4;
f->B = 8;
f->C = 15;
f->D = 16;
f->Initialized = TRUE;
}
//
// A global variable of type FOO
//
FOO f;
VOID
T1(VOID)
{
//
// Init the global FOO
//
InitAFoo(&f);
return;
}
VOID
T2(VOID)
{
ULONG AVal;
//
// See if our FOO is initialized
//
if (f.Initialized) {
AVal = f.A;
// Use AVal here
}
return;
}
This code is a little more interesting as we have 2 threads of execution that both use the same global data. This means that they can observe intermediate states in the global data as the other execution path modifies it. Let’s assume that the routine InitAFoo() gets reordered so that its execution order is like this:
VOID
InitAFoo(FOO* f)
{
f->A = 4; // (4)
f->B = 8; // (3)
f->C = 15; // (1)
f->D = 16; // (5)
f->Initialized = TRUE; // (2)
}
If T1() and T2() run in parallel then there exists a window of time where T2() can see f.Initialized equal to TRUE before f.A has been initialized. This is really bad if f.A is a pointer or anything that we use in any way for that matter.
I should point out here that T1() and T2() are sharing data without a lock. This type of progrmming raises the "difficult to get right" bar quite a bit. In fact if you use locks around your shared data then you never have to worry about reordering problems (more on why in a moment). T1() and T2() are (sort of) an example of lock free code. They are not technically “lock free” under the strictest definition – but the same principles apply – they modify global data without a shared locking mechanism. Lock-free code counts on state transitions for synchronization. In this case – the intended synchronizing state transition is when f.Initialized gets set to TRUE. The idea is that T1() is a producer (it produces initialized FOOs). T2() is a consumer of the FOOs. T2() knows that a FOO is ready for consumption when its Initialized field is set to TRUE.
The easiest way to fix this code is to associate a lock with f – and then acquire the lock every time the code reads or writes f. In other words – T1() would call InitAFoo() with the lock held and T2() would read f under the lock like so:
VOID
T1(VOID)
{
//
// Init the global FOO
//
Lock(fLock);
InitAFoo(&f);
Unlock(fLock);
return;
}
VOID
T2(VOID)
{
ULONG AVal;
//
// See if our FOO is initialized
//
Lock(fLock);
if (f.Initialized) {
AVal = f.A;
// Use AVal here
}
Unlock(fLock);
return;
}
Life is so much easier when you use locks. But just for the sake of this topic – let’s say we have a really good reason to - and decide not to use locks and to continue down this lock-less path.
So what is it about the code that messes us up here? It is the fact that there is no reliable order of initialization of f as observed by T1() . What we really need is a way to make sure that f.Initialized is always set to TRUE after the other fields of f have been initialized. How do we do that? Well – on most platforms there is a thing called a “memory barrier”. In the generic sense a memory barrier is a construct that prevents a compiler or processor from reordering instructions around it. Windows has a routine called MemoryBarrier() that does just this. But where exactly would we put this in our code? Well – where do we need the strict order of operations to occur? After the initialization of the fields of f and before f.Initialized. Basically stated – we don’t care what order the fields of f are initialized in - as long as they are all initialized before f.Initialized is set to TRUE. So we could modify our code to be:
VOID
InitAFoo(FOO* f)
{
f->A = 4;
f->B = 8;
f->C = 15;
f->D = 16;
MemoryBarrier();
f->Initialized = TRUE;
}
By changing the code in this way we force the order of operations as observed by T2() . With the code written as above, T2() can always know that if f.Initialized is TRUE then the fields of f are initialized. Note that MemoryBarrier() acts as a compiler and processor memory barrier. There are many flavors of barriers and they vary by platform. Some barriers only prevent writes, some only prevent reads. MemoryBarrier() is what is called a “full barrier” – it blocks both reads and writes from moving around it. There are also variable decorations like "volatile" that have barrier implications. If the specific barrier types and volatiles are interesting I can cover them in another post.
Note: For this code to be explicitly correct we would also have to do this in T2() :
VOID
T2(VOID)
{
ULONG AVal;
//
// See if our FOO is initialized
//
MemoryBarrier();
if (f.Initialized) {
AVal = f.A;
// Use AVal here
}
return;
}
The reason we need this barrier on the "read side" is very complicated and will not happen on any current processor that Windows supports. However, if you are a sick individual and want the whole story let me know and I'll post about it.
So why do locks prevent you from worrying about this? You guessed it – all locks will have an embedded memory barrier! If they didn’t they wouldn’t be a lock. My personal advice is to always write your code with locks. If you can prove to yourself that the locked paths are too slow – make sure you are only locking around the required elements of your code (don’t lock around a whole function if you just need to lock around a single assignment). If you perf is still suffering – then and only then should you put yourself through the torture of writing correct lock-free code. By the way – did I mention that every compiler and platform have different rules about how things can get reordered. More to come on this later.
Well – I hope I haven’t bored you to tears with this. I it is not interesting – please let me know and I will start an entry about Lost!
Comments
Anonymous
March 07, 2008
Very good post, and quite an interesting theme. But I didn't get one thing straight: you said that MemoryBarrier() blocks loads and stores from being reordered. I understand how that fixes our T1/T2 problem, but I didn't see how it acts as a lock, which is, to the best of my understanding, a means of halting one thread until the other one is finished doing something. Anyway, I hope you follow up on this theme (especially on the reason why MemoryBarrier() is also needed in T2)! Congratulations on your first (real) post.Anonymous
March 07, 2008
Cool! I am glad at least one person found it sort of interesting. :) So - to be clear - the memory barriers only enforce a consistent view and are not themselves locks. A lock requires the additional semantics of waiting and releasing. This code would allow a guaranteed pass/fail for T2 but would not function as a lock. This is the fundamental difference in code that uses locks and lock-free code. Thanks for the comment.Anonymous
March 09, 2008
Another sick individual here - I also thought it was interesting. I didn't know about the problem or about the API (actually a macro, apparently). Now I have to go back and check all my old code...Anonymous
March 09, 2008
Very interesting post. Can you please explain How MemoryBarrier() works to prevent reordering?Anonymous
March 09, 2008
'The reason we need this barrier on the "read side" is very complicated and will not happen on any current processor that Windows supports. However, if you are a sick individual and want the whole story let me know and I'll post about it.' I'm sick. Now you've locked yourself into posting why a read barrier is needed with this_code (even if not needed with the current set of Windows-friendly processors). Meanwhile, I think you need to assign operation numbers separately to the loads and stores. First for pedantry, where you say that operations (1) and (2) and (3) can occur in any order but have to occur before (4), that's because C's sequence points rule has an "as if" metarule so the sequence points don't really have to be enforced when violations cannot be observed. Now the effect at the unobservable level of actual operations is that the read sides of (1) and (2) and (3) have to occur before the read side of (4), but the store sides of (1) and (2) and (3) and (4) can occur in any order. In a multiprocessor system it would be possible to observe that z becomes 6 before a becomes 1.Anonymous
March 10, 2008
Hi, My guess for the read barrier is that you want to avaid an eager CPU to prefetch f.A before checking the initialized value which might retrieve a corrupted data. To my best of my knowledge, this can happen on IA64 machines (Which are supported on windows). So, was my guess right? Anyway, your BLOG is very interesting. Keep doing the good work!! Thanks, Eran.Anonymous
March 10, 2008
In a previous post on compiler and processor reordering, I said that for multi-threaded, lock-free codeAnonymous
March 10, 2008
Ok - I posted the "read side barrier" post. http://blogs.msdn.com/itgoestoeleven/archive/2008/03/11/the-joys-of-compiler-and-processor-reordering-why-you-need-the-read-side-barrier.aspxAnonymous
March 10, 2008
In a previous post on compiler and processor reordering, I said that for multi-threaded, lock-free codeAnonymous
June 13, 2009
話題の小向美奈子ストリップを隠し撮り!入念なボディチェックをすり抜けて超小型カメラで撮影した神動画がアップ中!期間限定配信の衝撃的映像を見逃すな