The Joys of Compiler and Processor Reordering: Why You Technically Need Read-Side Barriers

In a previous post on compiler and processor reordering, I said that for multi-threaded, lock-free code to be totally correct - you need a barrier on the read sides of the code - but that it was pretty complicated and wasn't required on any processors that Windows currently runs on. I also said that if there were equally as sick-minded individuals as myself that wanted to dive into the specifics of why - that I would post about it. There are - as it turns out - a lot of sickos out there and this is that post. :)

Consider this code:

//

// Global data definition and initialization

//

ULONG x = 1;

ULONG* p = &x;

ULONG y = 0;

VOID

T1(

    VOID

    )

{

    //

    // Set y to 1

    //

   

    y = 1;

    //

    // Compiler and processor barrier that prevents reodering

    // and forces an “invalidate” message to be sent to all

    // processors caching locations that have been written to by

    // this processor since the last barrier and before this

    // one – in other words the variable y

    //

    MemoryBarrier();

    //

    // Set p to point to y

    //

    p = &y;

}

VOID

T2(

    VOID

    )

{

    ULONG i = 1;

    //

    // Read the contents of the memory location pointed

    // to by p and store its contents in i

    //

   

    i = *p;

    ASSERT(i != 0);

}

In this code T1() is the producer – it populates p with the pointer to some variable. T2() consumes p. From reading this code we can observe the following facts about this code:

- p will only ever point to x or y

- x will always have a value of 1

- y will initially be set to 0, then set to 1

- a memory barrier is issued by T1() after y is set to 1 and before p is set to point to y

Given these facts we should be able reason that we could never see i with the value of 0. However, there exists at least one processor where this is possible – the Alpha 21264. It is a narrow windowed race condition – but it is real – and if you wrote code that ran on that processor you would have to code within its model and worry about this exact problem. So how exactly could this occur on this processor? Let’s explore.

First - before T1() or T2() start to run, x, y and p are all initialized. Then T1() starts running on processor P1 and T2() starts running on processor P2. At this point any dereference of p will access the location containing the variable x. T1() then sets y to the value 1, but this does not mean that the outside world knows about this update to y yet. That comes when we issue the call to MemoryBarrier() . When MemoryBarrier() is called a message is sent to all the other processors’ probe queues . A probe queue is a per processor queue of addresses that have been invalidated by other processors. Each processor has its own queue and processes it independently of other processors. So in this case – when T1() calls MemoryBarrier() – a message is sent to P2 indicating that the memory location that holds the contents of the variable y, is no longer valid – it contains stale data and should not be used. Excellent! This is exactly what we want. The next thing T1() does is set p to hold the address of y. This is perfectly correct from the point of view of T1() and P1. y is fully initialized before it is published via p. This is good.

So how does T2() ever see i as 0? Let’s follow the flow of T2() (a very important piece of information is that P2 currently has y cached in its processor cache with the value 0). The first thing that T2() does is define and initialize i to the value of 1. Next it dereferences the address contained in p. What address will that be? We don’t know – it is a race condition. What we do know is that it will either be the address of x or the address of y. First let’s look at the case where p is pointing to x. In this case we know the value of i will be 1 because x is and has always been 1 since it was initialized. Not too interesting of a case. So let’s consider the more interesting case where p is pointing to y. If p is pointing to y then we know the following: y was initialized to 0, was then changed to 1 by T1() before T1() issued a MemoryBarrier() call. Now we need to know what a MemoryBarrier() call does on this processor. Here is what the manual says:

If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox waits until the integer queue is

empty and then performs the following actions:

a. Sends all pending MAF and IOWB entries to the system port.

b. Monitors Cbox CSR MB_CNT[3:0], a 4-bit counter of outstanding committed

events. When the counter decrements from one to zero, the Cbox marks the

youngest probe queue entry.

c. Waits until the MAF contains no more Dstream references and the SQ, LQ, and

IOWB are empty.

When all of the above have occurred and a probe response has been sent to the system

for the marked probe queue entry, instruction execution continues with the

instruction after the MB .

The most important parts are highlighted – basically that text says “When a processor issues a memory barrier – it stalls instruction execution on the local processor, sends a message to all other processors letting them know to flush their caches because they now have invalid entries in them, flushes its own probe queue and continues executing instructions.”. Armed with that knowledge we can finally complete this long, weird, memory system - trivial pursuit sojourn. When T2() reads *p it is a 2 step process. First it loads the value of p – in this case the address of y -and then reads from that address. Because P2 has a cache entry for y – the read gets a cache hit. WAIT! T1() issued a MemoryBarrier() before setting p to hold the address of y, so how could P2 get a cache hit on y when T1() running on P1 just invalidated it?! Well – as it turns out – on the 21264 processor - local cache hits for reads are allowed to bypass the probe queue on the local processor! That’s right – the cache hit for the read of y on P2 bypasses the probe queue message telling P2 that y is not valid! Crazy-ness abounds!! So why would you ever design a processor this way? Because it is efficient, fast and beautiful - no blocking on cached reads. And any real programmer would know to put memory barriers before shared reads! Duh!! J

It would be an outright nightmare to program to this model. I have never had to do it – thank GOD! But assuming you did have to – how would you fix the code so that it worked correctly? Like this:

VOID

T2(

    VOID

    )

{

    ULONG i = 1;

    //

    // Read the contents of the memory location pointed

    // to by p and store its contents in i

    //

   

    MemoryBarrier();

    i = *p;

    ASSERT(i != 0);

}

The MemoryBarrier() in T2() forces P2 to drain its probe queue before continuing to execute instructions. This fixes the problem by not allowing the cache hit for the read to progress. This is also really weird because the memory barriers on this processor have local effects that are different that their global effects. And in some cases local memory barriers depend on remote barriers to complete their logical semantics. Like the local (local to P1) barrier – that depends on the remote barrier (local to P2 – but remote to P1) to complete the invalidate enforcement.

Well – I hope that I did a decent job explaining this problem and hope that it convinces you how hard it is to write correct lock-free code. Especially if it has to be portable to new processors that may have different memory barrier semantics than the processors you currently write code for. If I didn’t – or it didn’t – I hope it at least gave you something horrible and weird to think about for a few minutes. :D

Comments

  • Anonymous
    March 10, 2008
    PingBack from http://msdnrss.thecoderblogs.com/2008/03/11/the-joys-of-compiler-and-processor-reordering-why-you-technically-need-read-side-barriers/

  • Anonymous
    March 11, 2008
    The comment has been removed

  • Anonymous
    March 11, 2008
    From my reading of your explanation, adding that read barrier to T2 does not fix the problem.  Consider this sequence: T2 executes T2's memory barrier. P2 fetches y. T1 updates y and P1 finishes the update. T1 executes T1's memory barrier. T1 updates p and P1 finishes the update. T2 fetches p. T2 tries to fetch y but defectively designed CPU fetches its obsolete cached value. This CPU void where prohibited.  Cached value one-twentieth of one yen.

  • Anonymous
    March 11, 2008
    I don't see why P2 is fetching y in step 2. If it is a pre-fetch of *p it would fetch x. I am probably missing your point. Can you elaborate?

  • Anonymous
    March 11, 2008
    > I don't see why P2 is fetching y in step 2. It doesn't need a reason.  It's just a possibility.  The possibility is the same as it was in this part of the original posting:  "Let’s follow the flow of T2() (a very important piece of information is that P2 currently has y cached in its processor cache with the value 0)." > If it is a pre-fetch of *p It isn't a pre-fetch of *p.  It's a pre-fetch of y for unknown reasons.

  • Anonymous
    March 11, 2008
    Thanks for your elaborate explaination. While I perfectly understand your analysis and your solution, I think this example is somewhat different than the one you posted in an earlier post. In the earlier sample, we had in the reader:    if (f.Initialized) {        AVal = f.A;        // Use AVal here    } And I claimed that f.A can be retrieved even before the if statement is executed, and therefore can fetch stale data. Cannot this actually happen on modern CPUs like IA64 (And not on esoteric ones like the ALPHA)? Thanks, Eran.

  • Anonymous
    March 11, 2008
    > Can you elaborate? I think I did.  Can you undelete the comment where I think I posted an elaboration?

  • Anonymous
    March 12, 2008
    The comment has been removed

  • Anonymous
    March 12, 2008
    Eran - you might be correct on that fact. I am researching it now. From my reading of the IA64 memory model - the write side barrier should be enough - but the doc is very confusing and seems to potentially contradict itself in places. I'll reply back after a little more - re-reading. If you have specific text that leads you to believe you are right - please point me to it. Thanks!

  • Anonymous
    March 12, 2008
    The comment has been removed

  • Anonymous
    March 12, 2008
    To clarify, that CPU as described is broken in multiprocessor configurations.  It sounds like it would still be usable when it's the only CPU in a computer.

  • Anonymous
    March 23, 2008
    Hi, I searched the manual, but I must admit I got lost. And then it hits me that I might be searching in the wrong place... We are checking what the CPU does, but we forgot that there is another culprit: The compiler!!! The compiler can rearrange the commands to load f.A before checking the initialized flag. AFAK, the compiler can do it since it perceives a single-threaded program unless told otherwise (using volatile is one way of doing that). This can happen on any CPU supported by windows... This issue is very serious in asynchronous programming model when you deal with context objects (Like IRP in windows). One, must make sure that after it stores stuff in the context object, and there is a chance for the context to be grabbed and processed by other thread that a write-barrier is issued, and the other thread must (I think) issue a read barrier. Knowing how windows code works, I doubt that those barriers are issued. Most probably it works because there is usage of other primitives (Locks/Writing to device registers) that ensures barriers. Does that sounds right or am I over paranoid? Thanks, Eran.

  • Anonymous
    March 23, 2008
    "I must admit I got lost" Yeah I'll say.  This blog goes to eleven and yours was twelve.  Now I'm lost too, thirteen.  OK enough joking, let's lose count and continue. "The compiler can rearrange the commands to load f.A before checking the initialized flag. AFAK, the compiler can do it since it perceives a single-threaded program unless told otherwise (using volatile is one way of doing that)." Using volatile is one way of achieving that effect and it doesn't matter if the reason involves threading or not.  But I thought I read somewhere that Microsoft's compilers recognize Microsoft's MemoryBarrier instructions, so volatile wouldn't be needed in this particular case. [Regarding drivers] "Knowing how windows code works, I doubt that those barriers are issued. Most probably it works because there is usage of other primitives (Locks/Writing to device registers) that ensures barriers." From my understanding, barriers might still be used in coordinating CPUs (e.g. so that two CPUs don't think they obtained a lock on the same object).  But sure I don't think device controllers are expected to interact with memory barrier instructions.  From what Intel told me a few decades ago and my understanding of some manuals, I think that device controllers have to understand the effects of the LOCK prefix.

  • Anonymous
    March 24, 2008
    This is so exciting to find other people that love this low level stuff as much as me. The other day I was talking to this guy about reordering and he said "Dad - can you please just drive me to school now!?" :) >> The compiler can rearrange the commands to >> load f.A before checking the initialized >> flag. Norman is correct - the MS compilers treat any barrier intrinsic as a full optimization barrier. Also - this is a dependent load - the compiler will not reorder dependent loads. >> Regarding drivers Eran is correct - this is a huge problem! The kernel has to ensure that is uses proper barriers or it could publish something befoe it is initialized. Typically interlocked operations are used for pointer size data and locks are used for larger data. Also as Norman pointed out - for devices either locks are used or serialized instruction if available. For instance - the x86 has 'in' and 'out' that read from IO space and are always fully ordered by the processor.

  • Anonymous
    March 29, 2008
    The comment has been removed

  • Anonymous
    March 30, 2008
    "(My son usually likes to reorder my house and not my code...)" (You need to insert some barriers.) "every IRP eventuallly leads to someone writing to a device register" s/every/most/  ^^  especially if your device has been removed. s/eventuallly/eventuallly/  ~~  because you used the same spelling checker that Visual Studio used (it gives programmers an acccept button). "Regarding dependant loads" I'm trying to figure out what dependant load means.  A dependant barrier might mean this: 'And in some cases local memory barriers depend on remote barriers to complete their logical semantics.' because processors have to tell each other to invalidate their locally cached copies of values which are being changed (and non brain dead processors have to listen).  Does a dependant load mean that it depends on all preceding barriers having been executed properly?  Doesn't that mean that all loads are dependant? "I think that from the point of view of the compiler, no one accesses f.A (Again, the compiler assumes single threaded environment), and therefore it can think it is safe to pre-load it even before the "if"." I agree, both the compiler and the CPU can preload f.A until someone puts a barrier between the if and the load.  The C and C++ standards do not define any meanings for multithreaded programs.

  • Anonymous
    March 30, 2008
    Hi Norman, Thanks for your response. Forgive my english mistakes. Since it is not my natural/first language, I tend to include typos and grammar errors in everything I write. Best regards, Eran.

  • Anonymous
    March 31, 2008
    Even barriers don't help in my house! :) So a dependent load in the compiler is someting like: if (p) {   p->A; } It is illegal for the compiler to read p->A unless the compiler has already figured out that the "if (p)" test is true. This is from the point of view of a single thread of course. In our case however, "f" was already proven to be correct so I was wrong when I called it a dependent load - I had it in my mind that we were checking "f" not a field of "f". So you would need a barrier between them if you needed ordering semantics, and as Norman pointed out - that barrier would also be recognized as a compiler barrier as well. If the processor didn't allow such reordering the you would just need _ReadWriteBarrier to prevent the compiler from reordering. Thanks!

  • Anonymous
    March 31, 2008
    "Forgive my english mistakes. Since it is not my natural/first language" OK, then I have to revise one of my previous corrections. "every IRP eventuallly leads to someone writing to a device register" Previously I said: s/every/most/  ^^  especially if your device has been removed. A full correction is: Most IRPs eventually lead to someone writing to a device register. When you're programming, if Visual Studio offers you an acccept button, don't accept its word for it  ^^ Actually we see that computer instructions are your first/natural language, just as they are for me  ^_^ Now for the technical issue, "dependant load" in English or "dependent load" in American.  You didn't answer but you left it for the blog owner to confuse me further: 'It is illegal for the compiler to read p->A unless the compiler has already figured out that the "if (p)" test is true. This is from the point of view of a single thread of course.' No, even in the point of view of a single thread it is not illegal.  The standard notices that you didn't declare P to point to a volatile target.  A compiler with its extensions notices that there is no read barrier between the "if" test's expression and the subsequent fetch of p->A. 'So you would need a barrier between them if you needed ordering semantics,' I agree.  But this seems to contradict the preceding sentences, so I'm still wondering what you mean.

  • Anonymous
    March 31, 2008
    The comment has been removed

  • Anonymous
    April 01, 2008
    The comment has been removed

  • Anonymous
    April 01, 2008
    "They're coming to get - HA HA - HA HA!" - sorry about the paranoia. :) So - to really analyze this - you need to expand everything and then analyze it according the processor rules and compiler rules. If you can expand the code a bit we can try to analyze it. Thanks.

  • Anonymous
    April 01, 2008
    The comment has been removed

  • Anonymous
    April 01, 2008
    > BOOL bDoSomething; >  bDoSomething = /Compute according to   pContext fields/ >  startAsyncActivity(pContext); If startAsyncActivity starts another thread and the other thread accesses the same variable bDoSomething, then you'd better make it  volatile BOOL bDoSomething; Even just for plain old VC++6 MFC applications that was true.  Read some of Dr. Joseph Newcomer's essays at flounder.com.

  • Anonymous
    April 01, 2008
    The comment has been removed

  • Anonymous
    April 01, 2008
    The comment has been removed

  • Anonymous
    October 01, 2008
    The comment has been removed

  • Anonymous
    October 02, 2008
    BTW, i believe that T2() can be fixed with: {    int i;    register int *k;    k = p;    MemoryFence();    i = *k; } PS: The above (pseude)code for IA64 has 2 (pseudo)bugs: cmp.eq -> cmp.ne chk.s r33 -> chk.s r33, l1

  • Anonymous
    November 30, 2008
    Your description of the memory system on the alpha looks good, but I think you have an error in your solution -- a memory barrier inserted at the start of the code does not help.  I think this is what you need: VOID T2(VOID) {    ULONG i = 1;    ULONG* p2 = p;    MemoryBarrier();      i = *p2;    ASSERT(i != 0); } Similar problem in your previous post:    if (f.Initialized) {        MemoryBarrier();        AVal = f.A;