Larry gets taken to task on concurrency

In yesterdays blog post, I was taken to task, first by Bob Frankston, and then by Raymond Chen about comments I made, and I want to talk a bit about their comments.

Here's the relevant sections:

From Bob Frankston (yeah, the Bob Frankston, I'm honored):

I accidentally glanced at the comments and noticed the comment about 64 bit operations not being atomic. I guess I've been around too long and don't see any reason to assume any such operation is atomic. If the CLI doesn't make this promise then you can't assume it for any operation.
With HT processors we're going to see a lot of subtle bugs appear and they'll all be blamed on mysterious viruses or maybe ghosts left in the CPU from processes that died horrible deaths.
Seriously the current processors are the result of years of single threaded programming. Maybe we should make atomicity a well-defined part of the architecture instead of just assuming it because the program seems to work.

I responded:

Your point is actually a huge issue. On our currently supported 32bit processors, 32bit aligned memory operations are guaranteed (by the processor) to be atomic. However, in general, you're totally right - on 64bit processors this is explicitly NOT true - on those machines the rules on write ordering have been significantly relaxed, which can cause all sorts of issues.
[...]
Your comment about the ordering issue is why I suggested using the InterlockedExchangeAdd API - it asserts a memory barrier after the add to ensure that all (logical and physical) processors have a consistent view of the data.
Btw, the biggest likely item to fail in the presence of misordered writes is the Double Checked Lock Pattern - here's a great discussion of why it fails on modern processors: https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html
The problem is that there is a lot of existent code that uses it that will start breaking in really subtle ways.

And then Raymond chimed in (slightly edited):

"The problem with this is that 64bit math isn't atomic on 32bit processors."
Not even 32-bit math is atomic on 32-bit processors. Given
static int i = 0;
// Thread A
i += 2;
// Thread B
i += 3;
the final result might be i=2, or it could be i=3 or it could be i=5. if you want atomic arithmetic you must use an Interlocked function.

Of course, Bob and Raymond are totally right. 

And this shows just how hard understanding concurrency issues is.  In fact, both Raymond and I are correct in our assertions (he more than I because his claim is more conservative than mine).  This is because he and I are talking about different aspects of the concurrency issue.  My comments (about 64bit operations being atomic) has to do with the fact that a 32bit processor has no way of accessing 64bit addresses atomically.  What that means is that when a 32bit processor writes a 64bit value, it does it in two memory operations.

And that means that between the two writes, another thread can come in and modify the value.  And that leads to corruptions.  This can't happen on current 32bit processors because they guarantee write ordering and write atomicity.

Raymond's comments have to do with memory accesses on machines with more than one CPU core.  His scenario can't happen on the currently supported single cored 32bit architectures (it can on some of the 64bit architectures due to relaxed write ordering issues).  But if you've got more than one cpu core (either a HT machine or two distinct CPUs), then you can absolutely see it happen.

The reason for this is that most arithmetic (and logical) operations are read-alter-write operations.  They read a value from memory, perform some operation on it and then write the value back.  If some other thread can come in during the read-alter-write operation, then any of the results that Raymond called out can occur.  For instance, in his example above, thread B reads the value of i before thread A updates the value adds 3 to 0 and writes back 3. 

Having said that, Raymond's scenario can fail even on single cored processors.  You see, the reason his code doesn't fail on single cored processors is because he's adding a constant to a memory value.  And the processor has an instruction for adding constant values to memory addresses:
        add [i], 3

But it doesn't take very many changes to the code to make it break on single cored machines:

static int i = 0;
int j = 2;
int k = 3;
// Thread A
i += j;
// Thread B
i += k;
 

Then Raymonds scenario will fail on a single processor machine - the reason for this is that in order to perform the read-alter-write operation on two memory variables requires that the read-alter-write operation be explicitly performed:

       move eax, [j]
    add [i], eax 

If there's more than one thread modifying j, then BAM!™ we've got a problem.

So, when I wrote my original comment, I was thinking about the memory write case - relying on the fact that on 32bit processors, a natively aligned 32bit memory read is guaranteed to be atomic (and for 64bit processors, a 64bit aligned memory read is similarly guaranteed to be atomic).  On currently supported 32bit processors, however a 64bit memory read is guaranteed to NOT be atomic.

Please note that I'm calling out reads explicitly - memory writes are currently guaranteed to be atomic on the current crop of 32bit processors, but that is not the case for all the currently supported 64bit processors - they allow write misordering - see the linked paper about the double checked lock pattern above for more details of why this is a bad thing.

I'm going to spend a bit of time over the next few days talking about concurrency and the issues associated with concurrency.  The good news is that with a relatively small set of common sense rules, most of the issues I talked about become irrelevant (unless you're looking to squeeze your scalability to the absolute highest levels).

Comments

  • Anonymous
    February 11, 2005
    You can do some interlocked 64-bit operations on x86 (see: lock cmpxchg8b).
  • Anonymous
    February 11, 2005
    The comment has been removed
  • Anonymous
    February 11, 2005
    The comment has been removed
  • Anonymous
    February 11, 2005
    I'm pretty sure the 32-bit PowerPC architecture doesn't guarantee that 32-bit writes are in-order. I know we're on an x86-family-centric blog, but it may be worth noting, 'cause you do have Linux and Mac readers.
  • Anonymous
    February 11, 2005
    Chris, you may very well be right - the set of Windows supported 32bit processors (MIPS R8000, PPC, Alpha, x32) all enforce strict write ordering semantics.

    From what I understand (and I'm FAR from an expert here, it's WAY out of my area of expertise), before Windows can run on a processor, there are some important requirements that the processor has to be able to satisfy - for instance, a processor has to be able to map a single physical address to more than one virtual addresses - there have been processors that didn't support this and thus couldn't run Windows. The processor has to support memory protection.

    There are other requirements, like the aligned memory read operation atomicity requirement, and (until lately) the write ordering requirement (which was relaxed (I believe) for Server 2003).

    Current PPC architectures may very well have relaxed the write ordering restriction, but the PPC that Windows supported back in the mid 90's didn't.

    It's certainly true that the current trends in processor designs indicates that in the future the memory ordering assumptions that used to be valid will become less and less relevent.
  • Anonymous
    February 11, 2005
    This is going to sound negative, but I worry about developers out there on crushing budgets that won't have time/resources to test on multi-CPU when they write multi-thread apps. With CPU speed topping out, and multi-core the new wave, they will get pressure to multi-thread their app, and I (the customer) will be the one to "catch" it. I was relieved in the Win16 days when the CPU would not be switched until my message was done (except for drivers). Concurrent apps are an advanced coding feature; as such, maybe not for everyone.
  • Anonymous
    February 11, 2005
    James,
    That IS an issue, I believe it's the heart of Bob Frankston's comment. The good news is that hyperthreaded machines cost well under a thousand dollars, so hopefully that's not a huge issue.

    And the better news is that if you're careful upfront, and you don't push the envelope, you're likely to be ok.
  • Anonymous
    February 11, 2005
    The comment has been removed
  • Anonymous
    February 11, 2005
    John, if the alpha didn't ensure write ordering, how did we get around the DCLP problem? Or were we just lucky?
  • Anonymous
    February 11, 2005
    Even if you use a single instruction, you still have problems.

    add [i], 3

    is not atomic. It's a read-modify-write instruction, which is (internally by the processor) broken into a load, an add, and a store. If you want it to be atomic, you have to lock the bus:

    lock
    add [i], 3

    Else, another processor can change the value after you read it but before you can write the new value. If two processors try to do it at the same time, you can get either a 6 or a 3 increment, depending on the exact timing.

    (I do not recall if the lock prefix is allowed in userspace, but xchg has an implicit lock, and newer processors have other instructions like cmpxchg to help in these situations)

    By the way, that's why there is no real problem (besides performance) if the compiler uses 3 instructions instead of one; the meaning of the code stays the same.
  • Anonymous
    February 11, 2005
    Cesar, but on a single CPU core machine, even though it's internally broken up, there is still only one instruction being executed by the processor at a time. That's a CRITICAL caveat to my comments above.

    This may change in the future, and certainly isn't the case for machines with more than one core, but...

  • Anonymous
    February 11, 2005
    Ah, I see (sorry, didn't notice you meant by single core also single processor; the different meanings of "core" got me).

    But still, it explains why the compiler can break up the instruction (since it would break anyway with more than one core/cpu).
  • Anonymous
    February 11, 2005
    Could be - but I'd be surprised if the compiler would intentionally write bad code to save a developer some heartburn.

    But then again, I don't write compilers :)

    And, having said that, it doesn't matter - the i += 2 sequence isn't correctly protected, regardless.
  • Anonymous
    February 11, 2005
    The comment has been removed
  • Anonymous
    February 11, 2005
    Nicholas, that's true (as I mentioned they were internally broken up).

    I don't have a pentium manual on hand (I have a 8086, 186, 286, 386, 486 and i860 but not a pentium), but I'm pretty sure that the manual guarantees atomicity of instructions, even read/alter/write instructions like INC and ADD if they're naturally aligned. Unfortunately, the newest manual that I have that has detailed information is the 386 manual and it's essentially useless.

    Clearly in the future that restriction IS going to be relaxed, but...
  • Anonymous
    February 11, 2005
    I looked at the guaranteed atomic operations in the IA-32 System Programming Guide and they all appeared to be aligned read or write instructions plus a few unaligned accesses that fit certain bus and cache constraints. Instructions such as INC and ADD had to use a LOCK prefix if they were modifying a memory location.

    Interesting was that even LOCK INC is not atomic in some cases. Instruction fetch can pass a lock so you can get strange effects if you (or other processors) are executing the code that you're modifying.
  • Anonymous
    February 11, 2005
    Microsoft is teh debel
  • Anonymous
    February 11, 2005
    Nicholas, the key thing is that the scenarios you describe all involve more than one core.

    If the system was able to mis-order operations, then:
    ....mov [i], 0
    ....inc [i]
    ....inc [i]

    would not guarantee that eax was 2. It's entirely possible that on the bus, the system turns this into a single memory write to the address of i, but as far as the code running on the processor, it's coherent.

    The memory re-ordering semantics absolutely apply for multi-core scenarios, but not for single core.

    But, as I (and others) have said, it doesn't matter - the multi-core/multi-cpu scenarios are real, and are going to become more and more common in the future.

    And the code's broken for those scenarios and thus needs to be fixed.
  • Anonymous
    February 11, 2005
    The single-core, single-thread situation on IA-32 requires atomic instructions if you think about it, because IA-32 uses instruction restart to handle faults. An interrupted ADD MEM,REG thus has to either leave the memory location intact or write the final result, or else there is no way to correctly restart. On a 68020 you'd be in trouble because those CPUs support instruction continuation by dumping internal state to the supervisor stack -- they could in theory suspend an ADD.L (A0),D0 between the load and the store!

    By the way, while IA-64 and PowerPC have weaker memory ordering than IA-32, AMD64 (x64) still has processor ordering, right? I'd think that if the memory rules were weaker that porting from IA-32 would be harder, which is one of AMD64's strengths.
  • Anonymous
    February 13, 2005
    Here's an interesting paper by Simon Peyton Jones (of Microsoft Research) on composable memory transactions...
    http://research.microsoft.com/Users/simonpj/papers/stm/
  • Anonymous
    February 13, 2005
    John Vert beat me to it, but I'll ask anyway. Do you even have any assurance that a 32-bit memory write operation will be atomic on a 64-bit system? It looks to me like it would be a read-modify-write operation.
  • Anonymous
    February 13, 2005
    I find odd that most C code snippets don't have declared variables as volatile. With volatile set the compiler should not keep the value on registers but read and write as soon as possible to memory each time.
    Without volatile even :

    // thread 1
    while(bSomeBool) ..

    // thread 2
    .. bSomeBool = false

    will eventually fail. Note that with volatile this works on multiproc because one thread only stores and the other only loads :) But bSomeBool must be declare volatile otherwise thread 1 could keep the bool in some register and never read it back from memory again.
    ---
    BTW, while hyperthread machines will put in evidence some concurrency problems, they still may have some restrictions in respect to multi core or, best, multiprocessor machines. For example some instructions can be atomic on an ht processor or at least they can be "very-less-likely-to-cause-concurrency-problems" while it might be a disaster on real multiproc - this is because HT processors share many parts of the core between the two virtual cpus, so there are a number of implicit rendevouz points in the architecture that possibly can mask the concurrency problems.

    So : test on real multi processors systems. And - above all - try to analyze all possible situations beforehand because most concurrency bugs are detected only by sheer luck.
  • Anonymous
    February 13, 2005
    On Alpha MB/WMB gives you the tools to get around the DCLP problem.

    Also, I seem to vaguely recall that it was about the time of the NT port that PALcode level support of byte reads and writes happened.

    I think that there is a more worrying issue here: I guess I was "lucky" to have been bitten by the Alpha word tearing a long time ago. It isn't a lesson you need to learn more than once.

    But a lot of people don't have these scars, worse still many environments make parallelism seem deceptively simple (despite, as John says, them being an advanced feature). Worse of all these problems are unlikely to occur unless you go looking for them in a stress test lab (or by careful code review). Commercial pressures are such that many application vendors are unlikely to have the resources to do this (even if they know about the problem).