แชร์ผ่าน


Memory Model

One of the suggestions for a blog entry was the managed memory model. This is timely, because we’ve just been revising our overall approach to this confusing topic. For the most part, I write about product decisions that have already been made and shipped. In this note, I’m talking about future directions. Be skeptical.

So what is a memory model? It’s the abstraction that makes the reality of today’s exotic hardware comprehensible to software developers.

The reality of hardware is that CPUs are renaming registers, performing speculative and out-of-order execution, and fixing up the world during retirement. Memory state is cached at various levels in the system (L0 thru L3 on modern X86 boxes, presumably with more levels on the way). Some levels of cache are shared between particular CPUs but not others. For example, L0 is typically per-CPU but a hyper-threaded CPU may share L0 between the logical CPUs of a single physical CPU. Or an 8-way box may split the system into two hemispheres with cache controllers performing an elaborate coherency protocol between these separate hemispheres. If you consider caching effects, at some level all MP (multi-processor) computers are NUMA (non-uniform memory access). But there’s enough magic going on that even a Unisys 32-way can generally be considered as UMA by developers.

It’s reasonable for the CLR to know as much as possible about the cache architecture of your hardware so that it can exploit any imbalances. For example, the developers on our performance team have experimented with a scalable rendezvous for phases of the GC. The idea was that each CPU establishes a rendezvous with the CPU that is “closest” to it in distance in the cache hierarchy, and then one of this pair cascades up a tree to its closest neighbor until we reach a single root CPU. At that point, the rendezvous is complete. I think the jury is still out on this particular technique, but they have found some other techniques that really pay off on the larger systems.

Of course, it’s absolutely unreasonable for any managed developer (or 99.99% of unmanaged developers) to ever concern themselves with these imbalances. Instead, software developers want to treat all computers as equivalent. For managed developers, the CLR is the computer and it better work consistently regardless of the underlying machine

Although managed developers shouldn’t know the difference between a 4-way AMD server and an Intel P4 hyper-threaded dual proc, they still need to face the realities of today’s hardware. Today, I think the penalty of a CPU cache miss that goes all the way to main memory is about 1/10th the penalty of a memory miss that goes all the way to disk. And the trend is clear.

If you wanted good performance on a virtual memory system, you’ve always been responsible for relieving the paging system by getting good page density and locality in your data structures and access patterns.

In a similar vein, if you want good performance on today’s hardware, where accessing main memory is a small disaster, you must pack your data into cache lines and limit indirections. If you are building shared data structures, consider separating any data that’s subject to false sharing.

To some extent, the CLR can help you here. On MP machines, we use lock-free allocators which (statistically) guarantee locality for each thread’s allocations. Any compaction will (statistically) preserve that locality. Looking into the very far future – perhaps after our sun explodes – you could imagine a CLR that can reorganize your data structures to achieve even better performance.

This means that if you are writing single-threaded managed code to process a server request, and if you can avoid writing to any shared state, you are probably going to be pretty scalable without even trying.

Getting back to memory models, what is the abstraction that will make sense of current hardware? It’s a simplifying model where all the cache levels disappear. We pretend that all the CPUs are attached to a single shared memory. Now we just need to know whether all the CPUs see the same state in that memory, or if it’s possible for some of them to see reordering in the loads and stores that occur on other CPUs.

At one extreme, we have a world where all the CPUs see a single consistent memory. All the loads and stores expressed in programs are performed in a serialized manner and nobody perceives a particular thread’s loads or stores being reordered. That’s a wonderfully sane model which is easy for software developers to comprehend and program to. Unfortunately, it is far too slow and non-scalable. Nobody builds this.

At the other extreme, we have a world where CPUs operate almost entirely out of private cache. If another CPU ever sees anything my CPU is doing, it’s a total accident of timing. Because loads and stores can propagate to other CPUs in any random order, performance and scaling are great. But it is impossible for humans to program to this model.

In between those extremes are a lot of different possibilities. Those possibilities are explained in terms of acquire and release semantics:

  • A normal load or store can be freely reordered with respect to other normal load or store operations.
  • A load with acquire semantics creates a downwards fence. This means that normal loads and stores can be moved down past the load.acquire, but nothing can be moved to above the load.acquire.
  • A store with release semantics creates an upwards fence. This means that normal loads and stores can be moved above the store.release, but nothing can be moved to below the store.release.
  • A full fence is effectively an upwards and downwards fence. Nothing can move in either direction across a full fence.

A super-strong extreme model puts a full fence after every load or store. A super-weak extreme model uses normal loads and stores everywhere, with no fencing.

The most familiar model is X86. It’s a relatively strong model. Stores are never reordered with respect to other stores. But, in the absence of data dependence, loads can be reordered with respect to other loads and stores. Many X86 developers don’t realize that this reordering is possible, though it can lead to some nasty failures under stress on big MP machines.

In terms of the above, the memory model for X86 can be described as:

  1. All stores are actually store.release.
  2. All loads are normal loads.
  3. Any use of the LOCK prefix (e.g. ‘LOCK CMPXCHG’ or ‘LOCK INC’) creates a full fence.

Historically, Windows NT has run on Alpha and MIPS computers.

Looking forwards, Microsoft has announced that Windows will support Intel’s IA64 and AMD’s AMD64 processors. Eventually, we need to port the CLR to wherever Windows runs. You can draw an obvious conclusion from these facts.

AMD64 has the same memory model as X86.

IA64 specifies a weaker memory model than X86. Specifically, all loads and stores are normal loads and stores. The application must use special ld.acq and st.rel instructions to achieve acquire and release semantics. There’s also a full fence instruction, though I can’t remember the opcode (mf?).

Be especially skeptical when you read the next paragraph:

There’s some reason to believe that current IA64 hardware actually implements a stronger model than is specified. Based on informed hearsay and lots of experimental evidence, it looks like normal store instructions on current IA64 hardware are retired in order with release semantics.

If this is indeed the case, why would Intel specify something weaker than what they have built? Presumably they would do this to leave the door open for a weaker (i.e. faster and more scalable) implementation in the future.

In fact, the CLR has done exactly the same thing. Section 12.6 of Partition I of the ECMA CLI specification explains our memory model. This explains the alignment rules, byte ordering, the atomicity of loads and stores, volatile semantics, locking behavior, etc. According to that specification, an application must use volatile loads and volatile stores to achieve acquire and release semantics. Normal loads and stores can be freely reordered, as seen by other CPUs.

What is the practical implication of this? Consider the standard double-locking protocol:

if (a == null)

{

  lock(obj)

  {

    if (a == null) a = new A();

  }

}

This is a common technique for avoiding a lock on the read of ‘a’ in the typical case. It works just fine on X86. But it would be broken by a legal but weak implementation of the ECMA CLI spec. It’s true that, according to the ECMA spec, acquiring a lock has acquire semantics and releasing a lock has release semantics.

However, we have to assume that a series of stores have taken place during construction of ‘a’. Those stores can be arbitrarily reordered, including the possibility of delaying them until after the publishing store which assigns the new object to ‘a’. At that point, there is a small window before the store.release implied by leaving the lock. Inside that window, other CPUs can navigate through the reference ‘a’ and see a partially constructed instance.

We could fix this code in various ways. For example, we could insert a memory barrier of some sort after construction and before assignment to ‘a’. Or – if construction of ‘a’ has no side effects – we could move the assignment outside the lock, and use an Interlocked.CompareExchange to ensure that assignment only happens once. The GC would collect any extra ‘A’ instances created by this race.

I hope that this example has convinced you that you don’t want to try writing reliable code against the documented CLI model.

I wrote a fair amount of “clever” lock-free thread-safe code in version 1 of the CLR. This included techniques like lock-free synchronization between the class loader, the prestub (which traps first calls on methods so it can generate code for them), and AppDomain unloading so that I could back-patch MethodTable slots efficiently. But I have no desire to write any kind of code on a system that’s as weak as the ECMA CLI spec.

Even if I tried to write code that is robust under that memory model, I have no hardware that I could test it on. X86, AMD64 and (presumably) IA64 are stronger than what we specified.

In my opinion, we screwed up when we specified the ECMA memory model. That model is unreasonable because:

  • All stores to shared memory really require a volatile prefix.
  • This is not a productive way to code.
  • Developers will often make mistakes as they follow this onerous discipline.
  • These mistakes cannot be discovered through testing, because the hardware is too strong.

So what would make a sensible memory model for the CLR?

Well, first we would want to have a consistent model across all CLI implementations. This would include the CLR, Rotor, the Compact Frameworks, SPOT, and – ideally – non-Microsoft implementations like Mono. So putting a common memory model into an ECMA spec was definitely a good idea.

It goes without saying that this model should be consistent across all possible CPUs. We’re in big trouble if everyone is testing on X86 but then deploying on Alpha (which had a notoriously weak model).

We would also want to have a consistent model between the native code generator (JIT or NGEN) and the CPU. It doesn’t make sense to constrain the JIT or NGEN to order stores, but then allow the CPU to reorder those stores. Or vice versa.

Ideally, the IL generator would also follow the same model. In other words, your C# compiler should be allowed to reorder whatever the native code generator and CPU are allowed to reorder. There’s some debate whether the converse is true. Arguably, it is okay for an IL generator to apply more aggressive optimizations than the native code generator and CPU are permitted, because IL generation occurs on the developer’s box and is subject to testing.

Ultimately, that last point is a language decision rather than a CLR decision. Some IL generators, like ILASM, will rigorously emit IL in the sequence specified by the source code. Other IL generators, like Managed C++, might pursue aggressive reordering based on their own language rules and compiler optimization switches. If I had to guess, IL generators like the Microsoft compilers for C# and VB.NET would decide to respect the CLR’s memory model.

We’ve spent a lot of time thinking about what the correct memory model for the CLR should be. If I had to guess, we’re going to switch from the ECMA model to the following model. I think that we will try to persuade other CLI implementations to adopt this same model, and that we will try to change the ECMA specification to reflect this.

  1. Memory ordering only applies to locations which can be globally visible or locations that are marked volatile. Any locals that are not address exposed can be optimized without using memory ordering as a constraint since these locations cannot be touched by multiple threads in parallel.
  2. Non-volatile loads can be reordered freely.
  3. Every store (regardless of volatile marking) is considered a release.
  4. Volatile loads are considered acquire.
  5. Device oriented software may need special programmer care. Volatile stores are still required for any access of device memory. This is typically not a concern for the managed developer.

If you’re thinking this looks an awful lot like X86, AMD64 and (presumably) IA64, you are right. We also think it hits the sweet spots for compilers. Reordering loads is much more important for enabling optimizations than reordering stores.

So what happens in 10 years when these architectures are gone and we’re all using futuristic Starbucks computers with an ultra-weak model? Well, hopefully I’ll be living the good life in retirement on Maui. But the CLR’s native code generators will generate whatever instructions are necessary to keep stores ordered when executing your existing programs. Obviously this will sacrifice some performance.

The trade-off between developer productivity and computer performance is really an economic one. If there’s sufficient incentive to write code to a weak memory model so it can execute efficiently on future computers, then developers will do so. At that point, we will allow them to mark their assemblies (or individual methods) to indicate that they are “weak model clean”. This will permit the native code generator to emit normal stores rather than store.release instructions. You’ll be able to achieve high performance on weak machines, but this will always be “opt in”. And we won’t build this capability until there’s a real demand for it.

I personally believe that for mainstream computing, weak memory models will never catch on with human developers. Human productivity and software reliability are more important than the increment of performance and scaling these models provide.

Finally, I think the person asking about memory models was really interested in where he
should use volatile and fences in his code. Here’s my advice:

  • Use managed locks like Monitor.Enter (C# lock / VB.NET synclock) for synchronization, except where performance really requires you to be “clever”.
  • When you’re being “clever”, assume the relatively strong model I described above. Only loads are subject to re-ordering.
  • If you have more than a few places that you are using volatile, you’re probably being too clever. Consider backing off and using managed locks instead.
  • Realize that synchronization is expensive. The full fence implied by Interlocked.Increment can be many 100’s of cycles on modern hardware. That penalty may continue to grow, in relative terms.
  • Consider locality and caching effects like hot spots due to false sharing.
  • Stress test for days with the biggest MP box you can get your hands on.
  • Take everything I said with a grain of salt.

Comments

  • Anonymous
    May 17, 2003
    Chris you're my hero ;-) I believe that specifying a stronger memory model is absolutely the right thing to do. The disconnect between the memory model subtleties and the relatively highlevel worldview that the CLR otherwise provides is just too big, especially since this doesn't even result in better performance on today's (and probably tomorrow's) mainstream hardware.
  • Anonymous
    May 18, 2003
    There is one common case when I would like the full speed benefit of stores being reordered. This is when I have code that needs to be fast and ALL objects the method use are only accessed from one thread. E.g. think about inverting a matrix, I may call some other methods from within my method and then the CLR will not know that the array is only accessed by a single thread.MAYBE what I am asking for is to be able to mark an object instance as being only used by a single thread, and have the CLR give me an error if the object is used by another when running in debug mode.Remember for most people, a single thread on a single CPU is the normal case, It is only a few low level types like us that every have to use more then one thread.Is this a new version of the C/Fortran arrays rules, e.g. due to the fact that array can not overlap in Fortran, some maths code is a LOT faster in Fortran.
  • Anonymous
    May 18, 2003
    Ian,Presumably you are talking about store reordering performed by the native code generator. The IL generator can follow its language rules in reordering and the CPUs that we care about won't reorder anyway.We would have to enable store reordering in the native generator at the assembly, type or method level. We don't have a good way to enable this per-instance.I'm not opposed to our adding this relaxation in the future. But I'm not convinced that the performance benefits will justify the risk and extra programming effort for anyone. Finalization, subtyping, resurrection, statics and many other constructs often leave you exposed to multi-threaded access, even in the "single-threaded" server request case.You would have to be absolutely sure that none of your instances could escape before you would be able to safely enable this behavior.
  • Anonymous
    May 18, 2003
    Did you really mean it when you said that hitting main memory is only 1/10th the speed of hitting disk? I find this extremely hard to believe. While I'm totally prepared to believe that the relative cost of a L0 cache hit, and a main memory hit is several orders of magnitude, I can't believe that there is only a single order of magnitude difference between main memory and disk.Disk takes milliseconds to find and load data. Main memory looks pretty slow from the inside of a 3GHz CPU, but it's not going to take milliseconds is it?Or is there something I'm missing?
  • Anonymous
    May 18, 2003
    The comment has been removed
  • Anonymous
    May 19, 2003
    The comment has been removed
  • Anonymous
    May 19, 2003
    The comment has been removed
  • Anonymous
    May 19, 2003
    I'm not sure what the official word is here. It seems to me that each of the various Interlocked operations is a combination of a load.acquire and a store.release. The combination of these two operations would effectively form a full fence. If the official word seems to be heading in a different direction, I'll post here to warn you.
  • Anonymous
    May 28, 2003
    The comment has been removed
  • Anonymous
    May 29, 2003
    The comment has been removed
  • Anonymous
    June 23, 2003
    The comment has been removed
  • Anonymous
    June 24, 2003
    The comment has been removed
  • Anonymous
    June 30, 2003
    The comment has been removed
  • Anonymous
    July 19, 2003
    The comment has been removed
  • Anonymous
    July 22, 2003
    I believe the problem is not the memory model, but the way it is explained. The explanations usually concentrate on the detailed reordering rules, and don't give the programmer a simple rule of thumb on where to put "volatile".My article "Memory Consistency & .NET" in Dr. Dobb's Journal, Apr2003, Vol. 28 Issue 4, p46-50 gives a simple rule of thumb. When using shared memory, you are effectively sending a message from one thread to another. The last write of the message by the sender, and the first read of the message by the receiver, must both be volatile. That simple rule covers most cases in my experience.The double-check idiom is simple to write in CLI - just declare "a" volatile in the example and it is fixed. This follows immediately from the rule of thumb: the first thread through the double-check region is the sender; the other threads are the receivers.A typical JScript or VB.NET developer should use locks, and not conjure their own synchronization primitives. In that case, the locks in CLI already imply the necessary fences.
  • Anonymous
    September 13, 2003
    Hi All,I am a VB6 application developer and beginner to C#/.NET, what I am wondering about is why we need the CLR in the first place, is it like 90% for garbage collection only? I mean is, is the CLR mainly there to give us relief from allocation & deallocation of heap memory, and preventing from 'illegal memory' access like reading outside array bounds etc.
  • Anonymous
    June 07, 2004
    The comment has been removed
  • Anonymous
    June 10, 2004
    Does this spin version work? Why or why not? Cheers!

    public sealed class Singleton
    {
    private static int spinLock = 0; // lock not owned.
    private static Singleton value = null;
    private Singleton() {}

    public static Singleton Value()
    {
    // Get spin lock.
    while ( Interlocked.Exchange(ref spinLock, 1) != 0 )
    Thread.Sleep(0);

    // Do we have any mbarrier issues?
    if ( value == null )

    value = new Singleton();


    Interlocked.Exchange(ref spinLock, 0);
    return value;
    }
    }

    This would help answer a few related questions for me on how Interlocked works with mem barriers and cache, etc. TIA -- William
  • Anonymous
    June 30, 2004
    The comment has been removed
  • Anonymous
    August 21, 2007
    PingBack from http://mcobrien.org/wordpress/2005/09/pdc05_doubleche
  • Anonymous
    November 26, 2007
    In this article , Vance Morrison describes some of the issues involved in writing managed multithreaded
  • Anonymous
    November 26, 2007
    In this article , Vance Morrison describes some of the issues involved in writing managed multithreaded
  • Anonymous
    March 19, 2008
    PingBack from http://desktopcomputerreviewsblog.info/cbrummes-weblog-memory-model/
  • Anonymous
    March 23, 2008
    PingBack from http://jimmy69.wordpress.com/2008/03/23/memory-model/
  • Anonymous
    January 21, 2009
    PingBack from http://www.keyongtech.com/432159-the-interlocked-on-the-edge/2
  • Anonymous
    May 31, 2009
    PingBack from http://woodtvstand.info/story.php?id=9156
  • Anonymous
    June 01, 2009
    PingBack from http://portablegreenhousesite.info/story.php?id=9706
  • Anonymous
    June 01, 2009
    PingBack from http://uniformstores.info/story.php?id=14136
  • Anonymous
    June 08, 2009
    PingBack from http://insomniacuresite.info/story.php?id=8705
  • Anonymous
    June 08, 2009
    PingBack from http://cellulitecreamsite.info/story.php?id=7827
  • Anonymous
    June 16, 2009
    PingBack from http://fixmycrediteasily.info/story.php?id=14117
  • Anonymous
    June 16, 2009
    PingBack from http://lowcostcarinsurances.info/story.php?id=5116
  • Anonymous
    June 18, 2009
    PingBack from http://gardendecordesign.info/story.php?id=4908
  • Anonymous
    June 18, 2009
    PingBack from http://fancyporchswing.info/story.php?id=2804