Atomic Operations in C++ AMP

Hi, I’m Steve Deitz, a developer on the C++ AMP team, and in this post, I’ll review the atomic operations available in C++ AMP, mention some important caveats, and show a simple example.

Atomic operations are the only safe and correct way to communicate and synchronize between the threads in a non-tiled parallel_for_each and between the threads in different tiles of a tiled parallel_for_each.  Note that threads within the same tile of a tiled parallel_for_each can also communicate and synchronize using tile_static memory and calls to tiled_index::barrier.wait .

Let’s take a look at a list of these atomic operations.

A list of atomic operations

C++ AMP provides atomic operations for addition, subtraction, increment, decrement, exchange, compare-and-exchange, maximum, minimum, and, or, and xor.  These are defined in amp.h as follows:

int atomic_fetch_add(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_add(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic addition of val to the memory location pointed to by dest , and returns the original value at the memory location pointed to by dest.

int atomic_fetch_sub(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_sub(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic subtraction of val to the memory location pointed to by dest , and returns the original value at the memory location pointed to by dest.

int atomic_fetch_inc(int * dest) restrict(amp)
unsigned int atomic_fetch_inc(unsigned int * dest) restrict(amp)

Performs an atomic increment to the memory location pointed to by dest, and returns the original value at the memory location pointed to by dest.

int atomic_fetch_dec(int * dest) restrict(amp)
unsigned int atomic_fetch_dec(unsigned int * dest) restrict(amp)

Performs an atomic decrement to the memory location pointed to by dest, and returns the original value at the memory location pointed to by dest.

int atomic_exchange(int * dest, int val) restrict(amp)
unsigned int atomic_exchange(unsigned int * dest, unsigned int val) restrict(amp)
float atomic_exchange(float * dest, float val) restrict(amp)

Sets the value at the memory location pointed to by dest to val as an atomic operation, and returns the original value at the memory location pointed to by dest.

bool atomic_compare_exchange(int * dest, int * expected, int val) restrict(amp)
bool atomic_compare_exchange(unsigned int * dest, unsigned int * expected, unsigned int val) restrict(amp)

Performs an atomic compare-and-exchange of val to the memory location pointed to by dest, storing val at the memory location only if *expected matches the value at the memory location. Returns true if the operation is successful, false otherwise. If the operation is unsuccessful, *expected is set to the value pointed to by dest.

int atomic_fetch_max(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_max(unsigned int * dest, unsigned int val) restrict(amp)

Atomically computes the maximum of val and the value at the memory location pointed to by dest, storing the maximum into the memory location, and returns the original value at the memory location pointed to by dest.

int atomic_fetch_min(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_min(unsigned int * dest, unsigned int val) restrict(amp)

Atomically computes the minimum of val and the value at the memory location pointed to by dest, storing the minimum into the memory location, and returns the original value at the memory location pointed to by dest.

int atomic_fetch_and(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_and(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic bitwise-and operation of val to the memory location pointed to by dest , and returns the original value of the memory location pointed to by dest .

int atomic_fetch_or(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_or(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic bitwise-or operation of val to the memory location pointed to by dest , and returns the original value of the memory location pointed to by dest .

int atomic_fetch_xor(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_xor(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic bitwise-xor operation of val to the memory location pointed to by dest , and returns the original value of the memory location pointed to by dest .

Caveats

There are two important points to keep in mind when using C++ AMP’s atomic operations.  Fortunately, while these limitations are subtle, they shouldn’t affect most code.  They follow from the semantics of DirectX and the design of GPU hardware.  If you think that you’ve got a kernel that could benefit were these limitations somehow removed, please let us know.  We may be able to help you redesign your kernel, and we would be curious to see such examples.

  1. A normal read may not see the results of an atomic write to a particular memory location. Moreover, atomic operations should not be mixed with normal writes to the same memory location. This is sometimes referred to as weak atomicity.
  2. Unlike interlocked operations in C++, the above-listed atomic operations do not imply a memory fence of any kind.  Even the order of atomic operations on distinct variables may be rearranged. For example, if one thread atomically increments X and then atomically increments Y, it is possible for another thread to observe those increments in reverse order.

We will provide additional information about the finer details of the C++ AMP memory model in upcoming blog posts.

Example: Counting Exceptional Occurrences

Suppose you want to count the number of threads that run into an exceptional occurrence of some kind. This is similar to a reduction, but because there are presumably going to be very few occurrences, it may be better to simply use the atomic increment function.  Such a code may look as follows:

int numExceptionalOccurrences;
array_view<int> count(1, &numExceptionalOccurrences);
parallel_for_each(/* compute domain */, [=] (index<1> idx) restrict(amp)
    {
        if (/* exceptional occurrence */)
        {
            atomic_fetch_inc(&count(0));
        }
    });

Note the use of an array_view to write the number of exceptional occurrences to memory.  You must use either an array or an array_view to communicate results to the CPU, even when the results consist of just one item like in the example above.

Closing Thoughts

Atomic operations are a powerful abstraction and an important programming tool for getting certain kernels to perform well on GPUs.  C++ AMP exposes these operations through a simple interface so you can easily use them as necessary.

Comments

  • Anonymous
    January 05, 2012
    > The semantics will be similar except that if the exchange fails, the parameter expected will be updated to contain the value pointed to by dest.  In addition, there will be an overload for values of float type. Why is that better than: int cmpxchg(int* destination, int expected, int new_value) {  int old_value = *destination;  if(old_value == expected) {    *destination = new_value;  }  return old_value; } which avoids clobbering the caller's expected value.

  • Anonymous
    January 05, 2012
    DrPizza, good question.  The main reason we defined the interface as above is to be consistent with the atomic operations library in the new C++ standard: www.open-std.org/.../wg21.  If you search for 'atomic_compare' in the posted n3242.pdf, you'll see a similar interface. I can't think of any advantage over the function you've written, and I've seen your function supplied in other languages/libraries.  I'll see if I can find out why the C++ standards committee wrote it that way...

  • Anonymous
    January 05, 2012
    Hrm, I suppose it kind of makes sense to align with C++. I suppose ultimately it doesn't matter too much. There are algorithms where it's useful to call cmpxchg speculatively (as "helpers" to assist some other thread in making progress, for example) where you don't care about success or failure, but even then, clobbering the expected value is probably OK, because you probably aren't going to use it again anyway. It just seems a bit strange to me; why force it to be clobbered when it's really not necessary? Maybe they just preferred to be able to write if(cas(d, e, n)) instead of if(e == cas(d, e, n)).

  • Anonymous
    January 05, 2012
    Hi there Dr. Pizza, I suspect the motivation for the style that was adopted by the standard is to be able to allow for "spurious failures". i.e., the “old” value returned by the operation might be unchanged from the one you have expected, but the system still wants to be allowed to fail your request. This is not possible with the other style of the API. If you had to implement the latter in terms of the former, you'd have to spin until you have succeeded in actually transforming the value, but at that point it is no longer close-to-the-metal any more, and that’s an (at least informal) requirement.

  • Anonymous
    January 05, 2012
    Ah, I get you. So you can have a return value of false, even when *d == e. Is that to allow for LL/SC-based CAS emulation?

  • Anonymous
    January 06, 2012
    Dr. Pizza, yes, that would be one prominent example.

  • Anonymous
    May 01, 2012
    Is there a reason why many of these atomic functions do not support floats? Are there plans to add support?

  • Anonymous
    May 02, 2012
    Hi Steve, Thanks for asking the question. We aim to provide functionality that are portable across hardwares. Atomic functions on floats are not yet common today (except for atomic_exchange).  We will re-evaulate it in the future releases. If your scenario requires atomic operation on floats, you can consider building such operations on top of atomic_compare_exchange, assuming equality comparison using exact bits matching satisfies your need (i.e. you don't need to establish equality on NaN values). Thanks, Weirong

  • Anonymous
    March 31, 2015
    Hi, I noticed this thread dates from three years ago, so maybe specs have changed. Here's the problem: I have and image processing code and as part of a parall_for_each update loop, I have this code: for (int i1 = i - radius; i1 <= i + radius; i1++) for (int j1 = j - radius; j1 <= j + radius; j1++) { // Increment patch area with values from matched patch in the other fram dMod0P(i1, j1) += d1P(i1, j1); dDone0P(i1, j1)++; } this works. When, following your recommendation, I replaced this with: for (int i1 = i - radius; i1 <= i + radius; i1++) for (int j1 = j - radius; j1 <= j + radius; j1++) { // Increment patch area with values from matched patch in the other fram float u = dMod0P(i1, j1) + d1P(i1, j1); atomic_exchange(&dMod0P(i1, j1), u); atomic_fetch_inc(&dDone0P(i1, j1)); } I got distinct banding (diagonal lines, 45 degrees downwards) . I am puzzled by this. I am not using tiles. thanks bahman