Simplifying single-dimensional C++ AMP Code
Hello! My name is Daniel Griffing and I’m a test engineer on the C++ AMP team.
C++ AMP provides a great set of capabilities that scale well for N-Rank (dimensional) data by providing data types to encapsulate the shape (concurrency::extent) and specify a given element (concurrency::index) across the dimensions. Additionally, data is wrapped in a concurrency::array_view exposing it in a multi-dimensional way. For data with Rank=1, however, programmers are accustomed to specifying size & index using integer values (e.g. when using std::vector). This blog will look at some ways in which C++ AMP code can be simplified when dealing with 1-D data.
We’ll use a simple example of a function that multiplies each value in a std::vector by some scalar value for illustration.
A non-C++ AMP version of this code could be:
std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
for (int i = 0; i < outv.size(); ++i)
{
outv[i] = data[i] * multiplier;
}
return outv;
}
Here’s an equivalent function in C++ AMP code:
std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
array_view<float, 1> input(data.size(), data);
array_view<float, 1> output(outv.size(), outv);
parallel_for_each(extent<1>(data.size()), [=](index<1> idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
output.synchronize();
return outv;
}
This code differs from the non-C++ AMP code in two ways. We’ll look at ways to simplify the code for each of these items.
· The use of array_view<float, 1> allowing the transfer of data to and from the accelerator device.
· The use of the C++ AMP concurrency::parallel_for_each construct taking two arguments:
- A extent<1> defining the length of the data set
- An amp-restricted lambda function with an index<1> value as its only argument.
array_view<T, Rank = 1>
As you may have seen used in previous blog posts, the array_view type is defined with a default value of 1 for its Rank template parameter.
So, we can write:
array_view<float> input(data.size(), data);
…instead of:
array_view<float, 1> input(data.size(), data);
parallel_for_each with integers
In our Rank=1 example code, the extent<1> type is used to express the size of the input data and the lambda function takes an index<1> argument providing the index into the data set. To minimize the changes from the original, non-C++ AMP code, it would be convenient to express these values using an integer type as was the case in the original non-C++ AMP code.
Writing a wrapper function for parallel_for_each over 1-D data that uses int arguments in place of extent<1> and index<1> can be created quickly and with a small amount of code, as follows.
template <typename Kernel>
void parallel_for_each(int ext_size, Kernel kernel)
{
auto krn = [=] (index<1> idx) restrict(amp)
{
kernel(idx[0]);
};
concurrency::parallel_for_each(extent<1>(ext_size), krn);
}
We first wrap the user-provided kernel in a lambda invoking the kernel with the integer value idx[0]. The wrapper is then used in a call to the C++ AMP parallel_for_each signature.
In our example, the call to parallel_for_each can now be rewritten to use int arguments:
parallel_for_each(data.size(), [=](int idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
Summary
The resulting code using our helper functions is:
std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
array_view<float> input(data.size(), data);
array_view<float> output(outv.size(), outv);
// multiply each element of the input array_view by multiplier
parallel_for_each(data.size(), [=](int idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
output.synchronize();
return outv;
}
In this post we looked at a few ways to simplify C++ AMP code for data with Rank=1. We made use of the default value for Rank in array_view<> and implemented a wrapper for parallel_for_each allowing the use of int indices in our execution kernel. This is one example of a higher level abstraction, on top of parallel_for_each, that can be achieved using template programming.
If you have feedback, I’d love to hear it in the comments section below.
Comments
Anonymous
January 24, 2012
FYI The html markup/formatting of the second section of code is not quite right. Displays as plain text not <pre>Anonymous
January 25, 2012
Thanks Matthew! This has been fixed.Anonymous
January 25, 2012
It still doesn't appear to be fixed. Talking about the code snippet immediately following "Here’s an equivalent function in C++ AMP code:".Anonymous
January 27, 2012
This is a great trick. Thanks for sharing.