section example in C++ AMP
In this blog post I will give a simple example of using the section member function for array and array_view, demonstrating how to offset your origin point in order to operate on a smaller section of data in your computation. So for example if your data is matrix that looks like this:
array_view<float, 2> qin(height, width, data);
Where height and width are divisible by 2, you can view it in four quarters as follows:
array_view<float, 2> q1 = qin.section(index<2>(0, 0), extent<2>(height/2, width/2));
array_view<float, 2> q2 = qin.section(index<2>(height/2,0), extent<2>(height/2, width/2));
array_view<float, 2> q3 = qin.section(index<2>(0,width/2), extent<2>(height/2, width/2));
array_view<float, 2> q4 = qin.section(index<2>(height/2, width/2));
Below is a complete code example that does a summation of all elements in the array_view ‘qin’ and places the result in the first element. The algorithm views the data as two dimensions and splits it into four quarters, and then it sums up all elements in one quarter ‘qout’. By repeating this operation making ‘qout’ to be ‘qin’ it stores the overall reduction result in qin(0,0).
The code demonstrates the section functionality, but is not aimed to be (and indeed isn’t) an optimum implementation of a reduction algorithm (we have one of those in the pipeline) – it was written simply to demonstrate usage of the section API.
1: #include <amp.h>
2:
3: using namespace concurrency;
4: using std::vector;
5:
6: void main()
7: {
8: // a small data size for example
9: // a sample constrain require data to be equal and power of 2
10: int width = 16;
11: int height = 16;
12:
13: // generate dummy data
14: vector<float> data (width * height);
15:
16: for (int x = 0; x < (width * height); x++)
17: {
18: data[x] = x * 1.0f;
19: }
20:
21: // wrap data so it is ready to copy to accelerator
22: array_view<float,2> qin(height, width, data);
23:
24: // repeat reduction
25: // till data can't be reduced
26: while(width > 1)
27: {
28: height /= 2;
29: width /= 2;
30: extent<2> quarterdim(height, width);
31: array<float,2> qout(quarterdim);
32:
33: // view the data in 4 quarters
34: // create an array_view with offset to each quarters
35: const array_view<const float,2> q1 =
36: qin.section(index<2>(0, 0) /*origin*/, quarterdim /*extent*/);
37: const array_view<const float,2> q2 =
38: qin.section(index<2>(height, 0), quarterdim);
39: const array_view<const float,2> q3 =
40: qin.section(index<2>(0, width), quarterdim);
41: const array_view<const float,2> q4 =
42: qin.section(index<2>(height, width));
43:
44: // execute the kernel to accumulate all quarters into the first one
45: parallel_for_each(quarterdim, [=, &qout] (index<2> idx) restrict(amp)
46: {
47: // accumulate all quarters in output quarter
48: // using same index but in different section
49: qout[idx] = q1[idx] + q2[idx] + q3[idx] + q4[idx];
50: });
51:
52: // set output data array as input view
53: // for next loop
54: // NOTE: that doesn't sync data from GPU to host
55: qin = qout;
56:
57: // only for demo, print output data
58: // transition after every iteration
59: for(int y = 0; y < height; y++)
60: {
61: for (int x = 0; x < width; x++)
62: {
63: // accessing qin here force sync that quarter back to host
64: // this cause a performance hit
65: printf( "%0.1f ", qin(y, x));
66: }
67: printf("\n");
68: }
69: printf("===============================================\n");
70:
71: } // while loop
72:
73: // final summation result can be obtained from
74: // qin(0,0) here
75: }
76: // Sample print out
77:
78: //272.0 276.0 280.0 284.0 288.0 292.0 296.0 300.0
79: //336.0 340.0 344.0 348.0 352.0 356.0 360.0 364.0
80: //400.0 404.0 408.0 412.0 416.0 420.0 424.0 428.0
81: //464.0 468.0 472.0 476.0 480.0 484.0 488.0 492.0
82: //528.0 532.0 536.0 540.0 544.0 548.0 552.0 556.0
83: //592.0 596.0 600.0 604.0 608.0 612.0 616.0 620.0
84: //656.0 660.0 664.0 668.0 672.0 676.0 680.0 684.0
85: //720.0 724.0 728.0 732.0 736.0 740.0 744.0 748.0
86: //===============================================
87: //1632.0 1648.0 1664.0 1680.0
88: //1888.0 1904.0 1920.0 1936.0
89: //2144.0 2160.0 2176.0 2192.0
90: //2400.0 2416.0 2432.0 2448.0
91: //===============================================
92: //7616.0 7680.0
93: //8640.0 8704.0
94: //===============================================
95: //32640.0
96: //===============================================
Observe in the sample that array_view objects captured in the kernel need read only access to data, that is why I declared them as array_view<const float,2> .
Also notice that ‘q1’ creation - line(35) - can benefit from the section overloads to retrieve same view as follows:
array_view<float,2> q1 = qin.section(quarterdim);
In this case the extent is inferred to cover the rest of the parent array/array_view.
array_view<float,2> q1 = qin.section(0, 0, height, width);
Similarly q2 and q3 can be created using the latter section function call.
Finally, one might look close to ‘q1’ and ask couldn’t ‘qin’ replace its functionality and reduce the number of lines of code? The answer is “yes”, but that would introduce a performance overhead; instead of copying 4 quarters to GPU memory, this change will copy 3 quarters plus the whole matrix. Also copying data back to the host would again copy the whole matrix instead of just one quarter of it.
That completes my example for creating sub-sections using the section member function. Feel free to ask questions in the comments section below or in our MSDN forum.