Bitonic Sort Sample using C++ AMP

Bitonic Sort is one of the parallel sort algorithms suitable for GPU, and am sharing a C++ AMP implementation in this post. My sample is ported from a corresponding DirectX SDK sample which explains the algorithm in details, so please read the DirectCompute sort description on MSDN.

 

main – Entry point

Here the input data is randomly generated, then sorted using Bitonic sort algorithm and then verify if we have the sorted output data.

 

bitonic_sort_amp - C++ AMP Tiled Implementation

This function is called from main with input and output buffers. These buffers are encapsulated in concurrency::array objects and passed to kernels for computation. Initially the data is sorted in chunks of BITONIC_TILE_SIZE. If the data size is larger than BITONIC_TILE_SIZE, the algorithm breaks the data into multiples of BITONIC_TILE_SIZE – this is implemented inside the second for loop.

   // Then sort the rows and columns for the levels > than the block size
   // Transpose. Sort the Columns. Transpose. Sort the Rows.
   for (UINT level = (BITONIC_TILE_SIZE * 2); 
     level <= NUM_ELEMENTS; level = level * 2)

This takes data sorted along rows, interprets the data subset as a 2D matrix and transposes (see the Transpose section of the aforementioned article) to sort the column data, and then sorts column data. It does a transpose again, now to sort the row data and then sorts. This continues until all the data is sorted.

 

bitonic_sort_kernel - C++ AMP sorting kernel (Tiled)

This kernel, called from bitonic_sort_amp, is sorting a row of size BITONIC_TILE_SIZE number of elements at a time. All BITONIC_TILE_SIZE thread will read one element from GPU global memory into tile_static memory to avoid redundant reads. Then sorting is done in tile_static memory, here each thread will pick min or max, and then synchronizes before writing the result to tile_static memory. Eventually each thread will copy out the result from tile_static memory (indexed using the tile local id) to global memory (indexed using the global thread id).

 

transpose_kernel - C++ AMP matrix transpose kernel (Tiled)

This kernel, called from bitonic_sort_amp, transposes a 2D square matrix of size TRANSPOSE_TILE_SIZE * TRANSPOSE_TILE_SIZE. Given the input data as 1D vector, we could have used math to calculate linear address based on thread index and tile index to transpose. Another solution is to use view_as member function to view a 1D vector as a 2D matrix and then transform like a regular 2D matrix.

view_as - member function of concurrency::array

This function allows an array of higher rank to be reshaped into an array of lower rank or from lower rank to higher rank. Below is the code snippet where a lower rank array is reshaped to a higher rank

   array<int, 1> a(extent<1>(100));
    array_view<int, 2> av = a.view_as(extent<2>(10, 10));
Download the sample

Please download the attached project of the Bitonic Sort that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.


BitonicSort.zip

Comments

  • Anonymous
    November 14, 2011
    This is great -- I've just spent the last week implementing this myself, though since I'm not as familiar with optimising memory access patterns I expect this to perform much better (benchmarking will have to wait for a few hours -- on the wrong laptop right now). Any chance of this being included in the final release (as a concurrency::parallel_sort(array[_view]) overload, perhaps)?

  • Anonymous
    November 14, 2011
    Okay, I'll revise my praise down somewhat - the tiling used here doesn't seem to improve performance as much as I'd expected. I've posted more details (and my own code) on the forum: social.msdn.microsoft.com/.../90090ae1-44a9-470a-8be9-f40b4392c3d1