Gaussian Blur using C++ AMP

Article
03/14/2012

In image processing, applying filter function is very common and Gaussian Blur is one such method. In this blog post I’ll share a C++ AMP implementation.

main – Program entry point

In main() , create an instance of the gaussian_blur class, apply the filter (execute) and validate results (verify). The constructor generates random input data for calculation.

gaussian_blur::execute

In the gaussian_blur function, kernel input data are stored in a concurrency::array. Then it invokes a parallel_for_each computation (kernel) that uses the simple model. The kernel is implemented in the gaussian_blur::gaussian_blur_simple_amp_kernel function. At the end result is copied out of GPU to host memory.

gaussian_blur::gaussian_blur_simple_amp_kernel

This function implements a C++ AMP kernel. For each input data point, a GPU thread is used to apply the filter. Each GPU thread for a given data point reads in neighboring data points along both x-axis and y-axis and apply the filter to the point. The filter is applied in the nested “for” loop which is traversing neighbors and the “if” statement inside bounds access to the array dimension. The result is then stored in the output array.

gaussian_blur::verify

This function validates results computed on the GPU. Here the same input data is used to calculate results on CPU again. Finally results of CPU and GPU are compared to determine correctness.

Download the sample

Please download the attached sample of the Gaussian Blur that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.

gaussian_blur.zip

Comments

Anonymous
March 30, 2012
The comment has been removed
Anonymous
March 30, 2012
Yes, it causes divergent warps near the boundaries. As the most of warp threads don’t have divergent code, this shouldn’t be a problem. So it doesn’t deserve optimizing in this simple sample. However, another way to improve performance is to use tiles which improves resource utilization like parallel_for_each(extent<2>(P1TS, P2TS).tile<TS, TS>(), [...] (tiled_index<TS, TS> tidx) Thanks.

Share via