Sorting in PPL
The Parallel Patterns Library (PPL), a programming model that resembles the C++ Standard Template Library (STL), was introduced in Visual Studios 2010. PPL provides general-purpose containers and algorithms for performing fine-grained parallelism. Additional containers and algorithms were shipped as part of the Sample Pack/ConcRT Extras, and in its latest version (v0.33), an updated version of parallel sort (along with 2 other algorithms for sorting in parallel) were introduced. In this blog, I shall elaborate on these sorting algorithms and compare/contrast them:
The following parallel sorting algorithms are provided:
1. Parallel sort - a general-purpose compare-based in-place sort (parallel quick sort)
2. Parallel buffered sort - a faster general-purpose compare-based sort that require O(N) space (parallel merge sort + quick sort)
3. Parallel radix sort - an integer key-based stable sort that requires O(N) space
For general purpose compare based sorting algorithms, quicksort is very efficient for most of the cases. The standard sorting algorithm, std::sort, uses Introsort (quicksort/heapsort). Quicksort has a minimal average compare and swap times for sorting random sequences. The parallel version of quicksort has O((N/P) log(N)) time complexity when the number of processors P is much less than data length N, and in a practical situation, it usually achieves O(N/P) for many cases. The default sort algorithm parallel_sortis implemented with a parallel quicksort.
However, when the number of processors increases, parallel quicksort suffers from the time complexity changing to O((N/P) log(N/P) + 2N(P – 1)/P) since the initial splitting stage of quicksort cannot be parallelized, which affects the scalability of the algorithm over cores. Scalability bottleneck becomes more apparent especially when the compare operation is very expensive. This can be overcome by using a parallel merge to replace the initial stage of parallel quicksort and make the average time complexity O((N/P) log N). However, to make the parallel merge efficient, it will require O(N) extra space. This algorithm is implemented as the parallel_buffered_sort.
For the special objects with integer-based keys a more efficient algorithm may take advantage of the integer’s radix property. Hash-based radix sort can directly compute the destination of the key without requiring comparisons. This algorithm can achieve roughly O(N/P) complexity on a machine with P processors. It also requires O(N) space to avoid contention. This algorithm is implemented as the parallel_radixsort.
The space and time complexity relationship between the sort algorithms can be portrayed as follows:
Summary:
To summarize, the table below compares/contrasts the various parallel sorting algorithms in the sample pack:
Example:
Given below is a code sample illustrating sorting of a vector using serial sort and the different parallel sorts.
/// Sample code introducing the various parallel sorts
/// Compile with: /EHsc
#include <windows.h>
#include <vector>
#include <algorithm>
#include <iostream>
#include <ppl_extras.h>
const size_t NUM_ELEMS_TO_SORT = 10000000;
/// <summary>
/// Helper function that clocks and returns run times of a function
/// </summary>
template<typename Func>
const unsigned int ClockFunc(Func func)
{
unsigned __int64 start=0;
unsigned __int64 end=0;
QueryPerformanceCounter((LARGE_INTEGER *) &start);
func();
QueryPerformanceCounter((LARGE_INTEGER *) &end);
return (unsigned int)(end - start);
}
/// <summary>
/// Function that returns a vector with the same data(value) contents
/// on every call
/// </summary>
std::vector<size_t> GetVector()
{
//Set const seed
srand(0);
std::vector<size_t> vec(NUM_ELEMS_TO_SORT);
std::generate(vec.begin(),vec.end(),rand);
return std::move(vec);
}
int main()
{
//Std Sort (serial)
{
std::vector<size_t> v(GetVector());
std::cout << "Std Sort: " << ClockFunc([&v]() { std::sort(v.begin(), v.end()); }) << std::endl;
}
//Parallel Sort
{
std::vector<size_t> v(GetVector());
std::cout << "Parallel Sort: " << ClockFunc([&v]() { Concurrency::samples::parallel_sort(v.begin(), v.end()); }) << std::endl;
}
//Parallel Buffered Sort
{
std::vector<size_t> v(GetVector());
std::cout << "Parallel Buffered Sort: " << ClockFunc([&v]() { Concurrency::samples::parallel_buffered_sort(v.begin(), v.end()); }) << std::endl;
}
//Parallel Radix Sort
{
std::vector<size_t> v(GetVector());
std::cout << "Parallel Radix Sort: " << ClockFunc([&v]() { Concurrency::samples::parallel_radixsort(v.begin(), v.end()); }) << std::endl;
}
return 0;
}
Output:
When built for x86 retail and run on my 4-core machine:
Std Sort: 2342996
Parallel Sort: 861919
Parallel Buffered Sort: 977655
Parallel Radix Sort: 414180
Additionally, can also visualize the workings/behavior of the different sort algorithms with varying data patterns, concurrency index and comparator/hasher cost with the Parallel Sort Demo (under ConcRT_SamplePack->Samples->ParallelSortDemo) shipped as part of the Sample Pack.
Now that you are familiar with the different parallel sorting algorithms provided in the sample pack, is there a guideline on when to use a particular sort algorithm (and conversely, when not to use them)? In my next blog, I’ll explore this topic based on an experiment that I conducted.
Comments
Anonymous
March 14, 2011
As far as I know: "return std::move(vec);" won`t be faster then "return vec;" Am I wrong?Anonymous
March 14, 2011
Piotr, you are correct. Both return std::move(vec) and return vec call into move constructor of vector.