Freigeben über


Concurency::parallel_for and Concurrency::parallel_for_each

Marko Radmilac, Senior Developer for Microsoft’s Concurrency Runtime, offers the following discussion of parallel iteration in the Concurrency Runtime:

Let me say up front that this is my first blog, so I apologize if my writing format does not match what people have come to expect from a blog. I do hope, however, that this blog will contain enough information to provide some introductory information on the Parallel Patterns Library’s parallel_for and parallel_for_each constructs, when and how to use them. This blog has a Q and A format and contains both basic and advanced information, so feel free to jump to the section that is of most interest to you.

What is parallel_for and where do I find it? As you are probably aware (since you are reading this blog), the native C++ libraries in Visual Studio 2010 were extended to provide rich support for parallel programming. There are different layers at which users can interact with the parallel runtime, the highest one of which is the Parallel Patterns Library (using the header file ppl.h). In it, users can find different constructs that allow them to quickly parallelize their programs without extensive knowledge of scheduling decisions, underlying threads, the surrounding environment, etc. One of these constructs is the parallel_for construct, which allows a user to parallelize a for-loop quickly. Its close cousin is parallel_for_each, providing the same level of ease in parallelizing std::for_each construct.

How do I use parallel_for and when? Parallel for is a construct which takes the body of a for-loop written by a user captured in a functor (for details on functors and lambdas please read Stephan’s excellent blog), divides the work (number of iterations) amongst the available computing resources (processors) and executes that work in parallel. Here’s a simple example of serial to parallel transformation:

for (int i = 0; i < n; i++)
{
    iter();
}

becomes

Concurrency::parallel_for(0, n,
    [] (int i)
    {
        iter();
    });

A naïve user might decide go ahead and convert all for-loops in the program into parallel for-loops. That would be a mistake, because without an initial feasibility analysis this conversion might be detrimental to the program. There are two aspects that must be considered before a conversion is made, correctness and performance, and they are both explained in the following few paragraphs.

I have converted my for-loops into parallel for-loops. Why doesn’t my code work properly? Converting all for-loops in a program into parallel for-loops may yield an incorrect program. There are quite a few algorithms in which the individual iterations of the loop are not independent. They often require another iteration to have been executed beforehand, so executing them in isolation would likely yield wrong results. A typical example would be a partial sum computation “in place”, on the original array (look at std::partial_sum):

for (int i = 0; i < n; i++)
{
    // Array “a” contains both an original sequence
// and the end result
    a[i] += a[i-1];
}

In order to compute kth term in a resulting sequence, the k-1th term must be known. If one were to execute iterations in parallel it could be that the k-1th term is not populated by the time the kth term is processed, yielding an incorrect result.

Concurrency::parallel_for(0, n,
    [] (int i)
    {
        a[i] += a[i-1]; // incorrect!
    });

I have converted my for-loops into parallel for-loops. Why does my code run slower than serial code? Converting all for-loops in a program into parallel for-loops may yield performance degradation instead of an improvement. Parallel for, as mentioned before, will divide the work amongst available processors and set it up for parallel execution. The initial division of work, setting up worker threads, and invoking a functor for each iteration, all incur a certain cost. That cost is not big, but it is not negligible either. If the work done in a single iteration and number of iterations itself are both small, this overhead is likely to show up as a less than expected speedup, sometimes even a slowdown.

Concurrency::parallel_for(0, 100,
    [] (int i)
    {
        a[i] = b[i] + 1; // only a few cycles
    });

In the example above, there is one memory read, one memory write and one add to be performed in a single iteration of the loop. On top of that, there are only 100 iterations in the entire loop. In this case, making a function call to perform this one iteration is likely to be on the same order of magnitude, if not greater, than the entire iteration itself. Therefore, this loop is not a good candidate for parallelization.

I am frustrated! When should I use parallel_for then? You should analyze your program first and identify “hot spots”, i.e. places in your code that are computationally intensive. Identify for-loops that control that computation and try to rewrite them in a way that eliminates dependencies between iterations. Also, in the presence of nested for-loops, it is often more beneficial to parallelize at the outer most loop level, in order to increase the amount of work being done per iteration. Once this analysis is complete, apply parallel for to the given loops and watch your program run faster J.

I was using OpenMP and I noticed that OpenMP executes my workload much faster than parallel_for does. Why is that? OpenMP uses a simple mechanism for scheduling the parallel iterations. It counts the number of iterations, divides them up by the number of processors P, and then schedules P workers to execute individual chunks of work. Each iteration in OpenMP consists of only a function call to the outlined function (in parallel for’s case this is a functor/lambda). On the other hand, parallel for is a more general mechanism, does much more work during preparation for parallel invocation, and much more during each iteration. This impacts the total amount of work (and the number of iterations in the loop) needed for parallel for to be profitable, and that number is higher than OpenMP’s. Therefore, you must have a for-loop with just enough work to make OpenMP profitable, but not enough to do the same for parallel for. If you had less work, both would be slower than serial execution; if you had more, both would be equally profitable because the overhead is amortized by the amount of work being done.

Why does it take so much work per iteration for parallel_for to turn profitable? OpenMP does it with far less work per iteration. As mentioned above, parallel for is a much more general construct than any construct in OpenMP. It should be noted that the Concurrency Runtime (ConcRT) is not less performant than the OpenMP runtime, but the parallel for construct tries to address many more issues than a simple OpenMP for-loop, and these come at a cost. It is possible to reduce the generality of parallel for to a minimum, at which point it becomes very close to the performance of the OpenMP counterpart. You can find a less generalized implementation of parallel_for in our sample deck, called parallel_for_fixed (it uses fixed-partitioning of iterations, without range stealing). The reason for generality of parallel for implementation is given in the following few paragraphs.

My workload inside the for-loop is uneven. Can parallel for help me distribute it properly? Yes, parallel for by default does range stealing and dynamic redistribution of work. That means that if any worker thread finishes its range, it will attempt to help other worker threads with their workloads. This ensures that all computing resources are fully utilized during parallel for execution. OpenMP does this to an extent with dynamic and guided scheduling, but parallel for does it on a per iteration level without using interlocked operations to mark the progress.

Can parallel for support both unsigned and signed loop indices? Do I have to worry about overflow? Yes, parallel_for handles both signed and unsigned indices equally well. As long as the type supports some basic operations found on an integer type, parallel for will handle it well. It will also make sure that the range does not overflow a signed type. This was not a case with OpenMP below version 3.0.

Do I have to worry about overloading the system? What if there are other parts of a process using the Concurrency Runtime? Parallel for automatically detects system load and identifies available processors to use. Therefore, it will be cooperative with the rest of the process, and use only available resources. Once other, busy resources become available, parallel for will quickly find use for them (they will join the computation). OpenMP provides this through its dynamic mode; although, once a parallel region has started executing, any available resources arriving after that point would not be used in the computation.

Can I have blocking code in my loop iterations? What happens if I block? Yes, you can have blocking in your code as long as it is cooperative (using ConcRT blocking APIs, or using Win32 blocking APIs on user mode scheduling -- UMS). When blocking happens, parallel for and ConcRT ensure that another worker is scheduled to pick up where the previous one left off. As a result, parallel for enables users to write code that would not work properly as serial code. The prime example is enabling forward blocking dependency, where ith iteration is blocked until jth iteration executes, where j > i. This code would deadlock in serial. OpenMP does not provide any support for blocking between iterations.

Concurrency::event e;
volatile LONG counter = 0;

Concurrency::parallel_for(0, 64,
    [&] (int i)
    {
        InterlockedIncrement(&counter);

        if (i == 63)
{
// all other iterations blocked;
// unblock them
e.set();
}
else
{
// wait for iteration 63 to unblock
            e.wait();
        }
    });

// counter must be at 64 at this point

Can I cancel parallel for in the middle of computation? Yes, you can cancel parallel for at any point by throwing an exception from one of its iterations, or canceling its parent task group. This technique is very useful in performing long searches where computation can be stopped as soon as the result is found.

Concurrency::parallel_for(0, n,
    [] (int i)
    {
        if (found(i))
        {
            throw ItemFound();
// cancels work immediately
        }
    });

This can also be achieved by cancelling the parent task_group, thus enabling cancellation on the entire subtree of work.

Parallel for guarantees that only iterations that have already started will complete after cancellation is initiated. This ensures that resources are not wasted. OpenMP does not have any support for cancellation, although one can be built in using a flag.

bool flag = false;

#pragma omp parallel for
for (int i = 0; i < n; i++)
{
    if (flag)
    {
        return;
    }
    else if (found(i))
    {
        flag = true; // cancels work
    }
}

It should be noted that even with this hand-crafted cancellation scheme OpenMP does not support cancellation of the entire sub-tree of computation, which parallel for does.

Can I throw exceptions inside the parallel for? Will that tear down my application? Yes, you can throw exceptions in the parallel for functor, and that will not result in the termination of your application. An exception on a worker thread will safely be transported to the main thread that initiated the parallel for (if exception is not caught in the body of the functor, naturally) so the user will have an opportunity to catch the exception outside of parallel for. OpenMP has no exception handling support.

Can I use parallel for for STL iterators? This is why we have parallel for each. If your iterator supports random access, it will use the same mechanism used for parallel for. If your iterator only supports forward access, then we have provided code that reduces this case to a random access, at which point parallel for construct is reused.

How many iterations, exactly, does it take to make parallel for better than serial? How many cycles should the functor contain in order for parallel for to beat the serial implementation? This is a difficult question to answer. It depends on a machine configuration, whether there is other parallel work in the process, etc. Therefore, instead of giving a general and non-useful answer, I will attempt to give a concrete answer for a 4-core box that I am using, without any other work on it.

This is how I decided to measure it:

i) I will have an outside loop with the number of repetitions for the experiment. This loop amortizes the initial spin of threads in both OpenMP and ConcRT. Also, it magnifies the result to a level that is easily measurable (above 100 ms).

ii) I will have a dial that controls the amount of work being done per iteration.

iii) I will have a dial that controls the number of iterations in a loop.

When ii) is small and iii) is big we will measure the overhead per iteration (provided that the initial amount of work per iteration is small); when they are both small we will measure the overhead for the initial setup of threads, dividing work, etc.

The work being done will be a simple function call without any work in that function, and ii) will control how many times that function is being called. This is how a simple version of this program would look like:

#include <ppl.h>
#include <windows.h>
#include <stdio.h>
#include <omp.h>

using namespace Concurrency;

#define RUN_TEST test_parallel_for

#define SIZE_OF_ARRAY 10000

#define REPEAT 10000

#define WORK 2

long long counter() {
    LARGE_INTEGER li;
    QueryPerformanceCounter(&li);
    return li.QuadPart;
}

long long frequency() {
    LARGE_INTEGER li;
    QueryPerformanceFrequency(&li);
    return li.QuadPart;
}

// compiled /Od in a separate object as {}
// to avoid inlining and invariant code motion
void noworkhelper();

void work(int index)
{
    for (int i=0; i<WORK; i++)
    {
        noworkhelper();
    }
}

typedef void (*FNPTR) (int i);

FNPTR Func = work;

__declspec(noinline)
void test_parallel_for()
{
    parallel_for(0, SIZE_OF_ARRAY, Func);
}

__declspec(noinline)
void test_omp_for()
{
    #pragma omp parallel for
    for (int i = 0; i < SIZE_OF_ARRAY; i++)
    {
        Func(i);
    }
}

void run_data()
{
    for (int i = 0; i < REPEAT; i++)
    {
        RUN_TEST();
    }
}

void main()
{
    SchedulerPolicy policy(1,
SchedulerKind, ThreadScheduler);
    Scheduler::SetDefaultSchedulerPolicy(policy);

    double etime = 0.0;

    long long start = counter();
    run_data();
    long long finish = counter();

    etime = (finish - start) * 1000.0 / frequency();
    printf("Elapsed time for test: %g ms\n", etime);

    Scheduler::ResetDefaultSchedulerPolicy();
}

Before showing the results, it should be noted that these are pure overhead tests, designed to show costs related to creating, running and maintaining parallel regions. Here are the raw tables:

 

R:100000, I: 500, W:5

R:50000, I: 500, W:10

R:10000, I: 500, W:50

R:5000, I: 500, W:100

R:1000, I: 500, W:500

R:500, I: 500, W:1000

Serial

713.2

670.9

637.1

649.4

632.9

630.2

OpenMP

295.1

265.5

197.5

189.6

173.4

184.6

Parallel for

1142.1

671.7

293.7

241.3

202.5

200.9

Fixed

564.3

388.9

232.7

233.6

183.3

206.7

R – the number or repeats, I – the number of iterations, W – the number of units of work, measured times all in milliseconds

image Small number of iterations, small work 1

Now let’s look at what happens at high number of iterations and low amounts of work. This will tell us how costly each iteration is to execute in OpenMP and parallel for.

 

R:500, I: 100000, W:5

R:100, I: 100000, W:25

R:50, I: 100000, W:50

R:10, I: 100000, W:250

R:5, I: 100000, W:500

R:1, I: 100000, W:2500

Serial

712.1

645.2

638.4

637.1

632.7

630.5

OpenMP

198.6

194.6

170.9

168.3

169

168.5

Parallel for

420.3

207.6

219.9

169.1

170.6

164.9

Fixed

226.6

213.6

196.8

169.1

172.9

179.5

R – the number or repeats, I – the number of iterations, W – the number of units of work, measured times all in milliseconds

image 
Large number of iterations, small work 1

Conclusion -- Parallel for is a general purpose mechanism for executing serial for-loops in parallel. It is up to the user to determine legality and profitability of such a transformation. Parallel for construct increases the robustness and the flexibility of parallelized code by having built in support for features such as: cooperative blocking during iterations, immediate cancellation, signed integer iteration type with arbitrarily large ranges, dynamic redistribution of workload, support for STL iterators, etc. However, the generality of parallel for comes at a cost, and there are 2 types of overhead that make our parallel for implementation slower than OpenMP in some cases. The first one is the initial setup of the worker threads and the range stealing mechanism, and the second one is additional work being done per iteration to support cancellation and range stealing. This may result in poor scaling of the parallel for-loop, if the amount of work being done per iteration is very small. If your program falls into that category, and you would still like to see speedup from executing in parallel, please use a specialized, fixed-partitioning version of parallel for, parallel_for_fixed, available in our samples collection. It will scale as good as OpenMP in the cases where limiting overhead is essential.

Comments

  • Anonymous
    November 18, 2009
    Why doesn't parallel_for() respect the scheduler policy limits? In my case, I'm seeing three threads running on two cores, but I can't change that number.

  • Anonymous
    November 18, 2009
    The comment has been removed

  • Anonymous
    November 27, 2009
    That's a fair explanation and I might be using ConcRT and PPL wrong, but after this: SchedulerPolicy policy = CurrentScheduler::GetPolicy(); policy.SetPolicyValue(TargetOversubscriptionFactor, 1); policy.SetConcurrencyLimits(1, 1); CurrentScheduler::Create(policy); I'm still getting three threads for a parallel_for() call. Thanks!

  • Anonymous
    November 27, 2009
    I tried using SetDefaultSchedulerPolicy() and it works. Sorry.

  • Anonymous
    November 30, 2009
    The call to CurrentScheduler::GetPolicy() will wind up creating and attaching the default scheduler and returning a copy of its policy.  The CurrentScheduler::Create() call will create a new (second) scheduler attached to the same thread. If you run a parallel_for after this code, it would run on the second (newly) created scheduler.  I would expect, however, to see up to two threads (the main thread and the scheduler's) participating in such a parallel_for.

  • Anonymous
    March 01, 2010
    suppose that inside a parallel_for_each loop and inside I do so computation and based on that I push data inside a vector. Do I have to put some serialization around the push like this:    Concurrency::parallel_for(0, n,        [] (int i)        {           int value = GetValue();           if ( value )             myVect.push_back(value); // is this safe? I think I have to serialize the insertion in myVect?!        });

  • Anonymous
    March 01, 2010
    George, you have to synchronize any non-thread-safe operations, including std::vector<T>::push_back. However, take heart.  ConcRT ships with a concurrent_vector type which has a self-synchronized (thread-safe) version of push_back that is very efficient. For other operations within parallel_for (or any concurrent region), you need to ensure that they are thread-safe, or add your own synchronization around them.

  • Anonymous
    March 01, 2010
    Hi George, if that vector is not thread safe, then yes, you must protect it. You can also use : (0) concurrent_vctor which is thread safe (http://msdn.microsoft.com/en-us/library/ee355343(VS.100).aspx) (0) combinable (http://msdn.microsoft.com/en-us/library/dd492850(VS.100).aspx) to avoid using locks. (0) concrt_reader_writer (http://msdn.microsoft.com/en-us/library/dd504907(VS.100).aspx).

  • Anonymous
    March 03, 2010
    Thanks to all who answered my question. Were any of these parallel algorithms implemented in the STL? I mean, it will be great to have something like parallel_sort! George.

  • Anonymous
    March 03, 2010
    Hi George, Please find more samples in Concurrency Runtime code samples(http://code.msdn.microsoft.com/concrtextras/Release/ProjectReleases.aspx?ReleaseId=3519)

  • Anonymous
    March 05, 2010
    Is there a way to set parallel_for_each to use more resources on the computer where it runs? I run a task inside this loop and on a 8 core server uses only around 60%. thanks, George.

  • Anonymous
    March 05, 2010
    George, a version of parallel_sort will be in the next version of the sample pack. For your performance issues, I'd encourage you to first take a look at the issue with the concurrency visualizer: http://msdn.microsoft.com/en-us/library/dd537632(VS.100).aspx You also may find some help with oversubscription and scheduler policies, take a look at these help topics: http://msdn.microsoft.com/en-us/library/dd984036(VS.100).aspx

  • Anonymous
    March 06, 2010
    Rick >>a version of parallel_sort will be in the next version of the sample pack. ..... Great! When is the new sample pack going to be released? Thanks, George.

  • Anonymous
    March 09, 2010
    it's live now @ http://code.msdn.com/concrtextras

  • Anonymous
    April 01, 2010
    Hi again, I have now a version of our app using the PPL and works great on my development computer. However, I need to test it on a server and I need the runtime C distributable which I cannot find. Another odd thing is the missing of an installation project in the 2010 rc version. Thanks for any help, George. PS: if anyone wants to send me and email my address is: tihenea@comcast.net

  • Anonymous
    April 05, 2010
    I tried to use this function from the extras but just replacing the parallel_for_each with this one gives some compile errors regarding expecting 8 arguments, etc. The error is inside ppl_extras. Anybody having the same issue? Thanks, George.

  • Anonymous
    April 05, 2010
    thanks George, we'll take a look.

  • Anonymous
    April 07, 2010
    Hi, Here is a fragment from the docs which is not clear to me: ........... •The concurrent_vector class does not store its elements contiguously in memory. Therefore, you cannot use the concurrent_vector class in all the ways that you can use an array. For example, for a variable named v of type concurrent_vector, the expression &v[0]+2 produces undefined behavior. ........... Does this mean that a construct like this will not work? .............. concurrent_vector<TMyC> myC; size_t dist = 10; // this is < vector size parallel_for_each(myC.begin(), myC.begin() + dist, [&]( { ... } ) Thanks, George.

  • Anonymous
    April 07, 2010
    The comment has been removed

  • Anonymous
    April 09, 2010
    Hi George, Regarding your comment: "I have now a version of our app using the PPL and works great on my development computer. However, I need to test it on a server and I need the runtime C distributable which I cannot find." You can link C Runtime statically into your application and move your .exe to the server as you wish. Rick Molloy has provided this information on how to link statically in one of his blogs (thanks to Rick): "To do this open the solution explorer pane and right click on your C++ project and select properties. Under the C/C++ ->Code Generation property page under Runtime Library select "Multi-threaded (/MT)" for release builds and "Multi-threaded Debug (/MTd)" for debug builds.  Also note that if you're using MFC in your application on the General Property Page you'll need to change the "Use of MFC" setting to something other than "Use MFC in a Shared DLL." Atilla Gunal Concurrency Runtime Team

  • Anonymous
    May 28, 2010
    parallel_for_each_fixed I tried to use this function from the extras but just replacing the parallel_for_each with this one gives some compile errors regarding expecting 8 arguments, etc. The error is inside ppl_extras. Anybody having the same issue? Thanks, George.

  • Anonymous
    June 27, 2011
    How do I parallelize when I have two iterators that should be incremented and not just one? Say: for(;begin!=end;++begin,++dst) {...}

  • Anonymous
    June 27, 2011
    Hi someoneelse, parallel_transform may cover your scenario. Please find it in the latest sample pack : archive.msdn.microsoft.com/concrtextras /// <summary> ///     This template function is semantically equivalent to <c>std::transform</c>, except that ///     the iteration is done in parallel and ordering is unspecified. /// </summary> parallel_transform(_Input_iterator1 _First1, _Input_iterator1 _Last1, _Input_iterator2 _First2,    _Output_iterator _Result, const _Binary_operator& _Binary_op)

  • Anonymous
    August 22, 2012
    OH....it`s too hard for me..........

  • Anonymous
    November 27, 2012
    “I have converted my for-loops into parallel for-loops. Why doesn’t my code work properly? ” example is wrong. Concurrency::parallel_for(0, n,    [] (int i)    {        a[i] += a[i-1]; // incorrect!    }); VS2010 compiler and run to be passed.