Jaa


Auto-Vectorizer in Visual Studio 2012 – Rules for Loop Body

 

If you haven’t read previous posts in this series about auto-vectorization, you may want to begin at the beginning.

In the last post, we explained the rules for C++ loop headers that make your code amenable to auto-vectorization; we might say, make it auto-vectorizer-friendly.  Prompted by an online question from NitPick (our astute reader, who seems to enjoy dual citizenship in the blog world, and in the real world), let’s tidy off that discussion with a further, fourth rule:

Loop counter must be of type int or size_t

Consider the following, proto-typical loop: 

  1. for (int n = L; n < U; ++n) a[n] = b[n] + c[n];

This will vectorize.  And likewise, if we change int to size_t, it will also vectorize.  (In the latter case, you would get a C4018 warning that n < U is a signed/unsigned mismatch, unless you were careful to define U as an unsigned type)

       

[NitPick asks:  What about other data types as loop counter?  Answer: we have char, short, int, long and long long; together with their 5 unsigned cousins.  Some of these do indeed inhibit auto-vectorization in this first release, related to the indexing operations a[n] , b[n] and c[n] within the loop body.   You can run the experiment quite easily, testing all 10 cases.  You will find that any loop variable, which is unsigned, and narrower than the platform’s natural integer size, will inhibit vectorization.  So, on x86, where the natural integer is 32 bits wide, unsigned char or unsigned short, don’t work.  On x64, where the natural integer is 64 bits wide, unsigned char, unsigned short, unsigned long don’t work.  (Recall that long is just 32 bits wide).  But it’s simpler to use int or size_t and avoid the headache]

OK, if you follow these four simple rules for the loop header, the auto-vectorizer can start its next step of working out how to vectorize the loop body.  So let’s now explore the patterns of loop body the auto-vectorizer recognizes.

Remember, a couple of posts ago, I left you with a puzzle? – a list of six, one-line loops, and asked you to work out which ones might vectorize and which not?  In fact, the auto-vectorizer hits all but one of them.  Let’s explain the examples, one-by-one.

 

Example 1 – Embarrassingly Parallel

  1. for (int n = 0; n < 1000; ++n) a[n] = b[n] + c[n];

This loop is very simple – in the literature of parallel processing, it’s called “embarrassingly parallel”; I’ve also come across the description “delightfully parallel”.  Each iteration of the loop operates upon memory locations that are different from every other iteration.  There is no overlap, or interference between iterations.  For example, when n = 8, we have a[8] = b[8] + c[8] .  And there is no other value of n for which this loop accesses those same three locations.

[NitPick objects that the array called a could overlap array b in memory.  This would cause iterations to interfere with each other.  This problem – where the same memory locations can be accessed by different names in the source program – is called aliasing.  It’s a bane in the life of every optimizing compiler.  The auto-vectorizer includes runtime checks against such aliasing.  We will explain those in a later post.  For the moment, let’s assume that our examples do not include any aliasing.  Thank you NitPick, for keeping us on the right path]

The auto-vectorizer will transform this loop into its equivalent (we are using the same ‘slicing’ pseudo-code introduced in previous posts):

  1. for (int n = 0; n < 1000; n += 4) a[n:n+3] = b[n:n+3] + c[n:n+3];

This new loop will execute 250 iterations, calculating 4 results each time.  On the first iteration it will calculate a[0:3] = b[0:3] + c[0:3] .  On the next, a[4:7] = b[4:7] + c[4:7] .  And so on.

 

 

Example 2 – Non-Divisible Upper Bound

  1. for (int n = 0; n < 999; ++n) a[n] = b[n] + c[n];

The only change from Example 1 is that the loop’s upper bound is 999, rather than 1000.  It’s not divisible by 4 (remember, we are assuming our array elements are floats or ints, where each SSE vector register can hold 4 at-a-time).  The auto-vectorizer transforms this loop as before, but is smart enough to know it must add scalar additions for the last 3 iterations.  In effect, it transforms the code as follows:

  1. for (int n = 0; n < 996; n += 4) a[n:n+3] = b[n:n+3] + c[n:n+3];
  2. a[996] = b[996] + c[996];
  3. a[997] = b[997] + c[997];
  4. a[998] = b[998] + c[998];

 

 

Example 3 – Calling Math Functions

  1. for (int n = 0; n < 1000; ++n) a[n] = sin(0.01 * n);

This example is interesting because it includes a call to a math library function (sin, in this case). The auto-vectorizer includes vectorized versions of all the math functions defined in “math.h”.  The vector library calculates four sins on each call.  We thereby vectorize this loop too.

 

 

Example 4 – Iota

  1. for (int n = 0; n < 1000; ++n) a[n] = n;

It might appear that this loop would pose a problem for the auto-vectorizer.  Just think for a moment: how would you make use of vector instructions to implement this loop? . . .

In fact, the auto-vectorizer defines a variable, let’s call it seq for the sake of argument, as a short, 4-element array in memory, holding the values [0,1,2,3]

[NitPick asks what this notation means – a list of numbers enclosed in square brackets.  Answer: we just invented it, to simplify our explanations.  It’s not standard C++.  In this blog post, it denotes a list of numbers, stored in consecutive locations – either in memory, or as adjacent fields within a vector register.  NitPick: but, doesn’t this mean that [3] is ambiguous – since it could denote a vector with a single element, or the operation of indexing into element number 3 of an array?  Yes, but we promise never to use a vector that has less than 2 elements in order to avoid this flaw in our notation, honest]

The auto-vectorizer transforms the original loop to look like this:

  1. seq = [0,1,2,3];
  2. XMM0 = [0,0,0,0];
  3. for (int n = 0; n < 1000; n += 4) {
  4.     a[n:n+3] = seq + XMM0;          // [0,1,2,3] + [0,0,0,0]
  5.     XMM0 += [4,4,4,4];              // [4,4,4,4], [8,8,8,8], etc.
  6. }

Line 1 is where our variable seq is set up to hold the four values [0,1,2,3] .  Line 2 sets up the SSE register called XMM0 to hold the vector [0,0,0,0] .  In line 4, we set a[n:n+3] to hold the vector addition of seq with XMM0.  In line 5, we vector-add [4,4,4,4] into XMM0 (there’s an SSE instruction for that!).  So we see that the expression seq + XMM0 will take on successive values [0,1,2,3] , [4,5,6,7] , etc.  Quite neat, right?

[NitPick again: why did you call this example “Iota”? Answer: see iota]

 

 

Example 5 – Forward Dependency  

  1. for (int n = 0; n < 500; ++n) a[n] = a[n+1];

 

Suppose that we have populated the array a by setting a[n] = n throughout.  So a = [0,1,2,3,4…999] . The loop above, run on a single core, with no vectorization, will therefore compute the result a = [1,2,3,4…500] .  Does this vectorize safely?  In other words, does the following pseudo-code calculate the correct answer:

  1. for (int n = 0; n < 500; n += 4) a[n:n+3] = a[n+1:n+4];

Let’s work it out, on paper.  On the first iteration, we calculate a[0:3] = a[1:4] .  In pseudo-code, this step is performed as:

  1. XMM0 = a[1:4] = [1,2,3,4];
  2. a[0:3] = XMM0;

In other words, we copy the values of a[1:4] into XMM0, then write those 4 values into a[0:3] .  Similarly for the next iteration.  A little doodling with pencil and paper should convince you that this transformation is safe.  The auto-vectorizer reaches the same conclusion and performs this transformation.

Note that this calculation involved dependence between two different iterations of the same loop.  To explain what this means, consider the original loop, and its first two iterations:

  1. a[0] = a[1];   // original
  2. a[1] = a[2];   // original

Notice how a[1] is read on the first iteration, and then updated on the second.  So we have to perform the calculations in this order.  We cannot switch their order like this:

  1. a[1] = a[2];   // switched
  2. a[0] = a[1];   // switched

Because this would result in making a[1] = 2 and then also making a[0] = 2.  There is dependence between the original first two iterations (in fact, between each iteration and its neighbor): the second original iteration Writes the value of a[1] after the previous original iteration has Read that same value.  Hardware engineers call this a Write-After-Read, or WAR, dependence.  Compiler writers call it an anti-dependence.  In the same way that we cannot switch the order of these two steps and get the right answer, neither can we run them in parallel on different processors.  But we can run them in parallel in a vector register.  We will touch upon the subject of dependence more, in a future post.  For the moment, just be grateful that the auto-vectorizer takes care of these worries on your behalf.

 

 

Example 6 – Backward Dependency

  1. for (int n = 1; n < 500; ++n) a[n] = a[n-1];

This example differs from the previous only in that the right hand side of the assignment indexes [n-1] rather than [n+1] .   The consequences are profound: this one does not vectorize safely!  I’ve brought it into the picture, in order to break the illusion that the auto-vectorizer can conquer any old loop it comes across in your code.

Let’s look at what the simple scalar loop does.  Again we’ll assume that we have populated a = [0,1,2,3,4…999] .  We get the sequence of assignments:

Iteration Assignment a, after assignment
    [0, 1, 2, 3, 4, 5 ... 999]
First a[1] = a[0] [0, 0, 2, 3, 4, 5 ... 999]
Second a[2] = a[1] [0, 0, 0, 3, 4, 5 ... 999]
Third a[3] = a[2] [0, 0, 0, 0, 4, 5 ... 999]

So the loop will set all the elements 0 thru 499 to the value of a[0] .  A strange thing to do perhaps, but that is what the original code specified.  What would happen if we blindly vectorize this loop? – if we transform it to become:  

  1. for (int n = 1; n < 500; ++n) a[n] = a[n-1];

On the first iteration, we calculate a[1:4] = a[0:3] .  Recall that, in pseudo-assembly-code, this step is performed as:

  1. XMM0 = a[0:3];  // [0,1,2,3]
  2. a[1:4] = XMM0;    // [0,1,2,3]

So, starting with a = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14...999] , we obtain, after this first vectorized iteration, a = [0,0,1,2,3,5,6,7,8,9,10,11,12,13,14...999] .  Already our answer differs wildly from the correct answer, which, at this stage, should be a = [0,0,0,0,0,5,6,7,8,9,10,11,12,13,14...999] .  The following diagram depicts what’s going on:

         

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 1 2 3 5 6 7 8 9 10 11 12 13 14

       

The first row depicts the original a – the ‘before’picture.  The 4-element vector [0,1,2,3] shown in bold-red, is used to stomp over the original entries for a[1:4] , yielding the second row – the after picture.  The answer is already wrong.  And it just gets worse as we complete successive vector iterations.  Here we see the next step in the evolving carnage:

0 0 1 2 3 5 6 7 8 9 10 11 12 13 14
0 0 1 2 3 3 5 6 7 9 10 11 12 13 14

Anyhow, you get the idea: if the auto-vectorizer were to blindly vectorize this loop, it would produce wrong answers.  So it doesn’t!  Auto-vectorization is safe – it will never vectorize a loop if that would produce wrong answers – answers that differ from the single-thread, scalar calculation implied by the original C++ code.

       

At this point, you might ask how the auto-vectorizer can examine a C++ loop, and tell whether it’s safe to vectorize.  However, that would lead us into a topic called “loop dependence analysis”.  Not sure whether anyone is interested.  Please add a comment below if you’d like a post explaining what’s going on.

Meantime, let’s keep with practical concerns: in the next post, I’ll explain how you find out which loops the auto-vectorizer vectorized, and which not.

Comments

  • Anonymous
    May 08, 2012
    Hi Jim, Re: Example 4 Indeed this is code you generate, but why not generate a smaller (on my sandy bridge %10 faster) code? seq = [0,1,2,3]; XMM0 = [0,0,0,0]; for (int n = 0; n < 1000; n += 4) {    a[n:n+3] = seq;              seq += XMM0 } Note: one less addition

  • Anonymous
    May 08, 2012
    Oops, was too quick to post Should read XMM0 = [4,4,4,4] However the code I see generated in VS11 beta is actually different than both of these version. My version is still 10% faster

  • Anonymous
    May 08, 2012
    re: Example 5 Didn't vectorize until I changed a from vector<int> to int * Is this a beta issue? Thanks

  • Anonymous
    May 08, 2012
    Q: Do all array accessed need to be in the form array[index]? It seems that if a reference is involved then vectorization doesn't happen. For example: for (size_t i = 0; i < length; ++i) { // this is vectorized entries[i].fileName += offset; entries[i].objName += offset; } for (size_t i = 0; i < length; ++i) { // this isn't DebugEntry &entry = entries[i]; entry.fileName += offset; entry.objName += offset; } Where DebugEntry is struct DebugEntry { char *fileName, *objName; }, entries is DebugEntry * and length is size_t.

  • Anonymous
    May 08, 2012
    The comment has been removed

  • Anonymous
    May 09, 2012
    Hi Jim, Example 3: How would define the /fp:precise with respect to vectorization. I would define as: the values produced in the loop are binary equal regardless of whether the loop is vectorized or not. With that definition: We seems to agree that sum += a[i] won't be vectorized without /fp:fast However we seem to disagree on example 3. Special vector functions used by vectorizer traditionally trade ulps (sometimes as little as 0.5ulp) for speed. My point is that I would count Example 3 being vectorized under /fp:precise as a bug in VS11 beta. p.s. sorry of all this dump here. feel free to delete my comments

  • Anonymous
    May 09, 2012
    @NitPick: The machine code for Example 4 (Iota) was a simplified picture.  Below is the actual code.  This is taken from an x64 Release build, by setting and hitting a breakpoint in the loop, and looking at the "Disassembly" window: int main() {  movdqa  xmm1,xmmword ptr [__xmm@00000003000000020000000100000000]    xor         eax,eax              // n = rax = 0  lea         rcx,[a]                                 // rcx = &a[0]  nop        word ptr [rax+rax]              // 15-byte nop, for padding  movd        xmm0,eax               // xmm0 = [0,0,0,0]    add         eax,4               // n = eax += 4    lea          rcx,[rcx+10h]               // rcx += 16    (&a[n])    pshufd    xmm0,xmm0,0               // xmm0 = xmm00    paddd     xmm0,xmm1               // xmm0 += xmm1    movdqu  xmmword ptr [rcx-10h],xmm0    // a[n:n+3] = xmm0    cmp       eax,3E8h               // eax < 1000?  jl          main+20h             // yes, so loop } I've added comments to say what each line does.  Notice that "xor eax,eax" actually clears all bits in rax.  The notation xmm00 refers to the low 32 bits of the xmm0 register - just like in the "Registers" window. Can you reply with the machine code (or intrinsics) in your faster version, please?

  • Anonymous
    May 09, 2012
    Re: Example 3, do namespace qualifications have an impact? E.g., if I include cmath instead of math.h and use std::sin instead of sin, does it still vectorize?

  • Anonymous
    May 09, 2012
    @NitPick: Example 5 - right, vector<int> does not vectorize, but int[] does.  We have not fixed this in the next drop, I'm afraid.

  • Anonymous
    May 09, 2012
    @ildjarn: Nope, it doesn't matter.  If you look inside cmath, you can see that it says #include <math.h> [For the Beta drop, you will find the sources for the CRT in the folder: Program Files (x86)Microsoft Visual Studio 11.0VCcrtsrc )

  • Anonymous
    May 09, 2012
    The comment has been removed

  • Anonymous
    May 09, 2012
    @Jim re: faster code for Iota Here is the version that actually compiles and runs in VS2011 (think in VS2010 as well) __m128i step = _mm_set_epi32(4,4,4,4) __m128i curv = _mm_set_epi32(3,2,1,0) __m128i *pa = (_m128i *)a; for(int i < 0; i < size/4; i++) {   _mm_storeu_si128(pa+1, curv);  curv = _mm_add_epi32(curv, step); } This loop inside of another loop run in 570ms The VS2011 loop runs in 690ms Now aligning data by optionally dealing with the first unaligned ints manually Results in 530ms Yes, this is on sandy bridge CPU and aligned writes are still very important. So, I am wrong it's more like 30% difference from what I can do by hand. Way better than VS2010 would produce though.

  • Anonymous
    May 09, 2012
    The comment has been removed

  • Anonymous
    May 09, 2012
    re: /fp:fast Simple test (assume a and b are vector<double> for (int n = 0; n < size; ++n)  a[n] = sin(double(n)); #pragma loop(no_vector) for (int n = 0; n < size; n++)  b[n] = sin(double(n)); First loop calls ___vdecl_sin2 (which forwards to ___sse4_sin2 on my machine) Second loop calls __libm_sse2_sin_precise Now, I am testing if there is any difference for (int n = 0; n < size; n++)  if(a[n] != b[n])  .... There are plenty. For example for n = 4,10,11,14..... They all differ in the last bit (great, that the 0.5 ulp I mentioned before) I will try to check if in fact second version is more accurate. My installation of vs11 beta doesn't have source for those functions, to easily check the source of the differences... But, it's not even my point. My main point is that vectorization changed the result. To me that should not be allowed under precise model.

  • Anonymous
    May 09, 2012
    @JimHogg What about reduction loops?   You've shown in the Channel9 interview that this vectorizes:    for(int i=0; i<N; ++i)    sum += a[i]; but I've tried and saw that this doesn't:    for(int i=0; i<N; ++i)    sum += a[i]*b[i]; This is not a contrived example - most linear algebra operations consist in such sums of products (vector-vector dot product, matrix-vector product, matrix-matrix product..).   Vectorizing these could be a huge performance benefit. I also tested that Intel compiler successfully vectorizes this particular loop. Are there rules for which reductions currently vectorize and which do not? thanks, -Ofek

  • Anonymous
    May 09, 2012
    @NitPick:  re /fp:fast First, to put things in perspective for our readers, we are talking here of a difference in the very last bit of a double's 53-bit significand. Next, you proposed that /fp:fast is required if the vector result is not bit-for-bit identical with the scalar /fp:precise.  This, I would submit, is wrong.  I would argue that the obvious, and correct criterion is: the scalar /fp:precise and vector results must be "equally accurate".  By "equally accurate" I mean that if the first lies within plus-or-minus N bits of the infinite-precision result  - then so must the second. Your measurements show that results of sin(x) for scalar /fp:precise and vector differ by at most 1 bit in 53.  This is an astounding level of consistency!  In fact, because the N-bit tolerance spans either side of the mathematical result, it's likely that some of the vector results lie closer to the correct result than the scalar /fp:precise.  I therefore re-assert that this example of vectorization does NOT need to seek the "protection" of /fp:fast. [However, we are in agreement that the "sum" example is only legitimate under /fp:fast, because the algorithm assumes associativity of floating-point addition.  I guess I should explain this in a later post] [In passing, it's just too tempting to contrast the difference between these mathematical results, and the reality of the phyiscal world we inhabit: an accuracy of 1 bit in 53 corresponds to around 1 in 10^16.  This corresponds to measuring the distance between Seattle and New York to within about 4 Angstrom - the same scale as the diameter of an atom.  And it exceeds, by many orders of magnitude,  mankind's best endeavours to measure experimentally, the fundamental physical constants such as G, the electron mass, or the fine structure constant]

  • Anonymous
    May 10, 2012
    I wonder if it would be more efficient to use 'filling' elements when the upper bound is non-divisible. Then scalar additions would not be necessary. But perhaps setting up those filling elements is more expencive than those 1 to 3 scalar additions.

  • Anonymous
    May 10, 2012
    @Chris: Sorry, I don't understand the "filling" elements suggestion.  Can you explain a little more, please?

  • Anonymous
    May 10, 2012
    My question was based on this sentence: Example 2 - "The auto-vectorizer transforms this loop as before, but is smart enough to know it must add scalar additions for the last 3 iterations." Would it be more efficient to still use vector additions for those last 3 iterations? It would be possible to fill the vector registers with only 3 values, and make the 4th value be a filling value. Later those filling values can just be discarded.

  • Anonymous
    May 10, 2012
    @OfekShilon: Hi Ofek.  I enjoyed your blog on the Auto Vectorizer.  Here's the cross-link for readers - hope you don't mind. thetweaker.wordpress.com/.../a-day-with-vs11-beta-part-2-auto-vectorizer On auto-vectorizing the "dot-product" example - yes, we do get that one. For the int case, it just works - well, so long as you prevent Dead-Code-Elimination (DCE) optimization from throwing away the entire computation! For the float case, we do NOT auto-vectorize for the default /fp:precise setting.   You need to throw /fp:fast, and then we DO auto-vectorize. Here is a snippet that demonstrates.  The two "cout" calls keep the results "live" and prevent DCE. [I checked this on latest bits.  I think it worked on Beta bits too, but I've not checked back] int inta[1000], intb[1000]; float floata[1000], floatb[1000]; int main() {  int intsum = 0;  for (int n = 0; n < 1000; ++n) intsum += inta[n] * intb[n];  cout << intsum << endl;  float floatsum = 0.0f;  for (int n = 0; n < 1000; ++n) floatsum += floata[n] * floatb[n];  cout << floatsum << endl; }

  • Anonymous
    May 10, 2012
    @OfekShilon Works in beta as well and as Jim said your loop probably didn't vectorize because your a/b vectors are double/float and floating point model is not /fp:fast For reduction you need that switch @Jim Speaking of that loop. Compiler produces sub-optimal results there as well. It could be something like 40% faster The issue is that compiler generated mulps follow by addps. Normally CPU has 2 different ports one that deals with mulps and one to deal with addps. So you can achieve 0.5 CPI. The problem is that in VS11 the addps can proceed because it's waiting on the results of mulps. Would you like me to post the suggested code using intrinsics?

  • Anonymous
    May 10, 2012
    The comment has been removed

  • Anonymous
    May 10, 2012
    @Chris: OK, I understand now.   On the final iteration, we would need to store the "remnant" - the 1, 2 or 3 elements from the XMM register - into memory.  We cannot store 4 values into memory, then tidy up the mess, of course, since that might just have overshot the array bounds.   Unfortunately storing 1, 2 or 3 elements within the SSE instruction set is a little messy.  For a remnant of 1 use MOVSS.  For a remnant of 2 use MOVLPS.  For a remnant of 3, we would need to shuffle register contents before storing. So, overall, it's simpler to run a scalar loop on the "remnant".  But looking to the future, the AVX instruction set might make your approach easier - and important for loops with a small trip-count.

  • Anonymous
    May 10, 2012
    @Jim   I seriously hope you were kidding with your physics example. While for the final result the last bit may not be important it's accuracy of intermediate calculations that we are often concerned with.  Imagine I have 2 loops and I compute sin one was vectorized and the other wasn't Now I do (a[i] - b[i]) * 10e20 and what was supposed to be zero is now a very large difference!  People sometimes don't want FMA. Even though  FMA gives you better accuracy in the last bit. Even MS compiler supported #pragma fp_contract off   In my mind the entire goal of adding /fp:model is to create a mode for consistent results

  • Anonymous
    May 10, 2012
    @NitPick: Yes, please post your suggested code patterns.  Eric Brumer, one of the Devs on the auto-vectorizer team, will take a look.  Thankyou!

  • Anonymous
    May 10, 2012
    The comment has been removed

  • Anonymous
    May 13, 2012
    Why are you so picky on the type of the iteration variable?  I fear that other compilers will have other, incompatible requirements. Auto vectorization may also fail with boost foreach (no idea what type it uses).

  • Anonymous
    May 14, 2012
    @Bruno: It just came down to what the team could get done in the time available.  We have a long list of additional work to tackle for the next release, extending the patterns we cope with, and filling in annoying little holes, like restrictions on the type of the induction variable.

  • Anonymous
    May 14, 2012
    @Jim re: for(int i=0; i<N; ++i)    sum += a[i]*b[i]; Assuming pa and pb are "redefined" as __m128 *pa = (__m128 *)(a); __m128 *pb = (__m128 *)(b); VS11 generates roughly this code (re-written with intrinsics) __m128 psum = _mm_setzero_ps(); int loopSize = size / 4; for (int i = 0; i < loopSize; i++) {   __m128 p = _mm_mul_ps(pa[i], pb[i]);   psum = _mm_add_ps(psum, p); } // Eat remainder and combine the results Note, I am still at the mercy of compiler when it comes to register selection, re-use etc. when I use intrinsic functions. VS11 did a superb job for me these loops  (I recall VS2010 before SP1 didn't, in case someone wants to try with older compiler) Next step is to unroll the loop (this will start processing 8 elements at a time for floats) Note: there is absolutely NO performance improvement when I'll do that. for (int i = 0; i < loopSize; i+=2) {  __m128 p = _mm_mul_ps(pa[i], pb[i]);   psum = _mm_add_ps(psum, p);   p =_mm_mul_ps(pa[i+1], pb[i+1]);   psum = _mm_add_ps(psum, p); } Compiler nicely fold constants offsets (+1) . And finally the important step I mentioned above __m128 psum1 = _mm_setzero_ps(); __m128 psum2 = _mm_setzero_ps(); int loopSize = size / 4; for (int i = 0; i < loopSize; i+=2) {  __m128 p = _mm_mul_ps(pa[i], pb[i]);  psum1 = _mm_add_ps(psum1, p);  p =_mm_mul_ps(pa[i+1], pb[i+1]);  psum2 = _mm_add_ps(psum2, p); } psum1 = _mm_add_ps(psum1, psum2); // the remainder of the code is now almost the same as in the first loop Now, the loop contains exactly the same instruction one by one as the unrolled version! Except that it uses one more register and multiplication and additions are now independent. As the reslult CPI is now at 0.34 (much better) and the loop is now 40% faster (for my CPU/array sizes etc...)

  • Anonymous
    May 15, 2012
    @NitPick: Thankyou.  Disappointing there was no improvement on unrolling the vectorized loop - maybe upping the unroll to 4 or 8 might see some win?   Anyhow, I will pass this info on to Eric, and the rest of the Dev team, as we plan work for Dev12.

  • Anonymous
    May 16, 2012
    @Jim The fact that there was no improvement for just unrolling is a good thing. CPU works really hard to "unroll" your loop on its own. Things like register renaming and loop detection play a big role here. In fact in the compiler you do NOT want to unroll for modern CPUs where it's no needed as this will only pollute cache and sometimes defeat loop detector etc... If there would be no data dependency you would not want to unroll this loop. I showed the unrolled loop to emphasize that it is not unrolling but rather data dependency that is a problem

  • Anonymous
    May 17, 2012
    Does VC11 generate SSE4.1's BLENDPS/BLENDVPS and their double-precision/AVX variants?

  • Anonymous
    May 18, 2012
    @NitPick: Eric has kindly offered to add a post that gives a detailed analysis of the loop performance.  (May be a week or two until he clears urgent work on VS11)

  • Anonymous
    May 18, 2012
    @ysitu: Not sure whether we emit those particular SSE4.1 instructions.  (We don't emit any AVX instructions in this release) In the blog so far, we haven't dug down into the actual machine code instructions the compiler emits.  But it's perhaps worth explaining to folks that the SSE set is not a small addition by Intel or AMD!  SSE adds around 230 instructions to the already huge x86/x64 set.

  • Anonymous
    June 16, 2012
    Example 6: I am assuming that it can vectorize this: for (int n = 0; n < 1000; ++n) a[n] = x; and if it can, then shouldn't it be able to vectorize this: for (int n = 1; n < 500; ++n) a[n] = a[n-1]; by converting it to for (int n = 1; n < 500; ++n) a[n] = a[0]; Is this just a "for the future" case?

  • Anonymous
    December 21, 2014
    The comment has been removed

  • Anonymous
    January 13, 2015
    How to do a Loop in a loop in a loop sir.... JUST COMMENT ME BACK....

  • Anonymous
    January 13, 2015
    @Jim Hogg Sir post in the comment....the loop in a loop in a loop

  • Anonymous
    October 02, 2015
    This would seem to be an example of Embarrassingly Parallel, but it won't vectorize for me. Anyone shed some light as to why? __declspec( align( 16 ) ) class PhysicsSystem { public:    static const int32_t MaxEntities = 65535;    __declspec( align( 16 ) ) struct VectorizedXYZ    {        double      mX[ MaxEntities ];        double      mY[ MaxEntities ];        double      mZ[ MaxEntities ];        VectorizedXYZ()        {            memset( mX, 0, sizeof( mX ) );            memset( mY, 0, sizeof( mY ) );            memset( mZ, 0, sizeof( mZ ) );        }    };    void Update( double dt )    {        for ( int32_t i = 0; i < MaxEntities; ++i ) <== 1200        {            mTmp.mX[ i ] = mPos.mX[ i ] + mVel.mX[ i ] * dt;            mTmp.mY[ i ] = mPos.mY[ i ] + mVel.mY[ i ] * dt;            mTmp.mZ[ i ] = mPos.mZ[ i ] + mVel.mZ[ i ] * dt;        }    } private:        VectorizedXYZ   mTmp;    VectorizedXYZ   mPos;    VectorizedXYZ   mVel; };