Jaa


Auto-Vectorizer in Visual Studio 2012 – Did It Work?

 

If you’ve not read previous posts in this series about auto-vectorization, you may want to begin at the beginning.

This post will explain how to find out which loops in your C++ program were auto-vectorized.   Here is an example program, stored in a file called Source.cpp, with which to experiment:

  1. int main() {
  2.   const int N = 50;           // array dimensions
  3.   int a[N], b[N], c[N];
  4.   for (int n = 0; n < N; ++n) a[n] = b[n] * c[n];
  5. }

To keep things simple, I have missed out code to initialize the arrays b or c.  However, that doesn’t matter for the purpose of this post.

Running from the Command Line

Let’s start by running this program from the command-line (we’ll explain what to do in the Visual Studio IDE in a few minutes).

cl /c /O2 /Qvec-report:1 Source.cpp

This command tells the compiler to compile Source.cpp, but not to go on and link (that’s the /c switch). The /O2 switch tells the compiler to generate code that is optimized for speed.  This is crucial: the auto-vectorizer kicks in only when you enable optimization.  Finally, the /Qvec-report:1 switch tells the compiler to report which loops were successfully vectorized.  (Remember that these command-line switches are case-sensitive: so spell them as shown).  And here is the output:

Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50520 for x64
Copyright (C) Microsoft Corporation. All rights reserved.

source.cpp

--- Analyzing function: main
c:\source.cpp(4) : loop vectorized

This confirms that the loop on line 4 of Source.cpp, was indeed vectorized.

Please note that the /Qvec-report:1 switch is not present in the Beta drop of VS 11 from February.  But it will be included into the next public drop, available soon.

The compiler also provides a /Qvec-report:2 switch.  This one tells you which loops were successfully auto-vectorized, and which were not, with a reason code.  Here is another snippet that includes a second loop (on line 5):

  1. int main() {
  2.   const int N = 50;           // array dimensions
  3.   int a[N], b[N], c[N];
  4.   for (int n = 0; n < N; ++n) a[n] = b[n] * c[n];
  5.   for (int n = 0; n < N; ++n) a[n] = a[n-1] + 7;
  6. }

And here is the corresponding report:

Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50520 for x64
Copyright (C) Microsoft Corporation. All rights reserved.

source.cpp

--- Analyzing function: main
c:\source.cpp(4) : loop vectorized
c:\source.cpp(5) : loop not vectorized due to reason '1200'

As you can see, the compiler auto-vectorized the loop on line 4 (as before), but failed to auto-vectorize the one on line 5, with a reason code of 1200.  This loop is similar to Example 6 – Backward Dependency that we analyzed in a previous post.  Vectorizing this loop would produce wrong results, and the auto-vectorizer is smart enough to know this.

Before going on to explain the various reason codes, let’s catch up and explain how to request these results from the Visual Studio IDE.

Running from the IDE

For your project, select the “Release” (rather than “Debug”) configuration. (You can check the project properties to confirm that, under the covers, this sets the /O2 switch, just as we did above from the command-line).

In addition, navigate yourself to “Property Pages”, “Configuration Properties”, “C/C++”, “Command Line”, “Additional Options” and add: /Qvec-report:1.  Here’s a screen-shot:

clip_image002

The build shown in the screenshot is for x64, but you can equally well choose x86.  Now, whenever you build the project, the output will include a report saying which loops were successfully vectorized.  As in the case of requesting this report via the command-line, please note that the /Qvec-report:1 switch is not present in the Beta drop of VS11 from February.  But it will be included into the next public drop of VS11, available soon.

Reasons why Vectorization Was Not Possible

Recall that the auto-vectorizer is 100% safe: it will NEVER vectorize a loop if there is the slightest chance the generated code would produce wrong answers – answers different from that implied by the original sequential C++ code.

[NitPick again: what exactly are the answers implied by the sequential execution of a C++ program?  Answer: this is a deep question.  For our tiny examples, we will simply assume the answer is “obvious”.  For the general problem, try a web search for the topic “programming language semantics”]

Ensuring safety requires some pretty deep analysis of the input code.  It turns out that sometimes a loop would actually be safe to vectorize, but the analysis cannot prove it so.  The auto-vectorizer therefore refuses to vectorize that loop.  We say that its judgments are “conservative”.

The warnings from a /Qvec-report:2 run specify any of about 30 reason codes for why a given loop was not vectorized.

The reason codes are discovered and emitted from several layers deep within the compiler.  This can sometimes make it difficult to relate the specific issue back to  the original C++ code, several layers above.  For example, the report may be produced from a loop in a function whose body has been in-lined into its caller – so the original function, at this point in the analysis, no longer exists!  Bear this in mind as you read the explanations below for each reason code.  We will publish a fuller explanation, with examples, as part of MSDN documentation – this will guide you on tweaking your code so that it vectorizes.

 

Reason Code

Explanation

500

This is a generic message – it covers several cases: for example, the loop includes multiple exits, or the loop header does not end by incrementing the induction variable

501

Induction variable is not local; or upper bound is not loop-invariant

502

Induction variable is stepped in some manner other than a simple +1

503

Loop includes Exception-Handling or switch statements

 

Reason Code

Explanation

1100

Loop contains control flow – if, ?:

1101

Loop contains a non-vectorizable conversion operation (may be implicit)

1102

Loop contains non-arithmetic, or other non-vectorizable operations

1103

Loop body includes shift operations whose size might vary within the loop

1104

Loop body includes scalar variables

1105

Loop includes a non-recognized reduction operation

1106

Inner loop already vectorized: cannot also vectorize outer loop

 

Reason Code

Explanation

1200

Loop contains loop-carried data dependences

1201

Array base changes during the loop

1202

Field within a struct is not 32 or 64 bits wide

1203

Loop body includes non-contiguous accesses into an array

Reason code 1200 says the loop contains loop-carried data dependences which prevent vectorization.  This means that different iterations of the loop interfere with each other in such a way that vectorizing the loop would produce wrong answers.  More precisely, the auto-vectorizer cannot prove to itself that there are no such data-dependences. 

[NitPick asks: what is this “Data Dependence” thing you keep dragging into the conversation?  Answer: it lies at the heart of vectorization safety, and uses some interesting math – affine transformations and systems of Diophantine equations.  However, no-one commented last time that they wanted more details, so I’ll skip explanations]

 

Reason Code

Explanation

1300

Loop body contains no (or very little) computation

1301

Loop stride is not +1

1302

Loop is a “do-while”

1303

Too few loop iterations for vectorization to be a win

1304

Loop includes assignments that are of different size

1305

Not enough type information

 

Reason Code

Explanation

1400

User specified #pragma loop(no_vector)

1401

/kernel switch specified

1402

/arch:IA32 switch specified

1403

/favor:ATOM switch specified and loop includes operations on doubles

1404

/O1 or /Os switch specified

The 1400s reason codes are straightforward – you specified some option that is just plain incompatible with vectorization.

 

Reason Code

Explanation

1500

Possible aliasing on multi-dimensional arrays

1501

Possible aliasing on arrays-of-structs

1502

Possible aliasing and array index is other than n + K

1503

Possible aliasing and array index has multiple offsets

1504

Possible aliasing – would require too many runtime checks

1505

Possible aliasing – but runtime checks are too complex

The 1500s reason codes are all about aliasing – where a location in memory can be accessed by two different names.

Finally, note that the reason codes listed above apply to this first release of the auto-vectorizer.  Subsequent releases will likely stop emitting many of these warnings, as we make the compiler ‘smarter’ at recognizing more and more loop patterns.

The topic of aliasing cropped up earlier.  The time seems ripe to explain this term – what it means; why it’s a nuisance; how the auto-vectorizer deals with it. Although the alias analysis performed by a compiler is complex, we can explain the nub of the problem, via examples, in just a few paragraphs.  Let’s aim to do that in the next post.

Comments

  • Anonymous
    May 23, 2012
    Just fixed a typo - /arch:ATOM should in fact be /favor:ATOM.  (Thanks to Juan for spotting this)

  • Anonymous
    May 23, 2012
    Is there a pragma or similar to have the compiler emit a warning or error if a specific loop wasn't possible to vectorize? This could be very useful if you have some very performance critical code and want to be sure that it vectorizes as part of the build process. I see a #pragma loop(no_vector) mentioned, but it looks like it isn't yet documented.

  • Anonymous
    May 24, 2012
    @David: No, we don't have a way to "monitor" a given loop (although it has been requested; and we have considered it).  Best alternative is to specify the /Qvec-report and scan the build logs to check for any regression. Documentation for auto-vectorization is still in progress.  This blog gives the most up-to-date info, although we are still tweaking final details (like the exact format of the auto-vectorization warning messages).

  • Anonymous
    May 26, 2012
    Is there a pragma and flag to be able to force vectorization to happen, similar to #pragma vector in Intel compiler? Will you publish a table comparing vector #pragmas/flag alternatives to the Intel Compiler? wiki.duke.edu/.../Intel+Compiler+Optimizations Thanks

  • Anonymous
    May 27, 2012
    @David: No, we are not supporting a way to force vectorization.  As you know, such a flag is suitable only for expert users: if you mis-use, or mis-understand, then the program will give wrong answers. We Have no plans to summarize differences between flags/switches/options in the Microsoft compiler, versus others.

  • Anonymous
    May 27, 2012
    Does the __restrict keyword help with vectorization, by indicating when two pointers that could alias are, in fact, guaranteed not to?

  • Anonymous
    May 29, 2012
    The comment has been removed

  • Anonymous
    May 30, 2012
    @Bruce: Yes, __restrict helps.  In this first release, for auto-vectorization, we benefit by avoiding alias checks.  And it's on the TODO list to make wider use of the __restrict "user guarantee".

  • Anonymous
    May 30, 2012
    The comment has been removed

  • Anonymous
    May 30, 2012
    What are the rules for function calls in the loop? For example a[i] = f(a);. Does f have to be an intrinsic function like the ones in the math library or are non-intrinsic functions also allowed? What are the rules for the array types? For example let the arrays be: unsigned char a[1000], b[1000];. And the operation: a[i] += b[i];. Would this perform 16 vector additions at a time since the vector registers are 128 bits wide?

  • Anonymous
    May 31, 2012
    The comment has been removed

  • Anonymous
    May 31, 2012
    @ysitu: Just a follow-up on one detail of my previous answer - specifically, my last sentence "If used beyond the loop, however . . . ".  As your comment points out, the "t = 0" statement, immediately following the loop body, kills that instance of t.  Compiler dataflow analysis catches this fact, and auto-vectorization goes ahead. In passing, many readers may be puzzled by some of the more detailed questions/answers going on in this blog, wondering "what on earth are they talking about"?  If interested, checkout the classic text "Advanced Compiler Design & Implementation" by Muchnick.  It provides details on compiler optimizations.

  • Anonymous
    June 01, 2012
    I've been experimenting with the auto-vectorizer. For some reason it works when the loop counter is short, but not unsigned short. Why doesn't it accept unsigned short? loop not vectorized due to reason '1200'

  • Anonymous
    June 01, 2012
    @Chris: This is a restriction with first release.  See the previous post, in the first section, that explains the restrictions: blogs.msdn.com/.../auto-vectorizer-in-visual-studio-11-rules-for-loop-body.aspx Workaround is to stick with int or size_t

  • Anonymous
    June 01, 2012
    @Jim: Alright. I've tested the auto-vectorizer in some scenarios now and I must say it's very good. In some cases I actually get the maximum theoretical speedup or more (for example with floats I even exceeded 4x a little bit). Looking forward to new posts on the vectorizer. :)

  • Anonymous
    July 05, 2012
    Are there plans to replace the codes with the text version so that we don't need to have this blog post handy?

  • Anonymous
    July 10, 2012
    AFAIK the precision of floating point operations is different in SIMD and non-SIMD instructions. Is it somehow handled by auto-vectorizer?

  • Anonymous
    July 11, 2012
    @Bogdan: Answer is a little complicated, as follows: In VS11, the default floating-point model uses XMM registers.  (Previously, the default was x87).  So, a scalar calculation, such as (1.2f + 3.4f) will use the low 32 bits of some XMM register.  And an analogous, vector, SIMD calculation, would use all 128 bits of some XMM register to perform 4 additions in parallel.  So results for instructions such as +, -, * and so on, built into the chip, will be the same, whether the calculation is scalar or vector. However, if you are calculating math functions, such as sin or cos, we have a scalar library, and a vector (SIMD) library.  Results from either library are close, but not identical.  This issue, by coincidence, was raised a week or two back, within the team.  Currently under investigation.

  • Anonymous
    September 30, 2012
    Does alignment affect the vectorizer? In x86 apps the default is 8 byte alignment, but from what I've read, SSE2 works better with 16 byte alignment.  Should we be changing our project settings to 16 bytes even for x86 apps?

  • Anonymous
    October 08, 2012
    @Gary: Sorry for the late response - I was out on vacation last week.  Yes, alignment matters, some.  If, at compile time, we 'know' an array, at runtime, will be 16-byte aligned, then we will generate a vector instruction, such as MOVPS, that itself assumes the source address is so-aligned.  If the compiler cannot make that determination, then it will emit an instruction, such as MOVUPS, that works on a source address that is 16-byte aligned, or not.  (There's a small perf hit, compared with a raw MOVPS). The default alignment of an array is that of the element type.  So a static array of floats would only be guaranteed to be 4-byte aligned.  Different if allocated on the heap (8 bytes on x86, I think?).  Maybe different if allocated as a local variable on the stack. Is it worth going thru your code and sprinkling __declspec(align(#)) ?  I am doubtful.  First, your code might involve a loop that steps thru your [float] starting at offset 1, 2, 3, etc which invalidates the intended alignment.  Second, on Nehalem architectures forward, the performance hit for unaligned vector instructions is quite small.  Third, the compiler may not miss tracking alignment and therefore emit unaligned vector instructions anyway (on the todo list). But I'd be interested in any results if you did choose to experiment with this. Thanks, Jim

  • Anonymous
    October 13, 2012
    Thanks.  I'm afraid I have had very unsuccessful results when using auto-vectorization and auto-paralellization with a large (1 million+ lines of code) project.  I'm curious how many instances in for example, Office or Windows you got.  My project had only 2 instances of vectorization and no instances of paralellization.  

  • Anonymous
    October 21, 2012
    Is it possible that using /Zc:forScope- could be the reason why it's failing so badly to find any candidates?  I noticed that note about the loop counter being local (but it also says "the function" so I thought it would be ok to use forScope- )

  • Anonymous
    October 23, 2012
    @Gary: I meant for the "local" in message 501 to be local-to-that-function, in contrast to global.  Whether the induction variable is scoped to the for loop, or escapes afterwards, should not impact whether the loop can be vectorized.   On the more general point - making the auto-vectorizer hit more loops than it currently does: yes, part of our ongoing work is to make the vectorizer recognize more loop patterns - in effect, to reduce the number of reason codes we emit.  So some of them will simply 'die' as we make the auto-vectorizer smarter.