Jaa


Auto-Vectorizer in Visual Studio 2012 - Overview

This post will introduce and explain a new compiler feature in Visual Studio 2012, called auto-vectorization.  (A previous post - what is vectorization? - provides background)

The Visual Studio 2012 auto-vectorizer tries to make loops in your code run faster by automatically vectorizing your code – that’s to say, using the SSE instructions available in all current mainline Intel and AMD chips. Auto-vectorization is on by-default. You don’t need to request this speedup. You don’t need to throw a compiler switch. You don’t need to set environment variables or registry entries. You don’t need to change your C++ code. You don’t need to insert #pragmas. The compiler just goes ahead and does it. It all comes for free.

So, in a sense, you don’t really need to know anything about auto-vectorization: simply recompile your program, using Visual Studio 2012.  Of course, this is an optimization, so you need to choose a “release” configuration (/O2 for command-line junkies) rather than a “debug” configuration (/Od) to see the benefit. The compiler will analyze your loops and generate code that makes them run faster.

The following list gives you an idea of the kind of topics on auto-vectorization we will be covering in future blog posts:

Stay tuned, and if you have suggestions for other topics, please feel free to comment below.

Comments

  • Anonymous
    April 16, 2012
    C++ only or .NET CLR code as well?

  • Anonymous
    April 16, 2012
    Given that the name of this blog is Parallel Programming in Native Code, I would expect the blog authors to really only be familiar with the native side of things.  You would probably have to ask the CLR team whether their engine performs this type of optimization.

  • Anonymous
    April 17, 2012
    @Nick - At the moment, auto-vectoriization is implemented only in the native C++ compiler.   @SimonRev - In fact, the C++ team and the CLR JIT team know each other quite well.  Several members of each team have, at some time, worked in the other (including myself).  That said, I am not aware that the CLR team has imminent plans to support auto-vectorization in their JIT.

  • Anonymous
    May 02, 2012
    Is it possible to specify that it should write an warning if a critical for-loop suddenly cannot be Auto-Vectorized ? (pragma statement). Then one can catch casual maintainer changes that suddenly ruins performance.

  • Anonymous
    May 05, 2012
    @Rolf Kristensen: We've thought about this one - assertions on results of the compilation process.  One problem would be how many such conditions to support - eg, asserts on constant-folding, constant-prop, CSEE, DCE, invariant-code motion, auto-vectorization, etc.  Instead, we have a /Qvec-report flag to report which loops were vectorized - so comparing build logs would be a route to discovering why performance dropped in any build.  (that's not in the Beta drop - added since)

  • Anonymous
    May 10, 2012
    @Rolf Kristensen: I guess one solution is performance unit tests, where critical operations are profiled/timed automatically, just as functionality is tested automatically. It would be nice if unit testing tools would have better support for this, so that one could mark tests are being performance testing, and then any decreases in performance would be reported when the tests were run.

  • Anonymous
    May 19, 2012
    Hi Jim, in the Ch9 interview about auto-vectorization you mentioned compile time and data alignment issues as optimization limiting constraints.

  1. Compile time Currently, we already have "Debug" and "Release" configurations, the first with little or no optimization. I wouldn't mind if the compiler took hours to complete at the highest optimization level (Release config is built only overnight.)
  2. Data alignment As for structures, I learned that the compiler silently inserts padding bytes (unless disabled using pragma pack). As for large data chunks, many years ago, I allocated extra space using VirtualAlloc and did some pointer maths in order to provide aligned data chunks. So I would also not mind if the compiler (or library functions) would always silently provide aligned data chunks (unless disabled somehow).
  • Anonymous
    June 01, 2012
    @Frank:
  1.  Interesting idea.  I would guess that the Release build moves results pretty far up the flattening curve of code-quality - the usual curve of "diminishing returns", but I don't think we have studied that recently.
  2.  We are investigating what you describe.  Sometimes, the auto-vectorizer could peel off a few scalar iterations from the loop, leaving the remainder on an aligned bounday.  Where this does not work, and depending upon buffer size, we could silently copy to an aligned buffer to help vectorization.
  • Anonymous
    October 23, 2014
    When I tried matrix multiplication, the innermost loop was not vectorized. The reason was 1203 Loop body includes non-contiguous accesses into an array. Here's the code A, B and C are two-dimensional float arrays. for (size_t i = 0; i < N; ++i) { for (size_t k = 0; k < N; ++k) { for (size_t j = 0; j < N; ++j) { C[i][j] += A[i][k] * B[k][j]; } } } Memory access in that loop is contagious, but in terms of the second coordinate. However, when I turned those two-dimensional arrays in the innermost loop into one-dimensional arrays, it did get vectorized. for (size_t i = 0; i < N; ++i) { double* c = C[i]; for (size_t k = 0; k < N; ++k) { double a = A[i][k]; double* b = B[k]; for (size_t j = 0; j < N; ++j) { c[j] += a * b[j]; } } } How about auto-vectorizing loops in multidimensional arrays by the last coordinate like in this case, when all other coordinates are constant during the loop?

  • Anonymous
    February 26, 2015
    Interesting, but what CPU targets are used? Which version of the SIMD instruction set is used, does the compiler emit different code for differently supported CPU features, or does it use the lowest-common-denominator?