RyuJIT CTP4: Now with more SIMD types, and better OS support!

Hi, folks. It’s been a busy month around here. We’ve been working on all sorts of stuff that I can’t talk about right now, but in the meantime, we’ve also been responding to feedback on the SIMD types. So, since it’s busy, I’m just going to list off the details, and link to other places for more information.

  1. Probably the biggest news is that if you install the 4.5.2 runtime (check the .NET blog for details on that), you can use RyuJIT CTP4 on Windows Vista, 7, and 8, as well as Windows Server 2008, 2008 R2, and 2012. In the CTP1 FAQ, I made mention that 4.5.1 on “downlevel” OS’es looked different from a code generation perspective. Well, that’s been addressed in the 4.5.2 update, so we’re happy to support RyuJIT CTP4 across all platforms that support 4.5.2.
  2. Nearly all the available Vector<T> types are now accelerated! The only ones missing are Vector<uint> and Vector<ulong>. In addition, there are a handful of other methods that we are now accelerating, including the CopyTo() method, which means any performance you have measured is now completely invalidated! Wait, no I mean, any performance you measured could potentially be faster!
  3. The fixed-size vector types are all mutable, now. This was the single biggest piece of feedback we received, so we took it.

There you have it. For now, you can download the CTP4 bits from here. The BCL SIMD NuGet package has also been updated, so update that, and you should be good to go. Same directions as before for how to use the types, enable RyuJIT, all that stuff. As always, send feedback to ryujit@microsoft.com. Happy RyuJIT-ing!

-Kev

Comments

  • Anonymous
    May 12, 2014
    Looking forward to testing out CTP4.
  • Anonymous
    May 12, 2014
    I installed CTP4 on Windows 7My application runs fine normally. After enabling RyuJit with set COMPLUS_AltJit=* when my application runs it crashes with System.AccessViolationException:(
  • Anonymous
    May 13, 2014
    VS Update 2 = Blue Screen on restart / breaks my Win8, breaks VS 2013.  Suggestion(s)?
  • Anonymous
    May 13, 2014
    @David, can you get us a repro case? That seems like a bad thing! If you can get a repro, please send it to ryujit@microsoft.com.
  • Anonymous
    May 13, 2014
    @AzureSky: Uninstall VS Update 2? Upgrade to Win 8.1 (it's better that WIn8 in every possible way)? Insert additional general purpose troubleshooting guidance, here :-)
  • Anonymous
    May 13, 2014
    The comment has been removed
  • Anonymous
    May 13, 2014
    By the way, how might I uninstall VS Update 2?  There is nothing listed in the Control Panel Programs and Features list.  There is nothing added to my start menu, etc, no new programs groups or anything.  That might be because VS 2013 Update Setup failed to complete?  VS (in 'About') says Update 2 is installed, but quite apparently it's not installed properly.  I think my VS is in some strange zombie land.  Anyway, what I did earlier was run the VS 2013.0 Setup Repair and that at least got me a Visual Studio that works again.  To get my Desktop back in the first place (to get past the Blue Screens) I had to use a System Restore Point (created by the VS Update 2 installed); so I used that to fix Windows first, of course, followed by the VS 2013.0 Setup Repair.  Anyways, VS still says Update 2 "is installed", so things are not 100% rolled back, which is why I might want to try Update 2 Uninstall; but where?  how?
  • Anonymous
    May 13, 2014
    @AzureSky: Lemme bark up a couple trees to see what's possible, here. I had problems with VS Update 2 preview, and I wound up fully uninstalling VS 2013, then reinstalling it, which took care of the problem for me...
  • Anonymous
    May 13, 2014
    Any clues would be much appreciated, thanks Kevin.
  • Anonymous
    May 13, 2014
    Where can we report bugs?Just tested this on Windows 7 64-bit with IronScheme, NullReferenceException.You can repro by downloading latest version from ironscheme.codeplex.com and running the IronScheme.Console-v4.exe (that targets 64-bit .NET 4+).
  • Anonymous
    May 13, 2014
    Good news is when I NGEN my app, and run tests which involves a bit of codegen, I see a very decent speedup o/ryujit: 6.8 secsnormal jit: 9.6 secs
  • Anonymous
    May 14, 2014
    still waiting for 32-bit version...
  • Anonymous
    May 14, 2014
    @leppie: send e-mail to RyuJIT@microsoft.com.@Azarien: 32 bit version won't provide dramatic improvement in JIT times like 64 bit does, but it should provide improved code quality (and SIMD capabilities), but that's not happening in the short term.
  • Anonymous
    May 14, 2014
    Have send a repro to that email. Not heard anything. You get it :)?
  • Anonymous
    May 14, 2014
    Are you planning to implement support for further SIMD intrinsics like reciprocal and reciprocal square root?
  • Anonymous
    May 14, 2014
    @David: Yes, I got that repro, and I think a dev has already fixed the issue. I'll make sure that we ping you with status.@Nov0x: The idea for SIMD is to get stuff in customers hands and listen to what people want/need. I'll talk to the BCL folks (Immo & co.) to see how they're collecting customer feedback.
  • Anonymous
    May 14, 2014
    What is the time frame for RyuJIT to go into production? Four tech previews is pretty comprehensive.
  • Anonymous
    May 14, 2014
    @LKeene: It's planned to go into the next "major" release of the .NET framework. It's already in the new ASP.NET thing that was announced at TechEd. (I don't know the official name of that!) To be clear, CTP4, and this is probably going to be true of any additional previews before we get a .NET beta that includes RyuJIT, was primarily to address SIMD feedback.
  • Anonymous
    May 14, 2014
    @Kevin Frei: "The idea for SIMD is to get stuff in customers hands and listen to what people want/need."I'm not sure if this was mentioned or not already but one thing that's certainly missing is shuffling. Without that some more advanced SIMD stuff can't be implemented efficiently, for example matrix multiplication.
  • Anonymous
    May 16, 2014
    The comment has been removed
  • Anonymous
    May 16, 2014
    @Mike Danes, @CodesInChaos:   Both Shuffle and shift operations are in out ToDo list and under consideration.
  • Anonymous
    May 17, 2014
    I agree.  Would like to see bit shift and (at least some sort of) shuffle operators (needn't be the full SSE equivalent but at least something that would make doing matrix functions doable.)  Please do implement bit shifting and at least some type of shuffle operator.  Thank you in advance for any efforts in this direction.
  • Anonymous
    May 20, 2014
    The comment has been removed
  • Anonymous
    May 20, 2014
    Also: Please add SIMD 'ROR' and 'ROL' (bit rotate) operators.
  • Anonymous
    May 22, 2014
    Yes, all the above please! Bit shifting, rotations, shuffles, and unsigned type (unit, ulong) support.These are essential for fast cryptography implementations, and much else.
  • Anonymous
    May 22, 2014
    Also please add a new constructor to the Vector<T> type : ( [T] values, int index, int length)Otherwise to make a certain length vector requires making a new array and copying over the requisite section, completely negating performance advantages.
  • Anonymous
    May 26, 2014
    The comment has been removed
  • Anonymous
    May 27, 2014
    @OnurG I tested it on VB.NET x64 .NET 4.51 and Visual C++ x64 (with full optimizations selected.)Results:17 seconds in VB.NET (WinForms)16 seconds with C++ (console application.)A .NET console app might perform slightly better, possibly delivering the same result as the C++ console app.I'm still using VS 2013.0 (not RyuJIT.)
  • Anonymous
    May 28, 2014
    The comment has been removed
  • Anonymous
    May 28, 2014
    @Mike Yes, I am sure.  Ensure you have set 'remove integer overflow checks' and 'enable optimizations' in Release build.  BTW, I've always noticed VB.NET being on par with C++ when it comes to straightforward loops, math, etc (which of course it should do, because it should emit similar ASM) and which is why I found OnurG's assertion a bit odd; therefore I verified it for myself.  The results I achieved were expected and thus unsurprising to me.
  • Anonymous
    May 28, 2014
    @AzureSky: Well, I did my own conversion to VB.NET and I get exactly the same result I get with the C# version, 20 seconds. And of course, optimizations are enabled and overflow checks are disabled.Anyway, the claim that C# runs magnitudes slower is indeed odd, at least for the presented code.
  • Anonymous
    May 28, 2014
    @Mike  Interesting.  Perhaps it's something inherent to RyuJIT, or perhaps the way code is generated for your hardware. (I'm running an old AMD CPU.)
  • Anonymous
    June 04, 2014
    Any word on TCO support? It's a key feature for F#.
  • Anonymous
    June 04, 2014
    Is this still beta? It has no TCE
  • Anonymous
    June 04, 2014
    Yes, any word on support for the "tail." instruction?  Uniformity with the other existing JITs is critical for this to be safe to install. [ Don Syme, for the Visual F# Tools team ]
  • Anonymous
    June 04, 2014
    The tail prefix is fully supported, and has been since CTP2. One of our devs actually ran the whole of the F# test suite against RyuJIT a few weeks after CTP2, because it was such a different set of IL than the C#/VB stuff that most of our tests consist of. Oh, and the F# self build, too :-)
  • Anonymous
    June 04, 2014
    And Tail Calls are optimized. I wonder, however, if the we're doing something goofy with tail-prefixed calls. Sample code that's not working right, anyway? Send it to RyuJIT@microsoft.com :-)
  • Anonymous
    June 10, 2014
    Kevin, any chances of another release with those bug fixes? I'd love to try this out to see if it can improve our run time but no such luck with that JIT level crash.
  • Anonymous
    June 15, 2014
    The comment has been removed
  • Anonymous
    June 16, 2014
    Add me as another vote to a desire for bitshift operators.Is there any plan to optimizing built-in algorithms to use SIMD instructions? Things like Array.IndexOf<byte> have an optimized path, but because of the indirection they use are terribly slow. A SIMDified version could probably gain quite a bit of benefit.A really nice benefit of the SIMD types is that they avoid bound checking. When you visit the elements inside a vector it does not emit a bound check which is awesome.
  • Anonymous
    June 16, 2014
    "A really nice benefit of the SIMD types is that they avoid bound checking. When you visit the elements inside a vector it does not emit a bound check which is awesome."That's true only when the index is a constant, if the index is a variable then a bound check must be performed. And beware that accessing a vector element via a variable index will result in the vector register being written to memory
  • Anonymous
    June 16, 2014
    What are the chances of supporting things like FMA3/4 and Gather-Scatter in the full release.
  • Anonymous
    June 17, 2014
    The comment has been removed
  • Anonymous
    June 17, 2014
    "long v=vector[0]; would be smart enough to copy it straight to an available register if need be. (setting up debugging for ryujit is annoying)."That one is fast, it generates a single instruction. What's slow is something like vector[i] where i is a non const variable.
  • Anonymous
    June 20, 2014
    So I finally got around to testing the debugged output, and was somewhat disappointed in some of the things that is output and then you want to test it against zero or all one it generates another full equal. It does a full equal operations against zero again (requiring loading zero and everything). So I had to update my code and it was quite a bit of a gain to remove the test against zero. So now, I just test both halfs of the simd register at once.(e.g res= Vector.Equals(v1,v2) if(v3==Vector<int>.Zero)..., should only be one actual pcmpeqb but its actually two and it has to put zero into a SSE register. It also doesn't do it smart like pxor xmm3,xmm3 instead it writes zero to a long and then does a 64bit shuffle.)
  • Anonymous
    June 20, 2014
    "e.g res= Vector.Equals(v1,v2) if(v3==Vector<int>.Zero)..., should only be one actual pcmpeqb"The code generated for Vector.Equals<byte> is dubious:var vect = new Vector<byte>(array, startIndex);   lea         eax,[r8+0Fh]     cmp         eax,r10d     jae         000007FE8E1E201D     movups      xmm1,xmmword ptr [rcx+r8+10h]  var res = Vector.Equals<byte>(vect, comparer);   movaps      xmm2,xmm0     mov         eax,80808080h     movd        xmm3,eax     pshufd      xmm3,xmm3,0     psubb       xmm1,xmm3     psubb       xmm2,xmm3     pcmpeqb     xmm1,xmm2The compiler appears to try to compensate for the fact that SSE integer comparison instructions work  with signed values but Equals doesn't need this, pcmpeqb is all that's needed."So now, I just test both halfs of the simd register at once."Also try simplifying the branches, they're too many:var vl = Vector<byte>.AsVectorInt64(res);long v0 = vl[0];long v1 = vl[1];if ((v0 | v1) == 0)       continue;if (v1 != 0){       startIndex += 8;       v0 = v1;}return startIndex + DebruijnFindByte(v0);Anyway, this is kind of pointless because as soon as AVX2 support is added the code no longer works correctly.
  • Anonymous
    June 20, 2014
    Has the RyuJIT team tested CTP4 on a Haswell-based machine yet? I've installed CTP4 on a few different machines for testing, and it generally seems to work quite well, save for the NRE bug (which Kevin already said was fixed). However, I installed RyuJIT on my new Haswell-based desktop at home and once I enabled RyuJIT, every .NET program I tried to run crashes immediately. Disabling RyuJIT allows the programs to run normally again, so this seems like a code-generation bug in RyuJIT. Haswell has AVX2 support, so perhaps that has something to do with it (as Mike Danes mentioned in his comment).Also -- any word on when the next release will be out with the fix for the NRE bug?
  • Anonymous
    June 21, 2014
    The comment has been removed
  • Anonymous
    June 22, 2014
    The comment has been removed
  • Anonymous
    July 16, 2014
    Installed and tested on our huge project, where our application startup time is around 21sec in debug mode. .NET 4.5.2 available as well as Ruy with environment variable set. Win 7 x64 used for testing. I don't see any improvements whatsoever in startup time. Any idea if there is any problem with configuration ?
  • Anonymous
    July 17, 2014
    I doubt you'll see much difference in the startup time if you're running the application in debug mode. The startup improvements are mainly related to optimizations which are performed only in release mode.
  • Anonymous
    July 17, 2014
    We have to send a lot of data over the network so lots of copying to and from byte arrays occur; we control the network stack so don't need to worry about endianness. But, with the accelerated CopyTo being to float[] what would the preferred method be to take advantage of hardware accel/intrinsics? (happy to do unsafe casting and fixed blocks)
  • Anonymous
    August 06, 2014
    I was trying to benchmark the difference between SIMD and Non-SIMD code by writing my own Vector4f and comparing to the speed of System.Numerics.Vector4f.  I found that Vector4f I implemented was in most cases faster than the System.Numerics, but to my astonishment it was also almost 4x faster than the same code compiled using VS2010 and not using protojit.dll.  Is RyuJIT clever enough to realize it can match my own version of Vector4f to SIMD instructions?  Will all of my current codebase instantly benefit from SIMD when RyuJIT sees certain patterns and/or operations?
  • Anonymous
    October 21, 2014
    Anyone else getting DEBUG: Error 2203:  Database: C:windowsInstallerinprogressinstallinfo.ipi. Cannot open database file. System error -2147287037 when installing?
  • Anonymous
    October 29, 2014
    Hi Kevin - thanks for Ryujit!I have a Intel Haswell computer with AVX2 (256-bit-SIMD, I believe) and have downloaded your Mandelbrot sample.In the debugger, I see that Vector<int>.Length returns "4".Do we expect to work with 8 ints at a time on AVX2? Does that mean that I have not configured CTP4 correctly?CheersJiri