RyuJIT CTP4: Now with more SIMD types, and better OS support!
Hi, folks. It’s been a busy month around here. We’ve been working on all sorts of stuff that I can’t talk about right now, but in the meantime, we’ve also been responding to feedback on the SIMD types. So, since it’s busy, I’m just going to list off the details, and link to other places for more information.
- Probably the biggest news is that if you install the 4.5.2 runtime (check the .NET blog for details on that), you can use RyuJIT CTP4 on Windows Vista, 7, and 8, as well as Windows Server 2008, 2008 R2, and 2012. In the CTP1 FAQ, I made mention that 4.5.1 on “downlevel” OS’es looked different from a code generation perspective. Well, that’s been addressed in the 4.5.2 update, so we’re happy to support RyuJIT CTP4 across all platforms that support 4.5.2.
- Nearly all the available Vector<T> types are now accelerated! The only ones missing are Vector<uint> and Vector<ulong>. In addition, there are a handful of other methods that we are now accelerating, including the CopyTo() method, which means any performance you have measured is now completely invalidated! Wait, no I mean, any performance you measured could potentially be faster!
- The fixed-size vector types are all mutable, now. This was the single biggest piece of feedback we received, so we took it.
There you have it. For now, you can download the CTP4 bits from here. The BCL SIMD NuGet package has also been updated, so update that, and you should be good to go. Same directions as before for how to use the types, enable RyuJIT, all that stuff. As always, send feedback to ryujit@microsoft.com. Happy RyuJIT-ing!
-Kev
Comments
- Anonymous
May 12, 2014
Looking forward to testing out CTP4. - Anonymous
May 12, 2014
I installed CTP4 on Windows 7My application runs fine normally. After enabling RyuJit with set COMPLUS_AltJit=* when my application runs it crashes with System.AccessViolationException:( - Anonymous
May 13, 2014
VS Update 2 = Blue Screen on restart / breaks my Win8, breaks VS 2013. Suggestion(s)? - Anonymous
May 13, 2014
@David, can you get us a repro case? That seems like a bad thing! If you can get a repro, please send it to ryujit@microsoft.com. - Anonymous
May 13, 2014
@AzureSky: Uninstall VS Update 2? Upgrade to Win 8.1 (it's better that WIn8 in every possible way)? Insert additional general purpose troubleshooting guidance, here :-) - Anonymous
May 13, 2014
The comment has been removed - Anonymous
May 13, 2014
By the way, how might I uninstall VS Update 2? There is nothing listed in the Control Panel Programs and Features list. There is nothing added to my start menu, etc, no new programs groups or anything. That might be because VS 2013 Update Setup failed to complete? VS (in 'About') says Update 2 is installed, but quite apparently it's not installed properly. I think my VS is in some strange zombie land. Anyway, what I did earlier was run the VS 2013.0 Setup Repair and that at least got me a Visual Studio that works again. To get my Desktop back in the first place (to get past the Blue Screens) I had to use a System Restore Point (created by the VS Update 2 installed); so I used that to fix Windows first, of course, followed by the VS 2013.0 Setup Repair. Anyways, VS still says Update 2 "is installed", so things are not 100% rolled back, which is why I might want to try Update 2 Uninstall; but where? how? - Anonymous
May 13, 2014
@AzureSky: Lemme bark up a couple trees to see what's possible, here. I had problems with VS Update 2 preview, and I wound up fully uninstalling VS 2013, then reinstalling it, which took care of the problem for me... - Anonymous
May 13, 2014
Any clues would be much appreciated, thanks Kevin. - Anonymous
May 13, 2014
Where can we report bugs?Just tested this on Windows 7 64-bit with IronScheme, NullReferenceException.You can repro by downloading latest version from ironscheme.codeplex.com and running the IronScheme.Console-v4.exe (that targets 64-bit .NET 4+). - Anonymous
May 13, 2014
Good news is when I NGEN my app, and run tests which involves a bit of codegen, I see a very decent speedup o/ryujit: 6.8 secsnormal jit: 9.6 secs - Anonymous
May 14, 2014
still waiting for 32-bit version... - Anonymous
May 14, 2014
@leppie: send e-mail to RyuJIT@microsoft.com.@Azarien: 32 bit version won't provide dramatic improvement in JIT times like 64 bit does, but it should provide improved code quality (and SIMD capabilities), but that's not happening in the short term. - Anonymous
May 14, 2014
Have send a repro to that email. Not heard anything. You get it :)? - Anonymous
May 14, 2014
Are you planning to implement support for further SIMD intrinsics like reciprocal and reciprocal square root? - Anonymous
May 14, 2014
@David: Yes, I got that repro, and I think a dev has already fixed the issue. I'll make sure that we ping you with status.@Nov0x: The idea for SIMD is to get stuff in customers hands and listen to what people want/need. I'll talk to the BCL folks (Immo & co.) to see how they're collecting customer feedback. - Anonymous
May 14, 2014
What is the time frame for RyuJIT to go into production? Four tech previews is pretty comprehensive. - Anonymous
May 14, 2014
@LKeene: It's planned to go into the next "major" release of the .NET framework. It's already in the new ASP.NET thing that was announced at TechEd. (I don't know the official name of that!) To be clear, CTP4, and this is probably going to be true of any additional previews before we get a .NET beta that includes RyuJIT, was primarily to address SIMD feedback. - Anonymous
May 14, 2014
@Kevin Frei: "The idea for SIMD is to get stuff in customers hands and listen to what people want/need."I'm not sure if this was mentioned or not already but one thing that's certainly missing is shuffling. Without that some more advanced SIMD stuff can't be implemented efficiently, for example matrix multiplication. - Anonymous
May 16, 2014
The comment has been removed - Anonymous
May 16, 2014
@Mike Danes, @CodesInChaos: Both Shuffle and shift operations are in out ToDo list and under consideration. - Anonymous
May 17, 2014
I agree. Would like to see bit shift and (at least some sort of) shuffle operators (needn't be the full SSE equivalent but at least something that would make doing matrix functions doable.) Please do implement bit shifting and at least some type of shuffle operator. Thank you in advance for any efforts in this direction. - Anonymous
May 20, 2014
The comment has been removed - Anonymous
May 20, 2014
Also: Please add SIMD 'ROR' and 'ROL' (bit rotate) operators. - Anonymous
May 22, 2014
Yes, all the above please! Bit shifting, rotations, shuffles, and unsigned type (unit, ulong) support.These are essential for fast cryptography implementations, and much else. - Anonymous
May 22, 2014
Also please add a new constructor to the Vector<T> type : ( [T] values, int index, int length)Otherwise to make a certain length vector requires making a new array and copying over the requisite section, completely negating performance advantages. - Anonymous
May 26, 2014
The comment has been removed - Anonymous
May 27, 2014
@OnurG I tested it on VB.NET x64 .NET 4.51 and Visual C++ x64 (with full optimizations selected.)Results:17 seconds in VB.NET (WinForms)16 seconds with C++ (console application.)A .NET console app might perform slightly better, possibly delivering the same result as the C++ console app.I'm still using VS 2013.0 (not RyuJIT.) - Anonymous
May 28, 2014
The comment has been removed - Anonymous
May 28, 2014
@Mike Yes, I am sure. Ensure you have set 'remove integer overflow checks' and 'enable optimizations' in Release build. BTW, I've always noticed VB.NET being on par with C++ when it comes to straightforward loops, math, etc (which of course it should do, because it should emit similar ASM) and which is why I found OnurG's assertion a bit odd; therefore I verified it for myself. The results I achieved were expected and thus unsurprising to me. - Anonymous
May 28, 2014
@AzureSky: Well, I did my own conversion to VB.NET and I get exactly the same result I get with the C# version, 20 seconds. And of course, optimizations are enabled and overflow checks are disabled.Anyway, the claim that C# runs magnitudes slower is indeed odd, at least for the presented code. - Anonymous
May 28, 2014
@Mike Interesting. Perhaps it's something inherent to RyuJIT, or perhaps the way code is generated for your hardware. (I'm running an old AMD CPU.) - Anonymous
June 04, 2014
Any word on TCO support? It's a key feature for F#. - Anonymous
June 04, 2014
Is this still beta? It has no TCE - Anonymous
June 04, 2014
Yes, any word on support for the "tail." instruction? Uniformity with the other existing JITs is critical for this to be safe to install. [ Don Syme, for the Visual F# Tools team ] - Anonymous
June 04, 2014
The tail prefix is fully supported, and has been since CTP2. One of our devs actually ran the whole of the F# test suite against RyuJIT a few weeks after CTP2, because it was such a different set of IL than the C#/VB stuff that most of our tests consist of. Oh, and the F# self build, too :-) - Anonymous
June 04, 2014
And Tail Calls are optimized. I wonder, however, if the we're doing something goofy with tail-prefixed calls. Sample code that's not working right, anyway? Send it to RyuJIT@microsoft.com :-) - Anonymous
June 10, 2014
Kevin, any chances of another release with those bug fixes? I'd love to try this out to see if it can improve our run time but no such luck with that JIT level crash. - Anonymous
June 15, 2014
The comment has been removed - Anonymous
June 16, 2014
Add me as another vote to a desire for bitshift operators.Is there any plan to optimizing built-in algorithms to use SIMD instructions? Things like Array.IndexOf<byte> have an optimized path, but because of the indirection they use are terribly slow. A SIMDified version could probably gain quite a bit of benefit.A really nice benefit of the SIMD types is that they avoid bound checking. When you visit the elements inside a vector it does not emit a bound check which is awesome. - Anonymous
June 16, 2014
"A really nice benefit of the SIMD types is that they avoid bound checking. When you visit the elements inside a vector it does not emit a bound check which is awesome."That's true only when the index is a constant, if the index is a variable then a bound check must be performed. And beware that accessing a vector element via a variable index will result in the vector register being written to memory - Anonymous
June 16, 2014
What are the chances of supporting things like FMA3/4 and Gather-Scatter in the full release. - Anonymous
June 17, 2014
The comment has been removed - Anonymous
June 17, 2014
"long v=vector[0]; would be smart enough to copy it straight to an available register if need be. (setting up debugging for ryujit is annoying)."That one is fast, it generates a single instruction. What's slow is something like vector[i] where i is a non const variable. - Anonymous
June 20, 2014
So I finally got around to testing the debugged output, and was somewhat disappointed in some of the things that is output and then you want to test it against zero or all one it generates another full equal. It does a full equal operations against zero again (requiring loading zero and everything). So I had to update my code and it was quite a bit of a gain to remove the test against zero. So now, I just test both halfs of the simd register at once.(e.g res= Vector.Equals(v1,v2) if(v3==Vector<int>.Zero)..., should only be one actualpcmpeqb
but its actually two and it has to put zero into a SSE register. It also doesn't do it smart like pxor xmm3,xmm3 instead it writes zero to a long and then does a 64bit shuffle.) - Anonymous
June 20, 2014
"e.g res= Vector.Equals(v1,v2) if(v3==Vector<int>.Zero)..., should only be one actualpcmpeqb
"The code generated for Vector.Equals<byte> is dubious:var vect = new Vector<byte>(array, startIndex); lea eax,[r8+0Fh] cmp eax,r10d jae 000007FE8E1E201D movups xmm1,xmmword ptr [rcx+r8+10h] var res = Vector.Equals<byte>(vect, comparer); movaps xmm2,xmm0 mov eax,80808080h movd xmm3,eax pshufd xmm3,xmm3,0 psubb xmm1,xmm3 psubb xmm2,xmm3 pcmpeqb xmm1,xmm2The compiler appears to try to compensate for the fact that SSE integer comparison instructions work with signed values but Equals doesn't need this, pcmpeqb is all that's needed."So now, I just test both halfs of the simd register at once."Also try simplifying the branches, they're too many:var vl = Vector<byte>.AsVectorInt64(res);long v0 = vl[0];long v1 = vl[1];if ((v0 | v1) == 0) continue;if (v1 != 0){ startIndex += 8; v0 = v1;}return startIndex + DebruijnFindByte(v0);Anyway, this is kind of pointless because as soon as AVX2 support is added the code no longer works correctly. - Anonymous
June 20, 2014
Has the RyuJIT team tested CTP4 on a Haswell-based machine yet? I've installed CTP4 on a few different machines for testing, and it generally seems to work quite well, save for the NRE bug (which Kevin already said was fixed). However, I installed RyuJIT on my new Haswell-based desktop at home and once I enabled RyuJIT, every .NET program I tried to run crashes immediately. Disabling RyuJIT allows the programs to run normally again, so this seems like a code-generation bug in RyuJIT. Haswell has AVX2 support, so perhaps that has something to do with it (as Mike Danes mentioned in his comment).Also -- any word on when the next release will be out with the fix for the NRE bug? - Anonymous
June 21, 2014
The comment has been removed - Anonymous
June 22, 2014
The comment has been removed - Anonymous
July 16, 2014
Installed and tested on our huge project, where our application startup time is around 21sec in debug mode. .NET 4.5.2 available as well as Ruy with environment variable set. Win 7 x64 used for testing. I don't see any improvements whatsoever in startup time. Any idea if there is any problem with configuration ? - Anonymous
July 17, 2014
I doubt you'll see much difference in the startup time if you're running the application in debug mode. The startup improvements are mainly related to optimizations which are performed only in release mode. - Anonymous
July 17, 2014
We have to send a lot of data over the network so lots of copying to and from byte arrays occur; we control the network stack so don't need to worry about endianness. But, with the accelerated CopyTo being to float[] what would the preferred method be to take advantage of hardware accel/intrinsics? (happy to do unsafe casting and fixed blocks) - Anonymous
August 06, 2014
I was trying to benchmark the difference between SIMD and Non-SIMD code by writing my own Vector4f and comparing to the speed of System.Numerics.Vector4f. I found that Vector4f I implemented was in most cases faster than the System.Numerics, but to my astonishment it was also almost 4x faster than the same code compiled using VS2010 and not using protojit.dll. Is RyuJIT clever enough to realize it can match my own version of Vector4f to SIMD instructions? Will all of my current codebase instantly benefit from SIMD when RyuJIT sees certain patterns and/or operations? - Anonymous
October 21, 2014
Anyone else getting DEBUG: Error 2203: Database: C:windowsInstallerinprogressinstallinfo.ipi. Cannot open database file. System error -2147287037 when installing? - Anonymous
October 29, 2014
Hi Kevin - thanks for Ryujit!I have a Intel Haswell computer with AVX2 (256-bit-SIMD, I believe) and have downloaded your Mandelbrot sample.In the debugger, I see that Vector<int>.Length returns "4".Do we expect to work with 8 ints at a time on AVX2? Does that mean that I have not configured CTP4 correctly?CheersJiri