Jaa


Why IL?

One of the earliest and most frequently-asked questions we got when we announced the Roslyn project was "is this like LLVM for .NET?"

No, Roslyn is not anything like LLVM for .NET. LLVM stands for Low-Level Virtual Machine; as I understand it (admittedly never having used it), compiler "front ends" take in code written in some language -- say C++ -- and spit out equivalent code written in the LLVM language. Another compiler then takes the code written in the LLVM language and writes that into optimized machine code.

We already have such a system for .NET; in fact, .NET is entirely built upon it and always has been, so Roslyn isn't it. The C#, VB and other compilers take in programs written in those languages and spit out code written in the Common Intermediate Language (CIL, or also commonly MSIL or just IL). Then another compiler -- either the jitter, which runs "just in time" at runtime, or the NGEN tool which runs before runtime -- translates the IL into optimized machine code that can actually run on the target platform.

I'm occasionally asked why we use this strategy; why not just have the C# compiler write out optimized machine code directly, and skip the middleman? Why have two compilers to go from C# to machine code when you could have one?

There are a number of reasons, but they pretty much all boil down to one good reason: the two-compilers-with-an-intermediate-language system is much less expensive in our scenario.

That might seem counterintuitive; after all, now we have two languages to specify, two languages to analyze, and so on. To understand why this is such a big win you have to look at the larger picture.

Suppose you have n languages: C#, VB, F#, JScript .NET, and so on. Suppose you have m different runtime environments: Windows machines running on x86 or x64, XBOX 360, phones, Silverlight running on the Mac... and suppose you go with the one-compiler strategy for each. How many compiler back-end code generators do you end up writing? For each language you need a code generator for each target environment, so you end up writing n x m code generators.

Suppose instead you have every language generate code into IL, and then you have one jitter per target environment. How many code generators do you end up writing?  One per language to go to IL, and one per environment to go from IL to the target machine code. That's only n + m, which is far less than n x m for reasonably-sized values of n and m.

Moreover, there are other economies in play as well. IL is deliberately designed so that it is very easy for compiler writers to generate correct IL. I'm an expert on the semantic analysis of the C# language, not on efficient code generation for cellular phone chipsets. If I had to write a new backend code generator for every platform the .NET framework runs on, I'd be spending all my time doing a bad job of writing code generators instead of doing a good job writing semantic analyzers.

The cost savings go the other way too; if you want to support a new chipset then you just write yourself a jitter for that chipset and all the languages that compile to IL suddenly start working; you only had to write *one* jitter to get n languages on your new platform.

This cost-saving strategy of putting an intermediate language in the middle is not at all new; it goes back to at least the late 1960's. My favourite example of this strategy is the Infocom Z-Machine; the Infocom developers wrote their games in a language (Zork Implementation Language) that compiled to an intermediate Z-Code language, and then wrote Z-Machine interpreters for a variety of different platforms; as a result they could write n games and have them run on m different platforms at a cost of n + m, not n x m. (This approach also had the enormous benefit that they could implement virtual memory management on hardware that did not support virtual memory natively; if the game was too big to fit into memory, the interpreter could simply discard code that wasn't being used at the moment and page it back in again later as needed.)

Next time I'll talk a bit about why IL is specified the way it is.

Comments

  • Anonymous
    November 18, 2011
    I believe the "is this like LLVM for .NET?" question was sloppily phrased.  One LLVM project is Clang, the new C++ compielr, and one Clang project is the Clang Static Analyzer -- clang-analyzer.llvm.org -- which I believe is similar to Roslyn, although I haven't used either Roslyn or the Clang Static Analyzer.

  • Anonymous
    November 18, 2011
    This reminds me of a question I've had about jitters. I recall from my days as a C++ developer, the default for native code generation was code that ran  better on a 586 architecture, but would nevertheless run on a 386, ie, it would use only the 386 instruction set (albeit preferring instruction that had been optimized on the Pentium), and completely avoid any Pentium-only instruction. A jitter, on the other hand, knowing exactly what platform it's running on, could generate code very specific to that CPU. You mention that this is done at the macro level out of necessity  (i.e, separate jitters desktop, phone, xbox etc),  but is it done at the micro level (i.e, jitters specific to different but similar Intel CPUs).  If not, any plans for that in the future?

  • Anonymous
    November 18, 2011
    James: I'm pretty sure the 32-bit and 64-bit jitters for the desktop are quite different. Next time sounds interesting. I've quite often wondered why IL is stack-based and not, for example, expression-based. (Of course, it couldn't be based on the .net Expression<T> types because they didn't exist when IL was created. But the idea did exist)

  • Anonymous
    November 18, 2011
    @configurator, it's probably because stack-based is very simple to implement. Remember the m jitters cannot be more complex to write than the n compilers because then the n+m is as equally bad as nxm.

  • Anonymous
    November 18, 2011
    I'm curious if/what you need to give up anything when using IL like performance, flexibility, etc.

  • Anonymous
    November 18, 2011
    And when Windows adds new CPU types (ARM for example) all that needs to be done is write an ARM implementation of the CLR and everything compiled to MSIL will magically work.

  • Anonymous
    November 18, 2011
    Now please write a post on what Roslyn IS (as opposed to what it is not) and what it is good for. I have seen the announcement and think I know what it is but have seen nothing on why you guys think it is important and what you think people should do with it. I think you haven't been looking very hard. Did you read the whitepaper? In brief: what is it? A set of tools for programmatically analyzing C# and VB programs. What is it good for? Programmatically analyzing C# and VB programs. Why is it important? The question cannot be answered because it omits part of the predicate; things are not "important" on their own; they are important to someone. Why it is important to me might be very different from why (or if) it is important to you. Since I am in the business of doing nothing all day every day but analyzing C# code, it is very important to me. What should people do with programmatic code analysis tools? They should programmatically analyze C# and VB code. -- Eric  

  • Anonymous
    November 18, 2011
    PaulTDessci: Imagine you wanted to write your own IDE (like Visual Studio) for C#: you want syntax highlighting, IntelliSense, refactoring, squiggly lines to show up under syntax errors, the "go to definition" feature, an immediate mode window, and so on. Right now, creating all of these features essentially requires you to write almost a complete C# compiler because there's no way to get anything out of the current C# compiler other than assemblies or error messages. You have to write a parser to do syntax highlighting, a semantic analyzer to be able to go to definitions, etc. What Roslyn is intended to do is allow you to create these features without having to write your own compiler. It will give you a syntax tree, perform semantic analysis, and more. You can simply give Roslyn the source code and ask it for a syntax tree or what symbol the user's hovering their cursor over. Of course, there's no need to restrict its features to code editors; it's also intended for use by static analysis tools, and obviously it will be used for actually compiling source code to assemblies (whether to compile your project to an EXE or to compile ASPX files in memory). What Roslyn is not intended to do is allow you to create new languages on top of C#, whether it's macros, metaprogramming, aspects, or whatever. Just because it's not intended for such things, though, doesn't mean you can't or shouldn't do it.

  • Anonymous
    November 18, 2011
    Another benefit for application developers is that they don't have to build and distribute binaries for each target machine architecture. Instead, there is a single set of IL-compiled binaries to distribute. This simplifies ALM and deployment significantly. In addition, users don't have to know which architecture they are running in order to download/run the application.

  • Anonymous
    November 18, 2011
    If I'm not mistaken, GCC does more or less the same thing:  All the compilers generate an intermediate code, and the backend generates machine code.

  • Anonymous
    November 18, 2011
    The comment has been removed

  • Anonymous
    November 18, 2011
    The comment has been removed

  • Anonymous
    November 18, 2011
    Its all in the name isn't it? "Common Language Runtime" tells you all about the "why" and the benefits :D .

  • Anonymous
    November 20, 2011
    The LLVM language is at a much lower level than the .net IL. It isn't a replacement for the CIL, but could rather be used as a common backend for .net, java and low level(c like) languages. A common backend and optimizer for VC++ and .net would sure be nice. This way compilers for different source and target platforms can share large parts of the optimizer and code generator, it's possible to add additional optimization passes or backends as a third party,... In fact mono uses it as one of its backends which leads to very fast code (but I think it's traded for a higher JIT time). I think Microsoft's equivalent of LLVM would be part of Phönix, and not the traditional .net stack. I remember a talk about Phönix mentioning such a low level intermediate language. The equivalent of Roslyn would be something like MonoDevelop's NRefactory.

  • Anonymous
    November 21, 2011
    Looks like I was a victim of the invisible blog posting timeout, so I'll try again. In the late 80's, DEC (Digital Equipment Corporation) had compilers for almost a dozen languages on its CISC-based VAX architecture and was about to run into the m x n problem as it was developing it's RISC-based Alpha architecture. It developed a CIL for its compilers and turned it into a 12 + 2 problem.

  • Anonymous
    November 21, 2011
    The CIL is all good, but there is one problem with it - can it evolve? I guess it should evolve for 2 reasons. First, to allow for more optimizations preserving more of the higher language code structure. Second, to move things like WeakReference into CIL itself an make it native higher languages feature.

  • Anonymous
    November 22, 2011
    The concept of introducing intermediate language in middle is few years older than last 1960s. It was evolved with UNCOL by Melvin E. Conway in 1958. Here is the interesting link homepage.ntlworld.com/.../uncol.html.