Udostępnij za pośrednictwem


Past performance is no guarantee of future results

Before I get started today, a couple housekeeping notes. First off, sorry for no blog the last three weeks; I have been crazy busy adding features to the Roslyn C# semantic analyzer. More on that in an upcoming episode. Second, check out the snazzy new Developer Tools blog aggregation page; it's one stop shopping for lots of great information, and it's mostly purple. Vindication!

OK, now that we've got the metablogging out of the way, on to a question I get occasionally:

Is compiling the same C# program twice guaranteed to produce the same binary output?

No.

Well, that was an easy blog to write.

Maybe some more background might be useful.

What's up with that?

Well, a language is fundamentally nothing more than a set, possibly infinite, of the strings (*); a string is a valid program if it is in the set, and invalid if it is not. A compiler is a program that takes a string, determines if it is a valid member of a particular language, and if it is, then produces as its output an equivalent (**) valid string in a different language. For example, the C# compiler takes as its input a string stored in a ".cs" file and produces as its output a program written in the MSIL language. There are of course a whole host of pragmatic details; the C# compiler produces MSIL that has been encoded into a binary "portable executable" format rather than human-readable format. And the C# compiler also takes in as inputs programs written in this binary format in the form of library references. But at a fundamental level, the job that the C# compiler does for you is it takes in C# code, analyzes it for correctness, and produces an equivalent program written in the PE format.

Any given C# program has infinitely many equivalent MSIL programs; as a trivial example, we can insert as many extra nop "no operation" instructions as we like, anywhere we like. As a silly example, we can insert any verifiable code whatsoever in any unreachable code path! And as a more interesting example, there are frequently situations in which the compiler must "make up" unique names for things like anonymous methods, anonymous types, fields of closure classes, and so on. I thought I'd talk a bit about those unique names.

When the compiler needs to generate a unique name for something its basic strategy is to pattern for a name that cannot possibly be a legal name, because it contains characters that MSIL considers legal in identifiers but C# does not. It then inserts into the pattern strings such as the name of the current method, and finally, puts a unique number on the end. I described the exact pattern we use in this StackOverflow answer; note that this is an implementation detail of the compiler that we are considering changing in the future. (People occasionally complain that the names are too long and are eating up space in the metadata tables; we could make them much shorter.) The key point here is that uniqueness is guaranteed by incrementing a counter, which is a pretty easy way to guarantee uniqueness. As a practical matter we are not going to run out of unique identifiers; they only need to be unique within a given assembly, and no assembly has more than two billion compiler-generated identifiers.

This means that the actual content of a compiler-generated identifier is determined by the order in which the compiler gets around to analyzing a particular hunk of code that needs to generate a magical identifier. We make no guarantees that this order is deterministic! Historically it has been mostly deterministic; the C# compiler's analysis engine in the past has been single-threaded, and, if given the same list of files twice, will analyze the files in the same order and then analyze all the types in the program in the same order (***). But note the assumptions here:

First off, we are assuming that we always get the same list of files every time, in the same order. But that's in some cases up to the operating system. When you say "csc *.cs", the order in which the operating system proffers up the list of matching files is an implementation detail of the operating system; the compiler does not sort that list into a canonical order.

Second, we are assuming that the analysis engine of the compiler is single-threaded; there is no requirement that it be single threaded, and in fact we have toyed with making it multi-threaded in the past and likely will again in the future. Analyzing large programs is a great example of an "embarrassingly parallel" problem. Once the large-scale structure of the program -- all the types, methods, fields, and so on -- is known, then every method body can be analyzed in parallel; the contents of one method body never affect the analysis of a different method body. But once you make the compiler multi-threaded, there is no guarantee of any kind that the order is deterministic, and hence no guarantee that two compilations produce the same magic identifiers.

Now, all this is very interesting, which is why I told you all about it. I could have just cut right to the chase, which is to say: the C# compiler by design never produces the same binary twice. The C# compiler embeds a freshly generated GUID in every assembly, every time you run it, thereby ensuring that no two assemblies are ever bit-for-bit identical. To quote from the CLI specification:

The Mvid column shall index a unique GUID [...] that identifies this instance of the module. [...] The Mvid should be newly generated for every module [...] While the [runtime] itself makes no use of the Mvid, other tools (such as debuggers [...]) rely on the fact that the Mvid almost always differs from one module to another.

So there you go; the runtime specification requires that every module (an assembly being a collection of one or more modules) have a unique identifier.

Moreover: let's not forget here that it's not the IL that runs; yet another compiler will translate the IL to machine code. The jitter certainly does not guarantee that jit compiling the same code twice produces the same machine code; the jitter can be using all kinds of runtime information to tweak the generated code to optimize it for size, for speed, for debuggability, and so on, at its whim.

Assuming that the compiler produces correct code, why would you care if compiling the same program twice produces exactly the same output or not?

The first customer to ask me whether the C# compiler was deterministic had an interesting scenario. The customer was in an industry where their code was subject to government regulation. The government wanted to obtain the source code and the binary code that was going to be actually burned into a chip from the customer. The regulators were planning on performing a security review of the human-readable source code before allowing devices containing the chip to be approved for use in the state. Of course, the obvious attack by an unscrupulous provider is to make the source code and binary code not match; the source is benign and the binaries are hostile.

The regulator's strategy was going to be to compile the source code and then do a bit-for-bit comparison to the binary code, which of course you know now, will always fail in C#.

What I suggested to them was: since the regulator was going to be doing the compilation anyways for the purposes of comparison, then the regulatory body should only take in the source code in the first place; they can give the binaries back to the company. That way the regulators have a guarantee that the binaries match the source code, because the regulators produced the binaries in the first place.

I believe they ignored my advice and ended up doing a diff and then verifying that the differing bits were only found in the Mvid column mentioned above. Of course I strongly cautioned them that they were taking advantage of an undocumented and completely unsupported compiler feature. We make no guarantee that the compiler's behaviour is repeatable in this way. More generally: the C# compiler was not designed to be a component of a security system, so don't use it as one.


(*) And a string is of course a finite, ordered sequence of characters drawn from some alphabet.

(**) What precisely makes a program in one language "equivalent" to a program in another language is a deep question that we're going to completely ignore.

(***) The details are quite interesting; basically we start by walking each file, one after the other, and build up a tree of "top level" items: namespaces, types, methods, and so on. Once we have that tree, we iterate over it in a tree traversal that effectively does a partial order sort. The goal is to find an ordering, if one exists, that guarantees that the metadata emitting engine never has to emit metadata "out of order"; we prefer base type metadata to be emitted before derived type metadata and require that outer type metadata be emitted before nested type metadata. There are unusual code topologies in which our preferences cannot be met; in those topologies, the emitter can have poor performance. We therefore have a vested interest in ensuring that we find a good ordering to enumerate over all the types of a program.

Comments

  • Anonymous
    May 31, 2012
    The thing you refer to as "MSIL" and "IL" is the thing that's currently officially "CIL", right? (If you can't keep track of its name it makes me feel less bad that I can't :))

  • Anonymous
    May 31, 2012
    I've seen this "not guaranteed to produce the exact same output" have another effect: sometimes, even without making any changes, the ordering of types, methods, properties, interfaces on a class, etc. as enumerated via reflection will be different. I found this out because a code generator I wrote was producing different output every other time I ran it, and thus constantly committing "new" changes to the generated code. I simply changed it to sort everything by name before generating the code and then I was happy (the tool itself was always happy, but I prefer generated code to stay the same if the input is the same :)).

  • Anonymous
    May 31, 2012
    > they only need to be unique within a given assembly, and no assembly has more than two billion compiler-generated identifiers Challenged accepted! (Now I'm actually curious if there's a defined compiler error for running out of identifiers!)

  • Anonymous
    May 31, 2012
    Fascinating. Just after reading this post, I read an article about UEFI secure boot: mjg59.dreamwidth.org/12368.html The describes a situation with some similarities to the one described by your customer, but has an interesting wrinkle that renders your solution untenable: the parties providing and signing the code may be mutually untrusting. In the Linux scenario, Linux distributors need to rely on Microsoft to sign their bootloaders for them to be able to boot on Windows 8 certified machines*, due to UEFI secure boot. But they may be unwilling to put blind faith in Microsoft not to have tampered with the binaries. I don't think it's actually a problem in that case, because I think Microsoft doesn't actually need to compile or even see the code - they just provide a key while retaining the ability to revoke it later should it be found to be used by malware - but it's an interesting problem nonetheless: in the absence of a deterministic compiler, how can two mutually distrustful parties come to agreement that a single binary is the result of compiling the same source code?

    • Unless they mess around in the BIOS to disable secure boot (unpleasant for normal users) or register their own personal signing keys (extremely unpleasant for most users).
  • Anonymous
    May 31, 2012
    Nondeterministic builds would be problematic in regulated environments. I know one company that compiles their products on three separate build machines in parallel and verifies that all the binaries agree (which involves masking out timestamps).  Their system has caught problems with leaks in what was supposed to be a hermetic build environment, bugs in the process that fetches code from source control, and a flakey hard drive in one of the build lab machines. They also routinely pull an old version from source control, build it, and compare to the binaries built in the past.  These ongoing audits ensure their processes are working as intended.  (It's like routinely restoring a file from a backup, just to make sure your backups are still working.) If you spend a tremendous effort in a final validation of a particular binary, it's nice to know that you're shipping what you validated and that you can reproduce it if necessary.

  • Anonymous
    May 31, 2012
    The issue of code provenance is an important one in security and, at some point, compiler vendors are going to have to support some method of validating that a particular binary is the result of a particular compiler transforming a particular set of source code. What I expect is needed is to have a secure source control system that can produce an archive of a set of source code with an associated label and sign it to certify that it can guarantee that any pull of that label will always produce that identical archive. Then the compiler needs to be able to sign the executable, baking in its own signature and that of the source code. Of course, the environment in which the secure source code control system and build system run must be able to guarantee that they have not been tampered with. So if you want to securely rely that if a person certified that a certain set of source code was approved then a particular executable is approved, you need to build an entire secure infrastructure for source control and build. Not trivial. Lots of moving parts. But it will probably be required within a few years.

  • Anonymous
    May 31, 2012
    Over the years I have worked with (typically goverment, especialy military) projects that had the "repeatable except for weill defined differences" requirement on compilation/builds. In a few cases this dictated the technologies that could be used [meaning eliminated .NET], which IMHO is quite unfortunate as the project really could have benefited from a managed language platform. I believe there is real value (in very spcific circumstances) to a robust repeatable build. In such an environment, source code changes that did not impact the semantics would have no impact on the output. Regardless of file processing order, order of "blocks" within a files (think running ReSharper "code cleanup" to re-org to a fixed pattern), addition/deletion/modification of comments...the same outut [except for well defined items such as the MVID] would result.

  • Anonymous
    May 31, 2012
    [followup to previous post].... In certain real-time environments there is another consideration...different memory layouts will have an impact on the cache performace of the cpu/system. For 99.9999% of applications this should not be an issues, but it did burn me about 3 years ago.

  • Anonymous
    May 31, 2012
    The comment has been removed

  • Anonymous
    May 31, 2012
    Glad you are back, Eric! I had two theories: you were really busy or on a long vacation :) I follow every question you answer and every comment you make on StackOverflow and this hiatus had been killing me! What you write always makes me see many new angles and ways to look at a problem, and, I believe, has made me evolve as a professional. Nice article, I always enjoy the ones like this because there is no other place on the web where we can get to know stuff like this about the C# compiler. The title misled me :)

  • Anonymous
    May 31, 2012
    The government request immediately reminded me of the classic Reflections on Trusting Trust by Ken Thompson (cm.bell-labs.com/.../trust.html).

  • Anonymous
    May 31, 2012
    The comment has been removed

  • Anonymous
    June 02, 2012
    @Jeroen Mostert For JITed languages it gets even worse: No guarantees that the JIT won't introduce some security risks (and considering how complicated those things can get..) and since performance is much more important there than for your usual compiler, the JIT (at least for JVMs) will compile methods in parallel and non deterministic manner (butterfly effect; there are so many little counters that influence compilation behavior and with all those threads running in parallel chances are some timing will change from run to run)

  • Anonymous
    June 04, 2012
    Similar to David's comment, back when I was working with C++, when discussing the same question, I was told that due to different memory levels available on different machine the C++ compiler may do different optimization analysis, and yield different object code from the same source code.  Is that also a concern in C#? And Referring to Jeroen's comment, I recall a story about the AT&T UNIX source which had a backdoor password hard-coded into  it.  To keep it a secret, a different string was used in the source code, the AT&T C compiler would look for a that  string, and then substitute the actual string in it's place.  

  • Anonymous
    June 04, 2012
    Would the change of compiler generated identifiers have impact on the name of the backing store for auto-implemented properties? That makes me worry about deserialization of old objects...

  • Anonymous
    June 07, 2012
    Compiling the same source and project into an assembly should produce identical results except for possibly the asembly header.  We have that requirement for our C++ DLLs for our business applications.   This greatly helps us for cumulative patching of our applications.  We can binary compare all of the DLLs laid down in a base, base + patch level X and base + patch level X + 1 and ensure that only the DLLs thta should be modified are modified.     Binary bit by bit identical is needed for many environments (device drivers, operating systems, government, etc.

  • Anonymous
    June 17, 2012
    Reading this, I was reminded of my Calculus courses and the concept that an antiderivative for a given function is not unique, but unique up to some constant. Probably a gross simplification, but interesting--to me--all the same.

  • Anonymous
    June 24, 2012
    Identical output for given input would also be useful in testing. Say you produce some code, and then test it, and later on you want to refactor it (in a minor way, so renaming non public variable names, adding comments, etc) if it produced binary output that was identical to the previous version. Then you don't need to retest it.

  • Anonymous
    July 01, 2012
    Would it be possible via some compiler settings to compile the code in a traditional way (single threaded), atleast in exceptional cases where in like audit to help comparing (except GUID)?

  • Anonymous
    July 02, 2012
    Are the outputs the same up to arbitrary identifiers (and timestamps)?  Would it be a burden to make that guarantee?