What does the optimize switch do?

I was asked recently exactly what optimizations the C# compiler performs when you specify the optimize switch. Before I answer that, I want to make sure that something is perfectly clear. The compiler’s “usage” string is not lying when it says:

/debug[+|-]     Emit debugging information
/optimize[+|-]  Enable optimizations

Emitting debug information and optimizing the generated IL are orthogonal; they have no effect on each other at all (*). The usual thing to do is to have debug on and optimizations off, or vice versa, but the other two combinations are perfectly legal. (And while I’m at it, turning debug information generation on does not also do /d:DEBUG; that’s also orthogonal. Again, the sensible thing to do is to keep /debug and /d:DEBUG in sync, but you do not have to; you might want to debug the optimized version without assertions and we’re not going to stop you.)

From this point on, when I say “emitting” I mean “producing metadata”, and when I say “generating” I mean “producing IL”.

So, first off, there are some optimizations that the compiler always performs, regardless of the optimize flag.

The language semantics require us to do constant folding because we need to detect that this is an error:

switch(x) {
case 2:
case 1+1:

The compiler will replace usages of constant fields with the constant, but does not perform any more sophisticated constant propagation, even though the compiler has a reachability analyzer.

Speaking of which, there are two kinds of reachability analysis we perform for dead code elimination. First, there is the “according to spec” reachability. The specification calls out that reachability analysis consumes compile-time constants. This is the reachability analysis we use to give “unreachable code” warnings, to determine whether local variables are definitely assigned on all code paths, and to determine whether the end point of a non-void method is reachable. If this first pass determines that code is unreachable then we never generate IL for it. For example, this is optimized away:

if (false) M();

But if the expression involves non-compile-time constants then this first form of reachability analysis would tell you that the call to N here is reachable, and therefore is a definite assignment error:

int x = M();
int y;
if (x * 0 != 0) N(y);

If you fixed that up so that y was definitely assigned, then the first-pass reachability analyzer would NOT trim this code because it still thinks that the call is reachable.

But clearly you and I know that the call is not reachable, because we know more about arithmetic than the specified reachability analyzer. We perform a second pass of reachability analysis during code generation which is much smarter about these things. We can do that because the second pass only affects codegen, it does not affect language semantics, error reporting, and so on.

The existence of that second pass implies that we do a simple arithmetic optimizations on expressions which are only partially constant. For example, if you have a method M that returns an integer, then code like

if (M() * 0 == 0) N();

can be generated as though you’d written just:

M();
N();

We have lots of simple number and type algebra optimizers that look for things like adding zero to an integer, multiplying integers by one, using “null” as an argument to “is” or “as”, concatenation of literal strings, and so on. The expression optimizations always happen, whether you’ve set the optimize flag or not; whether basic blocks that are determined to be unreachable after those optimizations are trimmed depends on the flag.

We also perform some small optimizations on some call sites and null checks. (Though not as many as we could.) For example, suppose you have a non-virtual instance method M on a reference type C, and a method GetC that returns a C. If you say GetC().M() then we generate a callvirt instruction to call M. Why? Because it is legal to generate a callvirt for an instance method on a reference type, and callvirt automatically inserts a null check. The non-virtual call instruction does not do a null check; we'd have to generate extra code to check whether GetC returns null. So that's a small code size optimization. But we can do even better than that; if you have (new C()).M(), we generate a call instruction because we know that the result of the "new" operator is never null. That gives us a small time optimization because we can skip the nanosecond it takes to do the null check. Like I said, it's a small optimization.

The /optimize flag does not change a huge amount of our emitting and generation logic. We try to always generate straightforward, verifiable code and then rely upon the jitter to do the heavy lifting of optimizations when it generates the real machine code. But we will do some simple optimizations with that flag set. For example, with the flag set:

  • Expressions which are determined to be only useful for their side effects are turned into code that merely produces the side effects.
  • We omit generating code for things like int foo = 0; because we know that the memory allocator will initialize fields to default values.
  • We omit emitting and generating empty static class constructors. (Which typically happens if the static constructor set all the fields to their default value and the previous optimization eliminated all of them.)
  • We omit emitting a field for any hoisted locals that are unused in an iterator block. (This includes that case where the local in question is used only inside an anonymous function in the iterator block, in which case it is going to become hoisted into a field of the closure class for the anonymous function. No need to hoist it twice if we don’t need to.)
  • We attempt to minimize the number of local variable and temporary slots allocated. For example, if you have:

for (int i = …)  {…}
for (int i = …) {…}

then the compiler could generate code to re-use the local variable storage reserved for i when the second i comes along. (We eschew this optimization if the locals have different names because then it gets hard to emit sensible debug info, which we still want to do even for the optimized build. However, the jitter is free to perform this optimization if it wishes to.)

  • Also, if you have a local which is never used at all, then there is no storage allocated for it if the flag is set.

  • Similarly, the compiler is more aggressive about re-using the unnamed temporary slots sometimes used to store results of subexpression calculations.

  • Also, with the flag set the compiler is more aggressive about generating code that throws away “temporary” values quickly for things like controlling variables of switch statements, the condition in an “if” statement, the value being returned, and so on. In the non-optimized build these values are treated as unnamed local variables, loaded from and stored to specific locations. In the optimized build they can often be just kept on the stack proper.

  • We eliminate pretty much all of the “breakpoint convenience” no-ops.

  • If a try block is empty then clearly the catch blocks are not reachable and can be trimmed. (Finally blocks of empty tries are preserved as protected regions because they have unusual behaviour when faced with certain exceptions; see the comments for details.)

  • If we have an instruction which branches to LABEL1, and the instruction at LABEL1 branches to LABEL2, then we rewrite the first instruction as a branch straight to LABEL2. Same with branches that go to returns.

  • We look for “branch over a branch” situations. For example, here we go to LABEL1 if condition is false, otherwise we go to LABEL2.

    brfalse condition, LABEL1
    br LABEL2
    LABEL1: somecode

    Since we are simply branching over another branch, we can rewrite this as simply "if condition is true, go to LABEL2":

    brtrue condition, LABEL2
    somecode

  • We look for “branch to nop” situations. If a branch goes to a nop then you can make it branch to the instruction after the nop.

  • We look for “branch to next” situations; if a branch goes to the next instruction then you can eliminate it.

  • We look for two return instructions in a row; this happens sometimes and obviously we can turn it into a single return instruction.

That’s pretty much it. These are very straightforward optimizations; there’s no inlining of IL, no loop unrolling, no interprocedural analysis whatsoever. We let the jitter team worry about optimizing the heck out of the code when it is actually spit into machine code; that’s the place where you can get real wins.


(*) A small lie. There is some interaction between the two when we generate the attributes that describe whether the assembly is Edit’n’Continuable during debugging.

Comments

  • Anonymous
    June 11, 2009
    An awful lot of these optimizations (maybe all) are language-independent.  Do you share the optimizer with VB?  Indeed, it looks like you could write an optimizer that reads IL and writes better IL.  Other than compilation efficiency, is there any reason this isn't done? You are not the first person to notice this. :-) We could do that. Or, one of the things we are considering for future versions of C# and VB is to have the compilers output a common "semantic tree" format which can be fed into a common metadata emitter and/or IL generator. However, it's not yet clear whether this would be a net win from a cost-vs-benefit analysis. We have two IL generators that work perfectly well; why throw away much of that existing, debugged code and spend a few months writing a common infrastructure in order to achieve a false economy of shared code? It seems potentially penny-wise, pound-foolish. We'll make the decision that is right for the business case. -- Eric

  • Anonymous
    June 11, 2009
    I apologize for a comment about the format of the post, rather than about the post itself, but... Very often in this blog, I find that once text formatted as "code" appears, all of the text following that remains formatted as "code" rather than reverting back to the normal format. I looked at the HTML source, and I can see that although most of the HTML is valid XML, with each <p> tag having a corresponding </p> close tag, when the <span class="code"></span> shows up, the <p> element is NOT properly closed. In this particular post, the example would be the first section of text where the compiler options are described.  The entire paragraph is contained by a <span class="code"> element, and there is an open <p> tag, but no close <p> tag. Whose bug is this?  I'm not sure.  The DOCTYPE for the page isn't XHTML, so technically a <p> tag need not have a close tag.  On the other hand, when a <p> tag doesn't have a close tag, I believe that the <p> element is considered to continue until the next block element, which in this case is the next <p> element.  Because the <p> tag for the compiler options text is inside the <span class="code"> element, this means that logically, the <p> element and <span> element overlap, which isn't allowed even in HTML (unclosed elements are allowed, but otherwise the same hierarchical rules apply). To further confuse the issue, some browsers figure it out (e.g. Opera) and others don't (e.g. Safari).  Presumably in the case where the text is displayed as expected, the HTML parser auto-closes any unclosed elements contained within an element that is being closed. Anyway, I thought I'd mention this, in the hopes that it's simple to fix the problem. I have no idea how the HTML is actually being generated, and I suppose if it's auto-generated by some utility, it might be hard to request and/or have implemented a bug-fix so that the <p> element is correctly closed even when contained within a <span> element. But, if it's being hand-generated, or generated by something that is easily fixed, it sure would be helpful to those of us using certain browsers.  :) Thanks!

  • Anonymous
    June 11, 2009
    @pete.d: I think it parses it as <span> <p>code</p> <p>text</p> ..... and a </span> at the end. Now about the post, Isn't callvirt more 'complicated', because it needs to use the right method in the virtual whatchamacallit-tables (vtables?), thus making the optimization of using callvirt instead of call counter-productive? (I assume the answer is 'no, its not', otherwise you wouldn't make this optimization, but I'd like to know why not) [[ "callvirt" takes a method token as its argument; the jitter looks up the method token in the metadata tables to see if it is actually a virtual method. If it is, then it generates a null check and a vtable call, if not then it generates a null check and a direct call. -- Eric ]]

  • Anonymous
    June 11, 2009
    The comment has been removed

  • Anonymous
    June 11, 2009
    I've always developed using debug, no optimisation and deployed w/o debug and optimised (because it's the 'Right Thing To Do'). However, I've never really tested the performance characteristics, since there was basically no cost in doing the optimisation. However, reading your post, it sounds like the /optimise flag doesn't do much optimising at all. Is there much of a difference for the majority of cases? If so, are the gains spread across the scenarios, or is there one or two optimisations that give the majority of the speedups? Interesting post, by the way.

  • Anonymous
    June 11, 2009
    Configurator.... I think it quite clearly explained that the JIT makes the determination if "CallVirt" is actually "virtual" (in the sense of using the "v-table") or is a direct call. Travis.... Hopefully you "over-simpified"...Shipping a different Build configuration [noDebug/Optimizd] than the one you subject to all of your testing [Debug/NoOptimize] is dangerous. Years back I established a policy of doing all work in the "shipping" configuration [I do not call it "release" since that is just a name, and does not describe the actual settings], and only switching to a "diagnostic" [again I do not call it "Debug" for the same reason] if there is a bug that I simply can not pin down in a reasonable time. The amount of grief this has saved me over the years [I did the same thing with C++ and with C before that] is priceless.

  • Anonymous
    June 11, 2009
    [quote]That’s pretty much it. These are very straightforward optimizations; there’s no inlining of IL, no loop unrolling, no interprocedural analysis whatsoever. We let the jitter team worry about optimizing the heck out of the code when it is actually spit into machine code; that’s the place where you can get real wins.[/quote] I would have thought that it would make more of a difference to do these types of optimizations in the IL - the less the jitter needs to do, the faster things can start up.  These optimizations can't be free. But the C# compiler knows nothing about the target environment. The optimizations you want to make on compact framework running on a cell phone are very different than the optimizations you want to make on a 64 bit server. So let the jitter worry about it; the jitter knows how to optimize for the current environment. -- Eric

  • Anonymous
    June 11, 2009
    WOW! > We attempt to eliminate generation of "protected" regions. For instance, if a try block is empty then clearly the catch blocks are not reachable and the finally can just be an ordinary code region. This is dangerous! In fact that is how some people create "protected" regions in the first place. Code that is executed within finally block will not be interrupted by Abort exception for example. You are correct. I was misremembering. We eliminate catch blocks, but the finally blocks live on as finally blocks. I've corrected the text; thanks for the note. -- Eric

  • Anonymous
    June 11, 2009
    "We attempt to eliminate generation of "protected" regions. For instance, if a try block is empty then clearly the catch blocks are not reachable and the finally can just be an ordinary code region." I've seen code like this in Microsoft's own implementation of the ASP.NET Cache: try{}
    finally
    {
     Monitor.Enter(this._lock);
     acquiredLock = true;
    } Does this optimization mean that the above code might not run the "finally" block in a protected region? See above. -- Eric

  • Anonymous
    June 11, 2009
    +1 ThreadAbortExceptions do not interrupt finally blocks so don't replace finally blocks with ordinary code regions.

  • Anonymous
    June 11, 2009
    >Or, one of the things we are considering for future versions of C# and VB is to have the compilers output a common "semantic tree" format which can be fed into a common metadata emitter and/or IL generator. Now that you mention it, I remembered that .NET 4.0 beta1 has a whole lot of goodies in System.Linq.Expression - it's owned by C# team, isn't it? My first impression after looking at it is that it actually pretty much covers the "semantic tree" for a single method, including things such as lambdas with all the nitty-gritty done behind the scenes. The only thing that seems to be missing in terms of C# feature coverage there are iterator methods. And it can generate output into a MethodBuilder too, so it's almost a compiler writer's toolbox... ... almost - because it can only generate static methods (why oh why?), and seemingly cannot generate a bunch of cross-calling methods at a time. To generate non-static methods we need to generate them on some class. What class? We only have expression trees and statement trees; what we need are declaration trees. I would very much like to have the ability to represent types as trees. Then we would have a "compiler writer's toolbox" as you say. Whether we will get to that point or not, I don't know; guessing would be speculation about the feature sets of unannounced projects that have no budgets at this time. That kind of guessing is seldom productive. -- Eric

  • Anonymous
    June 11, 2009
    I admit, I don't get the concern about ThreadAbortException. Personally, I would have been happy if Thread.Abort(), Thread.Suspend(), and Thread.Resume() had just been left out of .NET altogether. There are so many other ways that Thread.Abort() can produce unpredictable results, I can't say that I'd be all that worried about the code in a finally block winding up unprotected. Now, if there's some other important benefit to protected code regions I'm overlooking, then perhaps that's a reason for concern. But I'm surprised anyone, never mind three people, care about "correctness" in the context of a ThreadAbortException.  IMHO, by definition any code using ThreadAbortException is incorrect, regardless of what the compiler's doing.  :p I don't like thread aborts either. But wishing doesn't make them go away. We've got to write a code generator for a language that targets the environment we're given, and the environment we're given has somewhat goofy thread abort semantics. -- Eric

  • Anonymous
    June 11, 2009
    @Thomas Fields are initialized to default values, but I'm not sure the same is guaranteed to happen for local variables, since they are allocated to the stack. If you ever wish to become sure, then I encourage you to read the CLI spec, Partition II, section 24.4.4 and Partition III, section 1.8.1.1. -- Eric

  • Anonymous
    June 11, 2009
    The comment has been removed

  • Anonymous
    June 11, 2009
    The comment has been removed

  • Anonymous
    June 11, 2009
    "because we know that the result of the "new" operator is never null" - well, there is a corner-case... - see: http://stackoverflow.com/questions/194484#194671

  • Anonymous
    June 11, 2009
    The comment has been removed

  • Anonymous
    June 12, 2009
    Interesting Finds: June 12, 2009

  • Anonymous
    June 12, 2009
    The comment has been removed

  • Anonymous
    June 12, 2009
    The comment has been removed

  • Anonymous
    June 12, 2009
    "why are structs required to have a public parameterless constructor instead of requiring that the programmer initializes the structs himself" structs are not allowed to have parameterless constructors, this is a good thing since structs can be created in their 'blank' state by the creation of either default(T) or new T[]. This will not call a constructor so it avoid confusion. If you really need a constructor that does something similar consider having a private one that takes object and pass null to it.

  • Anonymous
    June 12, 2009
    Eric are you planning to mention also emitting Debuggable attribute with different values. This, of course, effects JIT a lot. Or you just planning a new post?

  • Anonymous
    June 12, 2009
    @Pop Catalin You're absolutely right about the CLR needing support for it if it were to be this way. Obviously this is not a simple change to make now, and I don't think anyone should bother (if people would be making changes to the CLR type system, I'd rather have true non-nullable reference types instead of subtly different structs) but I'm wondering why it wasn't designed like this in the first place. It is obvious that it is necessary for the CLR to be allowed to construct a struct with its fields initialized to defaults, if you want to be able to declare a variable of any struct type without initializing the variable. But I don't see why that is necessary. What is the point of allowing you to declare but not initialize a struct variable? Why would you allow the use of unassigned variables at all? @ShuggyCoUk >structs are not allowed to have parameterless constructors That's not true. You are not allowed to define parameterless constructors, but all structs have them implicitly. >If you really need a constructor that does something similar consider having a private one that takes object and pass null to it. How would that help me prevent the creation of a struct with its fields initialized to default values?

  • Anonymous
    June 12, 2009
    @ Joren's "why are structs required to have a public parameterless constructor" Allow me to restate the core point already suggested by Pop Catalin can ShuggyCoUk. This goes back to Eric's recent posts about the essential difference between value and reference types, doesnt' it? Structs are value types, so even having a reference to one requires an initialized instance, hence the requirement of a parameterless constructor.  Null reference semantics is a luxury reserved for reference types, by the nature of the difference between value and reference.  This is why Nullable<> must exist, i.e., why null reference semantics can't be given to value types inherently.

  • Anonymous
    June 12, 2009
    @Joren "What is the point of allowing you to declare but not initialize a struct variable?" I think you wanted the subjunctive, i.e., what *would" be the point of that, since it's hypothetical.  In fact, by the nature of value types, declaration of a struct variable entails initialization.  Cf Eric's recent posts: http://blogs.msdn.com/ericlippert/archive/2009/04/27/the-stack-is-an-implementation-detail.aspx http://blogs.msdn.com/ericlippert/archive/2009/05/04/the-stack-is-an-implementation-detail-part-two.aspx You're asking to "prevent the creation of a struct with its fields initialized to default values".  This would be a breaking change to fundamental idioms.  To take a concrete example, consider an array of value types vs an array of reference types: var a = new MyValueType[1]; var b = new MyReferenceType[1]; Each array's item is initializes to the default value appropriate to its type. For b[0] that's simply null, but a[0] holds an instance of MyValueType, right from the get-go.   Where is that instance going to come from, if not a default constructor?  

  • Anonymous
    June 12, 2009
    > To generate non-static methods we need to generate them on some class. What class? We only have expression trees and statement trees; what we need are declaration trees. Eric, I was specifically talking about Expression.CompileToMethod(MethodBuilder) - http://msdn.microsoft.com/en-us/library/dd728224(VS.100).aspx - which already gets the MethodBuilder for a particular TypeBuilder. So I can produce the "declaration tree" using TypeBuilder/MethodBuilder as needed, and then "fill in" the methods with Expression.CompileToMethod. The problem is that:

  1. It explicitly checks that MethodBuilder is for a static method, and throws otherwise (not sure why, since I don't see why the generated code would be any different - it would just ignore the implicit receiver).
  2. Expression API not provide any way to reference "this" in the tree. Admittedly, I might well be trying to use this for a purpose it wasn't even remotely intended for, and the limitations I'm running into aren't really limitations for the real scenario. But I'm genuinely curious who this overload is intended for, then. Is it there just so that we can compile expression trees to AppDomains other than the current one?
  • Anonymous
    June 12, 2009
    The comment has been removed

  • Anonymous
    June 12, 2009
    Thank you, Eric.  Sorry for not doing my homework. I thought the answer would be more complicated and was not sure where to look. To complement what you said, the C# 3.0 Specification says in 7.4.4 Function member invocation: The value of E is checked to be valid. If the value of E is null, a System.NullReferenceException is thrown and no further steps are executed.

  • Anonymous
    June 12, 2009
    The comment has been removed

  • Anonymous
    June 12, 2009
    What about things like common subexpression elimination and loop invariant hoisting? Those seem like things that the JITter could do, but would be applicable to all environments so the compiler could do it. Very few "optimizations" are guaranteed to always be improvements. Both of your examples trade increased use of temporary slots for decreased time: the classic space vs time tradeoff. Some CPUs have huge numbers of available registers, so generating more temporaries is usually a win. But on CPUs with a small number of available registers (a large number of our users use computers with x86 architecture!) sometimes generating more temporaries means that you need to move things that would have otherwise always been in registers onto the stack, and then back into registers again later. If the total cost of those data moves is larger than the cost of doing the calculation multiple times, then this optimization just made the codegen worse.  We do not know what the register architecture is going to be when we generate the IL; therefore we leave decisions like that up to the jitter. -- Eric

  • Anonymous
    June 12, 2009
    Thank you for submitting this cool story - Trackback from DotNetShoutout

  • Anonymous
    June 12, 2009
    "We let the jitter team worry about optimizing the heck out of the code..." That seems like another small lie as the C/C++ compiler appears to apply much more aggressive optimizations than the jitter. I'm not talking C/C++ specific stuff either. It's a problem the jitter teams seems to face... they are tasked with loading the assembly as fast as possible AND produce as optimal code as possible. That's a Win/Lose proposition.

  • Anonymous
    June 12, 2009
    >The non-virtual call instruction does not do a null check; we'd have to generate extra code to check whether GetC returns null. If only the Common Type System had some sort of non-nullable reference type there would be no need for such trick. Every .NET developers are struggling everyday with the NullReferenceException while it could have been solved statically at compile time :o/ http://codebetter.com/blogs/patricksmacchia/archive/2007/07/25/i-want-non-nullable-types-in-c-4.aspx

  • Anonymous
    June 13, 2009
    Is there ever a situation where '/debug-' is desirable? Why does the '/debug' switch exist?

  • Anonymous
    June 15, 2009
    @Joren - I missed the reference-type note, thanks. @Pop Catalin - indeed, I'm very much aware of this; but the perception of this behaviour (especially with the GetType() boxing) can easily be deceptive. Re the whole constructor/not thing on value-types, isn't that one of those odd areas where the CLI and C# specs have different view, but just agree to disagree? Or has my memory gone to putty...?

  • Anonymous
    June 15, 2009
    @Joren, re no default constructor for structs. Another variant on the above answers are that structs are value types, and all value types can be zero. The byproduct of which means the CLR zero's them for you when you create an array of them. You can also think of the zero-struct as the same thing as a null pointer if it makes you happy. i.e. string[] arr = new string[100]; creates an array of string references. It doesn't call 100 string constructors either. You have to then set all the instances yourself. It's not really the case, but it gets the point across. A side benefit to the CLR writers is that it's really easy to zero out a chunk of the (underlying) memory, and the CLR guarantees you don't get uninitialized data (does it, or is it the compilers contract, i don't know which, whatever). While these may not be the published reasons, they're still consistent with the way the world is. enum Silly {   One = 1,   Two = 2 } It's like enum's. Despite being first-class citizens, the underlying storage is an int (value type) and the default is 0. So if you have an unitinitalized enum as a field, the CLR will, by default, initialize your enum field to zero like it says it will. However like the Silly enum above, you won't be able to test to your enums in a switch because you have no value which corresponds to 0. You win some, you lose some. You end up with the best-practice rule of 'always have a 0 value for an enum as a sensible default or suffer'.

  • Anonymous
    June 17, 2009
    JITs have the capability to fast load, and excessively optimize, at the same time. Well, not actually at the same time, however JITs, theoretically, can load a rough version, and optimize the code in a background task. I heard that CLI already does that, but I'm not exactly .sure

  • Anonymous
    June 18, 2009
    After a detour into Historical Debugging , it’s time to come back to return to answering questions about

  • Anonymous
    August 12, 2009
    I realise the blog post is old, I hope someone gets it :) I was wondering why you would not use both /debug and /optimize at the same time to get the best of both worlds? Does using /debug with /optimize produce less than optimal code if /debug were omitted? Why wouldn't you want to use /optimize in the first place? Cheers

  • Anonymous
    December 09, 2011
    @Rudy: I'm late (!) to the party, so you probably won't see this.  But you can use 0 in your switch: switch (sillyValue) {    case Silly.One:        break;    case Silly.Two:        break;    case 0:        break; }