Not as easy as it looks

My colleague Kevin works on (among many other things) the refactoring engine in the C# IDE. He and I were at the end of last year discussing the possible cases for a hypothetical "eliminate variable" refactoring. I thought that it might be of interest to you guys to get a glimpse of the sorts of issues that the IDE team faces on a daily basis.

First off, what is the "eliminate variable" refactoring? It is seemingly straightforward, but as we'll see, appearances can be deceiving. Suppose you have a local variable that is initialized in its declaration, never written to otherwise, and you have only one place where that variable is used. (A generalization of this refactoring is to replace every usage; for simplicity, let's consider only the single usage case.) For example:

void M()
{
int x = 2 + R();
Q();
N(x);
}

You can eliminate the variable by refactoring the code to:

void M()
{
Q();
N(2 + R());
}

Now, obviously this changes the meaning of the code. If Q() throws an exception then now R() is never called, whereas before it was always called. If Q() observed a side effect of R() in the first version, it does not after the refactoring. And so on; there are numerous ways in which this code is different. We assume that the user is all right with these changes; that's why they're using the refactoring.

How, at a high level, would you implement this? It seems straightforward:

  • Find the text of the initializer
  • Find the single usage (which the IDE already knows how to do somehow)
  • Textually replace the single usage with the initializer text
  • Erase the declaration and initialization of the variable

That last step might be slightly tricky if you have something like

int y, x = 2 + R();

because we have to make sure to erase the comma and the declaration and initialization of x, but not erase the "int y;". But that is not hard to deal with.

You also have to think a bit about comments. What should happen here?

// x is blah blah blah
int x = 2 /* blahblah */ + R(); // Blah!
Q();
N(x);

Should that become

// x is blah blah blah
/* blahblah */ // Blah!
Q();
N(2 + R());

Or should any of the three comments before, inside or after the declaration be inserted where x used to be used? Notice that if the one that comes after is inserted, it's going to have to be changed to an inline comment, or a newline is going to have to be inserted in the inside of the call. Also, a comment refers to a local variable that has been eliminated, which seems bogus.

Assume that we can solve these problems. Are we done? Let's apply our refactoring to a few test cases. In every case, we eliminate "x".

const short s = 123;
int x = s;
N(x);

becomes

const short s = 123;
N(s);

Is that right? Not necessarily. What if N has two overloads, one for int and one for short? Before we called N(int); now we are calling N(short)! This seems qualitatively different than merely reordering a possible side effect; now our refactoring is changing overload resolution, which seems wrong. Really we should be generating

const short s = 123;
N((int)s);

OK, is that the right refactoring? Let's keep on trying test cases.

int x = 123;
N(ref x);

becomes... what? We can't say N(ref 123) or even N(ref (int)123).

In this case, the refactoring engine has to fail. There is no way to eliminate the variable when what you're passing is a reference to the variable itself.

How about this?

int x = R() + S();
N(Q() * x);

becomes

N(Q() * R() + S());

Argh, we have now introduced a semantic change because of operator precedence. That should be

N(Q() * (R() + S()));

How about this?

Func<int, int> x = z=>z+1;
object n = x;

becomes

object n = z=>z+1;

Nope; you can't implicitly convert a lambda to anything other than a delegate or expression type. That's got to be

object n = (Func<int, int>)(z=>z+1);

Whew. What have we learned so far? To eliminate T x = expr, there are scenarios where you can use:

  • nothing at all; give an error
  • expr
  • (expr)
  • (T)expr
  • (T)(expr)

An exercise for the reader: find scenarios in which none of the above are good enough either. Can you find a scenario in which you have to say ((T)expr)?  How about ((T)(expr))? The latter was the worst I could find easily; are there any more that I missed?

Understandably, we would not want to implement this refactoring in a hypothetical future version of Visual Studio if every time you did an elimination of a variable, the compiler conservatively inserted four parens plus a cast; what we really need are heuristics which can tell us when each of the above forms is necessary. That is a surprisingly tricky problem in language analysis!

And these are just the issues that Kevin and I came up with in a few minutes of whiteboard doodling; it is entirely possible that on deeper analysis, we'll find more interesting problems.

Next time you use the refactoring tools in Visual Studio, think about everything that is happening behind the scenes to make those tools work seamlessly. Like playing the piano, the unseen effort required to make it look easy is enormous.

Comments

  • Anonymous
    January 12, 2011
    Has Kevin seen ReSharper's equivalent of "eliminate variable"?

  • Anonymous
    January 12, 2011
    All of which makes me even more impressed by tools like ReSharper that have dozens of refactorings like this. I'm always a little bit surprised when I run across another one of these edge cases and they get it right!

  • Anonymous
    January 12, 2011
    The comment has been removed

  • Anonymous
    January 12, 2011
    How about static void Main()
    {
      Funny a = null;
      Func<int> b = () => 1;
      Console.WriteLine(a + b);
      Console.WriteLine(a + ((Func<int>)(() => 1)));
    }
    class Funny {
      public static int operator +(Funny a, Func<int> b) { return 1 + b(); }
    } Yep, that's the same case that I came up with. Great minds think alike! - Eric

  • Anonymous
    January 12, 2011
    What about a generalized refactoring function to insert* or remove unnecessary parentheses and casts in a selected expression? This is something that could be useful on its own, and it could simply hand off to that after it's made its own version with ((T)(expr)) - though this could result in the undesirable effect of altering the text of expr. Are there any situations where this could result in needing to cast to an anonymous type?

  • yes, insert.

  • Anonymous
    January 13, 2011
    Seems to me that you'd start with ((T)(expr)) and then look to see which parts of it were redundant, and eliminate them. Having "eliminate redundant parentheses" and "eliminate redundant casts" as primitive operations within the refactoring engine seems like something that would be needed, or at least useful, for all kinds of refactoring operations (and possibly it'd be useful to expose them directly to end users, too).

  • Anonymous
    January 13, 2011
    Sometimes you need to rename variables, e.g. inline "f" here:        var f = (Func<int, int>) (x => x);        using (null)        {          var x = "sdfsdf";          var y = f(1);        }

  • Anonymous
    January 13, 2011
    @Khoth you code has error, because you cannot use "y" in M() before it is declared.

  • Anonymous
    January 13, 2011
    Another example, inline "numbers":      int[] numbers = {1,2,3};      Action<int[]> a = ints =>{};      a(numbers);

  • Anonymous
    January 13, 2011
    @Knoth giving it another thought, we can change your sample as follows, inline "x" here: class C
    {
     private int y = 2;
     private void M()
     {
       var x = (Func<int>) (() => y); 
       {
         int y = 4; // shadows field y
         int p = x();
       }
     }
    } It should change y in lamda to "this.y" Nice. Because the two simple names 'y' are not first used in overlapping declaration spaces, they are permitted to have different meanings. The refactoring causes the declaration spaces to overlap.

  • Anonymous
    January 13, 2011
    var x = 123; var y = new {x};

  • Anonymous
    January 13, 2011
    The comment has been removed

  • Anonymous
    January 13, 2011
    I tend to have the opposite problem of the first one stated. I find myself with void M() { Q(); N(2 + R()); } and I'll want to pull out 2+R(). Since I'm assigning it to a variable, I'd like it if the Visual Studio IDE atuomatically put the variable I just created as the parameter in the call to N(). Sure, that would create more work if I was trying to change the meaning of the code and wanted to call a parameterless N(), but I can't never recall a time where I wanted that, and I cut and pasted the old parameter. If I cut and paste a parameter into an assignment of a variable, that should give the IDE enough of a hint to put the variable in the spot of the parameter for me.

  • Anonymous
    January 13, 2011
    I don't know whether you'd count this, but we may need to include "checked" or "unchecked". For example, take:    unchecked
       {
           int z = X() + Y();
           checked
           {
               M(z + 1);
           }
           // Other code (just for simplicity so we can't completely remove the unchecked block)
       } If we want to extract z, I think we'd need:    unchecked
       {
           checked
           {
               M(unchecked(X() + Y()) + 1);
           }
           // Other code
       } That could probably be combined with the example from SSLaks. Nice one! - Eric

  • Anonymous
    January 13, 2011
    The comment has been removed

  • Anonymous
    January 13, 2011
    The comment has been removed

  • Anonymous
    January 13, 2011
    The comment has been removed

  • Anonymous
    January 13, 2011
    The comment has been removed

  • Anonymous
    January 13, 2011
    A somewhat coerced example, but replacing x here would fail since C as a dynamic can only be implicitly cast to int.  The variable cannot be eliminated.    class C : System.Dynamic.DynamicObject    {        public override bool TryConvert(ConvertBinder binder, out object result)        {            if (!binder.Explicit && binder.ReturnType == typeof(int))                {                    result = 3;                    return true;                }            return base.TryConvert(binder, out result);        }    }    class Program    {        static void Main(string[] args)        {            Action<int> A = i => Console.WriteLine("A(int) called.");            int x = (dynamic)(new C());            A(x); // works            // A ((dynamic)(new C())) and A((int)((dynamic)(new C()))) throw.        }    }

  • Anonymous
    January 13, 2011
    Eric, I question if this type of refactoring is something that you want to encourage.  Let's assume that your first two examples don't have any gotchas like side effects or exceptions thrown in Q.  I would still say that the functions are different from a readability/maintainability standpoint.  For example, I would write the first first function if 2 + R() has some meaning in and of itself.  I would write the 2nd function if 2+R() only means something to N() but doesn't have any meaning in another context. I'd be concerned that having Visual Studio do this automatically will encourage collapsing the first function into the second function all the time at the expense of readability.

  • Anonymous
    January 13, 2011
    The comment has been removed

  • Anonymous
    January 13, 2011
    Why do you need the extra parentheses around ((Func<int>)(() => 1)) in SSLaks's example? They seem redundant to me. Another example where you would need extra parentheses with functions is when they're applied: Func<int> x = () => 1; int y = x(); would turn into int y = ((Func<int>)(() => 1))();

  • Anonymous
    January 13, 2011
    Another example for the extra parentheses: implicit cast and call member class Bd {    public int Value;    public static implicit operator Bd(int value) { return new Bd { Value = value }; } } Bd x = 1 + 2; Console.WriteLine("The value is " + x.Value); would be turned into Console.WriteLine("The value is " + ((Bd)(1 + 2)).Value);

  • Anonymous
    January 13, 2011
    Here's another interesting case. int x = 42; var properties = new { x }; would need to be converted to var properties = new { x = 42 };

  • Anonymous
    January 13, 2011
    And last one for today: Expressions go more wrong than anything else. int x = 1; Expression<Func<int>> e1 = () => x; Expression<Func<int>> e2 = () => 1; These two are completely different.

  • Anonymous
    January 13, 2011
    I hope you compensate Kevin well. I'd have been fired by now for making too many tests fail.

  • Anonymous
    January 13, 2011
    The comment has been removed

  • Anonymous
    January 13, 2011
    Designing a language explicitly for perfect refactorability (instead of using a syntax compatible with existing mindsets) may be quite interesting. How far could the code editing process be shifted from traditional text editing towards snippet selection + refactoring?

  • Anonymous
    January 13, 2011
    continuing configurator's puash to find yet more cases requiring ((TYPE)(X)): DayOfWeek d = default(int); Console.WriteLine(d.ToString()); Console.WriteLine(((DayOfWeek)(default(int))).ToString()); msdn.microsoft.com/.../2bxt6kc4.aspx and I believe anything with higher precedence than typecast which is Unary and postfix (if prefix as with ~ you can get away without the outer brackets)  which you can make applicable.

  • Anonymous
    January 13, 2011
    The comment has been removed

  • Anonymous
    January 14, 2011
    You cannot replace a variable's all occurences with its initializers either, but only if the initializer is pure. Case in point: int f( ) {     ++x;     return 1; } int y = f( ); Console.Write( y ); Console.Write( y ); Calling f two times will increment x twice instead of only once. Blind text replacement (definitely a bad idea) is going to break things like this: int x = k( ); f( x: x );

  • Anonymous
    January 14, 2011
    Eric, why do you have to make up new names for well-understood patterns? This refactoring is called "inline temp(orary)" and it's been called that for more than a decade now (Fowler's Refactoring was published in 1999, and the Refactoring Browser's seminal paper was released in 1997). Eliminate aliases to remove, and "Remove Variable" is a very different refactoring: it completely removes a binding and its initialization from the source code if the binding is never used. First: a quick web search will show that other people have called this refactoring "eliminate variable". This is not a made-up-by-me new name. Second, so what if it was? We're having an informal conversation here about the difficulties of implementing simple refactorings; the name we choose to informally speak about the hypothetical feature is irrelevant to the discussion. Surely there are more constructive ways to participate in this discussion than sniping about nomenclature. Third, "inline temp" is a horrid name compared to "eliminate variable" for numerous reasons.(Just a few: inlining is primarily associated with JIT code generation, the variable in question might not be thought of as logically a "temporary", one can imagine extending the refactoring to other kinds of variables other than "temporary" locals, and the fact that the name emphasizes the elimination of a varible is a strong and helpful hint that you'd better not be using the thing as a variable, but rather, as a value.) Fourth, it is far more important that a name be clearly understood by customers who are not already familiar with the concept than that it conform to some alleged standard name understood only by persons already steeped in the literature of refactoring. -- Eric

  • Anonymous
    January 14, 2011
    To get the obvious out of the way, consider:    struct X { public int Y; }    static void Main()    {        X x = new X();        x.Y = 123;    } "x" cannot be substituted with "new X()", nor with anything, really - "new X().Y" is not an lvalue (I don't know the proper concise C# term for this, if it has one).

  • Anonymous
    January 14, 2011
    @Pavel Minaev   struct X { public int Y; }   static void Main()   {      new X().Y = 123;   } This should be perfectly valid...

  • Anonymous
    January 14, 2011
    @Harold: go ahead and try it!

  • Anonymous
    January 14, 2011
    Another case where four parentheses and a cast are necessary:         Func<int, int> x = z => z + 1; //eliminate x int h = x.GetHashCode(); This becomes: int h = ((Func<int, int>)(z => z + 1)).GetHashCode(); More interestingly, in a similar vein as Ilya Ryzhenkov's shadowing example, here's a contrived case where the elimination needs to be blocked. Action<int> f = n => { int x = n; }; { int x = 5; f(3); } This is the result, but it doesn't work, since the declaration of x inside of the lambda conflicts with the other declaration: { int x = 5; ((Action<int>)(n => { int x = n; }))(3); }

  • Anonymous
    January 16, 2011
       int x = R() + S();    N(Q() * x); becomes    N(Q() * R() + S()); Also has the problem that if Q() throws an exception and R() or S() have side effects then it will change behaviour. Do we assume the refactorer is aware of this or stop the refactoring in this case?

  • Anonymous
    January 17, 2011
    @Pavel Minaev: ReSharper can deal with that. I've used it many times. It'll refactor it to:   struct X { public int Y; }   static void Main()   {       X x = new X { Y = 123 };   } The quick fix is called "use initializer syntax" IIRC.

  • Anonymous
    January 17, 2011
    The comment has been removed

  • Anonymous
    January 18, 2011
    For the case:    int x = 123;    N(ref x); I believe this could become:    N(123); But only if N is a method on a COM interface in C# 4.0. :)

  • Anonymous
    October 04, 2011
    I'm with jader3rd. A much more useful refactoring would be common subexpression factoring, to find all the instances of the same expression in a method, and replace them with a variable.