String interning and String.Empty

Artikkeli
09/28/2009

Here's a curious program fragment:

object obj = "Int32";
string str1 = "Int32";
string str2 = typeof(int).Name;
Console.WriteLine(obj == str1); // true
Console.WriteLine(str1 == str2); // true
Console.WriteLine(obj == str2); // false !?

Surely if A equals B, and B equals C, then A equals C; that's the transitive property of equality. It appears to have been thoroughly violated here.

Well, first off, though the transitive property is desirable, this is just one of many situations in which equality is intransitive in C#. You shouldn't rely upon transitivity in general, though of course there are many specific cases where it is valid. As an exercise, you might want to see how many other intransitivities you can come up with. Post 'em in the comments; I'd love to see what obscure ones you can come up with. (Incidentally, one of the interview questions I got when applying for this team was to invent a performant algorithm for determining intransitivities in a simplified version of the 'better method' algorithm.)

Second, what's happening here is we're mixing two different kinds of equality that just happen to use the same operator syntax. We're mixing reference equality with value equality. Objects are compared by reference; in the first and third comparison we are testing if the two object references both refer to exactly the same object. In the second comparison we are checking to see if the two strings have the same content, regardless of whether they are the same object or not. In fact, the compiler warns you about this situation; this should produce a "possible unintended reference comparison" warning.

That might need a bit more explanation. In .NET you can have two strings that have identical content but are different objects. When you compare those strings as strings, they're equal, but when you compare them as objects, they're not.

That explains why the second comparison is true -- it's a value comparison -- and why the third comparison is false -- it's a reference comparison. But it doesn't explain why the first and third comparisons are inconsistent with each other.

This is the result of a small optimization. If you have two identical string literals in one compilation unit then the code we generate ensures that only one string object is created by the CLR for all instances of that literal within the assembly. This optimization is called "string interning".

String.Empty is not a constant, it's a read-only field in another assembly. Therefore it is not interned with the empty string in your assembly; those are two different objects.

This explains why the first comparison is true: the two literals in fact get turned into the same string object. And it explains why the third comparison is false: the literal and the computed value are turned into different objects.

Knowing that, you can now make an educated guess as to why we have this bizarre behaviour:

object obj = "";
string str1 = "";
string str2 = String.Empty;
Console.WriteLine(obj == str1); // true
Console.WriteLine(str1 == str2); // true
Console.WriteLine(obj == str2); // sometimes true, sometimes false?!

Some versions of the .NET runtime automatically intern the empty string at runtime, some do not!

But why, you might ask, do we not perform this interning optimization at runtime on every string? Why not aggressively turn all value-equal strings into reference-equal strings? Surely it is wasteful to have two identical strings around when you could have half as much memory.

The answer is that the TANSTAAFL Principle applies here, bigtime. That is, There Ain't No Such Thing As A Free Lunch. Interning has two positive effects: it decreases memory consumption and decreases time required to compare two strings. (Because if all strings are interned at runtime then all string comparisons can be cheap reference comparisons.) But those positive effects have a cost: allocating a new string now requires that you do a search of all string objects in memory to see if you have one that matches already. In our existing optimization, the cost is small; we can know at compile time what string literals are in a given assembly and which are identical. With the proposed optimization, that cost is imposed at runtime, and it could be a very large fraction of the time spent allocating strings.

In order to keep the time cost down, you'd have to build a hash table of all strings in memory. That means either computing the hashes frequently, which is itself expensive in time, or storing the hashes somewhere. If we do the latter then suddenly we are increasing the memory burden for strings that are not duplicated. That is, our optimization makes the normal scenario -- the vast majority of pairs of strings are not equal to each other -- take up more memory, so that a rare scenario saves on memory. That seems like a bad bargain; you usually want to optimize for the likely case.

There are also serious lifetime problems with interned strings. When can they be safely garbage collected? What if a new copy of the string is created while the old one is being collected on another thread? The safest thing to do is to make interned strings immortal, which looks like a memory leak. Memory leaks are bad for performance, particularly when the optimization you're doing is an attempt to save memory. TANSTAAFL!

In short, it is in the general case not worth it to intern all strings. However, it might be worth it in some specific cases. For example, if you were building a compiler in C#, odds are good that you are going to be producing a lot of strings that are the same at runtime. Our C# compiler is written in C++, in which we have written our own custom string interning layer so that we can do cheap reference comparisons on all strings in your program. Odds are good that "int" is going to appear tens, hundreds or thousands of times in a given program; it seems silly to allocate the same string over and over again. If you were writing a compiler in C#, or had some other application in which you felt that it was worth your while to ensure that thousands of identical strings do not consume lots of memory, you can force the runtime to intern your string with the String.Intern method.

Conversely, if you hate interning with an unreasoning passion, you can force the runtime to turn off all string interning in an assembly with the CompilationRelaxation attribute.

Anyway, to come back to the question of transitivity: object reference equality actually is transitive. It's also symmetric (A==B implies B==A) and reflexive (A==A), so it is an equivalence relation. Similarly, string value equality is transitive, symmetric and reflexive, since it uses a straight "character by character" ordinal comparison. But when you mix the two, then equality is no longer transitive. That's weird, but hopefully now understandable.

Comments

Anonymous
September 28, 2009
The comment has been removed
Anonymous
September 28, 2009
I wonder if an empty string is special-cased in some way. No matter what I've tried, it looks like an empty string will always refer to the same instance as String.Empty. I got the expected result using other non-interned strings though: object obj = "ab";
string str1 = "ab";
string str2 = "a" + new string(new char[] { 'b' }); // prevent compiler from computing "a" + "b" @John Kraft: Doesn't really affect the point of the article, but still kind of curious. You guys are absolutely right. Some versions of the framework intern string.Empty and some do not! I learned something new today; I've updated the text accordingly. Thanks! -- Eric
Anonymous
September 28, 2009
"Conversely, if you hate interning with an unreasoning passion, you can force the runtime to turn off all string interning in an assembly with the CompilationRelaxation attribute." That is not what I read at <http://msdn.microsoft.com/en-us/library/system.runtime.compilerservices.compilationrelaxations.aspx>: NoStringInterning Marks an assembly as not requiring string-literal interning. "Not requiring" and "Forbidding" are two different things.
Anonymous
September 28, 2009
The comment has been removed
Anonymous
September 28, 2009
This post accidently reveals a deep dark secret, that I have long suspected..... C# was created by the Ringworld Engineers!!!!!!! Just think how much this really explains <grin> [ps: I know Heinlein used it decades before Niven...]
Anonymous
September 28, 2009
Interesting that you wrote a post on this. This is one of my favorite interview questions to developers and they almost ALWAYS get it wrong!
Anonymous
September 28, 2009
Why doesn't make String.Empty a constant? I believe that String.Empty conforms to the defintion of a constant.
Anonymous
September 28, 2009
In java, the comparison of two string objects using "==" always results in a reference comparison. Therefore string comparison is always done using String.equals(), the same concept of literal pools applies java though. Sample this: String str1="xyz"; Object obj1="xyz"; String str2=new String("xyz"); System.out.println(str1==obj1); //true System.out.println(str1==str2); //false System.out.println(str2==obj1); //false System.out.println(str1.equals(obj1)); //true System.out.println(str2.equals(obj1)); //true System.out.println(((String)obj1).equals(str1)); //true I always thought the same was true for C#. Interesting, now I know... Thanks! :)
Anonymous
September 28, 2009
eh.. 'better method' algorithm? Was ist das?
Anonymous
September 28, 2009
@Franklin, if String.Empty were a constant (IL "literal"), its value would be inserted into IL at compile time - so it wouldn't be any different from just using "". In particular, it would only be interned once per assembly. But since it's actually static readonly field, there's just one single instance shared between all code using String.Empty. I'm not sure if this has any distinct advantages, or if it is even the rationale for making it non-constant, but I can't think of any other points of difference.
Anonymous
September 28, 2009
Excellent post. Could you please clarify the following bit for me: "only one string object is created by the CLR for all instances of that literal within the assembly." Are you saying that if assembly A and assembly B both contain the same literal string there will be two copies of this in memory or am I reading this backwards? Because as far as I have been able to observe, that is not the case.
Anonymous
September 28, 2009
I wonder why the C# compiler still written in C++... // Ryan
Anonymous
September 28, 2009
The compiler is probably still written in C++ because at first they had to, and now it'd be a waste to throw out all that perfectly good code.
Anonymous
September 28, 2009
Something is bothering me here. object obj = ""; string str1 = ""; string str2 = String.Empty; Console.WriteLine(obj == str1); // true Console.WriteLine(str1 == str2); // true Console.WriteLine(obj == str2); // sometimes true, sometimes false?! If my understanding is correct, this means that "" is interned, while string.Empty depends on .NET runtime version. Then, wouldn't it be better to always use "" rather than string.Empty ? By the way, I always wondered where "user string.Empty" good practice came from.
Anonymous
September 28, 2009
A slight modification of the code at the beginning of the post provides yet another illustartion of the difference between the comparison by reference and the comparison by value: object obj = "Int32"; StringBuilder sb = new StringBuilder("Int32"); string str1 = sb.ToString(); string str2 = typeof(int).Name; Console.WriteLine(obj == str1); // False, this time!!! Console.WriteLine(str1 == str2); // true Console.WriteLine(obj == str2); // false !? Well, it's self-explanatory, pretty much: the call to StringBuilder.ToString() defeats the interning, somehow, so that the two "Int32" do not end up being the same object (I've used the VS 2008 SP1, Standard Edition, on 64-bit Windows 7 Ultimate RTM: it may be different on other .NET versions, of course).
Anonymous
September 28, 2009
Not equality, but an intransitive string comparison: string s1 = "-0.67:-0.33:0.33"; string s2 = "0.67:-0.33:0.33"; string s3 = "-0.67:0.33:-0.33"; Console.WriteLine(s1.CompareTo(s2)); Console.WriteLine(s2.CompareTo(s3)); Console.WriteLine(s1.CompareTo(s3));
Anonymous
September 29, 2009
It might be good to mention another optimiztion.. string x = new string(new char[0]); string y = new string(new char[0]); Console.WriteLine(object.ReferenceEquals(x, y)); // true .. From http://stackoverflow.com/questions/194484/whats-the-strangest-corner-case-youve-seen-in-c-or-net
Anonymous
September 29, 2009
@Brian > Are you saying that if assembly A and assembly B both contain the same literal string there will be two copies of this in memory or am I reading this backwards? Because as far as I have been able to observe, that is not the case. You're right, and I'm wrong (and I have no idea where I got this notion from). In fact, it's quite obvious now that I think of it - there's only one string pool, so assemblies don't matter. Which, obviously, means that my guess at the rationale of String.Empty is entirely wrong, as well. Back to square one. @Denis > the call to StringBuilder.ToString() defeats the interning, somehow That one is actually pretty straightforward (and Eric has already explained it in the post): only literals (including those produced by constant expressions at compile-time, like "a"+"b") are interned by default. The return value of StringBuilder.ToString() is not a literal.
Anonymous
September 29, 2009
As usual Eric, fantastic article. Timely as well. Read your article this morning and this afternoon discovered some weird behavior with the equality operator. I expected the first 6 to be True (especially #6 as I would think it would be equal for both reference and value equality.) What is going on here? double d1 = double.NaN;
double? d2 = double.NaN;
double d3 = double.NaN;
double d4 = d3;
Console.WriteLine(d1 == d2.Value); //False
Console.WriteLine(d1 == d3); //False
Console.WriteLine(d2.Value == d3); //False
Console.WriteLine(d2.Value == double.NaN); //False
Console.WriteLine(d1 == double.NaN); //False
Console.WriteLine(d4 == d3); //False
Console.WriteLine(double.IsNaN(d1)); //true
Console.WriteLine(double.IsNaN(d2.Value)); //true First off, reference equality doesn't come into it; you have no reference types at all in this program fragment. NaN means "not a number", and NaNs are special. In particular, the floating point standard requires that NaN == NaN be false. Basically, NaN means "the result is unknown or nonsensical." You have two results which are unknown or nonsensical. Let's suppose the two results are the total sales for October 10th, which are unknown, and the total sales for February 31st, which are nonsensical. You compare them for equality. Does it make any sense to say "why yes, those two figures are equal!" ? Of course not. So NaNs never equal each other. Note that "null" in VB has this same property; if you compare null to null in VB, you get null, not true or false. See the IEEE 754 specification for more details. -- Eric
Anonymous
September 29, 2009
So why isn't String.Empty a constant ? (I know this was asked before, but it seems the only answer given was later invalidated). I guess since it appears that String.Empty IS interned it probably doesn't make any difference, but I'm interested in the answer.
Anonymous
September 29, 2009
Joren, Ofcourse, but apart from the costs, whatelse could be a reason to not switch to C#? Dont get me wrong, I love C++, but it would be nice to see the C# compiler being selfhosting http://en.wikipedia.org/wiki/Self-hosting // Ryan
Anonymous
October 01, 2009
@Ryan Heath - There have been hints that this may happen. It's likely it's something they're currently looking into.
Anonymous
October 01, 2009
@Ryan, "Apart from the costs" I don't think there is a reason not to write the C# compiler in C#. But that's like asking, apart from my height and lack of athletic ability, for what other reason can't I be an All-Star professional basketball player? You have to live in reality. Cost is usually the reason that desirable things don't get done, in software and the rest of the world. In fact the C# team has talked about exposing the compiler as a managed service to aid metaprogramming, scripting, and other scenarios. Anders himself spoke about it at PDC last year. So I imagine we might see it happen. But it has to make it to the top of the priority list, past a whole lot of other desirable things (as Eric has often spoken of).
Anonymous
October 05, 2009
It seems that string interning happens also between assemblies (VS2008 SP1): object obj = new StringContainer().Value; // in another project, returns "Int32" as object string str1 = "Int32"; string str2 = typeof(int).Name; Console.WriteLine(obj == str1); // true. I was expecting false Console.WriteLine(str1 == str2); // true Console.WriteLine(obj == str2); // false
Anonymous
October 09, 2009
If I run this code in a new console application, I get the behavior you indicate (true/true/false). When I look at the generated assembly in reflector, however, the CompilerRelaxations attribute is present, with string literal interning disabled [CompilerRelaxations(8)]. So if string interning is disabled, why am I getting the behavior that should only occur if string literal interning is enabled?
Anonymous
November 30, 2009
Thanks for the great explanation. I almost completely forgot the idea of string interning from my time with C++ after I had moved to .NET. We had a discussion about const fields in this question: http://stackoverflow.com/questions/1819117/c-do-const-fields-use-less-memory As I understand it, the references to const fields are replaced with actual values during compilation time. Const fields need to be of value type with string being some sort of exception. Does the interning rule kicks in when it detects a constant is a string?

Jaa

String interning and String.Empty

Comments

Lisäresursseja