Udostępnij za pośrednictwem


Interning Strings and immutability

Managed strings are subject to ‘interning’. This is the process where the system notices that the same string is used in several places, so it can fold all the references to the same unique instance.

Interning happens two ways in the CLR.

 

  1. It happens when you explicitly call System.String.Intern(). Obviously the string returned from this service might be different from the one you pass in, since we might already have an intern’ed instance that has been handed out to the application.
  2. It happens automatically, when you load an assembly. All the string literals in the assembly are intern’ed. This is expensive and – in retrospect – may have been a mistake. In the future we might consider allowing individual assemblies to opt-in or opt-out. Note that it is always a mistake to rely on some other assembly to have implicitly intern’ed the strings it gives you. Through versioning, that other assembly might start composing a string rather than using a literal.

 

One thing that might not be immediately obvious is that we intern strings across all AppDomains. That’s because assemblies can be loaded as domain-neutral. When this happens, we execute the same code bytes at the same address in all AppDomains into which that assembly has been loaded. Since we can burn the addresses of string literals into our native code as immediate data, we clearly benefit from intern’ing across all AppDomains rather than using per-AppDomain indirections in the code. However, this approach does add overhead to intern’ing: we are forced to use per-AppDomain reference counts into a shared intern’ing table, so that we can unload intern’ed strings accurately when the last AppDomain using them is itself unloaded.

Normally, strings should be compared with String.Equals and similar mechanisms. Note that the String class defines operator== to be String.Equals. However, if two strings are both known to have been intern’ed, then they can be compared directly with a faster reference check. In other words, you could call Object.operator==() rather than String.operator==(). This is only recommended for highly performance-sensitive scenarios when you really know what you are doing.

Of course, string intern’ing only works if strings are immutable. If they were mutable, then the sharing of strings that is implicit in intern’ing would corrupt all kinds of application assumptions – as we will see.

The good news is that strings are immutable… mostly. And they are immutable for many good reasons that have nothing to do with intern’ing. For example, immutable strings eliminate a whole host of multi-threaded race conditions where one thread uses a string while another string mutates it. In some cases, those race conditions could be used to mount security attacks. For example, you could satisfy a FileIOPermission demand with a string pointing to an innocuous section of the file system, and then use another thread to quickly change the string to point to a sensitive file before the underlying CreateFile occurs.

So how can strings be mutated?

Well, you can certainly use C#’s ‘unsafe’ feature or equivalent unverifiable ILASM or Managed C++ code to write into a string’s buffer. In those cases, some highly trusted code is performing some clearly dirty operations. This case isn’t going to happen by accident.

A more serious concern comes with marshaling. Here’s a program that uses PInvoke to accidentally mutate a string. Since the string happens to have been intern’ed, it has the effect of changing a string literal in an unrelated part of the application. We pass ‘computerName’ to the PInvoke, but ‘otherString’ gets changed too!

 

using System;
using System.Runtime.InteropServices;

public class Class1
{

static void Main(string[] args)
{

String computerName = "strings are always immutable";
String otherString = "strings are always immutable";

int len = computerName.Length;
GetComputerName(computerName, ref len);

Console.WriteLine(otherString);

}

[DllImport("kernel32", CharSet=CharSet.Unicode)]
static extern bool GetComputerName(
[MarshalAs (UnmanagedType.LPWStr)] string name,
ref int len);

}

 

 

And here’s the same program written to avoid this problem:

 

using System;
using System.Runtime.InteropServices;

public class Class1
{

static void Main(string[] args)
{

String computerName = "strings are always immutable";
String otherString = "strings are always immutable";

int len = computerName.Length;
GetComputerName(ref computerName, ref len);

Console.WriteLine(otherString);

}

[DllImport("kernel32", CharSet=CharSet.Unicode)]
static extern bool GetComputerName(
[MarshalAs(UnmanagedType.VBByRefStr)]
ref string name,
ref int len);

}

 

 

In this second case, VBByRefStr is used for the marshaling directive. The argument is treated as ‘byref’ on the managed side, but remains ‘byval’ on the unmanaged side. If the unmanaged side scribbles into the buffer, it won’t pollute the managed string, which remains immutable. Instead, a different string is back-propagated to the managed side, thereby preserving managed string immutability.

If you are coding in VB, you can pretend that the VBByRefStr is actually byval on the managed side. The compiler works its magic on your behalf, so you don’t actually realize that you now have a different string. C# works no such magic, so I had to explicitly add the ‘ref’ keyword in all the right places.

If you’re like me, you probably find all the marshaling directives bewildering. I can’t recommend Adam Nathan’s book enough. It is “.NET and COM – The Complete Interoperability Guide”. It truly is the bible for interop.

Nevertheless, even with the book it’s easy to make a lot of mistakes. There’s a feature in the new CLR release called Customer Debug Probes. It makes finding certain kinds of bugs much easier. Fortunately for all of us, it’s particularly geared to finding bugs with marshaling and other Interop issues.

Comments

  • Anonymous
    April 22, 2003
    Chris, again you make it appear easy to just babble on coherently about a difficult topic -- your words poke a finger in the eye of complexity. If I tried to explain this to someone in as many words, it would be meaningless. Keep it up!Oisin
  • Anonymous
    April 23, 2003
    The comment has been removed
  • Anonymous
    May 15, 2003
    First of all I´d like to congratulate you for your weblog, it´s really impressive.And now the question: when you talk about mutating strings and say:"Well, you can certainly use C#’s ‘unsafe’ feature or equivalent unverifiable ILASM or Managed C++ code to write into a string’s buffer."could you give me an example of this with C# and pointers?I know how to do it with MSIL, but in C# I don´t know how to make a pointer point to the memory address pointed by an object reference (a string in this case).Maybe this is a stupid question, but I would really apreciate your answer.Thanks a lot for sharing your knowledge with us.
  • Anonymous
    May 19, 2003
    The comment has been removed
  • Anonymous
    May 19, 2003
    Thanks a lot for your answer.I´d just found one way to make it work:// stupid sample codefixed(char* s1p=s1) { for(int i=0;i<4;i++) { (s1p+i)='t'; } }At first I was quite confused with all this stuff because I thought the fixed statement was something "optional" (if you were willing to have problems with the GC and object reallocation), and when I tried to compile the code above without the fixed statement the csc gave me this not too clear error:"can not implicity convert type string to char "Using the fixed statement (so it looks it´s more or less compulsory) there´s no compiler error.There was a small problem with the code you submitted, I guess you are one of the developers of the string class, so you are used to work with private fields and methods like m_firstChar and FastAllocateString, but I´m in the "outer world", so I don´t have access to those private fields-methods :-)
  • Anonymous
    April 14, 2004
    Chris,
    There are times where if there were a safe place to have all your strings interned, you can write far more efficient code...
    like in Object Spaces, they seem to assume that you can compare strings using == and do this many thousands of times per second as the cost of integer compares...
    for my own O/R mapping projects, it's either use == for strings, find some painful way of avoiding strings altogether (source code generation) or generate dynamic assemblies (yes, this is the ideal).
  • Anonymous
    April 14, 2004
    If you control the strings, you can explicitly intern them before proceeding with any comparisons (String.Intern, String.IsInterned). Rather than going through operator== (which performs String comparisons on objects that are statically typed as String), you can use Object.ReferenceEquals (which just compares the two references).
  • Anonymous
    June 03, 2004
    The comment has been removed
  • Anonymous
    June 04, 2004
    If you Intern a string in an AppDomain, the Intern'ed string will remain until the AppDomain is unloaded. So it's reasonable for you to Intern all the ownerNames, but it wouldn't be reasonable for you to Intern random test strings to see if you have a match.

    Instead, use the String.IsInterned method on the random test string. If it isn't already interned, there's no point in trying to match it. If it is interned already, then you aren't growing the set of Interned strings and you can now proceed with an efficient comparison against your Interned ownerNames.
  • Anonymous
    July 23, 2005
    The sample code in this post demonstrates that the run time maintains a string intern pool. The intern...
  • Anonymous
    July 18, 2006
    This post is actually a re-post of a post I did a little under year ago during PDC '05 after attending...
  • Anonymous
    July 18, 2006
    PingBack from http://microsoft.wagalulu.com/2006/07/19/shared-bytes-private-bytes-and-fixups/
  • Anonymous
    September 26, 2006
    Introduction
    Time for some cool .NET 2.0 feature that might prove useful in some scenarios: string interning....
  • Anonymous
    October 03, 2006
    PingBack from http://rant.blackapache.net/2006/10/04/the-day-job/
  • Anonymous
    March 05, 2008
    PingBack from http://www.csharp411.com/strings-dont-add-up/
  • Anonymous
    November 24, 2008
    The comment has been removed
  • Anonymous
    December 30, 2008
    PingBack from http://mdosman.us/?p=249
  • Anonymous
    June 13, 2009
    PingBack from http://quickdietsite.info/story.php?id=7394