UTF-8 strings in C#
I was doing a code review a few days ago on a web service library. There was a message which was supposed to be stored in UTF-8, since the service is internationalized. The code looked similarly to this:
1: string someRandomText;
2: UTF8Encoding encoder = new UTF8Encoding();
3: byte[] bytes = Encoding.UTF8.GetBytes(someRandomText);
4: string utf8ReturnString = encoder.GetString(bytes);
I actually didn’t noticed this on my first glance, but then I actually went back and thought about it. Can you figure it out? I’ll give you a few seconds … If you actually do a search, you’ll find something pretty similar. Given how it’s the top post for both search engines, I figured it’s worth talking about and analyzing this really simple code.
So what do we have here? Line 1 is just to demonstrate this. Assume someRandomText contains some .. err .. random text :) Assume it contains a bunch of characters, not just your usual ASCII text. Line 2 simply initializes the encoder. So far so good. Even line 3 is perfectly OK. You can just convert the string into a byte array. As you probably suspected, line 4 is the culprit here.
I guess it’s only fair give a little history of how strings are stored in C#. The short answer is UTF-16. Normally, this is perfectly fine, and for most application, UTF-16 should be the default encoding. However, for web services, UTF-8 is the usual encoding. XML is usually encoded in UTF-8, and since web services work over SOAP, it’s only natural to encode the text similarly.
Now, back to our little problem. Do you notice what’s going on here? The schema would look something like:
string (UTF-16) –> bytes[] (UTF-8) –> string (UTF-16)
So we’re really not doing anything. You might as well have just returned someRandomText and save some processor cycles.
If you really want to encode things properly, you’ll have to walk the byte array and extract the characters from there in hex format. A little bit more work, but at least it will do what you wanted it to do :)
-Cos
Comments
- Anonymous
December 26, 2008
PingBack from http://www.codedstyle.com/utf-8-strings-in-c/ - Anonymous
December 26, 2008
.. and that code would look like? - Anonymous
December 27, 2008
To print all the bytes you can use:for ( int i = 0; i < bytes.Length; i++ ){ Console.Write( "{0:X2} ", bytes[i] );}If you actually want to use them, then the code will vary depending on what you do. But a string is UTF-16 encoded, which cannot be changed.