Don't Use Encoding.Default
So you want to save some data and don't know which Encoding to use. My biggest suggestion is please do NOT use Encoding.Default.
Huh? That can't be right.
You heard me right, please don't use Encoding.Default. Encoding.Default sounds like the right thing to do (after all it does say "Default" right there in its name), and its pretty easy, and it even seems to work OK, but there are some pretty big gotchas with Encoding.Default.
- Encoding.Default returns the current system code page. If someone changes the code page or if the saved data is shared with a different machine, it might be decoded as gibberish. If you use latin-based languages you might not notice this very quickly, but once you start thinking globally you'll might find all sorts of strange encoding/code page related bugs.
- Since Encoding.Default changes depending on what machine you're using, you might find users are sending data files to other users who are complaining because those files are corrupt. They probably really aren't corrupt, they're probably just using the wrong Encoding to decode them.
- Encoding.Default provides an "ANSI" code page, which can only support a small fraction of the characters in Unicode, particularly for single byte locales such as those used in the US. That means that users can probably enter characters that would be translated to ? or cause fallback behavior.
- Encoding.Default doesn't provide any information about what Encoding it is. So if you do use it, it would be wise to use some sort of higher level protocol to explicitly declare what Encoding the files is encoded as. Some encodings like UTF-8 or UTF-16 allow for a byte order mark that can be used as a signature to be fairly certain that the file is correctly encoded.
- Encoding.Default uses best fit behavior, which is bad, see Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided.
So if you can't use Encoding.Default, what should you use? I'd recommend UTF-8 (Encoding.UTF8) or UTF-16 (Encoding.Unicode). Either of these support all of the characters that the framework can handle, so no more unexpected ?s. For English, UTF-8 is effectively as efficient as 1252, but UTF-8 supports unexpected characters that 1252 would drop. UTF-16 is a better choice for most scripts that require double byte encodings. For most scripts UTF-8 or UTF-16 have only slightly larger data sizes than the more restrictive "default" encoding. The extra confidence that the data is correctly encoded is almost always worth this small cost in data size. Even a web site on a dial up modem the size difference would be a negligible fraction of the total text and graphic sizes.
Comments
- Anonymous
March 15, 2005
Wow! That's a pretty important tip!
Does FxCop catch uses of Encoding.Default?
-Don - Anonymous
March 15, 2005
I don't think it does. There are cases where its useful. For example, if you're a console app writing directly to the console, then Encoding.Default is what the console would use. So it is useful on occasion, however I'd still try to avoid it.
Obviously if you talk to the console directly you need the "default" encoding, however it'd be nice if it has a less generic name :-) Note that Console.WriteLine() and the like accept nice, normal Unicode Strings, so even then the default encoding isn't really necessary. - Anonymous
April 08, 2005
I could not agree more. What is particularly dangerous is that most APIs (.NET and Java) don't force the caller to explicitly define which encoding is to be applied. So some people won't consider the consequences of using the default-encoding, while others might not even know what encodings are all about. I just stumbled over some of these cases, more at arnosoftwaredev.blogspot.com/2005/03/default-encoding-considered-harmful.html.