Share via


Encoding fun with XmlWriter and StreamWriter

This post provides the results for yesterday's post, where we discussed encoding for XML, Unicode and the presence of the Byte Order Mark (BOM).

Without further ado, here they are. Let's starts with the first two. The first is an XmlWriter directly over a stream, with Encoding.UTF8. The second is an XmlWriter directly over a stream as well, but with a new UTF8Encoding instance that has the BOM explicitly turned off.

Stream Encoding: (no stream)
  XML Encoding:  Unicode (UTF-8)
  EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D
  22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 75
  74 66 2D 38 22 3F 3E 3C 68 65 6C 6C 6F 3E 23 31 3C
  2F 68 65 6C 6C 6F 3E
Stream Encoding: (no stream)
  XML Encoding:  Unicode (UTF-8)
  3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 2E
  30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 75 74 66 2D
  38 22 3F 3E 3C 68 65 6C 6C 6F 3E 23 31 3C 2F 68 65
  6C 6C 6F 3E

Note the highlighted bytes. By default, the BOM is present on the output the XmlWriter produces - you'll need to opt out of it if you don't want it. The BOM isn't really necessary for UTF-8, as every byte is its own unit, but the encoding provides for it because it's a strong hint to other programs that this is indeed a UTF-8 file as opposed to an ANSI file or some other default system encoding. Other than the BOM presence, the streams are identical.

Next, let's take the following two, which include UTF-16, represented by Encoding.Unicode, and big-endian UTF-16, available as Encoding.BigEndianUnicode. Typically the former is preferred, as this is the way strings are already represented in memory.

Stream Encoding: (no stream)
  XML Encoding:  Unicode
  FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65
  00 72 00 73 00 69 00 6F 00 6E 00 3D 00 22 00 31 00
  2E 00 30 00 22 00 20 00 65 00 6E 00 63 00 6F 00 64
  00 69 00 6E 00 67 00 3D 00 22 00 75 00 74 00 66 00
  2D 00 31 00 36 00 22 00 3F 00 3E 00 3C 00 68 00 65
  00 6C 00 6C 00 6F 00 3E 00 23 00 31 00 3C 00 2F 00
  68 00 65 00 6C 00 6C 00 6F 00 3E 00
Stream Encoding: (no stream)
  XML Encoding:  Unicode (Big-Endian)
  FE FF 00 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00
  65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 22 00 31
  00 2E 00 30 00 22 00 20 00 65 00 6E 00 63 00 6F 00
  64 00 69 00 6E 00 67 00 3D 00 22 00 55 00 54 00 46
  00 2D 00 31 00 36 00 42 00 45 00 22 00 3F 00 3E 00
  3C 00 68 00 65 00 6C 00 6C 00 6F 00 3E 00 23 00 31
  00 3C 00 2F 00 68 00 65 00 6C 00 6C 00 6F 00 3E

I've highlighted the changed bytes, but you can see that throughout the stream, the byte order is inverted.

Finally, let's look at what happens when we layer a StreamWriter in between, configured to write ASCII, and an XmlWriter with a UTF-16 encoding.

Stream Encoding: US-ASCII
  XML Encoding:  Unicode
  3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 2E
  30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 75 73 2D 61
  73 63 69 69 22 3F 3E 3C 68 65 6C 6C 6F 3E 23 31 3C
  2F 68 65 6C 6C 6F 3E

You can see here that the stream encoding wins - this is plain ASCII. It's best to make sure that the stream writer and the XML writer have the same encoding though, to help make sure that the right encoding is used throughout the stack.

If you ran the program, you'll notice that it actually ends with an exception. Let's look at the evil-laughter configuration tomorrow - muhaha!

Enjoy!