XmlWriter, encodings and BOM
Today I want to talk about XmlWriter and the generation of a Byte Order Mark (BOM).
XmlWriter provides an API that generates, unsurprisingly, XML. This XML will typically end up as a managed string of characters or possibly a sequence of bytes. Of course, text transformed into bytes implies an encoding, as previously discussed.
Now XML has its own ways of determining the encoding that a document has, by peeking at the first bytes that make up an opening <?xml declaration or, more explicitly, with the encoding on this declaration.
Unicode is used for all sorts of puposes, not just XML encoding, and so it also has a mechanism to distinguish between small-endian and big-endian encodings, which determine which byte comes first in UTF-16 and UTF-32. It's also allowed for UTF-8, for that matter.
How do these mechanisms interact when using the .NET Framework classes? Let's write some code!
First, we'll write a short helper method to display the contents of a byte array.
private static void ShowBuffer(string linePrefix, byte[] bytes, long length) {
int bytesOnLine = 0;
for (long i = 0; i < length; i++) {
if (bytesOnLine == 0) {
Console.Write(linePrefix);
}
Console.Write("{0:X2} ", bytes[i]);
bytesOnLine++;
if (bytesOnLine > 16) {
Console.WriteLine();
bytesOnLine = 0;
}
}
}
Next, let's write a method to write out some short XML.
private static void WriteXml(XmlWriter xmlWriter) {
xmlWriter.WriteStartElement("hello");
xmlWriter.WriteString("#1");
xmlWriter.WriteEndElement();
xmlWriter.Flush();
}
Wel'll try different combinations of layering an XmlWriter with some encoding over a StreamWriter with a different encoding (or directly over a stream) to see what happens. These two methods will help us out.
private static long WriteEncodedXml(
Encoding streamEncoding,
Encoding xmlEncoding,
Stream stream) {
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = xmlEncoding;
settings.Indent = false;
if (streamEncoding != null) {
using (StreamWriter writer = new StreamWriter(stream, streamEncoding))
using (XmlWriter xmlWriter = XmlWriter.Create(writer, settings)) {
WriteXml(xmlWriter);
return stream.Length;
}
} else {
using (XmlWriter xmlWriter = XmlWriter.Create(stream, settings)) {
WriteXml(xmlWriter);
return stream.Length;
}
}
}
private static void ShowXmlEncoding(
Encoding streamEncoding,
Encoding xmlEncoding) {
Console.WriteLine("Stream Encoding: " +
((streamEncoding == null) ?
"(no stream)" : streamEncoding.EncodingName));
Console.WriteLine(" XML Encoding: " + xmlEncoding.EncodingName);
MemoryStream stream = new MemoryStream();
long length = WriteEncodedXml(streamEncoding, xmlEncoding, stream);
byte[] bytes = stream.GetBuffer();
ShowBuffer(" ", bytes, length);
Console.WriteLine();
}
Finally, here is the method to drive it all.
public static void Main(string[] args) {
// First encoding is for stream writer, second is XML writer.
ShowXmlEncoding(null, Encoding.UTF8);
ShowXmlEncoding(null,
new UTF8Encoding(/* encoderShouldEmitUTF8Identifier */false));
ShowXmlEncoding(null, Encoding.Unicode);
ShowXmlEncoding(null, Encoding.BigEndianUnicode);
ShowXmlEncoding(Encoding.ASCII, Encoding.Unicode);
// Muhaha.
Encoding muhaha = Encoding.GetEncoding(
"x-IA5-Norwegian",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
ShowXmlEncoding(null, muhaha);
}
You can run this now and see what comes up. Tomorrow, a short analysis of some interesting results.
Enjoy!