XmlWriterSettings Encoding Being Ignored?
I had an interesting exchange in an internal mailing list today, and thought that my readers could possibly benefit from this clarification as well.
Scenario
You are working with the XmlWriter class and trying to write the contents to something. You are expecting the XML declaration to look like:
<?xml version="1.0" encoding="utf-8"?>
However, when you inspect the generated XML, it instead looks like:
<?xml version="1.0" encoding="utf-16"?>
Background
The XmlWriter method has an overloaded Create method that accepts an XmlWriterSettings object. The XmlWriterSettings allows you to specify if the output XML will be indented, what the indentation character will be, how newlines are handled, and how encoding is handled.
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8;
settings.Indent = true;
settings.IndentChars = ("\t");
settings.NewLineChars = Environment.NewLine;
settings.NewLineHandling = NewLineHandling.Replace;
settings.OmitXmlDeclaration = false;
When you use the XmlWriterSettings type, you expect the encoding that you specified to show up in the ouput. For instance, consider the following code that reads a string and sends the output to the Console via a StringBuilder, then directly to Console.Out.
static void Main(string[] args)
{
string xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?><root xmlns=\"foo\"><bar/></root>";
XElement elt = XElement.Parse(xml);
XmlWriterSettings settings = GetWriterSettings();
WriteToString(elt, settings);
WriteDirectlyToConsole(elt, settings);
WriteToInterimStream(elt, settings);
}
static XmlWriterSettings GetWriterSettings()
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8;
settings.Indent = true;
settings.IndentChars = ("\t");
settings.NewLineChars = Environment.NewLine;
settings.NewLineHandling = NewLineHandling.Replace;
settings.OmitXmlDeclaration = false;
return settings;
}
static void WriteToString(XElement elt, XmlWriterSettings settings)
{
Console.WriteLine("====WRITING TO STRINGBUILDER====");
StringBuilder sb = new StringBuilder();
XmlWriter writer = XmlWriter.Create(sb, settings);
elt.WriteTo(writer);
writer.Flush();
writer.Close();
Console.WriteLine(sb.ToString());
Console.WriteLine();
}
static void WriteDirectlyToConsole(XElement elt, XmlWriterSettings settings)
{
Console.WriteLine("====WRITING DIRECTLY TO CONSOLE.OUT====");
XmlWriter writer = XmlWriter.Create(Console.Out, settings);
elt.WriteTo(writer);
writer.Flush();
writer.Close();
Console.WriteLine();
Console.WriteLine();
}
The output of these 2 methods look like the following:
====WRITING TO STRINGBUILDER====
<?xml version="1.0" encoding="utf-16"?>
<root xmlns="foo">
<bar />
</root>
====WRITING DIRECTLY TO CONSOLE.OUT====
<?xml version="1.0" encoding="IBM437"?>
<root xmlns="foo">
<bar />
</root>
Notice that we specified the encoding in the XmlWriterSettings class as UTF-8, yet the output here is UTF-16 and IBM437. The short explanation is that StringBuilder is incapable of containing bytes, it is designed to contain characters, so strings in .NET are always going to contain UTF-16 encoded values. Similarly, the Console.Out property is an implementation of a TextWriter which uses a specific encoding for displaying text in a console window... in this case, it is IBM437.
The XML declaration provides the intended encoding, but it should match the underlying stream. Imagine if your XML content indicates it is UTF-8 encoded, but the underlying stream is something else. This would cause odd side effects since the parser would try to parse the UTF-16 encoded content as UTF-8, ending up with some pretty odd looking output.
Solution
If you need UTF-8 encoding to be preserved, you need to write to a backing store that supports UTF-8 encoding. One way to do that is to use a MemoryStream.
static void WriteToInterimStream(XElement elt, XmlWriterSettings settings)
{
Console.WriteLine("====WRITING TO INTERIM STREAM====");
MemoryStream memStream = new MemoryStream();
XmlWriter writer = XmlWriter.Create(memStream, settings);
elt.WriteTo(writer);
writer.Flush();
writer.Close();
//Set the pointer back to the beginning of the stream to be read
memStream.Position = 0;
XmlReader reader = XmlReader.Create(memStream);
//Advance the cursor
reader.Read();
//Read the XML declaration
Console.WriteLine("<?xml {0} ?>", reader.Value);
reader.MoveToContent();
//Read the content
Console.WriteLine(reader.ReadOuterXml());
reader.Close();
memStream.Close();
memStream.Dispose();
}
The output of this method would be as you would expect.
====WRITING TO INTERIM STREAM====
<?xml version="1.0" encoding="utf-8" ?>
<root xmlns="foo">
<bar />
</root>
The difference is that we are able to serialize and deserialize to a backing store capable of storing UTF-8 encoded content.
Comments
Anonymous
August 11, 2008
PingBack from http://www.easycoded.com/xmlwritersettings-encoding-being-ignoredAnonymous
August 11, 2008
I suggest you have a look at Kirk Allen Evan's blog post: XmlWriterSettings Encoding Being Ignored?