Condividi tramite


XmlWriterSettings Encoding Being Ignored?

I had an interesting exchange in an internal mailing list today, and thought that my readers could possibly benefit from this clarification as well. 

Scenario

You are working with the XmlWriter class and trying to write the contents to something.  You are expecting the XML declaration to look like:

 <?xml version="1.0" encoding="utf-8"?>

However, when you inspect the generated XML, it instead looks like:

 <?xml version="1.0" encoding="utf-16"?>

Background

The XmlWriter method has an overloaded Create method that accepts an XmlWriterSettings object.  The XmlWriterSettings allows you to specify if the output XML will be indented, what the indentation character will be, how newlines are handled, and how encoding is handled.

 XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8;
settings.Indent = true;
settings.IndentChars = ("\t");
settings.NewLineChars = Environment.NewLine;
settings.NewLineHandling = NewLineHandling.Replace;
settings.OmitXmlDeclaration = false;

When you use the XmlWriterSettings type, you expect the encoding that you specified to show up in the ouput.  For instance, consider the following code that reads a string and sends the output to the Console via a StringBuilder, then directly to Console.Out.

 static void Main(string[] args)
{
    
    string xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?><root xmlns=\"foo\"><bar/></root>";
    XElement elt = XElement.Parse(xml);
    XmlWriterSettings settings = GetWriterSettings();
    WriteToString(elt, settings);
    WriteDirectlyToConsole(elt, settings);
    WriteToInterimStream(elt, settings);           
}

static XmlWriterSettings GetWriterSettings()
{
    XmlWriterSettings settings = new XmlWriterSettings();
    settings.Encoding = Encoding.UTF8;
    settings.Indent = true;
    settings.IndentChars = ("\t");
    settings.NewLineChars = Environment.NewLine;
    settings.NewLineHandling = NewLineHandling.Replace;
    settings.OmitXmlDeclaration = false;
    return settings;
}
        

static void WriteToString(XElement elt, XmlWriterSettings settings)
{
    Console.WriteLine("====WRITING TO STRINGBUILDER====");
    StringBuilder sb = new StringBuilder();
    XmlWriter writer = XmlWriter.Create(sb, settings);
    elt.WriteTo(writer);
    writer.Flush();
    writer.Close();

    Console.WriteLine(sb.ToString());
    Console.WriteLine();
}

static void WriteDirectlyToConsole(XElement elt, XmlWriterSettings settings)
{
    Console.WriteLine("====WRITING DIRECTLY TO CONSOLE.OUT====");
    XmlWriter writer = XmlWriter.Create(Console.Out, settings);

    elt.WriteTo(writer);
    writer.Flush();
    writer.Close();
    Console.WriteLine();
    Console.WriteLine();
}

The output of these 2 methods look like the following:

 ====WRITING TO STRINGBUILDER====
<?xml version="1.0" encoding="utf-16"?>
<root xmlns="foo">
        <bar />
</root>

====WRITING DIRECTLY TO CONSOLE.OUT====
<?xml version="1.0" encoding="IBM437"?>
<root xmlns="foo">
        <bar />
</root>

Notice that we specified the encoding in the XmlWriterSettings class as UTF-8, yet the output here is UTF-16 and IBM437.  The short explanation is that StringBuilder is incapable of containing bytes, it is designed to contain characters, so strings in .NET are always going to contain UTF-16 encoded values.  Similarly, the Console.Out property is an implementation of a TextWriter which uses a specific encoding for displaying text in a console window... in this case, it is IBM437.

The XML declaration provides the intended encoding, but it should match the underlying stream.  Imagine if your XML content indicates it is UTF-8 encoded, but the underlying stream is something else.  This would cause odd side effects since the parser would try to parse the UTF-16 encoded content as UTF-8, ending up with some pretty odd looking output.

Solution

If you need UTF-8 encoding to be preserved, you need to write to a backing store that supports UTF-8 encoding.  One way to do that is to use a MemoryStream.

 static void WriteToInterimStream(XElement elt, XmlWriterSettings settings)
{
    Console.WriteLine("====WRITING TO INTERIM STREAM====");
    MemoryStream memStream = new MemoryStream();
    XmlWriter writer = XmlWriter.Create(memStream, settings);

    elt.WriteTo(writer);
    writer.Flush();
    writer.Close();

    //Set the pointer back to the beginning of the stream to be read
    memStream.Position = 0;

    XmlReader reader = XmlReader.Create(memStream);
    //Advance the cursor
    reader.Read();

    //Read the XML declaration
    Console.WriteLine("<?xml {0} ?>", reader.Value);
    reader.MoveToContent();
    //Read the content
    Console.WriteLine(reader.ReadOuterXml());

    reader.Close();
    memStream.Close();
    memStream.Dispose();
}

The output of this method would be as you would expect.

 ====WRITING TO INTERIM STREAM====
<?xml version="1.0" encoding="utf-8" ?>
<root xmlns="foo">
        <bar />
</root>

The difference is that we are able to serialize and deserialize to a backing store capable of storing UTF-8 encoded content.

Comments