Jaa


Serializing Encoded XML Documents using LINQ to XML

Writing encoded (utf-8, utf-16, etc.) documents using LINQ to XML is pretty straight-forward, but there is one interesting dynamic of the semantics. When serializing to a file on disk, then you can set the encoding in the XML declaration, and the resulting XML document will be serialized as you wish. However, if you are writing to a stream that supports only one specific encoding, then the XML document will automatically be encoded to match the stream, overriding your specified desired encoding, and the XML declaration will be adjusted accordingly.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThe following example serializes the document as utf-8. Setting the encoding property of the XDeclaration (set through the second parameter of the constructor) specifies the encoding of the serialized XML.

XDocument doc8 = new XDocument(
    new XDeclaration("1.0", "utf-8", "yes"),
    new XElement("Root", 1));
doc8.Save("encoded-utf-8.xml");
Console.WriteLine(File.ReadAllText("encoded-utf-8.xml"));

This outputs:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Root>1</Root>

You can change the encoding of this document like this:

XDocument doc16 = new XDocument(
    new XDeclaration("1.0", "utf-16", "yes"),
    new XElement("Root", 1));
doc16.Save("encoded-utf-16.xml");
Console.WriteLine(File.ReadAllText("encoded-utf-16.xml"));

The resulting XML document looks like this:

<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<Root>1</Root>

However, all strings in .NET are encoded using utf-16, so if you create a StringWriter object using a StringBuilder, and serialize some XML to it, even though the encoding specified in the XML declaration indicates that it should be encoded in utf-8, it will actually be encoded in utf-16:

XDocument doc = new XDocument(
    new XDeclaration("1.0", "utf-8", "yes"),
    new XElement("Root", 1));
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb))
using (XmlWriter xw = XmlWriter.Create(sw))
    doc.WriteTo(xw);
Console.WriteLine(sb.ToString());

This outputs:

<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<Root>1</Root>

This also applies in certain circumstances when writing to the Console.

XDocument doc = new XDocument(
    new XDeclaration("1.0", "utf-8", "yes"),
    new XElement("Root", 1));
using (XmlWriter xw = XmlWriter.Create(Console.Out))
    doc.WriteTo(xw);

Here is what this snippet outputs:

<?xml version="1.0" encoding="IBM437" standalone="yes"?>
<Root>1</Root>

This only makes sense – the encoding of the console is set by default to the IBM437 code page. The Windows code page is 1252.

So, you can see that you can specify your desired encoding, and if the stream to which you are serializing supports that encoding, you will get what you want, but if the stream only supports one particular encoding, that’s what you’ll get.

Comments

  • Anonymous
    April 12, 2010
    Well, thank you for confirming the annoying behavior that I've been experiencing in my program. However, tips on how to fix or get around this problem would have been useful.