LINQ Farm: Preserving Formatting with LINQ to XML

In a previous post, you saw how to work with line numbers when using LINQ to XML to read a file. This post continues in the same vein, but this time the focus is on how to:

  • Read in an XML file with an arbitrary format, and then write it back out to disk in exactly the same format.
  • Read in an XML file with an arbitrary format, and write it back out with standard formatting

Some scenarios where you might need this functionality include reading a document from one location and writing it to another location, or reading in a document, editing it, and writing it back out with new data, but the same format.

You can download the source for the code shown in this post.

How to Preserve Formatting

Consider the following block of XML:

 <?xml version="1.0" encoding="utf-8" ?>
<alpha><beta> 
 sam 
</beta></alpha>

This document has an extremely idiosyncratic layout. Tags are nested on the same line and a text field stands on its own surrounded by linefeeds. If you wish to read in and write this document back out with the formatting preserved, here is how to proceed:

 XDocument x = XDocument.Load(readFileName, LoadOptions.PreserveWhitespace);
x.Save(writeFileName, SaveOptions.DisableFormatting);

When calling XDocument.Load, pass in the LoadOptions.PreserveWhitespace flag from the following enumeration:

 public enum LoadOptions
{
  None = 0,
  PreserveWhitespace = 1,
  SetBaseUri = 2,
  SetLineInfo = 4
}

When writing the document back out to disk, use SaveOptions.DisableFormatting, from the following enumeration:

 public enum SaveOptions
{
  None = 0,
  DisableFormatting = 1
}

If you take these two steps, then the XML you write to disk will have the same formatting as the document you read in.

Modifying Oddly Formatted XML

If you read in XML that has an odd format, it is likely that linefeeds play a role in that formatting. Suppose we want to modify the XML document shown previously so that it looks like this:

 <?xml version="1.0" encoding="utf-8" ?>
<alpha><beta>
  Sue  
</beta></alpha>

The value sam from the original document has been replaced with the word Sue. To make this substitution properly, you need to take care to preserve the linefeeds in the original document. In this example I show a relatively mindless way to preserve linefeeds. For good measure, I also add in code for working with line numbers.

 XDocument x2 = XDocument.Load(readFileName,
    LoadOptions.PreserveWhitespace | LoadOptions.SetLineInfo);

XText value = (from c in x2.Elements().DescendantNodes().OfType<XText>()
               where c.Value.Trim().Length > 0
               select c).Single();

IXmlLineInfo x = (IXmlLineInfo)value;
Console.WriteLine("Line Number: {0}", x.LineNumber);
value.Value = Environment.NewLine + "  Sue  " + Environment.NewLine;

x2.Save(createFileName, SaveOptions.DisableFormatting);

This code reads in the original XML file with the PreserverWhitespace flag. I've also OR'd in the LoadOptions.SetLineInfo flag. I do this not because I have any real need to do so, but simply so you can see how to pass in two flags to the Load method.

The next block of code queries the document for all the text nodes that consist of something more than pure white space. I then output the line number of the node that was found. Finally, I modify the node, replacing the original text with the word Sue:

value.Value = Environment.NewLine + " Sue " + Environment.NewLine;

To modify the node, I simply replace the Value property with new text. Note that the code explicitly adds in a pair of linefeeds to preserves the original, idiosyncratic formatting. If you are confused about what is happening here, look up again at the original document. It begins with two tags:

<alpha><beta>

Then there is a linefeed, the word sam, and another linefeed and finally the closing tags:

</beta></alpha>

The code I've written simply preserves those linefeeds. I'm doing this not because I think it is a good idea to include white space like this in a document, but simply to show you that it is possible to preserve it if you have a need or desire to do so.

After modifying the document, the XML is written back to disk using the same technique explained in the previous section.

Cleaning Up XML

If you discover an XML file that has idiosyncratic formatting that you want to remove, LINQ to XML makes it fairly easy to clean up such a document. By default, LINQ to XML will write out a document with proper formatting. All you have to do is use the Load and Save methods without any of the LoadOptions or SaveOption flags shown in the previous two sections:

 XDocument x2 = XDocument.Load(readFileName);
x2.Save(createFileName);

This code will not, however, remove arbitrary white space from a text node. For instance, the text node described above that begins and ends with a linefeed will still have linefeeds when you write the document back out. This may be exactly what you want. On the other hand, you may feel it leaves your document in a pretty ugly state. Here is how to clean up the problem

 XDocument x2 = XDocument.Load(readFileName);

var query = from c in x2.Elements().DescendantNodes().OfType<XText>()
            select c;

foreach (var item in query)
{
    item.Value = item.Value.Trim();
}

x2.Save(createFileName);

This code finds all the text nodes in the document and strips away their white space with standard string Trim method. When you are done, you can write the document back out to disk. It will now have the clean formatting you expect:

 <?xml version="1.0" encoding="utf-8"?>
<alpha>
  <beta>sam</beta>
</alpha>

Finally, if you want to strip away all white space, and end up with a document that sits on a single line, you can write code that looks just like the listing shown earlier in this section, except that the the call to Save the document should look like this:

 x2.Save(createFileName, SaveOptions.DisableFormatting);

The output you create will appear all on one line, like this:

 <?xml version="1.0" encoding="utf-8"?><alpha><beta>sam</beta></alpha>

If you did not clean up the text nodes as shown above, then your code would end up with linefeeds in it from the text nodes. The result, in our case, would look like this:

 <?xml version="1.0" encoding="utf-8"?><alpha><beta>
  sam  
</beta></alpha>

The point here is that DisableFormatting turns off LINQ to XML’s attempts to format your code properly. If you read in strangely formatted code, then using this flag will preserve that idiosyncratic formatting. If you read in code with no attempt to preserve the original formatting, as we do in this section, then writing to disk with this option will end up stripping away all formatting, and leaving you with code all on one line except for white space in a text node.

Working with Text Files

In this post I've focused on reading in and writing XML files using the LINQ to XML API. It's perhaps useful to recall that you can also use standard .NET IO in order to perform similar tasks. This can be particularly helpful when you are testing your code, and want to be sure that your LINQ to XML routines are behaving as expected.

Here is a simple routine for reading in a XML file, or any other text file:

 private static void ReadAsText(string fileName)
{
    string s = File.ReadAllText(fileName);
    Console.WriteLine(s);
}

And here is the C# call for writing a file to disk:

 File.WriteAllText

If you download the sample code associated with this post, you will see that I use these routines to confirm that my code is working correctly.

Summary

In this post you've seen how to preserve white space in an XML document, and how to strip it out and return to standard formatting. You also had a chance to learn how to edit an idiosyncratically structured XML document without changing its format.

Download the source.

kick it on DotNetKicks.com

Comments

  • Anonymous
    September 30, 2008
    You've been kicked (a good thing) - Trackback from DotNetKicks.com

  • Anonymous
    October 09, 2008
    On a similar vain to these posts, is there a simple way to find the end position of an XML node? I wish to highlight the source text of a node in a RichTextBox, for example. I can find the start using the XDocument.Load SetLineInfo option but how can I reliably find the end of the EndElement? A simple xelement.ToString().Length doesn't work becuase the ToString() format can be different from the source. Is there a way without parsing the source myself?