Jaa


Document-Centric Transforms using LINQ to XML

When thought of in a certain way, XML documents come in two flavors – data-centric and document-centric.  Further, there are two types of document-centric documents.  This post presents my thoughts about approaches to various types of document-centric transformations – data-centric to document-centric, document-centric to data-centric, and document-centric to document-centric.  Then, I’ll tie my thoughts back to Open XML transformations.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCData-centric to data-centric is, of course, the scenario that LINQ to XML absolutely shines at.  There’s been a lot written about this.  This post won’t focus on these types of transformations, but instead will give my thoughts on the wrinkle that document-centric XML documents give to transformations.

First, I’ll define what I mean by data-centric and document-centric XML documents.

Data-Centric XML Document

A data-centric XML document contains regular repeating elements.  Child elements of a given element might all have the same tag name, or they might not.  Typically, child element order doesn’t matter.  There are lots of examples of this – many types of transforms of a relational database to XML results in data-centric XML.  RSS feeds are another.

Here’s a data-centric XML document:

<Customers>
<Customer>
<Name>Bob</Name>
<Age>45</Age>
</Customer>
<Customer>
<Name>Jill</Name>
<Age>37</Age>
</Customer>
</Customers>

Document-Centric XML Document

Document-centric XML documents have the characteristic that the child elements of a given element are much less bounded – you might have many child elements of a given name, or you might have none.  You might have ‘recursion’ in the hierarchy – element A is a child of element B, which is itself a child of a different element A.  A number of examples: Open XML word processing markup, XHTML, and XPS.

I further divide document-centric XML documents into two camps – those that contained mixed content, and those that don’t.  Mixed content is a variety of XML where significant text nodes and elements are interspersed.  Insignificant text nodes are the white space that provides indenting when formatting XML.  Open XML word processing markup doesn’t contain mixed content, whereas XHTML does:

An Open XML paragraph that contains some bold text:

<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>def</w:t>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>

An XHTML document that contains significant text nodes interspersed with element start and end tags:

<html>
<head></head>
<body>
<p>abc<b>def</b>ghi</p>
</body>
</html>

Types of Transformations

If we’re going to divide the XML world into the categories of data-centric and document-centric, it follows that there are four types of transformations.

Data-Centric => Data Centric

There is a lot to say (and has been said) about these types of transformations.  In the LINQ to XML documentation, I included a tutorial on pure functional transformations of XML.  I also have a tutorial on my blog on composing queries in the pure functional style.

Data-Centric => Document Centric

These transforms are report writers for databases – take some subset of records, transform to XML, then transform that XML into another form – XPS, for instance.  The transform may be based on another source document, the report definition.  These types of transforms are straightforward to write in the pure functional style.  Based on the simplicity or complexity of the report definition, this type of transform could be a few hundred lines of code or many thousands.

There are also many good examples of transforming data-centric XML to Open XML markup.  We may want to transform a collection of records into a table in a word processing document, or into rows and cells in a worksheet.

Document-Centric => Data Centric

We write this type of transform when querying an Open XML document for some aspect of the markup.  If we want to retrieve a collection of comments from a document, or if we want a collection of content controls, then we write a query that iterates over certain descendant elements, projecting a regular data structure – perhaps a collection of strings or anonymous types.  The query that I develop in my functional programming tutorial is a document-centric => data-centric transform.

Another example is finding all hyper-links in an XHTML document.  It is easy to write a LINQ query to retrieve a collection of links and transform the collection to a regular repeating data structure.

Document-Centric => Document-Centric

This is where it starts to get a little more involved.  There are a variety of these types of transformations.

Common-vocabulary document-centric transform: Sometimes we want to transform a XML tree to a new tree in the same vocabulary.  For example, we can transform an Open XML document into a new Open XML document with modified contents - comments are removed or revisions accepted.  Another example - replace content controls with other markup based on the contents of the content controls.

Different-vocabulary document-centric transform: Sometimes we want to transform from one document-centric vocabulary to another one – Open XML => XHTML, or XHTML => Open XML.  With this type of transform, the ease with which we write the transform is directly related to whether the two vocabularies have a similar structure.  For instance, there is much that is parallel between Open XML and XHTML.  There is a body element.  The body contains paragraphs and tables.  Tables contain rows, which contain cells.  Tables can contain other tables in cells.

XSLT works well for these types of transformations – you write a pattern to match a node, and then supply the transformation for just that node.  In the case of XSLT, you can indicate to the transform engine to ‘continue processing rules for child elements’, so that you can specify the transforms for those child elements in their own rules.  If you are aware of Flat OPC, it is pretty easy to process Open XML documents using XSLT.

Some time ago, I write a post on an approach for using LINQ to XML annotations for doing this type of transform.  In that post, I was proving out whether you could write document-centric transforms using LINQ to XML in a style similar to XSLT.  It’s easy to read the code to specify the transform if you read LINQ code easily, but there are obvious problems with the approach, not the least of which is that annotating a tree in that fashion might have performance issues if you are working with too large of a tree.

Even though Open XML and XHTML have similar structures, there are places where the structures are not parallel, and in those cases, you still must jump through hoops.  In XSLT, this often means generating intermediate trees to use in subsequent transforms.  I’ve seen XSLT transforms where the first thing the developer did was to transform the tree to a new tree with new attributes on elements – the purpose of the attributes was to aid further transforms.  If using the LINQ to XML approach that uses annotations, you must deal with the same issues – parts of the transformation are expressed as nice mappings between a pattern that matches nodes and the subsequent transform of those nodes, and parts of the transformation deals with abstractions that often must be explained in comments.  It’s just more complicated to do these types of transformations.

Open XML Document-Centric Transforms

There are lots of examples of interesting common-vocabulary document-centric Open XML transforms.  Removing comments and accepting revisions are two, but there are many others.

I've needed to write a number of these over the last couple of years.  Because I know the size of documents that I potentially need to process (>2 million nodes), I rejected the annotations approach for simple transforms of Open XML documents.  For performance reasons, it just wouldn’t fly.

I also rejected using XSLT – I really don’t want to step out into another language.  XSLT is an attractive approach if you already have an XSLT transform written, or if you are particularly fluent in XSLT.  You must deal with converting the OPC (Zip) file to the Flat OPC format, but this is easy.  But when I’m writing little examples that show how to do something interesting in Open XML, XSLT isn’t appropriate.

So, for instance, for the code to accept tracked revisions, I opted for the tree-modification approach.  This isn’t idea from a functional programming purist’s point of view, but it performs well in the real world.  You have to be careful when coding, but no big deal.

Recursive Approach to Transforms

Lately, I’ve been writing more of these types of transforms in a recursive style.  The gist of this technique is that you write a recursive function to clone a tree, and while cloning, you trim nodes, or transform nodes, or whatever.

This approach has good performance, and it is appealing in that when you are writing a more complicated recursive transform, you can write it in terms of other simpler recursive transforms.  The code should be written with no side effects, and if so, transforms are easy to write and debug.

This approach has a draw-back.  It’s somewhat harder to intuitively see the mapping between the pattern that matches a node, and the transform for that pattern.  However, we don’t lose this entirely.  For example, we may want to write a recursive transform of an XHTML document to Open XML.  Here is the XHTML document:

<html>
<head></head>
<body>
<p>abc<b>def</b>ghi</p>
</body>
</html>

We want to transform it to this document:

<w:documentxmlns:w="https://schemas.openxmlformats.org/wordprocessingml/2006/main" >
<w:body>
<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>def</w:t>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>
</w:body>
</w:document>

We can write the recursive transform like this:

using System;
using System.Linq;
using System.Xml.Linq;

class Program
{
static object XHtmlToOpenXml(XNode node)
{
XNamespace w = "https://schemas.openxmlformats.org/wordprocessingml/2006/main";

XElement element = node as XElement;
if (element != null)
{
if (element.Name == "html")
return new XElement(w + "document",
new XAttribute(XNamespace.Xmlns + "w", w.NamespaceName),
new XElement(w + "head", ""),
element.Elements().Select(e => XHtmlToOpenXml(e)));

if (element.Name == "body")
return new XElement(w + "body",
element.Elements().Select(e => XHtmlToOpenXml(e)));

if (element.Name == "p")
return new XElement(w + "p",
element.Nodes().Select(n => XHtmlToOpenXml(n)));

if (element.Name == "b")
return new XElement(w + "r",
new XElement(w + "rPr",
new XElement(w + "b")),
new XElement(w + "t",
element.Value));
}

XText t = node as XText;
if (t != null)
return new XElement(w + "r",
new XElement(w + "t", t.Value));

// ignore all other nodes
return null;
}

static void Main(string[] args)
{
XElement root = XElement.Parse(