Transforming WordprocessingML to Simpler XML for Easier Processing

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

When building a document processing system based on Open XML WordprocessingML, one approach to making your software system more robust is to use a technique where you make use of WordprocessingML style information to first transform a WordprocessingML document to a simpler form of XML.  You then can write code to further process or query the simpler XML.  Because your code operates on the simpler XML document, it is easier to validate that your code is correct.  You stand a better chance that your software will have the desired behavior in more circumstances.  This post presents one approach to transforming and processing a WordprocessingML document.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThe following screen-shot uses a feature of Microsoft Word that displays the style name for every paragraph to the paragraph's left.  Your document might look like the following:

The WordprocessingML markup for the document looks like this:

<w:pw:rsidR="00DD5B8D"
w:rsidRDefault="00D56D15"
w:rsidP="00D56D15">
<w:pPr>
<w:pStylew:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>Introduction to WordprocessingML</w:t>
</w:r>
</w:p>
<w:sdt>
<w:sdtPr>
<w:aliasw:val="Overview"/>
<w:tagw:val="Overview"/>
<w:idw:val="17452686"/>
<w:placeholder>
<w:docPartw:val="DefaultPlaceholder_22675703"/>
</w:placeholder>
</w:sdtPr>
<w:sdtEndPr>
<w:rPr>
<w:rFontsw:asciiTheme="minorHAnsi"
w:eastAsiaTheme="minorHAnsi"
w:hAnsiTheme="minorHAnsi"
w:cstheme="minorBidi"/>
<w:bw:val="0"/>
<w:bCsw:val="0"/>
<w:colorw:val="auto"/>
<w:szw:val="22"/>
<w:szCsw:val="22"/>
</w:rPr>
</w:sdtEndPr>
<w:sdtContent>
<w:pw:rsidR="00D56D15"
w:rsidRDefault="00D56D15"
w:rsidP="00D56D15">
<w:pPr>
<w:pStylew:val="Heading2"/>
</w:pPr>
<w:r>
<w:t>Overview</w:t>
</w:r>
</w:p>
<w:pw:rsidR="00D56D15"
w:rsidRDefault="00D56D15">
<w:r>
<w:t>On the Insert tab, the galleries include items.</w:t>
</w:r>
</w:p>
</w:sdtContent>
</w:sdt>
<!-- some content elided to simplify the listing -->
</w:sdt>

You could transform this markup to something along the lines of the following, which is easier to further process.

<z:documentxmlns:z="https://www.adventureworks.com/sample">
<z:pstyle="Heading1">Introduction to WordprocessingML</z:p>
<z:contentControltag="Overview">
<z:pstyle="Heading2">Overview</z:p>
<z:pstyle="Normal">On the Insert tab, the galleries include items.</z:p>
</z:contentControl>
<z:contentControltag="Section">
<z:pstyle="Heading2">Next Section</z:p>
<z:pstyle="Normal">You can use these galleries to insert tables.</z:p>
</z:contentControl>
</z:document>

The first step in writing a transform to simpler XML is to accept tracked changes.  With these types of transforms, where you first transform valid WordprocessingML to another simpler form of valid WordprocessingML, you probably want to use a technique where you make all changes to a temporary in-memory document.  You probably do not want to write the simpler XML back to the original source document.  Simplifying Open XML WordprocessingML Queries by First Accepting Revisions presents the simplest approach for doing this.

The second step is to further process the document (which now contains no tracked changes), and remove features of Open XML that are not interesting to your transform.  Enabling Better Transformations by Simplifying Open XML WordprocessingML Markup introduces the MarkupSimplifier class, which is part of the PowerTools for Open XML project on CodePlex.  That class makes it very easy to reliably remove unused features of WordprocessingML.

One more important tool in your toolbox is to use the LogicalChildrenContent axis method.  Mastering Text in Open XML Word-Processing Documents introduces the LogicalChildrenContent axis method, and explains when you want to use it.

A transform from WordprocessingML to a simpler XML vocabulary is a document-centric transform.  The blog post Document-Centric Transforms using LINQ to XML explores the nature of these types of transforms.  XSLT is a common tool for writing these types of transforms.  It is designed with the express purpose of building them.  Recursive Approach to Pure Functional Transformations of  XML introduces one approach for writing these types of transforms using C# 3.0.  This approach allows you to write an absolute minimum of C# code to transform the XML.

The following is a complete listing of the transform that produces the simpler XML shown at the beginning of this post.  You can see that it doesn't take much code to do the transform – less than 90 lines of code for the entire example, and only 31 lines of code for the transform function.  This example can be found in the HtmlConverter.zip download under the Downloads tab at PowerTools for Open XML.  This code will work properly regardless of whether the WordprocessingML markup contains revisions, smart tags, or any of the other interesting features of WordprocessingML.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using OpenXmlPowerTools;

class Program
{
static XNamespace Z = "https://www.adventureworks.com/sample";

static object TransformToSimpleXml(XNode node, string defaultParagraphStyleId)
{
XElement element = node as XElement;
if (element != null)
{
if (element.Name == W.document)
return new XElement(Z + "document",
new XAttribute(XNamespace.Xmlns + "z", Z),
element.Element(W.body).Elements()
.Select(e => TransformToSimpleXml(e, defaultParagraphStyleId)));
if (element.Name == W.p)
{
string styleId = (string)element.Elements(W.pPr)
.Elements(W.pStyle).Attributes(W.val).FirstOrDefault();
if (styleId == null)
styleId = defaultParagraphStyleId;
return new XElement(Z + "p",
new XAttribute("style", styleId),
element.LogicalChildrenContent(W.r).Elements(W.t).Select(t => (string)t)
.StringConcatenate());
}
if (element.Name == W.sdt)
return new XElement(Z + "contentControl",
new XAttribute("tag", (string)element.Elements(W.sdtPr)
.Elements(W.tag).Attributes(W.val).FirstOrDefault()),
element.Elements(W.sdtContent).Elements()
.Select(e => TransformToSimpleXml(e, defaultParagraphStyleId)));
return null;
}
return node;
}

static void Main(string[] args)
{
byte[] byteArray = File.ReadAllBytes("Test.docx");
using (MemoryStream memoryStream = new MemoryStream())
{
memoryStream.Write(byteArray, 0, byteArray.Length);
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open(memoryStream, true))
{
RevisionAccepter.AcceptRevisions(wordDoc);
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = false,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = true,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
string defaultParagraphStyleId = wordDoc.MainDocumentPart
.StyleDefinitionsPart.GetXDocument().Root.Elements(W.style)
.Where(e => (string)e.Attribute(W.type) == "paragraph" &&
(string)e.Attribute(W._default) == "1")
.Select(s => (string)s.Attribute(W.styleId))
.FirstOrDefault();
XElement simplerXml = (XElement)TransformToSimpleXml(
wordDoc.MainDocumentPart.GetXDocument().Root,
defaultParagraphStyleId);
Console.WriteLine(simplerXml);
}
}
}
}

Comments

  • Anonymous
    March 07, 2010
    Hi Eric, Firstly thanks for all of the work and examples that you have done so far. Its really helped! I have a question concerning the transformation and would appreciate it if you could help. I have tested the code and it works fine except that the "Numbering" for the headings are gone. I have checked the other posts that you have done using the ListItemRetriever but cannot get it to work (my lack of knowledge) in a similar way to the styleId part (they would both be under "if (element.Name == W.p)". You you please post an example using ListtItemRetriever in the cóntext of transforming to simpler XML (including the non-heading paragraphs). Regards Tubs