Transforming Open XML Documents using XSLT

Transforming Open XML documents using XSLT is an interesting scenario.  However, Open XML documents are stored using the Open Packaging Convention (OPC), which are essentially ZIP files that contain XML and binary parts.  XLST processors can’t open and transform such files.  But if we first convert this document to a different format, the Flat OPC format, we can then transform the document using XSLT.  Perhaps the most compelling reason to use XSLT on Open XML documents is document generation.  You can take a source ‘template’ Open XML document and source XML data document, and produce a finished, formatted Open XML document with content derived from the source XML data document.

This post is one in a series of four posts that present this approach to transforming Open XML documents using XSLT.  The four posts are:

Transforming Open XML Documents using XSLT (This Post)

Presents an overview of the transformation process of Open XML documents using XSLT, and why this is important.  Also presents the ‘Hello World’ XSLT transform of an Open XML document.

Transforming Open XML Documents to Flat OPC Format

This post describes the process of conversion of an Open XML (OPC) document into a Flat OPC document, and presents the C# function, OpcToFlat.

Transforming Flat OPC Format to Open XML Documents

This post describes the process of conversion of a Flat OPC file back to an Open XML document, and presents the C# function, FlatToOpc.

The Flat OPC Format

Presents a description and examples of the Flat OPC format.

This approach is particularly important in SharePoint – it allows us to write and install a SharePoint feature that can transform Open XML documents in a general way using XSL style sheets stored in document libraries.  XSLT developers can then create a variety of XSL transforms of Open XML documents without writing and installing server-side code for each type of transform.  I’ll be writing about this powerful technique in the near future.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCAs you can see in the code in the linked posts, the conversion to and from the Flat OPC format is simple – less than 100 lines of code for each type of conversion.

The program OpcXsltTransform (attached) uses the code in the above posts, and the classes in System.Xml.Xsl to perform a transform using a supplied XSL style sheet.

To run OpcXsltTransform, you supply as arguments the source Open XML document, the destination Open XML document, and the name of the XSL style sheet.  You can optionally supply a fourth argument, -OutputIntermediate.  If you supply this argument, then after converting the source Open XML document to the Flat OPC format, OpcXsltTransform saves this file to the disk, and after the XSL transform, OpcXsltTransform saves the new Flat OPC file to the disk.  This can be helpful in debugging the XSL style sheet.  The name of the source intermediate file is the same as the source DOCX, but with a file extension of ‘.xml’.  The name of the destination intermediate file is the same as the destination OPC file, but with a file extension of ‘.xml’.  Here is the usage of DocXslTransform:

DocXslTransform -source source.docx -destination dest.docx -xsl transform.xsl [-outputIntermediate]

Here is an artificially simplistic XSL style sheet that works with the Flat OPC format.  It finds all paragraphs that have a text node that contains ‘Hello World’ and replaces those text nodes with a new one that contains ‘Goodbye World’.

<?xmlversion='1.0'?>
<xsl:stylesheet
xmlns:xsl='https://www.w3.org/1999/XSL/Transform'
xmlns:w='https://schemas.openxmlformats.org/wordprocessingml/2006/main'
version='1.0'>
<xsl:templatematch="w:document/w:body/w:p/w:r/w:t[node()='Hello World']">
<w:t>Goodbye World</w:t>
</xsl:template>
<!-- The following transform is the identity transform -->
<xsl:templatematch="/|@*|node()">
<xsl:copy>
<xsl:apply-templatesselect="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

This style sheet, as well as the DOCX that it transforms, are included in the bin/debug directory in the attached ZIP file.  You can build the project and run it to see the transform take place.

OpcXsltTransform.zip

Comments

  • Anonymous
    October 08, 2008
    how about transform to html files?

  • Anonymous
    October 08, 2008
    Sure, no prob, modify the program that starts the XSLT transform so that it doesn't convert the XML that results from the transform back to a DOCX.  Then, write the XSL to transfrom from the Flat OPC to XHTML (or HTML).  The file that results from the XSLT transform can be whatever you want it to be. -Eric

  • Anonymous
    December 15, 2008
    i'd very much like to use something like the FlatToOpc in my Visual 2005 web project.  How?

  • Anonymous
    December 15, 2008
    Hi Sylvain, System.IO.Packaging is available for .NET 3.0, and should work with C# 2.0.  You would need to rewrite OpcToFlatOpc and FlatToOpc to work with C# 2.0.  Given that the code is less than 100 lines long, shouldn't be too difficult.  Does this answer your question? -Eric

  • Anonymous
    August 10, 2009
    This is exactly what FlexDoc does, except that it doesn't use the flatOpc-format: it works directly on the xml of the header-, footer- and maindocumentpart. Check it out here: http://flexdoc.codeplex.com.

  • Anonymous
    June 22, 2010
    Is there similar one for excel format conversion ?

  • Anonymous
    June 22, 2010
    Hi SR,  I've not yet written an XSLT conversion for excel format conversion. -Eric

  • Anonymous
    July 10, 2012
    Hi Eric, I am very new to this technique. I have a word file, when I get the XML of that file the paragraph split into multiple RunItem. If I select first occurrence of the node in XSL I am getting only the first portion of the paragraph. Could you help me how to handle this using OpcXsltTransform.

  • Anonymous
    July 17, 2012
    Hi Erich, Is this support XSLT 2.0?

  • Anonymous
    July 18, 2012
    Hi Dhanasekaran, It certainly can work.  Microsoft does not have an XSLT 2.0 processor - however, you can use any of the other XSLT processors that you can use with .NET.  This series of blog posts is really about transforming Open XML to XML that you can then process using any number of approaches to transform XML. -Eric