Jaa


OpenXmlDiff.Exe: A Utility to Find the Differences Between Two Open XML Documents

This blog post introduces a small command line utility (OpenXmlDiff.Exe, code attached to this page) that compares two Open XML documents and produces a textual report of the differences in markup between them.  This utility was born out of sheer frustration.  I've been needing this utility for months.  I've always accomplished my goals without it, but it would have been a time-saver several times in the past.  I've heard rumors of this existing elsewhere, but it has always eluded me.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC(Oct 27, 2008 - The Open XML SDK development team has built an Open XML Diff program that's very nice - find out about it here.)

Using this utility, we can create two documents with a very small difference between them, and see the exact changes that the difference caused.  Sometimes this alone is enough to explain the markup.  But if further explanation is necessary, the diff makes it easy to find the relevant places in the Open XML specification.  For our purposes, the currently published Ecma 376 specification works just fine.

This utility is a small (350 line) program written in C# 3.0.  Any edition of Visual Studio 2008 will compile it.  OpenXmlDiff uses another program, XmlDiff.exe, and another DLL, XmlDiffPatch.dll.  These files are included in a download from the MSDN XML Developer's Center.

To build and run this utility:

  • Create a directory where we'll work (say, C:\OpenXmlDiff).
  • Download and install the "XML Diff and Patch Utility" from https://msdn.microsoft.com/en-us/xml/bb190622.aspx.  By default, the utility will install in C:\Program Files\XmlDiffPatch.
  • This utility also uses the Open XML SDK, so if you don't have it already, you will need to download and install it.
  • Copy XmlDiff.Exe and XmlDiffPatch.dll from C:\Program Files\XmlDiffPatch\bin to C:\OpenXmlDiff.
  • Download and unzip the OpenXmlDiff.zip file.  You can unzip it in C:\OpenXmlDiff if you like.
  • Open the solution in Visual Studio 2008 and build the solution.
  • Copy the executable of OpenXmlDiff.Exe that we just built to C:\OpenXmlDiff.  The executable will be located in the subdirectory .\bin\debug.

We now have assembled all three files that we'll need to execute the diff.  Another approach is to add C:\Program Files\XmlDiffPatch\bin to your path, and copy OpenXmlDiff.Exe to C:\Program Files\XmlDiffPatch\bin.  Or whatever.

Next, use Word 2007 to create a document with some content, and name it, say, "Doc1.docx".

Close Word, and copy Doc1.docx to Doc2.docx.

Edit Doc2.docx with Word, and make a small change.  For this blog post, I'll highlight the second word, and make it bold.  Save Doc2.docx, and close Word.

At the command prompt:

C:\OpenXmlDiff>OpenXmlDiff Doc1.docx Doc2.docx >Report.txt

Open Report.txt in an editor.

The report tells us that the documents have the same set of parts, and it lists the parts that are identical:

Comparison Report
Original File: Doc1.docx
Modified File: Doc2.docx

Documents have the same parts

Identical Parts
===============
/docProps/app.xml
/word/theme/theme1.xml
/word/endnotes.xml
/word/fontTable.xml
/word/footnotes.xml
/word/styles.xml
/word/webSettings.xml

The report then lists something called a diffgram for the parts that have changed.  More on what a diffgram is in a moment.  We can see that three parts were changed:

  • Uri: /word/document.xml
  • Uri: /word/settings.xml
  • Uri: /docProps/core.xml

When we look at the changes to settings.xml and core.xml, it is apparent that these changes are some housekeeping changes injected by Word, and not relevant.  However, the changes to /word/document.xml are what we're looking for.  We see something like the following.  I've formatted a bit to make it more readable on a blog: the original diffgram includes lots of namespace declarations, but I've deleted them from the following listings.

Part Comparison
===============
Uri: /word/document.xml
ContentType: application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
<xd:xmldiff version="1.0"
srcDocHash="15564052487353082289"
options="IgnoreWhitespace "
fragments="no"
xmlns:xd="https://schemas.microsoft.com/xmltools/2002/xmldiff">
<xd:node match="2">
<xd:node match="1">
<xd:node match="1">
<xd:add>
<w:r>
<w:t xml:space="preserve">On</w:t>
</w:r>
</xd:add>
<xd:node match="1">
<xd:add>
<w:rPr>
<w:b />
</w:rPr>
</xd:add>
<xd:node match="1">
<xd:change match="1">the</xd:change>
</xd:node>
</xd:node>
<xd:add>
<w:r>
<w:t xml:space="preserve">Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look.</w:t>
</w:r>
</xd:add>
</xd:node>
</xd:node>
</xd:node>
</xd:xmldiff>

The diffgram isn't really the differences between the two files.  It is a definition of a set of changes that when applied to the first file will result in the second file.  (The XmlPatch.Exe is a utility that can do this, not relevant.)  Instead, we'll use the diffgram to figure out exactly the changes that occurred to our doc as a result of bolding the second word in the first paragraph.

The diffgram schema is documented in a CHM that is installed with the XML diff and patch utilities:

The <xd:node match="2"> element (the first child element of the root element of the diffgram) specifies that to find the relevant change, we need to find the second node at the root level of Doc1.xml.  The match attribute is a "1" based index into nodes at the relevant hierarchical level in the XML.  Using Visual Studio (and the Power Tools for Visual Studio, which includes the very cool Open XML editor), we can open the main document part (/word/document.xml) for Doc1.docx and see:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:ve="https://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="https://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:m="https://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp="https://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:w="https://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:wne="https://schemas.microsoft.com/office/word/2006/wordml">
<w:body>
<w:p w:rsidR="00807E1B"
w:rsidRDefault="00807E1B">
<w:r>
<w:t>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look.</w:t>
</w:r>
</w:p>

The first node is the XML declaration:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

The second node is the <w:document> node:

<w:document xmlns:ve="https://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="https://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:m="https://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp="https://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:w="https://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:wne="https://schemas.microsoft.com/office/word/2006/wordml">

This is the node that the previously mentioned diffgram element is referring to.

The next element of the diffgram is <xd:node match="1">, so this indicates that we should look at the first child element of the previously selected element, so this refers to the <w:body> element, which is the first child of the <w:document> element.

And so on.

Then in the diffgram, we see the following.  (namespaces were deleted)

<xd:add>
<w:r>
<w:t xml:space="preserve">On</w:t>
</w:r>
</xd:add>

This is the markup that indicates the addition of a run and text node in the relevant spot.  Because I bolded the second word, this caused the markup to be split into three text runs, with formatting on the middle run.  The above indicates the addition of the first run.

And we see the new, formatted middle run.  This is where we see the addition of the <w:rPr> element and its child <w:b/> element, which indicate that this run is bolded:  (namespaces deleted)

<xd:node match="1">
<xd:add>
<w:rPr>
<w:b />
</w:rPr>
</xd:add>

And then we see the change of this text run to contain just the bolded word.

    <xd:node match="1">
<xd:change match="1">the</xd:change>
</xd:node>
</xd:node>

And then we see the rest of the paragraph, not bolded:  (namespaces deleted)

<xd:add>
<w:r>
<w:t xml:space="preserve">Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look.</w:t>
</w:r>
</xd:add>

This is a very simple scenario, but I've used this in a more complicated scenario.  I had a situation where I knew that there was a change to markup *somewhere* else in the document that I needed to know about.  I didn't even know what part the change might be in.  This utility found what I needed to know, and pointed me to the right place in the spec.

Possible enhancements: it would be cool if this were a GUI, and when you clicked on a part that had differences, it opened a window that displayed the differences graphically in an HTML viewer.  Converting a diffgram to HTML has been done before, so maybe it wouldn't be too hard.  Any volunteers?

OpenXmlDiff.zip

Comments

  • Anonymous
    June 13, 2008
    You've been kicked (a good thing) - Trackback from DotNetKicks.com

  • Anonymous
    June 16, 2008
    I found the documentation for the diffgram. I've udated the original post with the location. Also, I

  • Anonymous
    June 18, 2008
    When learning about Open XML or developing Open XML solutions, it's very common to find yourself wondering

  • Anonymous
    June 18, 2008
    In my enthusiasm to move the extraneous namespaces from the middle of the diffgram, I introduced a bug

  • Anonymous
    June 21, 2008
    Eric, I'm still using VS 2005, and so I can't compile OpenXmlDiff. Any chance of a pre-built executable? thanks .. Jason

  • Anonymous
    June 25, 2008
    Hi Jason, I wish I could.  But since this is a blog that is hosted on a Microsoft server, to post a pre-built executable is a long, drawn out process.  This won't happen unless there is a compelling business reason.  One approach - VS 2008 will run side-by-side with 2005 - you could install the express edition to build.  Also would give you the chance to play with some cool technologies, such as LINQ :)  Or if there are any volunteers who want to build an executable and post a link as a comment here, I'd be appreciative. -Eric

  • Anonymous
    July 13, 2008
    This post presents some code to remove personal information from an Open XML word processing document.

  • Anonymous
    July 16, 2008
    (July 16, 2008: This approach has been replaced with a better version .) I had a thought that the instructions

  • Anonymous
    July 28, 2008
    The comment has been removed

  • Anonymous
    August 25, 2010
    Nice Comparison between open XML Documents. Thanks <a href="http://www.litera.com">Litera Corp</a>