Jaa


Comparing Two Open XML Documents using the Zip Extension Method

Sometimes we want to compare two word processing documents to see if they contain the same content.  I’m working on a blog post to merge comments from multiple Open XML documents into a single document.  This is based on a feature in Word 2007 that allows you to lock a document and prevent changes to content, yet allows users to add comments to the document.  However, we don’t want to attempt to merge comments if the documents don’t contain the same content.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC(Note: this is a post on comparing two Open XML word processing documents.  For a post on comparing two XML documents, see Equality Semantics of LINQ to XML Trees.)

One Approach to Comparing Two Documents for Equivalency

If two documents contain exactly the same content, they will have the same number of paragraphs, tables, content controls, and more, and these elements will occur in the same order, and have the same content.  However, two paragraphs may contain the same content yet their XML representation may be very different if one has a comment and the other does not – the paragraph with the comment may have its runs split differently.  I’ve written a previous post that examines run splitting in detail, and contains a method to report where the run splits are, and a method that splits runs based on a list of split locations.

The following markup shows a very simple paragraph.  We can see the paragraph element, the run element, and the text element.

<w:p>
<w:r>
<w:t>abcdefghi</w:t>
</w:r>
</w:p>

If we select “def” in the above text, and add a comment, the markup changes to look like this:

<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:commentRangeStartw:id="0"/>
<w:r>
<w:t>def</w:t>
</w:r>
<w:commentRangeEndw:id="0"/>
<w:r>
<w:rPr>
<w:rStylew:val="CommentReference"/>
</w:rPr>
<w:commentReferencew:id="0"/>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>

We can write a query that returns a collection of a very specific subset of the elements in the XML document.  This is the subset of elements that won’t change if the contents of the document don’t change.  This query consists of all elements in the document except:

  • w:commentRangeStart and w:commentRangeEnd – these elements will be added when the user adds comments to a document.  Most commonly, these elements occur under the paragraph element (to be trimmed, see below), but its valid for these elements to be children of the body elements, so we should trim them.
  • w:proofErr – this element is added automatically by Word when there are spelling or grammar errors, and has no effect on content.  Word can (and will) add this element even though the document is locked for editing with the exception of being able to add comments.  Therefore, we want to trim this element from the collection returned by a query that we’re going to use to determine document equivalency.
  • Finally, we want to eliminate all of the descendants of paragraphs from the query, as these elements can change quite a bit even if the contents of the document don’t change.  Instead, we want to write a bit of code to determine whether two paragraphs are equivalent.

Here is the query that returns a collection of the elements that we’re interested in:

XDocument xDoc1 = doc1.MainDocumentPart.GetXDocument();

var doc1Elements = xDoc1
.Descendants()
.Where(e => e.Name != W.commentRangeStart &&
e.Name != W.commentRangeEnd &&
e.Name != W.proofErr &&
!e.Ancestors(W.p).Any());

We can query two word processing documents, and if the elements in the returned collection are not in the exact same order, then the documents are different.  And if corresponding paragraphs contain the same content, per whatever algorithm that we define, then we can say that the documents contain the same content.  In the example that I present in this post, I validate paragraph equivalency by checking actual textual content, disregarding formatting changes for runs within the paragraph.  In my case, this is good enough, as the transformation that I wrote (and will present in an upcoming post) that moves comments from one document to another will work properly if the paragraphs have the same text.

The above query works just fine for documents that contain tables and content controls.  The markup for tables, content controls, and the paragraphs will be in the same order and have the same content if the documents are equivalent.

We could change this query easily enough to define document equality in just about any way we want.  If we want to disregard bookmarks, it’s easy enough to remove them from the results of the query.

The above query is not the most efficient way to do this – more efficient would be to write an iterator that goes through the Descendants axis and trims appropriately.  But queries show intent in a better way, and in my informal testing, the above query performs well enough as is for many scenarios.

Now that we’ve defined the query that will return the elements that won’t change if the document content doesn’t change, we can define another query that determines if two queries, evaluated on two Open XML documents, contain the same items, and in the same order.  This is a job for the Zip extension method, coming with C# 4.0.

The Zip Extension Method

The Zip extension method processes two sequences, matching up each item in one sequence with a corresponding item in another sequence.  While this method won’t be part of the framework until C# 4.0, a simple implementation that we can use with C# 3.0 is trivial:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
this IEnumerable<TFirst> first,
IEnumerable<TSecond> second,
Func<TFirst, TSecond, TResult> func)
{
var ie1 = first.GetEnumerator();
var ie2 = second.GetEnumerator();

while (ie1.MoveNext() && ie2.MoveNext())
yield return func(ie1.Current, ie2.Current);
}

Note:  Bart De Smet has a great explanation of the Zip extension method, as well as this example of the implementation of it.  That post also has a good explanation of how iterators work, using IL to explain them.  With regards to iterators, it’s also useful to read the section 8.14 in the C# 3.0 specification.

Using the Zip Extension Method

If we have one sequence that contains names, and another sequence that contains ages, and we know that the two sequences contain corresponding elements, we can project a new collection of anonymous objects:

string[] names = new[] { "Jim", "Bob", "Susan" };
int[] ages = new[] { 50, 35, 41 };
var q = names.Zip(ages, (name, age) => new
{
Name = name,
Age = age
});
foreach (var item in q)
Console.WriteLine(item);

When you run this example, you see:

{ Name = Jim, Age = 50 }
{ Name = Bob, Age = 35 }
{ Name = Susan, Age = 41 }

Notice that for the projection, we write a lambda expression that takes two arguments – each pair of corresponding items from the two source collections is passed as arguments to the lambda expression.

The following query uses the Zip extension method to project a collection of Booleans indicating if the element or paragraph is equivalent:

IEnumerable<bool> correspondingElementEquivalency = doc1Elements.Zip(doc2Elements, (e1, e2) =>
{
if (e1.Name != e2.Name)
return false;
// determine if two paragraphs contain the same content
if (e1.Name == W.p && (GetParagraphText(e1) != GetParagraphText(e2)))
return false;
return true;
});

GetParagraph text is defined as:

// return the text of a paragraph with revisions accepted
public static string GetParagraphText(XElement p)
{
return p.Descendants(W.r)
.Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom)
.Descendants(W.t)
.Select(t => (string)t)
.StringConcatenate();
}

So then, we can use the Any extension method to determine if the documents are equivalent:

return ! correspondingElementEquivalency.Any(e => e != true);

This will be pretty efficient, as it uses lazy evaluation, and the Any extension method will terminate processing as soon as the code determines that the documents are different.

One Final Note

This code doesn’t process math markup – if the two documents contain a math formula, and one of the documents is commented, then this query will report that the documents differ.  The structure and approach to take are exactly parallel to the approach that I take with comments in regular paragraphs.  Extending this to math markup is another post.

The Code

Following is an example that compares two documents to determine if they have the same content.  The code is quite short.  Note this uses the Open XML SDK.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class Extensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader streamReader = new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
part.AddAnnotation(xdoc);
return xdoc;
}

public static string StringConcatenate(this IEnumerable<string> source)
{
StringBuilder sb = new StringBuilder();
foreach (string s in source)
sb.Append(s);
return sb.ToString();
}

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
this IEnumerable<TFirst> first,
IEnumerable<TSecond> second,
Func<TFirst, TSecond, TResult> func)
{
var ie1 = first.GetEnumerator();
var ie2 = second.GetEnumerator();

while (ie1.MoveNext() && ie2.MoveNext())
yield return func(ie1.Current, ie2.Current);
}
}

public static class W
{
public static XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

public static XName p = w + "p";
public static XName r = w + "r";
public static XName t = w + "t";
public static XName commentRangeStart = w + "commentRangeStart";
public static XName commentRangeEnd = w + "commentRangeEnd";
public static XName proofErr = w + "proofErr";
public static XName del = w + "del";
public static XName moveFrom = w + "moveFrom";
}

class Program
{
// return the text of a paragraph with revisions accepted
public static string GetParagraphText(XElement p)
{
return p.Descendants(W.r)
.Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom)
.Descendants(W.t)
.Select(t => (string)t)
.StringConcatenate();
}

// returns true if the documents contain the same content, otherwise false
private static bool CompareDocuments(WordprocessingDocument doc1,
WordprocessingDocument doc2)
{
XDocument xDoc1 = doc1.MainDocumentPart.GetXDocument();
XDocument xDoc2 = doc2.MainDocumentPart.GetXDocument();

var doc1Elements = xDoc1
.Descendants()
.Where(e => e.Name != W.commentRangeStart &&
e.Name != W.commentRangeEnd &&
e.Name != W.proofErr &&
!e.Ancestors(W.p).Any());

var doc2Elements = xDoc2
.Descendants()
.Where(e => e.Name != W.commentRangeStart &&
e.Name != W.commentRangeEnd &&
e.Name != W.proofErr &&
!e.Ancestors(W.p).Any());

IEnumerable<bool> correspondingElementEquivalency = doc1Elements
.Zip(doc2Elements, (e1, e2) =>
{
if (e1.Name != e2.Name)
return false;
// determine if two paragraphs contain the same content
if (e1.Name == W.p && (GetParagraphText(e1) != GetParagraphText(e2)))
return false;
return true;
});

return ! correspondingElementEquivalency.Any(e => e != true);
}

static void Main(string[] args)
{
using (WordprocessingDocument doc1 = WordprocessingDocument.Open("Test3a.docx", false))
using (WordprocessingDocument doc2 = WordprocessingDocument.Open("Test3b.docx", false))
{
bool same = CompareDocuments(doc1, doc2);

Console.WriteLine(same);
}
}

}

Program.cs

Comments

  • Anonymous
    February 16, 2010
    Good one. In the similar way is it possible to compare word by word or letter by letter with formatting?