Remove Rsid Attributes and Elements before Comparing Open XML Documents

A convenient way to explore Open XML markup is to create a small document, modify the document slightly in the Word user interface, save it, and then compare it with the Open XML Diff utility that comes with the Open XML SDK V2.  However, Word adds extraneous elements and attributes that enable merging of two documents that have forked.  These elements and attributes show up as changed, and obscure the differences that we’re looking for.  An easy way to deal with this is to remove these elements and attributes before comparing documents.  We can safely do so without changing the content of the document.  This post presents a bit of code to do this.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCFor more information on rsid elements and attributes, see Brian Jones’s blog post on them.

This post also contains two of my most commonly used little extension methods – to get an XDocument from an Open XML part, and to save that XDocument back into the word processing document.  The XDocument is stored as an annotation on the Open XML part.

This little program takes any number of files as arguments, and strips these extraneous elements and attributes from each of the files.  Its use:

C:\> RemoveRsid Test1.docx Test2.docx

Here is the listing of this program (code is attached to this post, as well):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader streamReader = new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
part.AddAnnotation(xdoc);
return xdoc;
}

public static void SaveXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
{
using (XmlWriter xw =
XmlWriter.Create(part.GetStream(FileMode.Create, FileAccess.Write)))
xdoc.WriteTo(xw);
}
}
}

class Program
{
// get rid of every rsid attribute/element in the doc.
// they exist to enable merging of forked documents; not something
// we're interested in here. if we don't delete these nodes, they
// show up as changed.
private static void CleanUp(XDocument doc)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
doc.Descendants().Attributes(w + "rsidTr").Remove();
doc.Descendants().Attributes(w + "rsidSect").Remove();
doc.Descendants().Attributes(w + "rsidRDefault").Remove();
doc.Descendants().Attributes(w + "rsidR").Remove();
doc.Descendants().Attributes(w + "rsidDel").Remove();
doc.Descendants().Attributes(w + "rsidP").Remove();
doc.Descendants(w + "rsid").Remove();
}

static void Main(string[] args)
{
foreach (var file in args)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(file, true))
{
XDocument xDoc = doc.MainDocumentPart.GetXDocument();
CleanUp(xDoc);
doc.MainDocumentPart.SaveXDocument();

foreach (var h in doc.MainDocumentPart.HeaderParts)
{
xDoc = h.GetXDocument();
CleanUp(xDoc);
h.SaveXDocument();
}

foreach (var f in doc.MainDocumentPart.FooterParts)
{
xDoc = f.GetXDocument();
CleanUp(xDoc);
f.SaveXDocument();
}
}
}
}
}

RemoveRsid.cs

Comments

  • Anonymous
    November 03, 2008
    PingBack from http://littleming.cn/2008/11/04/15274/

  • Anonymous
    November 17, 2008
    Zeyad Rajabi has started a series of very useful hands-on posts over on Brian Jones's blog about working

  • Anonymous
    November 21, 2008
    Comme à l'accoutumé, voici une brochettes de liens de la semaine sur Open XML. Posts techniques en vrac

  • Anonymous
    February 13, 2014
    i have images and text as well in my richtext control but when using above code to remove Rsid attribute it does not match my scenario is when contentcontrol.enter event is fired on myvstowordaddin i cache its wordml and then leave focus from it fires contentcontrol.exit event then i get the contentcontrol's wordml and then match both wordml after removing Rsids,it returns false

  • Anonymous
    March 07, 2014
    @Kashif, As I understand it, this is not the scenario that RSID values support.  Mainly they are there for the situattion where:  1) You send a single document to multiple people, without revision tracking turned on, 2) each user independently modifies the document, and 3) you use Word to create a new document that contains revision tracking as appropriate for each user. I don't know the semantics that Word follows while editing the document, particularly if you are retrieving the markup and examining the RSID values. -Eric