The Use of Extension Methods to Manage Open XML Document Changes in PowerTools for Open XML

There is an interesting approach that we use in PowerTools for Open XML that makes it easy to write cmdlets that modify Open XML documents.  This approach isn’t very complicated, but aspects of this approach need some explanation so that developers who are extending the PowerTools can understand what’s going on.  This approach is based on the techniques detailed in Technical Improvements in the Open XML SDK and Using LINQ to XML Events and Annotations to Track if an XML Tree has Changed.  This post explains the approach that we took in PowerTools in detail.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCNote from Eric White:  This is another guest post by Bob McClellan.  I’ve met and worked with an awful lot of developers in my career, and Bob is one of the best developers that I’ve ever worked with.  He’s an expert in more areas than is possible to list here, but a few relevant areas are C++, Windows development, C#, .NET, LINQ, Open XML, WPF, and SQL.  His contributions to the Open XML conversation are well documented – he wrote this C code implementing the legacy hashing algorithm in word processing ML and the KParts proof-of-concept of an implementation of embedding linked objects from an Open XML document.  He also wrote the code that enables document composability in v1.1 of PowerTools for Open XML.  Bob, as you can see from the various guest posts on my blog, is also a good writer.  Find out more about Bob here.

The Open XML SDK makes it very easy to create an XDocument object from a part by using a Stream.  It does not, however, have any way to keep track of changes to that object.  The PowerTools for Open XML uses a couple of simple extensions to the Open XML SDK classes to manage XDocument objects and changes to those documents.

Design Requirements

We wanted to meet certain requirements with these extensions.  First, we didn’t want to add another class or new member variables and so on.  We wanted the addition to be lightweight.  Second, a part should only need to be read once for as long as the package is open.  Third, changes should not be written out until we are ready to close the package (or when a logical group of changes are complete).

Synchronization Issues

In theory, it is a bad idea to represent the same data multiple times in a program.  If the multiple values that should be the same become different for some reason, then the program will most certainly not work correctly.  The best way to avoid that kind of problem is to store information just once and then there is nothing to synchronize.  However, we often break this rule for various reasons and performance is one of them.  When we are dealing with an Open XML package, we don’t want to have to keep reading and writing parts during a series of changes to that document.  In particular, if the part is very large, many reads and writes could be time-consuming.  As you will see below, the compromise is to keep the in-memory version of the document in an XDocument that is tightly coupled with the part so that it is unlikely that we will lose synchronization between the two.

Extension Methods

Extension methods are methods that are defined to appear as if they are member methods of an existing class.  The limitation is that extension methods cannot add any member variables to the class, but there are ways around that limitation, as you will see below.  The two extension methods that we will be examining are:

public static XDocument GetXDocument(this OpenXmlPart part)
public static void FlushParts(this OpenXmlPackage doc)

The first method will be used to get an XDocument for a particular part.  The keyword “this” signifies that it is an extension method for the OpenXmlPart class.  This method will look to see if an XDocument has already been created for that part.  Otherwise, it creates the XDocument.

The second method is used to write out any changes to the XDocuments that have been created.  It is an extension to the OpenXmlPackage class because changes to all parts will be written out when this method is called.

Caching XDocuments

As mentioned above, we want to make sure that an XDocument is only read once no matter how many times we might examine it or modify it while the package is open.  An easy way to do this without creating new objects is by using Annotations on the part in the Open XML SDK.  Annotations are simply objects that can be attached to other objects.  We can avoid reading XDocuments more than once by attaching the XDocument to the part as an annotation.  Here is the GetXDocument extension method:

public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
try
{
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
{
xdoc = XDocument.Load(xr);
xdoc.Changed += ElementChanged;
xdoc.Changing += ElementChanged;
}
}
catch (XmlException)
{
xdoc = new XDocument();
xdoc.AddAnnotation(new ChangedSemaphore());
}
part.AddAnnotation(xdoc);
return xdoc;
}

The parts that deal with ElementChanged and the ChangedSemaphore will be explained in the next section.  The first line calls a method that tries to retrieve an annotation that is an XDocument object.  If there is no XDocument object for that part, the method returns null.  The next line checks to see if the XDocument was returned.  If so, it was read in a previous call and there’s nothing else we need to do except return that XDocument.

If there is no annotation for the XDocument, then we need to read it in.  The process for reading an XDocument from a part is next.  A StreamReader class is needed for XmlReader and then that can be used for the static Load method that creates the XDocument.  If that process fails, then we can assume that the part doesn’t have any content yet, so we create an empty XDocument.  In either case, once we have the new XDocument object, we then add an annotation with that object so that this part will not be read again.

Tracking Changes

There were a few lines of code in the GetXDocument method that are used to track changes to that XDocument.  The basic approach to tracking changes is to use the ChangedSemaphore object to “tag” the XDocuments that have changed.

private class ChangedSemaphore { }

The class has no content; it is just used to identify which have changed.  The other two lines of code from that method that are used to track changes are setting event handlers that will be called when the XDocument is changed by any other method calls.  Here is the code for the event handler:

private static EventHandler XObjectChangeEventArgs new
EventHandler XObjectChangeEventArgs

private static void ElementChangedHandler(object sender,
XObjectChangeEventArgs e)
{
XDocument xDocument = ((XObject)sender).Document;
if (xDocument != null)
{
xDocument.Changing -= ElementChanged;
xDocument.Changed -= ElementChanged;
xDocument.AddAnnotation(new ChangedSemaphore());
}
}

This method is called when the XDocument is changed.  It starts by removing itself as an event handler.  Once we have detected a change, there is no reason to have the event handler called again upon subsequent changes.  Next is the addition of the ChangedSemaphore as an annotation for the XDocument.  As you will see in the next section, we will use that annotation to determine which parts need to be written.  If you look back to the GetXDocument method, you will also see that the ChangedSemaphore object is added as an annotation for a new XDocument because we assume that a newly created empty XDocument will always be changed.

Writing Changes

The process of writing out the changes is handled by the FlushParts method and its helper method shown below:

public static void FlushParts(this OpenXmlPackage doc)
{
HashSet<OpenXmlPart> visited = new HashSet<OpenXmlPart>();
foreach (IdPartPair item in doc.Parts)
FlushPart(item.OpenXmlPart, visited);
}
private static void FlushPart(OpenXmlPart part, HashSet<OpenXmlPart> visited)
{
visited.Add(part);
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null && xdoc.Annotation<ChangedSemaphore>() != null)
{
using (XmlWriter xw = XmlWriter.Create(part.GetStream(FileMode.Create, FileAccess.Write)))
{
xdoc.Save(xw);
}
xdoc.RemoveAnnotations<ChangedSemaphore>();
xdoc.Changing += ElementChanged;
xdoc.Changed += ElementChanged;
}
foreach (IdPartPair item in part.Parts)
if (!visited.Contains(item.OpenXmlPart))
FlushPart(item.OpenXmlPart, visited);
}

The FlushParts method calls its helper method for each part in the package.  The FlushPart helper method checks the XDocument for that part to see if it has changed and then writes it, if it has.  It then recursively calls all the related parts for that part.  The HashSet collection is used to keep track of which parts have already been checked.  It is needed because the parts of a package can be referenced from multiple parts.  It is even possible that there could be “loops” of references that would cause the method to enter an infinite recursion.  Instead, each part is added to the “visited” collection as it is checked and then that collection is checked to be sure the part is not processed a second time.

Writing the document is just a matter of getting the stream for the part and then using the XmlWriter object to write it out.  Once it is written, the method also removes the ChangedSemaphore annotation and sets the event handlers for changes again.  This is done because the package remains open after the FlushParts call.  If additional changes are made, we want to be sure they are detected.

Summary

I hope this shows how a very small amount of carefully designed code can create very powerful functionality.  Although we have a little bit of a compromise by storing a copy of the part in memory, that risk is reduced by the simplicity of the code that handles it.  There is still a risk, though.  If any part of the code using these methods makes direct calls to load, process and write a part, then they become out of sync and we don’t have any way to detect if that happened.  It’s all or nothing with this approach, but as long as you always use GetXDocument to get a part and use FlushParts before you close the package, your code will work properly.

-Bob McClellan

Comments

  • Anonymous
    April 27, 2009
    PingBack from http://www.anith.com/?p=32928

  • Anonymous
    May 04, 2009
    PHPPowerPoint 0.1.0 was released last week, as an open-source PHP API for generating PPTX files, much