Jaa


Generalized Approach to Open XML Markup References

The post Move/Insert/Delete Paragraphs in Word Processing Documents using the Open XML SDK introduces the C# code that enables document composability – building new Open XML word processing documents from portions of existing ones.  That is the basis for the most important new functionality in PowerTools for Open XML v1.1 – the Merge-OpenXmlDocument cmdlet.  That C# module takes interrelated markup into account when assembling new documents.  A few people have had questions about how to maintain/enhance the C# code presented in that post.  This post presents a basic explanation of the approach taken in that code.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCNote from Eric White: This post is another guest post by Bob McClellan.  Bob wrote the above mentioned C# code.  His explanations here of that code will help developers who want to leverage that code.  In addition, his code serves as a good model for how to keep references in sync.

The following diagram shows how chart references in the main document refer to chart parts, which in turn have references to binary parts.  In the case of charts, the binary parts contain the numeric data behind the charts.

When copying markup from the MainDocumentPart to another document, you could, in theory, copy all of the chart and binary parts to the new document.  However, this would be wasteful of space if the newly assembled document doesn’t contain a reference to a specific chart and binary part because the specific relevant markup in the MainDocumentPart wasn’t moved to the new document.  Instead, the assembly of the new document is driven from the markup in the MainDocumentPart – the code only copies such chart and binary parts to the new document as required.  After copying all necessary parts to the new document, the markup from MainDocumentPart is moved to the new document, and then all references are fixed up.

The code begins by finding all “c:chart” elements in the relevant region of markup in the MainDocumentPart.   For each reference, the code copies the chart part and then changes the ID in the reference to the new value.   In order to copy the chart part, the code also needs to find the reference to binary part and do the same thing: copy and change the reference ID.

Not all references are ID’s of entire parts, though.  In the following sections, I describe the different varieties of references that appear in Open XML word processing documents.

References to Entire Parts

This type of reference is used for Charts, Headers, Footers, Diagrams, Images, and Embedded Binary Objects.  If the part is XML (Charts, Headers and Footers), then it also may contain references to other parts.  Headers and Footers, for example, may contain references to images.  The method for copying references has already been described above.

Note that Diagrams are a special case.  Each diagram contains four references to other parts.  The same method is used for copying references, but it must be done four times for each diagram element.

References to Elements in a Single Part

This type of reference is used for Comments, Footnotes, Endnotes and Numbering.  The code still uses the approach of finding the references, copying the object, and then changing the reference, but there are a few differences.  I will use footnotes as an example.  All footnotes are stored in a single part, but each footnote element has a unique ID attribute and that ID is also stored in the reference.  The ID for a new part could be generated by the Open XML SDK, but I do not create a new part for each footnote, so instead the code generates the ID.  Before copying parts, the code gets the current highest ID for footnotes in the destination document.  It then increments that ID for each footnote that the code needs to copy.  Once the footnote part exists in the new document, copying a footnote involves copying a single element, rather than an entire part.

There is a special case of this type of reference used for Numbering.  The Numbering part is referenced by paragraphs that do automatic sequential numbering, like outline numbering or just simple numbered list paragraphs.  Numbering is special because the numbering elements also contain references to abstract numbering elements.  Both of these elements are stored in a single Numbering part.  It is important that references to the same elements remain valid after copying, otherwise the numbering will start over at 1 for every paragraph.  An additional complexity is that the numbering elements can override values from the abstract numbering elements.  Even with all these complexities, the process of copying these elements correctly is not too difficult.  We can avoid duplication of abstract numbering elements by looking at the “nsid” attribute, rather than comparing each individual value.  We can avoid duplication of the numbering elements by looking at the abstract ID, but if there are any overrides, then the code adds a unique element to the Numbering part.

References to Styles

Styles could be handled the same way as other references to a single part, but I chose not to do so for a couple of reasons.  Styles are identified by name, rather than an ID.  If the same name appears in two documents that we are merging together, we generally would like the style names to stay the same.  In fact, it is desirable to have the style change to match the others being copied.  As a result, there is no need to make each style unique.  Instead, the code merges the styles; if the style already exists in the destination, then we leave it alone, otherwise we copy it in.  Therefore, there is no need to change the references.

Of course, we could still check all the references to be sure we aren’t copying unused styles.  However, there are a great many elements that contain style references.  They can appear in paragraphs, runs, headers, footers, other styles, and more.  In general, it doesn’t matter that much if we end up copying a few styles that are never used.

Other References

There are a couple of other cases that only occur once, as far as I know.  Hyperlinks may contain references to an external relationship.  An external relationship is very much like a part except that it doesn’t really have any content.  The Open XML SDK generates a new ID, just like when creating a part.  This case is just like “References to Entire Parts” except that there is no content to copy, just the URI value that is set on the external relationship.

The last special case involves content controls that reference data from an external source, also known as “data-bound content controls.”  The data reference is done in a “w:dataBinding” element with an attribute named “storeItemID.”  That ID is a GUID that identifies a Custom XML part.  It is common to use the same Custom XML part for several or all of the content controls in a particular document.  Due to the unique nature of GUIDs, the code assumes that a GUID will be unique across all documents, so we never need to change any of the references.  However, since a Custom XML part can contain any XML that a user may desire, the GUID cannot be stored directly in the Custom XML part.  Instead, each Custom XML part contains a relationship to a Custom XML Properties part and the properties part contains the GUID for that custom XML.  This is nothing like any of the other reference, but it is actually quite easy to do the copying.  I found it easiest to start by building a unique list of all the GUID’s.  Then it is simply a matter of copying each Custom XML part with a GUID that is in that list.

Conclusion

After thinking for quite a while about the variety of possible approaches to maintaining referential integrity within Open XML markup while enabling composability of documents, we settled on this approach.  We feel that this approach has proven to be robust, and it performs well.

-Bob McClellan

Comments

  • Anonymous
    April 12, 2009
    Hi Eric, Have been following the evolution of the assembly of a document closely.Your posts are very informative. All we are missing now is the best method of finding a section of text to replace/append in a host document. We can structure the host doc so that sections/ paragraphs begin with say a Header1 and a unique number. What is the cleanest way to find the beginning/end of the target section so we can replace it. Would appreciate a pointer to some code as we are still very green finding our way around the xml.doc Thanks, Kerry

  • Anonymous
    April 12, 2009
    Hi Kerry, In the post, http://blogs.msdn.com/ericwhite/archive/2009/02/16/finding-paragraphs-by-style-name-or-content-in-an-open-xml-word-processing-document.aspx, (and subsequent posts, I developed a query to find paragraphs by content or style.  Does that code address your situation? -Eric

  • Anonymous
    April 13, 2009
    Eric, Apologies- did see the post but had forgotten, too much to learn... Thanks, will get back to let you now how we get on, Kerry

  • Anonymous
    April 15, 2009
    Eric, Well that code was way above my knowledge base..., but the Powershell scripts you posted ;-http://blogs.msdn.com/ericwhite/archive/2009/03/19/announcing-the-release-of-powertools-for-open-xml-v1-1.aspx seem to work well with a minimum of effort ! How would I incorporate these into a module for downloading docs from MOSS and uploading the modified versions? Again thanks for your great posts, Kerry

  • Anonymous
    April 16, 2009
    The comment has been removed