Open XML SDK... The Basics

Hi, my name is Ali and I'm a developer on the Word team. I have been part of the feature team working on the Open XML SDK and in this post I will be digging a little deeper into the SDK and its design concepts.

In their many blog posts, Zeyad Rajabi and Eric White demonstrated a number of cool solutions (for example, "Taking Advantage of Bound Content Controls" and "Accepting Track Changes and Removing Comments from a SharePoint Document") one can build by using the Open XML SDK to create, read, or modify a document, spreadsheet, or presentation. Each post had a scenario to achieve and showed the code to make it happen. In this post, I wanted to have a more detailed discussion on what's in the SDK and the rationale behind some of the designs. I don't have a scenario, just some notes that will hopefully give an insight into what is in the SDK and why.

I'm a Word guy, so my examples are all from the WordProcessingML portions, but everything is applicable to the other formats as well.

It's just XML.

Beyond packages and parts, the SDK contains a hierarchy of types such as Paragraph. To use the SDK successfully, it's important to know that these classes are just representing XML elements and act like it too, but with a little bit more integration with types.

At the root of the hierarchy is the (abstract) OpenXMLElement class. OpenXMLElement programmatically represents any old element that appears in an Open XML file. The name is quite clear about being a representation of an XML element. The methods and properties provided are what you would expect on an XML element too: .FirstChild, .InnerText, .NextSibling(), Append(OpenXMLElement), etc.

Open XMLElement provides some functionality all by itself (as all good base classes should) which we will get to in a moment, but the classes derived from it are much more interesting. For every element specified in the Open XML specification, there is a corresponding class in the SDK, deriving (indirectly) from OpenXMLElement. First off, there are two more abstract classes to make a distinction between elements that can have other elements as children and those that cannot: OpenXMLCompositeElement and OpenXMLLeafElement.

Paragraph, for example, derives from OpenXMLCompositeElement. "So what?" I hear you ask. Well, let's start with the simple stuff. If you've written code that manipulates XmlElement objects you know how tedious it is to be constantly checking the name and namespace to see which element you're looking at. With a type hierarchy representing the elements, you always get the most derived type for an element from the Open XML SDK methods that return OpenXMLElement objects. For example, Body.FirstChild property will return an instance of the Paragraph class if the first child of the w:body element in the document you're reading is indeed a w:p. If not, it will return some other type. So, instead of checking the element name and namespace, you can check its type, like so:

Paragraph p = child as Paragraph;

if (p != null)

{

// Do something ...

}

Let me be clear about one thing: there's no magic here. The SDK did in fact check the name and namespace to choose the class Paragraph, just like your code would have done. The value added is that you don't have to repeatedly write those checks yourself. Of course, having a type has other benefits too. For starters, you can write functions that take an argument of type Paragraph and the compiler will enforce that contract for you.

Furthermore when elements are distinguished by their type, you can use generics to make your code even cleaner. Remember I said OpenXMLElement by itself has some useful functionality? Take a look at the GetFirstChild<T> method, where T derives from OpenXMLElement. As you can imagine this method returns the first child that is of a specific type. Again all the SDK is doing is checking the name and namespace for you, but the resulting clean up to the code is quite noticeable. Descendants<T>(), RemoveAllChildren<T>(), ReplaceChild<T>(), Ancestors<T>(), etc all work similarly.

Another benefit of having these types is to be able to use the 'new' keyword. Want to create a w:b (bold) element? Just do,

Bold b = new Bold();

Remember that this is just a fancy way to manipulate the XML. What we've done here is created an xml element, in the WordProcessingML namespace (depending on which using statement the type Bold resolves to, of course), with the name 'b'. This element doesn't belong to any document yet, you have to add it (using InsertAfter, AppendChild, etc).

Having the Open XML specification allowed the SDK to represent individual elements as their own types, and add facilities that make manipulating the elements easier because of their types.

First Class Properties

So far, we've only used the schema as a list of element names and namespaces, albeit a very long list. The schema tells us much more than that, however, in particular which elements and attributes are allowed as children of any given element. For example, the w:sdt (a structure document tag, run level) element in WordProcessingML can have at most one child of type w:sdtPr (SDT properties). Given what we saw above, we can use the generic GetFirstChild<T>() to find this properties element, like so:

SdtProperties sdtPr = sdtRun.GetFirstChild<SdtProperties>();

if (sdtPr != null)

{

// Do something ...

}

Note that we don't need to cast the return value of GetFirstChild. This is better than looping through all the children and checking names and namespaces to find our sdtPr element, but it's still tedious, because we have to repeat it at every level. The SDK's SdtRun class has a property called SdtProperties that represents this child. So the code snippet becomes,

SdtProperties sdtPr = sdtRun.SdtProperties;

if (sdtPr != null)

{

// Do something ...

}

This property can be read and written. Let's take the RunProperties object this time.

RunProperties rPr = new RunProperties();

rPr.Italic = new Italic();

rPr.Bold = new Bold();

rPr.NoProof = new NoProof();

What just happened here? We created a w:rPr element, and assigned its Bold property to a new instance of the Bold class. This creates a w:b element, and the assignment appends it as a child of our rPr element. And the same with Italic and NoProof properties. Again, there's no magic here, these are all just shorthand for operations that are doable as manipulation of the raw xml, allowing for much less tedious code.

There's even more useful functionality in those four lines. Here's the equivalent without using the first class properties, can you spot what's wrong?

RunProperties rPr = new RunProperties();

rPr.AppendChild(new Italic());

rPr.AppendChild(new Bold());

rPr.AppendChild(new NoProof());

This snippet actually creates a schema invalid document. The schema specifies the children of the rPr element as a sequence, so order matters. Bold (w:b) must come before Italics (w:i) for the file to be valid according to its schema. The code snippet using the property assignments gets this right (because the code behind those assignments knows about the order, which the second one just obeys the calls).

Let me try to summarize what I've been showing through examples. Where the schema allows exactly one instance of a child element, a property of that type and by the same name is added to the class representing the parent element. The value of the property can be read or written, and assigning to the property adds (or overrides) the child element to the parent.

How about attributes? Well, attributes by definition fit the criteria above (each attribute must appear at most once on an element). So attributes declared in the schema always get first class properties. For example, we can add underline to our run properties above.

RunProperties rPr = new RunProperties();

rPr.Italic = new Italic();

rPr.Bold = new Bold();

rPr.Underline = new Underline();

rPr.Underline.Val = UnderlineValues.DotDotDash;

That last line is what we're interested in. First off, notice the property called Val on the Underline element. This represents the w:val attribute on that element. As usual, by assigning to this property we're creating an instance of that attribute and placing it on the element. The values of the attribute are of course strings, but when the schema specifies them to be members of an enumeration, the SDK contains a corresponding CLR enumeration as well. The enumeration's name is created to be easily recognizable: it's TypeValues. There's quite a bit of syntactic sugar that goes into that assignment involving implicit constructors and all, but I'll leave that as an exercise to the reader. I will point out however that the Val attribute here is strongly typed. The following statement will not compile:

rPr.Underline.Val = BooleanValues.False;

It's just XML, really.

Ok, back to the beginning. I hope I've demonstrated how the SDK adds value in small simple ways that add up to writing better, cleaner code to create, read or manipulate Open XML files. I'll finish off with a few short points.

  • Don't let those properties lull you into thinking the objects are always there. You really are manipulating XML elements and attributes, and sometimes they're simply absent. rPr.Underline may return null, for example, if the w:rPr has no underline child. Even if that's not null, rPr.Underline.Val may return null if the element is present, but the val attribute isn't.
  • Creating elements by using the new keyword is fun and all, but even that can get quite tedious for large hierarchies of elements, particularly when the elements being created are always the same. For example, imagine your application needs to add a table to a document. Furthermore, the first three rows of the table are always the same (some sort of header, say). Instead of creating a Table object, then three Row objects, and a bunch of cell objects for each of those Row objects, and then adding the cells to the rows, and the rows to the table, you can simply use a construction provided with the Table class that takes an xml string as input. Of course the xml supplied must be well-formed (so you will have to close the w:tbl tag, etc), but you can then add more rows to the resulting Table object.
  • LINQ. Most of the examples Zeyad and Eric have posted use LINQ extensively. While the Open XML SDK was built to work well with LINQ, there's definitely no requirement to use LINQ. Typically when generating xml elements (like the examples above with the rPr element) the simple imperative code comes out much more readable than the LINQ equivalents. Conversely, the LINQ queries (is that redundant?) come out more concise when the point is to ask a simple question like, "does this document have any SDT tags whose alias is a given value?"

-Ali

Comments

  • Anonymous
    January 13, 2009
    How would I iterate through a document looking for paticular elements such as bookmarks, fields etc.

  • Anonymous
    January 14, 2009
    The easiest way to iterate through a document looking for a particular element is to use .Descendants<T>, where T is the type of element you are looking for. For example, let's say you are looking for all BookmarkStart elements within the main document part. You can accomplish this task by doing the following: foreach (BookmarkStart b in doc.Descendants<BookmarkStart>) {  //DO SOMETHING } At this point you can iterate and look at all the bookmarkstart elements. Zeyad Rajabi

  • Anonymous
    January 15, 2009
    Une semaine assez riche en article technique : Valider les identifiants Open XML (ID de relation, content-type,

  • Anonymous
    January 28, 2009
    For the past few posts, I have been concentrating on showing you guys solutions to real world scenarios.

  • Anonymous
    February 06, 2009
    One of the more common scenarios related to a Wordprocessing document is the need to sanitize a document