Microformats and Open XML

Like most people, I hadn't heard of microformats a year ago. But now the concept seems to be gathering momentum. Microformats have their limitations, but they offer a practical way to solve a common interoperability problem: how to add structured data to existing documents (typically HTML) without changing the underlying schema or breaking existing implementations. The basic concept is that a microformat is a set of "class" attributes that can be added to spans in an HTML page to tag content with semantic meaning. For example, <span class="family-name">Mahugh</span>.

There have been microformats defined for reviews (hReview), calendar items (hCalendar), business cards (hCard), and other applications, and microformat-tagged content is starting to appear on many web pages. Search engines, aggregators, and other types of software can use the microformat to determine what a piece of information means in a particular context (what's being reviewed, when the meeting is scheduled, where the email address is in a business card) without any impact on the visual rendering of the HTML content itself.

I won't go into the details of microformats here (there's plenty of information over on microformats.org if you're interested), but I wanted to use the hCard microformat as an example of how Open XML's custom schema support allows binding of structured document tags (or content controls, as they're called in Word) to arbitrary nodes in a custom XML part.

Mr. Doug Mahugh
One Microsoft Way
Redmond, WA 98052
dmahugh@microsoft.com
Phone: +1-425-882-8080

Microformat sample: hCard

Consider, for example, the hCard shown here. This sample is just a DIV containing my contact info, with some minimal CSS styling in a style atrribute and the appropriate hCard attributes added to tag the content with its meaning. If you take a look at the HTML on this page (view source and search for the phone# or something), you'll see that the surrounding div has a class of vcard, and then the content within is tagged as follows:

An hCard as a custom XML part

Now this DIV is just some XML (XHTML). And as I've mentioned before, we can put any well-formed XML in an Open XML document as a custom XML part, and then do creative things with that XML to enable interoperability between various systems. for example, we can put that DIV into a custom XML part, and then bind content controls to the hCard fields (as identified by their classes in the markup above).

I've attached some sample documents that demonstrate this concept. For example, here's a screen shot of content controls bound to the nodes from the hCard-format custom XML part:

Those content controls have 2-way binding to the nodes in the custom XML part. For example, you can correct a typo in my name and then save the DOCX, and your correction is written to the appropriate node in the custom XML part. Or you can replace the custom XML part with a different hCard, and that hCard's data will appear in the content controls.

As a simple example of this concept, you can go to this Live Clipboard demo page and grab a sample hCard from there. Just right-click the orange icon next to any of the names listed, and select Copy. Then you can paste that text into the custom XML part in the attached sample document, and the content controls will now be bound to that hCard. For example, here's what you'll see if you paste in the first example from that page (this is hCard2.docx in the attached samples):

And here's the hCard source data for that example:

You'll notice the syntax isn't identical to the previous example, but the same hCard class attributes are used, and therefore all of the data-binding works the same in both instances. So here we have two simple examples of the same document, with its same visual presentation, providing an interactive editing capability for different instances of business data.

This ability to swap out custom XML parts enables a variety of development scenarios for Open XML solutions. For example, a custom XML part can be generated by some type of system, packaged in an Open XML document, and then travel with the document as a "data payload" that can be displayed and edited through the document interface. Later in the business process, the custom XML part can be extracted (by any programming environment that supports ZIP packages) and passed on to other systems as a clean instance of business data with none of the Open XML schemas included in it. Custom XML parts allow for simple and consistent separation of business data and document-formatting information.

Technical details

If you're interested in understanding the binding mechanism that connects custom XML parts to content controls, take a look at the markup in the attached sample document. There are two aspects involved: a GUID for identifying the custom XML part (since there could be more than one), and the XPath expression for selecting the particular node within the custom XML part. The GUID is stored in a separate "custom XML properties" part, which has a relationship to the custom XML part.

This approach makes the custom XML part itself entirely independent of the details of the binding. The XPath expressions are stored in the structured document tags within the document body, and the GUID is stored in the custom XML properties part. So the custom XML part itself can be replaced at any time, and the new part will populate the content controls as we saw above. And because the XPaths are only looking for a vcard with certain classes inside, the binding is very flexible and tolerant of changes in the format of the custom XML part.

This is important in our microformat example because not all web pages are structured the same way, and the path to a piece of microformat-tagged data can vary considerably between two different HTML documents. For example, one person might put their first and last names in different cells in a table, whereas another person might put them in a single paragraph, and yet another person might put them in separate DIVs. But as long as they're tagged with the appropriate microformat attributes, the data values can be mapped to content controls in a manner that will work with any of these variations.

Next up on our tour of Open XML's custom schema support will be custom content tagging. I mentioned this briefly in my last post, but I'd like to cover this concept in more detail because it's a powerful enabler of interoperability between Open XML documents and XML-aware business applications. I'll cover that later this week.

hCard-samples.zip

Comments

  • Anonymous
    March 04, 2007
    PingBack from http://businesstaxform.com/microformats-and-open-xml/

  • Anonymous
    March 04, 2007
    Dude, you're stealing my demos! :)

  • Anonymous
    March 05, 2007
    Hey, I thought you were busy this week.  Ciao!

  • Anonymous
    March 05, 2007
    I've blogged a few times about the support for custom defined schema in the OpenXML formats. ( http://blogs.msdn.com/brian_jones/archive/tags/Custom+Schema+Solutions/default.aspx

  • Anonymous
    March 05, 2007
    Is it possible to add and reference custom XML parts in ECEL documents? I was able to add a reference to an external XML document, but can I do the same with a XML document built into the XLSX?

  • Anonymous
    March 06, 2007
    You can embed an XML document in an XLSX using the same mechanism you use for embedding it in a DOCX (as a custom XML part), but there is no support in spreadsheets for the 2-way data binding that can be used with content controls.  One thing that compensates for this a bit is that the simplicty of the sheetdata content makes it easier to read and write data in the worksheet itself (as opposed to within a WordprocessingML document, which has more complex markup to navigate).

  • Anonymous
    March 06, 2007
    The comment has been removed

  • Anonymous
    March 06, 2007
    Hey nice write-up Doug.  I think I finally understand Microformats conceptually now.  I always wondered how it could be flexible enough to handle different formats in the XML.  Thanks!

  • Anonymous
    March 09, 2007
    As a brief follow-up to my post on microformats and Open XML , I'd like to show a slightly different

  • Anonymous
    March 11, 2007
    Shurik, You can do 1-way binding of XML nodes in an external XML part to a spreadsheet.  Is this what you want to do? If you'd like to send me a sample XML part, I'll bind it to a spreadsheet and send you a sample.  Very busy week this week, but feel free to send something if you'd like and I'll get to it in a few days.  It's dmahugh at the usual place.

  • Doug
  • Anonymous
    March 14, 2007
    Doug, I generate EXCEL files on a web server, and then users download autogenerated EXCELs by clicking on a link. So I need everything to be in a single file and external XML parts will not work for me :( But anyway thank you for the clarification.

  • Anonymous
    March 15, 2007
    Sounds good. I tried the paste operation and wound up with a bunch of angle brackets though. What's the procedure to get the data cleanly into the docx?

  • Anonymous
    March 16, 2007
    Hi Jon, It's not clear to me what you did -- did you paste some XML into the custom XML part?  If you paste directly into the document, then of course the tags and angle brackets are just text in the document and will appear as such. In general, the answer to your question is that you just put XML in the custom XML part, which is a discrete file in the ZIP package that constitutes the docx.  Then the XPath binding populates the content controls from the appropriate nodes in the custom XML part. Feel free to send me a sample document if you'd like and I'll take a look.