Jaa


Custom Defined Schemas

I've talked a lot about the value of "Custom Schema" support in Office. Anytime I give talks on the file formats, I make sure to spend some time also talking about the support for custom schema. I don't think I've really given the basic intro though on the difference between reference schemas and custom defined schemas in my blog, so if you haven't seen one of my presentations you may not know what I'm talking about when I refer to custom defined schemas. Each Office application has different levels of support, and it's good to start investigating the functionality...

There are of course a infinite number of valuable uses for Office documents. Obviously the Office applications are more than just a better typewriter. If you look at Word for example, one of the big investments have really been around making better looking documents . We've done a ton of work this release for instance to allow you to add great looking content to your document easily.

Making a great looking document though is only one part of making a document valuable. I think it's best to use a really simple example. Let's take the following document:

Using the basic Word functionality, it's pretty easy to format this document so that any human can easily look at it and understand the information that's being conveyed. It's clear that this is a conference report that was made on July 17th. You can quickly find out what the summary was as well as who attended the conference. This information can all be saved out as XML thanks to the reference schemas. In Word, we defined an XML schema called WordprocessingML that fully represents all of that formatting and layout information as XML. The reference schemas are used for conveying all the application specific information:

So the information that says "John Doe" is bold text and "Health Agency" is italic text can be represented as XML thanks to the reference schemas:

<w:p>
<w:r>
<w:rPr> <w:b /> </w:rPr>
<w:t>John Doe</w:t>
</w:r>
<w:r>
<w:rPr> <w:i /> </w:rPr>
<w:t>Health Agency</w:t>
</w:r>
</w:p>

The reference schemas are great for representing all of the application's information. Everything you do in Word, you want to be saved out and persisted, and the reference schemas allow for that (you don't lose anything when saving as XML). In a wordprocessing document the reference schemas are used to convey all the display-oriented information like bold; italics; paragraphs; tables; styles; etc. Reference schemas enable long term archive-ability of the formats as well as interoperability with other applications and solutions. This is provided there is good documentation around the reference schemas, which we will have via the Ecma process.

The thing that you don't get with reference schemas though is the ability to easily structure content using your own semantics. In the above example, if you wanted to quickly search for all conference reports where "John Doe" had been an attendee, you'd be kind of stuck. Any type of business logic you wanted to run on these documents would be extremely difficult, because the reference schemas are there to allow humans to easily read the content, but not programs. Let's say you wanted to write a solution that took all the conference reports that "John Doe" had attended, and create a single document that was a list of all those conferences and the summary of each. If the application you are using doesn't support custom defined schemas, then your stuck using features like style names, bookmarks, tables, or some other type of hack. Those approaches don't allow for any real hierarchy, and there isn't really a good way of specify the style structure so that the right type of validation can be done. Up until the introduction of the custom defined schema support in Word 2003 though, those hacks were the only options people had. I've seen plenty of solutions people have built using all of those methods, some of which were extremely impressive given the constraints. Unfortunately though, they all fall short of the goal.

This is where the custom schema support comes in. If you really want to treat these documents as a source of data and integrate them with your business processes, you need the ability to structure them in your own schemas. You want to specify what the date was, who the attendees where, and even what department they worked for:

This is the advantage XML can give you. The combination of namespaces; XSDs; and even XPath allows you to add your own structures to the documents; validate those structures; and even navigate them so that they can integrate better with your solutions. With the custom defined schema support, you can get this kind of information out of the document:

<ConferenceReport>
<Date>3/24/2004</Date>
<Attendees>
<Attendee Name=“John Doe”>
<Department>
Health Agency
</Department>

<Potential>
<Sales>100</Sales>
<Growth>25%</Growth>

</Attendee>

That's much more useful when you care more about the data than the presentation information. It represents the business information that's stored in the document rather than just the display information. This really helps to enable system integration.

It's really important to realize the potential of documents in your organization. Too often people think of documents as just being a bunch of formatted text, and don't see document collections as the valuable databases that they are. People spend a lot of time producing that valuable content, and you want to make sure that you can fully leverage that content. You'll see that in Office '12' we've done even more work to help you integrate with the data in your documents. In my post earlier this month I talked a bit about the new content controls in Word, and how you could bind those controls to your own XML data. There is a ton of momentum in this area that's been building up since Office 2000, and it just keeps getting better! I really love this stuff if you can't tell :-)

-Brian

Comments

  • Anonymous
    January 26, 2006
    Do you really think it is worth organizing a document like that? If there could be a way word automatically recognizes letters and applies labels such as address or signature or position, then it will be useful. I think it is a waste of time because as far as I can see, I would rarely use it.

    Simply searching thru my documents for John Doe is enough. I can look thru them for what I need. I will only get so many results that it will not be hard to search thru them.

  • Anonymous
    January 27, 2006
    The comment has been removed

  • Anonymous
    January 27, 2006
    The comment has been removed

  • Anonymous
    January 27, 2006
    Hi Keith, the key to making WordML useful in technical writing is to just understand the constructs of the format. If you want to work in a schema like DocBook, you should find the structures in DocBook that can be mapped to similar structures in WordML, and just transform back and forth. Then, for any additional structures that don't map, leverage the custom defined schema support. I posted back in the summer about how you can get started working with custom schema in WordML: http://blogs.msdn.com/brian_jones/archive/2005/07/26/443572.aspx

    It's important to note though that our goals with the custom schema support was not to turn Word into an XML editor like XMetal. In stead we wanted to bring structure to existing Word scenarios. We've seen a large number of customer solutions where they were using things like styles and bookmarks to imply semantics to certain portions of their documents, and we wanted to make that easier and more robust. You can work with any schema you want in Word, but you'll find that the more complex the schema is, the harder it will be to work with (as you probably expected).

    -Brian

  • Anonymous
    January 31, 2006
    The comment has been removed

  • Anonymous
    February 01, 2006
    Maybe with an Optimus Keyboard it'll be more comfortable and thus it'll b eworth.

    Anyway I think a really good equation editor is a plan with a top priority. And I mean a equation editor with a quality similar to LaTeX. Not pictures-equations, please.

  • Anonymous
    February 08, 2006
    Brian, could you elaborate on how you've expanded Word 12 to allow programmatic infusion of data within a Web Application?  I am currently doing this in a Web app with Word 2003 XML docs and custom XML schemas.  I load the XML into a DOM and use custom class that allows me to throw any business object or reader at the class and file the Word XML document with the data from the class or reader.  

    It really works well but the code is somewhat involved in that I have to load up the WordML and my custom schemas and call a ProcessNodes function that uses XPath queries to find nodes named similar to the class or reader field name.

    Will this get easier in Office 12 ?

    Also, would love to save this out to PDF programmatically, what are the chances of that?  Will there be an class that I can use from .Net to do this?

    Sorry for the cross post...found these XML articles after the fact.

    Thanks

  • Anonymous
    February 13, 2006
    The comment has been removed

  • Anonymous
    March 27, 2006
    Links to blog posts that contain useful technical information for developers.  Open XML is a new standard, but there's some good information already available if you know where to look.

  • Anonymous
    June 08, 2006
    This is the third post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...

  • Anonymous
    June 12, 2006
    If you're heading out to TechEd this week like I am, you should definitely plan on attending Tristan...

  • Anonymous
    June 14, 2006
    We have developed an application that creates charts with time series. The application has support for OLE and users frequently embedd charts from our application in Word or PowerPoint.
    We have added support for "Document Summary Properties" in order to make indexing of our own documents available to Desktop Search. This works fine and is very popular.
    But users also want to search for charts that are embedded using OLE in Office documents. Is there any way we can make that possible? Can we implement IFilter on the OLE objects, supply Office with some searcable properties or perhaps we can make use of XML somehow?

  • Anonymous
    June 20, 2006
    Brian,

    Is there a way I could modify the custom schema dynamically, while it is attached to an open document?

    Suppose I have created a custom schema which defines a set of possible elements in a document. Then the user wants to extend that set and add his specific elements. In order the document keeps validating OK against the schema, the latter needs to be modified. Can it be done? If no, what other options you'd suggest for my task?

    What I tried so far was disconnecting and re-connecting a schema, but I lose all tags in that case, it won't do. Also, you can close Word and purge its cash (somewhere in Local Settings) so that the new schema took effect. This is tedious either...

  • Anonymous
    June 22, 2006
    Best of all people w can talk...

  • Anonymous
    October 23, 2006
    I posted earlier this year on the support for custom defined schema in wordprocessingML via the new content

  • Anonymous
    March 21, 2007
    I was just looking at Karel De Vriendt&#39;s ODEF (Open Document Exchange Formats) Workshop Conclusions