WordprocessingML Document Model

I thought it might be worthwhile to give a bit of an overview of the WordprocessingML model that you see in the Open XML standard. There are some people who've played with other formats like HTML or DocBook that are curious why WordprocessingML doesn't use that same model as either of those formats, and there's actually a pretty straightforward reason. If you find this interesting, you might also want to check out the Introduction to Word document post I made last year.

WordprocessingML was built with the legacy base of Word documents in mind and it follows the original model you see in those documents pretty closely. This is why Open XML exists. There are other formats out there, but Open XML is the only one that was designed with compatibility with legacy Office documents in mind. We actually had tried other approaches in the past, such as the original SpreadsheetML format in Office XP, and the HTML support that was in Office 2000. In each of those cases though we tried to convert existing binary documents into these new models it quickly showed itself to be fairly problematic. Just take a look at the HTML output from Word for any complex document if you don't know what I'm talking about. In the original SpreadsheetML format we actually used the ISO date format that you're seeing a lot of the Anti-OpenXML folks claim we should use in Open XML. That also proved to be very problematic both from a formula fix-up issues as well as general loading performance, which is why we went with a simple integer value and kept the 1900 leap year bug.

Back to Word though... with WorprocessingML, the model we use is actually a very flat model. Jim Mason actually talks about this in the Open XML discussions going on in the US V1 committee. You can read this dialog in the archive that Sun's Jon Bosak set up: https://www.ibiblio.org/bosak/v1mail/200706/2007Jun05-132133.eml Here's some of what Jim had to say:

Word came to Microsoft from a program orignially developed at Xerox PARC. PARC was very much into object-oriented programming, like Smalltalk, and Word partakes of that spirit. It's also related to stack-oriented, postfix-operator programming languages. Patrick mentions "sections": in Word, a section is just a fairly high-level container to which properties, like margins or page numbering, can be attached. So think of text entry as shoving bytes on a stack. Then at some point you put in an operator that pops the stack and applies itself to everything that comes off. Nesting of sections is somewhat limited; they tend to pop the stack.

I've spent a lot of time working with folks who were long-term users of WordPerfect and who were getting frustrated making the transition to Word. All of them were used to running WordPerfect in split-screen "reveal codes" mode. That's easy when a file is linear and text and codes just came one thing after another. They were always looking for a similar mode in Word and couldn't understand that Word didn't have any such linear structure. They were also used to seeing margins set first, rather than as a postfix property of a whole section after the section content was complete, and they couldn't understand how Word was doing things.

Another aspect of Word's data structure, I've come to recognize, is that a lot of it is implicit. All the data necessary to create the rendition of a Word file in Word itself is in the file, but it's not easily extracted. It becomes evident only when the file is instantiated in the program, which knows how to deal with the implicit parts.

In XML we're used to clearly marked (tagged) containment and hierarchies. We think, for example, of having a <sect1> that contains a <title> and a bunch of <para> elements. If we have a <sect2>, it's clearly delimited inside the <sect1>. That's the sort of thing we see in applications like DocBook (Norm, correct me if you need to).

Of course there have been unstructured document models, like HTML, that tagged only locally, and you could put <H1> and <P> and <H2> in the file in any crazy order. Word appears to be like that: you can put in text and apply styles in any crazy order, so you can have the first paragraph of a document in "Normal" style, then one in "Heading2", one in "NumberedList" and then one in "Heading1". But just as a Web browser attempts to recover from any crazy codes you throw at it, Word attempts to build -- in memory -- a structure from the styles. You can see this structure if you put Word into "Outline View" rather than "Layout View" or "Normal View". This structure acts pretty much like the hierarchical structure with proper containment that we expect from the usual XML application. But it is, so far as I can tell, something that happens only when the file is in memory, and it's driven entirely from the styles. (By default, it works with Word's predefined Heading1, Heading2, etc., styles, though you can map user-generated styles to act like those.)

Jim is pretty close here, but is a bit off on this last part that I've italicized. Word never creates this hierarchy.  Word is always a streaming (cp-based) format.  The "outline" view is an artificial inference of the structure.

So the question isn't whether or not it's a conventional XML structure. Both ODF and Open XML are XML structures; one is flat and the other has a deeper hierarchy (or at least attempts to do that).

The WordprocessingML format represents a stream of content (the data), and the formatting associated with it. Word does not work on this data in a hierarchical manner, nor does it infer a hierarchy when working with it. As such, there is no hierarchy stored in the file format. The way that you impose any type of hierarchy or semantics is through the use of structured document tags (SDTs) like content controls, custom XML, etc.. That hierarchy will then be reflected in the document content and in the file format.

If you intend to use wordprocessingML as a pure data interchange format, and you want the data to be hierarchical in nature, then you will want to use the SDTs in your document for this hierarchy. We actually do this today in our workflows in Microsoft, such as our spec library where we leverage the SDTs to structure the specs for easy interrogation of the spec collection.

Other approaches folks have used to get semantics out of the document would be through the use of styles. Remember though that the Styles are flat since they are just a property of the paragraph or run of text.

The vital thing to understand is formatting itself should not be viewed as structure. The "view" of the data is not PART of the data. The "view" is separate. The fact that you have Heading 2 after heading 1 does not imply a structural relationship between the 2 headings – merely that they LOOK different. In a world that espouses the separation of data and view, this is a great model. There is no attempt to try to invent some hierarchical representation based on the view of the data.

There are places where Word actually attempts at runtime to give the user an impression of hierarchy based simply on the formatting, but this is artificial. As I said, Word never actually creates this hierarchy.  WordprocessingML is always a streaming format.  The "outline" view is an artificial inference of the structure that can be provided by the application.

This is why we've tried to stay really clear on the goals of Open XML. We needed to create and XML format that developers could use, but also one that our customers could move their existing binary documents into without fear of their documents been negatively affected. This is also why we think that choice of formats is a good thing. Office (the application team) knew that we needed Open XML in order to best meet our customer's current needs. But as other applications come along or if customers actually express an interest in another format, we will help improve the translation tools out there so people have that choice.

This also shows why ISO standardization is important. These formats meet an important need, and if you look at the numbers of downloads, you'll see that the number of Open XML documents out there will continue to grow. Over the coming years we will see millions of Open XML documents out there. If ISO approves the Open XML standard, then it helps to ensure that everyone will have long term access to the specs that help them read and write these files.

Open XML Community Quote of the day


Captaris Inc — United States

"Captaris has customers all over the globe who capture, manage and deliver documents that support vital business processes. Consequently they rely on international document standards to insure the open access, portability, security and long-term retrieval of the information. Our customers also want to have a choice in open document formats. On behalf of our customers, we support Open XML and believe it should be ratified by the ISO national standards bodies."

- Dan Lucarini, Senior Director

-Brian

Comments

  • Anonymous
    July 11, 2007
    Very interesting.  This streaming approach is what Lotus Notes uses in its rich text format as well, so I am very familiar with it, but I had not realized that this was why Open XML has some of the structure it does.  It actually makes it somewhat similar to the DXL format used by Lotus Notes to dump content in XML for processing.  I can see why it might be a challenge to convert that to a more "true" hierarchical format, but you also make processing of the streamed format much more complex.  Again, this makes a lot of sense as a  storage facility for legacy documents, which I know is one major goal, but it makes less sense as a general purpose modern file format.  It was for these very reasons that XHTML was developed as a follow on to HTML, although it could only correct some of the inadequacies. In any case, thanks for the elucidation, as it should make our development with Open XML (if we do indeed go that way) easier to design.

  • Anonymous
    July 11, 2007
    Hi Brian, > In the original SpreadsheetML format we actually used > the ISO date format that you're seeing a lot of the > Anti-OpenXML folks claim we should use in Open XML. We are not "Anti-OpenXML". I am merely doing my job in the standards process in reviewing the Microsoft Office Open XML spec as requested. Pointing out issues with evidence is part of the job. It is up to Microsoft/Ecma to deal with it. > That also proved to be very problematic both from a > formula fix-up issues as well as general loading > performance, which is why we went with a simple integer > value and kept the 1900 leap year bug.

  1. The formula "fix-up" should not be pooh-pooh'ed away and swept under the carpet under the guise of "legacy"  issues. Binary MS documents will always have to be "translated" into MSOOXML. Take advantage of this opportunity to fix the wrongs set by VisiCalc or Lotus 1-2-3 or whatever application that was the market leader 20 years ago.
  2. The "General Loading Performance" issue is proven not an issue when it comes to ISO 8601 dates vs numeric (internal representation) dates. Sure ISO 8601 dates take approximately 50% longer to decode from a string over numeric dates, but to put this in human perspective, it only takes an extra 0.1 seconds (0.1s) to decode 1 million ISO 8601 dates. Given that an average spreadsheet would only have a fraction of a million date entries, I would confidently say that the decision to use internal numbers to represent dates is a wrong one, and "performance" is not the reason why it was chosen.
  3. Dates are not always just "simple integers". When there is a time element, they become floats too. So the question remains: a) Why decrease readability in this new XML format? (MSOW2K3XML was in ISO 8601) b) Why have confusing epochs when ISO 8601 automatically removes this bizarre requirement c) Why limit dates to 1900s? Microsoft Excel users have to resort to string based sorts for their prior-1900 date needs. d) What is the justification to NOT use ISO 8601? Ref: "Malaysia's History is ill-formed" http://www.openmalaysiablog.com/2007/06/malaysias_histo.html "Will readability hinder Performance?" http://www.openmalaysiablog.com/2007/06/will-readabil-1.html Brian, I hope you are taking this constructively. I have provided numerous suggestions in improving the MSOOXML spec. The reluctance of Microsoft making any changes just makes your chances of getting MSOOXML through to ISO so much harder! Dont start blaming IBM (or some other convenient scapegoat) if your efforts fail; you just have yourselves to blame for not listening to your "customers". Please do not use the "legacy base" argument again. If Microsoft was so interested in providing access to information on the legacy document base of all the documents, then the effort would be the standardisation of the Office BINARY formats. Going through another abstraction layer  in XML to represent binary information is definitely not the most efficient way to do it. Call it what it is: The legacy problem, and the future file format issue.
  4. The legacy problem can be solved if Microsoft standardised the binary Office formats (and Ecma would be ideal for this)
  5. The future file format problem could be solved with MSOOXML or ODF or UDF or what-have-you. But ALL YOUR CUSTOMERS will benefit from just one standardised format. Trying to combine the two distinct problems just convolutes matters, and really compromises the potential of both efforts. yk.
  • Anonymous
    July 11, 2007
    [quote]Sure ISO 8601 dates take approximately 50% longer to decode from a string over numeric dates[/quote]This seems very uncertain to me especially since ISO 8601 dates have extra elements to convert like timezones and periods. As a former assembly programmer I would think 500%-1000% more parsing and interpreting processing time would be more likely. Do you have any numbers to show that converting large amounts of ISO 8601 dates only takes 50% more time or of converting a million of them in only 0.1 of a second? Also there is added complexity as subtracting ISO dates could lead to either a period or a numeric value whilst in a numeric  representation those are the same. Also I do not understand you using a readability issue. If that is the main reason for using ISO 8601 then there it is really a non-issue as the format just is not created for human reading and it is neither a goal nor a requirement.   They should have dropped the dateissue from lotus 1-2-3 or at least use a deprecated conversing format so it would not enter in new document. On hte other hand allthough ugly is is a very low impact issue in implementations. I do think that OOXML could add additional ISO 8601 support. Then it can support both speedy performance as wel as complex dates, timezones and dates between 1582 and 1900 depending on it's implementation. I would certainly advise MS and Ecma to add the limited XML standard version of ISO 8601 dates to SpreadsheetML for ISO issue resolution or for it's next version.

  • Anonymous
    July 12, 2007
    The comment has been removed

  • Anonymous
    July 12, 2007
    Hmmm, couldn't you fix date issue #2 by changing the start date of the Excel calendar (i.e., day 0 would become 12/31/1899.) This would eliminate the recalculation problem--only those formulas referring to the first two months of 1900 would see different results. (And people shouldn't be using these dates because of the erroneous leap year!) It'd probably be worth amending this error simply to take away the #1 objection to OOXML. Also, it's not only old WordPerfect users who are baffled by sections. They're probably the single most unintuitive feature in Word. (There is no indication that page layout, text direction, etc. of the preceding x pages are stored in the section break that follows them. There is no indication that they function like containers, or which pages they contain.) Users press backspace or cut text and bam!, the margins shift, the orientation turns to landscape, page numbers change--all for no apparent reason. (Section breaks are hidden by default.)

  • Anonymous
    July 12, 2007
    >> Sure ISO 8601 dates take approximately 50% >> longer to decode from a string over numeric dates > This seems very uncertain to me especially since > ISO 8601 dates have extra elements to convert like > timezones and periods. As a former assembly programmer > I would think 500%-1000% more parsing and interpreting > processing time would be more likely. Do you have any > numbers to show that converting large amounts of ISO > 8601 dates only takes 50% more time or of converting a > million of them in only 0.1 of a second? Hi Hal, Yes, I do have numbers to  show that converting large amounts of ISO 8601 dates take an incredibly small amount of time. On my 2 year old 1.7GHz laptop, 5.5 million ISO dates are converted from strings in memory in just over 1.5 seconds. The investigation is written here: "Will readability hinder Performance?" http://www.openmalaysiablog.com/2007/06/will-readabil-1.html The source code is also available there for compilation and verification. yk

  • Anonymous
    July 12, 2007
    Brian, > Just update the spec to say that negative values are > allowed on the number and you then have pre-1900 dates. Yes, allowing negative numbers would solve that problem. > #2 has been discussed ad-nauseam. With no resolution but to ignore the problem. Using ISO dates would fix it once and for all. SpreadsheetML v2003 (according to your take on its evolution) used ISO dates. Why did it regress back to internal numbers? > This has an impact on the performance of load and > save (especially when you have a larger spreadsheet > which could easily have hundreds of thousands of > dates) The numbers show otherwise, Brian. Hundreds of thousands of dates will take up at most, 0.3 seconds of extra wait time. The bottleneck would not processing, it would probably be disk I/O speed. It only takes 186 micro seconds per ISO date. Conversion between the two evidently, is no big deal. Performance of ISO dates will NOT have an impact. yk.

  • Anonymous
    July 12, 2007
    > 186 micro seconds per ISO date Sorry, Correction; 281 micro seconds per ISO date. yk.

  • Anonymous
    July 13, 2007
    yk, If it really were 281 micro seconds extra per date, then there wouldn't be any question at all around this issue... it would just be unnacceptable. That would mean that parsing a spreadsheet that was just 10,000 rows by 24 columns would take an extra 67 seconds to parse (over a minute!). Luckily it's not that bad, otherwise ODF would be almost impossible to work with for any spreadsheet. The OpenOffice guys have probably done a pretty good job writing efficient parsers for ISO dates, so the best way to test this out is to start with them. In OpenOffice create a decent sized worksheet with a bunch of numbers. I used one that was 24 columns and 10,000 rows. Definitely a decent sized, but nowhere close to the big ones we see in the real world. Save that file as "numbers.ods". Now select all the cells and format them as dates, and save that file as "dates.ods". Now just measure the time it takes to open each file and the time it takes to save each file.
    Here are the numbers I came up with: Numbers.ods Open = 8.25 seconds
    Dates.ods Open = 13.31 seconds Numbers.ods Save = 7.82 seconds
    Dates.ods Save = 18.41 seconds HUGE difference between whether something is a number or a date since they have to do that conversion into the ISO date format.


Now, if you have Excel 2007 do the same thing with an .xlsx file. Since .xlsx uses the number approach for storing dates in the format and .ods uses the ISO string, you can see if it makes a difference. Numbers.xlsx Open = 2.47 seconds
Dates.xlsx Open = 2.35 seconds Numbers.xlsx Save = 2.03 seconds
Dates.xlsx Save = 1.85 seconds NO SIGNIFICANT different at all between dates and numbers (in fact the dates were actually faster). This is because the dates are being stored with the faster numeric representation. -Brian

  • Anonymous
    July 13, 2007
    For the full picture we need XLS format results for Excel 2007 and OpenOffice.org. Then real slowdown is closer to (T(OOo,ODS)/T(Excel,XLSX))/(T(OOo,XLS)/T(Excel,XLS)), adjusted to the relative efficiency of OOo and Excel.

  • Anonymous
    July 14, 2007
    It is interesting that real world Excel date-handling performance of pre-1900 dates is zero. Besides, it is not an additional 67 seconds - just 67 seconds. The difference would be arrived at by loading Excel 2007 on the 1.7GHz Centrino laptop and doing the same test (converting) and then subtracting that time. In the example : ISO 8601 - 1.5 seconds/5.5 million dates on 1.7GHz Centrino vs Excel (so far) 2.35 seconds/.25 million dates on undefined machine.

  • Anonymous
    July 14, 2007
    The comment has been removed

  • Anonymous
    July 15, 2007
    It is no bug, it's a feature? Please stone ECMA.

  • Anonymous
    July 15, 2007
    Hi Brian, > If it really were 281 micro seconds extra > per date, then there wouldn't be any > question at all around this issue... it > would just be unnacceptable. That would mean > that parsing a spreadsheet that was just > 10,000 rows by 24 columns would take an > extra 67 seconds to parse (over a minute!). You just found my first mistake in my calculation! Its not micro, its nano seconds. I must have added the 1000 * without updating the units. Thanks. So its 281 nano seconds PER ISO date. (Not extra) 24 x 10K dates will take a grand total of 0.067s. Not 67s. yk.

  • Anonymous
    December 30, 2007
    It's been quite a year for those who have been blogging about the Open XML file formats. Here's a look