Introduction to Word documents
Now that folks have had a chance to work with Beta 1 for a few months, I wanted to take some time to give a high level overview of the three different document formats. Today I'm going to focus on Word. Obviously there is a huge set of features and functionality in Word, and I won't really be able to do much more than just scratch the surface today (but hopefully this will be a good start).
Document
There are a large number of pieces of information that we use to construct a Word document. If you want to just focus though on the pieces that actually provide the content for the document, then you can actually break it out into a collection of multiple subdocuments. We call those subdocuments 'stories', and there are 6 top level stories that make up a document:
- The main story - this is the core body of the document, and is really the only one that's required to make a document.
- Headers & Footers - There can be one or more of these, and they are tied to a section.
- Footnotes & Endnotes - Anchors for the footnotes and endnotes like in the body, but the actual content is stored separately.
- Subdocuments - There is a feature that allows for the document to be broken out into a collection of subdocuments.
- Frames
- Comments
Once you have the collection of stories, you then focus on the other parts of the file that help specify all the properties that should be used for those stories (ie layout; formatting; etc.). For the most part, all the stories in a document share a common set of properties. These properties are contained within:
- Style information
- Bullets and numbering information
- Font information
- Document settings
Style Information
A style defines a specific set of formatting properties that can the be referenced by content object. A great example of a style would be the "Normal" paragraph style which in Word 2003 is defined as having the following properties: Font = Times New Roman; Font Size = 12 point; Justification = Left; Line Spacing = Single.
Word supports five different style types:
- Paragraph Styles
- Character Styles
- Linked Styles (both paragraph and character)
- Table Styles
- List Styles
Style cascading (or inheritance) is a fairly important and complex area. Multiple style types can be applied to the same part of a file, so the properties must be applied in a specific order. It's possible for a property set by one style type to actually be removed or supplemented by other style types that follow it.
Styles of any given type can also inherit from other styles of that type. For example, the Heading 1 paragraph style is based on (and inherits from) the Normal paragraph style.
Here is a diagram that shows a simple view of how style information is applied. There are some additional complexities not outlined here, but this covers most cases.
If you look at the above diagram, you'll see that the first type applied is the Table style type. This will affect Tables, Paragraphs, and Characters (or runs) within that paragraph. The next level is the List style type. This affects the paragraph properties. A list style can also bring in a paragraph style, but that's a bit more complexity than I want to get into today. Paragraph and then Character styles are the next two applied, and the final piece is direct formatting, which will override everything else. That's why folks involved in more complex documents like to avoid direct formatting if at all possible, since you can then manage the styles, and don't have to worry about direct formatting overriding those styles.
Now let's talk about this at the XML level, and how a style is applied. The properties of the style are contained in the style definitions:
And the paragraph then just references the style via the style ID:
Bullets and Numbering
Although it's not always obvious, any bullet/numbering definition consists of nine levels, each of which have Paragraph properties (e.g. margins) and Item properties (e.g. bullet vs. numbering, numbering type, etc.) defined. The behavior of the numbering is specified in two parts, the Bullets & numbering definition, then the actual Bullets & numbering instance which is a specific instance of a given definition.
The Bullets and numbering definition specifies the properties for any or all of the nine levels. The instance then specifies the properties for a specific numbering instance inheritance which includes a reference to a definition; and then any additional overrides for one or more levels.
Let's get into an example of how this would look in XML. Here is what a numbering definition looks like:
Then, after the numbering definition, there is a numbering instance that references the definition, and itself has an ID.
And the paragraph then just references the numbering instance via the list property settings.
Font Information
Often, you can't rely on a specific font being on a users machine. In order to make sure a document being passed around still looks good on a users machine that doesn't have a font used in the document additional information can also be stored in the document. The two ways that is done is via the font embedding functionality, as well as the font type data that we write out. The font type data specifies characteristics of the font which are used to find a suitable replacement when the specified font is unavailable.
Document Settings
All settings that are pertinent to the document are stored in separate parts within the document package. The settings can really be divided into two groups: those that affect presentation, and those that are just pure application settings.
The settings that affect presentation are things like compatibility options (ie layout tables like Word 97), as well as web settings such as div behaviors or frameset data. The pure application settings are things like view or zoom state. They may affect how the document appears within the application, but not the actual layout of the document.
Story Content
So, let's get back into the concept of "stories" serving as the main building blocks of the document. Within each story, there is the actual content, which consists of block level structures:
- Paragraphs
- Tables
- Structure Document Tags (customer XML; smartTags; content controls)
- Range Permissions
And within each paragraph, there is a collection of inline structures:
- Runs
- Structured Document Tags (same as at the block level)
- Comments, tracked changes, bookmarks
- Drawings
- Fields
- Hyperlinks
There are a few basic structural rules that are in play here. First, all text in a word-processing document is contained with a run. A run is a region of text with a common set of properties. The second rule is that all runs must be contained within a paragraph. A paragraph of course, is a collection of one or more runs that is displayed as a unit (this is analogous to the HTML <p> tag).
So let's look at an example. The following text:
The quick brown fox.
would look like this in XML:
Notice that a paragraph is just a flat list of runs. There is not additional nesting which is different from the HTML <span> model. I'm not saying one is better than the other, just pointing out that it's different.
A paragraph may be at any location that allows for block level content. For example, it could be at the top level within a story (ie header, footer, main document); nested within a table cell; or nested within a structured document tag or some other structured markup.
Tables
Tables in Word (at least at the base level) or fairly similar to tables in HTML. A Word table consists of a table element which can have a set of properties assigned to it. Then within the table element is a collection of rows, and within each row is a collection of cells. Here is a basic example of a table in WordprocessingML:
Individual table cells can contain block level content. This means a table cell can contain not just a paragraph, but also another table. This allows for tables to be nested in other tables.
Custom Defined XML
The custom defined XML support allows users to embed their own XML within a WordprocessingML file. For example, if you wanted to have the following structure in your document:
You could just insert that XML using a custom XML tag:
That gives you additional structure in your document, and allows you to parse the file looking for your structures.
Sections
Sections in a word-processing document specify a number of properties. By default, a document contains one section, but additional sections can be inserted to either change some of those properties for a specific portion of the document, or even just to create some additional structure (such as a page break).
The types of information that lives with a section is:
- Page properties (page size; page orientation; margins)
- Header/footer references
- Footnote/endnote properties
- Column properties
- Line numbering
- Text direction (RTL vs. LTR; top-to-bottom vs. bottom-to-top)
There are four types of sections: Continuous; Next page (start this section on the next page); Even (start on the next even page); and Odd (start on the next odd page).
The last section of the document (which for the most part is the only section) is stored at the end of the body. All other additional sections inserted are stored as a paragraph property.
Headers and Footers
There are three types of headers and footers. The main one is the Odd page header. If that's the only one that exists, then it is applied to all pages of the document. Optionally, an override header can exist for the even pages, as well as for the first page. Headers are specified for each section, so if you want a different header used, you'll need to create a new section.
Headers and footers are stared in separate parts within the package. There is one part for each header and each footer. Each section then refers to it's header(s) and footer(s) by an explicit relationship reference:
The type of the header or footer is actually declared at the root of the part.
Closing
Well, that was probably enough for one day. I know that I kept this still at a relatively high level. I'll definitely try to dig deeper into the details on the areas that folks are more interested in.
I'm going to be offline for the next week or so, but hopefully I'll have time to at least check comments every once and awhile.
-Brian
Comments
Anonymous
February 02, 2006
Great information.
So are you tellling us that if a table will "live" in the document, we should start with the Table Style first and format in the order illustrated above? Is there harm when a Paragrah style is applies to a cell?
Again, great information!
JeffreyAnonymous
February 02, 2006
The comment has been removedAnonymous
February 02, 2006
The comment has been removedAnonymous
February 03, 2006
So Word 12 doc's are broken up into XML modules. Does this affect what we had in Office 2003 VBA? Will these modules be aggregated into a single XML property such that Range.XML() remains unchanged?Anonymous
February 03, 2006
Bryan: We're definitely not changing the result of Range.XML this release - it will continue to return WordprocessingML that matches the Word 2003 XML schemas. There will also be a method to return an XML serialized version of the new file format, whose schemas are different.Anonymous
February 04, 2006
The comment has been removedAnonymous
February 04, 2006
Only a question: Will be MathML natively supported in Microsoft Word 12?Anonymous
February 06, 2006
All great stuff.
Can I request for a further article information on how the different files fit together in a package, especially the customXML stuff? I've worked out (I think) the four places where the name of the xml data files is defined and referenced, but I'm interested in hearing it from the horses mouth, so to speak. I'd also like to see an example with more than one xml datastore in the package (and with more meaningful names than item1.xml, too!)
Thanks in advanceAnonymous
February 07, 2006
Very interesting... wish I was in the beta crowd too!
A question on Word and XML and datafiles which I didn't see touched on. Word 11 can take in data for a merge from a wide variety of formats including CSV, RTF, straight from a database, etc... but not XML. Will Word 12 import XML data directly? Without the need for ASP or other work arounds?
Thanks for the intriguing look at Word internals.Anonymous
February 08, 2006
The comment has been removedAnonymous
February 08, 2006
I currently use XML Schemas in Word XML files that I create in Word. I then use these files as templates from within my ASP.NET applications to load the XML into a DOM, cycle the XML Schema elements, and infuse SQL data into the document to produce tailored Word Documents and save to Disk as Word XML files.
I now have the need to save these Word XML files as PDF files and cannot find a good tool to do this programmatically from the ASP.NET application. If you know of a third party tool that can perform this operation, I'd love to know what it is.
But my real question is will there be a way with all the new features of XML and PDF functionality to perform the task above in a easier fashion AND have the ability to Save the Word XML to a PDF programmatically?
Thanks in advance, keep up the good work!Anonymous
February 08, 2006
The comment has been removedAnonymous
February 10, 2006
Hey Brian.
Good stuff, but one of the things that I haven't seen mentioned yet is Word's OfficeArt. Currently, Word seems to still use vml to describe autoshapes and vector objects, while Excel and PowerPoint both leverage the new OfficeArt schemas. Is Word planning to continue using vml, or will this eventually be changed before the final release so that even Word uses oartml?
Thanks, and my condolences regarding the super bowl. Maybe next year, buddy.Anonymous
February 13, 2006
The comment has been removedAnonymous
February 17, 2006
The comment has been removedAnonymous
February 17, 2006
Well, in the WordprocessingML, it would be more like this:
<r>
<t xml:space='preserve'>do not do this </t>
</r>
<r><rPr><i/></rPr>
<t>yet</t>
</r>
<r>
<t>.</t>
</r>
Notice that on the first text node, it specifies that leading and trailing space should be preserved. If that wasn't there, then when you opened the file "this" and "yet" would have appeared as one word (where the 2nd part is italicized).
-BrianAnonymous
February 25, 2006
Hey Brian, thanks for keeping up with the posts.
I just got off yet ANOTHER contract where the client was wanting to produce data infused documents and send out as PDFs. They also need to have the ability to alter the document template as business needs dictate. Not sure about others, but I'm finding this requirement on just about every workflow related project that I get involved with lately.
Word's ability to work with custom XML schemas is awesome for programmatically infusing data into word templates stored on disk. Unfortunately, the lack of programmatically going from WordML to PDF is preventing the above scenario from becoming a reality. It forces us to scour the web looking for the final piece to the puzzle. So far I have found only one solution that claims to do the convertion of WordML to PDF but it is cost prohibitive ($1600 which is ridiculous in my opinion, at 4 times the cost of Word itself).
You mentioned going from WordML to XSL-FO, and then to PDF. Could you elaborate on this, and what all is need to go from WordML - to XSL-FO - to PDF? Also, if I'm barking up the wrong tree and you're not the one to talk to about generating PDF's from WordML (programmatically from .Net apps) then let me know. I looked through Cindy blog (now defunct) and the other guys, and have seen my same question asked several times by others, but have yet to see any quality answers on the topic.
If MS recognizes the importance of embedding PDF generation into the entire Office suite, then I would think that they would recognize the importance of exposing this same functionaly to developers for workflow related applications.
If there was a specific blog on this topic, I would be willing to bet that there would be substantial interest.
Thanks a lot for your time and comments.Anonymous
March 06, 2006
Brian, Please can you tell me. Will the ability to add my own custom xml tags be available across the whole Office 12 range?
Or just the professional set (as is currently)?
I would like to use custom xml tags to define business structures and then process that xml "in document" via some custom code.Anonymous
March 06, 2006
Hey Ian, it will be available in all SKUs: http://blogs.msdn.com/brian_jones/archive/2005/09/20/472146.aspx
-BrianAnonymous
March 21, 2006
Hey Brian;
I have a xml doc. I have a dotx template with embedded custom xml using a predefined xsd. Now I want to create a new document by merging the xml and the dotx to create a docx that ultimately will be PDF'ed. The question is - or am I just dumb - how do I get the xml merged with the template?Anonymous
March 27, 2006
Links to blog posts that contain useful technical information for developers. Open XML is a new standard, but there's some good information already available if you know where to look.Anonymous
April 02, 2006
PingBack from http://www.evansthompson.com/2006/04/02/word-document-structure-explained/Anonymous
April 02, 2006
PingBack from http://www.evansthompson.com/2006/04/02/word-document-structure-explained/Anonymous
June 15, 2006
Is there any XSL to convert WordML to RTF?
Please?
With image support would be nice......Anonymous
July 11, 2007
I thought it might be worthwhile to give a bit of an overview of the WordprocessingML model that youAnonymous
July 11, 2007
I thought it might be worthwhile to give a bit of an overview of the WordprocessingML model that youAnonymous
August 02, 2008
PingBack from http://santino.getyourfreefitnessvideo.info/mircrosoftworddocumenttag.htmlAnonymous
June 02, 2009
PingBack from http://hammockstandsite.info/story.php?id=18790Anonymous
June 13, 2009
PingBack from http://fancyporchswing.info/story.php?id=1924Anonymous
June 13, 2009
PingBack from http://thestoragebench.info/story.php?id=5044Anonymous
June 16, 2009
PingBack from http://workfromhomecareer.info/story.php?id=33253Anonymous
June 19, 2009
PingBack from http://debtsolutionsnow.info/story.php?id=12419