Example Office 12 XML File

Artículo
06/20/2005

I wanted to get an example document posted so people get a chance to look through the new Office 12 XML formats and see what the similarities and differences are with the Word 2003 XML format. I took a basic document and saved it out in the new format, as well as in Word 2003's XML format. This is still very early code, so a number of the structures could still change, but I'm pretty confident this is close to what the final version will look like. Also, the majority of the file size is taken up by an embedded picture, so you won't see a significant file size saving with the new format compared to the current binary formats.

You will see right away that it's just pure XML representing the file. I read a post on a blog today where the author mistakenly thought these new formats weren't XML, but instead just XML-based. I guess if that's referring to the fact that we use ZIP as a container it would be true, but other than ZIP, everything else is pure XML following the W3C XML 1.0 standard. I still remember when we decided to go with ZIP as the container... it was a pretty straightforward decision. There were already a number of other formats out there using XML and ZIP, so we figured that would be the best way to go if we wanted people to have an easier time working with our files. Using a single flat XML file wasn't really ever given serious consideration just because of the file size bloat. This was especially true for PowerPoint, where presentations often contain tons of pictures, and having to encode those to store in a single XML file just didn't make a lot of sense.

So anyone want to see an example of the format? If you download the following zip file: https://jonesxml.com/resources/BasicDocument.zip you will see 3 embedded documents that have identical content, but in different formats. There is a binary document (.doc) you can open in Word, and you'll see some text and a picture. There is then an equivalent .xml file that was saved in Word 2003 with the XML format. The third file is a .docx file that I saved using the latest build of Word 12. That's the file you guys will find the most interesting. Open the file using any ZIP tool, and you can start to explore. Let me give you a basic description of what you are seeing:

Root Folder

If you are using the shell's ZIP support (just rename the file to have a .zip extension), you'll see that at the root level of the package there is an xml file called [Content_Types].xml, and three folders: "_rels", "docProps", and "word".

[Content_Types].xml

If you haven't read through the first part of the Metro Spec, I would recommend it. Office uses the same ZIP conventions that the metro folks do, as I described in this earlier post. We worked together on designing a logical model for documents, and then mapped that into ZIP. Since ZIP doesn't have a content type property on each part, we instead use this XML part to describe the content types that appear in the package. By reading this part (which always has the same URI "/[Content_Types].xml") you can quickly see what type of content the file consists of. There is a default mapping for extensions, as well as overrides for specific URIs.

_rels Folder

The _rels Folders are where you go to find the relationships for any given part. To find the relationships for a part, you just look for the _rels folder that is a sibling of your part. If the part has relationships, the _rels folder will contain a file that has your original part name with a ".rels" appended to it. For example, if the content types part had any relationships, there would be a file called "[Content_Types.xml.rels]" inside the _rels folder.

_rels/.rels

The root level _rels folder always contains a part called ".rels". This URI ("/_rels/.rels") and "/[Content_Types].xml" are the only two reserved URIs for parts in files that adhere to our conventions. This is where the "package relationships" are located. Whenever you open a file using these conventions, you always start by going to the _rels/.rels file. All relationship files are represented with XML. If you open it in a text editor you'll see a bunch of XML that outlines each relationship for that part. In this example document, the top level parts are two metadata parts, and the wordDocument.xml part. That's what we'll look at next.

word/wordDocument.xml

This is the main part for any Word document. If you crack it open in an XML editor (I just use IE to view it), you'll see a pretty basic XML file. This is where you'll start to see the differences between the new format, and the Word 2003 XML format. A bunch of the stuff that was at the beginning of the document in 2003 is now broken out into separate parts. The body of the document is what's contained in this part. As you look around in this part, there are a couple of things I want to call out.

Embedded picture

Notice that the picture isn't embedded in the XML like it was in Word 2003. You'll see there is some markup describing how the picture is laid out, but the picture data itself isn't there. Instead, there is the following tag:

<v:imagedata w:rel="rId5" o:title="bulls" />

This is the reference to the image file. In the new format, all references are done via relationships. The wordDocument.xml part has a relationship to the image part. In order to find the image, we just need to go to the relationships file for wordDocument.xml and find the relationship id "rId5". Looking back at the ZIP package, notice that there is a _rels folder in the same directory as the wordDocument.xml part. Open that folder and you'll see a file called wordDocument.xml.rels. If you open this up in a text editor you'll see that "rId5" is a relationship of type "https://schemas.microsoft.com/office/2006/relationships/image", and it points to the file image0.jpg in the media folder.

I'll talk more about relationships in future posts, but I hope the basic usefulness is clear. The relationships files allow you to quickly navigate through the package without having to open up each part. If I wanted to find all images that are referenced in the wordDocument, I don't even need to open the wordDocument.xml part. I just open the relationships file and look for all relationships that are of type "https://schemas.microsoft.com/office/2006/relationships/image". If I want to change this to point at a different image, I just edit the relationship, and don't need to modify the application level XML. This is especially useful for external relationships, as described next.

Hyperlink

Back in the wordDocument.xml, notice the inline markup for the hyperlink. The tag is just <w:hyperlink w:rel="rId4" w:history="1"> . It doesn't actually have the URL inline. Just like references to other parts in the ZIP use relationships, so to external references. If you go back to the relationships file for wordDocument.xml, you'll see that rId4 is a relationship of type hyperlink, and it points to my blog. This is true not just for hyperlinks, but for any external reference. Linked images, templates, etc. This makes it much easier to do link fix-up if your moving files from one server to another. Or if you want to remove all external references for security reasons, you just edit the relationships.

There are a bunch of other things I want to talk about with this file, but the post is already getting too long. The main thing I wanted to get across here was how the different pieces of the files are laid out, and how you go about navigating them. Please play around with the file a bit. Let me know what areas of the formats you'd like me to describe in greater detail.

-Brian

Comments

Anonymous
June 20, 2005
Nifty! Nice choice of examples about keeping relationships outside of the content parts.
Anonymous
June 20, 2005
The comment has been removed
Anonymous
June 20, 2005
Will Word2005 file format use more clear element and attribute names as apposed to Word2003 file format (e.g. "Paragraph" instead of "p")?
Anonymous
June 21, 2005
The comment has been removed
Anonymous
June 21, 2005
TO MARIO:
I thought that attributes are not a part of any namespace by default. You have to prefix them explicitly.
Anonymous
June 21, 2005
Thanks for posting the examples! I will be waiting for a similar post on your Excel blog
Anonymous
June 21, 2005
Mario – You’re right that in most cases people don’t qualify their attributes, and just assume the namespace of the parent element applies to the attribute. In Word 2003 schemas we qualified the attributes, and we currently are doing it for the 12 schemas. We may decide to drop that though and just go with unqualified attributes as you suggest (not really a big deal either way).

LexP – Currently the naming conventions are very similar to what they were in 2003. I was thinking about changing this because now that we use ZIP compression, there shouldn’t be a big impact on file size. There is a big impact though on performance, as longer names require more parsing. Because of that, for elements that occur often in the files, we try to use very short names. For elements that only occur a couple times though and are more rare, we will often use more verbose names.
I had been thinking about providing a tool that could make this easier, but I wasn’t sure how useful it would be. Maybe you can give me your opinion… I was going to have someone build a simple XSLT that converted every element from the short tag names to longer more verbose names. I would create a group of “debug” namespaces that matched the Office namespaces and allow people to transform between the two. These “debug” namespaces wouldn’t be supported by the applications, instead they would just be for putting the document into a temporary state that’s more readable. Does that sound useful? Or is it already getting too inconvenient?

Keith – I think we’ll be able to provide the first draft of the schemas a bit before beta 1. I’m currently thinking it will be around the time of PDC which is the 2nd week in September, but I’m not positive.
It’s funny that you bring up the pretty printing, because we were actually arguing about it yesterday. The current plan is that we will not pretty print our XML parts. The example I posted has pretty printing, but it wouldn’t in the final version. The reason for this is that there is actually a significant performance hit when you have to take the additional time to pretty print the files. Since these formats are going to be the default, we need to make sure they have fast open and save times. By not pretty printing though, it makes them harder to work with if you are editing the XML directly by hand. Many XML editors currently have pretty printing functionality though (FrontPage & VS for example), so I’m hoping it won’t be that big of a deal. What do you think?
You’re relationship ID point is interesting. The IDs just need to be unique. If you were creating a file from scratch, you could call them anything you want. There is a type attribute on each relationship as well though, which allows you to understand more about how it’s used. While there is nothing to help you know what order they are used, you will know when a relationship is pointing at an image vs. a stylesheet (just as an example). I’d like to hear from more folks on their first impressions of relationships.

Bruce – I’ll try to get some Excel stuff pulled together soon. There are more significant differences from the SpreadsheetML to the new Excel format than what you see with Word, so I’ll need to start with something simple and make sure I explain everything properly. Excel has done a lot of work to make the new format a faster more efficient format, which means far less information on each cell, and instead separate collections of properties that reference the grid. As an example, with a named range, instead of having the name on each cell, you would instead have a separate named range element that references the range in the grid that it applies to. I’ll get an example together to make this more clear.

Everyone – Keep the comments coming. We announced this early so we could get more feedback from folks on the formats. Let me know what you liked or disliked about the 2003 schemas. What would you like to see changed? What do you think about this first look at the new formats?

-Brian
Anonymous
June 21, 2005
Brian said:
<blockquote>
It’s funny that you bring up the pretty printing, because we were actually arguing about it yesterday. The current plan is that we will not pretty print our XML parts. [...] The reason for this is that there is actually a significant performance hit when you have to take the additional time to pretty print the files.
</blockquote>

Your speed argument seems to trump human-readability quite completely. I'm fine with that. Developers should be able to get pretty printing on their end pretty easily.

<blockquote>
You’re relationship ID point is interesting. The IDs just need to be unique. If you were creating a file from scratch, you could call them anything you want.
</blockquote>

Would hand naming be preserved after opening and saving using Word?

<blockquote>
There is a type attribute on each relationship as well though, which allows you to understand more about how it’s used.
</blockquote>

Sure, it seems pretty reasonable at this point. I'm just quivering because I work too much with XML & other stuff that auto-generates number strings for all XRefs (cross-references). For example, I can ask Word 2003 to "Toggle Field Codes" on a simple XRef and I get the not very helpful "REF _Ref107126818 h". Another program gives me nice DocBook XML strings like <pre><indexterm id="IXT-33-296798"><primary>algorithms</primary></indexterm></pre>

Because cross references tend to break for us we've gone to some lengths to get meaningful names for IDs so that we can fix them later. Here's what we've gotten Word 2003 to say instead of something like the above "REF XREF70988_Figure_111 h", which is a cross reference to the paragraph with the text "Figure 1-11" (implemented with auto-numbering).

Any human-readable text that finds its way into IDs would seem like a big improvement from my perspective.

<blockquote>
I’d like to hear from more folks on their first impressions of relationships.
</blockquote>

Ditto, I may be in the minority.

Thanks,
Keith
Anonymous
June 21, 2005
Frankly I don't understand such an economy on element names. Excel2003 already has readable and verbose XML format with Pascal casing, while word2003 has camel old-style "economy" naming convention.

Please pay attention to unification of:
1) verboseness of XML format (p vs Paragraph)
2) naming convention (camel, Pascal)
3) attribute vs. element usage in sample scenarios (<CoreProperties Title="" .. /> vs. <CoreProperties><Title>... )

As to me, I prefer XAML-style naming convention with clear element and attribute names.
Anonymous
June 21, 2005
The comment has been removed
Anonymous
June 21, 2005
For relID readable names I think an option to change the Id name inside Word (properties or so) should be enough.

Pretty printing should be an option, a checkbox, in the 'save as' dialog.

Will I notice the speed difference in saving and loading binary format compared to new format, especially with large files?
Anonymous
June 22, 2005
Brian,
thanks for the early insights. Nice picture as well.
Nevertheless, with reference to xml, people will have to cry about every little thing:
For once, it bugs me when it comes to the vector markup schema in Office 12. MS has always been very reserved in promoting vml. There's little known documentation available, it got almost completely erased from the msdn dvds and the schema wouldn't ship with the Office2003 schemas at first?!?

I realize that MS is tempted avoid svg vs. vml tumults. The office team has always been doing vml over svg and preferred not to talk about it. Since you started talking, I'd expect you to come down on one or other side of the fence. Please explain your decision and stand in for your solution...

The pity of it is that they don't have Longhorns in San Miguel de Allende;-))
Anonymous
June 22, 2005
The comment has been removed
Anonymous
June 22, 2005
Walter b: Are you talking about a "foreign" XML element comingled among/under/within the MS Office Word 12 ones? That's an interesting topic. I notice that OASIS OpenDocument specifies default behavior for those and I wonder if that is what you have in mind.

I think the OASIS OpenDocument foreign elements/attributes are handled a little like the HTML rules where foreign attributes are ignored and any foreign-element content leaks into the surrounding understood element content. I don't know if there's a way to control preservation as the result of editing of the containing understood element. (These things need to be thought through before a default behavior is invented, or else invent a default behavior that is easy to change in an upward compatible way later [;<).

Part of the problem is not knowing whether the editing has an impact on what the function of the foreign element is. You can even have this problem with something simple like adding a document property to the property sheet. The Office 12 scheme of things could have a way (via attribute) for the keep/discard/fail behavior required when the containing element is foreign to the processing application, but I don't know if that would solve your problem.

[Wishing there was a newsgroup as a better place for discussions like this ... ]
Anonymous
June 22, 2005
I should point out that the foreign element/attribute problem applies more widely in any scheme where an XML Document can be extended by "foreign" elements/attributes pretty much anywhere. WebDAV is kinda/sorta snarled up in this too. Someone else can say what Sharepoint does about it.
Anonymous
June 22, 2005
Hi Brian,

It looks like the relationships tags are going to be the key to user-navigation of these files, so yes, I think more descriptive tags would be better, and/or some sort of organisation of them in the relationships file. I think it would also be immensely helpful if MS could knock together a tool that would make it easy for us to navigate around these files, editing the XML as we go. E.g. a simple text editor, where every relationship link can be clicked to get to the definition, which can be clicked again to get to the target.

I'd like to also add my vote to a 'Verbose' or 'Debug' option in the Save As dialog, which would run the xml through your 'Flesh out the tags' transform, with a preprocessor to do the reverse when the files are loaded.

Stephen
Anonymous
June 22, 2005
Brian,

Please consider adopting just one convention from the OpenDocument standard: that of placing a "mimetype" file first in the zip archive, uncompressed, whose contents are the MIME type of the entire document. This makes it easy to identify documents by examining the first few bytes.

See Section 17.4 of the OpenDocument v1.0 standard, http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.0-os.pdf
Anonymous
June 23, 2005
Brian,

: There are more significant differences from
: the SpreadsheetML to the new Excel format
: than what you see with Word, so I’ll need to
: start with something simple and make sure I
: explain everything properly.

As one who is familiar with many of the issues you had to face when developing the new Excel format, I can certainly appreciate this.

However, some of us are already pretty familiar with the earlier Excel formats and the architecture they're based on, and we'd like to start getting our hands dirty. Would it be possible to post a sophisticated workbook in the new format (and including the BIFF equivalent) for our benefit?

We here in my office are excited about this new file format and what it means for the customers you and we have in common.
Anonymous
June 23, 2005
Keith – Hand naming of the relationships most likely would not be preserved. It’s important to understand the target use of relationships. Preservation of IDs on objects is something that really needs to be part of the application runtime, and not just the file format. The parts that we break our files into often aren’t truly separate objects in the applications memory structures. We are asked to provide IDs on different types of objects, like embedded documents, images, tables, paragraphs, etc. Those are things we could decide to persist via the relationship ID rather than other inline markup, but it would really depend on the object itself, and not the more generic concept of parts and relationships.
You’re point about cross references is very valid, and that is something that is more of an applications feature level thing, as opposed to something directly related to parts and relationships in the file format. Let me know what you’d like to see out of the references features.

Ryan – You definitely shouldn’t feel out of place. I’ve had some fairly technical posts so far, but I also plan to touch on a lot of topics more related to IT folks. I also will have some posts that are more at an “intro to XML” level. These new formats to make us more stable, but not necessarily in the way you describe. The new formats also remove a lot of the limitations we had in previous formats, which allows us to explore some of the constraints we had in the older versions. In Word, we’re also looking at the problems that come with longer documents and large tables, but there isn’t anything specific I can say right now as far as new functionality goes.

Ignace – Many of the relationships and parts aren’t exposed at runtime. They are only generated when we go persist the file. We don’t have performance numbers yet in comparing the old binary formats with the new XML formats (it’s still a bit too early).

Paul – VML has been around for a long time. The reason we didn’t have the schemas for VML when we first released the 2003 schemas was that we had to go back and create it. All the code for generating VML had been done long before XSD came about. We had XSDs for all the other schemas because we’d actually built them directly into our build process. Our code would pull the tag names to be used directly from the XSD files at build time. It was a lot easier for us to clean those up and make them available publicly. For VML we had to get someone to generate it from scratch based on the implementations.

Orcmid – We’ll probably wait until we get closer to the betas to get a newsgroup setup, but it may come sooner. I don’t really have enough time to monitor a newsgroup myself, and it would be best to not distract the rest of the development team until we get closer to ship. Like I said though, maybe it will make sense to set something up sooner.

Stephen – I’ve been looking around for some resources to put together a tool like you suggest for navigating the files. Not sure if we’ll get it together though or if we’ll have to rely on a third party to do it. It would definitely be very cool.

Kalelb – I’ll try to get something together soon. Most likely I’ll start with some really simple files though. The reason I posted the Word example file first is that the Word format is the furthest along. I’ll check the latest builds of Excel and see if it’s in a state where an example file would be useful.
Anonymous
June 23, 2005
Is there any merit in being able to have multiple documents in the same zip, so that shared style information, shared images etc are not duplicated.
Anonymous
June 23, 2005
That's a great question Mark. A number of people naturally wonder if this new format means we could create V2 of the "binder." The binder was essentially a document format that would allow for multiple files to be stored in one file. The scenario was more around having a project that had some Word documents, Excel spreadsheets, and PPT presentations, and you wanted to have them all stored together.
The ZIP container does lend itself well to that concept, but it's not something we're planning on doing this version. It is something we kept in mind while we were architecting the logical model for our documents though, so there isn't anything that would prevent us from moving in that direction if we decided it was worthwhile.
-Brian
Anonymous
June 23, 2005
The comment has been removed
Anonymous
June 24, 2005
Visual Studio Team System User Education - Process Planning Guide

David has written a nice guideline...
Anonymous
July 05, 2005
This post is for those of you interested in learning the basics behind WordprocessingML. That’s the schema...
Anonymous
July 11, 2005
Visual Studio Team System User Education - Process Planning Guide

David has written a nice guideline...
Anonymous
November 25, 2005
Format Comparison Between ODF and MS XML
by Carrera, D'Arcus, Eisenberg
http://groklaw.net/article.php?story=20051125144611543
Anonymous
March 21, 2006
I found your page from google but i like it so much
Anonymous
June 02, 2006
PingBack from http://www.bluesparc.com/2006/06/02/microsoft-opendocument-is-too-slow/
Anonymous
July 26, 2006
PingBack from http://openoffice.or.kr/blog/?p=94
Anonymous
September 30, 2006
PingBack from http://opendocument.or.kr/blog/?p=7
Anonymous
March 12, 2007
This post is for those of you interested in learning the basics behind WordprocessingML. That’s the schema
Anonymous
March 15, 2008
PingBack from http://blogrssblog.info/brian-jones-open-xml-formats-example-office-12-xml-file/
Anonymous
May 30, 2008
I wanted to get an example document posted so people get a chance to look through the new Office 12 XML formats and see what the similarities and differences are with the Word 2003 XML format. I took a basic document and saved it out in the new format
Anonymous
June 05, 2008
I wanted to get an example document posted so people get a chance to look through the new Office 12 XML formats and see what the similarities and differences are with the Word 2003 XML format. I took a basic document and saved it out in the new format
Anonymous
May 28, 2009
PingBack from http://paidsurveyshub.info/story.php?title=brian-jones-office-extensibility-example-office-12-xml-file
Anonymous
June 01, 2009
PingBack from http://woodtvstand.info/story.php?id=3869
Anonymous
June 08, 2009
PingBack from http://insomniacuresite.info/story.php?id=6760
Anonymous
June 16, 2009
PingBack from http://topalternativedating.info/story.php?id=3921
Anonymous
June 19, 2009
PingBack from http://debtsolutionsnow.info/story.php?id=13230

Compartir a través de