Readability vs. Performance of XML formats

This isn't really news for most folks, but I wanted to better explain this for those folks new to XML. As I’ve mentioned before, the new formats are going to be fairly similar to WordprocessingML from Office 2003 in that they will not be pretty printed and will use fairly short tag names. The reason for both of these decisions has to do with performance. The side effect here is that if you open an Office XML file in notepad or some other text editor, it will look overwhelming at first. This is just cosmetic though, and I'll explain why it's that way.

Shorter tag names

We can generate and read these formats a lot faster if the tag names are shorter. When opening a file or generating a file we spend a good amount of the time just parsing the XML text. The longer the tag name, the longer it takes to parse the file. When parsing the file, we use a trie to lookup element/attribute names, and this is exactly proportional to the length of the tag name. Double the length, double the lookup.

So, for any element that is repeated often throughout the file, we use a short tag name to cut back on this time (for example, Excel uses “<c>” instead of “<cell>”). That may not seem like a big deal, but if you have a file with ten-thousand cells (which is not too uncommon in Excel), that means you have 30K less bytes to parse just for that particular element type. As you can imagine, that adds up pretty quick when you think about all the tags we need to generate.

Pretty Printing

For those of you how aren’t familiar with XML, “pretty printing” is a way of writing out XML so that it’s easier for people to read with a text editor. Imagine the following XML:

<wordDocument>

    <body>

       <p>

          <r>

             <t>Hello World</t>

          </r>

       </p>

    </body>

</wordDocument>

That’s pretty easy to read, because there are line breaks and indentations for each element. The problem with this approach is that the application writing out the file has to generate all of that. In Office, our files will have thousands of tags in them. If we were to pretty print each file, it would mean that saving a file would take more time. Since the file save times affect everyone, and the pretty printing only affects those people who want to look at the XML in plain text, we decided to optimize for performance (most other applications that have XML as their default format do the same thing). Here is what that same XML would look like when saved out from Office:

<wordDocument><body><p><r><t>Hello World</t></r></p></body></wordDocument>

Most XML editors out there today (like VisualStudio and even FrontPage), have functionality built in that allows you to apply pretty printing to a file. This means if you want to look at the XML, just load it up in one of those applications and apply the pretty printing to it. You can also load the file in IE which automatically applies an XSLT that gives you a pretty decent view.

-Brian

Comments

  • Anonymous
    June 23, 2005
    Hi Brian, Internally, will O12 use MSXML to parse and generate these file formats, or will you develop dedicated code for this?

  • Anonymous
    June 23, 2005
    The decision you take about the readability vs. performance trade-off appears to be the appropriate. Creating verbose XML with pretty printing would be the right choice if the major consumer of the documents is a human. However, almost all of these documents would be consumed by a parser in a typical scenario.

    During develop-time you can format the documents using a XML editor or tool. Another approach to the pretty printing issue is to activate it programmatically. Since this feature is only useful for developers, including an option in the Save As dialog seems to make no much sense.

    After all, we all are familiar with the <p>, <c>, <li>... HTML tags.

    In my opinion, the key point is to create a XML vocabulary that is as much homogeneous as possible. This criteria includes capitalization, use of attributes vs. elements, verbose vs. efficient names... Having a consistent set of rules across all of the XML vocabularies of the MS Office System can make the difference.

  • Anonymous
    June 23, 2005
    The comment has been removed

  • Anonymous
    June 24, 2005
    wordprosseingML and spreadsheetML is unnecessarily long, so i think shorten word inside the each tag is one thing. but how about output simplified markup in first place?
    think one <cell> tag and too many <c></c><c></c><c></c>....
    it is obvious that one <cell> tag is much easier both for human and machine. so i want feature in office suite save as simplified xml output for reperpose using another simple task.
    thank you.

  • Anonymous
    June 27, 2005
    The comment has been removed

  • Anonymous
    July 05, 2005
    This post is for those of you interested in learning the basics behind WordprocessingML. That’s the schema...

  • Anonymous
    April 11, 2006
    I've really dropped the ball here over the past several months. I'd been meaning to post some example...

  • Anonymous
    May 10, 2006
    It's been awhile since I've talked in detail about the SpreadsheetML schema and I apologize. I had a...

  • Anonymous
    May 11, 2006
    It’s been awhile since I’ve talked in more detail about the SpreadsheetML schema and I apologize. I had...

  • Anonymous
    July 12, 2007
    It's been awhile since I've talked in detail about the SpreadsheetML schema and I apologize. I had a

  • Anonymous
    June 09, 2009
    PingBack from http://quickdietsite.info/story.php?id=2021

  • Anonymous
    June 16, 2009
    PingBack from http://fixmycrediteasily.info/story.php?id=18171