Performance of an XML file format for spreadsheets

I've blogged in the past about some of the things we did with the SpreadsheetML format to help improve the performance of opening and saving those files. While a file format most likely won't affect the performance of an application once the file is loaded, it can significantly impact the performance of the actual load or save. The majority of a wordprocessing file is usually the actual text content and anything you can do to avoid making the storage of that content complex the better. The majority of a presentation is usually the rich media and styling information and it's important to store those in an efficient manner. The biggest challenge though hands down in terms of file format performance though is a spreadsheet.

It's very easy to get Spreadsheets comprised of millions of cells. In older versions of Excel, it used to be that the maximum number of columns allowed in a worksheet was 256 and the maximum number of rows was 64,000. That meant that a single worksheet in a workbook could consist of over 16 million cells. An of course there can me many worksheets in a workbook.

There were a number of customers who were hitting this limit and wanted us to increase the number of rows and columns allowed in a worksheet (this was actually one of Excel's top requests). This had been a request for awhile, but it was going to be extremely difficult to increase this limit in the old binary formats so we didn't do it. The move to the new Open XML formats though allowed us to increase the limits, so now in Excel 2007, you can have up to 16,000 columns and 1 million rows. That means you could have a worksheet with 16 billion cells. I can't imaging anyone hitting that limit, but to be honest, I couldn't believe folks were hitting the older limits.

So, while that's a look at the extreme end of things, it's not too uncommon to have very large spreadsheets, and a workbook with a million cells of data is definitely something we see a lot of.

Now, while in a Wordprocessing document you may potentially have a million words (which would be about 2000 pages), it's not as likely. More importantly though, the structure of the XML doesn't have as big of an impact. In wordprocessingML, you don't have every word contained in it's own XML element. So you while may have a million words, your XML element count would probably only be in the tens of thousands.

In a spreadsheet though, you have each cell represented by at least one, if not more XML elements which means that a spreadsheet of any reasonable size can easily have millions of XML elements and attributes. This can have significant implications on the load and save performance of the files.

A couple examples:

There are a number of things we looked into doing to help improve the performance of these files both using current technology as well as thinking about future approaches. Some of the approaches we took that are easiest to understand are:

  1. Reduce the need to fully parse repeating data - In order to cut back on parsing large strings repeated multiple times, or formulas that are repeated down an entire column, spreadsheetML does a lot of sharing. I already talked about the huge benefits we can get from the shared formulas work in a post last spring. The shared string table can give similar results.

  2. Tag Size - Contrary to this post from IBM's Rob Wier, the size of XML elements actually does directly affect performance times. The more text you have to parse, the longer it will take. Rob I hope you didn't misunderstand some of my earlier points on this though. I don't mean in any way to give the impression that tag size is the biggest factor in terms of file format performance because it's not. There are a number of other architectural issues that play a much bigger role. That said, the size of the tags does have an impact (as I've blogged about before).

    The simplest way to see the impact for yourself is to just take an application like OpenOffice, and get a decent sized spreadsheet (I took a file with just one worksheet, that was 15 columns by 15,000 rows). Save it in ODF and make 2 copies. In one of the copies open content.xml and change some of the namespace prefixes to make them much longer. I aimed to make it so that the resulting content.xml file was about 3 times larger than the original since this helps replicate the size difference you would see if we changed SpreadsheetML so that it's tag length was longer and more descriptive like ODF (it would probably be more than that, but I found that making content.xml 3 times larger was more than enough to get a noticable slowdown in load times). I used namespace prefix approach because it's the easiest way to replicate the effect of a larger element name without changing the actual element names (since that would prevent the file from being readable by OpenOffice). In my test, by essentially tripling the size of the element names, I found that the file went from loading in about 9 seconds to taking about 11 seconds.

  3. Splitting the XML into multiple parts - In SpreadsheetML we break the XML content out into multiple parts in the ZIP (each worksheet, the string table, pivot table data, etc.). This can help a ton with on demand loading, or even loading multiple pieces at once. For example, it means that if you had a multi-proc machine, you could load each worksheet on a separate thread. It also means that you could decide to only load on sheet and you wouldn't have to parse through all the XML from the other sheets.

  4. Relationships stored outside of content - By storing all the relationships from one part to another in separate (and much smaller) files, it makes it really easy to see what other parts you should load when you are loading a particular part of the file. If it weren't for the relationships, you'd actually have to parse through the content of the XML to determine what other resources that part used. For example, if you wanted to just load one slide in a presentation, you can also see what images are used on that slide before you start parsing the XML for the slide.

There are clearly a large number of factors that play into the performance of a particular format. As you can see, it primarily comes into affect when loading or saving a file. If you do any type of delay loading though or incremental saves, it can also start to have an impact on your editing performance though. The key issues I've focused on primarily affect the general load and save times. The examples above are just a handful of issues we looked at when designing the formats.

-Brian

Comments

  • Anonymous
    October 25, 2006
    The comment has been removed

  • Anonymous
    October 26, 2006
    Hitting the limits is not as hard as you think. I hit the old 65k row limit. I have also hit the new 1 million row limit. Many data archives, e.g. the renowned General Social Survey and World Values Surveys, have hundreds, if not thousands, of columns, and hundreds of thousands, if not millions of rows. And the U.S. census, even a 1% extract, would still overwhelm Excel 2007 with 3 million rows. Incidentally, have you tried opening the part 4 of the OpenXML spec in Word B2TR? It takes about 10 minutes on my computer, and that is not including post-opening repagination.

  • Anonymous
    October 26, 2006
    I think the 1 million row limit is too low myself. would have preffered 16 million rows by 1000 columns in such dimensions. We use excel to manipulate certain large query dumps and that often results in large amounts of rows but very few columns. Upping the limit to a million row is a major advancement for us even so much that we already use some Office 2007 test version for that. The performance of the 2007 beta version with such large files is mediocre to poor but still faster then OOo 1.x which seems uable to handle certain large loads. I hope the performance of the final version will improve quite a bit still.

  • Anonymous
    October 26, 2006
    It would seem that Francis and hAl are manipulating data on a scale that is better suited to a database.

  • Anonymous
    October 26, 2006
    Brian - I tend to think the whole debate about tag size (your #2) if overrated, both by Rob Weir and you.  It seems unlikely to make a huge difference in speed, but if it does, I hardly think that the "human-readable" ODF tags are wildly preferable.  Let's face it, anybody manipulating XML is going to have to look up names and such anyway, so I see every reason to keep things short and simple.  This is definitely an area where Open XML seems better to me, although I do think even Open XML loses some of the advantage by nesting elements more deeply than I would like. Still, I would guess that I think that because I have not seen more complex examples where it is important. The issues in your #3 and #4 seem the more important and relevant.  Rob even states in his post that because of Microsoft's limitations on beta benchmarks, he can't use a real life example, but this is a case where his performance evaluation clearly doesn't take into account the very important logic of on-demand parsing and use.  It makes complete sense to split things up into more smaller files, and this is an area where I think ODF could be improved.  If the breaking into smaller files is done logically, and I don't know the details of how this is done in Open XML, this could make a huge difference in larger, complex documents.  If I join the ODF TC, which I am considering, I would definitely advocate for looking at how Open XML implements both #3 and #4 to see how ODF could be improved.  These are the sorts of areas where I hope both ODF and Open XML TCs look at the other spec and see what can be learned.  Not just copied, but learned.

  • Ben
  • Anonymous
    October 26, 2006
    Dear Brian, Is there a theoretical limit on how many cells SpreadsheetML can handle, or are the limits simply the results of hardware limitation?

  • Anonymous
    October 26, 2006
    @Doug We export data from the database because it is easier to manipulate in excel than with the database tools we have access to. Also it is easy to combine data from other databases or sorted files into the spreadsheet afterwards.

  • Anonymous
    October 26, 2006
    Doug: much of this information is flat, and a spreadsheet is essentially a flat-file database. The data sources I mentioned has no relationships (or objects), so a relational (or object-oriented) database offers no advantages. And databases usually do not offer the sort of functions that a good spreadsheet does. Besides, easy-to-use desktop databases that handle this volume of data are few and far between. (Access' 2007 unchanged 256 column limit means it is even more limited than Excel.)

  • Anonymous
    October 26, 2006
    Adam: That's an insane argument! Don't develop performant file formats because in 7 YEARS computers will be fast enough to handle a slower one? In the meantime, nobody is going to use the slow one!

  • Anonymous
    October 26, 2006
    Adam, Dean: it's even worse than that - there are plenty of 7 year old computers around and those people buy today will stay around for years and years, so designing a format that is dog slow right now is not an option. There are people who would suffer degraded performance for their ideology, for the rest of us there are better formats :)

  • Anonymous
    October 28, 2006
    And if you get the performance now, 7 years from now it will simply be astounding!

  • Anonymous
    October 31, 2006
    Why can't you get the best of both worlds?  Why not have both a format with long names and a corresponding version with short ones.  In fact I don't understand why you even need a textual representation of xml.  Couldn't you have a binary version of xml which could be automatically viewed as a textual representation through the action of a simple one to one translator.  All you need is some standard binary serialization format for XML.  I think W3C is doing some work in this area: http://www.w3.org/XML/Binary/ You would then have a zip container which when unzipped gives you a serialized binary which when unserialized gives you pure xml.   I guess the problem with the solution is that there is no standardized way of doing binary serialization of xml and therefore serializing it yourself makes your format less open.  However I think serialization could be done in an extremely simple, transparent and easy way which would be very easy to duplicate.  

  • Anonymous
    November 01, 2006
    @Ben I don't think "human readable" tags are worth the speed hit because humans are generally not going to be mucking with XML themselves via eye-balling.  99.999999999% of the time computer programs will be manipulating the XML, not human beings.

  • Anonymous
    November 02, 2006
    I think it's also worth looking at the XML itself and deciding what you think is human readable. HTML, which most folks would say is human readable does the following for a table: <table>
     <tr>
       <tc>A1</tc>
       <tc>B1</tc>
     </tr>
    </table> SpreadsheetML does this: <sheetData>
     <row>
       <c><v>A1</v></c>
       <c><v>B1</v></c>
     </row>
    </sheetData> ODF does this: <table:table>
     <table:table-row>
       <table:table-cell>
         <text:p>A1</text:p>
       </table:table-cell>
       <table:table-cell>
         <text:p>B1</text:p>
       </table:table-cell>
     </table:table-row>
    </table:table> So, while OpenXML does use terse tag names, it's just like HTML where you just look up the meaning and from then on it's pretty straightforward. -Brian

  • Anonymous
    July 20, 2007
    I'll be offline for the next week or so. Sorry if I don't answer your e-mails or comments during that

  • Anonymous
    July 20, 2007
    I&#39;ll be offline for the next week or so. Sorry if I don&#39;t answer your e-mails or comments during

  • Anonymous
    January 31, 2008
    There have been some discussions over the past several years about the harmonization of Open XML with

  • Anonymous
    January 31, 2008
    There have been some discussions over the past several years about the harmonization of Open XML with

  • Anonymous
    January 31, 2008
    Brian Jones has a post this afternoon on the concept of harmonization of document formats, and in particular

  • Anonymous
    January 31, 2008
    Brian Jones has a post this afternoon on the concept of harmonization of document formats, and in particular

  • Anonymous
    June 01, 2008
    I've blogged in the past about some of the things we did with the SpreadsheetML format to help improve the performance of opening and saving those files. While a file format most likely won't affect the performance of an application once the file is loaded

  • Anonymous
    June 05, 2008
    I've blogged in the past about some of the things we did with the SpreadsheetML format to help improve the performance of opening and saving those files. While a file format most likely won't affect the performance of an application once the file is loaded