Design Goals Behind SpreadsheetML
It's been awhile since I've talked in detail about the SpreadsheetML schema and I apologize. I had a number of posts back in the summer which talked through Office XP's SpreadsheetML format that we built about 6 years ago, but obviously a lot has changed since then.
The new SpreadsheetML that is part of the Open XML formats coming with Office 2007 had to undergo serious work in order to make it ready to be the default format. As you all know, the majority of folks don't really care about what kind of format they are using, they just want it to work (remember that most end users have never even heard of XML). We wanted our formats to play a more vital role in business processes though, which is why we've slowly been progressing towards these new default XML formats. We want people to be able to easily build solutions on top of the formats, but at the same time, we don't want the average end user to feel much noticeable difference with the change (at least no negative differences).
That leads me to why we had to restructure SpreadsheetML from the original design. The two issues with the SpreadsheetML format from 6 years ago was that it wasn't full fidelity, and it wasn't optimized for performance/file size. The term "Full fidelity" just means that everything that is in your file can be saved into the format without fear of it being modified or lost. The old SpreadsheetML format didn't support a number of feature like images, charts, objects, etc. So we had to add all those additional things to the format.
The second part (performance) was a really important and challenging one. We wanted to move to an open format so that people could build solutions around our formats. Like many other applications out there, we chose a combination of ZIP and XML to achieve this. We had to write the XML though in such a way that it could be parsed extremely efficiently so that the file open and save experience wouldn't get significantly slower. There have been a number of articles related to this issue, where people have complained about performance in other applications that use XML as their format. Of course we had to keep this in mind with our design, and for those of you who have played around with it I'm sure you've noticed the difference.
While I'm not going to go into a full description on the SpreadsheetML format, I'd at least like to give you a brief introduction. A SpreadsheetML package has a few different pieces that it's comprised of. Let's lok at a basic diagram of the pieces of a spreadsheet:
The main parts I wanted to call out for today are:
- "sheet1" – This is the data for the worksheet. Each worksheet is stored as its own XML file within the ZIP package which means you can easily get at your data within a particular sheet without having to parse all the other sheets.
- "sharedStrings" – Any string (not number, just string) used in the sheet is actually stored in a separate location. There is a part called the "Shared string table" that stores all the strings used in the files. So, if you have a column called "states", and "Washington" appears 100 times in the spreadsheet, it will only need to be saved into the file once, and then just referenced.
I think an example might be best to help show what I'm talking about. Let's take a spreadsheet that looks like this:
ID |
Num |
Resource |
1 |
543 |
F068BP106B.DWG |
2 |
248 |
F068BP106B.DWG |
In the Open XML file, there would be an XML file that contained the strings used, that would look like this:
Shared String Table
<sst xmlns="https://schemas.openxmlformats.org/spreadsheetml/2006/5/main">
<si>
<t>ID</t>
</si>
<si>
<t>Num</t>
</si>
<si>
<t>Resource</t>
</si>
<si>
<t>F068BP106B.DWG</t>
</si>
</sst>
Then, in the main sheet, there would be cell values, and pointers into the string table wherever a string occurs:
Sheet1
<worksheet xmlns="https://schemas.openxmlformats.org/spreadsheetml/2006/5/main">
<sheetData>
<row>
<c t="s">
<v>0</v>
</c>
<c t="s">
<v>1</v>
</c>
<c t="s">
<v>2</v>
</c>
</row>
<row>
<c>
<v>1</v>
</c>
<c> <v>0</v> </c> <c t="s"> <v>3</v> </c> </row> <row> <c> <v>2</v> </c> <c> <v>0</v> </c> <c t="s"> <v>3</v> </c> </row> </sheetData>
</worksheet>
Notice that in the first row, each cell has the attribute t="s" which means it's a string. Then, the value is interpreted as in index into the string table, rather than an actual number value. In the 2nd and 3rd rows, the first two cells are interpreted as numbers, so they don't have the t="s" attribute, and the values are actual values.
This may seem a bit complex, but remember that while this format was designed for developers to be able to use, it we couldn't take the hit that comes with making it completely intuitive. Believe me, as a developer, I would have loved to make the formats more verbose and straight forward, but that would have meant that everyone else opening the files would have to suffer for it. If the example above was a more complex set of data with a number of separate worksheets, each with a few thousand rows, you can imagine how quickly the savings of the string table and terse tag names would add up. I had a couple posts back in the summer talking about some other basic things we do to make sure that the formats are quick and efficient.
This tradeoff of who you design around and how you way ease of use versus efficiency is something folks have to look at every day when they design products. Whether it's an API, a user interface, or a file format, you need to decide which target user you are going to give more weight to when you make your design decisions. We had to give more weight to the end user, and instead require a bit more knowledge from the developer. That's why the Ecma documentation is so important. We need to make sure that the format is documented 100% and there are no barrier to interoperability. The great group of people we have on TC45 are really helping a lot here. As I said last week, the Novell guys have already built some working code that allows Gnumeric to open and save Spreadsheet files in the Open XML format. I'm sure we'll see more and more implementations as we provide better documentation and get closer to a complete standard. It's really exciting! That's one of the great things we'll see more and more of up on the openxmldeveloper.org site.
-Brian
Comments
- Anonymous
May 10, 2006
Is the key for the Shared String Table really determined by the position of the string in the table? What happens when a string is deleted?
I would have thought it would be something like
<si key="0">
<t>ID</t>
</si> - Anonymous
May 11, 2006
Yes, it's done based on the position. That makes it faster to read the string table into memory when loading the file.
It's pretty easy to query for a particular node based on it's index rather than an attribute value, so really the only complexity this ads is in situations where you want to delete a string (as you mention).
Since a string only appears once in the string table though, even if you've removed all reference to that string there isn't much benefit to deleting it (it's not taking up much space). You can just leave it around, or worse case (if you are removing it for some kind of privacy reason or just want to simplify) you can just zero it out.
At some point later (like when the entire file is loaded and resaved) the string table and all references can be cleaned up. Excel (for example) will recreate the string table whenever the file is saved. Other applications can do the same if they want.
-Brian - Anonymous
May 11, 2006
The comment has been removed - Anonymous
May 11, 2006
Adam,
Don't forget the file size and parse times are significantly different. You don't parse the zipped XML, you instead unzip and then parse. The longer the tag names, the longer it takes you to parse the file.
-Brian - Anonymous
May 11, 2006
Adam brought up an interesting point. Is MS Office using DOM or SAX when parsing the XML files? As I understand it, it is using SAX.
Assuming it is using SAX, how does Excel resolve references to other cells, particularly to sheets not yet loaded? In the binary format there used to be a way to jump around rows in groups of 32 using indexes/dbcells, so you could get to a cell on another sheet pretty quickly directly from the file. Is there any corresponding shortcut in the XML format, or does one have to parse the entire sheet to get at the desired source of the reference? Given a worse case scenario, you may end up having to read all sheets just to load the first one.
This final question probably needs to be directed at the Excel people, but it's still format related. I noticed that diagonal borders still cannot be formatted independently of each other. What is so difficult about allowing the up diagonal to be blue and the down diagonal red? This seems like a most ridiculous limitation :) I assume this is because in the binary format there was no way to store it, but with the new format this should be really easy. - Anonymous
May 11, 2006
So, does that mean it's actually impossible to save strings on the sheet itself, or just that by default Excel will store strings in a table?
If you can't put strings on the sheet, that would mean that in general you're going to need at least two XSLT transforms if you're converting some other XML, since you're having to output two XML trees. :/ - Anonymous
May 11, 2006
Just curious, what are the .DWG resources you mention here? Just made up? Autodesk DWG files? Or something particular to Excel or OpenXML/SpreadsheetML?
thanks - Anonymous
May 11, 2006
Andrea,
You are correct, we use SAX for parsing the files. The references to other cells are not fully resolved until the entire book is loaded. I'll try to provide some more information for you around what you would do if you only wanted to read one sheet, but the short of it is that yes, depending on to what level you want to get at, you may end up having to parse all sheets in order to properly display just one (there is a last cached value on most things though, so as I said it depends on what level you want to go to).
Alex,
No, you don't have to use the string table. It's used by Excel (and I would assume other Spreadsheet applications that care about performance optimizations), but it is not required. You can instead just do the following if you want the string inline:
<c t="inlineStr"><is><t>my string</t></is></c>
-Brian - Anonymous
May 11, 2006
Oli, those are just made up. I grabbed an example file and just trimmed it down to a couple of rows. Not sure what they were resources for. The original file was from this article: http://blogs.zdnet.com/Ou/?p=120
-Brian - Anonymous
May 11, 2006
Brian, by "unzip and then parse", does that mean you write an uncompressed copy of the zip file to a temp location and then read that copy with the parser? - Anonymous
May 11, 2006
Hey Adam, sorry if I wasn't clear. I'm not implying that you need to write to a temp file (we don't). I mean that you need to uncompress the XML before you can start to parse it. This could all be done in memory, but either way at the end of the day you are parsing XML, not the binary blob the results from the compression. I'm not talking about the hit it takes to read from disk, I'm talking about the hit it takes to actually parse over the text as you interpret the tags, attributes, and values. The longer the tag names, the more characters you have to look at to determine the tag name.
-Brian - Anonymous
May 11, 2006
The comment has been removed - Anonymous
May 12, 2006
Hey Adam, those are some great questions. Some of this is actually at a much deeper level than I deal with, but I've been talking with some folks on my team about it. Each application has slightly different approaches, and rather than write it up in comments I think what I'll do is pull together more of the details and write a new post about it. Sound good?
-Brian - Anonymous
May 12, 2006
Looking forward to it. - Anonymous
May 14, 2006
I would agree with Adam somewhat.
This just sounds like another layer of compression designed to get the data set size down a bit to get the parser through it quicker.
The extra indirection means that a string for any given cell could be in two places, which sounds like reading OpenXML with simple tools (e.g., XSL) is going to be a pain :( I guess at least you can ignore the string table for output, which reduces the pain somewhat. - Anonymous
May 16, 2006
Surprisingly, I haven't seen much information out there discussing the performance impacts of XML tag... - Anonymous
May 16, 2006
Adam, sorry that I didn't get to this right away. I had a couple things I wanted to verify. Here is some more information: http://blogs.msdn.com/brian_jones/archive/2006/05/16/599096.aspx
Alex, many spreadsheet applications actually practice the use of string tables even for their internal memory structures. It's much faster, especially if you are talking about a large spreadsheet. Image a table on cencus data where you have a column called "state", and "North Dakota" appears 50,000 times. Do you want to read in the string "North Dakota" that many times? Or would you rather have a string table of the 50 states, and each cell just have the value of which row it is in the string table (ie "15"). You're right that compression can solve the same issue, but that doesn't help you when it comes down to actually parsing the XML as it's being decompressed.
-Brian - Anonymous
May 29, 2006
I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML.... - Anonymous
June 05, 2006
As we move forward with the standardization of the Office Open XML formats, it's interesting to look... - Anonymous
July 11, 2006
There were a lot of great comments from last week's announcement about the creation of an open source... - Anonymous
March 15, 2007
I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML.