Spreadsheet performance - Shared Formulas
I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML. There's a whole host of things we've done to make sure that the move to XML formats is a huge benefit to developers, without it actually having a significant negative impact on our end users. Moving our formats into an open XML format significantly increases the relevance and value of Office files because they can have more interactive roles with business processes and systems. If the files aren't open, then they can't fit into as many scenarios and we lessen their value. That's was the whole reason we started the move to XML formats so many years ago, because we wanted to significantly increase the value of Office documents.
The big key here though is to make sure that people will actually use the formats. If our users decide to stick with the old binary formats, then we lose out on that opportunity. So it's really important to make sure that for the average end user (who doesn't really care or put any thought into file formats) doesn't see any significant losses from moving into the new XML formats.
A big area for Spreadsheets is performance. It was really scary to move from the Excel binary formats that were so damned fast into an XML format that we knew couldn't match up. There was a lot of work though analyzing every aspect of file load times to make sure that we could keep the performance drop to an absolute minimum, and we actually have really done a great job. We are absolutely trying to design for the developer, but the priority is of course given to the end user experience when it comes down to difficult design decisions.
I already talked about a few things we've done like keeping tag and attribute lengths small. I also mentioned the shared string table and I'll go into more detail on the benefits of that later. Today I want to discuss some optimizations we've done around formulas.
Shared Formulas
Most folks who've done any type of spreadsheet work are familiar with formulas. More relevant to this post though, you are probably also familiar with using a common formula in a column to make a similar calculation for each row. For example, image the following table:
Product ID |
Price |
Quantity |
Total |
1 |
5.45 |
9 |
49.05 |
2 |
3.99 |
15 |
59.85 |
Most likely, that fourth column is a formula so that if you could look at the formulas, the table would look like this:
Product ID |
Price |
Quantity |
Total |
1 |
5.45 |
9 |
=B2*C2 |
2 |
3.99 |
15 |
=B3*C3 |
Now, for this case, it probably doesn't look like there are a lot of optimizations we can do around the formula storage. The XML for this table would look something like this (in shorthand):
<
table>
<row>
<c><v>Product ID</v></c>
<c><v>Price</v></c>
<c><v>Quantity</v></c>
<c><v>Total</v></c>
</row>
<row>
<c><v>1</v></c>
<c><v>5.45</v></c>
<c><v>9</v></c>
<c>
<f>=B2*C2</f>
<v>49.05</v>
</c>
</row>
<row>
<c><v>2</v></c>
<c><v>3.99</v></c>
<c><v>15</v></c>
<c>
<f>=B3*C3</f>
<v>59.85</v>
</c>
</row>
</table>
The above XML representation would be pretty good if the table were this simple. Imagine though if this were actually more of a real world spreadsheet. Let's say this is tracking a large number of orders, so maybe you have more like 10,000 rows instead of just 2. Well that would mean that you have to parse that formula and figure it out 10,000 times. This would be a huge performance hit, especially if the formula was more complex.
You've probably noticed in Excel that if you type a formula into the first row and then copy it down through all the other rows, Excel can automatically adjust that formula in each row so that it's making the right calculation (ie you put =B2*C2 in row 2 and when you copy that into row 3 it says =B3*C3). Well, why shouldn't we do the same things in the file format? No reason to write out the formula 10,000 times and have to parse it 10,000 times. A complex formula can take awhile to parse, so we need to optimize around as little parsing as possible. Here is what the XML would actually look like using this approach:
<table>
<row>
<c><v>Product ID</v></c>
<c><v>Price</v></c>
<c><v>Quantity</v></c>
<c><v>Total</v></c>
</row>
<row>
<c><v>1</v></c>
<c><v>5.45</v></c>
<c><v>9</v></c>
<c>
<f t="shared" si="0" ref="B2:B3">=B2*C2</f>
<v>49.05</v>
</c>
</row>
<row>
<c><v>2</v></c>
<c><v>3.99</v></c>
<c><v>15</v></c>
<c>
<f t="shared" si="0"/>
<v>59.85</v>
</c>
</row>
</table>
Notice that now the formulas say they are "shared" and they have an id of "0" so that we know they match up. The first instance of the formula stores the actual formula value, as well as specifies the range that it applies to so that as the rest of the spreadsheet is parsed, you know that you can ignore those other cells and just apply the shared formula.
I admit that it makes development a bit more difficult, but as I said before, we kind of had to take the training wheels off and actually think more about the end users as well. There are hundreds of millions of users, and we need to keep them in mind. We definitely want these formats to be accessible to developers (otherwise we should have just stuck with the old binaries), and that's why we are taking the Ecma standardization so seriously. We need to make sure that we fully document these formats so that anyone can build solutions around them. We are also looking to folks in the openxmldeveloper.org community to help out here by building tools that work on top of the formats. This way we get the best of both worlds. We get a very fast format that's also easy to develop against!
-Brian
Comments
- Anonymous
May 29, 2006
Brian,
You might want to fix your examples. In the tables row 2 and 3 are different,(Product 1 and 2). In the code examples you repeate the data from row 2 in row 3. - Anonymous
May 29, 2006
PingBack from http://niklasblog.com/?p=969 - Anonymous
May 29, 2006
Great catch John, thanks. I've updated the examples to match the tables... - Anonymous
May 29, 2006
The comment has been removed - Anonymous
May 29, 2006
Looks a lot like BIFF with angle brackets so far...
Case in point, what about business solutions that have to update formula values as part of the file update? Since this blog is about the so-called power of Xml and associated API, how do you achieve this without a running Excel instance? - Anonymous
May 29, 2006
"Since this blog is about the so-called power of Xml and associated API, how do you achieve this without a running Excel instance?" -- Mike
Perhaps invest in better coders? This appears to be very sensible and straightforward optimization, not rocket science by any stretch of imagination. C'mon, do you believe every user out there must suffer from performance drops because your coders cannot get their act together? Bleh. - Anonymous
May 29, 2006
Biff,
I think you missed my point. How can you possibly say you can perform calculations without relying on a running instance of Excel? You realize that in order to do that you end up rewriting Excel, right? - Anonymous
May 29, 2006
"How can you possibly say you can perform calculations without relying on a running instance of Excel? You realize that in order to do that you end up rewriting Excel, right?" -- Mike
Wrong. You're free to use any or all of Quattro Pro, OOo Calc, Gnumeric, KSpread, and more pure-bred calculation engines like KDCalc and SpreadsheetGear - plug them in and update your data. You might get different results, but we're talking file formats here, right? - Anonymous
May 29, 2006
I guess if you're doing string sharing etc. it's probably a nice idea to extend that to different parts, like formulas.
Brian, is there a way of turning these optimisations off individually or not? I'm just wondering if it's possible to try and actually get some numbers on what difference these changes make.
I have the same worry as Mike - the more complex it is to interpret a spreadsheet, the less useful the OXML format becomes IMHO. But, since OXML doesn't store result values (as far as I know?) you have to process the formulas anyway, so I guess this isn't much extra work on top of that.
Biff - you probably could use alternative engines for some spreadsheets. Gnumeric in particular has a highly compatible implementation, which contains all the functions Excel does and gives more accurate results. But the others are not as compatible.
Since OXML is standardising the functions available, maybe other spreadsheets will build in a compatibility module. It would be better if the formula stuff could be standardised on its own, though. - Anonymous
May 29, 2006
>> I think you missed my point. How can you possibly say you can perform calculations without relying on a running instance of Excel? You realize that in order to do that you end up rewriting Excel, right? <<
I feel like I'm in the twilight zone:
Group:We HATE Microsoft, err... Open Standards Advocates:
"We DEMAND Open Standards to encourage competition and to not lock out other vendors from developing their own versions of office software that can openly read and write to this format"
Group:Same as above:
"How can you possibly say you can perform calculations without relying on a running instance of Excel? You realize that in order to do that you end up rewriting Excel, right?"
Me: "Du doo du doo... Du doo du doo... Do not adjust your set... Do not... "
To quote Biff:
Bleh!
---
Okay, sarcasm aside... The point of an open standard is to ensure that anybody, regardless of who they are, has the ability to openly use this standard for the primary purpose of interop. Or in other words, no matter what tool you might choose to use, you can expect that this same document format will be accessible to open, edit, save, repeat, as long as that vendor of choice has implemented support for this open standard.
"You realize that in order to do that you end up rewriting Excel, right?"
Yep! And thats the whole point. - Anonymous
May 29, 2006
@Alex,
Hey, I'm diggin' your comments... VERY constructive... It's fantastic to see that you have taken to this position as you are obviously someone who has a lot of value to add to the conversation.
Learning things from folks is what I hope to do everyday of my life. Today, you help me learn something I didn't know yesterday.
Thanks! :D - Anonymous
May 29, 2006
Brian Jones: Open XML Formats : Spreadsheet performance - Shared Formulas via a comment to the above linked, Biff (unfortunately I don't have a link (Biff, if you happen to read this and have a preferred link, let me know... - Anonymous
May 29, 2006
@Biff,
http://www.oreillynet.com/xml/blog/2006/05/comment_of_the_day.html - Anonymous
May 29, 2006
Biff said "Quattro Pro, OOo Calc, Gnumeric, KSpread, and more pure-bred calculation engines like KDCalc and SpreadsheetGear"
I didn't know those apps had native support for importing/exporting .xlsx files in order to perform calculations outside of Excel. Thanks for the info.
;-) - Anonymous
May 30, 2006
"I didn't know those apps had native support for importing/exporting .xlsx files in order to perform calculations outside of Excel." -- Mike
So your attention span is under 25 words, is it what you wanted to say? <g> - Anonymous
May 30, 2006
That was sarcasm.
Btw, are you a Microsoft employee? I'm willing to know if someone of your "caliber" is inside or outside the fence. - Anonymous
May 30, 2006
"That was sarcasm. " -- Mike
Doubt it.
"Btw, are you a Microsoft employee?" -- Mike
No, I'm but a lowly Microsoft shill, can't you tell? - Anonymous
May 30, 2006
Alex: The spreadsheet has to contain the values shown for each formula. Doing a recalc may be very expensive or may even change the values (in case of RAND or iterative computations).
It appears you didn't notice that the formula cells have both an "f" tag (for the formula) and a "v" tag (for the value. - Anonymous
May 31, 2006
Hi Brian,
I just came across an interesting Word topic that would be great to hear on from you: Master Documents
Is there still a need for master documents in 2007 (meaning can it better handle documents with >500 pages, more than 32 MB of text, etc?) Do master documents work better (hence don't corrupt easily) when used with new XML file formats?
Thanks,
Patrick - Anonymous
June 05, 2006
As we move forward with the standardization of the Office Open XML formats, it's interesting to look... - Anonymous
June 08, 2006
By using a reference for relative identical formulars it makes it more difficult to read and understand an OOXML document on the fly.
You also have the problem when updating a document writing a "new" formular because you have to check all the existing formulars if one of them is relative identical - or is it only when the formulars are in the same or extended area like original in B2:B5 but now B1:B5.
You could also implement it in a way so you both have the reference and the formular so the application only have to read one of the two but have to save them both. Quickly to read but slower to write. - Anonymous
June 09, 2006
Claus, it is true that it's a bit more difficult if your goal is to go to a specific cell and analyze it without looking anywhere else. Otherwise though, it's fairly straightforward.
Your suggestion of duplicating the data isn't really useful unless we actually enforced that they matched. Otherwise a consumer wouldn't know which one to trust. We try to avoid data duplication unless it gives a perf gain, and even in that case we try to make sure we enforce that the duplicated data is valid (otherwise it's much harder for consumers to know which value to pick).
If you think about it, the shared formulas not only make things significantly faster, but they also make them easier. If you have 50,000 rows of data, and one of the columns is a calculation of some of the other columns, then it's much easier to update that formula in this case (you only need to edit one cell). Without the shared formulas, you'd have to edit all 50,000 instances of the formula.
-Brian - Anonymous
June 09, 2006
What about bugs in dates:
http://openxmldeveloper.org/forums/thread/280.aspx - Anonymous
July 11, 2006
There were a lot of great comments from last week's announcement about the creation of an open source... - Anonymous
March 15, 2007
I've blogged in the past about some of the things we did with the SpreadsheetML format to help improve