File size reduction for Open XML
I spend a lot of time working on the adoption of the Open XML Formats,
For IT organizations, it can be a daunting task to migrate document formats in Office, and it the benefits are not always immediately obvious. Microsoft spent a fair bit of time on tools / guidance to make the introduction of Open XML easier, and I'll drive deep on those in future posts. But I wanted to use this opportunity to discuss one of the primary reasons why you should let Open XML in, and how it can help. This will be the first in a 3 part series on file size reduction, document "sanitization" and improvements in document format security.
A tangible benefit of Open XML is file size reduction. Reducing file sizes means lower storage costs and reduced bandwidth consumption. Particularly for those paying for bandwidth on a meter, this can be quite helpful.
Why are Open XML Files smaller? With Open XML, and the Open Packaging Conventions, the file architecture is much more modular and is compressed using a ZIP archive. Storing XML content in a ZIP container lends itself very well to compression, so we do see great results for text-intensive documents like documents and spreadsheets. The benefits don’t translate as well for presentation files, because those tend to be image-intensive (and therefore do not benefit from ZIP compression), but even those are smaller.
The data in this post is a preview of a more comprehensive study we’re working on, but I thought I’d share some of the early returns. There’s no real magic in the study, it’s a pretty simple project. If you want to try this for yourself, you can do what we’re doing: use your favorite search engine / content store to retrieve 100 documents each for word processing (Word 97-2003), spreadsheet (Excel 97-2003) and presentation (PowerPoint 97-2003) format documents, and convert them to Open XML. Results will always vary slightly depending on your data set, but the results should be somewhat consistent with what we’re showing here.
You can do the document conversion using the desktop products, or the Office Migration Planning Manager (and the Office File Conversion tool, specifically), which has a command line interface. Other conversion tools are also available. Quality / results will vary depending on the translation environment.
This post will only discuss the Word documents converted using Word 2007, but the data will illustrate the survey results clearly.
Word Documents: Converting .DOC to .DOCX
"docx" Sizes |
"doc" Sizes |
Size Change |
Storage Gain |
||
Median |
30Kb |
69Kb |
29Kb |
52% |
|
Minimum |
11Kb |
20Kb |
-2Kb |
-2% |
|
Maximum |
559Kb |
975Kb |
784Kb |
87% |
|
Percentiles |
25 |
18Kb |
35Kb |
15Kb |
40% |
50 |
30Kb |
69Kb |
29Kb |
52% |
|
75 |
76Kb |
160Kb |
67Kb |
62% |
A median size reduction of 52% for documents is quite significant, and translates to real savings for disks and network traffic. We can assume a linear correlation between document size and the number of packets transmitted over a network; therefore we can assume a similar result in bandwidth consumption (bandwidth consumption data will be published in the final paper as well.)
Don’t believe it? – try this simple test:
Create a simple document in Word 2007. A great way to generate sample text in Word is by using a formula: “=rand(10,5)”, where 10 is the number of paragraphs in your document, and 5 is the number of sentences per paragraph. You can use this formula to generate documents of increasing length. In doing so, the benefit of compression in Open XML becomes instantly clear. I conducted this test 5 times, on documents ranging from 10 paragraphs of text to over 60 pages. (I have attached them here for you to use.)
I simply added the text, saved the file in binary format first, then saved the file again as Open XML. There is no formatting (beyond my default template, no tables, images or anything other than simple paragraphs.) As the documents increase in length, the benefit of compression is obvious:
Sample file name |
.doc size |
.docx size |
Test 1 |
31k |
11k |
Test 2 |
86k |
13k |
Test 3 |
147k |
15k |
Test 4 |
269k |
18k |
Test 5 |
513k |
26k |
If you’re a graph type, we can make the relationship more clear:
This isn’t to say that 5,000 page documents stored using Open XML are going to be 1 – 2 % of their original size, but this is to point out that it is very easy to demonstrate real space savings with Open XML. Depending on the nature of the documents you are creating, especially if they are text-intensive, the size difference can be quite dramatic.
We’ll eventually publish the full data set in a more detailed (and scientific) white paper, and the paper will publish in late January. But as an introductory post, I thought I’d make this an easy one, with a pretty clear benefit. I’ll let you work out the math for your own storage & bandwidth savings, but if you can ask yourself “what would I gain if my files were half of their current size?” – I’ll bet the answer will usually be a good one.
Comments
Anonymous
January 01, 2003
It's been quite a year for those who have been blogging about the Open XML file formats. Here'sAnonymous
January 01, 2003
Some blogs you wait a long time for and here is one.  Gray Knowlton , Group product manager in OfficeAnonymous
January 01, 2003
People often ask me how much smaller Open XML documents are than corresponding Office binary documents.Anonymous
January 01, 2003
PingBack from http://geeklectures.info/2007/12/17/file-size-reduction-for-open-xml/Anonymous
January 01, 2003
Hi Dave, This is interesting feedback. Quite right that mileage will vary by the specific file in question, the Zip algorithm used as well as transmission modes. And the results you're seeing with the binary compression are similar to (and partly the reason for) the existence of the new XLSB format -- a new Binary for Excel 2007 that uses the OPC like Open XML, but uses binary parts instead of XML parts. So, acknolweding that one could potentially do more to shrink the document sizes using compression tools, we put the size reduction side by side with the modularity, extensibility, reduced data corruption, custom schema support and so on... where we are today with Open XML is a pretty good spot. But we'll definitely work to improve this for the future.Anonymous
January 01, 2003
People often ask me how much smaller Open XML documents are than corresponding Office binary documentsAnonymous
January 01, 2003
Einer der am besten messbaren Vorteile von Open XML ist die Reduktion der Dateigrößen. Auch in ZeitenAnonymous
January 01, 2003
Some blogs you wait a long time for and here is one.  Gray Knowlton , Group product manager in OfficeAnonymous
January 01, 2003
Hi David, Correct that the "=rand" formula for Word pulls text from the help file, resulting in much repeated text. The size reduction research is based on a more realistic data set, so yes, the results will reflect what is stored in existing documents. As for this test, repeated text or not, the size of the XML format compares very well to the (identical content in the) binary format; so in the comparative sense, a valid observation.Anonymous
December 18, 2007
In a recent test I found that a text-only Office 97 Word document compressed to about 75% the size of the MSO-XML version of the same text. The zipped version of "Test 5 Binary.doc" is 15kb, vs 26k in the MSO-XML format. This is a 42% reduction and vice-versa the MSO-XML version is 170% the size of the MSO binary version. Zipping the MSO-XML format document shrinks it to 23k, so there is some room at the bottom. You can do the math on 23/15. As to saving bandwidth, aren't some transmission modes already compressing the data? I recall that compressing an ideally compressed file can make it larger.Anonymous
December 20, 2007
With XML the compression should be good, relative to arbitrarily chosen data, given the small character set (~100 out of 256) and the significant repetition of tags that's likely for the body of the document. I had asked another MS blogger almost the same question, but at the time it had not occured to me to re-zip the .docx file. My question then was what additional information was in the .docx that left it so much larger than the .doc-zipped file. There was an unsatisfying answer. Ideal compression removes all redundancy and leaves only information. A bigger ideally compressed file should generally have more information. Now I am curious if information (unused placeholders?) is left out of the .docx that was in the .doc. Perhaps it's the lack of ideal compression. <ramble> File and data handling has been of some interest to me because of the number of times vendors have (burned me)provided unsatisfying results. Like modems that convert all LFs to LFCRs. Had to write a program that figured out where all the LFs were, pass that info separately, and then use that file to take out the neighboring CRs to get binary files back in shape. Or a printer adapter that specifically filtered a useful character. No soft fix for that. Company A told us (many users) there was no way to find the reason behind a software fault. Fortunately, two things occured. I got a copy of the format manual and I didn't know how the entire thing was supposed to work. So my first attempt failed to run right. But it left a little file, which I read and found within the information Company A said could not be determined. Turns out, in their implementation, they delete that file as part of cleanup. Sigh. It did not require a fix, Company A just didn't know how the software they bought worked either. (Same software as TRON used - from MAGI Corp. Clever name.) <ramble>Anonymous
January 03, 2008
Create a simple document in Word 2007. A great way to generate sample text in Word is by using a formula: “=rand(10,5)”, where 10 is the number of paragraphs As far as I can see, this causes the same sentences to be repeated multiple times, and the entire document to consist of these repetitions. This makes a very bad test for the file size testing as it will give highly exaggerated figures for zip compression (or pretty much any other text compression for that matter) Your first suggested mechanism, of sampling real documents should (given a good enough sample) give much more realistic results.