Binary Documentation (.doc, .xls, .ppt) and Translator Project Site are now live

As promised last month, the binary documentation (.doc, .xls, .ppt) is now live. In addition to this, the project to create an open source translator (binary -> Open XML) has now been formed on sourceforge, and the development roadmap has been published. Read my earlier post for more background on this: https://blogs.msdn.com/brian_jones/archive/2008/01/16/mapping-documents-in-the-binary-format-doc-xls-ppt-to-the-open-xml-format.aspx

Here's an overview of what's now available:

Office Binary (doc, xls, ppt) Translator to Open XML

The "Office Binary (doc, xls, ppt) Translator to Open XML" project is now live on sourceforge: https://b2xtranslator.sourceforge.net/

As you may remember, this was a request from a number of national bodies, and while Ecma TC45 believed it was outside of the scope of DIS 29500, they did talk with Microsoft and come to this agreement:

Nonetheless, Ecma International discussed this subject with Microsoft Corporation, the author of the Binary Formats. To make it even easier for third party conversion of Binary Format-to-DIS 29500, Microsoft agreed to:

  • Initiate a Binary Format-to-ISO/IEC JTC 1 DIS 29500 Translator Project on the open source software development web site SourceForge (https://sourceforge.net/ ) in collaboration with independent software vendors. The Translator Project will create software tools, plus guidance, showing how a document written using the Binary Formats can be translated to DIS 29500. The Translator will be available under the open source Berkeley Software Distribution (BSD) license, and anyone can use the mapping, submit bugs and feedback, or contribute to the Project. The Translator Project will start on February 15, 2008.  
  • Make it even easier to get access to the Binary Formats documentation by posting it and making it available for a direct download on the Microsoft web site no later than February 15, 2008. The Binary Formats have been under a covenant not to sue and Microsoft will also make them available under its Open Specification Promise (see www.microsoft.com/interop/osp) by the time they are posted.

We will modify DIS 29500 to include an informative reference to the SourceForge project.

While the project is still in its infancy, you can see what the planned project roadmap is, as well as an early draft of a mapping table between the Word binary format (.doc) and the Open XML format (.docx).

Microsoft Office Binary (doc, xls, ppt) File Formats

The binary documentation itself is available up here: https://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

  • Word 97-2007 Binary File Format (.doc) Specification PDF | XPS
  • PowerPoint 97-2007 Binary File Format (.ppt) Specification PDF | XPS
  • Excel 97-2007 Binary File Format (.xls) Specification PDF | XPS
  • Office Drawing 97-2007 Binary Format Specification PDF | XPS

It's all covered under the Open Specification Promise.

Another Surprise

Another great surprise in all of this is that we've made the documentation for a few other supporting technologies available as it may be of use to folks implementing the binary formats: https://www.microsoft.com/interop/docs/supportingtechnologies.mspx

The technologies included are:

  • Windows Compound Binary File Format Specification PDF | XPS
  • Windows Metafile Format (.wmf) Specification PDF | XPS
  • Ink Serialized Format (ISF) Specification PDF | XPS

These technologies are also all available under the Open Specification Promise.

Have a great weekend everyone!

-Brian

Comments

  • Anonymous
    February 15, 2008
    Wow! Who would of thought... But you just know the spin will be a spinnin' soon.  

  • Anonymous
    February 15, 2008
    The comment has been removed

  • Anonymous
    February 15, 2008
    Brian, The .doc specification document title is: MICROSOFT OFFICE WORD 97-2007 BINARY FILE FORMAT SPECIFICATION However, there is nothing there that described Word 2007 stuff. The documentation is barely up to Word 2003. At a glance, there is no FIB and no DOP records related Word 2007.

  • Anonymous
    February 15, 2008
    The comment has been removed

  • Anonymous
    February 15, 2008
    Yesterday I noted that the Office Binary <-> OpenXML translator project had started on SourceForge

  • Anonymous
    February 15, 2008
    @Andre, I think the 15th was the expected startdate of the project.

  • Anonymous
    February 16, 2008
    It would be interesting to see how the project progresses. On the sourceforce page: "Milestone 2 June 30th, 2008: Final Word translator" I think way too optimistic. I can't go by without predicting that it will be far from final by that date. Mappings between some DOC and DOCX structures are not straightforward. For example revisions and table formatting are written very differently and it will take time to get them right even with help from Microsoft. It is more likely to take a year than 5 months to get to the first final release.

  • Anonymous
    February 16, 2008
    The comment has been removed

  • Anonymous
    February 18, 2008
    Why should anybody use a M$ format? There is an open and free standard with a good documentation: ODF. The only reason for implementing the specs of DOC, XLS, PPT is to get better import filters in OpenOffice.org.

  • Anonymous
    February 18, 2008
    Standard Bodies should request the mapping between DOC and DOCX, otherwise delete those functions from the proposed standard. Now they are in good position to do so, since you have some published documentation about the DOC format. But the documentation about the DOCX format is still missing...

  • Anonymous
    February 18, 2008
    As promised, Brian Jones has announced the posting of the Microsoft Office Binary Format specs in this

  • Anonymous
    February 18, 2008
    Brian Jones has announced the posting of the Microsoft Office Binary Format specs on this blog . Along

  • Anonymous
    February 18, 2008
    Binary documentation and translator project. On Friday Brian Jones covered the availability of the Office

  • Anonymous
    February 18, 2008
    Orcad, I'm not sure what to make of your confused comments. You say: "Standard Bodies should request the mapping between DOC and DOCX" If you had read Brians' blog, you would realize that they have already considered and rejected this. You say: "...otherwise delete those functions from the proposed standard". Do you mean delete the information about the mapping from the OOXML specification?  It's not there in the OOXML standard to be deleted!  (and it is a standard -- ECMA -- by the way) You say: "But the documentation about the DOCX format is still missing..." From where?  It's documented in the OOXML standard.

  • Anonymous
    February 18, 2008
    Binary documentation and translator project. On Friday Brian Jones covered the availability of the Office

  • Anonymous
    February 18, 2008
    Brian Jones, Senior Program Manager just broke the news in his post today. Quoting from him: "As

  • Anonymous
    February 18, 2008
    Brian, I commend you and Microsoft for releasing this information. Hopefully, groups like OpenOffice.org and KOffice can get favorable rulings from their legal eagles that will allow them to improve binary compatibility. While I still disagree about the necessity for OOXML, I have always believed in giving thanks where it is due.

  • Anonymous
    February 18, 2008
    Brian Jones has announced the posting of the Microsoft Office Binary Format specs on this blog . Along

  • Anonymous
    February 18, 2008
    The comment has been removed

  • Anonymous
    February 18, 2008
    Just a quick note, before those specifications were made available freely, you had to email Microsoft. In fact, the steps are described here : http://support.microsoft.com/kb/840817 "Microsoft Office Binary File Formats Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program." See, there is one binary file format mentioned that is not being made freely available : VBA. What's the logic? VBA is certainly part of the binary formats (in fact, not just Office 97-2003, but in fact Office 2007 as well : see .bin parts). Also, a welcome specification is the new encryption specification (Encryption stream, etc.). This is not documented anywhere either, to the best of my knowledge. Then, we have all application-level bits. But I'll leave that for later. The boundaries between document-level interoperability and application-level interoperability is unclear due to the amount of application-level bits that are stored in files (for instance, all those bits that are used to pre-check options). Just a quick comment on the specifications.

  1. It would be handy to have a TOC-browsable version of it. As in MSDN Library, screenshot : http://www.arstdesign.com/BBS/picsupload/Office97doc.gif If you guys have a .doc/.docx version of the files, some people could convert them to .html, then .chm (compressed html help). As for the specifications, it is obvious that it remains substantial unspecified or missing records, but I believe these are the true Microsoft internal specs. The reason why I think so is because when you have the source code, those specs are handy references. The problem in this scenario is, obviously, when you don't have the source code...
  2. I believe those specs were really incrementally updated as each major Office version shipped. In fact, the portions of Office 97 formats did not change at all (including typos and errors), and sections were added to account for release 2000/XP/2003 and 2007 (compatibility mode).
  • Anonymous
    February 20, 2008
    In case anyone missed it, there's some interesting comments from Joel Spolsky: http://www.joelonsoftware.com/items/2008/02/19.html

  • Anonymous
    February 20, 2008
    Brian Jones Open XML Formats Binary Documentation (.doc, .xls, .ppt) and Translator Project Site 혹시 소식을

  • Anonymous
    February 20, 2008
    Brian Jones Open XML Formats Binary Documentation (.doc, .xls, .ppt) and Translator Project Site 혹시 소식을

  • Anonymous
    February 21, 2008
    For folks excited by the availability of the binary file format specs last week , but concerned about

  • Anonymous
    February 21, 2008
    For folks excited by the availability of the binary file format specs last week , but concerned about

  • Anonymous
    February 21, 2008
    This information is just promises that come too late, just for the purpose to get the bad OOXML format accepted as an ISO standard. And the translation project on sourceforge, also only consists in promises that may well never be fully or conveniently implemented. The deadline, that is really short, is after ISO vote ! In French we would say "ceci ne sert qu'à noyer le poisson". And we do not want to be these fishes.

  • Anonymous
    February 22, 2008
    It seems that the OOXML format was designed with backward compatibility with the binary formats in mind. Now I wonder what that means. I take it as: easy and faithful conversion to the new formats. Pro: upgrading existing implementations of the binary formats to support OOXML is easier. Con: if OOXML makes it as a standard, then all future implementations of OOXML will have to support parts of the old binary formats! Such a choice seems therefore profitable on the short term for the existing implementations of the binary formats. Frankly, the only one I know of is Microsoft's. Load any .doc or .xls file in OpenOffice and you'll see what "messed-up layout" really means. Stephane above and other people on this blog comment on the shortcomings of the specs of the binary formats. They could be the cause for the difficulty to recreate the exact layout of MS Office documents. I am worried about the long term implications of this choice. Any government committing to the choice of a standard does so in the hopes of having its archives readable for centuries. If that means having to fill the gaps of the specs and trying to emulate a decade of quirks, this is clearly not going to benefit anyone on the long term. If Microsoft is truly concerned about the long term interest of its customers, it should design its standard as clearly as possible, in a way that guarantees the possibility of writing a fully compliant implementation from scratch in 200 years. And I think that means either dropping the "backward compatibility" with binary formats or making the latter (and any associated quirks) become an international standard too. The reasonable choice would obviously be a clean break with the binary formats. Why won't Microsoft do it? The answer is simple. The long term interest of its customers is to be able to easily develop fully compliant implementations from scratch in the future. In the short term, that would allow competing products to become fully compatible with MS Office. Which is precisely what Microsoft has been fighting against for years.

  • Anonymous
    February 22, 2008
    The comment has been removed

  • Anonymous
    February 22, 2008
    Microsoft has now released documentation for the Office binary formats (.doc, .xls, .ppt) in addition to kicking off the project for an open source binary to Open XML converter (.doc to .docx)   The threw in WMF for good measure. The...

  • Anonymous
    February 25, 2008
    This is a great blog about Microsoft formats.  Thanks for the information.  I have a few questions.... The Office binary formats for Excel 97-2003 covers mostly only the file format for excel workbook.  For workspace files, it says "Excel creates several other files, some of which are documented in this material. The workspace file (.XLW extension in Microsoft Windows) and the toolbar file (.XLB extension in Microsoft Windows) are not covered in this document. The files are used to configure Excel‘s UI and do not contain user data."  This information is too coarse for people who have to archive and process files generated by Excel.  Is there any documentation about Excel Workspace (.xlw) and Excel Template (.xlt) file format? In addition, what is the difference between the XML spreadsheet format (.xml) generated by Excel 2003 verse the current OOXML format for spreadhsheeht (spreadsheetML)?  Are they different XML schema?  How compatible are they?  

  • Anonymous
    February 27, 2008
    Now if Microsoft would just release the visio format documentation all could be happy in the world. well, maybe not quite that far, but it'd be a step.

  • Anonymous
    February 28, 2008
    "The workspace file (.XLW extension in Microsoft Windows) and the toolbar file (.XLB extension in Microsoft Windows) are not covered in this document." Microsoft probably considers those file formats as part of what is at the application-level, as opposed to document-level. This can be argued obviously. But it should indeed be reminded that anything applicationl-level remains undocumented. For instance, VBA remains undocumented.

  • Anonymous
    May 12, 2008
    I'm catching up with a bunch of Open XML blogging from ages ago, so apologies if some of these are old