The myth of the Binary Key
I don't know where this myth began, but I have seen enough reference to it at this point that I think it's time to call it out directly. There is no such thing as a binary key that you need to unlock the Microsoft Office XML formats. They are just pure XML files that are fully documented (have been for awhile now). This isn't something where I'm asking you to just trust me; instead you can go and look for yourself. Take Office 2003 and save any Word document you have as XML. Now open that file up in a text editor and take a look. (If you don't have your own copy of Office 2003, try this free lab online that let's you play with the XML functionality).
I'm trying to figure out how this rumor was started, and I have a couple ideas, so let's try and track this down. Let's talk a bit about the format so that you can understand what's there. Take any XML file saved from Word 2003:
- Processing Instruction: As I discussed in this post (https://blogs.msdn.com/brian_jones/archive/2005/07/07/436647.aspx), if you try to open the file in IE, it will most likely be redirected to Word for opening because we put the following declaration in a processing instruction at the top of the file: <?mso-application progid="Word.Document"?> If you want to open the file in IE, you'll need to delete that PI. Is that the mythic binary key folks are talking about? It doesn't affect the way the file is displayed. All it does is tell the shell that Word can open the file.
- Pretty Printing: As I discussed in this post (https://blogs.msdn.com/brian_jones/archive/2005/06/23/432018.aspx), if you open the file in a text editor, you'll see that it's pretty hard to read because we don't "pretty print" the file. You'll either need to remove the PI and open in IE, or open in an editor that has pretty printing built in (like FrontPage or Visual Studio). Maybe this is what has confused people into thinking there is a binary key? It's obviously not though, it's just a way of laying out the XML to make the files more efficient.
- Objects: Word allows you to embed images; video; ActiveX controls; and OLE Objects. These are all foreign to Word though, and when they are stored they need to be stored in their native formats. In 2003, we base64 encode them and store them in a binary tag. For Office 12, since we are using a ZIP package to store the files, we can just keep them as separate binary files within the ZIP (so a JPEG will just be a separate .jpg file in the ZIP). I really doubt this is the "binary key", since it really isn't even owned by us. Any format you create will need to store foreign objects, unless the application decides it's not going to support those features.
- Handful of obscure legacy features: There are a handful of obscure legacy features where certain pieces of the data are stored in a <binData> tag. We did this because of resource constraints when building the original XML file. An example of this would be some of our old legacy fields. We just weren't able to get to them, but we only did this for features where the use of them was very, very low. For Office 12, we've done the extra work so that even these features are now represented in XML. So if this is the binary key, then it will go away, but I highly doubt this would be the "binary key" people talk about as it occurs so rarely.
- VB Project: If you have code that is embedded in your document, that would also be stored as a binary object. This is an area I can understand that some folks might want to see stored as text, but we didn't go that route. In fact, we're moving away from storing code directly in the files in general, as I've already discussed in an earlier blog post (https://blogs.msdn.com/brian_jones/archive/2005/07/12/438262.aspx). The default format won't even have these objects so if this is the "binary key", it's going away. I highly doubt this is the "binary key" though as it has nothing to do with the document itself, just with solutions that run on top of the document, and the majority of documents out there don't have it anyway.
- Namespaces: Someone commented in my last post that the Office files have namespaces in them and if you change the value of the namespace the file behaves in a goofy way. Anyone familiar with XML knows what's going on here, but I understand that a number of you are new to XML. Namespaces are a very important part of the XML standard. They allow you to identify what type of XML you are working with. If it weren't for namespaces, it would be very difficult to work with XML files unless you had control over everything (their creation, storage, and consumption). The point raised here though is really an interesting one. Notice that if you change the namespaces around, Word can still open the file. This is because we support opening all XML files as a result of our custom defined schema support. You can take a WordML file and add your own XML tags in your own namespace, and we'll support opening them, validating them while the file is being edited, and saving them out. The namespace issue obviously isn't a "binary key", and it's one of the major building blocks of XML.
- Byte Order Mark (Unicode) [10/18/2005 - I added this one after it was brought to my attention by Dare] - Dare points out that it could be that some folks unfamiliar with Unicode are having problems with the unicode BOM :
I wouldn't be surprised if the alleged "binary key" was just a byte order mark which caused problems when trying to process the XML file using non-Unicode savvy tools. I suspect some of the ODF folks who had problems with the XML file would get some use out of Sam Ruby's Just Use XML talk at this year's XML 2005 conference.
My theory is that the "binary key" idea came about because someone just took a quick look at file format without really doing their homework. For example, if you combine #2 and #3, you would probably see a binary blob in most files that appears to be at the top. The reason for that is that if the file has a image or some other kind of object in it, and since the file isn't pretty printed, the first line break would come from the base64 encoded data. That would mean that it would look like there is some binary data right at the top. The weird thing here though is that some of the folks that were saying there is a binary key supposedly spent a lot of time looking at all kinds of document formats and investigated them in order to create a universal file format capable of representing every document that ever existed. I would think they would have looked a little closer and seen that there really isn't a "binary key" to unlock the documents. They are already unlocked.
To learn more, go check out the documentation. It's up there for free and anyone can download it. Or play around with the free labs. Or read my "Intro to Word XML" posts. The easiest way for us to have good discussions on these topics is for everyone to actually look into it themselves rather than relying on random news stories. I understand not everyone has the time to look into it, but unfortunately there is a lot of false information out there.
-Brian
Comments
- Anonymous
October 17, 2005
I think one of the things that bothered people about the XML format - though not, perhaps, this mysterious 'binary key' - is that the reference schemas you point to are available only packaged as an MSI file - these aren't executables or a full-scale software deployment, so why should they be packaged in a properietary, Windows-only format? This sends a rather odd "You can only read the documentation if you're running Windows" message, which could be easily avoided by packaging the files in a platform-neutral format.
I'm guessing the reason for this packaging is to allow the files to be signed and verified, but I think it just generates pointless hostility without providing much value. - Anonymous
October 17, 2005
Hey Avner, I hear you on that. Did you see that for the Office 12 previews we actually provided two alternatives for the documentation? We provide the .msi file, but we also have a ZIP file that contains the XSDs and HTML files. Check that out and let me know if you like the format better: http://www.microsoft.com/downloads/info.aspx?na=46&p=2&SrcDisplayLang=en&SrcCategoryId=&SrcFamilyId=15805380-f2c0-4b80-9ad1-2cb0c300aef9&u=http%3a%2f%2fdownload.microsoft.com%2fdownload%2fb%2f5%2fb%2fb5b64679-4d6b-43ec-ba50-5891ca11cf15%2fOffice12XMLSchemaReference.zip
The main reason they had decided to go with that other approach back in 2003 was that most documentation we provide in Office had been done that way. They create a a .chm file which is what the msi installs (in addition to the .xsd files). It's typical for all of our help, and it gives you a pretty cool UI for navigating through all the topics. It of course has the negative impact of not working on all platforms which is ironic given the fact that we want to allow everyone the ability to work with these files. So as you said, it does send a rather mixed message. I'm sorry about that.
I'll look into it some more and see if we can backport the solution we used for the O12 schemas and provide it for the Office 2003 schemas as well.
-Brian - Anonymous
October 17, 2005
Is it possible the "binary key" they're referring to is the UUID stored in the XML headers? It doesn't appear to have anything to do with the document, but if it's not used for something it wouldn't have been included. If that's the case, then as far as I can tell it can be removed or just ignored safely. - Anonymous
October 17, 2005
Hey Todd, that's actually just another namespace declaration. It's saying that "dt:" prefix maps to that specific namespace. You'll see that we don't really use that prefix on many (if any) elements.
Namespaces are just essentially URIs, so you can use a URL as we do for some of our namespaces, but it isn't required.
BTW, in Office 12, we actually aren't using that namespace anymore anyway, so if folks did have an issue with it (which I can't really imagine), it will go away.
-Brian - Anonymous
October 17, 2005
Hi Brian, I've recently written about a problem with undocumented binary data which I believe will continue to exist in the Office 12 XML formats:
http://sixlegs.com/blog/java/please-document-emf-plus.html
It would be great if you could look into this. - Anonymous
October 17, 2005
Phil Windley on Microsoft, ODF, and state governments:
http://blogs.zdnet.com/BTL/?p=2024#comments - Anonymous
October 17, 2005
Berlind:
http://blogs.zdnet.com/BTL/?p=2020#more-2020
Hey you guys, I am posting all these pro-ODF links, why don't you post some anti-ODF links? I mean, let's make things even. - Anonymous
October 17, 2005
Until Brian gets the 2003 documentation in zip format like the Office 12 format documentation, interested parties can check the online SDK for information on Word 2003's XML format:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wordxmlcdk/html/welcomewordcdk_hv01147170.asp - Anonymous
October 18, 2005
The comment has been removed - Anonymous
October 18, 2005
The MSI file format is really just a cab file, and there are tons of open source cab extractors out there. For example cabextract:
http://www.kyz.uklinux.net/cabextract.php
The MSI file isn't preventing anyone from extracting the files, unless they just want to complain. - Anonymous
October 18, 2005
The comment has been removed - Anonymous
October 18, 2005
I am curious if this comment on Groklaw explains where the myth of the binary key originates:
http://www.groklaw.net/comment.php?mode=display&sid=20051016105739574&title=What%20Binary%20key%3F&type=article&order=&hideanonymous=0&pid=369059#c369217
As I read his point (less the rhetoric) is not that you need a binary key to just access the document, but that one cannot do usefull transformations of the information in the document without the aledged binary key.
??? - Anonymous
October 18, 2005
Brian: what you say only convinces me further that that UUID is the "binary key" being referred to. A simply publication of the schema and notes in the Office 2003 XML documentation about how to identify the schema for the "dt" namespace should suffice to clear up the problem.
I'd also note that one of the conventions behind using URLs to identify schemas associated with namespaces is that canonically the XSD itself can be retrieved from that URL. This lets programs that don't natively know the schema retrieve and process it. MS doesn't appear to do this. - Anonymous
October 18, 2005
Wondering - I think you're right that the myth is you can't do transformations without the "binary key." The reason that's not true though is that there is no "binary key." Anyone can come along and build transformations, as I've been posting about for the past few months. Everything is represented as XML and fully documented. I think unfortunately this myth is being spread because people just haven't taken the time to look into it and instead are going off of assumptions.
-Brian - Anonymous
October 18, 2005
Todd, I don't really see how that namespace could be the "binary key" seeing as how it's fairly irrelevant for parsing, consuming, transforming, and generating documents. According the articles this "binary key" somehow prevented people from writing transforms into and out of our format. The dt namespace just describes what the datatype is for custom document properties (that's hardly a "key" to the document). Custom document properties only exist for a document if you go to File -> Properties -> Custom Tab, and then add a property yourself.
You're right though that we should document it. Here's basically what the schema would be:
<?xml version="1.0" encoding="UTF-16" ?>
<xsd:schema targetNamespace="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"
attributeFormDefault="unqualified">
<xsd:simpleType name="dtType">
<xsd:annotation>
<xsd:documentation>Defines the datatypes of custom properties</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="string"/>
<xsd:enumeration value="dateType.tz"/>
<xsd:enumeration value="float"/>
<xsd:enumeration value="boolean"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
In response to your second point, you are right that many folks use URLs for their namespaces and then either have the XSD, or links to resources at that site. We use URLs for most of ours and will continue doing so as we move forward. We haven't yet set up a site though with resources for those URLs. Instead the schemas are all on msdn right now. I've been somewhat pushing for us to do this for some time (it's never really been at the top of my list though), but we just haven't pulled together the resources yet for managing the site.
-Brian - Anonymous
October 18, 2005
Brian: the UUID may not be a "binary key" the way you're thinking of it, but it sure looks like one to someone outside the Windows internals world. It'd be similar to you seeing 'xmlns:dt="xlu:F37A019D"' in the xsd:schema tag. That number obviously means something or it wouldn't be there, but you've no idea what to do with that number unless I happen tell you it's just a random number that always refers to a certain schema, or that you use it as a 4-byte unsigned integer (x = hex, l = longword (32 bits), u = unsigned) key in a particular database and the value for that key's the URL to the schema.
I'd add one really important thing to the documentation of the "dt" namespace: what it's for. If it's just for attaching type information to user-defined properties, say so in the docs. That'll make it clear to people when they need to worry about that namespace and when they can safely just ignore it. I'd note that this is a regular complaint about Microsoft formats: there's always something in there that isn't documented anywhere, and you can't tell if you can just ignore it or if it's actually important. Eg. the Windows credentials in the optional credentials field in the Active Directory Kerberos implementation: it was filled in with something, there wasn't anything anywhere on what was being put there (at least until MIT's lawyers hauled out the LARTs), and ignoring it like every other Kerberos implementation did (it was an optional field, after all) caused Windows clients to fail for no readily-apparent reason. - Anonymous
October 18, 2005
I hear you Todd. It's definitely important to make it as clear and straightforward as possible. We aren't going to have this namespace in the O12 schemas, so it shouln't be an issue going forward. I just wanted to make it clear that even for 2003 it isn't really an issue as it has a small presence and it's only used as I described.
Going forward I'll also talk to folks to make sure if there are other things like this they get properly discussed and documented so that it's clear there isn't a "key". :-)
-Brian - Anonymous
October 18, 2005
Todd, that is the way schemas have been found for years, and there is no requirement that a URL with a real schema-related resource be used. Look at the namespaces used for OpenDocument, for example, especially the urn-based ones.
[And I agree with Brian it is really great to have a resource that provides something authoritative about the namespace, such as the schema and anything else useful.]
It would be splendid if someone who ran into the problem of getting styles to convert would show what it was that they couldn't find that they thought was in a binary key. I'd like to see the key too. I don't understand why that's so hard. It would allow this to be cleared up. - Anonymous
October 18, 2005
Gary Edwards has another comment on the binary key. His position seems to be that there is one in 2003 ML. He thinks maybe it is going to be gone for Office 12 XML, but he doesn't care because he is doing SOA, and the vast majority of microsoft desktops are not even on Office 2003.
www.groklaw.net/comment.php?mode=display&sid=20051016105739574&title=Arent+the+Open+XML+spec+available+on+the+MS+website%3F&type=article&order=&hideanonymous=0&pid=369237#c369577 - Anonymous
October 18, 2005
The comment has been removed - Anonymous
October 18, 2005
Did you read this paper
http://www.ccianet.org/modules.php?op=modload&name=News&file=article&sid=566
http://72.14.207.104/search?q=cache:I3OAWmMjWfcJ:www.ccianet.org/papers/CCIA-XML.pdf+ccia-xml&hl=en
http://www.ccianet.org/papers/CCIA-XML.pdf
?
CONCLUSIONS
XML can be a powerful tool for achieving interoperability. The support of XML as a data
description language and the use of XML schema for application file formats is gaining
widespread acceptance throughout the computing industry.
While Microsoft has released a definition of the XML schema used by their Word 2003 and
Excel 2003 applications, these disclosures clearly lack information which is necessary for
interested parties to achieve complete interoperability with Microsoft Office 2003’s entire
feature set. Despite the fact that Microsoft promotes these disclosures as a prime example of
their interest in supporting interoperability, the disclosures are incomplete and therefore
effectively unusable; as a result, they have very little value as interoperability tools. Further, if
these disclosures are being promoted as interoperability tools, but if in reality they cannot be used as such, one might wonder about the true motivations behind the disclosures, and indeed if
those motivations have anything to do with interoperability at all. - Anonymous
October 18, 2005
Eduardo, based on those articles, it would appear that Gary Edwards has not fully looked at the XML formats. As I've repeated numerous times, there is no such thing as a "binary key" that needs to be reverse engineered in order to support Word documents. I'm not talking about only Office 12, this is also true with Office 2003. Please go have a look for yourself and give me some examples of what doesn't work. I'd really like to get your feedback. If there is something that you don't like, please let me know!
-Brian - Anonymous
October 18, 2005
James, I've never suggested we were innovative in deciding to use XML to represent our file formats. The whole reason we're doing it is that XML is a wide spread standard and that allows people to easily access our files. You should know though that we've been using XML to represent pieces of our files since back in 1997 when we first started working on the Office 2000 HTML support. The latest move to default formats is just part of an evolution.
On top of that, I don't know of any other Office software packages out there that have even close to the level of support for custom defined schema. That's pretty powerful functionality, since it allows you to use your own schema definitions to mark up the files that contain your data.
There is a royalty free license that allows you to freely work on any of your files. That license is perpetual and we've publicly committed to providing the license from this point forward, so you'll always have access to it.
-Brian - Anonymous
October 18, 2005
Marcos, did that article lead you to believe that our formats aren't interoperable. The examples of what is inaccessible is actually pretty weak (other than the obvious point that we haven't yet fully XMLized Excel, and there is no XML format for PowerPoint yet). I completely admit there wasn't a complete story for Excel or PowerPoint in 2003, but Word's XML support was close to 100% and was fully documented. The examples raised in the article are pretty obscure. The only two things mentioned are embedded macro buttons, and embedded objects.
Let's talk about the first one. What percentage of documents out there have embedded macro buttons? We're going to be even better about representing everything as XML in 12, but in 2003, you already have almost everything there, with only a handful of the more obscure features missed.
The second example of embedded OLE object is just how OLE embedding works. Since Excel's XML support wasn't full fidelity in 2003, it would make no sense to persist an embedded Excel object as XML. An embedded object's persistence is determined by the OLE server, not the container. That said, in Office 12, we've actually done the work so that embedded Word, PPT, and Excel files actually will be stored in XML, so this will be a non-issue.
I'll continue to explain that there is nothing preventing interoperability with pretty much any document out there. Like I said there are a couple minor things that are being added for Office 12, but for someone to claim that's a "binary key" breaking interoperability in Word 2003 is just showing that they haven't actually looked into it (or they are really stretching).
If anyone has examples of serious interoperability problems they've come across, please let me know. I really want to dig into this issue.
-Brian - Anonymous
October 19, 2005
The comment has been removed - Anonymous
October 19, 2005
Brian, I don't have a Windows computer, and I don't have the technical knowledge to figure this one out myself. I'm not sure who is right on the issue, so I've asked Gary Edwards to look over the discussion here and make a comment. - Anonymous
October 19, 2005
The comment has been removed - Anonymous
October 19, 2005
To mystere,
I tried to download it, and it is not a cab file. It is an msi file. I don't have a supported system for the msi, so I just draw the natural assumption. Microsoft only wants to pretend the schema is open. Otherwise, why would they make the documentation available in such a manner. You seem to be missing a point. Microsoft is trying to claim that its format is open. They have to convince the world. The traditional Microsoft tactic of requiring that people come grovel before them and take what they wish to share just won't work. Brian has been told many times about this problem and no changes have been forthcoming. I conclude that the pretense of an open format is just that, a pretense.
Good day, - Anonymous
October 19, 2005
To mystere,
Exev if I knew how to transform a msi file into a cab file(Is it just a name change or is there more?), the authors own web site for cabextract would stop me. He says it may be illegal to do so. Maybe he is wrong, but if he thinks that, why should I chance it?
Good day, - Anonymous
October 20, 2005
Ralph, like I said above, we have it in ZIP form here:
http://www.microsoft.com/downloads/info.aspx?na=46&p=2&SrcDisplayLang=en&SrcCategoryId=&SrcFamilyId=15805380-f2c0-4b80-9ad1-2cb0c300aef9&u=http%3a%2f%2fdownload.microsoft.com%2fdownload%2fb%2f5%2fb%2fb5b64679-4d6b-43ec-ba50-5891ca11cf15%2fOffice12XMLSchemaReference.zip
-Brian - Anonymous
October 20, 2005
to BrianJones,
Great, thank you. I missed that before somehow. I downloaded it and I will review it.
Have a great day, - Anonymous
October 20, 2005
No problem. Have a look and let me know if you have any questions. They are an early preview of the Office12 schemas so there is still a ton of information that we'll be filling in as we get closer to shipping, but it should serve as a great start.
-Brian - Anonymous
October 20, 2005
Good for Dare. I thought of that and shrugged it off as not something that fit into the "explanation" offered in the Groklaw account. But it seems Dare has found a deeper source for the problem, too.
Although the practice of using a BOM on UTF-8 is not widely known, and I always thought it was a mistake, the XML 1.0 3rd edition specification of 2003-10-30 recognizes it in section 4.3.3 and in non-normative Appendix F.
What's funny, of course, is that any creation of an OpenDocument XML file in UTF-8 and saved from Notepad will have that very same "binary key," assuming that is what happened. And of course, Office Open XML (Office "12" flavor) can get it that way too. - Anonymous
October 20, 2005
I was just reading that comment by Gary Edwards, and this sentence:
"although i sometimes wondered if people know there is a difference between the traditional MS "binary formats", and the "binary key" that is in the header file of every MSXML file."
From the use of the term "header file" makes me think he's talking about the BOM. I don't think the XML spec actually mentions whether a BOM is permissible in an XML document (certainly it doesn't say anything about UTF-8 XML documents, which clearly don't need a BOM anyway) but at the same time, it also doesn't say they're not permissible... - Anonymous
October 20, 2005
Oh, I stand corrected, orcmid. Apparently it does explicitly say that a BOM is OK. Well, that just re-enforces my point anyway. - Anonymous
October 20, 2005
Yes Dean, makes you wonder what happens if you use UTF16 and BOMs with OpenDocument XML files. I don't think I'll be trying that.
I think I will add that to my request for clarification about XML prologs on the OpenDocument comment list though. - Anonymous
October 24, 2005
Eduardo, any luck yet on contacting Gary Edwards and finding out what he was talking about? I really want to know what this binary key is that he's been talking to so many different folks about. Just this weekend I was looking around at some other sites and saw a bunch of references to this "binary key", yet noone had more detailed information.
It was actually pretty funny. On one site (that one run by the guy with a paralegal background) someone even posted a fake Word XML document and folks jumped on that as being the binary key. I think the initial post was just as an example of what a binary key would look like, but not everyone got that.
Anyway, I really do want to find out where this myth started since it's being referenced all over the place and I'd like to find out if there is something I'm missing (so we can fix it).
-Brian - Anonymous
October 24, 2005
Why don't you just write Edwards an email yourself and ask him to leave a comment here? I found his email address via google, it is gary.edwards@OpenStack.us. Funny enough it was in an email in an mailing list archive where he mentioned your blog, so I assume he knows who you are. - Anonymous
October 24, 2005
Brian, no reply yet from Gary, or Florian Reuter, who I also e-mailed. - Anonymous
October 25, 2005
Thanks Eduardo. I just sent Gary an e-mail as well to see if any of the suggestions in the post were what led to the confusion.
-Brian - Anonymous
November 01, 2005
Last week I sent this to gary.edwards@OpenStack.us but haven't heard back:
Hey Gary, I was wondering if you could help me understand what the Binary Key in the MS Office XML formats you’ve been referring to is. Is this something you’ve seen in the WordprocessingML format for Word? Or is it in the SpreadsheetML format from Excel?
I posted some thoughts on my blog about what the misunderstanding might have stemmed from and I was wondering if any of those sounded like the culprit: http://blogs.msdn.com/brian_jones/archive/2005/10/17/481983.aspx
I’d like to get this resolved soon so that we can make any corrections needed. Obviously the goal when we first started moving towards XML back in Office 2000 was to represent our data in an open and interoperable way. I feel like we’ve finally achieved that and would hate it if we somehow overlooked something that’s as big as what you’ve been saying.
Thanks for your help.
-Brian - Anonymous
November 08, 2005
Brian,
I think this was a very reasonable, if not to say friendly, email. Just the right tone. I also really hope you will get an answer back at some point, since this seems to be such a silly discussion, that one should be able to settle in minutes, if everyone is interested in resolving this.
If you ever get an answer, could you please post it as new blog article? I am getting a bit tired of checking the comments on this one, but would like to know if something comes up here.
Best,
David - Anonymous
November 09, 2005
Hey David, if I hear back I'll create a new post, so you don't have to keep checking back here. :-)
-Brian - Anonymous
November 10, 2005
Could this be (close to) the origin of the binary key meme?:
http://theequityexchange.com/OpenStack/docs/XML%20Security%20and%20XMP%20Metadata%20Header.html - Anonymous
July 27, 2006
The comment has been removed