Jaa


Texas looks at the interoperability of file formats

For those of you interested in the policies/politics side of file formats, I've seen a couple folks point out this bill currently in place in Texas (https://www.capitol.state.tx.us/tlodocs/80R/billtext/html/SB00446I.htm)

As all of you know by now, I think it's very cool to see this attention being paid to file formats, and the importance they play in all of our lives. I've been working on this stuff for years, and it's always fun to see other folks talking about your work. Here are the traits they'd like to see in a file format in Texas:

Each electronic document created, exchanged, or maintained by a state agency must be created, exchanged, or maintained in an open, Extensible Markup Language based file format, specified by the department, that is:

  1. interoperable among diverse internal and external platforms and applications;
  2. published without restrictions or royalties;
  3. fully and independently implemented by multiple software providers on multiple platforms without any intellectual property reservations for necessary technology; and
  4. controlled by an open industry organization with a well-defined inclusive process for evolution of the standard.

It's great to look at things like this and think about the scenarios folks have in mind. Rather than talk about motivations in terms of "levels of openness", I think it's easier to discuss it in terms of scenarios or use cases. Most policies around file formats are there to ensure the following:

  1. Long term availability – You want to know that 100 years from now, you'll still be able to access your data. This is a complex problem, as it can affect everything from the software you use to the hardware you use that software on. The key in terms of file formats is that everything in the file format is fully documented, and the stewardship for that documentation belongs to an independent standards body. The ISO, Ecma, OASIS, and the W3C are all examples of organizations people feel comfortable trusting with the stewardship of that documentation.
  2. Freely available – You want to make sure that you don't need to worry about someone else holding rights over your documents. If there is IP behind the format technology for instance, you want to make sure there is some type of license available that will work for you. Not only that, but you want to make sure this will work for anyone else that you want to have access to your documents. All formats out there take slightly different approaches here (PDF, OpenXML, ODF, HTML, etc.), so it's important to pay attention to this.
  3. Fully interoperable and accessible – You want to know that people on other systems can still work with your files. This means that the format needs to be fully documented, and there is nothing in the format that would prevent it from working on a different system. A great indicator here is to look at the number of applications that support the format, and what systems those applications run on. HTML is a great example of an interoperable format. OpenXML and ODF are also both fully interoperable, but are also much younger. So while you don't see as many applications support OpenXML and ODF as you do HTML, you'll clearly start to see more and more pop up as time goes by.
  4. Meets customer and end user scenarios/needs – This is really the key. Without this, then the formats won't be used. Plain text meets the three goals above, but obviously wouldn't work for most folks' documents. You need to make sure the end user doesn't see any ill effects when you try to meet the other three objectives.

There are a lot of other factors that can help you achieve these four goals, but those are all implementation decisions, and don't necessarily prevent you from achieving your goals. For example, using existing technologies like ZIP and XML helps you achieve #3 because there are already tools out there that support them (they aren't necessary for success though). You could go invent your own technology as well, and still achieve #3 assuming you fully document that new technology, but it's often easier leverage what's already there and can help you achieve a more rapid level of adoption in the community.

If you look at the bill in Texas, you can see that they have these goals in mind, and have set 4 criteria points to help them meet the goals:

  1. interoperable among diverse internal and external platforms and applications – super important to be both interoperable as well as accessible as I discussed above.
  2. published without restrictions or royalties – This is really around meeting the first and second goals I identified above. You need to make sure that you will always have the ability to open these files, and you don't want to be forced to pay for that access. This is extremely important, especially when it comes to the core document content. You also want to have the ability to easily scan the files to see if the end user has decided to embed some content that is restricted.
  3. fully and independently implemented by multiple software providers on multiple platforms without any intellectual property reservations for necessary technology – This really helps to show that you aren't going to be tied into a specific application for accessing the content. The reason this is important to ask for is that you want to ensure that the files can be accessed by as many folks as possible.
  4. controlled by an open industry organization with a well-defined inclusive process for evolution of the standard – This is more of a future looking goal. It's not about accessing the document of today, but ensuring that if new ideas come along, they can be added into the format. It's a bit harder to create concrete scenarios here, because obviously you don't want to allow random changes that don't undergo some extensive review. You need to make sure for instance that future changes to the spec don't break existing implications. Very important topic.

As I said at the beginning, it's fun seeing so much attention being paid to file formats. It's always important to remove the more "religious" aspects from the debate, and really drill into the scenarios. What are you trying to do with the documents, and what do you want to see put in place to help you succeed.

-Brian

Comments

  • Anonymous
    February 06, 2007
    So, how does OpenXML hold up against the proposed bill? Specifically how would it hold up right now and on December 1, when the bill would go into effect (especially considering that the ISO approval might fail on the fast track)? Thanks, Patrick

  • Anonymous
    February 06, 2007
    Speaking of scenarios, why you guys did choose to NOT rely on Microsoft Office own test suite to "validate" the CleverAge translator? Could it be because it draws all the wrong conclusions? "It's always important to remove the more "religious" aspects from the debate, and really drill into the scenarios." Really? You won't win this game Brian. One counter-example is enough to silence the claims you made over and over on this blog. Again, you prefer to create a reality distortion field, hoping that your audience does not understand. Your blog reads like a marketing brochure.

  • Anonymous
    February 06, 2007
    In this regard there are two simple steps Microsoft can take to demonstrate their commitment to customers over their commitment to protecting the Windows and Office monopolies.

  1. Publish the secret specifications of the MS-Office binary blob formats.
  2. Include native ODF support in MS-Office. Why does Microsoft need the "billions and billions" of existing Office documents to remain unreadable or poorly readable outside of MS-Office? Why does MS-Office support dozens of alternative file formats, including poorly defined "standards" such as RTF and CSV, yet fail to support OpenDocument Format natively? Sean DALY.
  • Anonymous
    February 07, 2007
    Patrick, I think that even without the ISO approval it still meets the bill's criteria as it's already an Ecma standard. I can't think of a single application/scenario that can't be met without ISO approval. The ISO submission was something that a number of governments had requested, which is why Ecma went that route. I view that more as an additional endorsement, as well as an alternative approach for the maintenance of the spec, rather than something that actually opens up more scenarios and use cases. Stephane, That's really a question that's best for the translator team's blog: http://odf-converter.sourceforge.net/blog/index.php I believe they looked to a number of different sources for their test cases. For instance, there is an ODF conformance test suite out there that they leveraged, as well as some government compliance documents. Again, that's a question that I would take up with them if you feel they didn't pick the right set. Sean,
  1. The binaries are already documented and freely available (they have been for some time). Here is the most up to date article describing how you can get to it: http://support.microsoft.com/kb/840817
  2. There are plenty of file formats out there that meet the needs of this bill. There aren't many folks out there that specifically want ODF. Instead they want formats with long term archivability; royalty free access; interoperability; and meet their user needs. ODF fits the bill some of the times, but so does HTML, PDF, plain text, OpenXML, DocBook, etc. etc. The ODF format was just a small blip on everyone's radar until the news around the commonwealth of Massachusetts. You can see this just by looking at the OASIS committee's meeting minutes. Leading up to the point there they submitted the standard for final approval, the attendance rate was very low. Only 2 people attended more than 75% of the meetings. It wasn't until all the press around Massachusetts hit that they got more people participating, but at that point the 1.0 version was already done (which is the one that went to ISO). We were already well on the way to the first Beta of Office 2007 when all this picked up, and it was far too late to add an additional format, even if customer demand increased. That's one of the reasons why we went the route of supporting the open source converter route. The other reason was that by making it an open source project, everything would be transparent, and there would be no chance of people misinterpreting the results. -Brian
  • Anonymous
    February 07, 2007
    Brian, that Texas bill is very interesting but also could end up stifling document creation and technological innovation as well as raising costs. The bill's aims are laudable. However, mandating XML for all government uses is not the best way to achieve them. In fact, it's incredibly short-sighted. It has been shown time and time again that prescribing a certain technology often proves costly and counterproductive in the long run:
  1. What happens if the technology stipulated isn't up to the task at hand? XML is not suitable for all uses (a fact evinced by the XSLB format in Excel 2007.) Forcing XML onto inappropriate applications will only increase the compliance burden without necessarily achieving the law's enumerated goals.
  2. What if better technologies appear? XML may be today's darling, but there is no guarantee that it will be the best solution twenty or thirty years down the road. Laws stay on the books a long time. At some point, technology will have moved on: XML will be obsolete. Mandating its use even under such circumstances will make data more arcane and less accessible. Such a hard-coded stipulation (instead of a call for best management practices) will do the exact opposite of what the law intends. The law is also impractical. Here are a few questions:
  3. What is an electronic document? What about files that are not destined for interchange? E.G., do temporary/swap/cache, raw input files (e.g. from cameras and scientific instruments), and database store files count?
  4. Do the file formats for limited use, custom in-house solutions have to be published? Does this apply to legacy systems (e.g. mainframes?) That could be a documentation nightmare. Public benefit would be minimal.
  5. It is not realistic to expect every format used by an entity as large as government to be implemented by multiple providers on multiple platforms. There is a lot of specialty and custom software in use. What happens if the government fails in this? Will the government be forced to subsidize additional software development to ensure that every piece of software it uses is available on at least one other platform and from one other company?
  6. Are open industry organizations necessary for legacy formats? What about for special-use software? Does it make sense to encumber the development of custom software with the need for such corporatist structures? A better law would call for open formats without mentioning XML. It would state that the formats to be used must adhere to best management practices (i.e. being open and accessible) but allow exceptions in cases where documents are not intended for interchange, where optimization is necessary, etc.
  • Anonymous
    February 07, 2007
    Francis, those are great points. I agree that it's very important to not mandate specific technologies but to instead stay focused on the actual goals. Let's take long term archivability as an example. People view PDF as an archival format, and it obviously doesn't use XML. Also, as you point out, not every file you create needs to be archived. Whenever we start to work on a version of Office, we focus on the scenarios we want to solve. We try to not think about the technologies until we have first agreed on which scenarios we think are more important. For example, one of the broad scenarios behind the new file formats was that we wanted developers outside of Microsoft to easily write solutions that could read and write our formats, which would increase the value of Office as a platform. We then created more detailed scenarios around the types of solutions we wanted to enable for those developers. We then took those scenarios as well as a few other broad scenarios and we reached the final decision of going with XML and ZIP. The same approach needs to be applied here. That's why I tried to talk about the actual goals first, rather than the specific technologies. -Brian

  • Anonymous
    February 07, 2007
    Brian: PDF does in fact use XML. The specification defines a number of pieces that are represented as XML. To name just two examples: XMP metadata and XFA forms. When you think about it, this isn't that different from OOXML, which uses a binary container (ZIP) to store data in binary (PNG, etc.) and XML forms. Of course, it goes without saying that as a much newer format OOXML represents much more of its data as XML, but it isn't nearly as black and white as you claim.

  • Anonymous
    February 07, 2007
    The comment has been removed

  • Anonymous
    February 07, 2007
    It's good to see this topic addressed here.   I'm stopping by since the /. coverage makes claim that OOXML won't meet the bills definitions.  As usual, great humor was had in reading the opposite. More humor and great wonderment was had in reading the complaints about previous and now deprecated technologies and the additional and hyper-generalized spurious claims and zealotry. Cheers!

  • Anonymous
    February 07, 2007
    Andrew, Great point, I did know that PDF makes use of XML, and I didn't mean to imply otherwise. My point was more to show that PDF at it's core is not based on XML, but that doesn't prevent it from being a great format for long term archival. Stephane, I could go and talk to the tranlator team to get a better understanding of the methodology they used to come up with their test files and then report that back to you. I thought it would be more efficient though if you just asked them directly. I don't work directly on that project, so I don't have the knowledge to drill into their design decisions. -Brian

  • Anonymous
    February 07, 2007
    Brian - The binary blob MS-Office formats have not been published. They are available for licensing to "companies or agencies", under terms which are secret. Why is this so? Why not just publish the formats? This secrecy threatens the MS-OOXML specification's candidacy for standardization, since the document refers in many instances to previous versions of Office and the closed binary blob formats. Hasn't the time come to publish them? Sean

  • Anonymous
    February 07, 2007
    "The binary blob MS-Office formats have not been published." They have been published, in fact those specs were part of MSDN for a long time. Just buy MSDN Library CD March 1998. I guess you have to get in touch with MS Ireland for MSDN CD orders.

  • Anonymous
    February 07, 2007
    The comment has been removed

  • Anonymous
    February 07, 2007
    Sean, What sections are there references to binary blobs? If you could point out what you're referencing I think I could better understand what you're looking for. Stephane, I'm not trying to sidestep the technical issues, and I'm well aware of you're knowledge in the area (that's why I enjoy these discussions with you). I'm not sure if you want to get into a discussion around the strategy behind the converter approach in general (ie an open source converter), or if you are more interested in the methods the team used for coming up with their test suite (of which I'm not very familiar). -Brian

  • Anonymous
    February 07, 2007
    Wow.  You are amazingly generous.  I must confess a certain cynicism around these latest legislative adventures, and I have probably gone too far in that direction in my own appraisal: http://orcmid.com/blog/2007/02/latest-oox-odf-fud-spat-states-prepare.asp I think your view of the policies that open-format adoption can be part of is very clear and would be important in establishing a legislative history too.  Your four points are excellent.

  • Anonymous
    February 07, 2007
    Sun Microsystems just raised its hand: http://www.sun.com/aboutsun/pr/2007-02/sunflash.20070207.1.xml This is an intriguing development.  I think, at this point, this is mostly good news.  It could be used in an argument that OOX is unnecessary, but we'll have to see what the caveats are concerning the Sun Conversion Technology.

  • Anonymous
    February 07, 2007
    You said: "Whenever we start to work on a version of Office, we focus on the scenarios we want to solve." And Microsoft's most basic scenario is how to lock-in customers and lock-out major competitive solutions.

  • Anonymous
    February 07, 2007
    Brian said "I'm not sure if you want to get into a discussion around the strategy behind the converter approach in general (ie an open source converter), or if you are more interested in the methods the team used for coming up with their test suite (of which I'm not very familiar)." My original question was whether Microsoft had used their own test suite (i.e. corpus of documents) to validate this project before you made the announcement. Not getting my answer, I thought a different angle would be to ask what you meant by "complete" in this blog post : http://blogs.msdn.com/brian_jones/archive/2007/02/02/odf-to-openxml-conversion-complete.aspx I also note that the announcement comes with no strings attached, meaning that it should do what it is expected to do, and do it right. CleverAge's own test suite means nothing to me. I have seen the list of features they support that goes into their test suite. Who made that list? How do you go from that arbitrary list (i.e. they are big features not even mentioned) to extrapolate into the billions of Office documents out there? I mean, how can this project make sense at all without relying on Microsoft's own Office test suite?

  • Anonymous
    February 08, 2007
    Stephane, You are clearly trying to make a mountain out of a molehill here, for your own purposes. I think it's pretty clear here that Brian was simply announcing the completion of release 1.0 of the converter by a non-Microsoft group(CleverAge, plus other contributors to this open source project).   Microsoft may have kicked off the project, but it does not mean it owns or controls this converter, nor is it required to do acceptance testing.  Further, as a 1.0 release, the development group is not guaranteeing that it is bug free; Microsoft cannot guartee this either, because it is not their converter, despite what you claim to believe. If it bothers you that Microsoft does not own the converter, and thus you cannot then hold them to task for any issues with it, then just learn to live with it.

  • Anonymous
    February 08, 2007
    Ian, I beg to differ. Microsoft has made a mountain of this project. Have you been following it since last year? And they are one who made PR lately announcing all sorts of things, including that it was "complete". When Microsoft say, it's "complete", one may understand it under the typical light of Microsoft declaring a product complete. It's very simple. Why have they participated in something that they knew from day one would not hold waters? They dare criticize IBM and others there's this politics, FUD, and so on. Well, sorry, but there are participating in it too. Now that, I think, the technical merit of the whole project are pretty questionable, I would not be surprised that Microsoft brings their top spin doctors to add a layer of politics on top of it. There is no technical merit in this project for one simple reason (and I know they worked hard, so it's a heart breaker) : reading/writing in full-fidelity mandates a lot more knowledge of the internals which in fact is much like rendering the document in memory by taking core of the semantics. Those who originally thought they would do a couple translations and would be done with it were wrong. The smart brass in the Microsoft Office team knew that since day one. Plus, you have to understand that this project will just fly in Microsoft's face now that government are being told that it's the official solution to the interoperability problem. There is a ton more to be said about it. I was just asking an innocent question, and apparently this is already too much being asked...

  • Anonymous
    February 08, 2007
    Stephane, Your logic is totally wrong.  The fact that Microsoft has chosen to get a lot of publicity from this does not imply that this is a product that they own or control or are responsible for.  Nothing you say can alter this basic fact, no matter how often you repeat it.  So, if you have any issues with the quality of the 1.0 release (I know I do!!!), it's not Microsoft's responsibility at all; it's the development teams'. As for your comment about the project having no technical merit, you seem not to understand that it was never meant to do reading/writing in full fidelity because that is technically impossible -- the ODF specification and the ECMA OOXML spec are not in one-to-one correspondance.

  • Anonymous
    February 08, 2007
    I have and likely will have no need for the converter but I have checked it out from time to time.  One of the things I read on their blog/release notes is that they test via "double conversion".  I'm not sure of the correct technical term for this, but they convert a document and then convert it back and verify the results.  Surely this doesn't say anything about supporting every feature, but it does place doubt on any kind of requirement for "mirroring" or understanding an in-memory binary image of it. Just rolling with the FUD machine, hahahaha

  • Anonymous
    February 08, 2007
    Ian, Have you read the press? CleverAge has done work on behalf of Microsoft. Governments are being told that this project is a solution to the interoperability problem. "full fidelity because that is technically impossible" Eh, eh, I knew someone would address this. I will not comment whether or not it is impossible as you say. It's in fact more complicated than that. Because you have markup (+ blobs) on both side you may think the translator is that clever thing which, from ODF to OOXML does the following : ODF ==> translator ==> OOXML If you state this, it's because you believe the semantics is entirely captured by the markup. You believe there is no discrepancy. Here is now the reality : ODF ==> intermediate representation(1) ==> translate ==> intermediate representation(2) ==> OOXML Problem : the intermediate representations hold pretty much all of the semantics. I won't comment on whether representation (1) is doable. But I'll assert that representation (2) is only doable by Microsoft own's application. Why? because there is semantics that OOXML does not store at all. And it is the same for translating formats the other way around. I would have hoped that OOXML would have been that wonderfully designed format that captures itself 100% of the semantics. It's not the case. And it puts the burden on CleverAge. That's why it's a mistake to have delegated this part of the work outside the fence. You may not agree by that's how I see it. And, while the translator only tries to do something with Word documents, it's going to get worse with Excel spreadsheets (this is my area of expertise).

  • Anonymous
    February 08, 2007
    The comment has been removed

  • Anonymous
    February 08, 2007
    @Stephane The facts that 1) CleverAge has done other projects for Microsoft in the past, and 2) that Microsoft is making some PR from this, does not make Microsoft responsible for this project.  Can we end this argument at that?  You're just repeating yourself. As for the technical imposibility of 100% fidelity, we are in agreement with the final conclusion.  However, you have introduced a spurious argument to reach that conclusion.  You seem to be referring to the well-discussed legacy rendering aspects of the ECMA spec.  (Correct me if I am wrong.)   As you undoubtedly know, those are optional minor details of layout as done in older versions of Word, and are ignored by Word 2007 itself when converting old files.  So, they are basically irrelevant.  I myself wouldn't make such a big deal out of them.  People who do, seem to have other agendas.  For example, like you, they omit to mention that the ODF spec has a similar set of attributes whose rendering is unspecified.  This omission of a similar ODF behaviour allows them to make it seem that it's all Microsoft's fault, which of course, is their real aim.

  • Anonymous
    February 08, 2007
    Dennis, Spreadsheet formulas have been defined pretty well by MS Excel a number of years ago don't you think? OpenOffice, or any other competing product, has no chance to stand if it does not accurately replicate MS Excel run-time behaviors in formulas, including the bugs. So it's not like there is no specs. It's very unfortunate that the MS Excel's implementation is the specs. Ian, Here is an excerpt of the MS press pass : "(...)Microsoft Committed to Interoperability The Open XML Translator is one among many interoperability projects Microsoft has undertaken. Microsoft continues to work with others in the industry to deliver products that are interoperable by design(...)" Very bold statements MeSay. Ian said "You seem to be referring to the well-discussed legacy rendering aspects of the ECMA spec." No. I am not talking about rendering in general. To render something, you need to know what you are doing. It's just not in the same league than reading and writing markup and binary blobs. Unless the markup and binary blobs are so well designed that they capture all of the semantics.

  • Anonymous
    February 15, 2007
    I'm sure most of you have had those annoying conversations with folks on a topic where that person views

  • Anonymous
    March 02, 2007
    I just saw that the Novell folks have released a version of OpenOffice with support for the Ecma Office

  • Anonymous
    March 02, 2007
    I just saw that the Novell folks have released a version of OpenOffice with support for the Ecma Office