Transforming Word XML to XSL-FO
We put an article up back in the winter on transforming from WordprocessingML into XSL-FO. From there, you can go into other formats like PDF. Not sure if you guys have already seen this, but if not you should check it out: https://msdn.microsoft.com/office/understanding/word/codesamples/default.aspx?pull=/library/en-us/odc_wd2003_ta/html/officewordwordmltoxsl-fo.asp
Moving into formats that are fixed formats are pretty difficult because if you really want full fidelity you need to be able to also understand Word's layout functionality. Fixed formats are formats that describe how the text and information is laid out on a page. PDF and XPS both have examples of fixed formats. The Word format is a flow based format. If you add a paragraph somewhere in the WordprocessingML, then when you open the file back up, the page layout will of course be different (everything after that paragraph just got shoved down). This of course means that we don't store information like page breaks in the format. If we did, and we enforced it, then it would be significantly more difficult to work with the files as you'd have to recalculate those things anytime you modified the text.
I love examples like this that show some of the stuff you can do once you get Office documents in XML. I'm trying to gather a list of similar articles we should provide as we go through the Betas and people start working with the new formats. Are these kind of articles useful? Are there other similar articles you'd like to see? At one point we had started to build a similar one showing how to go into DocBook, but I'm not sure what happened with it. I'll see if I can dig it up.
-Brian
Comments
- Anonymous
August 11, 2005
The comment has been removed - Anonymous
August 11, 2005
Tobek, I disagree. I work a lot with law firms, and the primary reason we convert Word documents to PDF is to reduce/remove metadata and ensure it's a (universally "read only" document. There are metadata removal tools that we use, too, but PDF is seen as the ultimate metadata eliminator (yes, I know it's not strictly true). - Anonymous
August 11, 2005
Those are both interesting comments. Tobek, when you say "freeze document", do you mean just essentially making it read-only? Or is there more to it than that?
Evans, why do you think it is that PDF is seen as the ultimate metadata eliminator? I've seen plenty of PDF files with additional information. Is the issue that it's harder for people to see that hidden data? Or is there just a misperception of how locked down and clean a PDF file is?
Have you seen this add-in for Office that let's you remove all hidden data from an Office file: http://www.microsoft.com/downloads/details.aspx?FamilyID=144e54ed-d43e-42ca-bc7b-5446d34e5360&displaylang=en
Do you think tools like this would give people more confidence or is there something else to it as well?
-Brian - Anonymous
August 11, 2005
I've certanily seen the confidence that people have in PDF, as Evans mentions. Part of that, from what I could gather, has to do with the UI of the reader, of all things. With a really simple and minimal UI, there is a feeling that there's no room to hide the metadata. I recall one person saying to me, of word, "with all those menus and dialogs, you never know where they have my salary written in there." - Anonymous
August 11, 2005
The comment has been removed - Anonymous
August 12, 2005
>At one point we had started to build a similar >one showing how to go into DocBook, but I'm not >sure what happened with it. I'll see if I can >dig it up.
I would love to see that.
Thanks,
Keith - Anonymous
August 12, 2005
Brian, I, too have seen PDFs with additional information. The vast majority of legal documents are repurposed to make new legal documents. Attorneys sometimes make a lateral transfer to a different firm, or a client will switch counsel. In both cases, the documents will frequently travel with them. More than track changes, the document stats and document properties, last editor, previous file locations, etc., can divulge information that is embarrassing at the least. It's not uncommon for us to use documents created years ago. Clients don't really like it when you're billing lots of hours for a document created in 1993.
I've used the Remove Hidden Data Tool, but it doesn't adequately meet the needs of the legal community. I've found people freak out when they receive a Word document that doesn't behave like a Word document. Another application, iScrub from Esquire Innovations (http://www.esqinc.com/), has a "Metasealant" feature that produces a protected Word document that. I'm not especially fond of that, either. I need to re-read the XPS specs, but part of that made me a little uncomfortable.
There is definitely a false sense of security in the PDF format. Users rarely apply security to their PDF documents. A major factor in the PDF vs. protected Word document is the user experience. The Acrobat reader is pretty universal, and people are very comfortable with it. As in Tobek's example, most courts require documents filed electronically to be PDF. Unlike Tobek, we rarely have to keep copies of every single document that goes out the door (thank goodness!), although the document management systems frequently used log that event.
I'm really excited about the new format and what it will allow us to do. I'm especially thrilled about the document stability / document corruption features built in. That's one of our biggest problems. I'll have to look at the article you referenced, Brian.
Evans - Anonymous
August 14, 2005
Brian, that tool to clean up word documents hardly works at all. Try getting a document that's been edited a lot, run it through the tool, and then do a "strings" on it (view all the strings in a hex editor if your OS doesn't have an equivalent command). You'll find yourself horribly surprised. I've lost all confidence in Microsoft coming up with a useful security or privacy tool. - Anonymous
August 15, 2005
The comment has been removed - Anonymous
August 15, 2005
I'll dig into the tool a bit more and see what the issues are.
You guys are right that this will be much easier with the new formats. It won't be as simple as just deleting a couple parts, but it will be relatively easy. We call this kind of data PII (Personally Identifiable Information). The definition of PII is actually different depending on who you talk to, and it even changes over time. The new formats will give that flexibility so that we can publish how all the information is stored, and people can build tools that allow you to remove whatever you want. It's pretty exciting.
-Brian - Anonymous
August 16, 2005
The comment has been removed - Anonymous
August 17, 2005
No I can't because they're docs from work (I don't use Word at home). Some things I remember include user names (besides last one to save), previous file paths, footnotes that do not show up on the doc, and other strings that don't mean much to me, but look interesting for a snooper. Just get some old document at Microsoft, that's been edited by many people, and try it. - Anonymous
September 29, 2005
The comment has been removed - Anonymous
October 20, 2005
I can't be the only one who wants to go in the reverse direction can I? I have reports that typically need to go out as PDF, and occasionally be read into Word. I was hoping to be able to put out XSL-FO, and run a FO processor for PDF, and XSLT from XSL-FO to WordProcessingML, but it looks like to do so I'd need to write the XSLT myself. I hope someone here tells me I'm an idiot and points me in another direction . . . - Anonymous
June 15, 2006
Well, we did the orginal conversion styles and many enhancements since. If this type of technology interests you, you may wish to try out the beta from our new partner.
They enhanced the technology to make Word into an XSL Style Designer for designing both WordML as well as streaming input to RenderX XSL FO engine.