Manage your documents: Using Office XML formats and XSLTs

아티클
06/16/2005

One of the main reasons we announced this new file format so early was that we wanted to give people an opportunity to start working on building different types of solutions on top of the file formats. I’m pushing for an early release of the schemas (sometime before Beta 1), but that still leaves us with a few months before they would be out. So, in the mean time, the best way to start playing around with potential solutions is using Office 2003. There is already a ton of XML support in that product. While the announcement of these new default XML formats is a big deal, it is definitely not the first time we’ve worked with XML. In Office 2000 (which we started developing in 1997) we build an HTML format that leveraged XML for representing things like document properties and other application specific information. This was done because HTML didn't support all of our features and we didn't want people to lose information when saving as a web page. It was unfortunate because it didn't look like "pure" HTML, but it was necessary to support our customers data. Starting in 1999 we began building the SpreadsheetML format that shipped with Excel in Office XP. Then in 2001 we started working on the WordProcessingML format which is now available in Word 2003. So, as you can see, we’ve been doing stuff with XML in Office for the past 8 years. Why the brief history lesson you ask? It’s important to understand that the new formats coming out with Office12 are based on the work we’ve done up through Office 2003. So, if you build solutions on top of Word2003’s XML, those will map fairly easily into the new file formats. For Word, the only big difference with the new format is that we break the single XML file into multiple files and wrap them all up in a ZIP package (We’ve actually designed a logical model for structuring documents from multiple pieces which we then mapped into ZIP). Today I want to show an example of something you can do with WordprocessingML in Word 2003.

There have been a number of questions around support for other XML formats (there are tons of them out there). As I’ve described, since the formats are XML and fully documented, anyone can build transforms to go from our format into another (or vice versa). I decided I would post a really simple transform that runs against Word 2003 XML just to give folks an example. This transform will get rid of all the tracked changes and comments in a file. It does the exact same thing as if you were editing the file directly in Word and chose to accept all revisions. This transform is something that people could leverage as part of a workflow process. Imagine if you had documents you wanted to publish and you wanted to make sure there weren’t any deletions or comments in the files. I’m sure you’ve heard of people getting burned by posting documents on a server that had deletions in them. Often times the end user didn’t realize the deletions were still there, and there wasn’t an easy way for administrators to write an automated process to remove those deletions. Well, using XML, it’s easy to write solutions that manipulate Office documents without having to run the applications themselves.

Here are the steps for trying this solution out:

Download this ZIP file and put the two enclosed files on your desktop (https://jonesxml.com/resources/trackChangesExample.zip)
Open the file called "FileFormatDev.xml" in Word 2003. Notice that there are a bunch of comments and deletions.
Open the file in Internet Explorer (or any text / xml editor) and look at it’s contents. There is a ton of XML there, but you only really need to care about certain parts. Do a search for “aml:” and you’ll see all the tags we use for representing those comments and revisions.
Now open the XSLT "acceptRevisionsAndDeleteComments.xslt" in a text editor or IE, and take a look. It’s a pretty simple transform. If you are familiar with XSLT, you’ll see that all this does is re-writes all the WordProcessingML except for the comments and revisions. It strips those out.
You’ll now need to apply the XSLT to the Word document to remove those comments and revisions. There are a number of ways you can do this. Most XML parsers out there can do this for you. You can also use Word to do this directly. In Word, we allow you to save XML files through transforms as well as open them through transforms. That’s what we’ll do just to keep it simple.
Open the Whitepaper in Word again and go to the Save As… dialog (File -> Save As…). The file type in that dialog should be “XML Document (.xml)”.
Notice that there is a checkbox in the dialog called “Apply Transform”. Go ahead and select that, and you will then have the “Transform...” button enabled.
Click on the “Transform...” button and go find the transform that you downloaded in Step 3.
You’ve now told Word to save the file as an XML document, and then after the save is done, apply the specified transform. That means that if the XSLT does it’s job right, you should get a WordXML file that has all the comments and revisions removed.
Rename the file so you can compare the results, then press the “save” button. There will be a warning letting you know you are saving through a transform and that some of the document information might be lost. Go ahead and press “OK”. Once the file is saved, go ahead & shut down Word.
Open the file in Word again, and you’ll see that the comments and revisions are now gone. Remember that while in these steps we applied the XSLT with Word, you can do it anywhere. You don’t need to have Word on the machine. You could use any XML parser that supports XSLT and apply it to your documents.
1. As a quick aside, you may notice that the XML file that you saved doesn't open as easily in IE or a text editor. There are two things going on here. The first is that we put the following PI (processing instruction) at the top of our files (<?mso-application progid="Word.Document"?>). We actually have a shell handler that sees this in the XML and associates the file with Word (even though the extension is just .xml). We do something similar with our HTML files. The problem is that if you try to open it in Internet Explorer, it will see that PI and hand the file off to Word. You can open the file in notepad and delete the PI, and then it will open in IE.
2. The other thing you may notice if you open the file in an XML or text editor is that it's just one long stream of text. We don't "pretty print" our XML files, so if you look at them as plain text, it's hard to read. We do this because it improves the performance of saving and loading. It makes it a bit more difficult to work with though. One option is to open it in IE (after removing the PI) since IE will apply a transform to lay it out better. Another option is to use an XML editor (Visual Studio; Front Page; etc.) that gives you the option to format the file. That will apply "pretty printing" to the document for easier reading.

So, that's just one example of writing a tool that manipulates a Word document. If you were going to try to do something like this with the binary formats, it would have been extremely difficult. Most people that are trying to do this today usually end up writing code that automates the Office applications. The advantage with the XSLT is that you don’t need to have the Office applications involved (in the demo we had Word apply the XSLT, but you could have used any number of tools to do it).

Let me know if you guys have any questions or if the XSLT doesn’t work for you. I think in my next post I’ll talk more about the Word schema and how we designed it. At first glance it’s a fairly intimidating schema, but as you learn about it, it’s pretty basic and straightforward. There are just a ton of features in Word, so we had to create XML to represent them all. That doesn't mean that you need to deal with them all though if you're just trying to do something simple. Also, does anyone feel like it would be useful to have some posts talking about more of the basics around XML? Or does everyone feel like they are already up to speed on everything I've discussed and just want to see more technical posts?

-Brian

Comments

Anonymous
June 16, 2005
The comment has been removed
Anonymous
June 16, 2005
The comment has been removed
Anonymous
June 16, 2005
Sorry everyone, I posted the wrong XSLT. It will still remove the tracked changes, but it leaves the comments.
In addition, it's a bit more complex than it needs to be. I'm at home right now so I can't update it. Feel free to still play with it though, as it still does most of what I had described. I'll post a more up to date one when I get in tomorrow morning.
Anonymous
June 17, 2005
OK, I just updated it. The XSLT should now remove comments. It's also a bit easier to look at and figure out what's going on.

Bob - Office isn't using the ZIP technology that comes with WinXP, so we aren't limited to only that platform.
Anonymous
June 17, 2005
Two questions. Will the new XML format for Excel be similar to SpreadsheetML or a totally new schema? Also how are you going to handle the ZIP/XML package format in MSXML? For example, to open an XML document in XSLT you use the document() function which is expecting a well-formed single XML document. How are you going to expose an Office Open XML to this kind of call?? If there is no support for this in XSLT then you have to unzip the files to get at the particular XML file you need. This pretty much cancels out the benefit of the compression or the format.
Anonymous
June 17, 2005
Hey Bruce. The new schema for Excel will be different from the existing SpreadsheetML schema. There will be some similarities, but it will be much more aligned with how Excel internally represents the grid.

Your second question is a great question. There are a couple alternatives here. If you want to operate on the file as a single XML file, we will have a serialization method you can run that will convert the ZIP package into a single XML file. This is what you would do if you just wanted to run a single XSLT against the thing. As you say though, that cancels out the packaging and compression benefits of the format.
Alternatively, you can use System.IO.Packaging provided in the WinFx SDK to navigate the ZIP package and relationships to access each part that makes up the file (http://blogs.msdn.com/brian_jones/archive/2005/06/06/425750.aspx). If you are building a solution on top of the format, this will often be the better way to go. It's also just ZIP, so you could use any existing ZIP library out there if you didn't want to use the WinFx SDK. If you instead are just wanting to apply a single XSLT though, than the serialization format would be what you want.

-Brian
Anonymous
June 17, 2005
Brian-

I'm catching up on your posts as I take a break from working my way through a real-world WordML/XSLT application. I'd love to see as much information and as many examples as you'd care to post showing how XSLT can be used to manipulate WordML.

Of course, everyone building Word solutions will want more documentation on the parts of the schema that address their own specific needs. So I might as well share just a bit about the specific solution I'm working on: We're sending a Word document to PDF format and taking tracked changes along for the ride as PDF Comments. Our latest approach is to use XSLT to transform the document twice--once to mark each revision in the document by either changing the font color (insertions) or adding a 1-pixel character (deletions), and once to provide an XML document with information about each revisions. From there another process will read the PDF and use the colors, marks, and metadata to add the annotations.

So far, we haven't hit anything we couldn't figure out by inspecting the WordML (although marking deleted rows within tables is a bit tricky). But examples of how you and others are using and transforming WordML, such as the one you just posted, help us figure things out that much faster.

Jan Fransen
OfficeZealot
Anonymous
June 17, 2005
Hi Brian.
All this is very good and exciting!
This particular post brings up an interesting thought...what I'd like to see is information on using Word to create the XSL files for the transforms. Perhaps it's already available in the SDK somewhere, and I've just missed it?

Thanks.
Darryl
Anonymous
June 17, 2005
Darryl, what are you thinking you'd like from the XSLTs? Are you looking for XSLTs to go from WordML into another format? Or do you want XSLTs to go from your XML into WordML?

-Brian
Anonymous
June 18, 2005
The comment has been removed
Anonymous
June 20, 2005
XSLTs to go from my XML into WordML. But it just struck me that this is the purpose of the XSLT Inference tool.
Darryl
Anonymous
June 20, 2005
The comment has been removed
Anonymous
June 20, 2005
Hi Brian

You keep talking about us using the stuff in System.IO.Packaging. I guess that assumes we're all coding in .NET. Is the same functionality provided anywhere that can be called from VBA/VB6/VBScript? It would be a travesty if we had to rely on third-party libraries to get into these files from VBA.

Regards

Stephen Bullen
Anonymous
June 20, 2005
Hi Brian,

Okay, tried the updated XSLT.

1. File size result: Orcmid had the same result I did (result file after accepting changes and removing comments and deletions resulted in a larger .XML or .DOC file than the original. I see also that the new XML file is UTF-16 although both your original and XSLT show UTF-8. Not sure what I'm missing in this case :)

2. With the Word 2000->2003 add-ins available when processing this (save as XML & apply transform) would there have then been a choice to save as XML or as OffXML (i.e. end up with a zipped package) with a checkbox (similar to the choice to specify a Transform during Save As?

3. One Word-side issue that could confuse folks doing this type of save via the U.I.

If you were working in a .doc file, make changes and use File=>Save as (.doc with new name) the document on your screen matches the 'as saved' condition and the file name & path on the top your open Word window reflects the new name.

If you save as xml and apply a transform the Word window changes to reflect the new file name, but what you see on the screen is still the 'pre transformed' flavor of the document. If you close and reopen (from the MRU on the File menu in Word) then you get the 'saved' version. Seems like a 'refresh' (reopen?) choice would be needed in Word to keep things 'in synch' (from the person used to working on .doc) files.

4. Yes - examples, more please :)

Bob Buckland ?:-)
Anonymous
June 20, 2005
An Inference Tool walk through would be sweet. Could you please include in your examples info on how to use a single element in multiple locations in the XSLT?

For example, the XML File contains:
<contact>
<first_name>Darryl</first_name>
<last_name>Hover</last_name>
...
</contact>

The final document is to contain multiple references to the <last_name> element as in a letter as follows:

Darryl Hover
9999 My Street
My City, MY 12345

Dear Mr. Hover
...

Thanks
Darryl
Anonymous
June 20, 2005
Stephen, the only solid plans right now for APIs are the managed ones I’ve been referring too. It’s just ZIP and XML though, so anyone can build a tool for accessing the files. I agree with you though that it would be nice to have something simple that is available through VBA, but like you said it would be nice to not have to rely on third party technologies. I’m looking into what we can do, and will probably have an update on this later on in the summer or early fall.

Bob, you’re right about there being user confusion when saving through an XSLT. We actually don’t see that as being much of an end user scenario though. Ideally the save through XSLT would be leveraged as part of a larger solution that has specific types of XML it wants out of Word.

Darryl, I’m not sure when I’ll get the example pulled together, but I’ll try to include your suggestion.

-Brian
Anonymous
June 22, 2005
So if I have a document that is being generated on the fly in Word using XML, every time the webpage is ran, the document is re-generated. Does your XSLT have to be used every time I run the webpage, or will it recognize that the XSLT has been used.
Anonymous
June 29, 2005
The comment has been removed
Anonymous
June 30, 2005
Thanks for the post Andrew.
The issue of the deletion spanning the paragraphs is that you also need to account for the fact that that paragraph mark at the end of the first paragraph needs to be removed. You need to merge the remainders of those two paragraphs into one.

Take this as an example:

"First paragraph
Second One"

If you selected from "paragraph" to "Second" and hit delete, the result would be one paragraph that says "First One".
That is what I was accounting for with the added complexity of the XSLT. If you do the above example and apply the XSLT you suggest, it will result in this:

"First
one"

Instead of this:

"First one"

Make sense?

As for the pretty printing, it's not clear yet how we'll package it. Most likely there will be a seperate tool to apply pretty printing, rather than built in functionality.
-Brian
Anonymous
July 07, 2005
Hi Brian,
I have a requirents to use word 2003 as an XML Editor. Is is possible to use macros dynamically in the Smart Documents ? ie .. the macros are downloaded from a server when the
instanhce is started
Will it possible to demostrate this with an example.

Also i would like to achive this transformation through code ie on some event trigger the tranform.Can i have an example for this.

Cheers
Rajiv
Anonymous
July 12, 2005
I just read this post on document security: http://news.com.com/Document+security+Tell+me+another+joke/2010-1071_3-5783062.html?part=rss&amp;tag=5783062&amp;subj=news...
Anonymous
June 08, 2009
PingBack from http://quickdietsite.info/story.php?id=2033
Anonymous
June 15, 2009
PingBack from http://unemploymentofficeresource.info/story.php?id=16342
Anonymous
June 18, 2009
PingBack from http://fancyporchswing.info/story.php?id=507

다음을 통해 공유

Manage your documents: Using Office XML formats and XSLTs

Here are the steps for trying this solution out:

Comments

추가 리소스