Word XML's Context Free Chunks: Building a document from multiple pieces of content
This is a bit of a more obscure feature that I like to point out every now and then. It's great if you are interested in building up a Word document from multiple pieces of content. This is a common scenario I've talked with a number of folks about over the years. In the design of WordprocessingML it was clear that we should make it easier to bring document fragments with rich formatting into an existing Word XML document without having to do a bunch of extra clean up work around style name conflicts, etc. This is why we created the context free chunk element, to allow people to insert a block of content where all the style and list definitions were defined locally for that chunk rather than for the entire document.
Example Scenario:
The simplified version of the scenario is that you want the ability to dynamically generate a document bringing in content from other sources. The contents of the document could be based on a number of outside factors, such as who the user is that is authoring it, what they are writing about, what the conditions of the market are, etc. For example, imagine you work for an investment bank and you want a solution that automatically generates a report template based on the company, industry, and analyst that is going to write the report. In order to do this, you need to bring in content from all over the place and use it to create the document. There may be disclosure clauses that you want to insert; a chart that shows historical financial figures; boilerplate description of the company, etc.
Basics
This is just one example of something I've talked to a number of people about supporting. Rather than dig into the scenarios more though, I want to talk about one bit of functionality we had in the WordprocessingML schema for Office 2003 that was designed to address this case. I was planning to talk about this a bit later as I wanted to talk about more intro-level stuff first, but thanks to a great post last week by John Durant, I figured I would briefly describe it now.
One of the difficult problems with almost any document format is deciding how to move new content in. This sounds like it should be easy, but there are a number of issues to deal with. In the WordprocessingML schema, all styles and list definitions are declared at the beginning of the file. Then in the content below, the various objects (tables, paragraphs, text runs) can reference those styles. This is a very common model (similar to CSS in HTML), which means there isn't a ton of repeated data. The problem comes with adding new content somewhere in the body though. When you add new content, if you can't define everything local to that content then you need to go parse the style definitions of the source document to make sure that all the styles referenced in you new content are already declared in the target, and guarantee there are no collisions.
Target Document:
Let's say you had a file that looked like this:
Introduction
- First item in the list
- Second item in the list
The XML for that might look something like this (I'm just going to use shorthand so this isn't really a valid Word XML file):
<wordDocument>
<styles>
<style styleId="H1">
<fontSize val="18"/>
</style>
</styles>
<lists>
<list id="1" type="1, 2, 3"/>
</lists>
<body>
<p style="H1">Introduction</p>
<p list="1">First item in the list</p>
<p list="1">Second item in the list</p>
</body>
</wordDocument>
Source Document:
Now let's say we want to have a solution that adds the following content:
Disclaimer
It is important to understand the following issues:
- Something confusing
- Something else confusing
The XML file for that would probably look something like this:
<wordDocument>
<styles>
<style styleId="H2">
<fontSize val="16"/>
</style>
</styles>
<lists>
<list id="1" type="a, b, c"/>
</lists>
<body>
<p style="H2">Disclaimer</p>
<p list="1">Something confusing</p>
<p list="1">Something else confusing</p>
</body>
</wordDocument>
Getting the result we want:
So, if our goal is to add the content from the source into the content of the target to create a new document, we need to worry about a couple things. The first thing is that our source document uses the "h2" style, but that isn't defined in our target document. This means we'll need to update the style information for the target. The second problem is that our source document uses a list with id "1" that has "a. b. c." styled numbers. In the target document, there is also a list of id "1", but it's number style is different. If we don't fix this problem up, then the two list items in the source would belong to the same list that's already in the target, and you would end up with this:
Introduction
1. First item in the list
2. Second item in the listDisclaimer
It is important to understand the following issues:
3. Something confusing
4. Something else confusing
Obviously that isn't what we want. To correct this, we would need to modify the source document so that the list uses a different id, and then add the list definition to the top of the target document.
Easier way:
Of course there is an easier way to do all of this. Building up a document from multiple parts was an important scenario for us. Because of that, we created an element in our schema called a cfChunk. The cfChunk allows you to create a temporary "mini-document". You can place a cfChunk within an existing WordprocessingML file, and then within that cfChunk you can make new style and list definitions that apply locally to that chunk. When Word opens the file, we'll then merge that content with the rest of the file, and take care of any conflicts. This is similar to what happens when you copy content from one document and paste it into another. If the style names match, then we'll inherit the definitions from the target document. If the style from the source doesn't yet exist, we'll create it. The cfChunk is one of those pieces of functionality that's rarely talked about but it's extremely useful. I think the main reason it isn't talked about is that for someone to see the benefits of it, they need to already understand that basics of WordprocessingML.
So, in order to get the file looking like we want:
Introduction
- First Item
- Second Item
Disclaimer
It is important to understand the following issues:
- Something confusing
- Something else confusing
We just do this:
<wordDocument>
<styles>
<style styleId="H1">
<fontSize val="18"/>
</style>
</styles>
<lists>
<list id="1" type="1, 2, 3"/>
</lists>
<body>
<p style="H1">Introduction</p>
<p list="1">First item in the list</p>
<p list="1">Second item in the list</p>
<cfChunk>
<styles>
<style styleId="H2">
<fontSize val="16"/>
</style>
</styles>
<lists>
<list id="1" type="a, b, c"/>
</lists>
<body>
<p style="H2">Disclaimer</p>
<p list="1">Something confusing</p>
<p list="1">Second item in the list</p>
</body>
</cfChunk>
</body>
</wordDocument>
Go ahead and try it out for yourself. You can imagine taking a template with a bunch of placeholder XML tags and posting it up on the server. Then your solution could just grab all the pieces of content you need, wrap them in a cfChunk tag, and swap them out with the placeholder XML tags in your template. I'm really pushing to extend this functionality for the new schemas in Word 12, so let me know if you find it useful or if there is some other kind of behavior you'd like to see added.
-Brian
Comments
Anonymous
July 21, 2005
Brian:
This is great stuff and exactly what we have been looking for. Please continue to push this in the new schemas for Word 12. I would be very interested in you expanding on the extended functionality you allude to.
I would like to see you write about or point us to some guidelines for deleting an abituary chunck/block of text from a WordML document (not using Word) such that the document will stay well-formed and valid. In some sense this is the opposite of <cfChunk>.
An issue related to deletion is whether Word will clean up un-referenced styles, fonts, etc. when it opens a document that does not refer to them any longer (because a chunk/block was deleted from the document).
Thank you muchAnonymous
July 22, 2005
Some problem I got in the past, was when I copied contents between word documents created with different localised versions of Office. "Heading 1" in English version was "Hoofding 1" in Dutch version. I remember Word didn't adjust namings in the merged document, so there was now a "Hoofding 1" named style in an English document.Anonymous
July 26, 2005
Hey Brad, are you curious about how to delete any random selection of text? Or something that is more structured.
In answer to your question about whether we clean up styles, the answer is no. When we open the file even if the style isn't in use we keep it around because it may actually be part of the template and used later. People often create a simple template that has a bunch of predefined styles that aren't currently in use, but they are kept around so that people using the template can take advantage of them. You'll need to delete the style if you don't want to use it.
A tool that might be cool for someone to build would be one that cleans up all the styles by deleting any style or list definition that isn't in use. I've seen similar tools that use Word's object model, but it would also be cool to do it using WordML.
Ignace, I've seen that issue before. It's pretty tough given that many style names are user defined, so we couldn't really try to do a translation. For the styles that are predefined though I thought there was something smarter that we did, but maybe that isn't the case. I'll check it out.
-BrianAnonymous
August 05, 2005
Brian:
You asked:
"...are you curious about how to delete any random selection of text? Or something that is more structured."
My answer:
No more random than what you can select within Word. Doing this using the OM is simple, my desire is some guidlines for doing it directly on wordML, either through the XML DOM or some other way.
Of course there are the obvious things to consider, being sure the resulting document is left schema valid, etc. However, I suspect than internally Word is following some algorithm for modifying its wordML presentation when text is deleted using the OM. We would like to match this working on wordML directly in our own code - running on a server.
When we create documents on the server we not only need to insert arbituary chunks of text (cfChunk)into an existing document but we need to also delete arbituary chunks of text. The areas of text to be conditionally deleted are designated by special markers in <t> elements. The markers may be in the same <t>, <t>s in different runs, or <t>s in different paragraphs. I want to process this wordML directly to carry out these potential deletions but make sure I end up with a schema valid wordML document when I am done.
Any guidance on doing this would be great.
ThanksAnonymous
August 18, 2005
Brad - your question on deleting text is actually faily easy to do with XSLT. Since Word is a series of <w:p> tags you only need to remove the <w:t> content and/or <w:p> tags between your marks. If you want to maintain the page layout just remove the <w:t> content and leave the <w:p> tags in place. Of course this is the simple case, next you will ask about images, tables, lists.... These will get more complicated. It depends specifically on what you are trying to accomplish as to how to solve this one.Anonymous
August 18, 2005
The comment has been removedAnonymous
August 19, 2005
The comment has been removedAnonymous
August 29, 2005
This works beautifully! Saves me eons of time in making sure that the inference tool actually kept the styles I needed (which it nearly never did). We need a WordML book -- if I only had time to write one.Anonymous
October 18, 2005
Hi,
This is excellent information, and exactly what I was looking for (I am creating one Word document from other Word documents). My only question is how to use this in conjunction with embedded objects?
It seems Word saves embedded objects in one XML element (<w:docOleData>) under the root <w:wordDocument> element. How can this element be created when there are multiple embedded objects and how can they correctly be referenced later in the doucment in the appropriate <o:OLEObject> elements?
It's a shame I can't use the new "12" format where this would not be a problem!
Regards,
Paul.Anonymous
November 07, 2005
Hi all,
I need Wordml to XML cleanup work. If any one aware of this It will be help me alot.
ThanksAnonymous
March 07, 2006
I am also looking to merge several Word documents with embedded images/objects. I noticed the rID{#} tags and also the rsidR tag that has a unique number what appears to be per document. It seems the document.xml.rels file should also contain the rsidR as an element so I could place "duplicate" rIDs inside this file and move my embedded images to the media directory. I am contemplating the best way to merge at a minimum of 7 documents being worked on by 7 different people.Anonymous
March 31, 2006
Hi Adam, Brian
Have you found a solution for the way of merging the documents? I thought that chunks will be the solution but it just imports pieces of content.
What I need is to build a document from others that may contain, images, objects, everything. Is for a group of researchers that work seperately on chapters of an article.
Thanks for your help.Anonymous
April 01, 2006
Hi Brian,
I am also looking for a solution to merge complete documents.
I would appreciate your feedback on whether we are approaching our situation in a recommended fashion. We have a 100+ document that we need to dynamically populate with data. We have broken the document into logical chapters. Each chapter has its own custom schema. We perform multiple xml/XSLT transforms then save to chapter##.xml. We would now like to merge all the chapters into one master document. Would cfchunk be the recommended approach?
Thanks.Anonymous
April 24, 2006
Juan Carlos / Brian,
I played around with the cfChunk tag, but yes, I am still looking for a "Word Merge" or "Word Append" solution. So far I think it might have to be custom, but still bugging the product team and a few others for support.
So far I have:
Users use a standard template.
Users use the standard styles in the template.
Users place "placeholders" for images that will be inserted by "me" later.
It looks like I will need to:
Open the document.xml.rels files of the source document.
Find the "max" Relationship Id="rId?" tag in the document.xml.rels file.
Open the document to append and build a mapping of it's rIds to the max rId + 1 from the source.
Replace all the rIds in the document.xml and document.xml.rels.
Determine any Target's in the document.xml.rels not in the source document.
Copy them to the source document.
This is not performing all the complex items like OLE objects and such. I have a limited scope and think I can get away this approach (though they might “throw” in a PowerPoint slide). The cfChunk tag does not seem to handle the external objects since their references are in the document.xml.rels file.
The documents I am working with also “contain” content types and still looking at what I will (if anything) need to do to them.Anonymous
April 25, 2006
Hi Brian,
We've been using cfChunk pretty successfully (building WordML through XSLT). One problem we've hit is embedding page orientation changes (we have a landscape page we'd like to include); Word seems to ignore them! Is there any way the contents of a cfChunk can change the orientation of a page? Currently it's looking like we'll need to change the orientation in the 'master' document (outside the cfChunk).
Thanks!
Chris.Anonymous
April 25, 2006
To solve my Word Append problem, I actually resorted to the Microsoft.Office.Interop.Word DLL (not to derail the blog). This does not really take advantage of the XML and cfChunk tag, but worked for appending documents together and keeps the formatting along with embedded images and such. I will be testing to see how robust it is, but put it in a loop for 1000 times running on a VPC and it ran successfully. The document was fairly complex and I opened and closed Word each time to see how well it handled the volume.
private void button1_Click(object sender, EventArgs e)
{
Microsoft.Office.Interop.Word.ApplicationClass oWord = null;
Microsoft.Office.Interop.Word.Document oDocument = null ;
Microsoft.Office.Interop.Word.Document oDocumentDest = null;
object saveChanges = false;
object missing = System.Reflection.Missing.Value;
object fileName = @"C:Source.docx";
object isVisible = false; // change to true to see Word working
object readOnly = false;
object destFilename = @"C:Destination.docx";
try
{
// Copy the document to preseve the styles/headers/etc (this is using the source file as a template) set in it
System.IO.File.Copy(@"C:Source.docx", @"C:Destination.docx", true);
oWord = new Microsoft.Office.Interop.Word.ApplicationClass();
oWord.Visible = false; // change to true to see Word working
// Open a document to copy from
oDocument = oWord.Documents.Open(ref fileName, ref missing, ref readOnly, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref isVisible, ref missing, ref missing, ref missing, ref missing, ref missing);
oDocument.Activate();
// Select and Copy from the original document
oDocument.Select();
oWord.Selection.Copy();
// Create a document to paste to
oDocumentDest = oWord.Documents.Open(ref destFilename, ref missing, ref readOnly, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref isVisible, ref missing, ref missing, ref missing, ref missing, ref missing);
// clear all the data from the source that I copied
oDocumentDest.Select();
oWord.Selection.Delete(ref missing, ref missing);
// Copy and then paste several times to see if it works
oDocumentDest.Activate();
oWord.Selection.Paste();
oWord.Selection.Paste();
oWord.Selection.Paste();
// Save
oDocumentDest.SaveAs(ref destFilename, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
}
finally
{
// Close up
if (oDocumentDest != null)
{
oDocumentDest.Close(ref saveChanges, ref missing, ref missing);
oDocumentDest = null;
}
if (oDocument != null)
{
oDocument.Close(ref saveChanges, ref missing, ref missing);
oDocument = null;
}
if (oWord != null)
{
oWord.Quit(ref saveChanges, ref missing, ref missing);
oWord = null;
}
}
}Anonymous
April 27, 2006
The comment has been removedAnonymous
August 08, 2006
This question has come up a few times, most recently over on the OpenXMLDeveloper site (http://openxmldeveloper.org/forums/477/ShowThread.aspx#477)...Anonymous
July 03, 2008
How to get content of a node or bookmark as HTML, WordML, RTF? Many customers of Aspose.Words ask this...Anonymous
May 31, 2009
PingBack from http://patiochairsite.info/story.php?id=1458