Using DocumentBuilder with Content Controls for Document Assembly

DocumentBuilder is an example class that’s part of the PowerTools for Open XML project that enables you to assemble new documents from existing documents.  One of the problems to solve when moving markup from one document to another is that of interrelated markup – markup in one paragraph often has dependencies with markup in other paragraphs, or other parts of the Open XML package.  Document builder fixes up interrelated markup when assembling a new document from existing documents.  This post shows how to use DocumentBuilder in concert with content controls to control the document assembly.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCZeyad Rajabi wrote a blog post on using content controls to control document assembly.  His post uses the altChunk approach for document assembly.  This post presents code that mirrors the code in his post, except that this code uses DocumentBuilder.  I’ve covered altChunk also, in How to Use altChunk for Document Assembly.

The updated post Inserting / Deleting / Moving Paragraphs in Open XML Wordprocessing Documents documents interrelationships in paragraph markup in detail.

The post Move/Insert/Delete Paragraphs in Word Processing Documents using the Open XML SDK introduces the DocumentBuilder class.

See Comparison of altChunk to the DocumentBuilder Class for more information about both approaches to document assembly.

The gist of the approach is that you insert content controls in the ‘template’ document, setting the tag of each content control to the name of the document that you want inserted at the point of the content control.  For example, in the following document, SolarOverview.docx will replace the content control in the assembled document:

Example Code

The example takes a ‘template’ document, solar-system.docx, and inserts eleven documents into it.  As I mentioned, each inserted document replaces a content control.  This example demonstrates one approach to coding document assembly using content controls and DocumentBuilder:

static void Main(string[] args)
{
using (WordprocessingDocument solarSystem =
WordprocessingDocument.Open("solar-system.docx", false))
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

// get children elements of the <w:body> element
var q1 = solarSystem
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements();

// project collection of tuples containing element and type
var q2 = q1
.Select(
e =>
{
string keyForGroupAdjacent = ".NonContentControl";
if (e.Name == w + "sdt")
keyForGroupAdjacent = e.Element(w + "sdtPr")
.Element(w + "tag")
.Attribute(w + "val")
.Value;
if (e.Name == w + "sectPr")
keyForGroupAdjacent = null;
return new
{
Element = e,
KeyForGroupAdjacent = keyForGroupAdjacent
};
}
).Where(e => e.KeyForGroupAdjacent != null);

// group by type
var q3 = q2.GroupAdjacent(e => e.KeyForGroupAdjacent);

// validate existence of files referenced in content controls
foreach (var f in q3.Where(g => g.Key != ".NonContentControl"))
{
string filename = f.Key + ".docx";
FileInfo fi = new FileInfo(filename);
if (!fi.Exists)
{
Console.WriteLine("{0} doesn't exist.", filename);
Environment.Exit(0);
}
}

// project collection with opened WordProcessingDocument
var q4 = q3
.Select(g => new
{
Group = g,
Document = g.Key != ".NonContentControl" ?
WordprocessingDocument.Open(g.Key + ".docx", false) :
solarSystem
});

// project collection of OpenXml.PowerTools.Source
var sources = q4
.Select(
g =>
{
if (g.Group.Key == ".NonContentControl")
return new Source(
g.Document,
g.Group
.First()
.Element
.ElementsBeforeSelf()
.Count(),
g.Group
.Count(),
false);
else
return new Source(g.Document, false);
}
).ToList();

DocumentBuilder.BuildDocument(sources, "solar-system-new.docx");

// dispose of the opened WordprocessingDocument objects
foreach (var g in q4)
if (g.Group.Key != ".NonContentControl")
g.Document.Dispose();
}
}

How the Code Works

The code consists of chained queries that eventually build up a list of OpenXml.PowerTools.Source objects, which is what we pass to DocumentBuilder.BuildDocument to specify the sources for the document assembly.

When building up the list of document source objects, where the ‘template’ document contains paragraphs or tables, then we need to include a source object with the source document set to the ‘template’ document, and the source range set to the range of those paragraphs.  Where the ‘template’ document contains a content control, then we need to include a source object with the source document set to the document being imported.  We don’t need to set a range – we simply import the entire document.

In other words, we need to group together all paragraphs that don’t contain content controls, and we need to process separately all content controls.  This is a job for the GroupAdjacent extension method.  If we create a key such that all non content control paragraphs have the same key, and all content controls have a unique key, then we’ll end up with groups of paragraphs to import from the template document, and separate groups that contain one content control each.  As I develop the query, I’ll show intermediate results so that you can see exactly what I mean.

The results of the first query is a collection of the child elements of the <w:body> element:

// get children elements of the <w:body> element
var q1 = solarSystem
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements();

This is pretty simple – no need to show the output from this query.

Here is the second query:

// project collection of tuples containing element and type
var q2 = q1
.Select(
e =>
{
string keyForGroupAdjacent = ".NonContentControl";
if (e.Name == w + "sdt")
keyForGroupAdjacent = e.Element(w + "sdtPr")
.Element(w + "tag")
.Attribute(w + "val")
.Value;
if (e.Name == w + "sectPr")
keyForGroupAdjacent = null;
return new
{
Element = e,
KeyForGroupAdjacent = keyForGroupAdjacent
};
}
).Where(e => e.KeyForGroupAdjacent != null);

// temporary code to dump q2
foreach (var item in q2)
Console.WriteLine(item.KeyForGroupAdjacent);
Environment.Exit(0);

If the child element of the <w:body> element is a content control, then the KeyForGroupAdjacent member of the anonymous type is set to the tag value of the content control (highlighted in yellow above).

If the child element is not a content control, then KeyForGroupAdjacent is set to “.NonContentControl”, which is an invalid filename – no chance to conflict with the tag values of the content controls.

If the child element is a section marker (<w:sectPr>), then we want to ignore that child element.  Setting the KeyForGroupAdjacent to null, and then filtering out those null items takes care of that.

When we dump out q2 to the console, we see:

.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
.NonContentControl
SolarOverview
Sun
Mercury
Venus
Earth
.NonContentControl
Mars
Jupiter
Saturn
Uranus
Neptune
Pluto

Next, we use the GroupAdjacent extension method to group the .NonContentControls together:

// group by type
var q3 = q2.GroupAdjacent(e => e.KeyForGroupAdjacent);

// temporary code to dump q3
foreach (var g in q3)
Console.WriteLine("{0}: {1}", g.Key, g.Count());
Environment.Exit(0);

When we run this, we see:

.NonContentControl: 21
SolarOverview: 1
Sun: 1
Mercury: 1
Venus: 1
Earth: 1
.NonContentControl: 1
Mars: 1
Jupiter: 1
Saturn: 1
Uranus: 1
Neptune: 1
Pluto: 1

Next, the code validates that the .DOCX files referenced by the content controls exist:

// validate existence of files referenced in content controls
foreach (var f in q3.Where(g => g.Key != ".NonContentControl"))
{
string filename = f.Key + ".docx";
FileInfo fi = new FileInfo(filename);
if (!fi.Exists)
{
Console.WriteLine("{0} doesn't exist.", filename);
Environment.Exit(0);
}
}

Then, the code projects a collection of anonymous types that include the group, as well as the open WordprocessingDocument objects:

// project collection with opened WordProcessingDocument
var q4 = q3
.Select(g => new
{
Group = g,
Document = g.Key != ".NonContentControl" ?
WordprocessingDocument.Open(g.Key + ".docx", false) :
solarSystem
});

The observant will notice that opening these documents very definitely introduces state to this very not-pure query.  We’ll need to close/dispose of those documents later.  I’ve been fermenting an idea about wrappers around the Open XML SDK that give true functional composability to Open XML documents.  This approach would eliminate this issue of classes that implement IDisposable.  If when I open that bottle it hasn’t turned to vinegar, I’ll blog it.

Finally, we’re ready to project the list of OpenXml.PowerTools.Source objects:

// project collection of OpenXml.PowerTools.Source
var sources = q4
.Select(
g =>
{
if (g.Group.Key == ".NonContentControl")
return new Source(
g.Document,
g.Group
.First()
.Element
.ElementsBeforeSelf()
.Count(),
g.Group
.Count(),
false);
else
return new Source(g.Document, false);
}
).ToList();

Finally the code calls DocumentBuilder.BuildDocument and disposes of all of the opened WordprocessingDocument objects (except the ‘template’ document, which will be disposed when exiting scope of the using statement).

DocumentBuilder.BuildDocument(sources, "solar-system-new.docx");

// dispose of the opened WordprocessingDocument objects
foreach (var g in q4)
if (g.Group.Key != ".NonContentControl")
g.Document.Dispose();

The entire example, including the implementation of the GroupAdjacent extension method and the GetXDocument extension method follow.  I’ve attached the source file and the sample documents to this post.  This code works with version 1.1.1 of DocumentBuilder (and not prior versions).  You can download DocumentBuilder.zip, which contains DocumentBuilder from https://www.CodePlex.com/PowerTools.  It’s under the ‘Downloads’ tab.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using OpenXml.PowerTools;

public class GroupOfAdjacent<TSource, TKey> : IEnumerable<TSource>, IGrouping<TKey, TSource>
{
public TKey Key { get; set; }
private List<TSource> GroupList { get; set; }

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return ((System.Collections.Generic.IEnumerable<TSource>)this).GetEnumerator();
}

System.Collections.Generic.IEnumerator<TSource> System.Collections.Generic.IEnumerable<TSource>.GetEnumerator()
{
foreach (var s in GroupList)
yield return s;
}

public GroupOfAdjacent(List<TSource> source, TKey key)
{
GroupList = source;
Key = key;
}
}

public static class LocalExtensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader streamReader =
new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
part.AddAnnotation(xdoc);
return xdoc;
}

public static IEnumerable<IGrouping<TKey, TSource>> GroupAdjacent<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector)
{
TKey last = default(TKey);
bool haveLast = false;
List<TSource> list = new List<TSource>();

foreach (TSource s in source)
{
TKey k = keySelector(s);
if (haveLast)
{
if (!k.Equals(last))
{
yield return new GroupOfAdjacent<TSource, TKey>(list, last);
list = new List<TSource>();
list.Add(s);
last = k;
}
else
{
list.Add(s);
last = k;
}
}
else
{
list.Add(s);
last = k;
haveLast = true;
}
}
if (haveLast)
yield return new GroupOfAdjacent<TSource, TKey>(list, last);
}
}

class DocProc
{
static void Main(string[] args)
{
using (WordprocessingDocument solarSystem =
WordprocessingDocument.Open("solar-system.docx", false))
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

// get children elements of the <w:body> element
var q1 = solarSystem
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements();

// project collection of tuples containing element and type
var q2 = q1
.Select(
e =>
{
string keyForGroupAdjacent = ".NonContentControl";
if (e.Name == w + "sdt")
keyForGroupAdjacent = e.Element(w + "sdtPr")
.Element(w + "tag")
.Attribute(w + "val")
.Value;
if (e.Name == w + "sectPr")
keyForGroupAdjacent = null;
return new
{
Element = e,
KeyForGroupAdjacent = keyForGroupAdjacent
};
}
).Where(e => e.KeyForGroupAdjacent != null);

// group by type
var q3 = q2.GroupAdjacent(e => e.KeyForGroupAdjacent);

// validate existence of files referenced in content controls
foreach (var f in q3.Where(g => g.Key != ".NonContentControl"))
{
string filename = f.Key + ".docx";
FileInfo fi = new FileInfo(filename);
if (!fi.Exists)
{
Console.WriteLine("{0} doesn't exist.", filename);
Environment.Exit(0);
}
}

// project collection with opened WordProcessingDocument
var q4 = q3
.Select(g => new
{
Group = g,
Document = g.Key != ".NonContentControl" ?
WordprocessingDocument.Open(g.Key + ".docx", false) :
solarSystem
});

// project collection of OpenXml.PowerTools.Source
var sources = q4
.Select(
g =>
{
if (g.Group.Key == ".NonContentControl")
return new Source(
g.Document,
g.Group
.First()
.Element
.ElementsBeforeSelf()
.Count(),
g.Group
.Count(),
false);
else
return new Source(g.Document, false);
}
).ToList();

DocumentBuilder.BuildDocument(sources, "solar-system-new.docx");

// dispose of the opened WordprocessingDocument objects
foreach (var g in q4)
if (g.Group.Key != ".NonContentControl")
g.Document.Dispose();
}
}
}

DocumentBuilderContentControls.zip

Comments

  • Anonymous
    June 02, 2010
    Thanks a lot! I really need this code. But also I need one which could extract all ( text, tables and Images) from a content control.... and place it in another docx. Could you help me?? Regards

  • Anonymous
    October 20, 2011
    What version of the XML Powertools do I need to run the above code?

  • Anonymous
    February 11, 2013
    I do not wish to replace the content controls. I want the copied content to be the content of the rich text content control so that my users could edit it further as the document I have is protected and I grant editing rights to different content controls to different users. I have tried usign AltChunk and that works fine for the first run. It has somehow corrupted my content control and at the second run the content control is not identified by the SdtElement in open xml nor by the ContentControl object in VSTO. Will document builder resolve this issue?