Splitting Runs in Open XML Word Processing Document Paragraphs
In Open XML Word processing document markup, paragraphs contain runs, and runs contain text elements. Sometimes when transforming a document, we may want to split runs differently than in the original document. This post presents a couple of small functions that help us deal with paragraphs and runs – determine the split locations of runs, and to split runs.
Note: I no longer recommend this approach. Instead, I recommend an approach of breaking up runs into multiple runs, each with a single character. Then, you can search for text (not using a method to find a string in a string, but to use a custom method that matches up runs (each 1 character long) with characters in a string. Then you can replace the runs with the new content. Finally, you can coalesce adjacent runs with identical formatting, so that the end result is neat and clean markup. You can find a screen cast that discusses this in detail, as well as sample code to do this here.
This blog is inactive.
New blog: EricWhite.com/blog
Blog TOCWord 2007 has a neat feature where you can lock a document and disallow editing of the content; yet allow the user to add comments. You can send this document for review to a number of users, and after the reviewers return the documents, it would be handy to have some code that merges comments from all documents into a single document. I’m currently working on a blog post that shows how to do this. However, adding a comment to a paragraph can cause runs to be split, which adds a bit of complexity.
Paragraphs, Runs, and Text Elements
The following markup shows a very simple paragraph. We can see the paragraph element, the run element, and the text element.
<w:p>
<w:r>
<w:t>abcdefghi</w:t>
</w:r>
</w:p>
If we select “def” in the above text, and add a comment, the markup changes to look like this:
<w:p>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:commentRangeStartw:id="0"/>
<w:r>
<w:t>def</w:t>
</w:r>
<w:commentRangeEndw:id="0"/>
<w:r>
<w:rPr>
<w:rStylew:val="CommentReference"/>
</w:rPr>
<w:commentReferencew:id="0"/>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
</w:p>
In this paragraph, we can see the commentRangeStart and commentRangeEnd elements. In addition, we can see a special run that contains information on the styling of the text that is commented. This special run contains a commentReference element.
If we want to programmatically insert a comment into a document, we need to split runs as appropriate so that we can insert commentRangeStart, commentRangeEnd, and the special run that contains commentReference into the paragraph.
Note that a paragraph can be split into runs for a variety of reasons, and that there are a number of other valid child elements of the paragraph element. For example, because the above text isn’t a correctly spelled word, and isn’t a sentence with proper grammar, the markup can include w:proofErr elements:
<w:p>
<w:proofErrw:type="spellStart"/>
<w:proofErrw:type="gramStart"/>
<w:r>
<w:t>abc</w:t>
</w:r>
<w:commentRangeStartw:id="0"/>
<w:r>
<w:t>def</w:t>
</w:r>
<w:commentRangeEndw:id="0"/>
<w:proofErrw:type="gramEnd"/>
<w:r>
<w:rPr>
<w:rStylew:val="CommentReference"/>
</w:rPr>
<w:commentReferencew:id="0"/>
</w:r>
<w:r>
<w:t>ghi</w:t>
</w:r>
<w:proofErrw:type="spellEnd"/>
</w:p>
When splitting runs, we want to honor those existing run splits, and make sure that we don’t disturb those other elements.
As Open XML developers know, content controls are very powerful features of Open XML. They enable a vast number of scenarios – we can make our documents smarter. However, they add an interesting twist to markup. The element for content controls is w:sdt, which contains another element, w:sdtContent, which contains the contents. This means that runs that we potentially want to split occur at different levels of the XML hierarchy:
<w:p>
<w:r>
<w:t>123</w:t>
</w:r>
<w:sdt>
<w:sdtContent>
<w:r>
<w:t>4567</w:t>
</w:r>
</w:sdtContent>
</w:sdt>
<w:r>
<w:t>890</w:t>
</w:r>
</w:p>
We may need to split runs at any level - as a child of the paragraph, or as content in a content control. We need to use a recursive transform to do the transform, which then handles this issue nicely.
Determining Run Split Locations
The first piece of functionality that we need is a method to return an array of integers indicating where run splits are. If we are moving comments from one document to another, then we want to find out where the run splits are in the source document so that we can create the same run splits in the destination document.
Here’s the prototype of simple method to do so:
staticint[] RunSplitLocations(XElement paragraph)
The following paragraph markup contains three runs:
<w:p> <w:r> <w:t>abc</w:t> </w:r> <w:commentRangeStartw:id="0"/> <w:r> <w:t>def</w:t> </w:r> <w:commentRangeEndw:id="0"/> <w:r> <w:rPr> <w:rStylew:val="CommentReference"/> </w:rPr> <w:commentReferencew:id="0"/> </w:r> <w:r> <w:t>ghi</w:t> </w:r></w:p> If we call RunSplitLocations for this paragraph, it returns an array that contains:
0
3
6
Splitting Runs
If we have another document that contains no comments in this paragraph, and we want to split runs so that we can insert a comment on the middle three characters, we can call another method that takes an array of integers to do the splitting:
publicstaticXElement SplitRunsInParagraph(XElement p, int[] positions)
If we have a paragraph with this markup:
<w:p> <w:r> <w:t>abcdefghi</w:t> </w:r></w:p> And we call SplitRunsInParagraph passing an array that contains 0, 3, and 6, it returns a paragraph that looks like this:<w:pxmlns:w="https://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:r> <w:t>abc</w:t> </w:r> <w:r> <w:t>def</w:t> </w:r> <w:r> <w:t>ghi</w:t> </w:r></w:p> As I previously mentioned, the paragraph may contain child elements other than runs. SplitRunsInParagraph will leave those other elements in place. Also, a run can contain styling information, which we also want to leave in place.
Now that we have some methods to determine where run splits are, and to create run splits, it will be pretty simple to write a pure functional transform to move comments from one document to another (if the documents contain the exact same content, with the exception of comments).
The Code
The following example contains RunSplitLocations and SplitRunsInParagraph. This code uses a node cloning technique similar to what I presented in this post. In addition, the code uses the pre-atomization approach that I showed in this post. This code implements a pure functional transformation - no side effects anywhere, which will make it easy to use when writing the next transformation.
Here’s the code (also attached):
using System;using System.Collections.Generic;using System.IO;using System.Linq;using System.Text;using System.Xml;using System.Xml.Linq;using DocumentFormat.OpenXml.Packaging;
publicstaticclassExtensions{
publicstaticXDocument GetXDocument(thisOpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader streamReader = newStreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
part.AddAnnotation(xdoc);
return xdoc;
}
publicstaticstring StringConcatenate(thisIEnumerable<string> source)
{
StringBuilder sb = newStringBuilder();
foreach (string s in source)
sb.Append(s);
return sb.ToString();
}
}
publicstaticclassW{
publicstaticXNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
publicstaticXName t = w + "t";
publicstaticXName r = w + "r";
publicstaticXName del = w + "del";
publicstaticXName body = w + "body";
publicstaticXName p = w + "p";
publicstaticXName moveFrom = w + "moveFrom";
}
classProgram{
staticint GetRunLength(XElement e)
{
return e
.Descendants(W.t)
.Select(t => (string)t)
.StringConcatenate()
.Length;
}
// return the run split locations for all runs in the paragraph staticint[] RunSplitLocations(XElement paragraph)
{
// find the runs that don't have w:del or w:moveFrom as parent elements var runElements = paragraph
.Descendants(W.r)
.Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom &&
e.Descendants(W.t).Any());
// determine the run length of each run var runs = runElements
.Select(r => new {
RunElement = r,
RunLength = GetRunLength(r)
});
// determine the split locations var runSplits = runs
.Select(r => runs
.TakeWhile(a => a.RunElement != r.RunElement)
.Select(z => z.RunLength)
.Sum());
return runSplits.ToArray();
}
// if value starts or ends with a space, return xml:space="preserve" attribute // else return null staticXAttribute XmlSpacePreserved(string value)
{
if (value.Substring(0, 1) == " " || value.Substring(value.Length - 1) == " ")
returnnewXAttribute(XNamespace.Xml + "space", "preserve");
else returnnull;
}
privateclassRunSplits {
publicXElement RunElement { get; set; }
publicint RunLength { get; set; }
publicint RunLocation { get; set; }
}
privatestaticobject RunTransform(XElement element,
int[] positions, IEnumerable<RunSplits> runSplits)
{
// split runs that have child text elements if (element.Name == W.r && element.Descendants(W.t).Any())
{
// get text of run string text = element
.Descendants(W.t)
.Select(t => (string)t).StringConcatenate();
// find run in runSplits RunSplits rs = runSplits.First(r => r.RunElement == element);
// find list of splits in this run var splitsInThisRun = positions
.Where(p => p >= rs.RunLocation && p < rs.RunLocation + rs.RunLength);
// adjust splits so that split locations are relative to this run instead of // relative to the beginning of the paragraph var splitsIntext = splitsInThisRun
.Select(p => p - rs.RunLocation)
.ToArray();
// project collection of strings that will be in the new, split runs var splitText = splitsIntext
.Select((p, i) =>
i != splitsIntext.Length - 1 ?
text.Substring(p, splitsIntext[i + 1] - p) :
text.Substring(p)
);
// project collection of runs that will replace the original run return splitText.Select(r =>
newXElement(W.r,
rs.RunElement.Elements().Where(e => e.Name != W.t),
newXElement(W.t,
XmlSpacePreserved(r),
r)));
}
// clone elements other than runs // must be recursive to handle custom XML markup and content controls returnnewXElement(element.Name,
element.Attributes(),
element.Nodes().Select(n =>
{
XElement e = n asXElement;
if (e != null)
return RunTransform(e, positions, runSplits);
return n;
})
);
}
publicstaticXElement SplitRunsInParagraph(XElement p, int[] positions)
{
// find the runs that don't have w:del or w:moveFrom as parent elements var runElements = p
.Descendants(W.r)
.Where(e => e.Parent.Name != W.del && e.Parent.Name != W.moveFrom &&
e.Descendants(W.t).Any());
// calculate the run length of each run var runs = runElements
.Select(r => new {
RunElement = r,
RunLength = GetRunLength(r)
});
// calculate the location of each split var runSplits = runs
.Select(r => newRunSplits {
RunElement = r.RunElement,
RunLength = r.RunLength,
RunLocation = runs
.TakeWhile(a => a.RunElement != r.RunElement)
.Select(z => z.RunLength)
.Sum()
});
// the positions argument contains a list of locations where splits will be added // to the paragraph. In addition, runs may already be split at various places, and // we want those splits to remain, so we need to create the complete list of // locations where we want run splits. // create ordered union of desired splits and existing splits int[] allSplits = runSplits
.Select(rs => rs.RunLocation)
.Concat(positions)
.OrderBy(s => s)
.Distinct()
.ToArray();
// transform the paragraph to a new paragraph with new splits in runs returnnewXElement(W.p,
p.Elements().Select(e => RunTransform(e, allSplits, runSplits))
);
}
staticvoid Main(string[] args)
{
using (WordprocessingDocument doc1 =
WordprocessingDocument.Open("Test.docx", true))
{
XDocument doc = doc1.MainDocumentPart.GetXDocument();
XElement p = doc.Root.Element(W.body).Element(W.p);
//XElement newPara = SplitRunsInParagraph(p, new[] { 12, 15 }); XElement newPara = SplitRunsInParagraph(p, new[] { 10 });
Console.WriteLine(newPara);
}
}
}
Comments
Anonymous
September 11, 2009
How to insert a comment in a paragraph after finding a specific text.Anonymous
September 11, 2009
Hi Syed, This would be a process of splitting nodes at the point where you want to attach the comment. In other words, if you want the comment attachment point to start at some point in the paragraph, and end at another point, runs must be split at those points. This would make a good blog post - I'll add it to my list. -EricAnonymous
February 16, 2010
Hi Eric, Is it possible to split runs such that each word comes in a different run(when there is no difference in style within the word). Thanks, SandeepAnonymous
February 17, 2010
Hi Sandeep, It certainly is possible. You can use the Open XML SDK to process a document and arbitrarily split runs, even if there is no formatting differences. There is no functionality to do this automatically in the SDK - you have to manipulate the markup directly. Try it out - use the Microsoft Visual Studio Tools for the Office System to open a DOCX file, find a run, copy and paste it, and modify the contents of the two runs appropriately, then save and open it using Word. The document will look the same. Note that there is no guarantee that Word will not combine the runs, although I believe that it doesn't do so in Word 2007. -EricAnonymous
March 19, 2014
The comment has been removedAnonymous
March 19, 2014
Hi Paul, Find the w:sdtContent element, delete all of its children elements, then new up a new paragraph as a child of the w:sdtContent element. You can add paragraph properties (w:pPr) as a child of the paragraph (w:p) element, and you can add run properties (w:rPr) as a child of the run that you insert. If you like, before deleting the children of the w:sdtContent element, you can grab the paragraph and run properties out of the existing content, and then add those properties back into the new content that you are generating. Cheers, Eric