Retrieving the Text of the Paragraphs

[Blog Map]  [Table of Contents]  [Next Topic]

Our next goal is to retrieve the text of the paragraphs in the document. Text is stored in the "t" nodes that are contained in "r" nodes that are children of the paragraph node. Text may be broken up into multiple "t" nodes, so we have to concatenate all of the text in the "t" nodes.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCEven though we could modify our query to include the code to extract the text of each paragraph, for demonstration purposes, we're going to approach the problem in a different way.  We're going to write a new query that uses our first query as its source.  Due to lazy evaluation, this is basically as efficient as if we were to simply modify the first query.  The approach creates more short-lived objects on the heap, but if this approach makes our code more clear, it is a good tradeoff.

We can add the query to our program, as follows:

string defaultStyle =
(string)styleDoc.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var paragraphs =
mainPartDoc.Root
.Element(w + "body")
.Descendants(w + "p")
.Select(p =>
{
string style = GetParagraphStyle(p);
string styleName = style == null ? defaultStyle : style;
return new
{
ParagraphNode = p,
Style = styleName
};
}
);

var paragraphsWithText =
paragraphs.Select(p =>
new
{
ParagraphNode = p.ParagraphNode,
Style = p.Style,
Text = p.ParagraphNode
.Elements(w + "r")
.Descendants(w + "t")
.StringConcatenate(s => (string)s)
}
);

The above code uses the StringConcatenate aggregate operator that we showed in the aggregation topic.

One of the features of Open XML is that a user can turn on the "Track Changes" feature, and the document will track all changes to text.  The above code would only work if there were no tracked changes.  However, it is easy to modify our code so that we retrieve the correct text for each paragraph regardless of whether there are tracked changes or not.  To do this, we need to find all of the children of the w:p element that have the name w:r or w:ins, and ignore all other elements.  We can modify the last of the three above queries, as follows:

var paragraphsWithText =
paragraphs.Select(p =>
new
{
ParagraphNode = p.ParagraphNode,
Style = p.Style,
Text = p.ParagraphNode
.Elements()
.Where(z => z.Name == w + "r" || z.Name == w + "ins")
.Descendants(w + "t")
.StringConcatenate(s => (string)s)
}
);

This approach introduces a small issue.  In LINQ to XML, all names are atomized; that is, if two XName objects are in the same namespace, and if they have the same local name, they will share the same instance.  It takes a little bit of work for the implicit conversion operator in LINQ to XML to atomize a name.  In certain scenarios in LINQ to XML, atomization can be a significant percentage of processor time.  You can easily minimize this.  This post describes atomization in more detail.  So if we pre-atomize our names, our query will execute faster, at least in theory.  In practice, I can't say that I've ever been in a situation where this would make a difference, but when processing huge files, it might.  But whatever, in general, when I have code like this, I pre-atomize my XName objects:

XName r = w + "r";
XName ins = w + "ins";

var paragraphsWithText =
paragraphs.Select(p =>
new
{
ParagraphNode = p.ParagraphNode,
Style = p.Style,
Text = p.ParagraphNode
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(s => (string)s)
}
);

The complete program now looks like this.  The code is attached to this page.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static string GetPath(this XElement el)
{
return
el
.AncestorsAndSelf()
.Aggregate("", (seed, i) => i.Name.LocalName + "/" + seed);
}

public static string StringConcatenate(
this IEnumerable<string> source)
{
return source.Aggregate(
new StringBuilder(),
(s, i) => s.Append(i),
s => s.ToString());
}

public static string StringConcatenate<T>(
this IEnumerable<T> source,
Func<T, string> projectionFunc)
{
return source.Aggregate(
new StringBuilder(),
(s, i) => s.Append(projectionFunc(i)),
s => s.ToString());
}
}

class Program
{
readonly static XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

public static XDocument LoadXDocument(OpenXmlPart part)
{
XDocument xdoc;
using (StreamReader streamReader = new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
return xdoc;
}

public static string GetParagraphStyle(XElement para)
{
return (string)para.Elements(w + "pPr")
.Elements(w + "pStyle")
.Attributes(w + "val")
.FirstOrDefault();
}

static void Main(string[] args)
{
const string filename = "SampleDoc.docx";

using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open(filename, true))
{
MainDocumentPart mainPart = wordDoc.MainDocumentPart;
StyleDefinitionsPart stylePart = mainPart.StyleDefinitionsPart;
XDocument mainPartDoc = LoadXDocument(mainPart);
XDocument styleDoc = LoadXDocument(stylePart);

string defaultStyle =
(string)styleDoc.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var paragraphs =
mainPartDoc.Root
.Element(w + "body")
.Descendants(w + "p")
.Select(p =>
{
string style = GetParagraphStyle(p);
string styleName = style == null ? defaultStyle : style;
return new
{
ParagraphNode = p,
Style = styleName
};
}
);

XName r = w + "r";
XName ins = w + "ins";

var paragraphsWithText =
paragraphs.Select(p =>
new
{
ParagraphNode = p.ParagraphNode,
Style = p.Style,
Text = p.ParagraphNode
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(s => (string)s)
}
);

foreach (var p in paragraphsWithText)
Console.WriteLine("{0} {1}",
p.Style != null ?
p.Style.PadRight(12) :
"".PadRight(12),
p.Text);
}
}
}

[Blog Map]  [Table of Contents]  [Next Topic]

RetrievingTheTextOfTheParagraphs.cs

Comments

  • Anonymous
    November 08, 2007
    The comment has been removed
  • Anonymous
    October 14, 2009
    Change the p.Text to p.ParagraphNode.Value