Retrieving the Paragraphs

[Blog Map]  [Table of Contents]  [Next Topic]

Our first goal is to retrieve all paragraphs in the document, along with the style of the paragraph.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCTo review, all paragraphs are children of the "body" element, and have a tag of "w:p".  If the style of the paragraph is other than the default style, then there will be a child element, "w:pPr".   "w:Pr" has a child element, "w:pStyle".  The style name is an attribute of the "w:pStyle" element named "w:val".  The XML looks something like this.  Note that there may or may not be "w:pPr" and "w:pStyle" elements.

<w:body>
<w:p>
<w:pPr>
<w:pStylew:val="Heading1"/>
</w:pPr>
<w:r>
<w:txml:space="preserve">Parsing </w:t>
</w:r>
<w:r>
<w:t>WordprocessingML</w:t>
</w:r>
<w:r>
<w:txml:space="preserve"> with LINQ to XML</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>The following example prints to the console.</w:t>
</w:r>
</w:p>

To make it easy to see the nodes that we find, I made an extension method for XElement that prints the path to the element from the root element.  That extension method is:

public static class LocalExtensions
{
public static string GetPath(this XElement el)
{
return
el
.AncestorsAndSelf()
.Aggregate("", (seed, i) => i.Name.LocalName + "/" + seed);
}
}

Another approach to the problem of identifying nodes is an extension method that generates an XPath expression that specifically identifies any node on which you invoke the extension method.  This is a more descriptive approach, although the extension method is significantly longer than the one above.  (When you are done with this tutorial, review this post, and you can see that it is implemented in a pure fashion - that code is just a bunch of queries!)

Here is the program (in its entirety) that contains our first query.  This code is attached to this page:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static string GetPath(this XElement el)
{
return
el
.AncestorsAndSelf()
.Aggregate("", (seed, i) => i.Name.LocalName + "/" + seed);
}
}

class Program
{
readonly static XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

public static XDocument LoadXDocument(OpenXmlPart part)
{
XDocument xdoc;
using (StreamReader streamReader = new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
return xdoc;
}

static void Main(string[] args)
{
const string filename = "SampleDoc.docx";

using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open(filename, true))
{
MainDocumentPart mainPart = wordDoc.MainDocumentPart;
XDocument mainPartDoc = LoadXDocument(mainPart);

var paragraphs =
mainPartDoc.Root.Element(w + "body").Descendants(w + "p")
.Select(p =>
new
{
ParagraphNode = p,
Style = (string)p.Elements(w + "pPr").Elements(w + "pStyle")
.Attributes(w + "val").FirstOrDefault()
}
);

foreach (var p in paragraphs)
Console.WriteLine("{0} {1}",
p.Style != null ?
p.Style.PadRight(12) :
"".PadRight(12),
p.ParagraphNode.GetPath());
}
}
}

While the query itself is written in a declarative style, I have no problem with using a plain old foreach statement to iterate through the results and print them to the console.  The important thing to remember here is that we want to compose our queries in a functional style, but when it comes time to print to the console, we are outside of our query, and using imperative code will not affect the composability of the code in the query.

We declared one local variable, and treated it as immutable, even though the language doesn't support it.  We can tag our static members with readonly, letting the language enforce immutability.

We can follow the flow of types through this query.

The mainPartDoc variable, of course, is an XDocument object.  The type of the Root property is XElement, and contains the root element of the document.  When we use the Element axis:

mainPartDoc.Root.Element(w + "body")

The result is also is an XElement object.

When we "dot" into the Descendants axis:

mainPartDoc.Root.Element(w + "body").Descendants(w + "p")

Then the result is IEnumerable<XElement>.

When we "dot" into the Select operator, the lambda creates a new anonymous type with two members, ParagraphNode, which is an XElement object, and Style, which is a string:

var paragraphs =
mainPartDoc.Root.Element(w + "body").Descendants(w + "p")
.Select(p =>
new
{
ParagraphNode = p,
Style = (string)p.Elements(w + "pPr").Elements(w + "pStyle")
.Attributes(w + "val").FirstOrDefault()
}
);

The code to determine the style deserves a little explanation. This expression:

(string)p.Elements(w + "pPr").Elements(w + "pStyle")
.Attributes(w + "val").FirstOrDefault()

is an idiom that we can use whenever an element or attribute may or may not exist.  The results of the expression is the value of the element or attribute exists.  The expression evaluates to null if the element or attribute is missing.

The way that this works is that we first call p.Elements(w + "pPr") on an XElement object.  This returns a collection of elements, with type IEnumerable<XElement>.  Then, when we dot into Elements again:

p.Elements(w + "pPr").Elements(w + "pStyle")

it calls the Elements extension method that takes a collection of elements, and returns all child elements of each element in the source collection.  In our example, the source collection will have only one element in it, unless the element didn't exist, in which case the source collection will be an empty collection.  The extension method is perfectly happy to receive either a collection with one element in it, or an empty collection.  If the source is an empty collection, the extension method returns an empty collection also.

We then dot into the Attributes extension method, which operates in a similar way.  It returns all attributes of each element in its source collection.  Again, perfectly happy to receive an empty collection, or a collection with one element in it.

We then dot into the FirstOrDefault extension method which returns the first element in the collection, or if the collection is empty, it returns the default value for the type.  XElement is a reference type, and the default value for reference types is null.  Therefore, FirstOrDefault either returns the first element in the source collection, or null.

We then add a cast to string at the beginning of the expression:

(string)p.Elements(w + "pPr").Elements(w + "pStyle")
.Attributes(w + "val").FirstOrDefault()

The cast to string will cast an XAttribute to string.  The definition of the cast explicit conversion in LINQ to XML is that if the value being cast is null, the explicit conversion returns null. See this topic for a more detailed explanation of casting using LINQ to XML.

So by using this idiom, we can write code that if the w:pPr/w:pStyle/@w:val attribute exists, we get the value of it.  If the w:pPr/w:pStyle/@w:val attribute does not exist, we get null.  However, the code will not throw a null reference exception.

When we run this example, we see something like this:

Heading1 document/body/p/
document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
document/body/p/
Code document/body/p/

This is what we expected.

We can also write this query using a query expression, as follows:

var paragraphs =
from p in mainPartDoc.Root.Element(w + "body").Descendants(w + "p")
select new
{
ParagraphNode = p,
Style = (string)p.Elements(w + "pPr").Elements(w + "pStyle")
.Attributes(w + "val").FirstOrDefault()
};

This gets rid of the lambda expression in the Select extension method.  However, just have to say, after you get used to them, lambda expressions become very natural and clear.

When writing using a query expression, we can take advantage of the let clause to further clarify our query.  By using the let clause, the projection in the select clause becomes clearer:

var paragraphs =
from p in mainPartDoc.Root.Element(w + "body").Descendants(w + "p")
let style = (string)p.Elements(w + "pPr").Elements(w + "pStyle")
.Attributes(w + "val").FirstOrDefault()
select new
{
ParagraphNode = p,
Style = style,
};

[Blog Map]  [Table of Contents]  [Next Topic]

RetrievingTheParagraphs.cs

Comments

  • Anonymous
    August 16, 2007
    Talking abour the Any() you say that "Due to lazy evaluation, this is not an expensive operation.". The problem I see, if I'm not wrong, is that the "list" is scanned (not necessarily completely)  two times: once for the Any, once for the First(), possibly looking at the same items. What happens if you try to call the First(), and check if it's not null? Maybe you need a local variable, is this the reason beacuse you decided to use the Any() construct?

  • Anonymous
    April 23, 2008
    Fabrizio, Here is the thing - when you use the Any operator, the query aborts just as soon as you get the first element in the query.  For all practical purposes, the sub-query that terminates in Any just follows a couple of links in a linked list. -Eric