Jaa


Retrieving the Paragraphs - VB

[Table of Contents] [Next Topic]

Our first goal is to retrieve all paragraphs in the document, along with the style of the paragraph.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCTo review, all paragraphs are children of the "body" element, and have a tag of "w:p".  If the style of the paragraph is other than the default style, then there will be a child element, "w:pPr".   "w:Pr" has a child element, "w:pStyle".  The style name is an attribute of the "w:pStyle" element named "w:val".  The XML looks something like this.  Note that there may or may not be "w:pPr" and "w:pStyle" elements.

<w:body>
<w:p>
<w:pPr>
<w:pStylew:val="Heading1"/>
</w:pPr>
<w:r>
<w:txml:space="preserve">Parsing </w:t>
</w:r>
<w:r>
<w:t>WordprocessingML</w:t>
</w:r>
<w:r>
<w:txml:space="preserve"> with LINQ to XML</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>The following example prints to the console.</w:t>
</w:r>
</w:p>

To make it easy to see the nodes that we find, I made an extension method for XElement that prints the path to the element from the root element.  That extension method is:

<System.Runtime.CompilerServices.Extension()> _
Public Function GetPath(ByVal el As XElement) As String
Return el _
.AncestorsAndSelf _
.InDocumentOrder _
.Aggregate("", Function(seed, i) seed & "/" & i.Name.LocalName)
End Function

Another approach to the problem of identifying nodes is an extension method that generates an XPath expression that specifically identifies any node on which you invoke the extension method.  This is a more descriptive approach, although the extension method is significantly longer than the one above.  (When you are done with this tutorial, review this post, and you can see that it is implemented in a pure fashion - that code is just a bunch of queries!)

Here is the program (in its entirety) that contains our first query.

Imports System.IO
Imports System.Xml
Imports DocumentFormat.OpenXml.Packaging

Module Module1
<System.Runtime.CompilerServices.Extension()> _
Public Function GetPath(ByVal el As XElement) As String
Return el _
.AncestorsAndSelf _
.InDocumentOrder _
.Aggregate("", Function(seed, i) seed & "/" & i.Name.LocalName)
End Function

Public Function LoadXDocument(ByVal part As OpenXmlPart) _
As XDocument
Using streamReader As StreamReader = New StreamReader(part.GetStream())
Using xmlReader As XmlReader = xmlReader.Create(streamReader)
Return XDocument.Load(xmlReader)
End Using
End Using
End Function

Sub Main()
Dim w As XNamespace = _
"https://schemas.openxmlformats.org/wordprocessingml/2006/main"
Dim filename As String = "SampleDoc.docx"
Using wordDoc As WordprocessingDocument = _
WordprocessingDocument.Open(filename, True)
Dim mainPart As MainDocumentPart = _
wordDoc.MainDocumentPart
Dim mainPartDoc As XDocument = LoadXDocument(mainPart)
Dim paragraphs = _
mainPartDoc.Root _
.Element(w + "body") _
.Descendants(w + "p") _
.Select(Function(p) _
New With { _
.ParagraphNode = p, _
.Style = CStr(p.Elements(w + "pPr") _
.Elements(w + "pStyle") _
.Attributes(w + "val") _
.FirstOrDefault()) _
} _
)

For Each p In paragraphs
Dim s As String
If Not p.Style Is Nothing Then
s = p.Style.PadRight(12)
Else
s = "".PadRight(12)
End If
Console.WriteLine("{0} {1}", s, _
p.ParagraphNode.GetPath())
Next
End Using
End Sub
End Module

While the query itself is written in a declarative style, I have no problem with using a plain old foreach statement to iterate through the results and print them to the console.  The important thing to remember here is that we want to compose our queries in a functional style, but when it comes time to print to the console, we are outside of our query, and using imperative code will not affect the composability of the code in the query.

We declared one local variable, and treated it as immutable, even though the language doesn't support it.

We can follow the flow of types through this query.

The mainPartDoc variable, of course, is an XDocument object.  The type of the Root property is XElement, and contains the root element of the document.  When we use the Element axis:

mainPartDoc.Root _
.Element(w + "body")

The result is also is an XElement object.

When we "dot" into the Descendants axis:

mainPartDoc.Root _
.Element(w + "body") _
.Descendants(w + "p")

Then the result is IEnumerable(Of XElement).

When we "dot" into the Select operator, the lambda creates a new anonymous type with two members, ParagraphNode, which is an XElement object, and Style, which is a string:

Dim paragraphs = _
mainPartDoc.Root _
.Element(w + "body") _
.Descendants(w + "p") _
.Select(Function(p) _
New With { _
.ParagraphNode = p, _
.Style = CStr(p.Elements(w + "pPr") _
.Elements(w + "pStyle") _
.Attributes(w + "val") _
.FirstOrDefault()) _
} _
)

The code to determine the style deserves a little explanation. This expression:

CStr(p.Elements(w + "pPr") _
.Elements(w + "pStyle") _
.Attributes(w + "val") _
.FirstOrDefault())

is an idiom that we can use whenever an element or attribute may or may not exist.  The result of the expression is the value of the element or attribute if it exists.  The expression evaluates to null if the element or attribute is missing.

The way that this works is that we first call p.Elements(w + "pPr") on an XElement object.  This returns a collection of elements, with type IEnumerable(Of XElement).  Then, when we dot into Elements again:

p.Elements(w + "pPr") _
.Elements(w + "pStyle")

it calls the Elements extension method that takes a collection of elements, and returns all child elements of each element in the source collection.  In our example, the source collection will have only one element in it, unless the element didn't exist, in which case the source collection will be an empty collection.  The extension method is perfectly happy to receive either a collection with one element in it, or an empty collection.  If the source is an empty collection, the extension method returns an empty collection also.

We then dot into the Attributes extension method, which operates in a similar way.  It returns all attributes of each element in its source collection.  Again, perfectly happy to receive an empty collection, or a collection with one element in it.

We then dot into the FirstOrDefault extension method which returns the first element in the collection, or if the collection is empty, it returns the default value for the type.  XElement is a reference type, and the default value for reference types is null.  Therefore, FirstOrDefault either returns the first element in the source collection, or null.

We then add a cast to string at the beginning of the expression:

CStr(p.Elements(w + "pPr") _
.Elements(w + "pStyle") _
.Attributes(w + "val") _
.FirstOrDefault())

The cast to string will cast an XAttribute to string.  The definition of the cast explicit conversion in LINQ to XML is that if the value being cast is null, the explicit conversion returns null.  See this topic for a more detailed explanation of casting using LINQ to XML.

So by using this idiom, we can write code that if the w:pPr/w:pStyle/@w:val attribute exists, we get the value of it.  If the w:pPr/w:pStyle/@w:val attribute does not exist, we get null.  However, the code will not throw a null reference exception.

When we run this example, we see something like this:

Heading1 document/body/p/
document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
Code document/body/p/
document/body/p/
Code document/body/p/

This is what we expected.

We can also write this query using a query expression, as follows:

Dim paragraphs = _
From p In mainPartDoc.Root _
.Element(w + "body") _
.Descendants(w + "p") _
Select New With { _
.ParagraphNode = p, _
.Style = CStr(p.Elements(w + "pPr") _
.Elements(w + "pStyle") _
.Attributes(w + "val") _
.FirstOrDefault()) _
}

This gets rid of the lambda expression in the Select extension method.  However, just have to say, after you get used to them, lambda expressions become very natural and clear.

When writing using a query expression, we can take advantage of the let clause to further clarify our query.  By using the let clause, the projection in the select clause becomes clearer:

Dim paragraphs = _
From p In mainPartDoc.Root _
.Element(w + "body") _
.Descendants(w + "p") _
Let style = CStr(p.Elements(w + "pPr") _
.Elements(w + "pStyle") _
.Attributes(w + "val") _
.FirstOrDefault()) _
Select New With { _
.ParagraphNode = p, _
.Style = style _
}

Sometimes when using the method syntax for a complicated query, you dot into method after method after method, as you can see in the example in the previous topic.  If you don’t break these lines up, they can get very long, and a little hard to read.  The approach that you see here where I line up the dots vertically works well for me.  This is just a matter of personal preference.

[Table of Contents] [Next Topic] [Blog Map]