Writing LINQ to XML Queries using the OpenXmlDocument Class

Next, we can write a set of queries to extract useful information from the document.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThere are a few points to note about this code:

· This code will correctly assemble the text for each paragraph, even if Track Changes has been turned on, and the paragraph contains inserted (or deleted) text.

· To see a large number of examples of these types of queries, see Querying XML Trees. In particular, it may be helpful to work through the Pure Functional Transformations of XML.

· This code uses casting to retrieve the element and attribute values. For more information, see How to: Retrieve the Value of an Element.

· This code uses the approach of projecting into an anonymous type. See How to: Project an Anonymous Type for more information.

· Because the style node for the paragraph is optional and might not exist, this code uses the technique detailed in How to: Filter on an Optional Element.

· To concatenate the text of all of the w:t nodes, this code uses the approach of using an extension method named StringConcatenate, as described in Refactoring Using an Extension Method

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Examples.LtxOpenXml;
using System.Xml.Linq;
using System.IO.Packaging;

class Program
{
public const string DocumentRelationshipType =
"https://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
public const string StylesRelationshipType =
"https://schemas.openxmlformats.org/officeDocument/2006/relationships/styles";

public const string WordProcessingMLNamespace =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

static void Main(string[] args)
{
string filename = "OfficeXMLMarkupExplained_en.docx";
//string filename = "Test.docx";

// a good convention to use is to name the XNamespace
// variable with the same name as the namespace prefix,
// and to name XName variables with the local name of the element
XNamespace w = WordProcessingMLNamespace;
XName r = w + "r";
XName ins = w + "ins";

using (OpenXmlDocument doc = new OpenXmlDocument(filename))
{
Relationship documentRelationship =
(
from dr in doc.Relationships
where dr.RelationshipType == DocumentRelationshipType
select dr
).FirstOrDefault();

Relationship stylesRelationship =
(
from sr in documentRelationship.Relationships
where sr.RelationshipType == StylesRelationshipType
select sr
).FirstOrDefault();

string defaultStyle = (string)(
from style in stylesRelationship
.XDocument.Root.Elements(w + "style")
where (string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1"
select style
).First().Attribute(w + "styleId");

XDocument xDoc = documentRelationship.XDocument;

// Find all paragraphs in the document.
var paragraphs =
from p in xDoc
.Root
.Element(w + "body")
.Descendants(w + "p")
let styleNode = p
.Elements(w + "pPr")
.Elements(w + "pStyle")
.FirstOrDefault()
select new
{
ParagraphElement = p,
StyleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyle,
// in the following query, we need to select both
// the r and ins elements in order to assemble the text
// properly for paragraphs that have tracked changes.
Text = p
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element)
};

Console.WriteLine("DefaultStyle: {0}", defaultStyle);

foreach (var p in paragraphs)
{
Console.WriteLine("Style: {0} Text: >{1}<",
p.StyleName.PadRight(16), p.Text);
}
}
}
}

public static class LocalExtensions
{
public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item));
return sb.ToString();
}
}

 

Don’t be discouraged if all of this code doesn’t make sense at first glance. It takes some study. However, this functional style of query construction is very powerful. Once you become familiar with this approach, you can write such code very easily. It tends to be shorter than the equivalent imperative code. In my experience, it works correctly more often, with less debugging. I personally believe that over the next few years, with the introduction of functional programming capabilities in both C# and VB, this style will become more and more familiar. It will eventually become absolutely essential for .NET developers to be familiar with this style of coding.

In the near future, I’m going to augment the above code, and create a new class, WordprocessingML, in the same namespace, which will derive from the OpenXmlDocument class. This class will contain functions and properties for some of the most common operations that you want to do on a WordprocessingML document. Following that, I’ll create a SpreadsheetML class that also derives from the OpenXmlDocument class.

Comments

  • Anonymous
    March 31, 2008
    These resource is for preparing the project ,supervised by MONO and GOOGLE SUMMER CODE 2008,Converting

  • Anonymous
    April 22, 2008
    Recently we spoke about converting XPS files and FixedDocuments to FlowDocuments . It works, but there