Open XML SDK and LINQ to XML
In this post, I’m presenting some code that uses the Open XML SDK and LINQ to XML to query an Open XML document. I’ve posted on this before, but this version is cleaner and smaller. It also is an example of code written in the functional style.
This blog is inactive.
New blog: EricWhite.com/blog
The most common need is to retrieve the paragraphs of a word processing document. But I've also had the need to dig into a spreadsheet, and get the data out for a row, a set of rows, or all the cells in a column. You can think of the code presented here as a (very) small library on top of the Open XML SDK.
Note: The Open XML API Version 2.0 will contain additional functionality that enables easier processing of the XML contents of parts. However, it's not available yet. It will be the best way to query documents in the future, but for now, the approach presented in this post is interesting.
I have a set of goals for this code:
- The code should be as short and simple as possible. You should be able to understand it or change it with an absolute minimum of effort (provided that you are familiar with LINQ and LINQ to XML).
- For robustness and simplicity of code, it should be based on the Open XML SDK. Another previous version was based on System.IO.Packaging (approx 500 lines long, this one is much shorter). Annotations, newly added to the SDK, make the SDK easier to work with.
- It should be written in a pure functional style, using LINQ queries where appropriate.
- The queries should be lazy. This little library should not maintain any state outside of the XDocument XML trees corresponding to each part.
To accomplish these goals, I used the following approach:
Using Nominal Tuple Types as Part of the API
In my functional programming tutorial, I talked about the idea of using nominal types for tuples instead of anonymous types. I use this approach in this code. Using types that are defined using the automatic property syntax helps in shortening the code.
I use the approach that is in the spirit of a dynamic language - to shorten the code, make all properties public, and use automatic properties. This makes the properties read/write. However, all the code is written in a functional style - no objects are mutated after creation and initialization. So, for example, here is the definition of the class that encapsulates a comment for a paragraph:
public class Comment
{
public int Id { get; set; }
public string Text { get; set; }
public string Author { get; set; }
}
Annotating Each Part with its XDocument
This code uses the approach that each part in the package is annotated with its XDocument. There is a small extension method to get an XDocument for a part. If the annotation already exists, we avoid de-serializing, and just return the annotation.
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr =
new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
Then, any time we have a part, we can call GetXDocument(), knowing that we aren’t causing any extra work for deserializing:
return (string)mainDocument
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") ==
"paragraph" &&
(string)style.Attribute(w + "default") ==
"1")
.First()
.Attribute(w + "styleId");
An Extension Method on MainDocumentPart
To retrieve the paragraphs of a document, I implemented an extension method on the MainDocumentPart. The signature is:
public static IEnumerable<Paragraph> Paragraphs(
this MainDocumentPart mainDocument)
Using the Open XML SDK, you navigate to the MainDocumentPart. Once you have it, you can call the Paragraphs method:
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
{
foreach (var p in doc.MainDocumentPart.Paragraphs())
{
Console.WriteLine("Style: {0} Text: >{1}<",
p.StyleName.PadRight(16), p.Text);
foreach (var c in p.Comments())
Console.WriteLine(
" Comment Author:{0} Text:>{1}<",
c.Author, c.Text);
}
}
The Paragraphs method returns a collection of Paragraph objects. The Paragraph class contains some public properties, such as Text and Style. It also contains a method, Comments, which returns a collection of comments for each paragraph. (A paragraph, of course, can have more than one comment.) It also contains a public property ParagraphElement, which contains the <w:p> XElement for the paragraph. The code takes care of the housekeeping necessary to aggregate multiple <w:t> text nodes into a single string containing the text of the paragraph.
If you run this code on the small document that is attached to this post, you see:
Printing Paragraphs
===================
Style: Normal Text: >Paragraph 1<
Comment Author:Eric White (DPE) Text:>Comment 1<
Comment Author:Eric White (DPE) Text:>Comment 2<
Style: Normal Text: >Paragraph 2<
Style: Normal Text: >Paragraph 3<
Comment Author:Eric White (DPE) Text:>Comment 3<
Style: Normal Text: >Paragraph 4<
There is also an extension method on the MainDocumentPart to retrieve the default style for the document. Its signature:
public static string DefaultStyle(this MainDocumentPart mainDocument)
The Paragraphs method uses this extension method, but you are free to use it too.
The code to retrieve the paragraphs and comments for a Wordprocessing document is 88 lines long.
An Extension Method on WorksheetPart
To retrieve the rows and cells of a spreadsheet, I implemented an extension method on the WorksheetPart. The signature is:
public static IEnumerable<Row> Rows(
this WorksheetPart worksheetPart)
The Row class contains a method, Cells, which returns a collection of the cells for the row. Its signature:
public IEnumerable<Cell> Cells()
Using the Open XML SDK, you navigate to the WorksheetPart. Once you have it, you can iterate through the rows, and for each row, you can iterate through the cells in the row:
Console.WriteLine("Contents of Spreadsheet");
Console.WriteLine("=======================");
using (SpreadsheetDocument doc =
SpreadsheetDocument.Open(filename, false))
{
WorksheetPart worksheet =
(WorksheetPart)doc.WorkbookPart.GetPartById("rId1");
foreach (var row in worksheet.Rows())
{
Console.WriteLine(" RowId:{0}", row.RowId);
Console.WriteLine(" Spans:{0}", row.Spans);
foreach (var cell in row.Cells())
{
Console.WriteLine(" Column:{0}", cell.Column);
Console.WriteLine(" ColumnId:{0}", cell.ColumnId);
if (cell.Type != null)
Console.WriteLine(" Type:{0}", cell.Type);
if (cell.Value != null)
Console.WriteLine(" Value:{0}", cell.Value);
if (cell.Formula != null)
Console.WriteLine(" Formula:>{0}<", cell.Formula);
if (cell.SharedString != null)
Console.WriteLine(" SharedString:>{0}<", cell.SharedString);
}
}
Console.WriteLine();
}
When run on the small spreadsheet attached to this post, you see:
Contents of Spreadsheet
=======================
RowId:1
Spans:1:3
Column:A1
ColumnId:A
Value:1
Column:B1
ColumnId:B
Value:2
Column:C1
ColumnId:C
Value:3
RowId:2
Spans:1:3
Column:A2
ColumnId:A
Type:s
Value:0
SharedString:>a<
Column:B2
ColumnId:B
Type:s
Value:1
SharedString:>b<
Column:C2
ColumnId:C
Type:s
Value:2
SharedString:>c<
If you want to retrieve just the cells for a specific column, you can just tack on a Where method call to the Cells method call:
WorksheetPart worksheet =
(WorksheetPart)doc.WorkbookPart.GetPartById("rId1");
foreach (var row in worksheet.Rows())
{
Console.WriteLine(" RowId:{0}", row.RowId);
Console.WriteLine(" Spans:{0}", row.Spans);
foreach (var cell in row.Cells().Where(c => c.ColumnId == "B"))
{
Console.WriteLine(" Column:{0}", cell.Column);
Console.WriteLine(" ColumnId:{0}", cell.ColumnId);
if (cell.Type != null)
Console.WriteLine(" Type:{0}", cell.Type);
if (cell.Value != null)
Console.WriteLine(" Value:{0}", cell.Value);
if (cell.Formula != null)
Console.WriteLine(" Formula:>{0}<", cell.Formula);
if (cell.SharedString != null)
Console.WriteLine(" SharedString:>{0}<", cell.SharedString);
}
}
The code to get rows and cells for spreadsheets is 69 lines long. This meets my definition of “short”.
Comments
Anonymous
July 09, 2008
Eric White shows some interesting use of LINQ to XML to query an Open XML document. Something that isAnonymous
July 15, 2008
Eric White on Linq and Open XML. Eric White has been posting some great code samples lately, includingAnonymous
August 17, 2008
Ce post n’a pas voulu partir ni jeudi ni vendredi, le voici donc ! Des mise à jours à n’en plus finirAnonymous
November 14, 2008
(November 14, 2008 - I've updated my approach for querying Open XML documents using LINQ to XML. YouAnonymous
November 14, 2008
[Blog Map] Excel has a very cool feature where you can declare that a range of cells is a table. It isAnonymous
January 06, 2009
What about the body of the extension method Paragraphs? Is there some optimal way of writing that?Anonymous
January 06, 2009
Hi Sten, did you take a look at the Paragraphs extension method (attached to this blog post?) I think it is pretty optimal: public static IEnumerable<Paragraph> Paragraphs(this MainDocumentPart mainDocument)
{
XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
XName r = w + "r";
XName ins = w + "ins";
string defaultStyle = mainDocument.DefaultStyle();
return
from p in mainDocument.GetXDocument()
.Root.Element(w + "body").Descendants(w + "p")
let styleNode = p.Elements(w + "pPr").Elements(w + "pStyle")
.FirstOrDefault()
select new Paragraph(mainDocument)
{
ParagraphElement = p,
StyleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyle,
// in the following query, need to select both the r and ins elements
// to assemble the text properly for paragraphs that have tracked changes.
Text = p.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element)
};
}Anonymous
January 06, 2009
Thanks a bunch! I'm trying to use OOXML API 2 and its class structure instead of XDocument, but in either case the extraction of Text tip is very helpful.Anonymous
April 23, 2009
Can one easily write and alter the content as easily?Anonymous
April 27, 2009
Hi Pierre, The code that I've presented here doesn't write or alter content. It's not hard to modify documents, but it is more work than querying them. I'm working on some ideas around this - I'll blog them as soon as I have time. :) -EricAnonymous
September 15, 2009
Querying OpenXML spreadsheets works very well with your example, but how about accessing individual columns of a row by index? That is, instead of using "Where(c => c.ColumnId == "B")" the query would be something like "Where(c => c.Index == 2)". What would be the best way to achieve this? I could write a small function to convert 1 to A, 2 to B and so on, but that seems very inelegant; and inelegant solutions seldom create maintainable code.Anonymous
January 15, 2010
Interesting article. I am using wordprocessingml to fill xml templates created in word 2003. Is there a better solution available where user can create word templates in 2007 and i can insert html text in templates and save as docx using wordprocessingml?Anonymous
February 18, 2010
we know the filename.vbproj can we open it as XML doc and it contains all the controls used in project so this is the way to find your coontrols througt LINQ TO XML > Dim s As XDocument = XDocument.Load(filename.vbproj) Dim LsForms = s.Root.Nodes.ToList Dim LItemGroup As New List(Of XElement) For Each c In LsForms If TypeOf c Is XElement Then Dim cc = CType(c, XElement) if cc.Name.LocalName = "ItemGroup" Then Dim co = cc.Nodes.Where(Function(f) TypeOf f Is XElement AndAlso CType(f, XElement).Name.LocalName = "Compile").Select(Function(f) CType(f, XElement).FirstAttribute.Value).ToList ' your code here to test (co)... end if end if nextAnonymous
February 21, 2010
Hi Bilal, I have never looked at the XML that makes up a vbproj, so unfortunately don't have any insights for you. -EricAnonymous
January 18, 2012
How difficult would it be to search for a Range of cells in a spreadsheet. For example A2:F8 without searching the whole table and checking the "cell.Column" property? Possibly a TableRange() class.