Jaa


Open XML SDK and LINQ to XML

In this post, I’m presenting some code that uses the Open XML SDK and LINQ to XML to query an Open XML document. I’ve posted on this before, but this version is cleaner and smaller. It also is an example of code written in the functional style.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC

 

The most common need is to retrieve the paragraphs of a word processing document. But I've also had the need to dig into a spreadsheet, and get the data out for a row, a set of rows, or all the cells in a column. You can think of the code presented here as a (very) small library on top of the Open XML SDK.

 

Note: The Open XML API Version 2.0 will contain additional functionality that enables easier processing of the XML contents of parts. However, it's not available yet. It will be the best way to query documents in the future, but for now, the approach presented in this post is interesting.

 

I have a set of goals for this code:

  • The code should be as short and simple as possible. You should be able to understand it or change it with an absolute minimum of effort (provided that you are familiar with LINQ and LINQ to XML).
  • For robustness and simplicity of code, it should be based on the Open XML SDK. Another previous version was based on System.IO.Packaging (approx 500 lines long, this one is much shorter). Annotations, newly added to the SDK, make the SDK easier to work with.
  • It should be written in a pure functional style, using LINQ queries where appropriate.
  • The queries should be lazy. This little library should not maintain any state outside of the XDocument XML trees corresponding to each part.

 

To accomplish these goals, I used the following approach:

 

Using Nominal Tuple Types as Part of the API 

 

In my functional programming tutorial, I talked about the idea of using nominal types for tuples instead of anonymous types. I use this approach in this code. Using types that are defined using the automatic property syntax helps in shortening the code.

 

I use the approach that is in the spirit of a dynamic language - to shorten the code, make all properties public, and use automatic properties. This makes the properties read/write. However, all the code is written in a functional style - no objects are mutated after creation and initialization. So, for example, here is the definition of the class that encapsulates a comment for a paragraph:

 

public class Comment

{

    public int Id { get; set; }

    public string Text { get; set; }

    public string Author { get; set; }

}

 

Annotating Each Part with its XDocument 

 

This code uses the approach that each part in the package is annotated with its XDocument. There is a small extension method to get an XDocument for a part. If the annotation already exists, we avoid de-serializing, and just return the annotation.

 

public static XDocument GetXDocument(this OpenXmlPart part)

{

    XDocument xdoc = part.Annotation<XDocument>();

    if (xdoc != null)

        return xdoc;

    using (StreamReader sr =

           new StreamReader(part.GetStream()))

        using (XmlReader xr = XmlReader.Create(sr))

            xdoc = XDocument.Load(xr);

    part.AddAnnotation(xdoc);

    return xdoc;

}

 

Then, any time we have a part, we can call GetXDocument(), knowing that we aren’t causing any extra work for deserializing:

 

return (string)mainDocument

        .StyleDefinitionsPart

        .GetXDocument()

        .Root

        .Elements(w + "style")

        .Where(style =>

          (string)style.Attribute(w + "type") ==

            "paragraph" &&

          (string)style.Attribute(w + "default") ==

            "1")

        .First()

        .Attribute(w + "styleId");

 

An Extension Method on MainDocumentPart 

 

To retrieve the paragraphs of a document, I implemented an extension method on the MainDocumentPart. The signature is:

 

public static IEnumerable<Paragraph> Paragraphs(

  this MainDocumentPart mainDocument)

 

Using the Open XML SDK, you navigate to the MainDocumentPart. Once you have it, you can call the Paragraphs method:

 

using (WordprocessingDocument doc =

    WordprocessingDocument.Open(filename, false))

{

    foreach (var p in doc.MainDocumentPart.Paragraphs())

    {

        Console.WriteLine("Style: {0} Text: >{1}<",

            p.StyleName.PadRight(16), p.Text);

        foreach (var c in p.Comments())

            Console.WriteLine(

              " Comment Author:{0} Text:>{1}<",

              c.Author, c.Text);

    }

}

 

The Paragraphs method returns a collection of Paragraph objects. The Paragraph class contains some public properties, such as Text and Style. It also contains a method, Comments, which returns a collection of comments for each paragraph. (A paragraph, of course, can have more than one comment.) It also contains a public property ParagraphElement, which contains the <w:p> XElement for the paragraph. The code takes care of the housekeeping necessary to aggregate multiple <w:t> text nodes into a single string containing the text of the paragraph.

 

If you run this code on the small document that is attached to this post, you see:

 

Printing Paragraphs

===================

Style: Normal Text: >Paragraph 1<

  Comment Author:Eric White (DPE) Text:>Comment 1<

  Comment Author:Eric White (DPE) Text:>Comment 2<

Style: Normal Text: >Paragraph 2<

Style: Normal Text: >Paragraph 3<

  Comment Author:Eric White (DPE) Text:>Comment 3<

Style: Normal Text: >Paragraph 4<

 

There is also an extension method on the MainDocumentPart to retrieve the default style for the document. Its signature:

 

public static string DefaultStyle(this MainDocumentPart mainDocument)

 

The Paragraphs method uses this extension method, but you are free to use it too.

 

The code to retrieve the paragraphs and comments for a Wordprocessing document is 88 lines long.

 

An Extension Method on WorksheetPart 

 

To retrieve the rows and cells of a spreadsheet, I implemented an extension method on the WorksheetPart. The signature is:

 

public static IEnumerable<Row> Rows(

  this WorksheetPart worksheetPart)

 

The Row class contains a method, Cells, which returns a collection of the cells for the row. Its signature:

 

public IEnumerable<Cell> Cells()

 

Using the Open XML SDK, you navigate to the WorksheetPart. Once you have it, you can iterate through the rows, and for each row, you can iterate through the cells in the row:

 

Console.WriteLine("Contents of Spreadsheet");

Console.WriteLine("=======================");

using (SpreadsheetDocument doc =

    SpreadsheetDocument.Open(filename, false))

{

    WorksheetPart worksheet =

        (WorksheetPart)doc.WorkbookPart.GetPartById("rId1");

 

    foreach (var row in worksheet.Rows())

    {

        Console.WriteLine(" RowId:{0}", row.RowId);

        Console.WriteLine(" Spans:{0}", row.Spans);

        foreach (var cell in row.Cells())

        {

            Console.WriteLine(" Column:{0}", cell.Column);

            Console.WriteLine(" ColumnId:{0}", cell.ColumnId);

            if (cell.Type != null)

                Console.WriteLine(" Type:{0}", cell.Type);

            if (cell.Value != null)

                Console.WriteLine(" Value:{0}", cell.Value);

            if (cell.Formula != null)

                Console.WriteLine(" Formula:>{0}<", cell.Formula);

            if (cell.SharedString != null)

               Console.WriteLine(" SharedString:>{0}<", cell.SharedString);

        }

    }

    Console.WriteLine();

}

 

When run on the small spreadsheet attached to this post, you see:

 

Contents of Spreadsheet

=======================

  RowId:1

  Spans:1:3

    Column:A1

      ColumnId:A

      Value:1

    Column:B1

      ColumnId:B

      Value:2

    Column:C1

      ColumnId:C

      Value:3

  RowId:2

  Spans:1:3

    Column:A2

      ColumnId:A

      Type:s

      Value:0

      SharedString:>a<

    Column:B2

      ColumnId:B

      Type:s

      Value:1

      SharedString:>b<

    Column:C2

      ColumnId:C

      Type:s

      Value:2

      SharedString:>c<

 

If you want to retrieve just the cells for a specific column, you can just tack on a Where method call to the Cells method call:

 

WorksheetPart worksheet =

    (WorksheetPart)doc.WorkbookPart.GetPartById("rId1");

 

foreach (var row in worksheet.Rows())

{

    Console.WriteLine(" RowId:{0}", row.RowId);

    Console.WriteLine(" Spans:{0}", row.Spans);

    foreach (var cell in row.Cells().Where(c => c.ColumnId == "B"))

    {

        Console.WriteLine(" Column:{0}", cell.Column);

        Console.WriteLine(" ColumnId:{0}", cell.ColumnId);

        if (cell.Type != null)

            Console.WriteLine(" Type:{0}", cell.Type);

        if (cell.Value != null)

            Console.WriteLine(" Value:{0}", cell.Value);

        if (cell.Formula != null)

            Console.WriteLine(" Formula:>{0}<", cell.Formula);

        if (cell.SharedString != null)

            Console.WriteLine(" SharedString:>{0}<", cell.SharedString);

    }

}

 

The code to get rows and cells for spreadsheets is 69 lines long. This meets my definition of “short”.

LtxOpenXml.zip

Comments

  • Anonymous
    July 09, 2008
    Eric White shows some interesting use of LINQ to XML to query an Open XML document. Something that is

  • Anonymous
    July 15, 2008
    Eric White on Linq and Open XML. Eric White has been posting some great code samples lately, including

  • Anonymous
    August 17, 2008
    Ce post n’a pas voulu partir ni jeudi ni vendredi, le voici donc ! Des mise à jours à n’en plus finir

  • Anonymous
    November 14, 2008
    (November 14, 2008 - I've updated my approach for querying Open XML documents using LINQ to XML. You

  • Anonymous
    November 14, 2008
    [Blog Map] Excel has a very cool feature where you can declare that a range of cells is a table. It is

  • Anonymous
    January 06, 2009
    What about the body of the extension method Paragraphs? Is there some optimal way of writing that?

  • Anonymous
    January 06, 2009
    Hi Sten, did you take a look at the Paragraphs extension method (attached to this blog post?) I think it is pretty optimal: public static IEnumerable<Paragraph> Paragraphs(this MainDocumentPart mainDocument)
    {
        XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
        XName r = w + "r";
        XName ins = w + "ins";
        string defaultStyle = mainDocument.DefaultStyle();
        return
            from p in mainDocument.GetXDocument()
                .Root.Element(w + "body").Descendants(w + "p")
            let styleNode = p.Elements(w + "pPr").Elements(w + "pStyle")
                .FirstOrDefault()
            select new Paragraph(mainDocument)
            {
                ParagraphElement = p,
                StyleName = styleNode != null ?
                    (string)styleNode.Attribute(w + "val") :
                    defaultStyle,
                // in the following query, need to select both the r and ins elements
                // to assemble the text properly for paragraphs that have tracked changes.
                Text = p.Elements()
                    .Where(z => z.Name == r || z.Name == ins)
                    .Descendants(w + "t")
                    .StringConcatenate(element => (string)element)
            };
    }

  • Anonymous
    January 06, 2009
    Thanks a bunch! I'm trying to use OOXML API 2 and its class structure instead of XDocument, but in either case the extraction of Text tip is very helpful.

  • Anonymous
    April 23, 2009
    Can one easily write and alter the content as easily?

  • Anonymous
    April 27, 2009
    Hi Pierre, The code that I've presented here doesn't write or alter content.  It's not hard to modify documents, but it is more work than querying them.  I'm working on some ideas around this - I'll blog them as soon as I have time.  :) -Eric

  • Anonymous
    September 15, 2009
    Querying OpenXML spreadsheets works very well with your example, but how about accessing individual columns of a row by index? That is, instead of using "Where(c => c.ColumnId == "B")" the query would be something like "Where(c => c.Index == 2)". What would be the best way to achieve this? I could write a small function to convert 1 to A, 2 to B and so on, but that seems very inelegant; and inelegant solutions seldom create maintainable code.

  • Anonymous
    January 15, 2010
    Interesting article. I am using wordprocessingml to fill xml templates created in word 2003. Is there a better solution available where user can create word templates in 2007 and i can insert html text in templates and save as docx using wordprocessingml?

  • Anonymous
    February 18, 2010
    we know the filename.vbproj can we open it as XML doc and it contains all the controls used in project so this is the way to find your coontrols througt LINQ TO XML > Dim s As XDocument = XDocument.Load(filename.vbproj) Dim LsForms = s.Root.Nodes.ToList Dim LItemGroup As New List(Of XElement)   For Each c In LsForms      If TypeOf c Is XElement Then        Dim cc = CType(c, XElement)         if cc.Name.LocalName = "ItemGroup" Then                            Dim co = cc.Nodes.Where(Function(f) TypeOf f Is XElement AndAlso CType(f, XElement).Name.LocalName = "Compile").Select(Function(f) CType(f, XElement).FirstAttribute.Value).ToList ' your code here to test (co)... end if end if next

  • Anonymous
    February 21, 2010
    Hi Bilal, I have never looked at the XML that makes up a vbproj, so unfortunately don't have any insights for you. -Eric

  • Anonymous
    January 18, 2012
    How difficult would it be to search for a Range of cells in a spreadsheet. For example A2:F8 without searching the whole table and checking the "cell.Column" property? Possibly a TableRange() class.