Sdílet prostřednictvím


How to Extract Comments from Open XML Documents

This post is based on an interesting query - a user of Open XML wanted a general way to extract the comments from Open XML documents and save them in a common metadata server. This post contains a short example that iterates through all files in a directory, and extracts all comments from them, and outputs some XML containing the comments. The directory can contain all types of Open XML documents: WordprocessingML, SpreadsheetML, and PresentationML documents.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThe resulting XML will look something like this:

<Root>
<Comment Source="docx" Author="Eric White">
<Text space="preserve">Comment 1</Text>
</Comment>
<Comment Source="docx" Author="Eric White">
<Text space="preserve">Another comment in a word doc.</Text>
</Comment>
<Comment Source="xlsx" Author="Eric White">
<Text space="preserve">Eric White:
This is a comment in an Excel spreadsheet.</Text>
</Comment>
<Comment Source="xlsx" Author="Eric White">
<Text space="preserve">Eric White:
Another comment.</Text>
</Comment>
<Comment Source="pptx" Author="Eric White">
<Text space="preserve">Another comment.</Text>
</Comment>
<Comment Source="pptx" Author="Eric White">
<Text space="preserve">This is a PPT comment.</Text>
</Comment>
</Root>

Following is the example, in its entirety. This example was interesting, in that I originally write it using queries instead of the for loops in the extension methods, and in this case, I think that the code is more readable using an imperative style of coding.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using Microsoft.Office.DocumentFormat.OpenXml.Packaging;
namespace LtxOpenXml
{
public static class Extensions
{
public static XDocument LoadXDocument(this OpenXmlPart part)
{
XDocument xdoc;
using (StreamReader streamReader = new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
return xdoc;
}
public static string StringConcatenate(this IEnumerable<string> source)
{
StringBuilder sb = new StringBuilder();
foreach (string s in source)
sb.Append(s);
return sb.ToString();
}
}
class Program
{
static IEnumerable<XElement> ExtractFromDocx(string filename)
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filename, true))
{
MainDocumentPart mainPart = wordDoc.MainDocumentPart;
CommentsPart commentsPart = mainPart.CommentsPart;
XDocument cDoc = commentsPart.LoadXDocument();
XNamespace w = "https://schemas.openxmlformats.org/wordprocessingml/2006/main";
foreach (var c in cDoc.Root.Elements(w + "comment"))
{
yield return new XElement("Comment",
new XAttribute("Source", "docx"),
new XAttribute("Author", (string)c.Attribute(w + "author")),
new XElement("Text",
new XAttribute("space", "preserve"),
c.Descendants(w + "t").Select(t => (string)t).StringConcatenate())
);
}
}
}
static IEnumerable<XElement> ExtractFromXlsx(string filename)
{
using (SpreadsheetDocument spreadDoc = SpreadsheetDocument.Open(filename, true))
{
XNamespace s = "https://schemas.openxmlformats.org/spreadsheetml/2006/main";
foreach (var wsp in spreadDoc.WorkbookPart.WorksheetParts)
{
var xwp = wsp.WorksheetCommentsPart;
if (xwp != null)
{
XDocument xd = xwp.LoadXDocument();
var authorArray = xd
.Root
.Element(s + "authors")
.Elements(s + "author")
.Select(c => (string)c).ToArray();
foreach (var c in xd.Root.Element(s + "commentList").Elements(s + "comment"))
{
yield return new XElement("Comment",
new XAttribute("Source", "xlsx"),
new XAttribute("Author", authorArray[(int)c.Attribute("authorId")]),
new XElement("Text",
new XAttribute("space", "preserve"),
c.Element(s + "text").Descendants(s + "t").Select(t => (string)t).StringConcatenate())
);
}
}
}
}
}
static IEnumerable<XElement> ExtractFromPptx(string filename)
{
using (PresentationDocument pDoc = PresentationDocument.Open(filename, true))
{
XNamespace p = "https://schemas.openxmlformats.org/presentationml/2006/main";
var cap = pDoc.PresentationPart.CommentAuthorsPart;
if (cap != null)
{
var capXDocument = cap.LoadXDocument();
foreach (var slide in pDoc.PresentationPart.SlideParts)
{
var cp = slide.SlideCommentsPart;
if (cp != null)
{
var cpXDocument = cp.LoadXDocument();
foreach (var c in cpXDocument.Root.Elements(p + "cm"))
{
yield return new XElement("Comment",
new XAttribute("Source", "pptx"),
new XAttribute("Author", (string)capXDocument
.Root
.Elements(p + "cmAuthor")
.Where(z => (string)z.Attribute("id") == (string)c.Attribute("authorId"))
.FirstOrDefault().Attribute("name")
),
new XElement("Text",
new XAttribute("space", "preserve"),
(string)c.Element(p + "text"))
);
}
}
}
}
}
}
static void Main(string[] args)
{
XElement root = new XElement("Root",
Directory.GetFiles(".", "*.docx").Select(f => ExtractFromDocx(f)),
Directory.GetFiles(".", "*.xlsx").Select(f => ExtractFromXlsx(f)),
Directory.GetFiles(".", "*.pptx").Select(f => ExtractFromPptx(f))
);
Console.WriteLine(root);
}
}
}

Comments

  • Anonymous
    January 16, 2008
    This post is based on an interesting query - a user of Open XML wanted a general way to extract the comments

  • Anonymous
    January 16, 2008
    This post is based on an interesting query - a user of Open XML wanted a general way to extract the comments

  • Anonymous
    January 18, 2008
    Google's support for the Open XML formats continues to improve. They recently started identifying DOCX,

  • Anonymous
    July 15, 2009
    This is almost exactly what I was looking for! (Isn't it always?) So you can pull the comment, but can you also get the related cell? I need to pull these, then stick them in a database table with a reference to the cell value that I'm sticking in another database table. Thanks!