Parsing WordML using XLinq
[Blog Map] This blog is inactive. New blog: EricWhite.com/blog
( This is a note added on 4/17/2008 - I just want to acknowege that the approach taken in this blog post is the wrong approach!! :-)
I first posted this on August 1, 2006, before I had the necessary functional programming epiphanies. To see the correct approach, go through this tutorial.
----------------------------------------------------------------------
Recently, I had a problem where there wasn't a code testing harness that would do exactly what I wanted. I want to grab my code snippet directly from my word document, compile it, run it, and validate the output.
In more technical terms, I want to parse some WordML to grab text formatted with a given style. Further, I want to put a comment on the first line of the formatted text, and be able to grab the comment. The comment will contain the metadata that tells how to compile and run the code.
My word docs are stored in WordML (which is XML). My experiment was to see how easy it would be to pick apart the WordML using XLinq. This is the result.
First, I needed to see what the WordML looked like. If you open a WordML file, it is saved without any indenting, making it difficult to see the element tags, and the structure of the document. So I used the following program to indent the file:
using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
namespace Indent
{
class Program
{
static void Main(string[] args)
{
foreach (string s in args)
{
XmlDocument doc = new XmlDocument();
doc.Load(s);
string newName = s.Substring(0, s.Length - 4) + "_Indented.xml";
XmlTextWriter writer = new XmlTextWriter(newName, null);
writer.Formatting = Formatting.Indented;
doc.Save(writer);
}
}
}
}
The word doc that we're using for this sample is attached to this blog entry.
After building this little ap, running it, and looking at my re-formatted WordML file, I see:
<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> System;</w:t>
</w:r>
<aml:annotation aml:id="0" w:type="Word.Comment.End" />
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
<w:rFonts w:ascii="Times New Roman" w:h-ansi="Times New Roman" />
<wx:font wx:val="Times New Roman" />
</w:rPr>
<aml:annotation aml:id="0" aml:author="Eric White" aml:createdate="2006-08-01T11:50:00Z" w:type="Word.Comment" w:initials="EW">
<aml:content>
<w:p>
<w:pPr>
<w:pStyle w:val="CommentText" />
</w:pPr>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:annotationRef />
</w:r>
<w:r>
<w:t><Test </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>SnipId</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>="000101" TestId="0001"/></w:t>
</w:r>
</w:p>
</aml:content>
</aml:annotation>
</w:r>
</w:p>
<w:proofErr w:type="gramStart" />
<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>System.Collections.Generic</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>;</w:t>
</w:r>
</w:p>
<w:proofErr w:type="gramStart" />
<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>System.Text</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>;</w:t>
</w:r>
</w:p>
I can see where the word comment is. It is stored on a Word annotation:
<aml:annotation aml:id="0" aml:author="Eric White" aml:createdate="2006-08-01T11:50:00Z" w:type="Word.Comment" w:initials="EW">
So in XLinq, I can issue a query to select all annotations:
var commentNodes =
from annos in wordDoc.Descendants(aml + "annotation")
where (string)annos.Attribute(w + "type") == "Word.Comment"
select annos;
Word breaks up text, but it is easy to re-assemble: Paragraphs are contained in 'p' elements. Text is contained in 't' elements. The following XLinq code assembles text:
StringBuilder comment = new StringBuilder();
foreach (var p in commentNode.Descendants(w + "p"))
{
foreach (var t in p.Descendants(w + "t"))
comment.Append(t.Value);
comment.Append("\n");
}
Once we have found and extracted the relevant comment, we then need to jump up two ancestors:
var codePara = commentNode.Parent.Parent;
Now, we have the node of the first paragraph of the code in the word doc. The logic next consists of:
- If we are still on a paragraph styled code
- Get all the text in the paragraph
- Get rid of all annotations that are not Word.Insertion
- Assemble the text
- Move on to the next paragraph
This is the code to do this:
while(true)
{
XElement c1, c2;
if (codePara.Name.LocalName == "proofErr")
{
codePara = (XElement)codePara.NextNode;
continue;
}
// if there is a pPr that has a pStyle with val="Code"
if (
((c1 = codePara.Element(w + "pPr")) != null) &&
((c2 = c1.Element(w + "pStyle")) != null) &&
((string)c2.Attribute(w + "val") == "Code")
)
{
// select all of the nodes that have content
var interestingPieces =
from s in codePara.Elements()
where (s.Name == w + "r") ||
((s.Name == aml + "annotation") &&
((string)s.Attribute(w + "type") == "Word.Insertion"))
select s;
// get rid of all annotations that are just comments
List<XElement> le = new List<XElement>();
foreach (var i in interestingPieces)
{
var e = i.Element(aml + "annotation");
if (e != null)
{
if ((string)e.Attribute(w + "type") == "Word.Comment")
continue;
else
le.Add(i);
}
else
le.Add(i);
}
foreach (var t in le.Descendants(w + "t"))
code.Append(t.Value);
code.Append("\n");
codePara = (XElement)codePara.NextNode;
if (codePara == null)
break;
if (!(codePara is XElement))
break;
}
else
break;
}
The above code works even when change tracking has been turned on, and there are changes in the text. The entire program follows:
using System;
using System.Collections.Generic;
using System.Text;
using System.Query;
using System.Xml.XLinq;
using System.Data.DLinq;
namespace WordMLReader
{
class Program
{
static void WordMLReader(string fn)
{
XElement wordDoc = null;
try {
wordDoc = XElement.Load(fn);
}
catch (System.Xml.XmlException e)
{
Console.WriteLine(e.ToString());
return;
}
XNamespace aml = "https://schemas.microsoft.com/aml/2001/core";
XNamespace w = "https://schemas.microsoft.com/office/word/2003/wordml";
var commentNodes =
from annos in wordDoc.Descendants(aml + "annotation")
where (string)annos.Attribute(w + "type") == "Word.Comment"
select annos;
foreach (var commentNode in commentNodes)
{
StringBuilder comment = new StringBuilder();
StringBuilder code = new StringBuilder();
foreach (var p in commentNode.Descendants(w + "p"))
{
foreach (var t in p.Descendants(w + "t"))
comment.Append(t.Value);
comment.Append("\n");
}
var codePara = commentNode.Parent.Parent;
while(true)
{
XElement c1, c2;
if (codePara.Name.LocalName == "proofErr")
{
codePara = (XElement)codePara.NextNode;
continue;
}
// if there is a pPr that has a pStyle with val="Code"
if (
((c1 = codePara.Element(w + "pPr")) != null) &&
((c2 = c1.Element(w + "pStyle")) != null) &&
((string)c2.Attribute(w + "val") == "Code")
)
{
// select all of the nodes that have content
var interestingPieces =
from s in codePara.Elements()
where (s.Name == w + "r") ||
((s.Name == aml + "annotation") &&
((string)s.Attribute(w + "type") == "Word.Insertion"))
select s;
// get rid of all annotations that are just comments
List<XElement> le = new List<XElement>();
foreach (var i in interestingPieces)
{
var e = i.Element(aml + "annotation");
if (e != null)
{
if ((string)e.Attribute(w + "type") == "Word.Comment")
continue;
else
le.Add(i);
}
else
le.Add(i);
}
foreach (var t in le.Descendants(w + "t"))
code.Append(t.Value);
code.Append("\n");
codePara = (XElement)codePara.NextNode;
if (codePara == null)
break;
if (!(codePara is XElement))
break;
}
else
break;
}
Console.WriteLine("============= This is the code =============");
Console.WriteLine(code);
Console.WriteLine("============================================");
Console.WriteLine("");
Console.WriteLine("============= This is the comment =============");
Console.WriteLine(comment);
Console.WriteLine("===============================================");
}
}
static void Main(string[] args)
{
WordMLReader("CodeInDoc.xml");
}
}
}
When you have the attached word doc, and you run the code, you see:
============= This is the code =============
using System;
using System.Collections.Generic;
using System.Text;
using System.Query;
using System.Xml.XLinq;
using System.Data.DLinq;
namespace WordMLReader
{
class Program
{
static void (string[] args)
{
Console.WriteLine("Hello");
}
}
}
============================================
============= This is the comment =============
<Test SnipId="000101" TestId="0001"/>
===============================================
Comments
Anonymous
August 02, 2006
Eric White, Programming Writer for XLinq, MSXML, and XmlLite, shows how he used Linq to XML (XLinq) to...Anonymous
August 02, 2006
Eric White, Programming Writer for XLinq, MSXML, and XmlLite, shows how he used Linq to XML (XLinq) toAnonymous
August 03, 2006
Here's a post from Eric White where he provides some code samples for using XLinq to parse a WordprocessingML...Anonymous
August 03, 2006
Last month I attended a meeting where Eric White described a really cool way he has been using the new...Anonymous
August 03, 2006
We're starting to see some real applications that demonstrate how easy it is to use LINQ to XML (I'm...Anonymous
August 05, 2006
This post continues the series on “Typed XML programmer -- Where do you want to go tomorrow?”. This time,...Anonymous
November 29, 2006
This post continues the “Typed XML programmer” series . This time, let’s ponder about ‘ the 1 st generationAnonymous
October 06, 2007
We're starting to see some real applications that demonstrate how easy it is to use LINQ to XML