Compartilhar via


Parsing WordML

[Back to the Table of Contents]   This blog is inactive.  New blog: EricWhite.com/blog

The first problem that we're going to tackle is to retrieve some specific text out of a Word document that has been saved as XML. In this word document will be text that has the style of "Code". We want to find all consecutive paragraphs that have this style, and retrieve paragraph text. Also, we want to put comments on some of these paragraphs, and we want to retrieve the text of the comments.

Further, if there are multiple separate blocks of text in the word doc that are styled Code, we want to grab each of these blocks as separate chunks of text, along with their associated comments.

For those who are interested, the reason I need this code is that I have hundreds of code snippets in my LINQ to XML documentation. Each time that I get a new version of LINQ to XML, I want to automatically test all of these snippets and make sure that they still work. What I have done is to put build instructions (written in XML) in Word comments on the code in the docs. Then, I run a program that extracts the snippets and uses the build instructions to compile and build the code. The code tester then runs the code and verifies the output. In most cases, the output is also in the word doc, styled as Code. This is a Good Thing, in that it not only makes sure that the code compiles, but it verifies that when the reader of the docs runs the snippet, he or she is going to see the output that is shown in the docs. Of course, it is a little more complicated than this - for instance, there is the facility to specify a language for each snippet, copy files that are required for the snippet to run, validate against a file that the snippet writes, etc. But once we have the query to retrieve the text styled as Code and the comments, the rest of the code tester is pretty simple stuff.

To accomplish this task, we'll start by writing a simple query, then enhance our query, using additional standard query operators, and writing a couple of extension methods that helps us retrieve exactly what we want from the docs.

Next: The Source Word Document

Comments