Searching Open XML documents
Searching documents for text strings is a common task in many type of applications. And there are many possible variations on this simple concept: whether to search the body of the document or its metadata, whether to restrict your search to specific document types or document sections, and so on. At its core, however, all of these search scenarios have the same basic objective: iterate through all nodes of a particular type, and check whether those nodes contain the string we're looking for.
To get a feel for how this might be accomplished in the world of Open XML documents, let's take a look at a specific example: searching for text within the body of a wordprocessingml document. We'll build a simple WinForm application where you can select a folder and specify a search string, and then it will scan through all the DOCX documents in that folder (and its sub-folder), and display a list of matches found.
In this example, we'll only be searching the main document part, and we'll only be looking at the t (text) nodes. Note that this simplicity is in part a result of the design of the file formats. For example, we don't have to think about whether we're looking at deleted text that's still in the document (because Track Changes is turned on, say), because that text is stored in a special node to distinguish it from the actual text of the document. And since the text nodes only contain text and nothing else (no formatting information, for example), we can simply check whether our search string occurs in the value of a text node, and if it does then this document is a "hit" for our search.
One final note before we get started: this is not a sample of industrial-strength best-practices code. I've left out lots of error-checking you'd probably want to do, and I've even included explicit references to namespace prefixes and other things you'd never do in production code. The goal here is simply to illustrate how to search text in Open XML documents, as simply and clearly as possible.
Program Structure
The sample application (source code attached) includes some basic code for selecting a folder, enabling double-click to launch a hit from the hit list, enabling the OK button when appropriate, and so on. But the heart of the matter, the actual search functionality, is in the SearchFiles() and SearchDocx() methods.
SearchFiles() is a recursive method that searches all of the files in a specified folder (passed as a DirInfo object) for a specified string of text. For any sub-folders contained within the search folder, SearchFiles calls itself recursively to drill down to whatever depth is required to search every DOCX file under the specified folder. Add some code to trap access-denied errors for some folders (as you'll need when searching your entire C: drive under Vista), and our SearchFiles() method looks like this:
Searching a document
SearchDocx() is where we search a specific DOCX file. It's based on the HowToGetToDocPart.CS code snippet that comes with the Visual Studio code snippets for Open XML development. This particular snippet shows how to get the document start part, and in our sample app that part is named StartPart. Here's the code that takes that part and searches it for the text string (searchFor) that we're looking for:
This code is extremely simple, and it also runs very fast. Here are a few things to note about what's going on:
- We're using the XmlReader class here, instead of the DOM approach using XmlDocument. The DOM is much more flexible and gives you random-access read/write capability anywhere in the document, but loading an entire document into an XmlDocument adds a lot of overhead. Use XmlDocument when you need the flexibility it provides, and use XmlReader when your needs are simple and you're optimizing performance.
- We're not instantiating an XmlTextReader directly -- instead, we're using the XmlReader.Create() method, which creates an XmlTextReader for us. This is the preferred approach for .NET 2.0 and above, and the use of this factory pattern shields you from the internal implementation details and also enables some internal optimizations that have made XML readers extremely fast under .NET 2.0 and above.
- Note that we're doing a lower-case match -- the searchFor parameter was already converted to lower case outside this loop, so that makes the search case-insensitive.
- We're referring to the desired node as "w:t", but using an explicit namespace prefix like this is not recommended practice. (More on this topic below.)
Variations for other document types
This sample searches word-processing documents, but a similar approach can also be used for spreadsheets and presentations. Let's look at how how this code would differ if we were searching spreadsheets or presentations instead of word-processing documents.
For spreadsheets, there are two types of text that need to be searched: inline strings and shared strings. For inline strings, we would iterate through the worksheet relationships, searching the t nodes that occur inside "is" (inline string) nodes. For shared strings, we would search the shared-strings part for the t nodes inside "si" nodes.
For presentations, all text is contained in t (text) nodes in the slide parts. Note that these t nodes are in the drawingml namespace, as opposed to the wordprocessingml namespace. PowerPoint uses an "a" prefix for drawingml, so it writes these nodes out as "a:t" instead of "w:t" as used by Word, but you should never count on namespace prefixes because they can change. For example, you could create a perfectly valid wordprocessingml document that uses the "z" prefix instead of "w" for the wordprocessingml namespace.
The details of managing namespaces are outside the scope of this post, but Wouter Van Vugt has covered some of the details on his blog. One thing you'll find saves some time and hassle is to use an all-in-one schema file with the XmlSchemaSet class to avoid circular-reference issues, as Wouter explains here. I'm going to implement that in this sample app, along with some of the variations mentioned above, and will post an updated version after those changes are done.