Open XML SDK Code behind the Custom XML Markup Detection Tool

Recently on Gray's blog he posted about a scanning tool that can be used to detect custom XML markup in Word Open XML files (*.docx, *.docm, *.dotm, and *.dotx). The tool is built using the Open XML SDK and we wanted to take the opportunity to show you how the solution works in this blog.

Solution

The scenario we want to support is, given a directory, find all Word Open XML documents that contain custom XML markup (w:customXml elements). In order to accomplish this scenario we will need to take the following actions:

  1. Given a directory, get all Word Open XML files
  2. For each file found, open the document with the Open XML SDK
  3. Get all xml parts contained with the document
  4. For each part, count the number of occurrences of custom XML markup. Count the total across all the parts contained within the document
  5. Show results of scan

If you want to jump straight into the code, feel free to download this solution here.

The solution uses the December 2009 CTP of the Open XML SDK 2.0 for Microsoft Office, which you can learn more about in this introduction to the Open XML SDK. You can certainly use version 1.0 of the Open XML SDK, but version 2.0 makes things a lot easier, especially with all the improvements added to the December 2009 CTP as described in the CTP announcement blog post.

Step 1 – Find all Word documents in a directory

For the sake of simplicity we are going to build a command line solution that expects one argument, which will represent the directory that will be scanned by the tool. Given this directory and all of its subdirectories, we are going to look for all Word Open XML documents, which have the following extensions: .docx, .docm, .dotx, and .dotm. Performing this task is pretty simple with the class DirectoryInfo and the method GetFiles. The only issue is that the GetFiles method only allows you to search for one extension at a time. Files or directories that cannot be scanned, for example due to file permissions, will be reported in a separate list. Here is a code snippet to solve this issue and look for all files given multiple extension types:

private static string lookfor = "*.docx;*.docm;*.dotx;*.dotm"; private static string[] extensions = lookfor.Split(new char[] { ';' });   // Files with the above extensions will be added to this list private static List<string> files = new List<string>();   // Track files with errors (such as IRM-protected documents we cannot load) in a //separate list private static Dictionary<string, object> errors = new Dictionary<string, object>();   private static int folderCount = 0;   static void ShowHelp() { Console.WriteLine(); Console.WriteLine("DetectCustomXMLMarkup.exe - Detect Custom XML Markup in Documents Tool"); Console.WriteLine("Usage:"); Console.WriteLine(" DetectCustomXMLMarkup.exe [Path To Folder]"); Console.WriteLine(@"Example: DetectCustomXMLMarkup.exe C:\temp"); Console.WriteLine("For more information about Custom XML Markup, please see Microsoft Knowledge Base article 978951"); Console.WriteLine(); }   //Recursive method to get all files in a directory and all it's sub-directories static void GetAllFiles(string parentFolder) { folderCount++; if (folderCount % 10 == 0) Console.Write(".");   try { foreach (string ext in extensions) { files.AddRange(Directory.GetFiles(parentFolder, ext, SearchOption.TopDirectoryOnly)); } foreach (string subFolder in Directory.GetDirectories(parentFolder)) { GetAllFiles(subFolder); } } //all files/directories that cannot be accessed will be reported as an error catch (Exception e) { errors.Add(parentFolder, e.Message.Replace("\n", string.Empty).Replace("\r", string.Empty)); } }   static void Main(string[] args) { if (args.Length != 1) { ShowHelp(); return; }   //args[0] represents the directory path to scan DirectoryInfo di = new DirectoryInfo(args[0]); if (di.Exists) { Console.Write("Scanning for Word Open XML files under {0}", di.FullName); Console.WriteLine();   GetAllFiles(di.FullName);   Console.WriteLine(); Console.WriteLine(); Console.WriteLine("Found {0} Word Open XML files...", files.Count);   ...   } else { Console.WriteLine("Path is incorrect or cannot be accessed. Try again or specify another path."); Console.WriteLine(); ShowHelp(); Console.WriteLine(); } }

Step 2 – Open Word documents with the Open XML SDK

Now that we have all the Word files to scan, our next step is to open each of these documents with the Open XML SDK. The Open XML SDK should be able to handle most Word Open XML files. However, there are occasions where the Word document may have issues that prevent it from being opened with the SDK. For example, IRM documents cannot be opened with the Open XML SDK. To ensure our solution continues to function despite these types of issues we can simply wrap the SDK Open method with a try and catch. Any errors detected will be reported in the errors list. Here is the code snippet to accomplish this task:

static void Main(string[] args) { ...   int fCount = 0; bool addFinalLine = false;   //Go through all files found foreach (string file in files) { if (fCount++ % 100 == 0) { if (addFinalLine) { Console.WriteLine(); } Console.Write("Scanning files [{0}/{1}]", fCount, files.Count); addFinalLine = true; } else if (fCount % 5 == 0) { Console.Write("."); addFinalLine = true; }   string thisFile = string.Empty;   try { using (WordprocessingDocument myDoc = WordprocessingDocument.Open(file, false)) { //Get all parts within the package //Count occurrences of custom XML markup ... } } catch (Exception e) { errors.Add(file, e.Message.Replace("\n", string.Empty)); } }   ... }

Step 3 – Get all parts within a document

At this point we have the file opened with the Open XML SDK. The next step is to get all the parts within the package. For this task we are going to leverage some source code from Eric White's post on how to create a list of all parts in an Open XML document. Essentially we are going to use two methods GetAllParts and AddPart to recursively find all XML based parts within a package. Here is the code snippet necessary to accomplish this task:

static void Main(string[] args) { ... using (WordprocessingDocument myDoc = WordprocessingDocument.Open(file.FullName, false)) { //Get all parts within the package List<OpenXmlPart> parts = GetAllParts(myDoc); ... }   ... }   //This method is used to recursively find all parts within a package static void AddPart(HashSet<OpenXmlPart> partList, OpenXmlPart part) { if (partList.Contains(part)) return;   //only add parts that are xml based if (part.ContentType.EndsWith("+xml")) partList.Add(part);   foreach (IdPartPair p in part.Parts) AddPart(partList, p.OpenXmlPart); }   //This method is used to recursively find all parts within a package static List<OpenXmlPart> GetAllParts(WordprocessingDocument doc) { HashSet<OpenXmlPart> partList = new HashSet<OpenXmlPart>();   foreach (IdPartPair p in doc.Parts) AddPart(partList, p.OpenXmlPart);   return partList.ToList(); }

Step 4 – Count the occurrences of Custom XML markup

Now that we have all the XML related parts contained within our Word document, the next step is to scan each of those parts for Custom XML markup. This task should be pretty easy with the Open XML SDK. All files with detected custom XML markup will be reported in the results list. Here is the code snippet necessary to accomplish this task:

// Files with detected custom XML markup will be added to this list private static Dictionary<string, object> results = new Dictionary<string, object>();   static void Main(string[] args) { ... using (WordprocessingDocument myDoc = WordprocessingDocument.Open(file.FullName, false)) { int count = 0;   //Get all parts within the package List<OpenXmlPart> parts = GetAllParts(myDoc);   foreach (OpenXmlPart part in parts) { count += NumberOccurrencesCustomXMLMarkup(part); }   if (count > 0) { results.Add(file, count); } }   ... }   //Count all instances of custom XML markup within a given part static int NumberOccurrencesCustomXMLMarkup(OpenXmlPart part) { int count = 0;   if ((part != null) && (part.RootElement != null)) { count += part.RootElement.Descendants<CustomXmlBlock>().Count(); count += part.RootElement.Descendants<CustomXmlCell>().Count(); count += part.RootElement.Descendants<CustomXmlRow>().Count(); count += part.RootElement.Descendants<CustomXmlRuby>().Count(); count += part.RootElement.Descendants<CustomXmlRun>().Count(); } return count; }

The NumberOccurrencesCustomXMLMarkup method simply looks for the following Custom XML Markup related SDK objects:

  • CustomXmlBlock
  • CustomXmlCell
  • CustomXmlRow
  • CustomXmlRuby
  • CustomXmlRun

Pretty easy stuff!

Step 5 – Reporting the results

The last step in the solution is to report the results as a text based log file. Here is the code snippet to accomplish this task:

static void Main(string[] args) { ... foreach (string file in files) { ... }   if (addFinalLine) { Console.WriteLine(); } File.Delete("output.log");   Console.WriteLine("Writing results file 'output.log'...");   //Output results to log file using (StreamWriter sw = new StreamWriter("output.log")) { sw.WriteLine("Scanned folder and sub-folders: {0}", di.FullName); sw.WriteLine(); sw.WriteLine("Files with Custom XML Markup: [{0}/{1}]", results.Count, files.Count);   if (results.Count != 0) { sw.WriteLine("File Name\tCustom XML Markup References"); foreach (string filename in results.Keys) { sw.WriteLine("{0}\t{1}", filename, results[filename]); } }   if (errors.Count != 0) { sw.WriteLine(); sw.WriteLine("Errors reported: [{0}]", errors.Count); sw.WriteLine("File Name\tError message"); foreach (string filename in errors.Keys) { sw.WriteLine("{0}\t{1}", filename, errors[filename]); } } } Console.WriteLine(); Console.WriteLine("Scan completed."); Console.WriteLine(); Console.WriteLine("Files with Custom XML Markup: [{0}/{1}]", results.Count, files.Count); Console.WriteLine("Errors reported: [{0}]", errors.Count); Console.WriteLine(); Console.WriteLine("See 'output.log' for more details."); Console.WriteLine(); Console.WriteLine("Press any key to continue..."); Console.ReadKey();   ... }

End Result

Running this code on a directory we end up with a tab delimited log file that shows all the files that contain Custom XML markup. Here is a screenshot of how the log file looks like when opened in Microsoft Excel:

image

Hopefully this solution shows you how easy it is to interrogate an Open XML file with the Open XML SDK.

Brian Jones + Zeyad Rajabi

Comments

  • Anonymous
    January 27, 2010
    It was my understanding that Microsoft made the decision to remove custom XML from files http://www.computerworld.com/s/article/9142627/Microsoft_yanks_Custom_XML_from_Word_offers_patch_to_OEMs Does this create any issues with this detection tool ?

  • Anonymous
    January 27, 2010
    Michael, The patched version of Word will ignore CustomXML markup when .DOCX, .DOCM, and .XML files are loaded into Word. (and will not be re-saved into the document.) The purpose of the tool is for organizations to identify any solutions which may be affected. The presence of documents containing the markup would indicate that a solution may be in place which is creating the markup in question. If a user has opened and re-saved a document containing the markup with a patched version of Word, the markup would not be present, and the document would not be identified by the tool. http://blogs.technet.com/gray_knowlton/archive/2010/01/20/scanning-tool-to-detect-custom-xml-markup-in-docx-and-docm-files.aspx You can read here to understand more of the use case for the scanning tool.

  • Anonymous
    January 29, 2010
    That does look simple! Sadly, the SDK can not be installed on MacOS or Linux as it is distributed as an MSI.

  • Anonymous
    February 08, 2010
    The comment has been removed

  • Anonymous
    February 21, 2010
    Brian, The provided 'solution' won't be of help for those who stored the documents in so called 'non filesystem' locations such as databases or embedded scenarios (OLE embedding, etc). Last week I demo-ed you a possible solution that is independent of these scenarios, I'll contact you or John D. to see if there is an option to work this out and maybe put it on CodePlex for the community. It does need some extra work before it can be provided to the public and need some information for that.