Validate Open XML Documents using the Open XML SDK 2.0

Open XML developers create new documents in a variety of ways – either through transforming from an existing document to a new one, or by programmatically altering an existing document and saving it back to disk.  It is valuable to use the Open XML SDK 2.0 to determine if the new or altered document, spreadsheet, or presentation contains invalid markup.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThis was particularly useful when I was writing the code to accept tracked revisions, and the Open XML WordprocessingML markup simplifier.  I wrote a small program to iterate through all documents in a directory tree and programmatically alter or transform each document, and then validate.  This allowed me to run the code on thousands of documents, making sure that the code would not create invalid documents.

The use of the validator is simple:

  • Open your document/spreadsheet/presentation as usual using the Open XML SDK.
  • Instantiate an OpenXmlValidator object (from the DocumentFormat.OpenXml.Validation namespace).
  • Call the OpenXmlValidator.Validate method, passing the open document.  This method returns a collection of ValidationErrorInfo objects.  If the collection is empty, then the document is valid.  You can validate before and after modifying the document.

Here is the simplest code to validate a document.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

class Program
{
static void Main(string[] args)
{
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open("Test.docx", false))
{
OpenXmlValidator validator = new OpenXmlValidator();
var errors = validator.Validate(wordDoc);
if (errors.Count() == 0)
Console.WriteLine("Document is valid");
else
Console.WriteLine("Document is not valid");
}
}
}

While debugging your code, it is helpful to know exactly where each error is.  You can iterate through the errors, printing:

  • The content type for the part that contains the error.
  • An XPath expression that identifies the element that caused the error.
  • An error message.

Here is code to do that:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

class Program
{
static void Main(string[] args)
{
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open("Test.docx", false))
{
OpenXmlValidator validator = new OpenXmlValidator();
var errors = validator.Validate(wordDoc);
if (errors.Count() == 0)
Console.WriteLine("Document is valid");
else
Console.WriteLine("Document is not valid");
Console.WriteLine();
foreach (var error in errors)
{
Console.WriteLine("Error description: {0}", error.Description);
Console.WriteLine("Content type of part with error: {0}",
error.Part.ContentType);
Console.WriteLine("Location of error: {0}", error.Path.XPath);
}
}
}
}

As a developer, you will want to open a document, modify it in some fashion, and then validate that your modifications were correct.  The following example opens a document for writing, modifies it to make it invalid, and then validates.  To make an invalid document, it adds a text element (w:t) as a child element of a paragraph (w:p) instead of a run (w:r).

This approach to document validation works if you are using the Open XML SDK strongly-typed object model.  It also works if you are using another XML programming technology, such as LINQ to XML.  The following example shows the document modification code written using two approaches.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

public static class MyExtensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument partXDocument = part.Annotation<XDocument>();
if (partXDocument != null)
return partXDocument;
using (Stream partStream = part.GetStream())
using (XmlReader partXmlReader = XmlReader.Create(partStream))
partXDocument = XDocument.Load(partXmlReader);
part.AddAnnotation(partXDocument);
return partXDocument;
}

public static void PutXDocument(this OpenXmlPart part)
{
XDocument partXDocument = part.GetXDocument();
if (partXDocument != null)
{
using (Stream partStream = part.GetStream(FileMode.Create, FileAccess.Write))
using (XmlWriter partXmlWriter = XmlWriter.Create(partStream))
partXDocument.Save(partXmlWriter);
}
}
}

class Program
{
static void Main(string[] args)
{
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open("Test.docx", true))
{
// Open XML SDK strongly-typed object model code that modifies a document,
// making it invalid.
wordDoc.MainDocumentPart.Document.Body.InsertAt(
new Paragraph(
new Text("Test")), 0);

// LINQ to XML code that modifies a document, making it invalid.
XDocument d = wordDoc.MainDocumentPart.GetXDocument();
XNamespace w = "https://schemas.openxmlformats.org/wordprocessingml/2006/main";
d.Descendants(w + "body").First().AddFirst(
new XElement(w + "p",
new XElement(w + "t", "Test")));
wordDoc.MainDocumentPart.PutXDocument();

OpenXmlValidator validator = new OpenXmlValidator();
var errors = validator.Validate(wordDoc);
if (errors.Count() == 0)
Console.WriteLine("Document is valid");
else
Console.WriteLine("Document is not valid");
Console.WriteLine();
foreach (var error in errors)
{
Console.WriteLine("Error description: {0}", error.Description);
Console.WriteLine("Content type of part with error: {0}",
error.Part.ContentType);
Console.WriteLine("Location of error: {0}", error.Path.XPath);
}
}
}
}

When you run this example, it produces the following output:

Document is not valid

Error description: The element has invalid child element
'https://schemas.openxmlformats.org/wordprocessingml/2006/main:t'.
List of possible elements expected:
<https://schemas.openxmlformats.org/wordprocessingml/2006/main:pPr>.
Content type of part with error:
application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
Location of error: /w:document[1]/w:body[1]/w:p[1]

Comments

  • Anonymous
    March 13, 2010
    Hi Eric, Could you tell a bit on what - in the eyes of Open XML SD 2.0 - constitutes a "valid document"? The documents created by Microsoft Office 2007 does not always validate against ISO/IEC 29500, but Open XML SDK validation comes out fine. I have implemented a OOXML validator using System.IO.Packaging and the schemas for ISO/IEC 29500, and it reports different results than the SDK. The validator is available at http://is29500validator.codeplex.com .

  • Anonymous
    March 14, 2010
    Hi Jesper, Doug Mahugh will be covering SDK validation on his blog this week – if you can provide a specific example of a document or repro steps, I’ll make sure he addresses it in his blog post. -Eric

  • Anonymous
    May 25, 2010
    The comment has been removed

  • Anonymous
    May 06, 2012
    Thanks for clear explanation!

  • Anonymous
    July 04, 2013
    For now, if you have missed references in your docx and try to open it with WordProcessingDocument.Open(filepath, true), you'll get InvalidOperationException with message that is not so informative (smth like "package contains a reference to missing part"). It'll be much more helpful if we can see file name of the missing part