Jaa


Finding Open XML Errors with Open XML SDK Validation

In a previous post, I gave you an overview of the functionality added to the Open XML SDK 2.0 August 2009 CTP. Today, I want to deep dive into the schema and semantic level validation support within the SDK. Specifically, I am going to show you guys the Open XML SDK code needed to actually validate your Open XML files.

If you've played around with manipulating Open XML files there is a good chance at one point in time your resulting document was considered invalid or corrupt by the applications. You've probably even seen one of these dialogs:

What do you do when you get into this state? A lot of the time the application error dialogs don't really help you debug the issue. Well, that's where the Open XML SDK can help you out. With just a few lines of code you can identify key pieces of information that tell you what the error is and where to find it within the package. Validation with the Open XML SDK 2.0 is accomplished via the OpenXmlValidator class. This class allows you to enumerate all the errors within a file, where each error is represented via the ValidationErrorInfo class. The ValidationErrorInfo class stores the following information:

  • User friendly description of the error
  • An XPath to the exact location of the error
  • The part where this error exists
  • Other elements or parts that are related to this error

Here is a code snippet you can reuse to validate Word documents:

try { OpenXmlValidator validator = new OpenXmlValidator(); int count = 0; foreach (ValidationErrorInfo error in validator.Validate(WordprocessingDocument.Open("InvalidFile.docx", true))) { count++; Console.WriteLine("Error " + count); Console.WriteLine("Description: " + error.Description); Console.WriteLine("Path: " + error.Path.XPath); Console.WriteLine("Part: " + error.Part.Uri); Console.WriteLine("-------------------------------------------"); } Console.ReadKey(); } catch (Exception ex) { Console.WriteLine(ex.Message); }

The same code can be used to validate Excel and PowerPoint documents. All you need to do is change the Open method to be one of the following:

foreach (ValidationErrorInfo error in validator.Validate(PresentationDocument.Open("InvalidFile.pptx", true)))

or

foreach (ValidationErrorInfo error in validator.Validate(SpreadsheetDocument.Open("InvalidFile.xlsx", true)))

Pretty simple stuff! If you want to jump straight into the code, feel free to download this solution here.

Let's walk through an example of validating and fixing an example corrupt Word document. Given this corrupt document, the Open XML SDK detects the following errors:

Let's look at each of these errors.

Error 1

  • Description: The attribute 'https://schemas.openxmlformats.org/wordprocessingml/2006/main:rsidR' has invalid value '006B4C'. The actual length according to datatype 'hexBinary' is not equal to the specified length. The expected length is 4.
  • Path: /w:document[1]/w:body[1]/w:p[1]
  • Part: /word/document.xml

Let's take a look at the xml within the main document part:

The error indicates that the length of the value for rsidR is not correct. We can fix this issue by changing the value to 00006B4C.

Error 2

  • Description: Element 'DocumentFormat.OpenXml.Wordprocessing.Footnote' referenced by 'footnoteReference@id' does not exist in part '/word/footnotes.xml'. The reference value is '3'.
  • Path: /w:document[1]/w:body[1]/w:p[6]/w:r[2]/w:footnoteReference[1]
  • Part: /word/document.xml

Let's take a look at the xml within the main document part:

Let's take a look at the xml within the footnotes part:

The error indicates that there is a reference to a footnote using the value "3", but no such value exists in the footnotes part. Let's go ahead and change the footnoteReference to have a value of "2".

Error 3

  • Description: Attribute 'id' should have unique value. Its current value '1' duplicates with others.
  • Path: /w:endnotes[1]/w:endnote[4]
  • Part: /word/endnotes.xml

Let's take a look at the xml within the endnotes part:

The error indicates that that more than one endnote specify the same id value. Let's go ahead and change the values to be unique.

End Result

After making these fixes we should be able to open the fixed document with no issues as shown below:

Try out the validation functionality and let us know what you think.

Zeyad Rajabi

Comments

  • Anonymous
    October 06, 2009
    With some errors I get the same log as the informations displayed by Microsoft Word : no more no less. With other errors however I get nothing. The code simple falls in an exception. Sample cases are when some part or ressource is missing from the document. I am generating documents on the fly with C++ and I really some better methods that this method. Anyway, thank you for the post.

  • Anonymous
    October 08, 2009
    Thanks for the feedback Ahmed. Is there any way you can send me a copy of the document you mention above. We want to continue improving our validation functionality and having the document will be quite useful.