Open XML File Formats: What is it, and how can I get started?
While being at Tech Ed, a lot of people were interested in finding a way to programmatically generate documents without Interop. Some of the business scenarios contemplated generating over 5,000 documents and some IT professionals were interested in finding the best option. A great option to solve this business need is: The Open XML File Formats.
Some people have been following the news and are even ahead of most of us already building solutions to generate documents using the Open XML File Formats. Some other people are not familiar with this technology and want to learn more about this, so here is a quick introduction for those of you who want to learn more about: What is it, and how you can get started. I have to warn you that this is going to be a long blog entry, but I promise it's worth the reading.
What is it?
The new formats improve file and data management, data recovery, and interoperability with line-of-business systems. They extend what is possible with the binary files of earlier versions. Any application that supports XML can access and work with data in the new file format. The application does not need to be part of the Microsoft Office system or even a Microsoft product. Users can also use standard transformations to extract or repurpose the data. In addition, security concerns are drastically reduced because the information is stored in XML, which is essentially plain text. Thus, the data can pass through corporate firewalls without hindrance.
The new Open XML File Formats take advantage of the Open Packaging Conventions, which describe the method for packaging information in a file format and describe metadata, parts, and relationships. The new Open XML Format, with a few minor exceptions, is written entirely in XML and is contained in a .zip file. This creates significant advantages over the old binary file format:
- The file size is much smaller because of ZIP compression.
- The file is much more robust because it is broken up into different document parts. Should one part become damaged (for example, a part describing headers), the rest of the document remains intact and still opens successfully.
- The file is easier to work with programmatically because of the new structure. For example, it is easier to access embedded content, such as images, because they are stored in their native format inside the file.
- Custom XML is also easier to work with because it is stored in its own part, separate from the XML that describes the bulk of a document.
The old binary file format was created when priorities in software differed from the priorities of today. Back then, the ability to transfer a Word document from computer to computer using a floppy disc ranked very high, and the tight structure of a binary format worked well. As software advanced, other priorities became clear, such as the ability to write code against a file format and make it as robust as possible. XML is a clear solution.
Microsoft began to address this issue in previous versions of Microsoft Office by introducing SpreadSheetML and WordprocessingML. However, only now, with the 2007 release of Microsoft Office, have the goals that were conceived as far back as 1999 been accomplished fully. By including the XML File Format inside a ZIP container, the benefit of a small compressed file format is also realized. Excel 2007 and PowerPoint 2007 share this new file format technology, described by the Open Packaging Conventions. Together, the shared formats are called the Microsoft Office Open XML Formats. The new Word 2007 XML Format is the default file format, although the old binary file format is still available in the 2007 Microsoft Office system.
An easy way to look inside the new file format is to save a Word 2007 document in the new default format and then rename the file with a .zip extension. By double-clicking the renamed file, you can open and look at its contents. Inside the file, you can see the document parts that make up the file, along with the relationships that describe how the parts interact with one another. However, it is important to note that, with a few exceptions defined within the Open Packaging Conventions, the actual file directory structure is arbitrary. The relationships of the files within the package, not the file structure, are what determine file validity. You can rearrange and rename the parts of an Word 2007 file inside its .zip container if you update the relationships properly so that the document parts continue to relate to one another as designed. If the relationships are accurate, the file opens without error. The initial file structure in a Word 2007 file is simply the default structure created by Word. This default structure enables developers to determine the composition of Word 2007 files easily.
How can I get started?
The easiest way to modify a Word 2007 XML file programmatically is to use the System.IO.Packaging class in the Microsoft® Windows® Software Development Kit (SDK) for Beta 2 of Windows Vista and WinFX Runtime Components. Using this technology, you can easily update header and footer files programmatically across numerous Word 2007 documents stored on a server.
We published recently some resources that might be of your interest if you are trying to learn more about the Open XML File Formats:
- Introducing the Microsoft Office (2007) Open XML File Formats: Learn the benefits of the Microsoft Office (2007) Open XML Formats. Users can exchange data between Office applications and enterprise systems using XML and ZIP technologies. Documents are universally accessible. And, you reduce the risk of damaged files.
- Walkthrough: Word 2007 XML Format: I wrote this article to provide a deep dive to the Word 2007 XML Format architecture, key components, and ways in which you can programmatically modify content.
- Setting Word Document Properties the Office 2007 Way: Ken Getz wrote this great MSDN Magazine article on how to extract document properties using the Open XML File Formats.
- Open XML File Format Code Snippets for Visual Studio 2005: Kevin Boske and Ken Getz have been working together for months to create a set of code samples in C# and VB.NET (Visual Studio 2005 Code Snippets) that will help you to accomplish the following tasks:
Open XML Snippets
- Open XML: Get OfficeDocument Part: Given an Open XML file, retrieve the part with the https://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument relationship type.
Microsoft Office Excel Snippets
- Excel: Add Custom UI: This snippet adds a custom UI Ribbon part to a given workbook.
- Excel: Delete Comments by a specific User: This snippet deletes all comments from a given user from a given workbook.
- Excel: Delete Worksheet: This snippet deletes the specified worksheet from within a given workbook and resets the selected worksheet to the next one on the list. Returns true if successful, false if failure.
- Excel: Delete Excel 4.0 Macro sheets: This snippet deletes all the Excel 4.0 Macro (XLM) sheets from a given workbook.
- Excel: Retrieve hidden rows or columns: This snippet returns a list of hidden row numbers or column names from a given workbook and worksheet.
- Excel: Export Chart: Given a workbook and title of a chart, this snippet exports the chart as a Chart (.crtx) file.
- Excel: Get Cell Value: Given a workbook, worksheet and cell address, this snippet returns the value of the cell as a string.
- Excel: Get Comments as XML: Given a workbook, this snippet returns all the comments as an XmlDocument.
- Excel: Get Hidden Worksheets: This snippet returns a list containing the name and type of all hidden sheets in a given workbook.
- Excel: Get Worksheet Information: This snippet returns a list containing the name and type of all sheets in a given workbook.
- Excel: Get Cell for Reading: Given a workbook, worksheet and cell address, this snippet demonstrates how to navigate to the cell to retrieve its contents. The cell must exist for the function to find it.
- Excel: Get Cell for Writing: Given a workbook, worksheet and cell address, this snippet demonstrates how to navigate to the cell to set its value. If the cell does not exist, the snippet creates it.
- Excel: Insert Custom XML: Given a workbook and a custom XML value, this snippet inserts the custom XML into the workbook.
- Excel: Insert Header or Footer: Given a workbook, worksheet and text to insert and a header or footer type, this snippet inserts the header or footer with the given text into the worksheet.
- Excel: Insert a Numeric Value into a Cell: Given a workbook, worksheet, cell address and numeric value, this snippet inserts the value into the cell.
- Excel: Insert a String Value into a Cell: Given a workbook, worksheet, cell address and string value, this snippet inserts the value into the cell.
- Excel: Set Recalc Option: Given a workbook and a RecalcOption, this snippet sets the recalculation property to the new option.
Microsoft Office PowerPoint Snippets
- PowerPoint: Delete Comments by User: Given a presentation and a user name, this snippet deletes all comments by that user.
- PowerPoint: Delete Slide by Title: Given a presentation and slide title, this snippet deletes the first instance of a slide with that title (titles are not unique).
- PowerPoint: Get Slide Count: This snippet returns the number of slides in a given presentation.
- PowerPoint: Get Slide Titles: Given a presentation, this snippet returns a list of the slide titles in the order presented.
- PowerPoint: Modify Slide Title: Given a presentation, old slide title, and new slide title, this snippet changes the first instance of a slide with the given title to the new value. The snippet returns true if successful, false if not successful.
- PowerPoint: Reorder Slides: Given a presentation, an original position, and a new position, attempt to place the slide from the original position into the new position within the deck. If the original position is outside the range of the number of slides in the deck, use the last slide. If the new position is outside the range of slides in the deck, put the selected slide at the end of the deck. The snippet returns the loctation wher the slide was placed, or -1 on failure.
- PowerPoint: Replace Image: Given a presentation, slide title and image file, this snippet replaces the first image on the slide with the given image.
- PowerPoint: Retrieve Slide Location by Title: Given a presentation and a slide title, this snippet returns the 0-based location of the first slide with a matching title.
Microsoft Office Word Snippets
- Word: Accept Revisions: Given a document and an author name, this snippet accepts the revisions by that author.
- Word: Add Header: Given a document and a stream containing valid header content, add the stream content as a header in the document.
- Word: Convert DOCM to DOCX: Given a macro-enabled document (.docm), this snippet removes the VBA project and converts the file to a macro-free Word Document (.docx).
- Word: Remove Comments: Given a Word Document, this snippet removes all the comments.
- Word: Remove Headers and Footers: This snippet removes all headers and footers from a given Word document.
- Word: Remove Hidden Text: This snippet removes any hidden text in a given document.
- Word: Replace Style: Given a document and valid header content, this snippet adds the content as a header in the document.
- Word: Retrieve Application Property: Given a document name and an app property, this snippet returns the value of the property.
- Word: Retrieve Core Property: Given a document name and a core property, this snippet returns the value of the property.
- Word: Retrieve Custom Property: Given a document name and a custom property, this snippet returns the value of the property.
- Word: Retrieve Table of Contents: Given a document name, this snippet returns a table of contents as an XmlDocument.
- Word: Set Application Property: This snippet sets a property’s value given a document name, application property and value. The snippet returns the old value if successful.
- Word: Set Core Property: Given a document name, a core property, and property value, this snippet sets the property value.
- Word: Set Custom Property: Given a document name, a custom property, and a value, this snippet sets the property’s value. If the property does not exist, create it. Returns true if successful, false if not.
- Word: Set Print Orientation: Given a document name, this snippet sets the print orientation for all sections in the document.
Download them here!
Finally, if you want to stay current with new resources to work with the Open XML File Formats, go to the XML in Office Developer Portal. We launched this portal recently to create a special section of the MSDN Office Developer Center where you will find bloggers, technical articles, code samples, developer documentation, and multimedia presentations on working with XML in Office.
Happy Office XML programming!
Comments
Anonymous
June 23, 2006
The comment has been removedAnonymous
June 26, 2006
The comment has been removedAnonymous
June 26, 2006
The comment has been removedAnonymous
June 26, 2006
I just spotted a great post on the TechEd Bloggers feed from Erika Ehrli - "Open XML File Formats: What...Anonymous
July 05, 2006
E' stato fatto oggi un importante annuncio da Microsoft, che anche per noi è stato una bella sorpresa:...Anonymous
July 07, 2006
Yeah, I know long time no B(log). I was real busy those days, but don’t worry I will tell you what was...Anonymous
July 09, 2006
Remember my last post about ODF?
Things go quite fast in our industry. Indeed, Microsoft has decided...Anonymous
July 23, 2006
Does somebody knows where I can get the Open XML file format detail reference?
I want to write a program to read and write Open Xml files directly.Anonymous
July 23, 2006
Infact i have got a PDF version of the reference:Office Open XML Document Interchange Specification /Ecma TC45 Working Draft 1.3 / Public Distribution May 2006.
But i want a .chm version of it.
If somebody knows where it is, please give me a link, thanks a lot!Anonymous
July 25, 2006
Microsoft PacWest SharePoint Server Newsletter – July 2006
 Update on Download Availability of...Anonymous
August 14, 2006
As far as I have seen, it is not enough to change the custom property value in the docPropsCustom.xml file because the original value of the custom property inserted in the document is still there (in the worddocument.xml file). I had to manually update all field codes in the document using Word (F9) before the new custom property value will be seen.
Comments?Anonymous
August 16, 2006
whui1978, you can find the ECMA spec here:
http://www.ecma-international.org/activities/office%20open%20xml%20formats/tc45_fd_xml_docform.zip
I haven't seen a chm.
Lars, read my August blog entries, you will find samples on how to read/write Office XML File Formats. I am writing a new document and I change programatically the custom.xml file without changing anything else. I only replace the file and it works, not sure why you are experiencing this behavior.Anonymous
September 23, 2007
i ve to convert my ppt file to xml using c# , what i was able to do using system.io.packaging, now i ve to recollect the package and retrieve my original ppt back, please do hep out a way of doing itAnonymous
September 23, 2007
How can i retrieve my files back , once i save them as open xml formatAnonymous
June 06, 2008
Some people have been following the news and are even ahead of most of us already buidling solutions to generate documents using the Open XML File Formats. Some other people are not familiar with this technology and want to learn more about this, so herAnonymous
August 21, 2008
The examples for PresentationML show how to count slides and reorder and delete slides, etc., In other words, they work on existing slide deck. I am trying to append new pptx slides (one at a time) to create a new slide deck, and I dont see an example of how to programmatically (using OpenXML) append a pptx to another. If you have any code samples, could you please post them. Thanks. -ssAnonymous
February 17, 2009
Does anybody know whether it si possible to save a file as ".doc" i.e. word 2003 using Open XML. I could create docx file but not doc. If yes, please let me know how. Thanks & Regards JohnAnonymous
February 24, 2009
I am not able to get the header in the /docx file correctly. I am not able to find the structure of the word document by renaming it with .zip extension. I have an image in header in .docx and i want to find the name given for that in the structure. How to do this?Anonymous
July 26, 2012
The comment has been removedAnonymous
January 07, 2013
I am using EPPlus/ OpenXml to create a binary format (compressed format) of input .xlsx file. After getting the binary format with reduced size, when i am trying to get it converted back to xlsx, its size is not increasing and is same as compressed format. Which shows that the file may be corrupted. The code used is as following Step1. to generate binary file Public Sub ConvertExcelToBinaryUsingEPPlus(ByVal fileName As String, ByVal outPutBinaryFile As String) Dim stream As New FileStream(fileName, FileMode.Open) Dim fileInfo As New FileInfo(outPutBinaryFile) Dim excelPackage As New ExcelPackage(stream) excelPackage.Compression = CompressionOption.Maximum excelPackage.SaveAs(fileInfo) End Sub Step2. to generate .xlsx file from binary as input Public Sub ConvertBinaryToExcelUsingEPPlus(ByVal binaryFileName As String, ByVal outPutXlsxFile As String) Dim stream As New FileStream(binaryFileName, FileMode.OpenOrCreate) Dim fileInfo As New FileInfo(outPutXlsxFile) Dim excelPackage As New ExcelPackage(stream) excelPackage.SaveAs(fileInfo) End Sub Plz suggest.Anonymous
December 31, 2013
Get VBA Password Recovery Software to unlock VBA Password without any coding. To know more about this software you can go for this link gallery.technet.microsoft.com/VBA-Password-Recovery-eac07070