Jaa


Transforming Open XML Documents to Flat OPC Format

Transforming Open XML documents using XSLT is an interesting scenario, but before we can do so, we need to convert the Open XML document into the Flat OPC format.  We then perform the XSLT transform, producing a new file in the Flat OPC format, and then convert back to Open XML (OPC) format.  This post is one in a series of four posts that present this approach to transforming Open XML documents using XSLT.  The four posts are:

Transforming Open XML Documents using XSLT

Presents an overview of the transformation process of Open XML documents using XSLT, and why this is important.  Also presents the ‘Hello World’ XSLT transform of an Open XML document.

Transforming Open XML Documents to Flat OPC Format (This Post)

This post describes the process of conversion of an Open XML (OPC) document into a Flat OPC document, and presents the C# function, OpcToFlat.

Transforming Flat OPC Format to Open XML Documents

This post describes the process of conversion of a Flat OPC file back to an Open XML document, and presents the C# function, FlatToOpc.

The Flat OPC Format

Presents a description and examples of the Flat OPC format.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCAbout the Code

The code presented in this post uses LINQ to XML and System.IO.Packaging to perform the conversion to Flat OPC.

The signature of the function to convert from an Open XML document to Flat OPC is:

static XDocument OpcToFlatOpc(string path);

You pass as an argument the path to the Open XML document.  The method returns an XDocument object, which you can then modify as necessary, transform using XSLT, serialize to the standard output, or save to a file.

The code to convert a binary part to a base 64 string uses the System.Convert.ToBase64String method.  The base 64 string needs to be broken up into lines of 76 characters (see The Flat OPC Format for more detail).  The code uses the technique described in Chunking a Collection into Groups of Three to do the chunking.

If you are not familiar with this style of programming, I recommend that you read this Functional Programming Tutorial.

The conversion code adds the appropriate XML processing instruction to the resulting Flat OPC XML document based on the filename of the source Open XML document.  If the source document has the .docx extension, then the code adds the XML processing instruction for Word.  If the source document has the .pptx extension, then the code adds the XML processing instruction for PowerPoint.

Here is the code to perform the transform (also attached):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.IO;
using System.IO.Packaging;
using System.Xml;
using System.Xml.Schema;

class Program
{
static XElement GetContentsAsXml(PackagePart part)
{
XNamespace pkg = "https://schemas.microsoft.com/office/2006/xmlPackage";

if (part.ContentType.EndsWith("xml"))
{
using (Stream str = part.GetStream())
using (StreamReader streamReader = new StreamReader(str))
using (XmlReader xr = XmlReader.Create(streamReader))
return new XElement(pkg + "part",
new XAttribute(pkg + "name", part.Uri),
new XAttribute(pkg + "contentType", part.ContentType),
new XElement(pkg + "xmlData",
XElement.Load(xr)
)
);
}
else
{
using (Stream str = part.GetStream())
using (BinaryReader binaryReader = new BinaryReader(str))
{
int len = (int)binaryReader.BaseStream.Length;
byte[] byteArray = binaryReader.ReadBytes(len);
// the following expression creates the base64String, then chunks
// it to lines of 76 characters long
string base64String = (System.Convert.ToBase64String(byteArray))
.Select
(
(c, i) => new
{
Character = c,
Chunk = i / 76
}
)
.GroupBy(c => c.Chunk)
.Aggregate(
new StringBuilder(),
(s, i) =>
s.Append(
i.Aggregate(
new StringBuilder(),
(seed, it) => seed.Append(it.Character),
sb => sb.ToString()
)
)
.Append(Environment.NewLine),
s => s.ToString()
);
return new XElement(pkg + "part",
new XAttribute(pkg + "name", part.Uri),
new XAttribute(pkg + "contentType", part.ContentType),
new XAttribute(pkg + "compression", "store"),
new XElement(pkg + "binaryData", base64String)
);
}
}
}

static XProcessingInstruction GetProcessingInstruction(string path)
{
if (path.ToLower().EndsWith(".docx"))
return new XProcessingInstruction("mso-application",
"progid=\"Word.Document\"");
if (path.ToLower().EndsWith(".pptx"))
return new XProcessingInstruction("mso-application",
"progid=\"PowerPoint.Show\"");
return null;
}

static XDocument OpcToFlatOpc(string path)
{
using (Package package = Package.Open(path))
{
XNamespace pkg = "https://schemas.microsoft.com/office/2006/xmlPackage";

XDeclaration declaration = new XDeclaration("1.0", "UTF-8", "yes");
XDocument doc = new XDocument(
declaration,
GetProcessingInstruction(path),
new XElement(pkg + "package",
new XAttribute(XNamespace.Xmlns + "pkg", pkg.ToString()),
package.GetParts().Select(part => GetContentsAsXml(part))
)
);
return doc;
}
}

static void Main(string[] args)
{
XDocument doc;
doc = OpcToFlatOpc("Test.docx");
doc.Save("Test.xml", SaveOptions.DisableFormatting);
doc = OpcToFlatOpc("Test2.pptx");
doc.Save("Test2.xml", SaveOptions.DisableFormatting);
}
}

OpcToFlat.zip

Comments

  • Anonymous
    October 08, 2008
    how can we tranform it directly to html file?

  • Anonymous
    October 24, 2008
    Important Safety Tip for Office Open XML - Flatten Your Package!

  • Anonymous
    September 30, 2009
    The comment has been removed

  • Anonymous
    September 30, 2009
    The comment has been removed

  • Anonymous
    January 13, 2010
    Why no XLSX for GetProcessingInstruction?

  • Anonymous
    January 13, 2010
    Hi David, the reason I included the processing instructions that I did is that you can directly open these XML files in Word and PowerPoint.  The processing instruction enables opening by double clicking.  If your only purpose is to transform via XSLT, then sure, replace the processing instruction with one for XSLX. -Eric

  • Anonymous
    February 19, 2010
    BTW - thank you a ton for these posts. Incredibly helpful!!!

  • Anonymous
    February 26, 2010
    Hi Eric, Thank you very much for the post and your blog as a whole. The processing above is OK for a “standard” docx file. How would one go about when processing an embedded docx file into another one? Here is a scenario:

  1. Generating SubDoc1.docx.
  2. Merging SubDoc1.docx into MainDoc.docx.
  3. “Flatten” the MainDoc.docx to an OPC Format. The problem is that SubDoc1.docx is part of the package. I see that it has to be processed recursively, but which parts of the SubDoc1.docx needs to be extracted. Thank you !
  • Anonymous
    December 01, 2010
    I have a similar problem as Bukabi where after merging the docx (via altchunk), I'm having problem trying to convert the merged docx to Flat OPC Format (OpcToFlat). Any workaround for this? Thanks. Your blogs has been of great help to me.

  • Anonymous
    December 02, 2010
    Hi LMK and Bukabi, It depends on what you're trying to do, but after creating the document with altChunk elements, you need to merge the imported documents/html/etc. into the original document so that the document contains ordinary WordprocessingML markup (i.e. paragraphs, runs, text).  There are two ways to do this - either use Word to open and save the document (perhaps using automation), or to use Word Automation Services (msdn.microsoft.com/.../ff742315.aspx).  Then you can flatten, or process in a variety of ways. -Eric

  • Anonymous
    June 10, 2014
    can i write new in openxml file using c# code ? Because i want to replace new path on place of old path of video in powerpointpresentation openxml file. So how can i do ?

  • Anonymous
    December 03, 2014
    Thanks Eric for the codes which works great in our scenario. However I noticed the flat file created is almost double of size of the original word document. The original word document has a few of graphics embedded, and I know converting images to binaries increases size about 37%. But in my testing, the size of the resulted flat file was double. Any thoughts?

  • Anonymous
    December 03, 2014
    Hi Alex, yes, you are right, Flat OPC will be quite a bit larger, due to non-compression, and due to base64 encoded binary parts.  Nothing can be done about this, really. I'm glad the code is helpful to you. Cheers, Eric

  • Anonymous
    December 03, 2014
    Thanks Eric for your quick reply. I really enjoy reading your articles.

  • Anonymous
    December 03, 2014
    to Alex Wong. nothing you can control with that. Office 07 or after documents themselves are compressed.

  • Anonymous
    January 22, 2015
    Thanks Eric for the tool. When i used the codes to convert a docx to a flat ooxml, then I converted the ooxml back to a docx using open XML for office API. The font type was changed for the whole document. any ideas why this happened? Alan

  • Anonymous
    September 16, 2015
    how to convert xml file in other file