Embedding Any File Type, Like PDF, in an Open XML File

In my last post, I showed you guys how to embed an Excel spreadsheet within a Word document without the need to invoke an OLE Server. In today's post I am going to show you how to embed any file in an Open XML file. Specifically, I am going to show you how to embed a PDF file into a Word document. Note that this approach requires you to invoke an OLE Server to embed the file into an Open XML file.

My post will talk about using version 2 of the SDK.

If you just want to jump straight into the code, feel free to download this solution here.

<

Solution

To embed a PDF file into a Word document we can take the following actions:

  1. Create a template in Word that contains a content control that will be used to demarcate the region where the embedded object will be inserted
  2. Open up the Word document via the Open XML SDK and access its main document part
  3. Invoke the OLE server application associated with PDF files to create an IStorage and an image of the embedded object
  4. Add an image part to the document
  5. Feed the data from the generated image into the added image part
  6. Add an embedded object part to the document
  7. Feed the data from the generated IStorage into the embedded object part
  8. Determine the prog id associated with the application associated with PDF files
  9. Create a paragraph that contains the embedded object
  10. Locate the content control that will contain the embedded object
  11. Swap out the content control for the newly created paragraph
  12. Save changes made to the Word document

Note that the steps outlined above are just one method to accomplish this scenario. The steps above are very similar to my previous post showing you how to embed an Excel spreadsheet within a Word document. The main difference is in how we go about adding the embedded object to the Word document. No application, at least on my computer, has written out a subkey IPersistStorageType under HKCR\CLSID\{Apps_OLE_Storage_CLSID} for PDF files, which means there is no way for us to know the required structure of an IStorage containing a PDF file. Instead we are required to rely on the OLE server application associated with PDF files to generate the appropriate IStorage.

For the sake of this example, let's say I am starting with the following Word document:

Embed1

This document contains a content control, named "EmbedObject," which will contain my embedded object. In addition, let's say I have the following PDF file I wish to embed:

Embed2

The Code

As mentioned in my previous post, embedding an object in a document requires both a visual representation of the object and the underlying data. In this post, I am going to show you how to generate the IStorage and the image representing the embedded object by invoking the OLE Server associated with PDF files. To create the underlying data for a non-Office embedded object we need to look up the prog id of the application associated with the file format extension. To get this data we need to look under \HKCR\.XXX within the registry, where XXX is the file format extension (ex. PDF). Under this path you should see at least two sub keys: "(Default)" and "Content Type." The value specified for "(Default)" represents the prog id of the application associated with the file format. On my computer, the prog id associated with PDF files is "AcroExch.Document."

Since we don't know the structure of the embedded object we shouldn't use the content type associated with the file format extension. Instead, we should use the generic content type for embedded objects, which is "application/vnd.openxmlformats-officedocument.oleObject."

Our next step is to create the IStorage and an image representation for the embedded object. As mentioned in the Solution section above, we need to invoke the OLE Server associated with PDF files. Below is the C++ code needed to accomplish this task:

//********** This snippet is C++ code *************// HRESULT PackageOleObject(LPCTSTR inputFile, LPCTSTR outputFile) { HRESULT hr = S_OK; IStoragePtr pStorage = NULL; IOleObjectPtr pOle = NULL; IDataObjectPtr pdo = NULL; FORMATETC fetc; STGMEDIUM stgm; HENHMETAFILE hmeta;   // Create a compound storage document. hr = StgCreateStorageEx ( outputFile, STGM_READWRITE | STGM_SHARE_EXCLUSIVE | STGM_CREATE | STGM_TRANSACTED, STGFMT_DOCFILE, 0, NULL, NULL, IID_IStorage, reinterpret_cast<void**>(&pStorage)); CheckHr(hr);      // Create OLE package from file. hr = OleCreateFromFile(CLSID_NULL, inputFile, ::IID_IOleObject, OLERENDER_NONE, NULL, NULL, pStorage, (void**)&pOle);   hr = OleRun(pOle); CheckHr(hr);   hr = pOle->QueryInterface(IID_IDataObject, (void**)&pdo); CheckHr(hr);   fetc.cfFormat = CF_ENHMETAFILE; fetc.dwAspect = DVASPECT_CONTENT; fetc.lindex = -1; fetc.ptd = NULL; fetc.tymed = TYMED_ENHMF;   stgm.hEnhMetaFile = NULL; stgm.tymed = TYMED_ENHMF; hr = pdo->GetData(&fetc, &stgm); CheckHr(hr);   // Create image metafile for object. CopyEnhMetaFile(stgm.hEnhMetaFile, emfFile);   hr = pStorage->Commit(STGC_DEFAULT ); CheckHr(hr);   pOle->Close(0); DeleteEnhMetaFile(stgm.hEnhMetaFile); DeleteEnhMetaFile(hmeta);      return hr; }

The above C++ code snippet will create two output files that represent the IStorage and the image representation for our embedded object.

We are now ready to accomplish the rest of the steps. Here is how you add the appropriate image data and embedded object data to a Word file:

using (WordprocessingDocument myDoc = WordprocessingDocument.Open(output, true)) { MainDocumentPart mainPart = myDoc.MainDocumentPart;   //Note that I created this emf file using my C++ solution ImagePart imagePart = mainPart.AddImagePart(ImagePartType.Emf); imagePart.FeedData(File.Open("output.emf", FileMode.Open));   EmbeddedObjectPart embeddedObjectPart = mainPart.AddEmbeddedObjectPart(@"application/vnd.openxmlformats-officedocument.oleObject");   //Note that I created this bin file using my C++ solution embeddedObjectPart.FeedData(File.Open("input.pdf.bin", FileMode.Open));   ... }

I should note that both the image and the embedded data were created using my C++ code that I showed you earlier in this post. The next step is to create a paragraph that represents our embedded object. Using the Document Reflector to help me out, I was able to create the following method:

static Paragraph CreateEmbeddedPDFParagraph(string imageId, string embedId, string progId) { Paragraph p = new Paragraph( new Run( new EmbeddedObject( new V.Shapetype( new V.Stroke() { JoinStyle = V.StrokeJoinStyleValues.Miter }, new V.Formulas( new V.Formula() { Equation = "if lineDrawn pixelLineWidth 0" }, new V.Formula() { Equation = "sum @0 1 0" }, new V.Formula() { Equation = "sum 0 0 @1" }, new V.Formula() { Equation = "prod @2 1 2" }, new V.Formula() { Equation = "prod @3 21600 pixelWidth" }, new V.Formula() { Equation = "prod @3 21600 pixelHeight" }, new V.Formula() { Equation = "sum @0 0 1" }, new V.Formula() { Equation = "prod @6 1 2" }, new V.Formula() { Equation = "prod @7 21600 pixelWidth" }, new V.Formula() { Equation = "sum @8 21600 0" }, new V.Formula() { Equation = "prod @7 21600 pixelHeight" }, new V.Formula() { Equation = "sum @10 21600 0" }), new V.Path() { AllowGradientShape = V.BooleanValues.T, ConnectionPointType = OVML.ConnectValues.Rectangle, AllowExtrusion = V.BooleanValues.F }, new OVML.Lock() { Extension = V.ExtensionHandlingBehaviorValues.Edit, AspectRatio = OVML.BooleanValues.T } ) { Id = "_x0000_t75", CoordinateSize = "21600,21600", Filled = V.BooleanValues.F, Stroked = V.BooleanValues.F, OptionalNumber = 75, PreferRelative = V.BooleanValues.T, EdgePath = "m@4@5l@4@11@9@11@9@5xe" }, new V.Shape( new V.ImageData() { Title = "", RelationshipId = imageId } ) { Id = "_x0000_i1025", Style = "width:459pt;height:594pt", Ole = V.BooleanEntryWithBlankValues.Empty, Type = "#_x0000_t75" }, new OVML.OleObject() { Type = OVML.OLEValues.Embed, ProgId = progId, ShapeId = "_x0000_i1025", DrawAspect = OVML.OLEDrawAspectValues.Content, ObjectId = "_1309181277", Id = embedId } ) { DxaOriginal = (UInt32Value)9180U, DyaOriginal = (UInt32Value)11881U }) ); return p; }

The last step of the solution is to swap out the content control for this newly created paragraph. Here is the code snippet to accomplish this task:

Paragraph p = CreateEmbeddedPDFParagraph( mainPart.GetIdOfPart(imagePart), mainPart.GetIdOfPart(embeddedObjectPart), "AcroExch.Document");   SdtBlock sdt = mainPart.Document.Descendants<SdtBlock>() .Where(s => s.GetFirstChild<SdtProperties>().GetFirstChild<Alias>().Val.Value .Equals("EmbedObject")).First();   OpenXmlElement parent = sdt.Parent; parent.InsertAfter(p, sdt); sdt.Remove(); mainPart.Document.Save();

End Result

Running this code I should end up with a document that looks like the following:

Embed3

Upon activating the embedded object I will see the following:

Embed4

Let me know if you guys are interested in more solutions around embedded objects.

Zeyad Rajabi

Added video to blog post

Comments

  • Anonymous
    July 21, 2009
    Excellent article!!!  Do you know about a solution for merging (not embedding) PDF files and Office 2003 files into a docx file?  Thanks in advance for your answer

  • Anonymous
    July 21, 2009
    Johann - Thanks. To merge PDF content within Open XML files you will need some kind of API that understands PDF. I found several doing a quick internet search. As for Office 2003 files, there are 3rd party APIs that can help you out. You could also programmatically access the content of binary documents yourself. Check out the following link: http://msdn.microsoft.com/en-us/library/cc313153.aspx to see the specification for the .doc file format. Another approach for handling Office 2003 files is to convert them to Open XML. You can use the OMPM tool to do bulk conversions for you.

  • Anonymous
    July 28, 2009
    The program runs very well in my PC. However, I got a HRESULT=0x80070057 error code when I tried to convert a TXT file to an OLE object under Win2003 server. There is also no EMF image file in the output when 0x80070057 error exists.

  • Anonymous
    August 03, 2009
    Is there any way to detect where in a document a particular part is referenced, with SDK v2? It would be nice to be able to look at, say, a Paragraph, and get a list of Parts referenced. There are several use cases:

  • removing a paragraph from a document; what set of Parts need to be removed as well?

  • extracting a single paragraph from a document; what set of Parts need to come along? Word obviously has this information because it strips out unreferenced Parts when saving.