Mastering Text in Open XML Word-Processing Documents

Processing text in Open XML word-processing documents seems deceptively simple at first – you have the body of the document, paragraphs and tables in the body, and rows and cells in tables, just like HTML, right?  Then it seems deceptively hard – you see the markup for revision tracking, numbered and bulleted lists, content controls, markup that doesn't affect text, such as bookmarks and comments, and so on.  Styles might seem like they don't impact text, but in the case of numbered and bulleted lists, they do.  Actually, the truth is, it is somewhere around the middle.  There is a lot to keep track of, but each one of these features, taken by itself is not very complicated.

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC

That said, there are some basic ideas and abstractions that can simplify how you think about word-processing markup.  These abstractions are relevant regardless of whether you are working with word-processing markup using the Open XML SDK 2.0 strongly-typed object model, using the Open XML SDK with LINQ to XML, or using some other platform, such as Java or PHP.  We can write some code that will help us to deal with these abstractions.  The code will 'surface' just those elements that you are interested in, and surface them in an organized, predictable manner.  In the MSDN article, Mastering Text in Open XML WordprocessingML Documents, I present C# code written with both LINQ to XML and with the Open XML SDK 2.0 strongly-typed object model.  It is not a lot of code.  Because the semantics of a few useful methods are defined carefully, they are easy to implement in whatever language and platform that you are using.

Comments

  • Anonymous
    January 19, 2010
    This is a nightmare compared to ODF, right? So much code for this kind of processing.

  • Anonymous
    January 20, 2010
    Hi Haba, Thanks for your comment.  I disagree that this is more difficult than ODF.  While it might seem on the surface that ODF is simpler than Open XML, the added complexity is due to the issues that content controls add.  ODT has no corresponding functionality (as far as I know).  I've seen literally hundreds of applications that use content controls to create better ways to use documents.  There is no question that having constructs that allows us to give structure to document content is very valuable. Further, while Open XML has a content structure of paragraphs/runs/text elements, ODT has mixed text content (interspersed text nodes and elements).  It is very much open to debate as to which form presents more processing challenges. Finally, there are issues with ODT that seriously limit it, such as the lack of table styles, and revision tracking that works properly for tables. The LogicalChildrenContent code that I present in this post is less than 200 lines long, which isn't a lot of code.  The code to use those extension methods is only a few lines long.  Programming with document formats, almost by definition, has its complications.  When you have a document format that has as many features and capabilities as Open XML, we always want to generalize and abstract functionality so that working with it becomes easier. -Eric

  • Anonymous
    February 16, 2010
    Hi Eric, Is it possible to read word by word with all the style attributes?

  • Sandeep
  • Anonymous
    February 16, 2010
    Hi Sandeep, Yes, it is possible, but it's non-trivial.  You have to 'roll up' styles, assembling the style information for each run.  (See posts #4, #5, and #6 in this series.)  I'm currently working on enhancing the XHtml converter to properly render styled text.  This project will demonstrate in detail how to find out the styling information for each run, paragraph, table, and list item. -Eric

  • Anonymous
    March 11, 2010
    Seems to me that the OpenXML sdk ought to contain a lot of this type of code. Realistically, it should wrap all these concepts up into objects much like the word object model itself. Requiring developers to get down and dirty with the details of a complex file format like this, to do simply things like replace mail merge fields, or insert text into content controls just seems wrong. In playing with the SDK, I've found it's trivially easy to generate a "corrupt" document that word can't even load, yet try to do that with the Word object model. It's virtually impossible. Maybe what I'm talking about is layer ON TOP OF the OpenXML sdk, but the sdk as it stands now is only partially helpful. Just my 2c

  • Anonymous
    March 11, 2010
    Hi Darin, With regards to corrupt documents, it is useful to use the SDK to validate documents after you generate or modify them. http://blogs.msdn.com/ericwhite/archive/2010/03/04/validate-open-xml-documents-using-the-open-xml-sdk-2-0.aspx With regards to added functionality, as I understand it, the SDK team has/is considering additional functionality.  But the first, and arguably the hardest task was putting together the strongly-typed OM and validation functionality. -Eric

  • Anonymous
    March 17, 2010
    Hi Eric, is it possible to only extract the text in a paragraph with a certain style? I can't seem to find a solution for this. this is a example of my document.xml and I only need to retrieve the text with a style on it. <w:body>   <w:p w:rsidR="00C11C22"        w:rsidRDefault="00C11C22">     <w:pPr>       <w:pStyle w:val="LLTitle" />     </w:pPr>     <w:r>       <w:t>My Document</w:t>     </w:r>   </w:p>   <w:p w:rsidR="00C11C22"        w:rsidRDefault="00C11C22">     <w:pPr>       <w:pStyle w:val="LLAuthor" />     </w:pPr>     <w:r>       <w:t>Jeroen</w:t>     </w:r>   </w:p>   <w:p w:rsidR="00C11C22"        w:rsidRDefault="00C11C22" />   <w:p w:rsidR="00C11C22"        w:rsidRDefault="00C11C22" />   <w:p w:rsidR="00C11C22"        w:rsidRDefault="00C11C22">     <w:r>       <w:t>This is normal content.</w:t>     </w:r>   </w:p>   <w:p w:rsidR="00C11C22"        w:rsidRDefault="00C11C22">     <w:pPr>       <w:pStyle w:val="LLKeyword" />     </w:pPr>     <w:r>       <w:t>Test, Test Document, My Document</w:t>     </w:r>   </w:p>   <w:p w:rsidR="00C11C22"        w:rsidRDefault="00C11C22">     <w:r>       <w:t>It will not be extracted.</w:t>     </w:r>   </w:p> </w:body>   so my goal is to retrieve the text "My Document", "Jeroen" and "Test, Test Document, My Document" the text "This is normal content." should not be retrieved. can you show me how to do this? ps: I used the LINQ to XML method.

  • Anonymous
    March 17, 2010
    Hi Jeroen, The query to return a collection of paragraphs that have styles would be something like this: doc.MainDocumentPart.GetXDocument().Root    .Element(w + "body")    .LogicalChildrenContent(w + "p")    .Where(p => p.Elements(w + "pPr").Elements(w + "pStyle").Any()); If you wanted to retrieve the text of each paragraph, then the following returns a collection of strings, which contain the contents of paragraphs with styles. doc.MainDocumentPart.GetXDocument().Root    .Element(w + "body")    .LogicalChildrenContent(w + "p")    .Where(p => p.Elements(w + "pPr").Elements(w + "pStyle")    .Select(p => p.LogicalChildrenContent(w + "r")        .LogicalChildrenContent(w + "t")        .Select(t => (string)t)        .StringConcatenate()); (Not tested this code) Take a look at this post and the follow-on posts: http://blogs.msdn.com/ericwhite/archive/2009/02/16/finding-paragraphs-by-style-name-or-content-in-an-open-xml-word-processing-document.aspx I recommend you spend a bit of time and go through this: http://blogs.msdn.com/ericwhite/pages/FP-Tutorial.aspx -Eric

  • Anonymous
    April 18, 2010
    Hi Eric I have a reqt to create docx documents from the template(docx). I need to search and replace the delimited placeholders with data from database. The Xml is pretty tedious to read and modify. For ex, if I have "Hello [#name#]" in the document, the Xml interprets it as <w:p w:rsidRPr="00EE3654" w:rsidR="00EE7F3D" w:rsidRDefault="004F5D94"><w:pPr><w:rPr><w:lang w:val="en-GB" /></w:rPr></w:pPr><w:r><w:rPr><w:lang w:val="en-GB" /></w:rPr> <w:t xml:space="preserve">Hello </w:t></w:r><w:r w:rsidR="00EE7F3D"><w:rPr><w:lang w:val="en-GB" /></w:rPr> <w:t>[#</w:t></w:r><w:r w:rsidR="008C733B"><w:rPr><w:lang w:val="en-GB" /></w:rPr> <w:t>name</w:t></w:r><w:r w:rsidR="00EE7F3D"><w:rPr><w:lang w:val="en-GB" /></w:rPr> <w:t>#]</w:t></w:r> </w:p> Is there an easier way to do this or do we have any tools for this? Pls help

  • Anonymous
    April 19, 2010
    Hi Rama, I don't know of any tools for this.  It is a non-trivial problem.  The approach taken in this blog post can help, in that you can reliably determine the text for each paragraph, and retrieve the text as a single string.  Then, you can re-construct the markup from the changed text.  If you have to do propagate run properties through to the newly generated markup, you have a difficult programming task that requires a fair amount of detailed design work. -Eric

  • Anonymous
    April 19, 2010
    Thanks Eric I thought so and done exactly the same way. I am reading through each pragraph's innertext and determining whether the place holder exists and then deleting all the w:t tags and creating one w:t with the modified inner text. And am applying the first w:t node's w:rPr to the entire text. Disadvantage with this is that there may be loss of style as I am applying the first w:t's style to the whole text. Open Xml should have the whole text of a paragraph in a single w:t tag. Atleast the ones with same style. I know Im being over ambitious ;)

  • Anonymous
    July 04, 2010
    I have a similar issue with something I'm working only the data comes from a file downloaded from some hardware.  I found that text looks exactly the same int word document without all the redundant padding. <w:t>[#name#]</w:t> gives you the same result as. <w:t>[#</w:t></w:r><w:r w:rsidR="008C733B"><w:rPr><w:lang w:val="en-GB" /></w:rPr> <w:t>name</w:t></w:r><w:r w:rsidR="00EE7F3D"><w:rPr><w:lang w:val="en-GB" /></w:rPr> <w:t>#]</w:t></w:r> It seems to be bloated in the same way as HTML output from word. I have no problem opening and 'edit' the xml version of the file, BUT, how do I append a second file to the first and ensure the each appended file starts on a new page.?

  • Anonymous
    September 05, 2011
    The comment has been removed

  • Anonymous
    May 08, 2012
    The comment has been removed

  • Anonymous
    May 08, 2012
    Hi Parinaaz, I think that you may want to use DocumentBuilder. See openxmldeveloper.org/.../documentbuilder.aspx for more information. Take a look at this video first: openxmldeveloper.org/.../new-screen-cast-short-and-sweet-intro-to-documentbuilder-2-0.aspx -Eric