Jaa


Removing Comments from a Wordprocessing Document Programmatically

One of the more common scenarios related to a Wordprocessing document is the need to sanitize a document in order to remove personally identifiable information. What do I mean by personally identifiable information? Well, I am talking about, among other things, comments, revisions, personal information such as author name, and hidden text. This type of content may need to be stripped out of a document before the document gets sent outside a corporation.

This scenario is so important to Office that we added a Document Inspector feature in Office 2007, which is able to find and remove these types of personally identifiable information. You can find this feature by clicking the Office button | Prepare | Inspect Document. Here is what the feature looks like:

How do I perform the same actions programmatically, let's say on the server? Well, here is where the Open XML SDK can help. Today I am going to show you how to remove comments within a Wordprocessing document. This post is similar to Eric's post on using LINQ to remove comments from a document, except I will show you a solution that builds on top of version 2 of the Open XML SDK.

The Document

Imagine I have a document that has multiple comments, where some of the comments may even contain images. If you crack open the package you will notice that a Wordprocessing document that contains comments will have the following content:

  • The document will contain a Comments part, which contains the content of every comment
  • If applicable, the Comments part will reference other parts associated with a given comment. For example, if a comment contains an image, the comments part will reference an image part
  • The main document part will contain references to comments via a comments reference element
  • The main document will demarcate regions that are associated with a comment via comment range start and end elements

Here is a screenshot of an example document with comments:

Solution

To remove comments from a Wordprocessing document we need to take the following actions:

  1. Open up the Wordprocessing document via the Open XML SDK
  2. Access the main document part, which will give us access to all other parts within the package
  3. Delete the Comments part and all parts referenced by the Comments part
  4. Find all elements within the main document part associated with comments
  5. Delete all those found elements
  6. Save changes made to the document

My post will talk about using version 2 of the SDK.

If you just want to jump straight into the code, feel free to download this solution here.

The Code

The following code snippet accomplishes all six tasks discussed in the Solution section above. This code snippet builds upon some of the topics discussed in the Traversing in the Open XML SDK DOM and Open XML SDK... The Basics posts. In particular, the Descendants() method is used to find specific elements associated with comments and the generic OpenXmlElement class is used for manipulation. Another thing to note is that deleting a part via the Open XML SDK, not only deletes the part, but all parts referenced by that part as well.

static void RemoveComments(string filename) { //Open up the document using (WordprocessingDocument myDoc = WordprocessingDocument.Open(filename, true)) { //Access main document part MainDocumentPart mainPart = myDoc.MainDocumentPart; //Delete the comment part, plus any other part referenced, like image parts mainPart.DeletePart(mainPart.WordprocessingCommentsPart); //Find all elements that are assoicated with comments IEnumerable<OpenXmlElement> elementList = mainPart.Document.Descendants() .Where(el => el is CommentRangeStart || el is CommentRangeEnd || el is CommentReference); //Delete every found element foreach (OpenXmlElement e in elementList) { e.Remove(); } //Save changes mainPart.Document.Save(); } }

End Result

Putting everything together and running my code, I will end up with a document that is completely devoid of comments. Sweet!

Here is a screenshot of the final document:

Zeyad Rajabi

Comments

  • Anonymous
    February 06, 2009
    PingBack from http://www.clickandsolve.com/?p=4433

  • Anonymous
    February 09, 2009
    I would like to make a suggestion for a new topic: Copying worksheets within a workbook. I found Todde's solution here: http://www.codeproject.com/KB/aspnet/CopyExcelSheet.aspx but that doesn't utilize the OOXML API. I also tried to adapt your "Clone the slide template" code to work with spreadsheets, but was unsuccessful there. Let me know what you think about this topic, thanks!

  • Anonymous
    February 11, 2009
    For copying worksheets, you could refer to a sample code here which adds a new worksheet into the package: http://msdn.microsoft.com/en-us/library/cc881781(office.14).aspx You will need to adjust the first few lines that creates a blank sheet with the code in this sample where a part's content is copied to another part: http://msdn.microsoft.com/en-us/library/bb463673(office.14).aspx The resulting code will be well reduced

  • Anonymous
    February 12, 2009
    Hey, don't get me wrong, here, this OpenXML SDk looks very nice and all, but...what about us VBA developers? Is there any way (a COM Interop wrapper or somesuch) for US to use this nifty new tool in OUR applications? Your previous "how to generate a Word Document from an external Database" perfectly captured one of the reporting tasks we MS Access/VBA Developers are tasked with every day. Only, your solution requires VB.Net/VS.net (or at least VSTA, which is still not readily available for "the rest of us"). So, how about it? Is there some sort of solution or offering out there for us, or will we once again be the forgotten, "red-headed stepchildren" developers of Microsoft-based Office/VBA solutions? (please hurry up with an answer, as I'm holding my breath...)

  • Anonymous
    February 12, 2009
    I was missing two pieces:

  1. I wasn't using the   using(SpreadsheetDocument...){} structure, so I needed to add Package.Flush() to save the relationships and
  2. That sample doesn't include copying DefinedNames associated with the worksheet. I've added the code below for copying the DefinedNames from one worksheet to another in case anyone else was wondering about this.            // Copy the Named Regions from Workbook.xml            List<DefinedName> myNewNames = new List<DefinedName>();            DefinedNames myNames = myWBpart.Workbook.GetFirstChild<DefinedNames>();            foreach (DefinedName name in myNames)            {                if (name.InnerText.Split('!')[0].Trim(''') == oldWorksheetName)                {                    DefinedName newName = new DefinedName();                    newName.Name = name.Name + "_" + newWorksheetName;                    newName.Text = name.Text.Replace(oldWorksheetName, newWorksheetName);                    myNewNames.Add(newName);                }            }            foreach (DefinedName myNewName in myNewNames)            {                myNames.AppendChild<DefinedName>(myNewName);            }
  • Anonymous
    February 13, 2009
    Anthony, Good suggestion. Next week I will have a post up talking about how to copy a worksheet. Mburns, Good question. I will need to look into how to call managed code from VB. I know there is a way, but I will need to ask some people. Zeyad Rajabi

  • Anonymous
    February 18, 2009
    (MBurns has died from hypoxia)

  • Anonymous
    February 24, 2009
    In addition to posting my own content, I will from time to time post links to the great new Open XML

  • Anonymous
    April 20, 2009
    In a previous post, I showed you how to remove comments from a Word file . In today's post, I am going