OpenXML & VSTO & VBA - Finding a reliable mechanism for reading the correct value of CharactersWithSpaces 'extended-properties' in Word documents [part 2/2].
This article is split across two blog posts and this is part #2 .. use this link to go to part #1. In part #1 of this article, I demonstrated how words are counted using OpenXML and I warned about the dangers of not getting it done correctly. Here is a short summary of the alternatives we have: Can we obtain 100% reliable statistics? Method #1: When Word receives a query about the Document statistics (either manually through the Ribbon graphical interface or using VBA ObjectModel), it computes those results and always reports the correct numbers. All we have to do next, is to save and close the file and then send it to a program which accesses the OpenXML structure and reads those values. The only difficult part is finding a way to trigger the update every time, to ensure that the information stored in the file is up to date. Advantages: > easy to implement; > OpenXML code used to read the values is very simple; > works for all kinds of input files ... even very complex ones (containing embedded charts, shapes, nested tables ..etc); Disadvantages: > we can only force the Word Count update if we rely on a VBA / VSTO add-in installed on the client-side; > somehow, the automated Statistics Update Add-in has to be deployed to all end-users; Method #2: Write our own OpenXML code and count the words ourselves. Advantages: > no need for 'helper' tools; Disadvantages: > because the OpenXML format is VERY complex, the code will run reliably only for basic input files; If you want to extend the program to be able to handle all kinds of input documents you will find that the complexity of the code increases up to the point where it is not feasible to continue with the project (you will very likely be forced to write individual code rules for targeting all kinds of exceptions and special conditions for XML text tags, that may appear in different combinations); In this article, I would like to present the first method, where we use VBA to count our words. But first, we have to trigger the problem:
Triggering the Word statistics mismatch problem 1. Just create a new Word document, type " =rand(1) " (without the quotes), then press Enter key;2. Save it file using .docx type, then close it;3. Open the file using an OpenXML editor, or rename the document from .docx to .zip, open the docProps folder and then edit the app.xml file; Note the values of these XML items: > Pages; > Words; > Characters; > Lines; > Paragraphs; > CharactersWithSpaces;4. Close the editor, or if you renamed the file to .zip, restore its original extension; Open it again in Word;5. On the Review tab, in the Proofing group, click Word Count;6. Compare the statistics in the Word Count dialog with those noted from the app.xml file; Result: we easily notice that the numbers are different ...7. Close the document again, you should be prompted to Save it. Go ahead and click OK to store the updated document information;8. Open its internal OpenXML structure and this time you should see that the numbers match; If we slightly change the order of execution for the aforementioned steps: > create a new file; > add some text; > save the file; > keep the document open, then go to 'Review' > 'Proofing'; > click on 'Word Count'; > close the file; > open it using an OpenXML editor, or rename the document from .docx to .zip, open the docProps folder and then edit the app.xml file;... you'll notice that the correct statistics information is stored into the document.But something interesting happens ... when we try to close the Word document after viewing the Statistics, even though we didn't add any modification (we simply clicked on 'Word Count') the application prompts us to save the file again!The same behavior occurs if I go to VBA and execute: "Debug.Print ActiveDocument.ComputeStatistics(wdStatisticCharactersWithSpaces) ". What this means is that when we first saved the file, Word just entered a rough estimation of the Character count, but when we clicked on 'Word Count', the application updated its Statistics. Since we know that it’s enough to click on the 'Word Count' button or execute the VBA instruction to have the correct value stored in OpenXML, we can take advantage of it to force a computation before the user triggers a Save in Word. In this way, each time the end-user sends his file to an automated code or script, it will contain the most up-to-date Statistics information.
A simple solution using VBA ... I wrote a simple Word DOTM add-in which will trigger the statistics update before each save:
As you can see, everything seems to be working: .. but not for all scenarios. Let's suppose the end-user starts by opening an older .DOC file, then he does a SaveAs to store it as an OpenXML format document. In this case, the newly saved .DOCX file will contain unreliable word count information ... But why ?It seems this issue is being caused by the fact that my code is receiving a handle that points at the old .DOC file when this event occurs:Private Sub clsWd_DocumentBeforeSave(ByVal Doc As Document .... therefore Word computes the correct Statistics for a different file. This is not a problem in a normal Save action, but with a SaveAs, we get a handle on the new document only after we exit the BeforeSave event handler, and by that time it is too late for the code to act.A less simple solution using VBA :) ... There is no AfterSave event in Word, but I tried to simulate one: > first I am saving the name of the present document (DOC);> then I am executing a delayed call to another function (timerCallback) where I check if the active document (it becomes a DOCX format after SaveAs completes) has the same name as the one I recorded before;> if the names are not identical, we probably executed a SaveAs so we trigger the computation again;> I chose to introduce a 200ms delay .. but it doesn’t seem to matter whether this interval is smaller or larger; The Application.OnTime call schedules a timer callback at the first available moment, but it is done only after the SaveAs function completes .. so by that time we already have a new document name and can detect if we have to trigger a new computation; The source code became more complex now … and with the added complexity, there may be problems. It’s up to you to decide if you want to keep the first design and just instruct your users to perform a normal Save after they convert a document, or keep this more complicated design.
Here is the output: As I have shown inside the areas highlighted with red, at first we open an older document format: rand1.doc. When we trigger a SaveAs, my BeforeSave code runs but it just counts the words inside this old file and exits. The Save action is automatically performed by Word, and only after we exit that internal function (which is not reachable to us) we notice that the ActiveDocument has changed to rand9.docx. When the control returns to the VBA macros and they get their chance to run (the Application.OnTime schedules my task to execute at the first available slot), my code performs a comparison and detects the new document: rand9.docx which is highlighted in blue. It will trigger another word count and exit the routine. The End. I hope you enjoyed my article.
For any questions, feel free to add a comment or write me at cristib@microsoft.com.
|