How to convert a word document to other formats using PowerShell
I recently borrowed a Sony Reader Touch Edition from someone I know to try it out. As I started using Sony’s own library manager, I quickly got bored. I then tried open source Calibre which turned out to be a lot better interface but had a major flaw when it comes to supporting Sony Reader: It didn’t support importing word documents in the library despite of Sont Reader’s capability to read it. It can however import filtered html files which work can produce. Given my lazy nature, I did not want to convert a bunch of word documents I have by hand so I set out to write a PowerShell script to do the work for me.
The script turned out to be much simpler than I thought. Here it is for everyone’s benefit.
param([string]$docpath,[string]$htmlpath = $docpath)
$srcfiles = Get-ChildItem $docPath -filter "*.doc"
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");
$word = new-object -comobject word.application
$word.Visible = $False
function saveas-filteredhtml
{
$opendoc = $word.documents.open($doc.FullName);
$opendoc.saveas([ref]"$htmlpath\$doc.fullname.html", [ref]$saveFormat);
$opendoc.close();
}
ForEach ($doc in $srcfiles)
{
Write-Host "Processing :" $doc.FullName
saveas-filteredhtml
$doc = $null
}
$word.quit();
Save this code to convertdoc-tohtml.ps1 and you can run it on a set of word documents regardless of doc or docx extension. Also for efficiency I am using –filter in Get-ChildItem instead of piping to where-object or using If statement in the script. Why? Because it says right in the help:
“Filters are more efficient than other parameters, because the provider applies them when retrieving the objects, rather than having Windows PowerShell filter the objects after they are retrieved.”
To run the script, you simple need to point it to a folder where your source document files are and provide output folder if you wish. If not provided, source folder will also be used as output folder. Here’s how you can run it:
convertdoc-tohtml.ps1 -docpath "C:\Documents" -htmlpath "C:\Output"
If you want to know how the script can be transformed to save as different format, refer to wdSaveFormat Enumeration members on MSDN.
Comments
Anonymous
January 01, 2003
@John, not sure if my script can do that. Do post solution if you find one.Anonymous
September 23, 2011
Thanks for the article. What if I want to transform the paragraphs (from the Word document) by appending enclosing them within <p> and </p> html tags and write them into the HTML file?