How to decompile a Compiled HTML Help (.CHM) files and extract information using Powershell
What is a Compiled HTML help (.CHM)?
Microsoft Compiled HTML Help is a Microsoft proprietary online help format, consisting of a collection of HTML pages, an index and other navigation tools. The files are compressed and deployed in a binary format with the extension .CHM, for Compiled HTML. The format is often used for software documentation, like for Sysinternals tools.
How to decompile HTML help
Today me and my friend were looking for a approach through which we can Decompile .chm files into HTML and then parse the HTML DOM to extract some information. After some research I found that there is Windows command line utility HH.exe shipped with Windows operating system which can decompile the .CHM files to HTML using some command line options.
So I wrapped up the commands into a Powershell function, like below
001
002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 |
<#
PURPOSE : Script utlizes HH.exe to decompile the Compiled HTML Help file (.CHM) AUTHOR : Prateek Singh BLOG : http:\\Geekeefy.wordpress.com #> Function Get-DecompiledHTMLHelp { [cmdletbinding()] param( [String] $Destination, [String]$Filename ) $EXE = 'C:\Windows\hh.exe' If(-not (Test-Path $destination)) { "Destination folder doesn't exist" } elseIf(-not (Test-Path $Filename)) { "Target .chm file not found, please make sure you're entering the full path and file name" } else { Start-Process -FilePath $EXE -ArgumentList "-decompile $Destination $Filename" $FilesAndFolder = Get-ChildItem $Destination -Recurse| group psiscontainer $FolderCount = ($Filesandfolder| ?{$_.name -eq $true}).count $FileCount = ($Filesandfolder| ?{$_.name -eq $False}).count
Write-host "Decompiled into $(if($Foldercount -gt 0){$Foldercount}else{0}) Folders and $(if($FileCount){$FileCount}else{0}) Files to Destination $Destination" -ForegroundColor Yellow }
} |
Provide the path to a Compiled HTML Help file (.CHM) and a destination folder to place you decompiled content, to the Function which decompile and save the content a the target destination, like in the following image.
Extracting information from HTML file using HTML <Tags>
To extract information from the HTML files, use the function Create-HTMLDomFromFile to create a DOM structure to the HTML content and pull the text residing under a specific HTML , like below