Share via


Getting Information from Web Pages via PowerShell

In many cases, the information we need is available on one or many web pages, but we need to process the same information repeatedly. To automate such a repeated task via PowerShell we need to read and parse HTML data. For example, a question was recently posted on the Microsoft 2010 Sharepoint forum:

I love how SharePoint 2010 has the page: http://Server:880/_admin/PatchStatus.aspx. However, looking at this page each server has about 100 patches.  Scrolling through this list is difficult to see if one of the servers is missing a patch or has patches that other servers do not. Is there a way to export the information on the PatchStatus.aspx page in Central Admin to an excel spreadsheet?

For the purpose of this exercise, let’s say we want to get the titles of the posts in this URL: http://superwidgets.wordpress.com/category/sql

First, let’s read in the HTML code of this page:

http://superwidgets.files.wordpress.com/2014/08/html21.jpg?w=590

Now, let’s pipe that to Get-Member to see what kind of object we got and its available methods and properties:

http://superwidgets.files.wordpress.com/2014/08/html31.jpg?w=590

After exploring different properties of the $HTML object we have, and with some background HTML knowledge, you can tell that the information we’re looking for is in the HTML Body.

Using the same technique above, we can explore further properties of the HTML object like “ParsedHTML”:

$HTML.ParsedHtml | Get-Member

This shows that it’s a HTMLDocumentClass object with tons of events, methods, and properties:

http://superwidgets.files.wordpress.com/2014/08/html7.jpg?w=590

What’s useful to us here is the following three methods:

getElementById
getElementsByName
getElementsByTagName

Now that we know how to extract the information we need from a web page, let’s look at the specifics of the web page at hand: http://superwidgets.wordpress.com/category/sql. Open it in IE for example, hit F12 to open the DOM explorer at the bottom, expand the HTML tags, and move the mouse over the tags one by one. Notice that the top IE pane changes the background color of the element you’re moving the mouse on. This gives us a visual indication of which element in the HTML code represents which text or area of an HTML page.

http://superwidgets.files.wordpress.com/2014/08/html6.jpg?w=590

You can see the article titles we’re interested in are the ones that start with:

<h2 class =”entry-title”>

We can now write the following few lines of PowerShell script code to complete the task:

http://superwidgets.files.wordpress.com/2014/08/html8.jpg?w=590

# Script to display post titles in the SQL Categroy of Superwidgets blog
# Sam Boutros – 08/10/2014
$URI = “http://superwidgets.wordpress.com/category/sql/ “
$HTML = Invoke-WebRequest -Uri $URI
($HTML.ParsedHtml.getElementsByTagName(‘h2′) | Where{ $_.className -eq ‘entry-title’ } ).innerText

In line 5 we pick HTML elements by the “H2″ tag, filter on ClassName = “entry-title” and select the innertext property.

Output looks like this:

http://superwidgets.files.wordpress.com/2014/08/html9.jpg?w=590

Which is the exactly the article titles we set out to get.

This information can be further processed, logged, stored, or repackaged in other HTML, CSV or other reports.