Udostępnij za pośrednictwem


Microcode: PowerShell Scripting Tricks: Scripting The Web (Part 1) (Get-Web)

Several of the last posts have tackled how to take the wild world of data and start to turn it into PowerShell objects, so that it’s easier to make heads or tails out of it.  Once all of that data is in a form that PowerShell can use more effectively in the object pipeline, you can use PowerShell to slice and dice that data in wonderfully effective ways.

Data mining is a strange art that I find scripting languages normally are a little better at than compiled languages. I find that PowerShell can be incredibly useful for data miners for two important reasons.  First, it can pull upon an unprecedented array of technologies (COM, WMI, .NET, SQL, Regular Expressions, Web Service, Command Line) in order to get structured data.  Second, and more importantly, since you can easily create objects on the fly in PowerShell, and since PowerShell has wonderful string processing, you can often use PowerShell to extract structure data out of unstructured data.

Being able to pull data out of the mist and give it form is a very valuable skill, because people do not think in structured data.  While it might be useful to the computing world to have most information in structured data, most people disseminating information don’t think or record their thoughts with rigorous structure.  However, many people do record their thoughts in semi-rigorous structure, like the sentences and paragraphs you’re reading now.

The fact of the matter is that tons of data floats on the web requiring a very little bit of work and a small amount of art to extract it.  This is because the web that people record their thoughts in is largely in HTML, and so, it is possible to learn a few ways to pull the data from the little structure that exists.

The first piece of the toolkit to extract out data from the Web is a function I’ve called Get-Web.  Get-Web will simply download web pages, and it wraps part of the System.Net.Webclient object.

Using WebClient has pros and cons.  The biggest pro is that it relies on .NET, rather than on a particular browser, which means that you can use it without IE.  The biggest con is that a lot of web pages do checks on the browser in order to change how they display.

Get-Web is below.  As with Get-HashtableAsObject, I’m using comments to declare some nifty inline help.

 function Get-Web($url, 
    [switch]$self,
    $credential, 
    $toFile,
    [switch]$bytes)
{
    #.Synopsis
    #    Downloads a file from the web
    #.Description
    #    Uses System.Net.Webclient (not the browser) to download data
    #    from the web.
    #.Parameter self
    #    Uses the default credentials when downloading that page (for downloading intranet pages)
    #.Parameter credential
    #    The credentials to use to download the web data
    #.Parameter url
    #    The page to download (e.g. www.msn.com)    
    #.Parameter toFile
    #    The file to save the web data to
    #.Parameter bytes
    #    Download the data as bytes   
    #.Example
    #    # Downloads www.live.com and outputs it as a string
    #    Get-Web https://www.live.com/
    #.Example
    #    # Downloads www.live.com and saves it to a file
    #    Get-Web https://wwww.msn.com/ -toFile www.msn.com.html
    $webclient = New-Object Net.Webclient
    if ($credential) {
        $webClient.Credential = $credential
    }
    if ($self) {
        $webClient.UseDefaultCredentials = $true
    }
    if ($toFile) {
        if (-not "$toFile".Contains(":")) {
            $toFile = Join-Path $pwd $toFile
        }
        $webClient.DownloadFile($url, $toFile)
    } else {
        if ($bytes) {
            $webClient.DownloadData($url)
        } else {
            $webClient.DownloadString($url)
        }
    }
}

To walk through a few examples of Get-Web, simply point it to any webpage.

   Get-Web https://en.wikipedia.org/

To save a page to disk

   Get-Web https://www.msn.com/ -toFile www.msn.com.html

Just downloading the data is only the first step.  All being able to download web data gives you is a way to get the mist into a bottle, but it doesn’t help you give it form.  The next piece will cover how to pull the data out of the web and into PowerShell.

Hope this helps,

James Brundage [MSFT]

Comments

  • Anonymous
    December 01, 2008
    PingBack from http://www.alvinashcraft.com/2008/12/01/dew-drop-december-1-2008/

  • Anonymous
    December 01, 2008
    Actually, it would be really cool if you posted these scripts to PoshCode.org (even when they're partly duplicates) ... PoshCode has script-tag based embedding too, so you can get syntax-highlighted code on your blog ;)

  • Anonymous
    December 07, 2008
    The first post about scripting the was a lot of waxing philosophical but little about how to extract

  • Anonymous
    December 10, 2008
    I was going to make a comment about how you could use split-path and its Qualifier parameter instead of looking for a colon. Then I messed around with it and didn't like that it throws an exception if there's no qualifier. Nevermind. :) P.S. Test-Path needs more guts, I just realized. P.P.S. The bad thing about using net.webclient is that you cannot reliably convert it to [xml] in one step because many webpages are not xhtml compliant.  Joel addressed that here: http://huddledmasses.org/get-web-another-round-of-wget-for-powershell/

  • Anonymous
    December 11, 2008
    The first post in this series was learning to crawl. I introduced Get-Web , which allows you to use System.Net.Webclient

  • Anonymous
    April 24, 2014
    I am finding in many places a need to Login to the site and then retrieve data. How can I use the system.net.webclient to POST to the site, retain the session and then use downloadstring method.