Using PowerShell to Harvest Data from a Web Page

Powershell can be useful for parsing or harvesting data from the web via means of the “Invoke-WebRequest”. Among other things, it can return a collection of links. The code below loads a page from Wikipedia and loops through the collection of links, looking for a certain pattern to find all the cities in a state. It then writes the “city, state” pair to a file.

Calling “Invoke-WebRequest” returns a Microsoft.PowerShell.Commands.HtmlWebResponseObject (the variable $site in my sample code below), which is part of the Microsoft.PowerShell.Commands.Utility assembly. Useful members inclued AllElements, Forms, Headers, Images, InputFields, Links, ParsedHTML, RawContent, and StatusCode (see complete list here: HTMLWebResponseObject Members.

cls
$state = "Texas" 
#example: http://en.wikipedia.org/wiki/List_of_cities_in_Texas 
$url = "http://en.wikipedia.org/wiki/List_of_cities_in_$state" 
$harvestFile = "c:\TexasCities.txt" 
$site = Invoke-WebRequest -Uri $url 
#$elements = $site.AllElements | where ($_.id -eq "100 Largest Cities in Texas by Population") 
#Write-Host "Matches = $($elements.length)"
foreach ($link in $site.Links) 
  {
    $textLink = $link.href
    if ($link.href.StartsWith("/wiki/") -and $link.href.EndsWith("_$state") -and $link.title.EndsWith(", $state") )
        {
        # this is our signal to stop processing, the cities repeat now by descending order of population 
        write-host "$($link.innerText) $($link.href)" 
        $outrow = "$($link.innerText), $state" 
        add-content $harvestFile $outrow   #write name of city to output file 
        }
    else 
        {
        write-host "Other $($link.innerText) $($link.href)" 
        if ($link.href.StartsWith("#cite_note-2"))
            {
                Write-Host "Stopping because found #cite_note-2" 
                break
            }
        }
  }

The only trick above is when to stop parsing. Of course, with any parser, if the web page changes, the parser might break, and need updates. I’m using an anchor tag that notes the beginning of a list of the 100 largest cities in the state, sorted by descending population.

Example Data Harvested

Abbott, Texas
Abernathy, Texas
Abilene, Texas
Ackerly, Texas
Addison, Texas
Adrian, Texas
Agua Dulce, Texas
Alamo, Texas
Alamo Heights, Texas

Uncategorized  

Leave a Reply