Powershell can be useful for parsing or harvesting data from the web via means of the “Invoke-WebRequest”. Among other things, it can return a collection of links. The code below loads a page from Wikipedia and loops through the collection of links, looking for a certain pattern to find all the cities in a state. It then writes the “city, state” pair to a file.
Calling “Invoke-WebRequest” returns a Microsoft.PowerShell.Commands.HtmlWebResponseObject (the variable $site in my sample code below), which is part of the Microsoft.PowerShell.Commands.Utility assembly. Useful members inclued AllElements, Forms, Headers, Images, InputFields, Links, ParsedHTML, RawContent, and StatusCode (see complete list here: HTMLWebResponseObject Members.
cls $state = "Texas" #example: http://en.wikipedia.org/wiki/List_of_cities_in_Texas $url = "http://en.wikipedia.org/wiki/List_of_cities_in_$state" $harvestFile = "c:\TexasCities.txt" $site = Invoke-WebRequest -Uri $url #$elements = $site.AllElements | where ($_.id -eq "100 Largest Cities in Texas by Population") #Write-Host "Matches = $($elements.length)" foreach ($link in $site.Links) { $textLink = $link.href if ($link.href.StartsWith("/wiki/") -and $link.href.EndsWith("_$state") -and $link.title.EndsWith(", $state") ) { # this is our signal to stop processing, the cities repeat now by descending order of population write-host "$($link.innerText) $($link.href)" $outrow = "$($link.innerText), $state" add-content $harvestFile $outrow #write name of city to output file } else { write-host "Other $($link.innerText) $($link.href)" if ($link.href.StartsWith("#cite_note-2")) { Write-Host "Stopping because found #cite_note-2" break } } }
The only trick above is when to stop parsing. Of course, with any parser, if the web page changes, the parser might break, and need updates. I’m using an anchor tag that notes the beginning of a list of the 100 largest cities in the state, sorted by descending population.
Example Data Harvested
Abbott, Texas Abernathy, Texas Abilene, Texas Ackerly, Texas Addison, Texas Adrian, Texas Agua Dulce, Texas Alamo, Texas Alamo Heights, Texas