Scraping Test Data from the Web – List of Cities for a State

First, I found data on Wikipedia. It was formatted into columns, but I had inutition that in the HTML behind the page, the data would not be in columns, that was just formatting, and I was right.

This is the page that I started with: http://en.wikipedia.org/wiki/List_of_cities_in_Texas.

I did right-click “View source” then copied the HTML surrounding the cities to NotePad++.

With a few minutes of editing, I was able to remove the lines of HTML that didn’t look like this:

<code>
<ul>
  <li><a title="Abbott, Texas" href="/wiki/Abbott,_Texas">Abbott</a></li>
</ul>
<ul>
  <li><a title="Abernathy, Texas" href="/wiki/Abernathy,_Texas">Abernathy</a></li>
</ul>
<ul>
  <li><a title="Abilene, Texas" href="/wiki/Abilene,_Texas">Abilene</a></li>
</ul>
<ul>
  <li><a title="Ackerly, Texas" href="/wiki/Ackerly,_Texas">Ackerly</a></li>
</ul>
</code>

Then I pasted this into Notepad++, and used the Find/Replace with the RegEx capture feature.

NOTE: Be sure and check the “Search Mode” type to “Regular Expression” BEFORE cliking “Replace All”.
And of course, when it doesn’t work, CNTL-Z to undo and try again.

Let me explain this find “RegEx”;

.* title="(.*)"&gt;.*
Then Replace \1

Results:

Abbott, Texas
Abernathy, Texas
Abilene, Texas
Ackerly, Texas

The parentheses indicate the area to capture (I only need to capture one phrase).
If I wanted to capture the city as \1 and the state as \2 I could do something like this.

.* title="(.*), (.*)"&gt;.*

Then Replace \1 \2

There’s a nice RegEx Cheat Sheet here: http://www.cheatography.com/davechild/cheat-sheets/regular-expressions

But briefly, the . means any character, and the * means “0 or more”. Then I’m looking for the exact match of the phrase ‘title=”‘ (i.e. the word “title” followed by a double quote). I want to capture the city, which is everything up to the comma. (The theory will not work on city names that might perhaps have commas in them, I doubt there are any of them, but who knows…). Then I want to skip a space and capture the next set of characters up to the next double-quote followed by greater-than-sign than any string of characters.

Then in the “Replace with” textbox, I simply enter \1 (or \1 \2 if capturing city/state separately, then I could put \1/\2 or \1-\2 with my own desired separarator or no separator at all.

Results:

Abbott Texas
Abernathy Texas
Abilene Texas
Ackerly Texas

Scraping Test Data from the Web – List of Cities for a State – Using RegEx and NotePad++

Leave a Reply Cancel reply

Related posts:

Leave a Reply Cancel reply