Scraping Test Data from the Web – List of Cities for a State – Using RegEx and NotePad++

First, I found data on Wikipedia. It was formatted into columns, but I had inutition that in the HTML behind the page, the data would not be in columns, that was just formatting, and I was right.

This is the page that I started with: http://en.wikipedia.org/wiki/List_of_cities_in_Texas.

I did right-click “View source” then copied the HTML surrounding the cities to NotePad++.

With a few minutes of editing, I was able to remove the lines of HTML that didn’t look like this:

{!{code}!}czozNjU6XCINCjx1bD4NCgk8bGk+PGEgdGl0bGU9XCJBYmJvdHQsIFRleGFzXCIgaHJlZj1cIi93aWtpL0FiYm90dCxfVGV4YXNcIj5BYmJve1smKiZdfXR0PC9hPjwvbGk+DQo8L3VsPg0KPHVsPg0KCTxsaT48YSB0aXRsZT1cIkFiZXJuYXRoeSwgVGV4YXNcIiBocmVmPVwiL3dpa2kvQWJlcm57WyYqJl19YXRoeSxfVGV4YXNcIj5BYmVybmF0aHk8L2E+PC9saT4NCjwvdWw+DQo8dWw+DQoJPGxpPjxhIHRpdGxlPVwiQWJpbGVuZSwgVGV4YXNcIntbJiomXX0gaHJlZj1cIi93aWtpL0FiaWxlbmUsX1RleGFzXCI+QWJpbGVuZTwvYT48L2xpPg0KPC91bD4NCjx1bD4NCgk8bGk+PGEgdGl0bGU9XCJBe1smKiZdfWNrZXJseSwgVGV4YXNcIiBocmVmPVwiL3dpa2kvQWNrZXJseSxfVGV4YXNcIj5BY2tlcmx5PC9hPjwvbGk+DQo8L3VsPg0KXCI7e1smKiZdfQ=={!{/code}!}

Then I pasted this into Notepad++, and used the Find/Replace with the RegEx capture feature.

NotePad++_Find_Replace_Regex_Capture

NOTE: Be sure and check the “Search Mode” type to “Regular Expression” BEFORE cliking “Replace All”.
And of course, when it doesn’t work,  CNTL-Z to undo and try again.

Let me explain this find “RegEx”;

.* title="(.*)">.*
Then Replace \1 

Results:
Abbott, Texas
Abernathy, Texas
Abilene, Texas
Ackerly, Texas

The parentheses indicate the area to capture (I only need to capture one phrase).
If I wanted to capture the city as \1 and the state as \2 I could do something like this.
.* title="(.*), (.*)">.*

Then Replace \1 \2

NotePad++_Find_Replace_Regex_Capture2

There’s a nice RegEx Cheat Sheet here: http://www.cheatography.com/davechild/cheat-sheets/regular-expressions

But briefly, the . means any character, and the * means “0 or more”. Then I’m looking for the exact match of the phrase ‘title=”‘ (i.e. the word “title” followed by a double quote). I want to capture the city, which is everything up to the comma. (The theory will not work on city names that might perhaps have commas in them, I doubt there are any of them, but who knows…). Then I want to skip a space and capture the next set of characters up to the next double-quote followed by greater-than-sign than any string of characters.

Then in the “Replace with” textbox, I simply enter \1 (or \1 \2 if capturing city/state separately, then I could put \1/\2 or \1-\2 with my own desired separarator or no separator at all.

Results:

Abbott Texas
Abernathy Texas
Abilene Texas
Ackerly Texas

Uncategorized  

Leave a Reply