Scraping Test Data from the Web – List of Cities for a State – Using RegEx and NotePad++

First, I found data on Wikipedia. It was formatted into columns, but I had inutition that in the HTML behind the page, the data would not be in columns, that was just formatting, and I was right.

This is the page that I started with:

I did right-click “View source” then copied the HTML surrounding the cities to NotePad++.

With a few minutes of editing, I was able to remove the lines of HTML that didn’t look like this:


Then I pasted this into Notepad++, and used the Find/Replace with the RegEx capture feature.


NOTE: Be sure and check the “Search Mode” type to “Regular Expression” BEFORE cliking “Replace All”.
And of course, when it doesn’t work,  CNTL-Z to undo and try again.

Let me explain this find “RegEx”;

.* title="(.*)">.*
Then Replace \1 

Abbott, Texas
Abernathy, Texas
Abilene, Texas
Ackerly, Texas

The parentheses indicate the area to capture (I only need to capture one phrase).
If I wanted to capture the city as \1 and the state as \2 I could do something like this.
.* title="(.*), (.*)">.*

Then Replace \1 \2


There’s a nice RegEx Cheat Sheet here:

But briefly, the . means any character, and the * means “0 or more”. Then I’m looking for the exact match of the phrase ‘title=”‘ (i.e. the word “title” followed by a double quote). I want to capture the city, which is everything up to the comma. (The theory will not work on city names that might perhaps have commas in them, I doubt there are any of them, but who knows…). Then I want to skip a space and capture the next set of characters up to the next double-quote followed by greater-than-sign than any string of characters.

Then in the “Replace with” textbox, I simply enter \1 (or \1 \2 if capturing city/state separately, then I could put \1/\2 or \1-\2 with my own desired separarator or no separator at all.


Abbott Texas
Abernathy Texas
Abilene Texas
Ackerly Texas


Leave a Reply