First, I found data on Wikipedia. It was formatted into columns, but I had inutition that in the HTML behind the page, the data would not be in columns, that was just formatting, and I was right.
This is the page that I started with: http://en.wikipedia.org/wiki/List_of_cities_in_Texas.
I did right-click “View source” then copied the HTML surrounding the cities to NotePad++.
With a few minutes of editing, I was able to remove the lines of HTML that didn’t look like this:
<code> <ul> <li><a title="Abbott, Texas" href="/wiki/Abbott,_Texas">Abbott</a></li> </ul> <ul> <li><a title="Abernathy, Texas" href="/wiki/Abernathy,_Texas">Abernathy</a></li> </ul> <ul> <li><a title="Abilene, Texas" href="/wiki/Abilene,_Texas">Abilene</a></li> </ul> <ul> <li><a title="Ackerly, Texas" href="/wiki/Ackerly,_Texas">Ackerly</a></li> </ul> </code>
Then I pasted this into Notepad++, and used the Find/Replace with the RegEx capture feature.
NOTE: Be sure and check the “Search Mode” type to “Regular Expression” BEFORE cliking “Replace All”.
And of course, when it doesn’t work, CNTL-Z to undo and try again.
Let me explain this find “RegEx”;
.* title="(.*)">.* Then Replace \1
Abbott, Texas Abernathy, Texas Abilene, Texas Ackerly, Texas
The parentheses indicate the area to capture (I only need to capture one phrase).
If I wanted to capture the city as \1 and the state as \2 I could do something like this.
.* title="(.*), (.*)">.*
Then Replace \1 \2
There’s a nice RegEx Cheat Sheet here: http://www.cheatography.com/davechild/cheat-sheets/regular-expressions
But briefly, the . means any character, and the * means “0 or more”. Then I’m looking for the exact match of the phrase ‘title=”‘ (i.e. the word “title” followed by a double quote). I want to capture the city, which is everything up to the comma. (The theory will not work on city names that might perhaps have commas in them, I doubt there are any of them, but who knows…). Then I want to skip a space and capture the next set of characters up to the next double-quote followed by greater-than-sign than any string of characters.
Then in the “Replace with” textbox, I simply enter \1 (or \1 \2 if capturing city/state separately, then I could put \1/\2 or \1-\2 with my own desired separarator or no separator at all.
Abbott Texas Abernathy Texas Abilene Texas Ackerly Texas
Filed under: RegEx