Capturing Text using RegEx in Powershell

Regular expressions go all the way back to 1956. I think I first saw them in the PERL language; but today in 2015, they are very useful in Powershell, C# and most every language.

Regular Expressions have three main purposes:
1. Validate if text conforms to a pattern
2. Capture (extract) a string (or series of strings) from another string
3. Replace a pattern with a new text.

In this blog, I’ll be discussing #2 on the above list. Here is the function I created.

Function GetCityStateFromKeyword([String] $keyword)
{
   #Write-Host "GetCityStateFromKeyword" 
   $pattern = "in (.*?)($|.?\d{5}?.?)"
   #the first parenthese is for capturing the cityState into the $Matches array 
   #the second parentheses are needed above to look for $ (which is end of line) 
   #or zip code following the city/state 
   $isMatch = $keyword -match $pattern 
   $returnCityState = "ERROR" 
   #Write-Host "GetCityStateFromKeyword RegEx `$Matches.Count=$($Matches.Count)" 
   if ($Matches.Count -gt 0) 
      {
         $returnCityState = $Matches[1]
      }
   
   return $returnCityState
}

First, let's look at some of my sample data:

best deals on appliances in Irving TX 75039 
best deals on appliances in Irving TX 75039 bestbuy 
best deals on appliances in Irving TX
best deals on appliances in Irving TX 75039-1234 

$myKeyword = "best deals on appliances in Irving TX 75039"
$cityState = GetCityStateFromKeyword($myKeyword)
Write-Host "City/State=$cityState" 

My goal was to extract (or capture) the city, state from each of the above lines, or return the string "ERROR" if not found. The result I want in any of the above is simply:

Irving TX 

Now, let's look in detail at my pattern.

   $pattern = "in (.*?)($|.?\d{5}?.?)"

Parentheses have two purposes in Reg Ex:
1. Specify what to capture
2. Wrap around a list of alternatives, which are delimited by the pipe symbol.

.* means 0 or more characters, and adding the question mark to make it .*? makes it "non-greedy". If you don't specify the non-greedy operator (the question mark), then the .* might "suck up" too much text. So, (.*?) will capture my city state.

What does ($|.?\d{5}?.?) mean? Well first, \d stands for digit, and \d{5} means exactly 5 digits, i.e. the zip code. The ? mark unfortunately has two meanings, one as the non-greedy operator, and when seen like this .?, it means 0 or more characters (as compared to .* which means 1 or more characters). So the zip-code is optional, the space before and after the zip-code s optional.

NOTE: My requirement was only for US postal zip codes. I would have to change the logic to handle Canadian zip codes, which I believe are 6 characters and contain alphas.

Now let's look at this line of code:

 $isMatch = $keyword -match $pattern 

$isMatch will be a boolean. $keyword is my parameter to which the RegEx pattern will be applied. -Match is the Powershell keyword, and we've already discussed the value of the $pattern variable above. -Match will return an array in the variable $Match. Each capture from the () provided in the pattern will be stored in an element of the $Matches array. Thus, my city/state is stored in $Matches[1] (array in Powershell are 1-based, not 0-based).

Uncategorized  

Leave a Reply