Jaa


The “joy” of Reg-ex part 1.

One of the good things in PowerShell is support for regular expressions – in fact I suspect some Unix sys-admins might laugh their Windows counterparts for not got to grips regular expressions sooner.
The downside is that regular expressions are an area which give a lot people a serious headache.

So lets start from first principles.

1. Regular expressions are about looking for text that matches a pattern. When matching text is found various follow-up operations can be performed such as replacing it.

2. The patterns defined by regular expressions allow “classes” of characters to be specified , for example “word” characters, digits and spaces

3. The patterns can specify groups of alternatives or repetitions.

Everything else stems from that: simple. The thing with simple ideas is we build complex practices on them

In PowerShell we can use the –match operator to test an expression: so here are some examples 

"cat" -match "at"   returns true.   “cat” contains “at”, we have a match. That was easy. We can specify alternates
" dogma" -match "cat|dog"   returns true because the test pattern translates as  “cat or dog” (pipe sign is “or”) and dogma is a match for dog.
"AlleyCat" -match "cat|dog"   also returns true because it contains cat

If we want to specify the words rather than “substringscat or dog we can use the first of the special characters, \b means a word Boundary,
So "dogma" -match "cat\b|dog\b"   returns false but "Alleycat” –match “cat\b|dog\b" returns true. We can specify a boundary before as well as after the text to get the exact word

In Regular expressions the Wildcards that most people have been used to divide into classes of characters – another place where we use special characters - and repetitions\w is a word (alphanumeric) character \s is a space character, \d is a digit. A dot stands for “ANY character” at all. Changing case reverses the meaning. \W is any non-word\S is any non-space ..
If we want to specify alternates we can write them as [aeiou] or [a-z] for a range. We can reverse the selection of alternates with the ^ , so [^aeiou], is any non vowel 
"oat" -match "\b[a-z]at" returns true, but replacing the letter o with a zero as in "0at" -match "\b[a-z]at" returns false.

"oat" -match "\b[a-z]at" will only return true of there is exactly 1 letter between the start of the word and “at”,  so chat won’t match. We can specify repetition: {2} means 2 exactly repetitions, {2,10} means at least between 2 and 10 repetitions, {2,} means at-least two. We have short-hands for these: * means any number including 0, + means any non zero number and ? means zero times or once but no more. Incidentally if we need to match a character which has a special use – the different kinds of brackets, . * ? and \ we “escape” them it prefixing with a \

"[a-z]at " will find a match with cat and hat but not  at or chat
"[a-z]+at" will find a match with cat, hat and Chat , but not at ,
"[a-z]?at"
will find a match with cat, hat and at , but not chat ,
"[a-z]*at" will find a match with cat, hat at and chat

This requires unlearning some automatic behaviours we’ve learnt: at most command lines we can use * to mean “any combination of characters, including none”, so A* means “A followed by anything” in regular expressions “A*” will always match because it means “containing any number of instances ‘A’ including none. In regular expressions the syntax is A .*   (A followed by any character, repeated any number of times). Similarly where we use ? to stand for a “a single character” in regular expressions we use .

There are a couple of other special characters worth noting. ^ means start of line and $ means end of line.  These last two are very useful in scripts, where you often need test for something which begins or ends with a given piece of text.  For example if in PowerShell you declare a variable to hold some text – for example $myString = “The cat sat on the mat” , the .net string type has an endswith() method so $myString.endswith(“at”) returns true. Great. Except, we often want to do something with the text – like replace and PowerShell has a replace operator too. If we want to say “Replace the HTM  at the end of a file with HTML” we can do $mystring –replace “HTM$”,”HTML”   Similarly if we’re looking at text and we want to cope with trailing spaces strings have a trim method, but regular expressions can get rid of punctuation as well $myString –match “at\W*$”  will match even if there is punctuation and spaces between “Cat” and the end of the line.

So far so good – we can also use a –split operator in PowerShell: again, .net strings have a split() method, but if we try this
$string=”The cat sat, on the dog’s mat” ; $string.split(“ “)  

It will return blank extra lines for the spaces between “sat” and “on” and the comma will be welded to sat for good. We could split the text at any non-word character. by using –split “\W”  Unfortunately Regualar expressions don’t consider ’ to be a word character so  ’s will be split off from “dog”. This is easily fixed by using $string -split "[^\w‘] +"   which says Split where you find something which is neither a word character nor an apostrophe, treating multiple occurrences as one. 

The last thing I wanted to mention is one I ways have to double check, and that is something called “greedy” / “Lazy” behaviour. Suppose I want to change something in the first tag of a piece of XML . I might look for a match on “^<.*>”  - which says find the start of the text, then a < then any other characters and finally a >. This will match the whole document because * finds as many characters as it can before the final > if we want the fewest characters the * must be followed by a ? sign.

In the next post I’ll look at a couple of ways we can put regular expressions to work including one of the best tools in PowerShell – select string.