Regular Expressions via Active Patterns
Edit 2/8/2010: Updating code samples for recent language changes.
In my last post I introduced Active Patterns. Sure they seem neat, but what do you actually do with them?. This post coverage how to leverage Active Patterns to make your code simpler and easier to understand. In particular, how you can use Regular Expressions in Active Patterns.
Our first pair of examples will show how to simplify the code required to extract groups from regular expressions. If you have ever needed to do a lot of string parsing, you probably know how tedious Regular Expressions can be. But with Active Patterns we make the matching not only straight forward but fit well into the ‘match and extract’ workflow.
We will define two Active Patterns, first a Parameterized Active Pattern that will return the match group given a regex and input string. The second Active Pattern will take that match group and return three sub group values. For example, parsing a simple DateTime string or the version number from the F# Interactive window.
Note how the second Active Pattern, Match3, calls into the first using ‘match (|Match|_|) pat inp with’.
open System
open System.Text.RegularExpressions
(* Returns the first match group if applicable. *)
let (|Match|_|) (pat:string) (inp:string) =
let m = Regex.Match(inp, pat) in
// Note the List.tl, since the first group is always the entirety of the matched string.
if m.Success
then Some (List.tail [ for g in m.Groups -> g.Value ])
else None
(* Expects the match to find exactly three groups, and returns them. *)
let (|Match3|_|) (pat:string) (inp:string) =
match (|Match|_|) pat inp with
| Some (fst :: snd :: trd :: []) -> Some (fst, snd, trd)
| Some [] -> failwith "Match3 succeeded, but no groups found. Use '(.*)' to capture groups"
| Some _ -> failwith "Match3 succeeded, but did not find exactly three matches."
| None -> None
// ----------------------------------------------------------------------------
// DateTime.Now.ToString() = "2/22/2008 3:48:13 PM"
let month, day, year =
match DateTime.Now.ToString() with
| Match3 "(.*)/(.*)/(.*).*" (a,b,c) -> (a,b,c)
| _ -> failwith "Match Not Found."
let fsiStartupText = @"
MSR F# Interactive, (c) Microsoft Corporation, All Rights Reserved
F# Version 2.0.0.0, compiling for .NET Framework Version v2.0.50727";
let major, minor, dot =
match fsiStartupText with
| Match3 ".*F# Version (\d+)\.(\d+)\.(\d+)\.*" (a,b,c) -> (a,b,c)
| _ -> failwith "Match not found."
A More Complex Example
The code seems simple enough, but notice how the Active Pattern removes any need to deal with RegularExpression namespace entirely. You simply match the input string with the regex string and get a tuple of values back.
We can take this concept one step further and show a more complex example. Consider the task of extracting all the URLs from a webpage. To do this we will use two Active Patterns. One to convert an HTML blob into a list of URLs (RegEx GroupCollections) and a second to normalize relative URL paths (“/foo.aspx” => “https://.../foo.aspx”) .
(* Ensures that the input string contains given the prefix. *)
let (|EnsurePrefix|) (prefix:string) (str:string) =
if not (str.StartsWith(prefix))
then prefix + str
else str
(* Returns all match groups if applicable. *)
let (|Matches|_|) (pat:string) (inp:string) =
let m = Regex.Matches(inp, pat) in
// Note the List.tl, since the first group is always the entirety of the matched string.
if m.Count > 0
then Some (List.tail [ for g in m -> g.Value ])
else None
(* Breaks up the first group of the given regular expression. *)
let (|Match1|_|) (pat:string) (inp:string) =
match (|Match|_|) pat inp with
| Some (fst :: []) -> Some (fst)
| Some [] -> failwith "Match3 succeeded, but no groups found. Use '(.*)' to capture groups."
| Some _ -> failwith "Match3 succeeded, but did not find exactly one match."
| None -> None
// ----------------------------------------------------------------------------
open System.Net
open System.IO
// Returns the HTML from the designated URL
let http (url : string) =
let req = WebRequest.Create(url)
// 'use' is equivalent to ‘using’ in C# for an IDisposable
use resp = req.GetResponse()
let stream = resp.GetResponseStream()
let reader = new StreamReader(stream)
let html = reader.ReadToEnd()
html
// Get all URLs from an HTML blob
let getOutgoingUrls html =
// Regex for URLs
let linkPat = "href=\"([^\"]*)\""
match html with
// The matches are the strings which our Regex matched. We need
// to trim out the 'href=' part, since that is part of the rx matches collection.
| Matches linkPat urls
-> urls |> List.map (fun url -> match (|Match1|_|) "href=(.*)" url with
| Some(justUrl) -> justUrl
| _ -> failwith "Unexpected URL format.")
| _ -> []
// Maps relative URLs to their fully-qualified path
let normalizeRelativeUrls root urls =
urls |> List.map (fun url -> (|EnsurePrefix|) root url)
// ----------------------------------------------------------------------------
let blogUrl = "https://blogs.msdn.com/chrsmith"
let blogHtml = http blogUrl
printfn "Printing links from %s..." blogUrl
blogHtml
|> getOutgoingUrls
|> normalizeRelativeUrls blogUrl
|> List.iter (fun url -> printfn "%s" url)
As you can see, Active Patterns are a powerful addition to F# and definitely something to keep in mind when you find yourself writing a lot of repetitive code.
Comments
Anonymous
February 22, 2008
PingBack from http://www.biosensorab.org/2008/02/22/regular-expressions-via-active-patterns/Anonymous
March 27, 2008
In response to an earlier post a reader wrote: Sent From: namin Subject: (|EnsurePrefix|) -- why is itAnonymous
June 08, 2009
I'm trying to get a grip on active patterns, but in this example I'm having a hard time grasping how their use has a benefit over functions. In particular, it seems to me that the (|Match1|) and the (|EnsurePrefix|) patterns could have just as easily been written as functions and it would change anything. What have I missed?Anonymous
June 12, 2009
Great question. I've addressed the difference between single-case active patterns and function calls in: http://blogs.msdn.com/chrsmith/archive/2008/03/27/single-case-active-patterns-vs-function-calls.aspx If that doesn't clear things up let me know. Cheers!