Share via


Writing a Regular Expression parser in Haskell: Part 3

The third module in the simple regular expression parser is called: NFAtoDFA.  Which as you might have guessed, takes the NFA that resulted from the first module and converts it into a DFA.  The structure that the DFA uses is the same that the NFA uses since they are both finite state machines.

 

Converting an NFA to a DFA requires mapping sets of nodes in the NFA to a single node in the DFA.  Many nodes in a NFA will correspond to one node in the DFA.  Making this change requires updating transitions to point to and from sets of nodes.  To manage this transformation I create a state monad using the following context:

 

    1: -- The state which we pass to build the DFA
    2: data ConvertContext = ConvertContext { nfa :: FiniteMachine,
    3:                                        trans :: [Transition],
    4:                                        setMap :: Map.Map (Set Node) Integer,
    5:                                        setStack :: [Set Node],
    6:                                        begin :: Node,
    7:                                        accept :: Set Node,
    8:                                        nextNode :: Node
    9:                                      } deriving (Show, Eq)
   10: type ConvertState a = State ConvertContext a

 

Most of the code in this module is just managing this context and updating it according to two operation:

  1. Epsilon Closure
  2. Set Move

These are explained in more detail in this article.

Basically, epsilon closure is the process of taking a set of initial nodes and returning a new set of all nodes you can traverse to purely on epsilon transitions.  To help with this I created some smaller methods to build up to an epsilon closure.

First are a couple methods (findToNodes and closure):

    1: closure trans value nodes oldSet = Set.union (findToNodes trans value nodes) oldSet
    2:  
    3: -- Search the table of transitions to find all nodes you can reach given an initial set of nodes
    4: findToNodes trans value fromNodes = foldr match Set.empty trans
    5:     where 
    6:       match (from, to, val) nodes
    7:           | (from == fromNodes) && (val == value) = Set.insert to nodes
    8:           | otherwise = nodes

findToNodes searches a transition table for all nodes which go from any node in fromNodes on value.  It will builds up a set with all the to  nodes that match. 

closure wraps findToNodes to let us easily union together an initial set and the nodes we can reach from that set.

With this in hand we can write clearly a epsilon closure function:

    1: -- Given an initial set of nodes, find the set of all nodes you can reach by taking 
    2: -- transitions on epsilon only
    3: epsilonClosure trans nodes = foldUntilRepeat Set.union Set.empty $
    4:                              iterate (Set.fold (closure trans epsilon) Set.empty) nodes
    5:  
    6:  

 

This function takes full advantage of the lazy nature of Haskell.  It repeats the closure on epsilon over and over and streams its results into our function foldUntilRepeat.   This method does what it says, it will fold the values that are streamed in until it sees the same value twice. 

 

The set move is just combination of what you have already seen:

    1: -- Given a starting set of nodes the set of all nodes that you can reach on a given value
    2: -- This includes epislonClosure on the desitination nodes
    3: moveClosure trans value nodes = epsilonClosure trans $ 
    4:                                 Set.fold (closure trans value) Set.empty nodes
    5:  

 

With these functions in hand, this module just becomes calling them and updating the context until we have no more nodes in the NFA to process.

 

In the next installment I will discuss using the output of this modules to match a regex against a string.

Also, once again the code is attached.

NFAtoDFA.hs

Comments

  • Anonymous
    June 09, 2008
    The first module in my simple regular expression parse is called RegexToNFA. This module exposes the