แชร์ผ่าน


A Simple Lexer in C# That Uses Regular Expressions

Lately, I’ve been working on a (personal) project to build a DSL using Vaughan Pratt’s top down operator precedence parsing technique, which I discovered through Douglas Crockford’s paper on the same subject (and through my friend Matt Manela). Because Crockford’s paper leaves the lexer as an exercise for the reader, I set out to write a very simple lexer in C# that would provide tokens to my parser.

Design

I chose to use regular expressions for token definitions rather than a more traditional, character-based scanner; my time developing ColorCode (which is essentially a specialized lexer) left a mental imprint for a regular expression-based lexer and I decided to run with it. The token definitions, then, are just a collection of regular expressions wrapped with an associated token type identifier (e.g., “operator” or “literal”) and a flag indicating whether they should be ignored:

     public class TokenDefinition
    {

        public bool IsIgnored { get; set; }
        public Regex Regex { get; set; }
        public string Type { get; set; }
    }

In Crockford’s paper, the lexer provides an array of tokens to the parser, but I want this lexer to fit C#’s paradigms as best I can. To that end, I think the lexer should return IEnumerable<Token>. This, then, is my lexer’s contract:

     public interface ILexer
    {
        void AddDefinition(TokenDefinition tokenDefinition);
        IEnumerable<Token> Tokenize(string source);
    }

 

Each token will have a type, value, and position (that is, its position within the source code):

     public class Token
    {
        public TokenPosition Position { get; set; }
        public string Type { get; set; }
        public string Value { get; set; }
    }
     public class TokenPosition
    {
        public int Column { get; set; }
        public int Index { get; set; }
        public int Line { get; set; }
    }

That’s the entire contract. It’s design is very simple, as you’d expect.

Implementation

The tokenize method will test the current source code position against each definition’s regular expression until it finds a match. When it finds a match, it will yield the matched token. If it doesn’t find a match, it will throw an exception for an unrecognized symbol. It will also track the current index, line number, and column position within the source code as it tokenizes. It will finally yield an end token when it reaches the end of the source code.

     public IEnumerable<Token> Tokenize(string source)
    {
        int currentIndex = 0;
        int currentLine = 1;
        int currentColumn = 0;

        while (currentIndex < source.Length)
        {
            TokenDefinition matchedDefinition = null;
            int matchLength = 0;

            foreach (var rule in tokenDefinitions)
            {
                var match = rule.Regex.Match(source, currentIndex);

                if (match.Success && (match.Index - currentIndex) == 0)
                {
                    matchedDefinition = rule;
                    matchLength = match.Length;
                    break;
                }
            }

            if (matchedDefinition == null)
            {
                throw new Exception(string.Format("Unrecognized symbol '{0}' at index {1} (line {2}, column {3}).", source[currentIndex], currentIndex, currentLine, currentColumn));
            }
            else
            {
                var value = source.Substring(currentIndex, matchLength);

                if (!matchedDefinition.IsIgnored)
                    yield return new Token(matchedDefinition.Type, value, new TokenPosition(currentIndex, currentLine, currentColumn));

                var endOfLineMatch = endOfLineRegex.Match(value);
                if (endOfLineMatch.Success)
                {
                    currentLine += 1;
                    currentColumn = value.Length - (endOfLineMatch.Index + endOfLineMatch.Length);
                }
                else
                {
                    currentColumn += matchLength;
                }

                currentIndex += matchLength;
            }
        }

        yield return new Token("(end)", null, new TokenPosition(currentIndex, currentLine, currentColumn));
    }

Simple, as promised.

Because it evaluates the token definitions top to bottom, you can use the order for precedence (if you care about the precedence of lexical definitions; I haven’t encountered that need myself yet).

The complete source is available for download.

SimpleLexer.zip

Comments

  • Anonymous
    November 11, 2010
    Thanks for posting the code.  Was simple and came in useful.

  • Anonymous
    April 08, 2011
    What license is this Lexer released under? I'd love to use it for some code here at work.

  • Anonymous
    July 23, 2012
    Thank you for posting the source code.

  • Anonymous
    August 05, 2014
    Thanks for uploading the source code. It helped a lot.

  • Anonymous
    March 21, 2015
    The comment has been removed

  • Anonymous
    June 05, 2015
    Hi Drew, I've used such an approach in the past A small remark - unless you add the TokenDefinitions in the right order, you may not return  the longest match eg  < and <= An alternative is not to break when a match is found, but try all patterns and return the longest match. There's a performance penalty with such an approach

  • Anonymous
    June 05, 2015
    Sorry - too fast - too little thinking Think using negative lookaheads is the solution so that < is not matched when a = follows