Jaa


Building a Simple Recursive Descent Parser (Completed Simple Parser)

In this post, I complete the recursive descent parser that will parse the simple grammar that I presented previously in this series.  In this post, I’ll examine in detail the NospaceExpression class, which needs to implement operator precedence.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCNote:  I have to apologize for the delay in writing this post.  First summer vacation took some time, and then I became involved in a project of writing two papers – one that summarizes SharePoint application development, and another that summarizes Office Client application development.  I’ve been literally obsessing about every possible way that you as a developer can extend SharePoint Server 2010, and every way that you can extend the Office 2010 client applications.  The SharePoint paper is nearing publication.  The Office Client paper will be along in a bit.  While obsessing, it has been difficult for me to mentally switch tracks back to this series.  But here we go again…

This post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.  You can find the complete list of posts here.

The NospaceExpression Class

Here is the entire implementation of the NospaceExpression class.  I explained all of the code that has a gray background in the previous post, Building a Simple Recursive Descent Parser (continued).

public static NospaceExpression Produce(IEnumerable<Symbol> symbols)
{
// nospace-expression = open-parenthesis expression close-parenthesis
// / numerical-constant
// / prefix-operator expression
// / expression infix-operator expression

if (!symbols.Any())
return null;

if (symbols.First() is OpenParenthesis && symbols.Last() is CloseParenthesis)
{
Expression e = Expression.Produce(symbols.Skip(1).SkipLast(1));
if (e != null)
return new NospaceExpression(new OpenParenthesis(), e, new CloseParenthesis());
}

// expression, infix-operator, expression
var z = symbols.Rollup(0, (t, d) =>
{
if (t is OpenParenthesis)
return d + 1;
if (t is CloseParenthesis)
return d - 1;
return d;
});
var symbolsWithIndex = symbols.Select((s, i) => new
{
Symbol = s,
Index = i,
});
var z2 = symbolsWithIndex.Zip(z, (v1, v2) => new
{
SymbolWithIndex = v1,
Depth = v2,
});
var operatorList = z2
.Where(x => x.Depth == 0 &&
x.SymbolWithIndex.Index != 0 &&
InfixOperator.Produce(x.SymbolWithIndex.Symbol) != null)
.ToList();
if (operatorList.Any())
{
int minPrecedence = operatorList
.Select(o2 => OperatorPrecedence[o2.SymbolWithIndex.Symbol.ToString()]).Min();
var op = operatorList
.Last(o2 => OperatorPrecedence[o2.SymbolWithIndex.Symbol.ToString()] == minPrecedence);
if (op != null)
{
var expressionTokenList1 = symbols.TakeWhile(t => t != op.SymbolWithIndex.Symbol);
Expression e1 = Expression.Produce(expressionTokenList1);
if (e1 == null)
throw new ParserException("Invalid expression");
var expressionTokenList2 = symbols
.SkipWhile(t => t != op.SymbolWithIndex.Symbol).Skip(1);
Expression e2 = Expression.Produce(expressionTokenList2);
if (e2 == null)
throw new ParserException("Invalid expression");
InfixOperator io = new InfixOperator(op.SymbolWithIndex.Symbol);
return new NospaceExpression(e1, io, e2);
}
}

NumericalConstant n = NumericalConstant.Produce(symbols);
if (n != null)
return new NospaceExpression(n);

PrefixOperator p = PrefixOperator.Produce(symbols.FirstOrDefault());
if (p != null)
{
Expression e = Expression.Produce(symbols.Skip(1));
if (e != null)
return new NospaceExpression(p, e);
}

return null;
}

Let’s work through the new code.  Let’s consider how the new code behaves when parsing the following formula:

"1+((2+3)*4)^5"

When determining if an expression has an infix operator, we specifically need to ignore operators that are in parentheses.  Those operators will have their chance to be recognized when NospaceExpression.Produce sees that an expression starts with an open parenthesis and ends with a close parenthesis, as processed by this code:

if (!symbols.Any())
return null;

if (symbols.First() is OpenParenthesis && symbols.Last() is CloseParenthesis)
{
Expression e = Expression.Produce(symbols.Skip(1).SkipLast(1));
if (e != null)
return new NospaceExpression(new OpenParenthesis(), e, new CloseParenthesis());
}

So we need to ignore operators that are inside parentheses.  The easiest way to determine if an operator is inside of parentheses is to use the Rollup extension method, as follows:

// expression, infix-operator, expression
var z = symbols.Rollup(0, (t, d) =>
{
if (t is OpenParenthesis)
return d + 1;
if (t is CloseParenthesis)
return d - 1;
return d;
});

This use of the Rollup extension method returns a new collection of integers that has the same number of elements in it as the source collection.  The integer is the depth of parentheses at that point.  In the following snippet, I add a bit of code to print the items in the collection, and then exit:

// expression, infix-operator, expression
var z = symbols.Rollup(0, (t, d) =>
{
if (t is OpenParenthesis)
return d + 1;
if (t is CloseParenthesis)
return d - 1;
return d;
});
foreach (var item in z)
Console.WriteLine(item);
Environment.Exit(0);

When I run this example for the test formula "1+((2+3)*4)^5", I see:
0
0
1
2
2
2
2
1
1
1
0
0
0

When looking for an infix operator, I can ignore all operators where the depth is not zero.

Later on, I’m going to need the index of the symbol of the infix operator that we find.  The next projection uses symbols as its source, and projects an anonymous type that includes the symbol and its index:

var symbolsWithIndex = symbols.Select((s, i) => new
{
Symbol = s,
Index = i,
});
foreach (var item in symbolsWithIndex)
Console.WriteLine(item);
Environment.Exit(0);

When I run this for the test formula, I see:

{ Symbol = 1, Index = 0 }
{ Symbol = +, Index = 1 }
{ Symbol = (, Index = 2 }
{ Symbol = (, Index = 3 }
{ Symbol = 2, Index = 4 }
{ Symbol = +, Index = 5 }
{ Symbol = 3, Index = 6 }
{ Symbol = ), Index = 7 }
{ Symbol = *, Index = 8 }
{ Symbol = 4, Index = 9 }
{ Symbol = ), Index = 10 }
{ Symbol = ^, Index = 11 }
{ Symbol = 5, Index = 12 }

And next, use the Zip extension method to zip together the last two queries to create a list of items that contains:

·         The symbol

·         Its index

·         Its parenthesis depth

var z = symbols.Rollup(0, (t, d) =>
{
if (t is OpenParenthesis)
return d + 1;
if (t is CloseParenthesis)
return d - 1;
return d;
});
var symbolsWithIndex = symbols.Select((s, i) => new
{
Symbol = s,
Index = i,
});
var z2 = symbolsWithIndex.Zip(z, (v1, v2) => new
{
SymbolWithIndex = v1,
Depth = v2,
});
foreach (var item in z2)
Console.WriteLine(item);
Environment.Exit(0);

This produces:

{ SymbolWithIndex = { Symbol = 1, Index = 0 }, Depth = 0 }
{ SymbolWithIndex = { Symbol = +, Index = 1 }, Depth = 0 }
{ SymbolWithIndex = { Symbol = (, Index = 2 }, Depth = 1 }
{ SymbolWithIndex = { Symbol = (, Index = 3 }, Depth = 2 }
{ SymbolWithIndex = { Symbol = 2, Index = 4 }, Depth = 2 }
{ SymbolWithIndex = { Symbol = +, Index = 5 }, Depth = 2 }
{ SymbolWithIndex = { Symbol = 3, Index = 6 }, Depth = 2 }
{ SymbolWithIndex = { Symbol = ), Index = 7 }, Depth = 1 }
{ SymbolWithIndex = { Symbol = *, Index = 8 }, Depth = 1 }
{ SymbolWithIndex = { Symbol = 4, Index = 9 }, Depth = 1 }
{ SymbolWithIndex = { Symbol = ), Index = 10 }, Depth = 0 }
{ SymbolWithIndex = { Symbol = ^, Index = 11 }, Depth = 0 }
{ SymbolWithIndex = { Symbol = 5, Index = 12 }, Depth = 0 }

Next, we can look for any infix operators where the Depth == 0.  In addition, we filter out operators that are found in the very first position, because that will not be an infix operator, but instead a prefix operator.

// expression, infix-operator, expression
var z = symbols.Rollup(0, (t, d) =>
{
if (t is OpenParenthesis)
return d + 1;
if (t is CloseParenthesis)
return d - 1;
return d;
});
var symbolsWithIndex = symbols.Select((s, i) => new
{
Symbol = s,
Index = i,
});
var z2 = symbolsWithIndex.Zip(z, (v1, v2) => new
{
SymbolWithIndex = v1,
Depth = v2,
});
var operatorList = z2
.Where(x => x.Depth == 0 &&
x.SymbolWithIndex.Index != 0 &&
InfixOperator.Produce(x.SymbolWithIndex.Symbol) != null)
.ToList();
foreach (var item in operatorList)
Console.WriteLine(item);
Environment.Exit(0);

This produces a list of the two possible operators that are part of an expression that uses an infix operator:

{ SymbolWithIndex = { Symbol = +, Index = 1 }, Depth = 0 }
{ SymbolWithIndex = { Symbol = ^, Index = 11 }, Depth = 0 }

Next, of the operators that could possibly be the infix operator for this expression, we find the minimum precedence of any operators.  When implementing a top-down parser, we process operators with the least precedence before processing operators of higher precedence.  When looking at the expression tree, operators with the least precedence are nearer to the root of the tree.  Operators with the highest precedence will be closer to the leaves, so that they will be evaluated before the operators with lower precedence.  If this isn’t clear, I think that it will be when you see the entire parse tree.  In this particular case, we need to find the + operator, not the ^ operator, as + has lower precedence.

To enable processing of precedence of operators, I defined a small dictionary of operators and their precedence:

public static Dictionary<string, int> OperatorPrecedence = new Dictionary<string, int>()
{
{ "^", 3 },
{ "*", 2 },
{ "/", 2 },
{ "+", 1 },
{ "-", 1 },
};

In the following listing, the highlighted line finds the minimum precedence of the operators that are in play.

// expression, infix-operator, expression
var z = symbols.Rollup(0, (t, d) =>
{
if (t is OpenParenthesis)
return d + 1;
if (t is CloseParenthesis)
return d - 1;
return d;
});
var symbolsWithIndex = symbols.Select((s, i) => new
{
Symbol = s,
Index = i,
});
var z2 = symbolsWithIndex.Zip(z, (v1, v2) => new
{
SymbolWithIndex = v1,
Depth = v2,
});
var operatorList = z2
.Where(x => x.Depth == 0 &&
x.SymbolWithIndex.Index != 0 &&
InfixOperator.Produce(x.SymbolWithIndex.Symbol) != null)
.ToList();
if (operatorList.Any())
{
int minPrecedence = operatorList
.Select(o2 => OperatorPrecedence[o2.SymbolWithIndex.Symbol.ToString()]).Min();
var op = operatorList
.Last(o2 => OperatorPrecedence[o2.SymbolWithIndex.Symbol.ToString()] == minPrecedence);
if (op != null)
{
var expressionTokenList1 = symbols.TakeWhile(t => t != op.SymbolWithIndex.Symbol);
Expression e1 = Expression.Produce(expressionTokenList1);
if (e1 == null)
&nbs

Comments

  • Anonymous
    December 29, 2010
    A good point was made by a reader - this example uses the Zip extension method, which is included in .NET 4.0.  It builds and runs as-is using Visual Studio 2010.  However, if you are using Visual Studio 2008, you need to include the Zip extension method.  A simple implementation can be found at: blogs.msdn.com/.../comparing-two-open-xml-documents-using-the-zip-extension-method.aspx The implementation of the extension method is about half way down the post. -Eric

  • Anonymous
    February 10, 2011
    This code assumes IEnumerable<Expression> is IEnumerable<Symbol> (true in C#4, not in C#3) to run this code in VS2008, you need to change the constructor for Symbol:        public Symbol(params Object[] symbols) {            List<Symbol> ls = new List<Symbol>();            foreach (var item in symbols) {                if (item is Symbol)                    ls.Add((Symbol)item);                else if (item is IEnumerable<Symbol>)                    foreach (var item2 in (IEnumerable<Symbol>)item)                        ls.Add(item2);                else if (item is IEnumerable)                    foreach (var item2 in (IEnumerable)item)                        ls.Add(item2 as Symbol);                else                    // If this error is thrown, the parser is coded incorrectly.                    throw new ParserException("Internal error");            }            ConstituentSymbols = ls;        }

  • Anonymous
    February 10, 2013
    Thanks for the post Eric. The above source deals with all the simple calculations (shown above in valid formulas). But what if they have cell references. Like (=SUM(A1:A5)) OR =A5+B5. if i put any of them in the valid formula list, it is throwing exceptions. It is not treating as valid formula. Can you suggest a way to overcome this as well ? Thanks and Regards, Kishor

  • Anonymous
    February 10, 2013
    Hi Kishore, To extend as you suggest it is necessary to add the appropriate productions to the grammar and then write theropriate code. I still have designs on writing a cool parser for excel formulas.  All in good time, I hope. -Eric

  • Anonymous
    April 15, 2014
    Not sure if this blog is still being monitored... I'm keen to take this parser one step further by adding support for mathematical functions, e.g. "1 + Sqr(4) * Abs(-3)"? Can you provide any pointers as to where to start? Thanks in advance.

  • Anonymous
    April 15, 2014
    Silly question - I've just realised that this code only parses expressions, and doesn't actually evaluate them. What would be involved in implementing this functionality?

  • Anonymous
    May 06, 2014
    This code is tripped up by expressions containing multiple terms. E.g. "(1+2)+(3+4)" becomes "1+2)+(3+4" under the influence of: if (symbols.First() is OpenParenthesis && symbols.Last() is CloseParenthesis) {  Expression e = Expression.Produce(symbols.Skip(1).SkipLast(1));  if (e != null)    return new NospaceExpression(new OpenParenthesis(), e, new CloseParenthesis()); } This problem requires an additional test to recognize this circumstance then skip over this code. I created a method HasMultipleTerms() that does this by counting parentheses.

  • Anonymous
    February 05, 2015
    Same questions- looks like this code only parses expressions, is it possible to get the result after processing the expressions. It will be helpful to see the implementation.