Jaa


Rollup Extension Method: Create Running Totals using LINQ to Objects

Recently, I had need for a new extension method for IEnumerable<T>, which I call "Rollup".  This extension method is sort of a cross between the Select and Aggregate extension methods.  Like Select, when using this extension method you write a lambda expression to project the new value in a new collection.  Unlike Select, in the projection lambda expression, you are passed the projected value for the immediately preceding element.  The previous value for the first element in the projected collection is a seed value that you pass to the Rollup extension method.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThis extension method has use in two situations.  First, it allows you to write a query that produces a running total.  Second, it allows you to process elements in context.  I'll explain more about what I mean by this, but first, let's look at the simpler use, which is to produce a running total.

int[] a = new [] { 3, 1, 4, 1, 5, 9 };
var runningTotal = a.Rollup(0, (s, x) => s + x);
foreach (var i in runningTotal)
Console.WriteLine(i);

This outputs:

3
4
8
9
14
23

The first argument to Rollup is the seed (0, in this case).  The second argument is a lambda expression with two arguments.  The first argument (s) is the current aggregation value, and the second argument (x) the element in the source collection.

Here is the implementation of the Rollup extension method.  It is pretty simple:

public static IEnumerable<TResult> Rollup<TSource, TResult>(
this IEnumerable<TSource> source,
TResult seed,
Func<TSource, TResult, TResult> projection)
{
TResult nextSeed = seed;
foreach (TSource src in source)
{
TResult projectedValue = projection(src, nextSeed);
nextSeed = projectedValue;
yield return projectedValue;
}
}

The resulting collection has exactly the same number of elements in it as the source collection, so if you want to produce a new collection that has the original values and the running total, you can use the Zip extension method, which was introduced with version 4.0 of the .NET framework.  If you are working with C# 3.0 and .NET 3.5, you can find a super simple implementation of Zip in this blog post.

int[] a = new [] { 3, 1, 4, 1, 5, 9 };
var runningTotal = a.Rollup(0, (s, x) => s + x);
var b = a.Zip(runningTotal, (a1, a2) => new
{
Data = a1,
RunningTotal = a2,
});
foreach (var item in b)
Console.WriteLine(item);

And as you might expect, this produces:

{ Data = 3, RunningTotal = 3 }
{ Data = 1, RunningTotal = 4 }
{ Data = 4, RunningTotal = 8 }
{ Data = 1, RunningTotal = 9 }
{ Data = 5, RunningTotal = 14 }
{ Data = 9, RunningTotal = 23 }

Well, this is interesting, but this isn't the situation that prompted me to write Rollup.  I need to parse some Open XML WordprocessingML fields (see section 17.16 in IS29500), and those fields have syntax like this:

DATE @ "dd, MM" h

When I break this up into tokens, the tokens should be:

Token 1

DATE

Token 2

@

Token 3

"dd, MM"

Token 4

h

In other words, I need to break on spaces when not in a quoted token, but don't break on spaces within a quoted token.  To break this string into tokens, I first define an enum:

public enum TokenState
{
InWhiteSpace,
InToken,
OnOpeningQuote,
InQuotedToken,
OnClosingQuote,
};

I can then write a query that processes characters of the string in context:

string s = @" DATE @ ""dd, MM"" h";
var tokenStates = s.Rollup(TokenState.InWhiteSpace,
(c, previousState) =>
{
switch (previousState)
{
case TokenState.InWhiteSpace:
if (c == '"')
return TokenState.OnOpeningQuote;
if (!Char.IsWhiteSpace(c))
return TokenState.InToken;
break;
case TokenState.InToken:
if (c == '"')
return TokenState.OnOpeningQuote;
if (Char.IsWhiteSpace(c))
return TokenState.InWhiteSpace;
break;
case TokenState.OnOpeningQuote:
return TokenState.InQuotedToken;
case TokenState.InQuotedToken:
if (c == '"')
return TokenState.OnClosingQuote;
break;
case TokenState.OnClosingQuote:
return TokenState.InWhiteSpace;
}
return previousState;
});
foreach (var item in tokenStates)
Console.WriteLine(item);

In this example, the seed is the starting state (TokenState.InWhiteSpace) for this tiny lexical parser.  Then, each projection is responsible for returning the state to be passed to the element following, given the current state and the current character.

This example produces:

InWhiteSpace
InToken
InToken
InToken
InToken
InWhiteSpace
InToken
InToken
InWhiteSpace
OnOpeningQuote
InQuotedToken
InQuotedToken
InQuotedToken
InQuotedToken
InQuotedToken
InQuotedToken
OnClosingQuote
InWhiteSpace
InToken
InToken

Again, we can use the Zip extension method to combine the state with the characters in the source string:

string s = @" DATE @ ""dd, MM"" h";
var tokenStates = s.Rollup(TokenState.InWhiteSpace,
(c, previousState) =>
{
switch (previousState)
{
case TokenState.InWhiteSpace:
if (c == '"')
return TokenState.OnOpeningQuote;
if (!Char.IsWhiteSpace(c))
return TokenState.InToken;
break;
case TokenState.InToken:
if (c == '"')
return TokenState.OnOpeningQuote;
if (Char.IsWhiteSpace(c))
return TokenState.InWhiteSpace;
break;
case TokenState.OnOpeningQuote:
return TokenState.InQuotedToken;
case TokenState.InQuotedToken:
if (c == '"')
return TokenState.OnClosingQuote;
break;
case TokenState.OnClosingQuote:
return TokenState.InWhiteSpace;
}
return previousState;
});
var characters = s.Zip(tokenStates, (a1, a2) => new
{
C = a1,
State = a2,
});
foreach (var item in characters)
Console.WriteLine(item);

Now the example outputs:

{ C = , State = InWhiteSpace }
{ C = D, State = InToken }
{ C = A, State = InToken }
{ C = T, State = InToken }
{ C = E, State = InToken }
{ C = , State = InWhiteSpace }
{ C = , State = InToken }
{ C = @, State = InToken }
{ C = , State = InWhiteSpace }
{ C = ", State = OnOpeningQuote }
{ C = d, State = InQuotedToken }
{ C = d, State = InQuotedToken }
{ C = ,, State = InQuotedToken }
{ C = , State = InQuotedToken }
{ C = M, State = InQuotedToken }
{ C = M, State = InQuotedToken }
{ C = ", State = OnClosingQuote }
{ C = , State = InWhiteSpace }
{ C = , State = InToken }
{ C = h, State = InToken }

For demonstration purposes, I oversimplified this use of rollup, because I will need to handle escaped quotes in quoted strings, escaped quotes outside of quoted strings, etc.  I wanted to keep this as simple as possible when explaining this use of Rollup.  When implementing somewhat more complex semantics, the exact same principles apply, but the code will be a little longer.

Now that we have this information about each character in the string, we can use the GroupAdjacent extension method to group together the characters of tokens.  Along the way, we filter out the opening and closing quotes, as well as white space.

string s = @" DATE @ ""dd, MM"" h";
var tokenStates = s.Rollup(TokenState.InWhiteSpace,
(c, previousState) =>
{
switch (previousState)
{
case TokenState.InWhiteSpace:
if (c == '"')
return TokenState.OnOpeningQuote;
if (!Char.IsWhiteSpace(c))
return TokenState.InToken;
break;
case TokenState.InToken:
if (c == '"')
return TokenState.OnOpeningQuote;
if (Char.IsWhiteSpace(c))
return TokenState.InWhiteSpace;
break;
case TokenState.OnOpeningQuote:
return TokenState.InQuotedToken;
case TokenState.InQuotedToken:
if (c == '"')
return TokenState.OnClosingQuote;
break;
case TokenState.OnClosingQuote:
return TokenState.InWhiteSpace;
}
return previousState;
});
var characters = s.Zip(tokenStates, (a1, a2) => new
{
C = a1,
State = a2,
});
var charactersGroupedByToken = characters.GroupAdjacent(c => c.State);
var tokens = charactersGroupedByToken
.Where(g => g.Key != TokenState.InWhiteSpace &&
g.Key != TokenState.OnOpeningQuote &&
g.Key != TokenState.OnClosingQuote)
.Select(g =>
g.Aggregate(new StringBuilder(), (sb, c) => sb.Append(c.C), sb => sb.ToString()));
foreach (var item in tokens)
Console.WriteLine("Token >{0}<", item);

This outputs:

Token >DATE<
Token >@<
Token >dd, MM<
Token >h<

This is what I wanted.  It didn't take a lot of code, and realizes the benefits of functional programming, which is that it is easy to debug.

After parsing in this fashion, I very well may add a ToArray call at the end, so that subsequent processing of the token list doesn't need to go back to the string.

After I develop the complete version of an Open XML WordprocessingML field parser, I'll blog it.

Comments

  • Anonymous
    February 24, 2010
    Cool. Nice post. I love this kind of posts where I can get new ideas and ways to look and solve everyday problems in software development :)

  • Anonymous
    February 28, 2010
    Thanks, dbsteppr.  This is one of those topics where I wondered why I didn't figure this out about a year ago! -Eric