Retrieving the Two Code/Comment Groups
[Back to the Table of Contents] This blog is inactive. New blog: EricWhite.com/blog
Now, we want to retrieve each block of code and comments, as two separate groups.
The problem is, the GroupBy operator doesn't do what we want. It reads in the file, sorts, and then groups. It would join our two groups of code (which we want to keep separate).
For instance, if we amend the query, as follows:
var paraGroups =
wordDoc
.Element(w + "body")
.Descendants(w + "p")
.Select(p =>
new {
ParagraphNode = p,
Style = GetParagraphStyle(p),
ParaText =
p
.Elements(w + "r")
.Elements(w + "t")
.StringConcatenate(t => (string)t)
}
)
.Select(p =>
new {
ParagraphNode = p.ParagraphNode,
Style = p.Style,
CommentOrCode =
p.Style == "Code" ||
p.Style == "CommentText",
ParaText = p.ParaText
}
)
.GroupBy(g => g.CommentOrCode);
foreach (var g in paraGroups)
{
Console.WriteLine("========");
foreach (var p in g)
{
Console.WriteLine(p.ParaText);
}
}
Then we see:
========
This is a heading.
This is some normal test.
See the following code for an example of how to do something:
This is more text.
========
using System;
<Test SnipId="000101" TestId="0001" Lang="C#9">
<!-- validation instructions go here -->
</Test>
using System.Collections.Generic;
using System.Text;
using System.Query;
using System.Xml.XLinq;
using System.Data.DLinq;
namespace WordMLReader
{
class Program
{
static void (string[] args)
{
Console.WriteLine("Hello");
}
}
}
using System.Text;
<Test SnipId="000201" TestId="0002" Lang="C#9">
<!-- validation instructions go here -->
</Test>
using System.Query;
using System.Xml.XLinq;
using System.Data.DLinq;
namespace WordMLReader
{
class Program
{
This grouped all of our code and comments together, and grouped the non-code/comments together, which is not what we want.
By the way, you will notice that we devolved into some loops:
foreach (var g in paraGroups)
{
Console.WriteLine("========");
foreach (var p in g)
{
Console.WriteLine(p.ParaText);
}
}
For the interested, those loops could have been written like this:
paraGroups
.Select(
g =>
"========" +
Environment.NewLine +
g.Select(
i => i.ParaText +
Environment.NewLine
)
.StringConcatenate()
)
.ForEach(s =>
Console.WriteLine("{0}", s)
)
However, I am not convinced that the functional approach is better here! It certainly is harder to read. Besides, using foreach to iterate over a collection is still declarative. It just uses a different style of notation.
As it turns out, there isn't a standard query operator that does exactly what we want. We want an operator that groups when a field changes, but DOESN'T sort them first. So let's write one. First, we need a ChangeGroup class that we can iterate through for each grouping. We'll derive it from List<T>, and specify that it implements IGrouping<int, T>:
public class ChangeGroup<T> : List<T>, IGrouping<int, T>
{
private int key;
public int Key {
get {
return key;
}
set {
key = value;
}
}
}
Now, we'll implement the GroupOnChange operator:
public static IEnumerable<IGrouping<int, T>> GroupOnChange<T, K>(
this IEnumerable<T> source,
Func<T, K> changeFieldSelector)
{
int count = 0;
ChangeGroup<T> cg = null;
K lastChangeField = default(K);
bool haveLastChangeField = false;
List<IGrouping<int, T>> newGroupList =
new List<IGrouping<int, T>>();
foreach (var t in source)
{
var changeField = changeFieldSelector(t);
if (!haveLastChangeField ||
!changeField.Equals(lastChangeField))
{
cg = new ChangeGroup<T>();
newGroupList.Add(cg);
cg.Key = count++;
cg.Add(t);
lastChangeField = changeField;
haveLastChangeField = true;
}
else
{
cg.Add(t);
lastChangeField = changeField;
}
}
foreach (var g in newGroupList)
yield return g;
}
To use this operator, we pass it a lambda that selects the value that when that value changes, the operator creates a new group. This operator is implemented using imperative code, but there are no side-effects. It simply projects a sequence of groups, each of which contain a sequence of type T.
Note: you can write this extension method using only declarative code, but it is quite warped and inverted, IMHO.
The whole program is:
using System;
using System.Collections.Generic;
using System.Text;
using System.Query;
using System.Xml.XLinq;
namespace WordMLReader
{
public delegate void VoidFunc<T0>(T0 a0);
public class ChangeGroup<T> : List<T>, IGrouping<int, T>
{
private int key;
public int Key {
get {
return key;
}
set {
key = value;
}
}
}
public static class MySequence {
public static void ForEach<T>(
this IEnumerable<T> source,
VoidFunc<T> func)
{
foreach (var i in source)
func(i);
}
public static string GetPath(this XElement el)
{
return
el
.SelfAndAncestors()
.Aggregate("",
(seed, i) =>
i.Name.LocalName + "/" + seed
);
}
public static string StringConcatenate(
this IEnumerable<string> source)
{
StringBuilder sb = new StringBuilder();
foreach (var s in source)
sb.Append(s);
return sb.ToString();
}
public static string StringConcatenate<T>(
this IEnumerable<T> source,
Func<T, string> projectionFunc
)
{
StringBuilder sb = new StringBuilder();
foreach (var s in source)
sb.Append(projectionFunc(s));
return sb.ToString();
}
public static IEnumerable<IGrouping<int, T>>
GroupOnChange<T, K>(
this IEnumerable<T> source,
Func<T, K> changeFieldSelector)
{
int count = 0;
ChangeGroup<T> cg = null;
K lastChangeField = default(K);
bool haveLastChangeField = false;
List<IGrouping<int, T>> newGroupList =
new List<IGrouping<int, T>>();
foreach (var t in source)
{
var changeField = changeFieldSelector(t);
if (!haveLastChangeField ||
!changeField.Equals(lastChangeField))
{
cg = new ChangeGroup<T>();
newGroupList.Add(cg);
cg.Key = count++;
cg.Add(t);
lastChangeField = changeField;
haveLastChangeField = true;
}
else
{
cg.Add(t);
lastChangeField = changeField;
}
}
foreach (var g in newGroupList)
yield return g;
}
}
class Program
{
readonly static XNamespace aml =
"https://schemas.microsoft.com/aml/2001/core";
readonly static XNamespace w =
"https://schemas.microsoft.com/office/word/2003/wordml";
readonly static XNamespace wsp =
"https://schemas.microsoft.com/office/word/2003/wordml/sp2";
static string GetParagraphStyle(XElement p)
{
return
p
.Elements(w + "pPr")
.Elements(w + "pStyle")
.Any()
?
(string)(p
.Elements(w + "pPr")
.Elements(w + "pStyle")
.First()
.Attribute(w + "val"))
:
"Default";
}
static void Main(string[] args)
{
XElement wordDoc = XElement.Load("CodeInDoc.xml");
var paraGroups =
wordDoc
.Element(w + "body")
.Descendants(w + "p")
.Select(p =>
new {
ParagraphNode = p,
Style = GetParagraphStyle(p),
ParaText =
p
.Elements(w + "r")
.Elements(w + "t")
.StringConcatenate(t => (string)t)
}
)
.Select(p =>
new {
ParagraphNode = p.ParagraphNode,
Comments
- Anonymous
March 01, 2007
Thanks for an interesting article series. The creation of a List followed by "foreach (...) yield return g" in GroupOnChange() seems counter to the whole idea of "yield return" though. I'd remove the list and insert the code "if(cg != null) yield return cg;" right before "cg = new ChangeGroup<T>();" and replace the foreach at the end with "yield return cg;". That way you yield one group at a time without temporary storage. Or is there some benefit to doing all the grouping at once?