Retrieving the Two Code Groups

[Blog Map]  [Table of Contents]  [Next Topic]

There are two groups of paragraphs in our document that are styled as "Code".  The first group contains the C# code that we want to test.  The second group contains a single paragraph that is the output of the code in the first group.  Next in the process of formulating our query, we want to retrieve each block of code as a separate group.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThe problem is, the GroupBy extension method doesn't do what we want.  It groups all items together in the collection, regardless of if they are separated by other items.  It would join our two groups of code, which we want to keep separate.

For instance, if we amend the code to group the paragraphs, adding one more query to the bottom of our string of queries, as follows:

string defaultStyle =
(string)styleDoc.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var paragraphs =
mainPartDoc.Root
.Element(w + "body")
.Descendants(w + "p")
.Select(p =>
{
string style = GetParagraphStyle(p);
string styleName = style == null ? defaultStyle : style;
return new
{
ParagraphNode = p,
Style = styleName
};
}
);

XName r = w + "r";
XName ins = w + "ins";

var paragraphsWithText =
paragraphs.Select(p =>
new
{
ParagraphNode = p.ParagraphNode,
Style = p.Style,
Text = p.ParagraphNode
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(s => (string)s)
}
);

var groupedCodeParagraphs =
paragraphsWithText.GroupBy(p => p.Style);

foreach (var group in groupedCodeParagraphs)
{
Console.WriteLine("Group of paragraphs styled {0}", group.Key);
foreach (var p in group)
Console.WriteLine("{0} {1}",
p.Style != null ?
p.Style.PadRight(12) :
"".PadRight(12),
p.Text);
Console.WriteLine();
}

Then we see:

Group of paragraphs styled Heading1
Heading1 Parsing WordprocessingML with LINQ to XML

Group of paragraphs styled Normal
Normal The following example prints to the console.
Normal This example produces the following output:

Group of paragraphs styled Code
Code using System;
Code
Code class Program {
Code public static void Main(string[] args) {
Code Console.WriteLine("Hello World");
Code }
Code }
Code
Code Hello World

This grouped the "Hello World" with the code, which is not what we want.

As it turns out, there isn't a standard query operator that does exactly what we want.  We want an operator that groups only adjacent fields with a common key.  So let's write one.  In addition to the GroupAdjacent extension method, we need an GroupOfAdjacent class that we can iterate through for each grouping.  It only takes a couple dozen lines of code to implement this.

GroupAdjacent is lazy.  Until the results are iterated, the source is not iterated.

As the source for each group is iterated, GroupAdjacent populates a List<TSource>  with the elements for that group, so it uses somewhat more memory than, say, the Where extension method, which never holds on to objects in the source collection.  However, this is the correct behavior:  the following code is valid and should give the expected results:

int[] ia = new int[] { 1, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0 };

var groups = ia.GroupAdjacent(i => i);

foreach (var g in groups)
{
Console.WriteLine("Group {0}", g.Key);
foreach (var i in g)
Console.WriteLine(i);

// it is perfectly valid to iterate through the group more than once.
foreach (var i in g)
Console.WriteLine(i);

Console.WriteLine();
}

To use GroupAdjacent, we pass it a lambda that selects the value that when that value changes, the operator creates a new group.  GroupAdjacent then is a sequence of groups, each of which contain a sequence of type T.

The whole program is attached to this page.  Here is the listing:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public class GroupOfAdjacent<TSource, TKey> :
IEnumerable<TSource>, IGrouping<TKey, TSource>
{
public TKey Key { get; set; }
private List<TSource> GroupList { get; set; }

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return ((System.Collections.Generic.IEnumerable<TSource>)this).GetEnumerator();
}

System.Collections.Generic.IEnumerator<TSource>
System.Collections.Generic.IEnumerable<TSource>.GetEnumerator()
{
foreach (var s in GroupList)
yield return s;
}

public GroupOfAdjacent(List<TSource> source, TKey key)
{
GroupList = source;
Key = key;
}
}

public static class LocalExtensions
{
public static string GetPath(this XElement el)
{
return
el
.AncestorsAndSelf()
.Aggregate("", (seed, i) => i.Name.LocalName + "/" + seed);
}

public static string StringConcatenate(
this IEnumerable<string> source)
{
return source.Aggregate(
new StringBuilder(),
(s, i) => s.Append(i),
s => s.ToString());
}

public static string StringConcatenate<T>(
this IEnumerable<T> source,
Func<T, string> projectionFunc)
{
return source.Aggregate(
new StringBuilder(),
(s, i) => s.Append(projectionFunc(i)),
s => s.ToString());
}

public static IEnumerable<IGrouping<TKey, TSource>> GroupAdjacent<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector)
{
TKey last = default(TKey);
bool haveLast = false;
List<TSource> list = new List<TSource>();

foreach (TSource s in source)
{
TKey k = keySelector(s);
if (haveLast)
{
if (!k.Equals(last))
{
yield return new GroupOfAdjacent<TSource, TKey>(list, last);
list = new List<TSource>();
list.Add(s);
last = k;
}
else
{
list.Add(s);
last = k;
}
}
else
{
list.Add(s);
last = k;
haveLast = true;
}
}
if (haveLast)
yield return new GroupOfAdjacent<TSource, TKey>(list, last);
}
}

class Program
{
readonly static XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

public static XDocument LoadXDocument(OpenXmlPart part)
{
XDocument xdoc;
using (StreamReader streamReader = new StreamReader(part.GetStream()))
xdoc = XDocument.Load(XmlReader.Create(streamReader));
return xdoc;
}

public static string GetParagraphStyle(XElement para)
{
return (string)para.Elements(w + "pPr")
.Elements(w + "pStyle")
.Attributes(w + "val")
.FirstOrDefault();
}

static void Main(string[] args)
{
const string filename = "SampleDoc.docx";

using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open(filename, true))
{
MainDocumentPart mainPart = wordDoc.MainDocumentPart;
StyleDefinitionsPart stylePart = mainPart.StyleDefinitionsPart;
XDocument mainPartDoc = LoadXDocument(mainPart);
XDocument styleDoc = LoadXDocument(stylePart);

string defaultStyle =
(string)styleDoc.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var paragraphs =
mainPartDoc.Root
.Element(w + "body")
.Descendants(w + "p")
.Select(p =>
{
string style = GetParagraphStyle(p);
string styleName = style == null ? defaultStyle : style;
return new
{
ParagraphNode = p,
Style = styleName
};
}
);

XName r = w + "r";
XName ins = w + "ins";

var paragraphsWithText =
paragraphs.Select(p =>
new
{
ParagraphNode = p.ParagraphNode,
Style = p.Style,
Text = p.ParagraphNode
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(s => (string)s)
}
);

var groupedCodeParagraphs =
paragraphsWithText.GroupAdjacent(p => p.Style);

foreach (var group in groupedCodeParagraphs)
{
Console.WriteLine("Group of paragraphs styled {0}", group.Key);
Console.WriteLine("===================");
foreach (var p in group)
Console.WriteLine("{0} {1}",
p.Style != null ?
p.Style.PadRight(12) :
"".PadRight(12),
p.Text);
Console.WriteLine();
}
}
}
}

When we run it, it produces the following output:

Group of paragraphs styled Heading1
===================
Heading1 Parsing WordprocessingML with LINQ to XML

Group of paragraphs styled Normal
===================
Normal The following example prints to the console.

Group of paragraphs styled Code
===================
Code using System;
Code
Code class Program {
Code public static void Main(string[] args) {
Code Console.WriteLine("Hello World");
Code }
Code }
Code

Group of paragraphs styled Normal
===================
Normal This example produces the following output:

Group of paragraphs styled Code
===================
Code Hello World

This is what we want.

[Blog Map]  [Table of Contents]  [Next Topic]

RetrievingTheTwoCodeGroups.cs

Comments

  • Anonymous
    April 24, 2008
    Looking at the GroupAdjacent extension method i thought about to things:
  1. where the code says else { llist.Add(s); last = k;} that last statement can be omitted since the else catches all cases where k.Equals(last)
  2. if the foreach is substituted with a while u can skip 2 if's one of them being executed in every execution of the loop. with those 2 changes the code would look similiar to the code below (only for showing the itention not ment for production e.g. not tested :-) )   public static IEnumerable<IGrouping<TKey, TSource>> GroupAdjacent<TSource, TKey>(        this IEnumerable<TSource> source,        Func<TSource, TKey> keySelector) {            var list = new List<TSource>();            var en = source.GetEnumerator();            if(!en.MoveNext()) {                yield break;            }            var k = keySelector(en.Current);            list.Add(en.Current);            TKey last = k;            while(en.MoveNext()) {                TSource s = en.Current;                k = keySelector(s);                if (!k.Equals(last)) {                    yield return new GroupOfAdjacent<TSource, TKey>(list, last);                    list = new List<TSource> {s};                    last = k;                } else {                    list.Add(s);                }            }            yield return new GroupOfAdjacent<TSource, TKey>(list, last);    }
  • Anonymous
    August 28, 2008
    [Table of Contents] [Next Topic] Our next goal is to retrieve the text of the paragraphs in the document.