Jaa


Ease of Maintenance of LINQ Code

I believe that it is easier to maintain code that is written in the functional style.  For one thing, this is the very reason for many of the characteristics of functional code.  No state is maintained, so we don’t have to worry about corrupting any state.  If a variable is in scope, then the variable has its value, and it will never have another value.  And the idea of composability is all about being able to inject/surround/refactor code without making it brittle.  In this post, I’m going to show the process of maintaining and modifying a somewhat more involved query.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCIn a previous post, I wrote a somewhat (although not very) involved query that searched an Open XML document for specific styles and text.  I identified the incremental changes that I made to a query as I developed it.  After delivering that code to Bob McClellan, who is building a PowerShell cmdlet using that code, he responded with a number of good points.  If we take Select-String as a model for the cmdlet that Bob’s building, then Select-String has some capabilities that my code doesn’t.

·         Select-String allows for regular expressions

·         Select-String allows us to specify that the search is case-sensitive or not

In addition, in my previous query I only matched on the style ID, and not on the style name.  This means that someone would need to look for “Heading1”, and could not specify “heading 1” (with a space in it).  It would be nice to allow matching on either.  We also decided that the issue of case sensitivity and regular expressions apply only to searching for content, not searching for styles.  Most documents only have limited number of styles in them, and people know exactly what they want to search for.  Regular expressions and case insensitivity are not so important when searching for style IDs or names.

So now, my task is to modify that previous query, and add these capabilities to it.  As before, I’ll show each set of changes, identified by highlighting.

(Update Feb 22, 2009 - After writing this post, I continued evolving this query, cleaning up the code in this next post.)

previous post using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item));
return sb.ToString();
}

public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func, string separator)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item)).Append(separator);
return sb.ToString().Trim(separator.ToCharArray());
}

public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
}

class Program
{
static bool ContainsAny(string stringToSearch, IEnumerable<string> searchStrings)
{
foreach (var s in searchStrings)
if (stringToSearch.Contains(s))
return true;
return false;
}

static IEnumerable<string> GetInheritedStyles(WordprocessingDocument doc, string styleName)
{
string localStyleName = styleName;
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

yield return styleName;
while (true)
{
XElement style = doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(e => (string)e.Attribute(w + "type") == "paragraph" &&
(string)e.Element(w + "name").Attribute(w + "val") == localStyleName)
.FirstOrDefault();

if (style == null)
yield break;

var basedOn = (string)style
.Elements(w + "basedOn")
.Attributes(w + "val")
.FirstOrDefault();

if (basedOn == null)
yield break;

yield return basedOn;
localStyleName = basedOn;
}
}

static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
XName r = w + "r";
XName ins = w + "ins";

var defaultStyleName = (string)doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var q1 = doc
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements()
.Select((p, i) =>
{
var styleNode = p
.Elements(w + "pPr")
.Elements(w + "pStyle")
.FirstOrDefault();
var styleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyleName;
return new
{
Element = p,
Index = i,
StyleName = styleName
};
}
);

var q2 = q1
.Select(i =>
{
string text = null;
if (i.Element.Name == w + "p")
text = i.Element.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element);
else
text = i.Element
.Descendants(w + "p")
.StringConcatenate(p => p
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element),
Environment.NewLine
);

return new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = text
};
}
);

var q3 = q2
.Select(i =>
new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = i.Text,
InheritedStyles = GetInheritedStyles(doc, i.StyleName)
.StringConcatenate(s => s, "\t")
}
);

int[] q4 = null;
if (styleSearchString != null)
q4 = q3
.Where(i => ContainsAny(i.InheritedStyles, styleSearchString))
.Select(i => i.Index)
.ToArray();

int[] q5 = null;
if (contentSearchString != null)
q5 = q3
.Where(i => ContainsAny(i.Text, contentSearchString))
.Select(i => i.Index)
.ToArray();

int[] q6 = null;
if (q4 != null && q5 != null)
q6 = q4.Intersect(q5).ToArray();
else
q6 = q5 != null ? q5 : q4;

return q6;
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null);
}

static void Main(string[] args)
{
int[] results = SearchInDocument("Test.docx", new[] { "Heading" }, new[] { "Hello", "aaa" });
foreach (var i in results)
Console.WriteLine(i);
}
}

Step 1

First thing I need to do is to modify parameters to methods, and allow specification of a regular expression.  I made two methods out of ContainsAny, which are ContainsAnyStyles, and ContainsAnyContent.  I precompiled all of the regular expressions – makes sense to do this.  In addition, I modified the StringConcatenate overload that allows for a separator, per Marc’s suggestion on the previous post.

This step was pretty easy – just standard refactoring.  We’ve all done stuff like this before.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item));
return sb.ToString();
}

public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func, string separator)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item)).Append(separator);
if (sb.Length > separator.Length)
sb.Length -= separator.Length;
return sb.ToString();
}

public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
}

class Program
{
static bool ContainsAnyStyles(string stringToSearch, IEnumerable<string> searchStrings)
{
foreach (var s in searchStrings)
if (stringToSearch.Contains(s))
return true;
return false;
}

static bool ContainsAnyContent(string stringToSearch, IEnumerable<string> searchStrings,
IEnumerable<Regex> regularExpressions, bool isRegularExpression, bool caseInsensitive)
{
if (isRegularExpression)
foreach (var r in regularExpressions)
if (r.IsMatch(stringToSearch))
return true;
else
foreach (var s in searchStrings)
if (stringToSearch.Contains(s))
return true;
return false;
}

static IEnumerable<string> GetInheritedStyles(WordprocessingDocument doc, string styleName)
{
string localStyleName = styleName;
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

yield return styleName;
while (true)
{
XElement style = doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(e => (string)e.Attribute(w + "type") == "paragraph" &&
(string)e.Element(w + "name").Attribute(w + "val") == localStyleName)
.FirstOrDefault();

if (style == null)
yield break;

var basedOn = (string)style
.Elements(w + "basedOn")
.Attributes(w + "val")
.FirstOrDefault();

if (basedOn == null)
yield break;

yield return basedOn;
localStyleName = basedOn;
}
}

static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString,
bool isRegularExpression, bool caseInsensitive)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
XName r = w + "r";
XName ins = w + "ins";

RegexOptions options;
Regex[] regularExpressions = null;
if (isRegularExpression)
{
if (caseInsensitive)
options = RegexOptions.IgnoreCase | RegexOptions.Compiled;
else
options = RegexOptions.Compiled;
regularExpressions = contentSearchString
.Select(s => new Regex(s, options)).ToArray();
}

var defaultStyleName = (string)doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var q1 = doc
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements()
.Select((p, i) =>
{
var styleNode = p
.Elements(w + "pPr")
.Elements(w + "pStyle")
.FirstOrDefault();
var styleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyleName;
return new
{
Element = p,
Index = i,
StyleName = styleName
};
}
);

var q2 = q1
.Select(i =>
{
string text = null;
if (i.Element.Name == w + "p")
text = i.Element.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element);
else
text = i.Element
.Descendants(w + "p")
.StringConcatenate(p => p
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element),
Environment.NewLine
);

return new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = text
};
}
);

var q3 = q2
.Select(i =>
new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = i.Text,
InheritedStyles = GetInheritedStyles(doc, i.StyleName)
.StringConcatenate(s => s, "\t")
}
);

int[] q4 = null;
if (styleSearchString != null)
q4 = q3
.Where(i => ContainsAnyStyles(i.InheritedStyles, styleSearchString))
.Select(i => i.Index)
.ToArray();

int[] q5 = null;
if (contentSearchString != null)
q5 = q3
.Where(i => ContainsAnyContent(i.Text, contentSearchString, regularExpressions,
isRegularExpression, caseInsensitive))
.Select(i => i.Index)
.ToArray();

int[] q6 = null;
if (q4 != null && q5 != null)
q6 = q4.Intersect(q5).ToArray();
else
q6 = q5 != null ? q5 : q4;

return q6;
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString,
bool isRegularExpression, bool caseInsensitive)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString,
isRegularExpression, caseInsensitive);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString, bool isRegularExpression, bool caseInsensitive)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null,
isRegularExpression, caseInsensitive);
}

static void Main(string[] args)
{
int[] results = SearchInDocument("Test.docx", new[] { "Heading" }, new[]
{ "h.*o", "aaa" }, true, true);
foreach (var i in results)
Console.WriteLine(i);
}
}

This allows us to search on regular expressions, also supporting case insensitivity.  That was pretty easy.

Step 2

Next, we want to collect style names in addition to the style IDs, and search for those.  This was a little bit of a re-write of the GetInheritedStyles method.  I renamed the method to GetAllStyleIdsAndNames to accurately reflect its functionality.  I refactored the method ContainsAnyStyles to take an IEnumerable<string> for the list of styles of the paragraph, rather than a concatenated string.  And I modified ContainsAnyContent to allow for case insensitivity when searching for strings that are not regular expressions.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item));
return sb.ToString();
}

public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func, string separator)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item)).Append(separator);
if (sb.Length > separator.Length)
sb.Length -= separator.Length;
return sb.ToString();
}

public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
}

class Program
{
static bool ContainsAnyStyles(IEnumerable<string> stylesToSearch,
IEnumerable<string> searchStrings)
{
foreach (var style in stylesToSearch)
foreach (var s in searchStrings)
if (style == s)
return true;
return false;
}

static bool ContainsAnyContent(string stringToSearch, IEnumerable<string> searchStrings,
IEnumerable<Regex> regularExpressions, bool isRegularExpression, bool caseInsensitive)
{
if (isRegularExpression)
{
foreach (var r in regularExpressions)
if (r.IsMatch(stringToSearch))
return true;
}
else
if (caseInsensitive)
{
foreach (var s in searchStrings)
if (stringToSearch.ToLower().Contains(s.ToLower()))
return true;
}
else
{
foreach (var s in searchStrings)
if (stringToSearch.Contains(s))
return true;
}

return false;
}

static IEnumerable<string> GetAllStyleIdsAndNames(WordprocessingDocument doc, string styleId)
{
string localStyleId = styleId;
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

yield return styleId;

string styleNameForFirstStyle = (string)doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(e => (string)e.Attribute(w + "type") == "paragraph" &&
(string)e.Attribute(w + "styleId") == styleId)
.Elements(w + "name")
.Attributes(w + "val")
.FirstOrDefault();

if (styleNameForFirstStyle != null)
yield return styleNameForFirstStyle;

while (true)
{
XElement style = doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(e => (string)e.Attribute(w + "type") == "paragraph" &&
(string)e.Attribute(w + "styleId") == localStyleId)
.FirstOrDefault();

if (style == null)
yield break;

var basedOn = (string)style
.Elements(w + "basedOn")
.Attributes(w + "val")
.FirstOrDefault();

if (basedOn == null)
yield break;

yield return basedOn;

XElement basedOnStyle = doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(e => (string)e.Attribute(w + "type") == "paragraph" &&
(string)e.Attribute(w + "styleId") == basedOn)
.FirstOrDefault();

string basedOnStyleName = (string)basedOnStyle
.Elements(w + "name")
.Attributes(w + "val")
.FirstOrDefault();

if (basedOnStyleName != null)
yield return basedOnStyleName;

localStyleId = basedOn;
}
}

static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString,
bool isRegularExpression, bool caseInsensitive)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
XName r = w + "r";
XName ins = w + "ins";

RegexOptions options;
Regex[] regularExpressions = null;
if (isRegularExpression)
{
if (caseInsensitive)
options = RegexOptions.IgnoreCase | RegexOptions.Compiled;
else
options = RegexOptions.Compiled;
regularExpressions = contentSearchString
.Select(s => new Regex(s, options)).ToArray();
}

var defaultStyleName = (string)doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var q1 = doc
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements()
.Select((p, i) =>
{
var styleNode = p
.Elements(w + "pPr")
.Elements(w + "pStyle")
.FirstOrDefault();
var styleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyleName;
return new
{
Element = p,
Index = i,
StyleName = styleName
};
}
);

var q2 = q1
.Select(i =>
{
string text = null;
if (i.Element.Name == w + "p")
text = i.Element.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element);
else
text = i.Element
.Descendants(w + "p")
.StringConcatenate(p => p
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element),
Environment.NewLine
);

return new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = text
};
}
);

var q3 = q2
.Select(i =>
new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = i.Text,
InheritedStyles = GetAllStyleIdsAndNames(doc, i.StyleName).Distinct()
}
);

int[] q4 = null;
if (styleSearchString != null)
q4 = q3
.Where(i => ContainsAnyStyles(i.InheritedStyles, styleSearchString))
.Select(i => i.Index)
.ToArray();

int[] q5 = null;
if (contentSearchString != null)
q5 = q3
.Where(i => ContainsAnyContent(i.Text, contentSearchString, regularExpressions,
isRegularExpression, caseInsensitive))
.Select(i => i.Index)
.ToArray();

int[] q6 = null;
if (q4 != null && q5 != null)
q6 = q4.Intersect(q5).ToArray();
else
q6 = q5 != null ? q5 : q4;

return q6;
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString,
bool isRegularExpression, bool caseInsensitive)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString,
isRegularExpression, caseInsensitive);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString, bool isRegularExpression, bool caseInsensitive)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null,
isRegularExpression, caseInsensitive);
}

static void Main(string[] args)
{
int[] results = SearchInDocument("Test.docx", new[] { "Normal" }, new[]
{ "h.*o", "aaa" }, true, false);
foreach (var i in results)
Console.WriteLine(i);
}
}

And with that, we’re done!

Code is attached.

(Update Feb 22, 2009 - After writing this post, I continued evolving this query, cleaning up the code in this next post.)

Program.cs

Comments

  • Anonymous
    February 20, 2009
    You can simplify further, you know: static bool ContainsAny(string stringToSearch, IEnumerable<string> searchStrings)    {        return searchStrings.Any(s => stringToSearch.Contains(s));    } ... :) Making as much use of the query operators as possible should make it easier to switch to the Parallel Extensions when the time comes.

  • Anonymous
    February 20, 2009
    Ahh, nice suggestion.  Thanks! -Eric

  • Anonymous
    April 17, 2009
    It is hard to maintain the XML if it has XNamespace!

  • Anonymous
    April 20, 2009
    Hi Jack, Working with namespaces is one of the areas that folks new to XML development have issues with.  Believe me, it's easier with LINQ to XML compared to other XML programming technologies - the issues around namespace prefixes are significantly simplified!  That said, this is one of the issues I spent a lot of time on when I wrote the LINQ to XML documentation.  There are a large number of examples where I included two examples - an example that doesn't use a namespace, and an example that does.  Take a look at these topics:  http://msdn.microsoft.com/en-us/library/bb387093.aspx -Eric