Querying LINQ to XML Nodes in Reverse Document Order with Better Performance
Occasionally I need to query LINQ to XML nodes in reverse document order. I’m currently writing some LINQ to XML queries over Open XML documents where I need to select paragraph nodes based on content in the immediately preceding paragraph. However, nodes in LINQ to XML are forward-linked only. We can see evidence of this in the XNode.NodesBeforeSelf and the XElement.ElementsBeforeSelf methods - these methods return collections of nodes in document order, not reverse document order. This was by design – LINQ to XML was designed to provide great performance for the vast majority of scenarios with the minimum memory footprint possible. The need to process nodes in reverse document order is rare, so the designers of LINQ to XML decided that it was more important to reduce memory footprint than to allow for good performance in the few scenarios that require processing in reverse document order, and of course it was a good decision. But the need does exist.
This blog is inactive.
New blog: EricWhite.com/blog
Blog TOC(Update June 25, 2009 - fixed bugs in event handlers associated with deleting last node and inserting node at beginning of list)
In my scenario (a functional transform that processes Open XML document revisions), it is possible that I would need to process 80,000 (or more) paragraphs. If we use the XNode.PreviousNode property, we won’t have acceptable performance. There is an easy work-around that provides us the ability to query in reverse document order in a way that performs well.
- We define a new class, PreviousNodeAnnotation , that contains one public field, public XNode PreviousNode;.
- We add instances of this class as annotations on nodes that we need to query in reverse document order.
In the following small example, I select nodes based on previous node value for a document size of 50,000. The slow version exhibits performance of O(n2). I limited the sample size in the example to 50,000 nodes. When I increased the doc size to 80,000 nodes (the size of one of my documents that I need to query), the execution time of the slow version exceeded my patience. In any case, it is clear that I can’t use XNode.PreviousNode for my scenario.
using System;
using System.Linq;
using System.Xml.Linq;
class PreviousNodeAnnotation
{
public XNode PreviousNode;
public PreviousNodeAnnotation(XNode prev) { PreviousNode = prev; }
}
class Program
{
static int DocumentSize = 50000;
static void SlowPreviousNodeAccess()
{
// create a tree with lots of nodes
XElement root = new XElement("Root",
Enumerable.Range(0, DocumentSize).Select(i => new XElement("Child", i)));
// query for all elements where the previous element has a value of 1000
DateTime start = DateTime.Now;
var q = root
.Elements()
.Where(e =>
{
XElement p = e.PreviousNode as XElement;
return (string)p == "1000";
});
var q2 = q.ToList(); // force iteration
TimeSpan duration = DateTime.Now - start;
Console.WriteLine(duration);
}
static void FastPreviousNodeAccess()
{
// create a tree with lots of nodes
XElement root = new XElement("Root",
Enumerable.Range(0, DocumentSize).Select(i => new XElement("Child", i)));
// initialize previous node annotations
XElement prev = null;
foreach (var item in root.Elements())
{
item.AddAnnotation(new PreviousNodeAnnotation(prev));
prev = item;
}
// query for all elements where the previous element has a value of 1000
DateTime start = DateTime.Now;
var q = root
.Elements()
.Where(e =>
{
XElement p = e
.Annotation<PreviousNodeAnnotation>()
.PreviousNode as XElement;
return (string)p == "1000";
});
var q2 = q.ToList(); // force iteration
TimeSpan duration = DateTime.Now - start;
Console.WriteLine(duration);
}
static void Main(string[] args)
{
FastPreviousNodeAccess();
SlowPreviousNodeAccess();
}
}
On my old slow laptop, the execution time of these two queries is .015 seconds and 30 seconds, respectively.
We can expand on this technique a bit by declaring three extension methods:
public static XNode PreviousNodeFast(this XNode node)
public static IEnumerable<XNode> NodesBeforeSelfFast(this XNode node)
public static IEnumerable<XElement> ElementsBeforeSelfFast(this XElement element)
It’s convenient to use these methods in queries.
In addition, while I’m partial to pure transforms with no side effects, we can declare two event handlers that keep the previous node annotations in sync when adding or deleting nodes. Here’s an example that includes these extension methods and event handlers:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
classPreviousNodeAnnotation
{
publicXNode PreviousNode;
public PreviousNodeAnnotation(XNode prev) { PreviousNode = prev; }
}
publicstaticclassExtensions
{
publicstaticXNode PreviousNodeFast(thisXNode node)
{
return node.Annotation<PreviousNodeAnnotation>().PreviousNode;
}
publicstaticIEnumerable<XNode> NodesBeforeSelfFast(thisXNode node)
{
XNode currentNode = node;
while (true)
{
XNode prevNode = currentNode.PreviousNodeFast();
if (prevNode == null)
yieldbreak;
else
yieldreturn prevNode;
currentNode = prevNode;
}
}
publicstaticIEnumerable<XElement> ElementsBeforeSelfFast(thisXElement element)
{
return NodesBeforeSelfFast(element).OfType<XElement>();
}
}
classProgram
{
staticvoid ValidatePreviousNodes(XElement element)
{
XNode prev = null;
foreach (XNode node in element.Nodes())
{
if (node.PreviousNodeFast() != prev)
{
Console.WriteLine("ERROR: previous nodes are invalid");
Environment.Exit(0);
}
prev = node;
}
Console.WriteLine("Validated");
}
staticvoid Main(string[] args)
{
// create a tree with lots of nodes
XElement root = newXElement("Root",
Enumerable.Range(0, 100).Select(i => newXElement("Child", i)));
// setup previous nodes after tree creation
XElement prev = null;
foreach (var item in root.Elements())
{
item.AddAnnotation(newPreviousNodeAnnotation(prev));
prev = item;
}
// add event handlers to take care of adding / deleting nodes
root.Changed += newEventHandler<XObjectChangeEventArgs>((o, e) =>
{
if (e.ObjectChange == XObjectChange.Add)
{
Console.WriteLine("Add");
XNode node = o asXNode;
// o could be an XAttribute, in which case it's not applicable
if (node != null)
{
node.AddAnnotation(newPreviousNodeAnnotation(node.PreviousNode));
if (node.NextNode != null)
{
node.NextNode.RemoveAnnotations<PreviousNodeAnnotation>();
node.NextNode.AddAnnotation(newPreviousNodeAnnotation(node));
}
}
}
});
root.Changing += newEventHandler<XObjectChangeEventArgs>((o, e) =>
{
if (e.ObjectChange == XObjectChange.Remove)
{
Console.WriteLine("Remove");
XNode node = o asXNode;
// o could be an XAttribute, in which case it's not applicable
if (node != null)
{
if (node.NextNode != null)
{
node.NextNode.RemoveAnnotations<PreviousNodeAnnotation>();
node.NextNode
.AddAnnotation(newPreviousNodeAnnotation(node.PreviousNode));
}
}
}
});
ValidatePreviousNodes(root);
root.Elements().ElementAt(3).AddAfterSelf(
newXElement("NewChild", 999)
);
ValidatePreviousNodes(root);
root.Nodes().ElementAt(2).Remove();
ValidatePreviousNodes(root);
root.Nodes().ElementAt(3).NodesBeforeSelfFast().Remove();
ValidatePreviousNodes(root);
ValidatePreviousNodes(root);
root.AddFirst(
newXElement("ANode", 1)
);
ValidatePreviousNodes(root);
root.Add(
newXElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().Last().NodesBeforeSelfFast().Remove();
ValidatePreviousNodes(root);
root.Add(
newXElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Add(
newXElement("ANode", 2)
);
root.Add(
newXElement("ANode", 2)
);
root.Add(
newXElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().First().Remove();
ValidatePreviousNodes(root);
root.Add(
newXElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().Last().Remove();
root.Add(Enumerable.Range(0, 100).Select(i => newXElement("Child", i)));
XElement last = root.Elements().ElementAt(50);
foreach (var item in last.NodesBeforeSelfFast())
{
Console.WriteLine(item);
}
}
}
I would personally only define these extension methods in the module where I need good performance of reverse document order queries. It would be messy to have these extension methods in scope for modules that don't set up the annotations.
Code is attached.
Comments
Anonymous
June 25, 2009
You say "The need to process nodes in reverse document order is rare" but that is only if you remove the scenario of processing documents from what people do with XML. Database people have seemed congenitally certain that whatever they need is all that document processing people need. But it isn't so. In XML, you can write that an element X has a label Y with a value Z as <X Y="Z" /> <X><Y>Z</Y></X> <label Y="Z"><X/></label> <container> <containerPr><Y>Z</Y></containerPr><X/> </container> and so on. So the heuristic for XML processing systems should not be that "forward axis is normal, other axes are weird" but rather something like "processing one element requires accessing other elements in inverse proportion to their degree of separation." Indeed, you could classify documents, schemas and programs by the degrees of separation (XPath location steps needed to process each node.) I would guess an average (mode) of 2 for HTML, 4 for ODF and 5 for OOXML.Anonymous
June 26, 2009
Hi Rick, I believe that you are right that 'document-centric' XML tree processing is the scenario that requires the most from XML processing systems, including the ability to process nodes in reverse document order. I certainly agree that document-centric XML processing requires more interesting axes than database scenarios. I can't comment on ODF - I'm not an expert - but certainly there are places in Open XML that require quite a few (4-5) XPath location steps to process nodes. Regarding the decision of the designers of LINQ to XML to not include a doubly linked list for nodes, I believe it is a good one. I've written a lot of LINQ to XML queries over Open XML documents, and til now, there have been only one or two scenarios that require reverse document order queries that perform well. I just counted the nodes of one of my largest Open XML documents - it contains > 2,400,000 XML nodes in the main document part. If there were a previous sibling node link in each node, then this would consume 8 meg that wouldn't be often used. In addition, maintaining previous nodes would add additional processing burden - not a lot, but it adds up. But as seen here, it is trivial to add reverse document order axes that perform well. We only need to enable them for the few scenarios where we actually need them. -Eric