Jaa


Streaming with LINQ to XML - Part 2

In the first post in this series we gave some background to a problem the LINQ to XML design team has been working on for some time: how to easily yet efficiently work with very large XML documents.  In today's world, developers have a somewhat unpleasant choice between doing this efficiently with fairly difficult APIs such as the XmlReader/XmlWriter or SAX, and doing this easily with DOM or XSLT and accepting a fairly steep performance penalty as documents get very large.

Let's consider a real world example - Wikipedia abstract files.  Wikipedia offers free copies of all content to interested users, on an immense number of topics and in several human languages.  Needless to say, this requires terabytes of storage, but entries are indexed in abstract.xml files in each directory in a hierarchy arranged by language and content type.  There doesn't seem to be a published schema for these abstract files, but each has the basic format:

<feed>
<doc>
<title></title>
<url></url>
<abstract></abstract>
<links>
<sublink linktype="nav"><anchor></anchor><link></link></sublink>
<sublink linktype="nav"><anchor></anchor><link></link></sublink>
</links>
</doc>
.
. [lots and lots more "doc" elements]
.
</feed>

Something one might want to do with these files is to find the URLs of articles that might be interesting given information in the  <title> or  <abstract>  elements.  For example, here is a conventional LINQ to XML program that will open an abstracts file and print out the URLs of entries that contain 'Shakespeare' in the  <abstract>.  (If you want to run this, it would be best to copy a small subset of a real Wikipedia file such as  abstract.xml  -- !! I do NOT recommend clicking on this link, it's about 10 MB!! of XML, which will keep your browser busy for awhile and possibly choke up your internet connection --  to the appropriate local directory.)

 using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace SimpleStreaming
{
    class Program
    {
        static void Main(string[] args)
        {
            XElement abstracts = XElement.Load(@"abstract.xml");
            IEnumerable<string> bardQuotes =
            from el in abstracts.Elements()
                where el.Element("abstract").Value
                   .Contains("Shakespeare")
                select (string)el.Element("url");
            foreach (string str in bardQuotes)
            {
                Console.WriteLine(str);
            }
        }
    }
}

Note that this is a typical LINQ to XML program - we query over the top level elements in the tree of abstracts for those which contain an <abstract> subelement with a value that contains the string "Shakespeare", then print out the values of the <url> subelements.   Of course, actually running this program with a multi-megabyte input file will consume a lot of time and memory; it would be more efficient to query over a stream of top level elements in the raw file of abstracts, and perform the very same LINQ subqueries (and transformations, etc.) that are possible when querying over an XElement tree in memory. 

As noted in the earlier post, we did not manage to find a design that would do this in a generic, discoverable, easy to use, yet efficient way.  Instead, we hope to teach you how to do this in a custom, understandable, easy to use, and efficient way... with just a bit of code you can tailor to your particular data formats and use cases.  In other words, to abuse the old cliche, rather than giving you a streaming class and feeding you for a day, we'll teach you to stream and let you feed yourself for a lifetime. [groan]  But seriously folks, with just a little bit of learning about the XmlReader and some powerful features of C# and .NET, you can extend LINQ to XML to process huge quantities of XML almost as efficiently as you can with pure XmlReader code, but in a way that any LINQ developer can exploit without knowing the implementation details.

The key is to write a custom axis method that functions much like the built-in axes such as Elements(), Attributes(), etc. but operates over a specific type of XML data.  An axis method typically returns a collection such as IEnumerable<XElement>. In the example here, we read over the stream with the XmlReader's ReadFrom method, and  return the collection by using yield return. This provides the deferred execution semantics necessary to make the custom axis method work well with huge data sources, but allows the application program to use ordinary LINQ to XML classes and methods to filter and transform the results.

Specifically, we will modify only a couple of lines in the application:

XElement abstracts = XElement.Load(@"abstract.xml");

goes away, because we do not want to load the big data source into an XElement tree.  Let's replace it with a simple reference to a big data source:

string inputUrl = @https://download.wikimedia.org/enwikiquote/20070225/enwikiquote-20070225-abstract.xml;

Next,
from el in abstracts.Elements()

morphs into a call to the custom axis method we are going to write, passing the URL of the data to process and the element name that we expect to stream over:
    from el in SimpleStreamAxis(inputUrl, "doc")

 Writing the custom axis method is a bit tricker (but not as scary as the name might sound), and requires only a bare minimum of knowledge about the XmlReader class (and Intellisense will help with that). The key steps are to:

a) create a reader over the inputUrl file:
    using (XmlReader reader = XmlReader.Create(inputUrl))

b) move to the content of the file and start reading:
    reader.MoveToContent();
while (reader.Read())

c) Pay attention only to XML element content (ignore processing instructions, comments, whitespace, etc. for simplicity ... especially since the Wikipedia files don't contain this stuff):

    switch (reader.NodeType)
{
case XmlNodeType.Element:

d) If the element has the name  that we were told to stream over, read that content into an XElement object and yield return it:

    if (reader.Name == matchName)
{
XElement el = XElement.ReadFrom(reader) as XElement;
if (el != null)
yield return el;
}
break;
e) Close the XmlReader when we're done.
    reader.Close();

That's not so hard is it?  The simple example program is now:

 using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace SimpleStreaming
{
    class Program
    {
        static IEnumerable<XElement> SimpleStreamAxis(
                       string inputUrl, string matchName)
        {
            using (XmlReader reader = XmlReader.Create(inputUrl))
            {
                reader.MoveToContent();
                while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
                        case XmlNodeType.Element:
                            if (reader.Name == matchName)
                            {
                                XElement el = XElement.ReadFrom(reader) 
                                                      as XElement;
                                if (el != null)
                                    yield return el;
                            }
                            break;
                    }
                }
                reader.Close();
            }
        }

        static void Main(string[] args)
        {
            string inputUrl = 
               @"https://download.wikimedia.org/enwikiquote/20070225/enwikiquote-20070225-abstract.xml";
            IEnumerable<string> bardQuotes =
                from el in SimpleStreamAxis(inputUrl, "doc")
                where el.Element("abstract").Value.Contains("Shakespeare")
                select (string)el.Element("url");

            foreach (string str in bardQuotes)
            {
                Console.WriteLine(str);
            }
        }
    }
}
  

The actual results contain more than just Shakespeare quotes; feel free to add whatever logic it takes to exploit Wikipedia's conventions in a more sophisticated way.  Likewise, you might wish to experiment with other LINQ to XML techniques to transform the matching elements into RSS or HTML data.  Or you might wish to experiment with a more sophisticated query language, e.g. an XPath subset, rather than using the simple name matching scheme here.  The possiblities are endless!  We'll explore some in a bit more depth in the next installment, and address a question that the LINQ to XML design team wrestled with for a long time: How do to handle documents with a more complex structure, such as a header containing contextual data that needs to be preserved, or more deeply nested documents where you want to stream over multiple levels of the hierarchy.  

TreeBard.cs

Comments

  • Anonymous
    March 24, 2007
    PingBack from http://blogs.msdn.com/xmlteam/archive/2007/03/05/streaming-with-linq-to-xml-part-1.aspx

  • Anonymous
    March 26, 2007
    I would think a simpler implementation would be: using(XmlReader reader = XmlReader.Create(inputUrl)) {    while(reader.ReadToFollowing(matchName))        yield return (XElement)XElement.ReadFrom(reader); } and I would be inclined to actually turn this into an extension method on XmlReader: public static IEnumerable<XElement> StreamElements(this XmlReader reader, string matchName) {    while(reader.ReadToFollowing(matchName))        yield return (XElement)XElement.ReadFrom(reader); } and then your main would have: using(XmlReader reader = XmlReader.Create(inputUrl)) {    var bardQuotes =        from el in reader.StreamElements("doc")        where el.Element("abstract").Value            .Contains("Shakespeare")        select (string)el.Element("url");    foreach (string str in bardQuotes)        Console.WriteLine(str); }

  • Anonymous
    March 26, 2007
    Thanks, those are good ideas.

  • Anonymous
    April 17, 2007
    I agree. In most cases, XML files that need streaming are actually just a huge number of 2nd-level elements, but each of those elements would fit nicely into an XElement. More often than not, these elements even have the same name, so MKane's implementation is already quite specialized. A simple method with the signature public static IEnumerable<XElement> StreamElements (this XmlReader reader) would be fine for many cases where DOM is too fat. After all, you can still filter your stuff in the where clause. Or you could pass a filter method (quite like a where clause): StreamElements (XmlReader reader, Func<XmlReader,bool> predicate) from el in reader.StreamElements (reader => reader.Name == "doc") where el.Element ("abstract") ... In this case, the first predicate would operate on the reader (performance), while the second gets XElements (ease of use). I suppose you already tried to get the XLINQ stuff to support IQueriable instead of IEnumerable and transform XElement conditions to XmlReader conditions and got nowhere? Anyway, even with two seperate predicates this could still prove to be quite a lot easier than writing custom axis methods for every single bit of code. Of course, for parsing huge OpenXML files, we'd still have to write specialized axis methods. But including a few methods like those above (or maybe even just the one with the predicate) would be a quite good solution for many situations.

  • Anonymous
    April 17, 2007
    btw, this should work too: public static IEnumerable<XmlReader> Where (    this XmlReader reader,    Func<XmlReader,bool> predicate) {  while (reader.Read())    yield return reader;  else    reader.Skip(); } reader.ReadToDescendant("feed"); from r in reader where r.NodeType == XmlNodeType.Element && r.Name == "doc" select (XElement) XElement.ReadFrom (reader) into el where el.Element ("abstract") ...

  • Anonymous
    April 17, 2007
    sorry, i was in the middle of writing when my finger fell on the enter key... here's the final snippet (though untested): public static IEnumerable<XmlReader> Where (    this XmlReader reader,    Func<XmlReader, bool> predicate) {  while (reader.Read())  {    if (predicate(reader))      yield return reader;    else      reader.Skip();  } } public static void Main() {  string inputUrl = @"http://download.wikimedia.org/enwikiquote/20070225/enwikiquote-20070225-abstract.xml";  using (XmlReader reader = XmlReader.Create(inputUrl))  {    reader.ReadToDescendant("feed");    var bardQuotes = from r in reader                     where r.NodeType == XmlNodeType.Element && r.Name == "doc"                     select (XElement)XElement.ReadFrom(reader)    into el                     where el.Element("abstract").Value.Contains("Shakespeare")                     select (string)el.Element("url");    foreach (var quote in bardQuotes)      Console.WriteLine(quote);  } }

  • Anonymous
    April 17, 2007
    thinking about it, this could become rather risky, considering that

  1. the user can choose to not read entire elements from the reader in the select clause (which would leave the reader in a probably unexpected position)
  2. the user can apply further query expressions like "order", which is probably going to mess up everything quite a bit. an enumeration of XmlReaders that always return the same instance, only in different states does not look like such a good idea from this point of view. it might therefore be better to return an IEnumerable<XNode> in the first place. it could still look nice: the Where() method would change its yield to: yield return (XNode) XNode.ReadFrom(reader); the user code would be: var bardQuotes = from r in reader                 where r.NodeType == XmlNodeType.Element && r.Name == "doc"                 select (XElement) r into el                 where el.Element("abstract").Value.Contains("Shakespeare")                 select (string)el.Element("url"); note that casting r to XElement in the first select does not look very logical though. this works better with lambda syntax, where we can change names:        reader.Where(r => r.NodeType == XmlNodeType.Element && r.Name == "doc")          .Select(node => (XElement)node)          .Where(el => el.Element("abstract").Value.Contains("Shakespeare"))          .Select(el => (string)el.Element("url")); in this code it's obvious to the reader, that the select targets a node, not a reader. hm... and of course, this still works only with sibling nodes, any other navigation would still have to appear outside the from/select clause or in a custom axis method like you suggest. (but I still believe thats a minority scenario.)
  • Anonymous
    April 17, 2007
    sorry for my inverted thinking/posting order ;-)

  • Anonymous
    April 17, 2007
    a dirty workaround for the node/r lambda arg name problem: var bardQuotes = from node in reader                where reader.NodeType == XmlNodeType.Element && reader.Name == "doc"                select (XElement) node but that's maybe even worse. ok, last post for now. promise.

  • Anonymous
    September 08, 2007
    LINQ to XML and the XML API that underpins it contained in the System.Xml.Linq namespace is essentially...