Use of GetElementsByTagName considered harmful

As you may already be aware, there is an article on msdn about the great performance improvements we made in the V2 xml stack. This is a pretty big event for the team, since increasing performance was one of the big goals of this release.

While I posted this to give a shout-out to the team for the great perf work in they did for Whidbey, I want to take a minute to point out another interesting part of the article. Up at the top of this article you will find a this paragraph:

2. We also found that Sun is using the GetElementsByTagName method in C#, and this results in a mismatch of the C# and Java versions. Although both harnesses use XmlElement.GetElemensByTagName(), the C# implementation of this method keeps track of a live list of nodes which is negatively impacted by some of the DOM test scenarios that have edits. A better method for element selection in the C# harness is XmlElement.SelectNodes() which matches the Java harness in functionality.

When I first read this, it caused me some concern. Were we somehow ‘cheating’ by changing the nature of the test? Was there something wrong with GetElementsByTagName that I (and the various xml developers out there) should be aware of? I did a little digging, and wanted to share the results.

The W3C Level1 DOM spec has this to say about the results of GetElementsByTagName (I underlined the last sentence for emphasis):

The DOM also specifies a NodeList interface to handle ordered lists of Nodes, such as the children of a Node, or the elements returned by the Element.getElementsByTagName method, and also a NamedNodeMap interface to handle unordered sets of nodes referenced by their name attribute, such as the attributes of an Element. NodeLists and NamedNodeMaps in the DOM are "live", that is, changes to the underlying document structure are reflected in all relevant NodeLists and NamedNodeMaps.

 

In other words, GetElementsByTagName is, according to the spec, supposed to return a ‘live list’ where changes to the underlying DOM are reflected in the returned NodeList. The MS implementation of this function conforms to the spec, but the Sun/Java implementation does not.

Unfortunately, this conformance comes at a cost. In this case, the cost is a non-trivial hit to the speed and working set of code which uses GetElementsByTagName. Let me explain why.

 

In order to keep the NodeLists returned from GetElementsByTagName live, each list needs to ‘register’ somehow with the related element so that it can be notified when the children of that element change. When we first implemented this function, we thought about adding a series of events to XmlElement. NodeLists could register for these events and react appropriately. Unfortunately, this turns out to bloat the size of a DOM tree, and the performance costs caused by the related cache misses were unacceptable. Because of this we moved to a model where only XmlDocument exposed ‘OnChange’ events. When an item is added into the DOM, the XmlDocument.NodeInsterted event fires, each NodeList can determine if the new element should be added to its list or not, and then take the appropriate action.

 

This solution has the advantage that it adds no extra overhead to scenarios which do not call GetElementsByTagName, and the overhead is minimal even for scenarios which do use GetElementsByTagName, as long as those scenarios do not change the DOM (in this case the only overhead is the cost of registering for the event and a small increase in allocated memory for the created delegate).

 

The real problem shows up when you combine calls to GetElementsByTagName and edits to the DOM. Imagine that you had a moderate sized DOM (a few hundred elements) and for each element you called GetElementsByTagName, which results in a few hundred NodeLists registering for the XmlDocument.NodeInsterted. Now, after walking through the tree, you add a few elements. Each time you add an element, the NodeInserted event fires, and hundreds of delegate functions will get called. As you can imagine, this can cause quite a hit to the performance of your application.

 

The bottom line? If you are using GetElementsByTagName, especially in scenarios where you edit the DOM, you should investigate using SelectNodes instead. SelectNodes takes an XPath expression, but all you would need to do to get back the same elements as what is called by GetElementsByTagName would be to pass in “\\ElementName” [edit: that should read "//ElementName] to SelectNodes where you would have passed “ElementName” in to GetElementsByTagName. As always, you should measure how this changes your perf before switching over, especially in cases where you are not editing the DOM, since SelectNodes does a little more work to support the full syntax, but don’t be afraid to try. Even if your perf is equivalent in both cases, you may want to look at switching over, since we are very focused on improving XPath performance, so you may find your perf improve over time along with our XPath implementation (note: as always Microsoft’s plans are always subject to change, while we currently are focused on XPath performance, some future change may alter this in the future).

Comments

  • Anonymous
    July 19, 2005
    It's been quiet on this blog as we get closer to Whidbey RTM.  Meanwhile, Erik Saltwell - the dev...

  • Anonymous
    July 19, 2005
    It's //ElementName, not \ElementName.

  • Anonymous
    July 19, 2005
    Or even "descendant::ElementName". That probably would be faster.

  • Anonymous
    July 25, 2005
    Thanks Oleg (so much for my proof readong skills :-). I updated the page.

  • Anonymous
    July 25, 2005
    I would be surprised if descendent gave you much better perf here, since you are - by definition of the API call - going to be at the root of your subtree, so everything will be your descendant. I haven't tried it though, so I would be interested to see what the numbers look like.

  • Anonymous
    November 01, 2005
    I think the real problem with GetElementsByTagName is not live collection and event handlers themselves, but inability to remove/dispose collection when it is not need any more. XmlNodeList is not IDisposable, so once it is created and handlers are registered, it's for the document lifetime. Second issues is that second call to GetElementsByTagName for the same name returns new live collection and register more event handleres. In my experience these 2 factors hurt performance the most, not the live property of the collection alone.

  • Anonymous
    November 16, 2005
    Dima, thanks these are well thought out comments. There are some areas where I disagree with you - 1) It is not always for the document lifetime, there are times where we detect that the list is no longer around and destroy un-register, but this is really just a technicality. I agree that we should have put more thought into IDisposable before the code went into the wild. 2) I don't know if I would recomend that we return the same list. This has some usage implications which I don't care for.

    If you would like to discuss this more, please feel free to drop me an email at erik.saltwell@microsoft.com

  • Anonymous
    November 20, 2006
    embattle Brussels  http://www.hotelsfamily.info/uninviting_United%20Kingdom/swum_Wales/embattle_Swansea_1.html

  • Anonymous
    November 21, 2006
    indigested London  http://www.hotelsman.info/bod_United%20Kingdom/accouchement_England/indigested_Folkestone_1.html

  • Anonymous
    December 14, 2006
    piccalilli Edinburgh  http://www.barcelohotels.info/religious_United%20Kingdom/inability_England/piccalilli_York_1.html

  • Anonymous
    December 15, 2006
    prejudice Berlin  http://www.mynhotel.info/sleet_Germany/cuneiform_Bavaria%20(Bayern)/prejudice_Munich_1.html

  • Anonymous
    February 17, 2007
    mmm.. nice design, I must say..

  • Anonymous
    February 24, 2007
    Lavoro eccellente! ..ringraziamenti per le informazioni..realmente lo apprezzo: D

  • Anonymous
    February 25, 2007
    The information I found here was rather helpful. Thank you for this.

  • Anonymous
    March 04, 2007
    luogo interessante, soddisfare interessante, buon!

  • Anonymous
    March 10, 2007
    Luogo molto buon:) Buona fortuna!

  • Anonymous
    March 15, 2007
    luogo interessante, soddisfare interessante, buon!

  • Anonymous
    March 17, 2007
    Lo trovo piuttosto impressionante. Lavoro grande fatto..)

  • Anonymous
    March 18, 2007
    Stupore! ho una sensibilit molto buona circa il vostro luogo!!!!

  • Anonymous
    April 09, 2007
    Interessare, molto interessante. Come avete fatto questo?

  • Anonymous
    April 12, 2007
    L'information interessante que vous avez! I'am allant revenir bientot.

  • Anonymous
    May 22, 2007
    Thank you for clearing this up. I had huge problems when using GetElementsByTagName on an XML file of 3MB, where appending child nodes something took eons. But I didnt find the other SelectNodes() much faster (as I did alot of SelectNodes()), so I ended up writing a quick one that satisfied my needs and nothing more. (ugly but quick :) private static XmlElement GetFirstElementByTagName(XmlElement parent, String tagName) { foreach (XmlNode node in parent.ChildNodes) { if ((node is XmlElement) && (node.Name == tagName)) { return (XmlElement)node; } } return null; }

  • Anonymous
    August 14, 2007
    I do not think so go to http://apartments.waw.pl/