Jaa


Effective Xml Part 4: Let me project this (Xml file) for you

Xml is ubiquitous. No doubt about it. It is being used almost everywhere and almost by everyone. This includes places where huge amounts of data are being processed. This means xml files (or streams) used there are also huge. And the bigger the Xml file the harder it is to process. The two biggest problems are:

  • You need to query the document with a couple of XPath expressions or transform it with an Xslt file but the document is too big to be even loaded (the rule of thumb is that an Xml document needs up to 5 times memory of its size on the disk). When you try to load the document you get OutOfMemoryException and that’s about where your Xml processing ends.
  • You are able to load the document but all the queries or transformations are sloooow (and I assume it’s not because the queries or Xslt stylesheets are poorly written – if you are not sure see Effective Xml Part 3)

These are problems indeed but there is a good chance they are solvable. First, take a look at the structure of the Xml in the source Xml. Then look at the XPath expressions or Xslt stylesheet. How much information from the source Xml are you actually using? Probably the bigger the file is and the more complex its structure the less data you are actually using. So, if you don’t actually use some data what’s the point of even trying loading it? Filter this data out. You can do it in a streaming fashion. Instead of using the XmlReader from the Xml API implement your own which will report the stuff you really need and ignore all you don’t really need (i.e. project). Depending on how much you need you can save a lot. Now you document can fit in the memory and the queries or transformations will be faster – they don’t need process nodes or attributes that are never being used. If you don’t feel like writing your own reader you can try using XPathReader https://msdn.microsoft.com/en-us/library/ms950778.aspx (note that the article is aged and may be using some old APIs but the basic idea is the same).

If the above steps don’t help you may try splitting your one big task to a few smaller tasks you can run sequentially. Doing this will hopefully enable you to achieve what your goal.

Pawel Kadluczka