Performance tips of using XML data in SQL Server
Author: Kun Cheng
Reviewers: Peter Carlin, Mike Ruthruff, Thomas Kejser, Nicholas Dritsas
XML data type is usually used to store semi-structured data with great flexibility and query capabilities. It’s a good choice for developing platform agnostic applications to separate storage of complex data from processing at application layer, which understands the complex data relationship. To achieve optimal performance of querying XML data in SQL Server, extra steps need to be taken to ensure query access pattern matches how XML data is stored and indexed.
Here are 3 quick tips from query performance perspective based on real customer experience.
1. Promoting key XML attribute/property to relational column. Promoting to relational column will get you to on-par performance with relational queries, the best you can get out of modern RDBMS. To achieve optimal performance and scalability with XML data, I strongly recommend you first consider this option. The idea is identifying hot elements or attributes that your workload queries on the most. Then add computed columns using user functions to promote the elements or attributes out of XML data. Note it does require application code change to modify queries to take advantage of the promoted columns. A simple example below is a table with “XMLDATA” that has a hot element “a1”. Here are steps to add a computed column to promote the element (Note adding the computed column is offline operation for the table).
CREATE FUNCTION udf_get_a1 (@xData xml)
RETURNS varchar(30)
WITH SCHEMABINDING
BEGIN
DECLARE @a1 varchar(30)
SELECT @a1 = @xData.value('(/Dept[1]/orders/a1/text())[1]', 'varchar(30)')
RETURN @a1
END
Go
ALTER TABLE Orders
ADD a1 AS dbo.udf_get_a1(XMLDATA) PERSISTED
Go
CREATE INDEX ind_a1 ON Orders(a1)
Go
Now the original query below
SELECT *
FROM Orders t
WHERE t.XMLDATA.exist(N'(/Dept[1]/orders/a1/text())[1][.="Company10001Special"]') = 1
Needs to be rewritten as
SELECT *
FROM Orders
WHERE a1='Company10001Special'
In a customer lab, we observed exponential performance gain by using this method (1000+ times faster!!). To caution against over-promotion, when you add too many promoted columns, the overhead of index change could reduce the performance gain.
2. Typed or Untyped XML data? Typed XML means there is validating schema defined against the data. For untyped XML, the whole XML data is treated as a big string. To decide whether or not to define typed XML, you need to examine your XML query access pattern. Generally speaking, schema speeds up lookups since the data types of XML elements/attributes are known. But it would slow down INSERT due to overhead of validating new data. For UPDATE, it would benefit from faster lookups like SELECT, but incur same schema validation overhead like INSERT.
Better off |
Worse off |
|
Typed (w/ schema) |
Faster SELECT |
Slower INSERT |
Untyped (w/o schema) |
Faster INSERT |
Slower SELECT |
3. What XML indexes to add? Again depending on query access pattern, you may choose different XML indexes. Note XML index strategy is different than relational index. Popular DTA (Database Tuning Advisor) tool wouldn’t be able to recommend XML indexes. Fortunately there are only 3 types of secondary XML indexes to consider in addition to the required primary XML index.
Secondary XML index type |
Benefits this type of queries |
Path |
XPath queries with explicit path expression (XMLcol.exist ('/Dept/Orders/[@id="10001"]') = 1) |
Value |
XPath queries with no explicit path expression but with value predicate (XMLcol.exist ('//[@id="10001"]') = 1) |
Property |
XPath queries with multiple row results |
As straight forward as it sounds, you may find it difficult to examine your application and find out what appropriate indexes to add. The best way I recommend is to add all 3 types of XML indexes and test your workload to find out which XML indexes were used to determine the ones to keep (Be aware of rebuilding index affecting system availability). You can build XML indexes on either untyped or typed data. But you will get more gain with indexes on typed data. I see up to 50 times faster query response (SELECT queries) compared to plain untyped XML! However for DML queries, they could be significantly impacted by the overhead of XML index change plus schema validation. So again the key is to test your workload to find out if it makes sense to add XML indexes.
In conclusion, Promotion to relational column is the primary option you should consider when working with XML data. Whether or not to use XML schema and XML indexes, you need to be careful about your decision. Test your workload and identify the XQueries, which are critical to the overall performance. There is no DTA to help you, no query hints apply to XQueries either. Examine the query plans of those XQueries thoroughly to eliminate inefficiency.