XML: Typed and Untyped
Occasionally we hear from people who are surprised to find that their XML data uses more space when typed than when untyped. In general, this is to be expected. Typed XML has some advantages over untyped XML, namely smarter query plans and the ability to constrain user input, but size usually isn't one of them. In addition to storing all the same markup information needed for untyped XML, typed XML also has type annotations. Storing values in their binary forms rather than as Unicode strings does offer some opportunities for savings, but these savings usually aren't enough to make up for the extra space needed for type annotations. The space requirements for typed values in bytes (excluding markup) are as follows:
- xs:boolean - 1 byte
- xs:decimal - 17 bytes
- xs:float - 4 bytes
- xs:double - 8 bytes
- xs:date/xs:time/xs:dateTime - 8 bytes
- xs:hexBinary - 1 byte per 2 characters of input
- xs:base64Binary - 3 bytes per 4 characters of input
Note that all types derived from xs:decimal (xs:integer, xs:long, xs:byte, etc.) take 17 bytes. All other types, as well as all text nodes in untyped XML, are stored as Unicode strings, and use two bytes per character. The biggest opportunities for saving are with binary types (xs:hexBinary values takes up four times as much space when untyped), and to a lesser extent xs:dateTime.
The size of the type annotation varies. For attributes and elements with complex content, the annotation is seven bytes: One for the extension token (indicating an extension of the standard binary XML format), one for the number of bytes in the extension, one for a set of flags, and four for the type ID.
For elements with simple content, there are two type annotations: one for the element itself, and one for the value. The annotation for the value is the same as the annotation for an attribute: seven bytes. The annotation for the element itself is 11 bytes, the the first seven being the same as for attributes and values, and the remaining four being a pointer to the value. The pointer is necessary because an arbitrary number of attributes may come between the element and its value. This gives a total of 18 bytes for elements with simple content.
So an element typed as xs:decimal will require 35 bytes between the value and type annotation. Only if the value is at least 18 digits long does this give any savings over the untyped representation. On the other hand, an attribute of type xs:dateTime needs only 15 bytes for its value and annotation if typed, whereas its string representation would require 40-58 bytes. While it's possible in principle for a document to be smaller as typed XML than as untyped XML, space savings should not generally be expected without a specific reason, such as use of large binary values or many dateTime values.
Comments
Anonymous
December 03, 2006
Quick question - I can't find documentation on the ranges for SQL 2K5 XQuery xs:float and xs:double data types. It appears the xs:float range might be different from the W3C spec and also different from the SQL REAL data type. I've come up with -3.4E+38 to -1.401298E-45, 0, +1.401298E-45 to +3.4E+38 for xs:float (still working on xs:double.) Is this correct? ThanksAnonymous
December 04, 2006
Other than a lack of support for NaN, xs:float and xs:double should conform to the IEEE-754 standard, which is what the W3C spec requires: <blockquote>The basic value space of float consists of the values m * 2^e, where m is an integer whose absolute value is less than 2^24, and e is an integer between -149 and 104, inclusive.</blockquote> The smallest possible positive float value is 1 * 2^-149, or about 1.401298e-45, and the largest possible finite float value is (2^24 - 1) * 2^104, or about 3.4028237e+38. Are you saying that the range is only slightly different from what you expected (i.e., at the sixth or seventh decimal place), or did you expect something radically different (e.g., 1.0e-149 to 9.99999e104)? SQL REAL also conforms to the IEEE-754 standard with a few limitations, namely that it doesn't support NaN or +/-INF. It also has limited support for denormalized values (those with an absolute value less than about 1.1754944e−38), which is why xs:float appears to have a greater range than REAL. You can't assign the literal value 1.401298E-45 to a REAL variable, but you can get there with something like this: <blockquote>declare @r real set @r = 1.401298E-37 set @r = @r / 100000000 select @r</blockquote>Anonymous
December 05, 2006
Thanks, my tests were only slightly off from the standard, but now I understand a little better why. Also, the documented range for SQL REAL (in BOL anyway) is -3.40E + 38 to -1.18E-38, 0 and 1.18E-38 to 3.40E+38, which is an even greater deviance than I expected, and that's what was throwing me off. Of course a deviance in the 38th place after the decimal point isn't that big a deal unless you're doing some sort of extremely specialized sub-atomic physics calculations (or trying to calculate your odds in Vegas lol), but I was just wondering if I my calcs were correct. Thanks again!