T-SQL text hashing in SQL Servers 2000 and 2005
Recently several users of SQL Server ran into interesting issues related to hashing of T-SQL text by SQL Server for caching purposes. This blog post will describe how SQL Server performs this hashing, and the differences in behavior between 2000 and 2005 versions of SQL Server.
In SQL Server, plans for queries submitted from the client (as opposed to queries in stored procedures or other stored objects) are cached using the text of the query as the key. The plan cache is logically organized as a large hash table, and a hash of the text of the query is computed in order to lookup the correct bucket in the hash table during cache insertions and lookups.
In SQL Server 2000, the hash is computed from the first 4000 characters of the query text. This allows hash computation to be fast for large queries, but of course is problematic if an application submits different batches that have a large common initial part, because such queries will all go into the same hash bucket.
In both SQL Server 2000 and 2005, for parameterized queries (submitted, for example, using sp_executesql or sp_prepare APIs), the parameter declaration string is prepended to the text of the query for the purposes of determining T-SQL text that is hashed. For example, if the application calls sp_executesql as follows:
sp_executesql N'select * from t where a = @p and b = @q', N'@p int, @q varchar(300)', 3, 'hello'
the T-SQL text used for cache lookup will be
(@p int, @q varchar(300))select * from t where a = @p and b = @q
The limitation in SQL Server 2000 that only first 4000 characters are hashed could cause problems with applications that use a lot of parameters, because it is possible for the declaration string to be more than 4000 characters long. If an application uses the same names and types of parameters in multiple queries, then the declaration strings will be the same, but the actual queries will be different. Because only 4000 characters are hashed, SQL Server will send all these queries to a single hash bucket. Cache lookup will spend a lot of time searching through this bucket and comparing strings, and if these queries come from multiple connections, it will also cause a lot of spinlock contention, since only one thread can access a hash bucket at the same time. Performance suffers.
A workaround for SQL Server 2000 applications that use parameterized queries with many parameters is to vary the names of their parameters between different queries to avoid hash bucket collisions.
In SQL Server 2005 we hash the entire query text, but we skip the parameter declaration string for hashing purposes. This fixes the problems from SQL Server 2000 with an parameterized queries with many parameters.
Unfortunately in some scenarios, the new behavior can also cause pains, when coupled with the auto-parameterization feature of SQL Server. Suppose an application is doing inserts into a table with numeric columns and that the constants inserted vary in scale/precision. For
example, suppose two of the submitted batches are:
insert t values (2.1, 3.12)
go
insert t values (3.33, 4.1)
go
SQL Server auto-parameterization feature converts the two samples batches to:
(@1 numeric(2,1), @2 numeric(3,2))insert [t] values(@1 , @2)
(@1 numeric(3,2), @2 numeric(2,1))insert [t] values(@1 , @2)
Thus the two batches differ only in the declaration string, not the query text, and so will hash to the same value. If there is a large number of queries like this, differing in scale and precision of the constants use, they will all go into the same bucket.
A workaround for this problem in SQL Server 2005 is for the application to explicitly parameterize queries on the client side with some common scale/precision for the numeric parameters.
SQL Server 2005 SP2 will hash the entire T-SQL text, including both the query text and the declaration string, solving such problems for good.