Consider Data Distribution when using OPTIMIZE FOR UNKNOWN
The Microsoft SQL Programmability & API Development Team Blog recently posted an excellent article titled "OPTIMIZE FOR UNKNOWN – a little known SQL Server 2008 feature."
I can definitely think of a few scenarios where this feature will be useful, however as with most performance tuning tools, you must have a good handle on the data distribution of the underlying data. I'll illustrate my point by referencing the Production.TransactionHistoryArchive table in the 2008 version of the AdventureWorks database. This table has an non-indexed column, ReferenceOrderLineID. If you look at the distinct values for this column across the 89,000+ rows, you'll see an uneven distribution. For example, there are only two rows with a ReferenceOrderLineID value of "72" - but more than a thousand rows for a ReferenceOrderLineID value of "5."
Create the following stored procedure in the 2008 version of the AdventureWorks database:
IF EXISTS (SELECT name from sys.procedures
WHERE name = 'usp_SEL_ReferenceOrderLineID_TransactionHistoryArchive')
BEGIN
DROP PROCEDURE dbo.usp_SEL_ReferenceOrderLineID_TransactionHistoryArchive
END
GO
CREATE PROCEDURE dbo.usp_SEL_ReferenceOrderLineID_TransactionHistoryArchive
@ReferenceOrderLineID int
AS
SELECT ReferenceOrderID
FROM Production.TransactionHistoryArchive
WHERE ReferenceOrderLineID = @ReferenceOrderLineID
OPTION (OPTIMIZE FOR (@ReferenceOrderLineID UNKNOWN))
GO
This procedure takes the ReferenceOrderLineID as an input parameter, which is then used in the WHERE clause of a simple SELECT query. In the last line of the SELECT, I use the new OPTIMIZE FOR UNKNOWN option, meaning that the query optimizer will use statistics to determine the ReferenceOrderLineID value when caching the stored procedure plan, rather than using the input parameter value passed on the initial execution.
Let's add a nonclustered index to cover the ReferenceOrderLineID column:
CREATE NONCLUSTERED INDEX ix_TransactionHistoryArchive_ReferenceOrderLineID
ON Production.TransactionHistoryArchive (ReferenceOrderLineID)
Now let's execute this procedure, along with the XML SHOWPLAN:
SET STATISTICS XML ON
EXEC dbo.usp_SEL_ReferenceOrderLineID_TransactionHistoryArchive 72
SET STATISTICS XML OFF
This procedure call returns two data rows and the associated XML SHOWPLAN. Because we have a nonclustered index to cover this column, and only two rows out of 80K returned, we should expect an index seek for a ReferenceOrderLineID of 72. However that is not what happened. Instead we see an index scan. Why? This is answered indirectly by evaluating the XML SHOWPLAN ParameterList node:
<ParameterList>
<ColumnReference Column="@ReferenceOrderLineID" ParameterRuntimeValue="(72)" />
</ParameterList>
Anything missing? The answer is YES - we are missing a ParameterCompiledValue value. Had we not used the OPTIMIZE FOR UNKNOWN hint in the stored procedure, an initial execution of the stored procedure would have shown the following value for the ParameterList node:
<ParameterList>
<ColumnReference Column="@ReferenceOrderLineID" ParameterCompiledValue="(72)" ParameterRuntimeValue="(72)" />
</ParameterList>
Most values in the ReferenceOrderLineID column are not very unique (high density). So when using OPTIMIZE FOR UNKNOWN in this case, the query execution plan is compiled to benefit the majority of high density values and therefore perform an index scan instead of an index seek. So in this case the drawback for using this hint will be encountered when passing in a lower density ReferenceOrderLineID value.
Also, if you look at the estimated row count in the plan - you'll see it was 1222 rows, versus the 2 rows that were actually returned. So we have another clue that OPTIMIZE FOR UNKNOWN selected a ReferenceOrderLineID query plan that would benefit 1222 rows. 1222 also happens to be the output of the following query that takes the average of row counts across each distinct value:
SELECT AVG(row_count)
FROM
(SELECT ReferenceOrderLineID, COUNT(*) row_count
FROM Production.TransactionHistoryArchive
GROUP BY ReferenceOrderLineID) a
The behavior of OPTIMIZE FOR UNKNOWN will be beneficial under parameter sniffing scenarios where initial execution values are not realistic and will cause a less optimal plan for consecutive calls using realistic input parameters. However this new feature isn't a free ride - you'll still need to understand the data distribution and anticipated query patterns before you can decide if it makes sense to use this hint.
Update January 5th 2009: The blog I referenced above originates from the MSSQLISV blog (https://blogs.msdn.com/mssqlisv/) and the author was Peter Scharlock, Senior Program Manager from the SQLCAT team. Thanks for the update Peter!
Comments
- Anonymous
January 07, 2009
PingBack from http://www.codedstyle.com/consider-data-distribution-when-using-optimize-for-unknown/