Conor vs. Misbehaving Parameterized Queries + OPTIMIZE FOR hints

In my previous post, I debugged a customer problem for you, including the various guesses I had to make and why they matter.

The specific problem was likely related to parameter sensitivity, an issue in query optimization where the optimizer will try to generate a plan that is optimal for one parameter value but perhaps not optimal over all parameters values likely to be used in the future calling pattern of this query.  I had recommended using the optimizer hint OPTIMIZE FOR (@p <value>) to nudge the optimizer to use a value that was better for the average workload.

Another user wants to better understand the difference between OPTIMIZE FOR(@p value) and OPTIMIZE for (@p UNKNOWN), a new hint that was added in SQL 2008.  This new hint optimizes for an average value rather than a specific value.  It is probably a good time to review a few background details of parameterized query plans so that I can use some vocabulary that we use internally to discuss specific customer cases.

So, the abbreviated summary (which you can find in earlier blog posts if you have not seen before:

1. Queries in SQL Server (all current versions) are compiled and are usually cached.  That plan is used for all subsequent executions as long as it does not recompile.

2. Queries with parameter values _can_ (but do not always) have their values “sniffed”, meaning that the optimizer takes a peek at the current state of the containing T-SQL batch, using this value to generate a more optimal plan.  You can tell if a particular parameter value was sniffed by looking at the showplan xml output, but please understand that the current system is imperfect in terms of sniffing.  For the rest of this post, I will assume that values ARE sniffed for simplicity in describing the behavior that happens in the system.

 

When the optimizer knows a specific value to sniff, it will be able to determine things like whether a table scan is likely to be faster than a seek + bookmark lookup for a given filter.  For example, “SELECT * FROM T WHERE col = @p” would likely generate a scan plan if col = @p is non-selective.  If it is highly selective, then the seek may be optimal.  This seek/scan plan flip is one common case where parameter sniffing can yield a more optimal plan for a specific parameter value.  Whether this plan is optimal for the workload is another matter entirely.  Unfortunately, the optimizer is in a tricky spot.  Sniffing the value often leads to a much better plan – so much so that customers complain if they don’t get that plan.   So, there is an interaction between the initial sniffed value + the resulting plan that is generated with the average calling pattern for that parameter.  (It gets more complicated when you have many parameters).

There are a number of different common “plan flips” where the optimizer generates different plans based on what is sniffed.  In addition to seek/scan flips, I often see index flips where the optimizer will choose a seek on index 1 vs. a seek on index 2.  I also see join order flips in OLTP applications.  The main ingredients here are that your workload be repetitive enough to see the patterns.

Even if you do have plan flips, this does not necessarily mean that your application will have problems.  The relative performance of those plans may be very close, meaning that there is no real substantial benefit in picking one over the other.  Even if the performance isn’t great, it might be that the query is just unimportant to an application – perhaps it is not really visible to the application and the whole job takes awhile, so some variability in one query is insignificant.  The key point is that this problem includes the perception of the user and is not just some arbitrary metric defined in the optimizer.

The most common class of parameter sensitive problem (PSP) we see is where sniffing one value has “good” performance (from the perspective of the customer) for all parameter values while sniffing + caching the plan for another parameter value leads to “bad” performance. 

Common Case (Atypical parameter):

  Executing value 1 Executing value 2
Sniffed value 1 “good” “bad”
Sniffed value 2 “good” “good”

 

Less Common (Distinct Sets of Optimal Plans/Values): 

  Executing value 1 Executing value 2
Sniffed value 1 “good” “bad”
Sniffed value 2 “bad” “good”

 

90% of the cases I see (where I can also get enough information to categorize) relate to having an uncommon value sniffed.  If we take our earlier example, if @p is a very frequent value in the data distribution, the filter WHERE col = @p is highly unselective and the Scan plan may be generated.  However, this value may be very infrequently used.  As a result, most of the time the seek plan is generated but occasionally the scan plan might be generated instead if this infrequent value happens to be passed when the query needs to be recompiled.  Understanding the frequency of calling values vs. the frequency of data distribution is necessary to fully understand which case you have.  Note that I could construct an example this is opposite of this, where picking the scan plan is the right answer in most cases but sniffing a very infrequent value with low representation in the data set could cache a seek plan.  The metric is overall workload time.

The other kind of problem we see is that plans are really only optimal for certain sets of parameters.  If you use the wrong plan with the wrong parameter, then the performance is “bad” (again, from the perspective of a customer).  This is harder because you actually don’t want to cache plans like this, so adding option(recompile) is usually needed if you want consistent performance.  (You can also split the query into N copies and call each separate copy based on the values you see via T-SQL).

To optimize overall workload throughput, the other piece of the equation is compilation time.  If you have a very expensive query to run but is cheap to compile, option(recompile) fixes all PSP problems easily :).  If you have a PSP problem where the performance difference between the “good” and “bad” cases is larger than the total compilation time, you might very well be able to use option(recompile) to bound that query’s performance.  If you have a critical code path that needs faster performance than compiling each time, then you should consider the OPTIMIZE FOR hints.  If you are in the common case, you can hint a regularly used parameter value and that will avoid the “bad” performance case.  If you are in the less common case, you may need to revert to OPTION(RECOMPILE) or perhaps split your queries into different copies so each set of parameters can get their own plans.

Now I’ve given enough background to explain OPTIMIZE FOR … UNKNOWN.  There are some customers who have a common PSP problem but do not know a common value to pass.  Perhaps their data distribution changes over time.  Perhaps they don’t know their data set that well.  OPTIMIZE FOR… UNKNOWN tells the optimizer to use an “average” value to optimize the query, and this often (but not always) will avoid the “bad” squares in the taxonomy I described.

Here’s an example of how this works:

SELECT * FROM T WHERE col = @p.

Let’s say that T has 10000 rows.  Let’s say that we have the values 1-5000 each in the table once.  Let’s have the value 5001 in there 5000 times.  Furthermore, we have the column “col” in an index and we also have a lot of other columns that are not in that index.  If the optimizer sniffs any of the values 1-5000, it is likely to generate the seek + fetch plan.  For the value 5001, the scan plan is optimal. 

When the optimizer is reasoning about parameters, it might not have a specific value to use.  So, it reverts to use the density/frequency information stored in our histograms.  That stores a representation of how common an “average” value is in the data distribution.  In this case, the first 5000 values exist once and the last value is there 5000 times, so selecting an “average” value in an equality predicate would yield ~2 rows.  2 may be a low enough cardinality such that the seek + fetch plan is still optimal and the customer is happy.  Note that running the parameter 5001 with that plan may still be very slow but perhaps it is never/rarely done.  That parameter will get suboptimal performance.

The use of the frequency/density information varies a bit based on the scalar predicate or relational operator.  GROUP BY col would use the density, but an arbitrary scalar condition in a WHERE clause may or may not (the conditions are too complex to explain and we don’t document them all publicly anyways).

I hope this gives a bit more insight into parameter sensitivity problems and how to solve them

Happy Querying!

Conor

Comments

  • Anonymous
    August 12, 2010
    Great post. Thanks for this information.

  • Anonymous
    August 12, 2010
    Great post. Thanks for this information.

  • Anonymous
    August 12, 2010
    Thanks for these two articles, Connor.  I hadn't understood how SQL Server managed query plans before. What are the other reasons why a query is suddenly MUCH slower than usual?  I have an ETL SSIS project which runs a series of SQL procs to clean up and augment the data it has loaded.  Normally each of the several updates takes a few seconds - but sometimes one or another takes many hours, on the same data as took seconds in a previous run.  If I start the whole job again, the same update usually runs normally, in seconds. I don't think it's a query plan issue, as the parameter values actually used should not result in a different plan - and also running dbcc freeproccache first doesn't stop the problem.  I've also tried rebuilding the indices just before the procedures run, without much benefit, so I don't think it's caused by index fragmentation.  Finally, I added EXEC sp_updatestats to the start of the update tasks in case it was a problem with out of date statistics, yet the problem still occurs. Nor is it a hardware problem, as it happens on both the development machine (old AMD single core running Windows 7) and the production server (dual Xeon, 16 GB RAM, Windows Server 2008); both run SQL Server 2008 sp1. The database size is 10 GBytes. Nothing else is using the server at the time and there don't appear to be any locked transactions.  The only thing I can find out of the ordinary when a query goes slow - other than the duration! - is hundreds of millions of logical reads, against a few thousand physical reads. Any ideas?

  • Anonymous
    August 12, 2010
    Usually this would be a plan change, and I would first check to see if it is parameter sniffing.  The kinds of cases where this would happen are seek/scan flips (as described in the post) and sometimes there are Nested Loops + Spools on the inner side of the NL.  This would happen if you have a "gatekeeper row" case, which is conceptually similar to a parameter but not actually a parameter.  I guess I have my next topic on which to blog :).

  • Anonymous
    August 12, 2010
    The comment has been removed

  • Anonymous
    August 13, 2010
    And of course, soon after I posted that, the job once again took 100 minutes for a step that normally takes 2 seconds, DESPITE the WITH RECOMPILE

  • Anonymous
    August 13, 2010
    The main thing I can recommend is capturing the query plans (sql profiler, DMVs) when you see this case happening.  This would give you insight into what is happening - table scan, lots of random IOs, etc.  

  • Anonymous
    August 15, 2010
    The comment has been removed

  • Anonymous
    August 15, 2010
    Please read my earlier blog post on batch compilation in SQL Server: blogs.msdn.com/.../conor-vs-dynamic-sql-vs-procedures-vs-plan-quality-for-parameterized-queries.aspx

  • Anonymous
    August 15, 2010
    The comment has been removed

  • Anonymous
    August 15, 2010
    The comment has been removed

  • Anonymous
    August 16, 2010
    Let me start by stating I am quite new to the world of SQL Server and DB administration in general. I'm trying to learn as much as I can. My understanding of parameterized queries and cached plans is that each subsequent query must be identical to the cached  version. My question is how identical must future queries be? Does it have to be identical character for character? Is the optimizer smart enough to recognize simple difference in column or predicate order? What if the only difference is the name of the database being queried?

  • Anonymous
    August 17, 2010
    Thanks Leon, that's a useful script.

  • Anonymous
    August 20, 2010
    caching is done with an exact string match today. (memcmp), broadly.

  • Anonymous
    May 09, 2012
    The comment has been removed