Are My Statistics Correct?

The question is often “Are my statistics up-to-date?” which can be a bit misleading.   I can make sure I have up-to-date statistics but the statistics may not be accurate. 

I recently engaged in an issue where the statistics were rebuilt nightly.   A maintenance job change had been made moving from FULLSCAN to WITH SAMPLE statistics creation/update that dramatically altered the statistical layout.  The underlying data was skewed and as such the execution plan generation(s) varied significantly.  Queries running in 1 minute now took over an hour to complete using an alternate plan with significant memory grants and TEMPDB usage.

As you can imagine this issue has resulted in a series of DCR asks from the product team.  

The dilemma we all run into is what level of SAMPLED statistics is appropriate?   The answer is you have to test but that is not always feasible and in the case of Microsoft CSS we generally don’t have histogram, historical states to revisit.  

Microsoft CSS is engaged to help track down the source of a poorly performing query.   It is common step to locate possible cardinality mismatches and study them closer.   Studying the statistics dates, row modification counter(s), atypical parameters usage and the like are among the fundamental troubleshooting steps.

The script, below, is one way Microsoft CSS may use to help determine the accuracy of the current statistics.   You can use similar techniques to check the accuracy of your statistical, SAMPLING choices or to store historical information.  The example loads a specific histogram for the ‘SupportCases’ table then executes queries to, using the key values and range information to determine actual counts (as if FULLSCAN) had been executed.   The final select of the captured data can be used to detect variations in current actual vs the in use histogram.

create table #tblHistogram
(
vData sql_variant,
range_rows bigint,
eq_rows bigint,
distinct_range_rows bigint,
avg_range_rows bigint,
actual_eq_rows bigint DEFAULT(NULL),
actual_range_rows bigint DEFAULT(NULL)
)
go

create

procedure #spHistogram @strTable sysname, @strIndex sysname
as

dbcc show_statistics(@strTable, @strIndex) with HISTOGRAM
go

truncate

table #tblHistogram
go

insert

into #tblHistogram (vData, range_rows, eq_rows, distinct_range_rows, avg_range_rows)
exec #spHistogram 'SupportCases', 'cix_SupportCases'
go

-- EQ_ROWS

update #tblHistogram
set actual_eq_rows = (select count(*) from SupportCases with(NOLOCK) where ServiceRequestNumber = h.vData)
from #tblHistogram h;

-- RANGE_ROWS

with BOUNDS (LowerBound, UpperBound)
as
(
select LAG(vData) over(order by vData) as [LowerBound], vData [UpperBound] from #tblHistogram
)

update

#tblHistogram
set actual_range_rows = ActualRangeRows
from (select LowerBound, UpperBound,
(select count(*) from SupportCases with(NOLOCK) where ServiceRequestNumber > LowerBound and ServiceRequestNumber < UpperBound) as ActualRangeRows from BOUNDS
) as t
where vData = t.UpperBound
go

select

/*TOP 10 NEWID(),*/ vData, eq_rows, actual_eq_rows, range_rows, actual_range_rows from #tblHistogram
where eq_rows <> actual_eq_rows or range_rows <> actual_range_rows
--order by 1
go

Testing the script I leveraged UPDATE STATISTICS with SAMPLE 1 PERCENT and skewed data in my table.   This resulted in several steps of the histogram having a statistical variation of +200% from the actual (FULLSCAN) values.

I continued to test variants of SAMPLE PERCENTAGE until the statistical relevance level from actuals fell within a noise range.   For my data this was 65 PECENT.   SAMPLING at 65 PERCENT allows reduction of statistics creation/modification time while retaining the necessary statistical relevance.

Bob Dorr - Principal SQL Server Escalation Engineer

Comments

  • Anonymous
    November 11, 2015
    Hi Bob, thanks for the post. Glad to see more attention on this as 'up-to-date but wrong' statistics is a problem I've been dealing with lately. I think it's a somewhat neglected topic and it seems there are many DBAs who aren't aware of the distinction between out-of-date stats and up-to-date but wrong stats, nor of the fact that there are some datasets that you can run UPDATE STATISTICS WITH FULLSCAN on all day but they will still be inaccurately represented in the histogram. One thing I have read is that FULLSCAN can go parallel and therefore often finishes sooner than SAMPLE n PERCENT. Have you found this to be the case in practice? As you might know, Kim Tripp has also done some interesting work in this area and has written some scripts that compare the histogram steps with the actual data in an indexed column to quantify the skew, and then generate filtered statistics on the values that are inaccurately represented in the histogram. I have a 2008 -> 2014 upgrade for a project that has historically been a stats disaster coming soon, so I will make use of the technique above along with Tripp's scripts and see how I get on. Also eager to hear more about those DCRs in the future. : ) (Urban Dictionary tells me that's the Microsoft term for Design Change Request.) Thanks, James