SQL Server 2008 Statistics

Artículo
03/18/2009

Here are some notes on "SQL Server 2008 Statistics" I took while attending an advanced class on SQL Server taught by Kimberly Tripp (https://sqlskills.com/AboutKimberlyLTripp.asp).

Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without the right pre-requisite knowledge and the right context. Reading the post from top to bottom will help.

Understanding the job of the cost estimator

Table Scan (always an option)
Useful Indexes (look at WHERE, JOIN, GROUP BY, ORDER BY, covering)
Other options?

Statistics

Statistics = density information about a table kept by SQL
Help SQL "estimate" how many rows a certain query will return
Stored as a BLOB, but you can query the data using "DBCC SHOW_STATISTICS"

Not always right

Statistics cannot be right for every case, or else it would be actual data
By definition, it is not all the data, it's a sample.
How frequently do you update? Depends on how the data is changing.
Statistically, certain types of data are consistent over time.
Unless that are some atypical events related to your data.

Could be statistically correct after update

Look at the city column for an employee table for a company based in Seattle
Think of the frequency for Seattle, Redmond (suburb) and Spokane (hours away)
This would probably not change much (statistically) quickly
High number of rows with low selectivity - this is harder.

Looking at stats

Things to look in your statistics (DBCC SHOW_STATISTICS):
How old are your statistics?
How many rows sampled in the statistics (out of the total rows)?
How many steps in the statistics (up to 201 steps in SQL Server 2008)?
Steps = number of rows in your histogram.
What is the density of that key?
Density * # of Rows = average rows returned for a given value
You also get the average length of the column.
For every step: actual value (high for step), total rows, count of unique rows.
Data could be sampled (not based on every row, but just a subset).

AUTO

Generating statistics is usually not so expensive, it's done with sampling
The query optimizer will actually create statistics on the fly, as queries come in.
Future queries will benefit from having created that.
If you don't have the stats, SQL will not be sure of the usefulness of an index.
AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS are ON by default.

Stats with no index?

Statistics do make sense in cases when you don't have indexes on a column
For instance, for helping with deciding how to do sorts
For instance, in a JOIN, to figure out which side has a smaller set of rows.

Let SQL do it

Let SQL Server do its thing with stats
You may start with a query taking a long time, than it improves "by itself"
This could be the results of statistic being autogenerated and helping you
Careful: On Read-only database, SQL cannot create statistics

Pre-create statistics

You can also pre-create the statistics
You could pre-create for every column of every table, but you probably shouldn't
It probably helps generate them for every column of your non-clustered indexes
You can do that using "sp_createstats 'indexonly', 'fullscan'"

Leave it on

Leave autocreate/autoupdate on. It's fine in most cases
For most cases, you better off leaving them on
If you know what you're doing, you can manage it manually
You can still leave it on, since it will do only what's missing
In that case, use STATISTICS_NORECOMPUTE
Having one specific "problem child" database with scheduled updates with full scans.
Details at https://www.sqlskills.com/BLOGS/KIMBERLY/post/Auto-update-statistics-and-auto-create-statistics-should-you-leave-them-on-andor-turn-them-on.aspx

Stats get old

After they are created, statistics will get old over time, as data gets updated
You can use sp_updatestats to query
You can also query sys.stats and sys.indexes
You can use DBCC SHOW_STATISTICS (with HISTOGRAM?)

Auto updates

Statistics are automatically updated if AUTO_UPDATE_STATISTICS are ON (default)
They get updated when about 20% of the data changes
In SQL 7, tracks row mod, immediately updates
In SQL 2000, tracks row mod, updates when needed
In SQL 2005, tracks column mod, updates when needed
In SQL 2008, tracks column mod, updates when needed. New: Filtered stats.

Async updates

Even if stats need updating, you can use old stats if new ones are not ready
To do that, use ALTER DATABASE dbname AUTO_UPDATE_STATISTICS_ASYNC ON
Use with caution

Sampling

Sampling is usually OK, but could create issues if data is not evenly distributed
You might miss something significant if your sampling does not give you enough
You can disable autoupdate and update statistics manually using a full scan
To do that "UPDATE STATISTICS... WITH FULL SCAN".

Very large sets

For very large data sets, even a full scan (no sampling) might not be enough.
Think a few billion sales records, looking at statistics for customer_number.
In that case, you have 1000 large customers and 1 million small ones.
With only 201 steps, you just can't get good statistics for both large and small.
For those, consider partitioning into multiple tables to get better stats.
In SQL 2008, consider filtered statistics (for instance, WHERE customer_type=Large).
Automatically created for filtered indexes. Can be created manually.

Conclusion

Statistics will only help
However, it is always better to have the right indexes
As usual, you can't optimize for every single query
Pick your battles.

For details, see "Statistics Used by the Query Optimizer in Microsoft SQL Server 2005" at:
https://technet.microsoft.com/en-us/library/cc966419.aspx

Compartir a través de

SQL Server 2008 Statistics

Recursos adicionales