SQL Server: Using Statistics with Columnstore Indexes

Article
1/17/2024

The query optimizer uses statistics to estimate the cardinality (number of rows) in the output from an operator in a query plan. Statistics play an important role in getting a good query plan, so it's worth some effort to ensure that the optimizer has the right information when choosing a query plan.

1) Keep statistics up to date

Keeping statistics up to date is important so that the query optimizer has accurate information on which to base its cardinality estimates when evaluating various query plan alternatives. Up to date statistics are important whether or not you use columnstore indexes, but the effects of stale statistics can be especially noticeable when you have a columnstore index because the performance difference can be quite large between a good query plan that uses columnstore indexes and batch mode processing and a not-so-good query plan that either does not use an available columnstore index when it should or underestimates the size of a memory grant request, causing a hash join to spill to disk. If you add data with ascending data values, read this blog post about keep statistics updated in the presence of ascending keys: Ascending Keys and Auto Quick Corrected Statistics.

2) Columnstore indexes do not have their own statistics

Statistics are not created automatically when you create a columnstore index. The optimizer will use statistics that exist on the base table, whether those statistics were created automatically or manually. Hence, you will generally want to create statistics on all the columns you expect to query or leave auto create statistics turned on.

When you create a columnstore index, a statistics object is created, but the values reported by DBCC SHOW_STATISTICS are all NULL. The statistics object associated with a columnstore index is used only for database cloning. A database clone is a statistics-only copy of the database, used for investigating query plan issues. For more information about database cloning, see http://blogs.msdn.com/b/psssql/archive/2009/07/08/attach-of-the-clone-databases.aspx

3) Create multicolumn statistics on correlated columns

The optimizer estimates how much each filter and join in a query plan will reduce the amount of data to be processed by the next operator. When filters are on columns with highly correlated data, the optimizer can overestimate the data-reducing effect of multiple filters and thus underestimate the size of a join. Creating multicolumn statistics on columns that you know, from your design, are correlated will help the optimizer make better cardinality estimations. This recommendation is especially important for the fact tables in your design. Better cardinality estimates for joins can result in more accurate memory grants and avoid spills to tempdb. When designing multicolumn statistics, consider what data are correlated and which columns will have predicates in the same query. The optimizer can use a multicolumn statistic only if the columns in the predicates match a prefix of the columns in the statistics definition. For example, if you create a multicolumn statistic on (A, B, C), the statistics can be used for predicates on A, B, and C, predicates on A and B, or a predicate on A only. The statistic cannot be used for queries with predicates only on A and C or predicates only on B and C.

Share via

SQL Server: Using Statistics with Columnstore Indexes

1) Keep statistics up to date

2) Columnstore indexes do not have their own statistics

3) Create multicolumn statistics on correlated columns

See also

Additional resources