SQL Server: Maintaining Batch Processing with two or More DISTINCT Aggregates
Certain queries that use at least two DISTINCT aggregates, may get less-than-optimal query plans with the columnstore index. They still probably will be no worse than older row-store plans, but because pure batch-mode columnstore plans are so fast, you may notice the difference more. Here's to higher expectations!
All is not lost though. You can rewrite these queries to run fast by doing almost all their work in batch mode. For example, consider the following query:
-- slow plan with spool<br>
<br>
select p.Date, <br>
count(distinct p.UserId) as UserIdCount, <br>
count(distinct p.MediaId) as MediaIdCount<br>
from dbo.Purchase p, dbo.Media m <br>
where p.MediaId = m.MediaId <br>
and m.Category in ('Horror') <br>
group by p.Date
The dbo.Purchase table has about 101 million rows, and a columnstore index on it. dbo.Media has a few thousand rows. In SQL Server 2012 RTM, this query generates a plan with a table spool. It saves millions of rows into an intermediate spool file with a single thread. A separate branch of the query then reads the spool, so each of the aggregates is computed separately. This can take on the order of half a minute on a 4-core processor. You can improve on this dramatically with the following query rewrite. It computes each aggregate separately with a query block that contains only a single aggregate with grouping, and then joins the two results together on the grouping key to form the final result.
-- workaround <br>
<br>
with DistinctMediaIds (Date, MediaIdCount) as <br>
( <br>
select p.Date, <br>
count(distinct p.MediaId) as MediaIdCount <br>
from dbo.Purchase p, dbo.Media m <br>
where p.MediaId = m.MediaId <br>
and m.Category in ('Horror') <br>
group by p.Date <br>
), <br>
DistinctUserIds (Date, UserIdCount) as <br>
( <br>
select p.Date, <br>
count(distinct p.UserId) as UserIdCount <br>
from dbo.Purchase p, dbo.Media m <br>
where p.MediaId = m.MediaId <br>
and m.Category in ('Horror') <br>
group by p.Date <br>
) <br>
select m.Date, m.MediaIdCount, u.UserIdCount <br>
from DistinctMediaIds m join DistinctUserIds u on m.Date=u.Date
The above query produces the same result as the first one but it runs in about a three quarters of a second, compared to half a minute or so for the original query.
It used to be the case that queries with multiple aggregates that included only a single distinct aggregate would use a spool. SQL Server 2012 RTM has an improvement to eliminate the spool in that case, and in general gives good performance for queries with a single distinct aggregate. Nevertheless, the resulting query plan still may not be as efficient as this particular style of query rewrite. Keep this rewrite technique in mind if you need to tune your application when using columnstore indexes. It follows a common general pattern: summarize data into small sets of rows using query plans that run fully in batch mode, and then combine the results. You can use the same pattern in multiple ways to tune columnstore queries in SQL Server, if the optimizer doesn't produce a plan that's fast enough for your needs.