SQL: Ensuring Your Data is Sorted or Nearly Sorted by Date to Benefit from Date Range Elimination
The most common type of filter in data warehouse queries is by date. Columnstore segment elimination helps you skip entire one-million-row segments if the system can determine that no rows qualify, simply by looking at the minimum and maximum values for a column in a segment. So you usually will want to make sure that your segments are sorted, or nearly sorted, by date, so date filters can be executed as fast as possible.
Consider the following table
Fact(Date, ...)
and assume that you will create a columnstore that includes all columns of this table. There are two ways to make sure that the data is sorted, or nearly sorted, by Date when it is placed into columnstore segments.
- Create a clustered B-tree index on Date before you build the columnstore index.
- Partition the table by Date, and use a reasonably narrow partition width (e.g. Day or Week).
Data is inserted into a columnstore index during the index build process in the order of the underlying clustered B-tree index, if there is one. So method 1 is preferred to give the best segment elimination.
Even if the underlying data in the table is a heap, if you partition the table by Date (method 2), every segment within a partition has dates greater than all preceding partitions and less than all following partitions. For example, if you partition by day, and there are five million rows per day, then there will be five columnstore segments per day. If the underlying table does not have a B-tree index on date, then data for these five segments will not be sorted by Date. But the table will be approximately ordered by Date simply due to the partitioning. This approximate order will be enough so that you'll get very good partition elimination for filters on Date. The following table illustrates this for two days worth of data.
partition 1 Date = 20120301 |
|
|||||
partition 2 Date = 20120302 |
|
If you use partitioning to ensure order by Date, be sure that there is at least almost a full segment worth (one million rows) of data for each partition. If the number of rows per partition is a lot less than a full segment worth, then the per-segment overhead associated with scanning the data will be somewhat high. We've done experiments, and once the number of rows per segment is less than about 100,000, query performance starts to drop off noticably. So, for example, if you are using partitioning to ensure data ordering to improve segment elimination, and you have about a million rows per week, it will most likely be better to partition by week, not by day. That will ensure that your segments are, on average, relatively full. This will reduce per-segment scanning overhead.
Late Arriving Updates
If you have late arriving updates, but an insert-only update pattern, then partitioning by Day or Week may not be convenient. That's because you may have to update multiple partitions during your load operations if you partition by date. If you have this pattern, you can still use partitioning to maintain data approximately sorted by date. Instead of partitioning by date, you can, for example, create an integer load_date column (e.g. make it the date of the load job in YYYYMMDD format). Create a partition for each load job, and give it a new load_date. Partiiton by load_date. Because late arriving updates are almost always for data from the recent past (at most a few days or weeks old), this will keep the data close to ordered by date. The data is naturally clustered by date. So, you will still get very effective segment elimination for date ranges. If you want to load more than once a day, you can generalize this idea to partition by load_job_id which can contain the date and time of the load. That will make the load_job_id unique, and ordered by date, even if you load multiple times a day.
The following script illustrates this. It assumes that there is one load per date. It uses a column called load_date to identify each load.
/*
This demo creates an N partition table with a date column,
a load_date column, and a measure. It illustrates how you
can get date range segment elimination even when you don't
partition by date, if your load_date is correlated with date.
If you have trouble building or updating a table with a
large number of partitions, you may need to increase
'min server memory' with sp_configure and try again.
*/
declare @N int = 2000;
-- create a partition function with N partitions
declare @start_date datetime = '20000101'; -- start date
declare @value_list varchar(max) = '';
set nocount on;
declare @d datetime = @start_date;
declare @i int = 1;
while(@i<>@N)
begin
if @i <> 1
set @value_list += ',';
set @value_list += convert(varchar(30),@d,112);
set @d += 1;
set @i+=1;
end
declare @cmd varchar(max) =
'create partition function pf (int) as range left for values ('
- @value_list
- ')';
-- create the function
exec ( @cmd );
create partition scheme ps as partition pf all to ([PRIMARY]);
-- Create the fact table partitioned by "load_date," not the
-- actual date of the business events,
-- which is stored in a different column, "date."
create table fact (date int, load_date int, value int) on ps(load_date);
-- add a row to each partition
set @i = 1;
set @d = @start_date;
while(@i<>@N)
begin
-- Insert rows with closely correlated date and
-- load_date (this normally would occur naturally).
insert fact values(convert(varchar(30),@d,112),
convert(varchar(30),@d+1,112),
@i+1);
set @d += 1;
set @i+=1;
end
-- Create the columnstore index. It inherits the
-- partitioning scheme of the table.
create nonclustered columnstore index ncci on fact(date, load_date, value);
go
-- Run a query that filters to a narrow date range. -- This query gets extremely good segment elimination -- and runs in about a tenth of a second. For larger -- data volumes, this segment elimination will improve query -- performance tremendously.
-- See the topic on verifying segment elimination to see how you
-- can check the segment elimination is happening.
select date, load_date, sum(value)
from fact
where date between 20000101 and 20000107
group by date, load_date
go
Return to main SQL Server columnstore index tuning page.