Udostępnij za pośrednictwem


Columnstore Indexes

An xVelocity memory optimized columnstore index, groups and stores data for each column and then joins all the columns to complete the whole index. This differs from traditional indexes which group and store data for each row and then join all the rows to complete the whole index. For some types of queries, the SQL Server query processor can take advantage of the columnstore layout to significantly improve query execution times. The explosive growth of data warehousing, decision support and BI applications has generated an urgency to read and process very large data sets quickly and accurately into useful information and knowledge. This growth of data volumes and escalating expectations are challenging to maintain or improve through incremental performance improvements. SQL Server columnstore index technology is especially appropriate for typical data warehousing data sets. Columnstore indexes can transform the data warehousing experience for users by enabling faster performance for common data warehousing queries such as filtering, aggregating, grouping, and star-join queries.

Contents

Basics

  • Basics: Columnstore Indexes Described

  • Basics: Columnstore Index Restrictions and Limitations

  • Demonstration Example: Columnstore Indexes with a Partitioned Table

  • Basics: Typical Columnstore Index Scenarios

  • Basics: Bitmap Filter Optimizations

Best Practices

  • Best Practices: Updating Data in a Columnstore Index

  • Best Practices: Choosing Columns for a Column Store Index

  • Best Practices: Partitioned Tables

How to

  • How To: Create a Column Store Index

  • How to: Determine the Size of a Columnstore Index

  • How To: Troubleshoot the Performance of a Column Store Index

Basics: Columnstore Indexes Described

Columnstore indexes in the SQL Server Database Engine can be used to significantly speed-up the processing time of common data warehousing queries. Typical data warehousing workloads involve summarizing large amounts of data. The techniques typically used in data warehousing and decision support systems to improve performance are pre-computed summary tables, indexed views, OLAP cubes, and so on. Although these can greatly speed up query processing, these techniques can be inflexible, difficult to maintain, and must be designed specifically for each query problem.

For example, consider a fact table F1 with dimension key columns dk1 and dk2. Let M be an aggregate function such as SUM. Instead of calculating M over column dk1 every time that a query referring to M(dk1) is run, a summary table F2(dk1, M) can be created and used so the result can be pre-computed and the query can execute faster. However, if a new query referring to M(dk2) is needed, a new summary table F3(dk2, M) with this information has to be created. When the number of columns in a table increases and with many possible functions, this approach becomes difficult to maintain and does not easily cover all the queries that are needed.

This overhead can be significant for the user. By using SQL Server columnstore indexes, users can reduce the overhead of the other solutions. Columnstore indexes also enable queries to compute the results quickly so that pre-computation is not required.

The key characteristics of SQL Server column store technology are as follows:

  • Columnar data format – Unlike the traditional row based organization of data (called rowstore format), in columnar database systems such as SQL Server with columnstore indexes, data is grouped and stored one column at a time. SQL Server query processing can take advantage of the new data layout and significantly improve query execution time.

  • Faster query results – Columnstore indexes can produce faster results for the following reasons:

    • Only the columns needed must be read. Therefore, less data is read from disk to memory and later moved from memory to processor cache.

    • Columns are heavily compressed. This reduces the number of bytes that must be read and moved.

    • Most queries do not touch all columns of the table. Therefore, many columns will never be brought into memory. This, combined with excellent compression, improves buffer pool usage, which reduces total I/O.

    • Advanced query execution technology processes chunks of columns called batches in a streamlined manner, reducing CPU usage.

  • Key columns – There is no concept of key columns in a columnstore index so the limitation on the number of key columns in an index (16) does not apply to columnstore indexes.

  • Clustered index key – If a base table is a clustered index, all columns in the clustering key must be present in the nonclustered columnstore index. If a column in the clustering key is not listed in the create index statement, it will be added to the columnstore index automatically.

  • Partitioning – Columnstore indexes works with table partitioning. No change to the table partitioning syntax is required. A columnstore index on a partitioned table must be partition-aligned with the base table. Therefore a nonclustered columnstore index can only be created on a partitioned table if the partitioning column is one of the columns in the columnstore index.

  • Record Size – The index key record size limitation of 900 bytes also does not apply to columnstore indexes.

  • Query processing – Along with the columnstore index, SQL Server introduces batch processing to take advantage of the columnar orientation of the data. The columnstore structure and batch processing both contribute to performance gains, but investigating performance issues might be more complex than if only one factor were involved.

  • Table cannot be updated – For SQL Server 2012, a table with a columnstore index cannot be updated. For workarounds, see Best Practices: Updating Data in a Columnstore Index

For syntax on how to create a columnstore index, see CREATE COLUMNSTORE INDEX (Transact-SQL).

Data Types

The common business data types can be included in a columnstore index. The following data types can be included in a columnstore index.

  • char and varchar

  • nchar and nvarchar (except varchar(max) and nvarchar(max))

  • decimal (and numeric) (Except with precision greater than 18 digits.)

  • int, bigint, smallint, and tinyint

  • float (and real)

  • bit

  • money and smallmoney

  • All date and time data types (except datetimeoffset with scale greater than 2)

The following data types cannot be included in a columnstore index:

  • binary and varbinary

  • ntext, text, and image

  • varchar(max) and nvarchar(max)

  • uniqueidentifier

  • rowversion (and timestamp)

  • sql_variant

  • decimal (and numeric) with precision greater than 18 digits

  • datetimeoffset with scale greater than 2

  • CLR types (hierarchyid and spatial types)

  • xml

The Potential for Poor Performance

Decision support query performance is often improved when columnstore indexes are used on large tables, but some queries and even whole workloads may perform worse. By using a cost-based approach, the query optimizer typically decides to use a columnstore index only when it improves the overall performance of the query. However, the cost models the optimizer uses are approximate and sometimes the optimizer chooses to use the columnstore index for a table when it would have been better to use a row store (B-tree or heap) to access the table. If this occurs, use the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query hint, or use an index hint to direct the optimizer to a row store index. The optimizer might still include some information from the columnstore index. Therefore, in rare cases, this option might not resolve the performance problem. If your workload performance is not helped by a columnstore index, and you cannot use index hints to correct the issue, drop the columnstore index to revert to row store processing.

Problem areas

SQL Server columnstore indexes and column-based query processing are optimized for typical data warehousing queries that feature a large fact table, and moderate to small dimension tables joined in a star schema configuration, and then grouped and aggregated. Although the number of rows in the fact table is large, these queries typically return a comparatively small result set because the data is aggregated. Performance of queries using the columnstore index can be slow when one or more of the following conditions are true.

  • The result set is large because the data is not aggregated. (Returning a large result set is inherently slower than returning a small result set.)

  • There is no join, filtering, or aggregation. In this case there is no batch processing. Therefore, the benefit from the columnstore index is limited to the benefits of compression and reading fewer columns.

  • Two large tables must be joined together in a way that creates large hash tables that cannot fit into memory and must spill to disk.

  • Many columns are returned, which cause more of the columnstore index to be retrieved.

  • A join condition to a columnstore-indexed table includes more than one column.

You can use the methods described earlier in this section to work around slow processing with the columnstore if it occurs for one of these reasons.

Top

Basics: Columnstore Index Restrictions and Limitations

Basic Restrictions

A columnstore index:

  • Cannot have more than 1024 columns.

  • Cannot be clustered. Only nonclustered columnstore indexes are available.

  • Cannot be a unique index.

  • Cannot be created on a view or indexed view.

  • Cannot include a sparse column.

  • Cannot act as a primary key or a foreign key.

  • Cannot be changed using the ALTER INDEX statement. Drop and re-create the columnstore index instead. (You can use ALTER INDEX to disable and rebuild a columnstore index.)

  • Cannot be created with the INCLUDE keyword.

  • Cannot include the ASC or DESC keywords for sorting the index. Columnstore indexes are ordered according to the compression algorithms. Sorting is not allowed in the index. Values selected from a columnstore index might be sorted by the search algorithm, but you must use the ORDER BY clause to guarantee sorting of a result set.

  • Does not use or keep statistics in the manner of a traditional index.

  • Cannot contain a column with a FILESTREAM attribute. Other columns in the table that are not used in the index can contain the FILESTREAM attribute.

A table with a columnstore index cannot be updated

To work around this problem, see Best Practices: Updating Data in a Columnstore Index.

Effects of Limited Memory

Column store processing is optimized for in-memory processing. SQL Server implements mechanisms that enable data and most data structures to spill to disk when insufficient memory is available. If severe memory restrictions are present, processing uses the row store. There may be instances in which the columnstore index is chosen as an access method but memory is insufficient to build needed data structures. By starting as a columnstore operation and then defaulting to a slower code path, there might be some decrease in performance in cases of queries encountering a severe memory restriction. The effective memory requirement for any query depends on the specific query. Building a columnstore index requires approximately 8 megabytes times the number of columns in the index times the DOP (degree of parallelism).  Generally the memory requirements increase as the proportion of columns that are strings increases. Therefore, decreasing the DOP can reduce the memory requirements for building the columnstore index.

Evaluation of some expressions will be accelerated more than others

Some common expressions are evaluated using a batch mode instead of a single row at a time mode when the columnstore index is used. The batch mode will provide additional query acceleration, in addition to the advantages of using a columnstore index. Not every query execution operator is enabled for batch mode processing.

The columnstore index does not support SEEK

If the query is expected to return a small fraction of the rows, the optimizer is unlikely to select the columnstore index (for example: needle-in-the-haystack type queries). If the table hint FORCESEEK is used, the optimizer will not consider the columnstore index.

Columnstore indexes cannot be combined with the following features:

  • Page and row compression, and vardecimal storage format (A columnstore index is already compressed in a different format.)

  • Replication

  • Change tracking

  • Change data capture

  • Filestream

Top

Demonstration Example: Columnstore Indexes with a Partitioned Table

The examples in this topic use a partitioned table named FactResellerSalesPtnd created in the AdventureWorksDW2012 sample database. To test a columnstore index in a partitioned table, connect to the AdventureWorksDW2012 database and execute the following code to create a partitioned version of the fact table.

[!UWAGA]

For information about the sample databases and instructions on how to download the database, see AdventureWorks Sample Databases.

Create the FactResellerSalesPtnd table

  • Execute the following code to create a partitioned version of the FactResellerSales table named FactResellerSalesPtnd.

    USE AdventureWorksDW2012;
    GO
    
    CREATE PARTITION FUNCTION [ByOrderDateMonthPF](int) AS RANGE RIGHT 
    FOR VALUES (
        20050701, 20050801, 20050901, 20051001, 20051101, 20051201, 
        20060101, 20060201, 20060301, 20060401, 20060501, 20060601, 
        20060701, 20060801, 20060901, 20061001, 20061101, 20061201, 
        20070101, 20070201, 20070301, 20070401, 20070501, 20070601, 
        20070701, 20070801, 20070901, 20071001, 20071101, 20071201, 
        20080101, 20080201, 20080301, 20080401, 20080501, 20080601, 
        20080701, 20080801, 20080901, 20081001, 20081101, 20081201
    ) 
    GO
    
    CREATE PARTITION SCHEME [ByOrderDateMonthRange] 
    AS PARTITION [ByOrderDateMonthPF] 
    ALL TO ([PRIMARY]) 
    GO
    
    -- Create a partitioned version of the FactResellerSales table
    CREATE TABLE [dbo].[FactResellerSalesPtnd]( 
        [ProductKey] [int] NOT NULL, 
        [OrderDateKey] [int] NOT NULL, 
        [DueDateKey] [int] NOT NULL, 
        [ShipDateKey] [int] NOT NULL, 
        [CustomerKey] [int] NOT NULL, 
        [EmployeeKey] [int] NOT NULL, 
        [PromotionKey] [int] NOT NULL, 
        [CurrencyKey] [int] NOT NULL, 
        [SalesTerritoryKey] [int] NOT NULL, 
        [SalesOrderNumber] [nvarchar](20) NOT NULL, 
        [SalesOrderLineNumber] [tinyint] NOT NULL, 
        [RevisionNumber] [tinyint] NULL, 
        [OrderQuantity] [smallint] NULL, 
        [UnitPrice] [money] NULL, 
        [ExtendedAmount] [money] NULL, 
        [UnitPriceDiscountPct] [float] NULL, 
        [DiscountAmount] [float] NULL, 
        [ProductStandardCost] [money] NULL, 
        [TotalProductCost] [money] NULL, 
        [SalesAmount] [money] NULL, 
        [TaxAmt] [money] NULL, 
        [Freight] [money] NULL, 
        [CarrierTrackingNumber] [nvarchar](25) NULL, 
        [CustomerPONumber] [nvarchar](25) NULL,
        OrderDate [datetime] NULL,
        DueDate [datetime] NULL,
        ShipDate [datetime] NULL
    ) ON ByOrderDateMonthRange(OrderDateKey);
    GO
    
    -- Using simple or bulk logged recovery mode, and then the TABLOCK 
    -- hint on the target table of the INSERT…SELECT is a best practice
    -- because it causes minimal logging and is therefore much faster.
    ALTER DATABASE AdventureWorksDW2012 SET RECOVERY SIMPLE;
    GO
    
    -- Copy the data from the FactResellerSales into the new table
    INSERT INTO dbo.FactResellerSalesPtnd WITH(TABLOCK)
    SELECT * FROM dbo.FactResellerSales;
    GO
    
    -- Create the columnstore index
    CREATE NONCLUSTERED COLUMNSTORE INDEX [csindx_FactResellerSalesPtnd]
    ON [FactResellerSalesPtnd]
    ( 
        [ProductKey], 
        [OrderDateKey], 
        [DueDateKey], 
        [ShipDateKey], 
        [CustomerKey], 
        [EmployeeKey], 
        [PromotionKey], 
        [CurrencyKey], 
        [SalesTerritoryKey], 
        [SalesOrderNumber], 
        [SalesOrderLineNumber], 
        [RevisionNumber], 
        [OrderQuantity], 
        [UnitPrice], 
        [ExtendedAmount], 
        [UnitPriceDiscountPct], 
        [DiscountAmount], 
        [ProductStandardCost], 
        [TotalProductCost], 
        [SalesAmount], 
        [TaxAmt], 
        [Freight], 
        [CarrierTrackingNumber], 
        [CustomerPONumber], 
        [OrderDate],
        [DueDate],
        [ShipDate]
    );
    

Now execute a query that can benefit from the columnstore index and confirm that the columnstore index is used.

Test the Columnstore Index

  1. Press Ctrl+M, or on the Query menu, select Include Actual Execution Plan. This turns on a graphical representation of the actual execution plan used by SQL Server Management Studio.

  2. In the Query Editor window, execute the following query.

    SELECT SalesTerritoryKey, SUM(ExtendedAmount) AS SalesByTerritory
    FROM FactResellerSalesPtnd
    GROUP BY SalesTerritoryKey;
    

    In the Results window, under the Execution plan tab, confirm that the query plan selected an index scan of the nonclustered csindx_FactResellerSalesPtnd index.

    [!UWAGA]

    For more information about graphical showplan icons, see Showplan Logical and Physical Operators Reference.

    Top

Basics: Typical Columnstore Index Scenarios

Star and snowflake database schema are usually found in dimensional data warehouses and data marts where the speed of the data retrieval is more important than the efficiency of data manipulations. Column store technology in SQL Server 2012 can detect and speedup queries targeting star and snowflake schemas.

Examples:

[!UWAGA]

Batch processing might not be used in the following examples because the tables are not large enough. Batch execution mode, such as parallel processing, is only used for more expensive queries.

A: An aggregate query joining two tables

  • Consider a star-join query that calculates how many of product 215 we sold each quarter. The fact table named FactResellerSalesPtnd is partitioned on the OrderDateKey column. One of the dimension tables named DimDate links to the fact table through a primary key – foreign key relationship on the date key.

    SELECT d.CalendarYear,
        d.CalendarQuarter,
        COUNT(*) AS NumberSold
    FROM dbo.FactResellerSalesPtnd AS f
        JOIN dbo.DimDate AS d
        ON f.OrderDateKey = d.DateKey
    WHERE ProductKey = 215
    GROUP BY d.CalendarYear, d.CalendarQuarter
    ORDER BY d.CalendarYear, d.CalendarQuarter;
    

Additionally, there will be queries that may only target a single table. In such cases, SQL Server tries to take advantage of the power of batch execution and column store technology to speed up query execution also.

B: A simple aggregate query on a single table

  • The fact table is FactResellerSalesPtnd and this table is partitioned on the OrderDateKey column. The following query returns the number of rows and the number of orders.

    SELECT COUNT(*) AS NumberOfRows, 
        COUNT(DISTINCT(f.SalesOrderNumber)) AS NumberOfOrders
    FROM dbo.FactResellerSalesPtnd AS f;
    

    For typical data-warehousing scenario queries, a speed increase between 1.5 times and 10 times are typical when columnstore indexes and batch execution mode are used during query execution. For some star join queries, the speed-up is much larger.

Basics: Bitmap Filter Optimizations

In addition to the layout of the data in columnar format, SQL Server uses bitmap filters that are passed to the storage engine for enhancing the performance during query execution. The bitmap filters increase the query execution speed by reducing the number of rows that are included, before any joins are implemented, and therefore reducing the number of rows that are processed by the join operator. The bitmap is created on the build side of a hash-join, but the actual bitmap checks are performed on the probe side of the hash-join. You can observe the use of bitmap filters by using either the graphical or xml execution plan.

Best Practices: Updating Data in a Columnstore Index

Tables that have a columnstore index cannot be updated. There are three ways to work around this problem.

  • To update a table with a columnstore index, drop the columnstore index, perform any required INSERT, DELETE, UPDATE, or MERGE operations, and then rebuild the columnstore index.

  • Partition the table and switch partitions. For a bulk insert, insert data into a staging table, build a columnstore index on the staging table, and then switch the staging table into an empty partition. For other updates, switch a partition out of the main table into a staging table, disable or drop the columnstore index on the staging table, perform the update operations, rebuild or re-create the columnstore index on the staging table, and then switch the staging table back into the main table.

  • Place static data into a main table with a columnstore index, and put new data and recent data likely to change, into a separate table with the same schema that does not have a columnstore index. Apply updates to the table with the most recent data. To query the data, rewrite the query as two queries, one against each table, and then combine the two result sets with UNION ALL. The sub-query against the large main table will benefit from the columnstore index. If the updateable table is much smaller, the lack of the columnstore index will have less effect on performance. While it is also possible to query a view that is the UNION ALL of the two tables, you may not see a clear performance advantage. The performance will depend on the query plan, which will depend on the query, the data, and cardinality estimations. The advantage of using a view is that an INSTEAD OF trigger on the view can divert updates to the table that does not have a columnstore index and the view mechanism would be transparent to the user and to applications. If you use either of these approaches with UNION ALL, test the performance on typical queries and decide whether the convenience of using this approach outweighs any loss of performance benefit.

[!UWAGA]

Do not create a columnstore index as a mechanism to make a table read-only. The restriction on updating tables having a columnstore index is not guaranteed in future releases. When read-only behavior is required it should be enforced by creating a read-only filegroup and moving the table to that filegroup.

Best Practices: Choosing Columns for a Column Store Index

Some of the performance benefit of a columnstore index is derived from the compression techniques that reduce the number of data pages that must be read and manipulated to process the query. Compression works best on character or numeric columns that have large amounts of duplicated values. For example, dimension tables might have columns for postal codes, cities, and sales regions. If many postal codes are located in each city, and if many cities are located in each sales region, then the sales region column would be the most compressed, the city column would have somewhat less compression, and the postal code would have the least compression. Although all columns are good candidates for a columnstore index, adding the sales region code column to the columnstore index will achieve the greatest benefit from columnstore compression, and the postal code will achieve the least.

Top

Best Practices: Partitioned Tables

Columnstore indexes are designed to support queries in very large data warehouse scenarios, where partitioning is common. Partitioning is recommended when the data in a table with a columnstore index must be periodically updated. For more information about how to update partitions of a columnstore index, see the previous section Best Practices: Updating Data in a Columnstore Index.

Top

How to: Create a Columnstore Index

Creating a columnstore index is like creating any other index. You can create a columnstore index by using Transact-SQL or by using SQL Server Management Studio graphical tools.

Creating a columnstore index by using Transact-SQL

  • In a Query Editor window, execute the CREATE COLUMNSTORE INDEX statement. For an example, see Create the FactResellerSalesPtnd table above. For more information, see CREATE COLUMNSTORE INDEX (Transact-SQL).

Creating a columnstore index by using SQL Server Management Studio

  1. Using Management Studio, use Object Explorer to connect to an instance of the SQL Server Database Engine.

  2. In Object Explorer, expand the instance of SQL Server, expand Databases, expand a database, expand a table, right-click a table, point to New Index, and then click Non-Clustered Columnstore Index.

  3. In the Index name dialog box, under the General tab, type a name for the new index, and then click Add.

  4. In the Select Columns dialog box, select the columns to participate in the columnstore index, and then click OK two times to create the index.

How to: Determine the Size of a Columnstore Index

A columnstore index consists of both segments and dictionaries. The following example demonstrates how to determine the total size of a columnstore index (on the FactResellerSalesPtnd table) by combining the on_disk_size columns from sys.column_store_segments and sys.column_store_dictionaries.

SELECT SUM(on_disk_size_MB) AS TotalSizeInMB
FROM
(
   (SELECT SUM(css.on_disk_size)/(1024.0*1024.0) on_disk_size_MB
    FROM sys.indexes AS i
    JOIN sys.partitions AS p
        ON i.object_id = p.object_id 
    JOIN sys.column_store_segments AS css
        ON css.hobt_id = p.hobt_id
    WHERE i.object_id = object_id('FactResellerSalesPtnd') 
    AND i.type_desc = 'NONCLUSTERED COLUMNSTORE') 
  UNION ALL
   (SELECT SUM(csd.on_disk_size)/(1024.0*1024.0) on_disk_size_MB
    FROM sys.indexes AS i
    JOIN sys.partitions AS p
        ON i.object_id = p.object_id 
    JOIN sys.column_store_dictionaries AS csd
        ON csd.hobt_id = p.hobt_id
    WHERE i.object_id = object_id('FactResellerSalesPtnd') 
    AND i.type_desc = 'NONCLUSTERED COLUMNSTORE') 
) AS SegmentsPlusDictionary

How To: Troubleshoot the Performance of a Column Store Index

To determine whether a columnstore index is being used, examine the query execution plan. Three elements are present when the maximum benefit is achieved.

  • The columnstore index is present in the query execution plan.

    Skanowanie indeksu magazynu kolumn

    columnstore index scan operator icon

    If the columnstore index is not being used and you believe that the columnstore could benefit the query, evaluate the query performance while forcing the use of the columnstore index by using the WITH (INDEX(<indexname>)) hint. The following example demonstrates a query with an index hint.

    SELECT d.CalendarYear,
        d.CalendarQuarter,
        COUNT(*) AS NumberSold
    FROM dbo.FactResellerSalesPtnd AS f WITH (INDEX(csindx_FactResellerSalesPtnd))
        JOIN dbo.DimDate AS d
        ON f.OrderDateKey = d.DateKey
    WHERE ProductKey = 215
    GROUP BY d.CalendarYear, d.CalendarQuarter
    ORDER BY d.CalendarYear, d.CalendarQuarter;
    
  • When you move the pointer over the columnstore index icon in the graphical query plan the actual execution mode is listed as batch instead of row.

  • A bitmap physical operator icon is present in the graphical execution plan indicating that bitmap filter is reducing the number of rows before a join operation.

    Ikona operatora mapy bitowej

    bitmap operator icon

Top

CREATE COLUMNSTORE INDEX (Transact-SQL)

sys.column_store_dictionaries (Transact-SQL)

sys.column_store_segments (Transact-SQL)