Freigeben über


Extreme 25x compression of JSON data using CLUSTERED COLUMNSTORE INDEXES

CLUSTERED COLUMNSTORE INDEXES (CCI) provide extreme data compression. In Azure SQL Database and SQL Server vNext you can create CCI on tables with NVARCHAR(MAX) columns. Since JSON is stored as NVARCHAR type, now you can store huge volumes of JSON data in tables with CCI. In this post, I will show you how you can get 25x compression on a table that contains JSON/NVARCHAR(MAX) column using CCI.

In this experiment, I will use publicly available ContosoDW database where we have FactOnlineSales table with 12 million records. This fact table is related to a number of dimension tables such as DimStore, DimCustomer, DimProduct, etc., as it is shown on the following figure:

contosodwsales

 

Imagine that all these related tables are stored in NoSQL-style as a single JSON document. In that case we would have a single table that will have some Data NVARCHAR(MAX) column where we would store data from all related tables as JSON text - something like:

 DROP TABLE IF EXISTS SalesRecords
GO
CREATE TABLE SalesRecords(
 OrderNumber nvarchar(20) not null,
 ProductKey int not null,
 StoreKey int not null,
 Date datetime not null,
 CustomerKey int not null,
 Data nvarchar(max) not null,
 INDEX cci CLUSTERED COLUMNSTORE
)

In this example, I will keep int key columns as separate columns and store all other columns from FactOnlineSales table and all columns from the related tables in a single Data column. The query that will de-normalize all related dimensions into a single column is at the end of this post, and it looks like:

 INSERT INTO SalesRecords(OrderNumber, StoreKey, ProductKey, Date, CustomerKey, Data)
SELECT FactOnlineSales.SalesOrderNumber, FactOnlineSales.StoreKey, FactOnlineSales.ProductKey, FactOnlineSales.DateKey, FactOnlineSales.CustomerKey,
 (SELECT ... FROM DimensionTables FOR JSON PATH) as Data

I'm joining all related dimension tables, format them as JSON text using FOR JSON clause and loading everything into SalesRecords table. This query will populate a table with CCI index.

I will also create a copy of this table but without CCI (plain heap table) using the following query:

 select *
 into SalesRecordsRS
 from SalesRecords

Now, I will compare sizes of the table with CCI and the table without CCI using the following query:

 exec sp_spaceused 'SalesRecordsRS'
exec sp_spaceused 'SalesRecords'

The results of these queries are shown below:

contosodwsales-space-used

A table without CCI has 101.020.920 KB, while the table with CCI has only 4.047.128 KB in data column. With CCI we can compress 100GB table to 4GB with 24.96 compression ratio!

Compression is not important only for storage savings. The following query on a table with CCI this query is executed in 46 seconds, while on a heap table execution takes 13 min 45 seconds.

 select min(datalength(data)), avg(datalength(data)), max(datalength(data))
 from SalesRecords

Smaller disk io and batch execution provided by CCI enables you to run 18x faster queries.

Conclusion

CLUSTERED COLUMNSTORE INDEXES provide extreme data compression in SQL Server and Azure SQL Database. With NVARCHAR(MAX) support in CCI indexes you can use them on your JSON data stored is database and get high 25x compression. Therefore, CCI is a perfect solution if you need to store a large volume of JSON data in your SQL Database.

ContosoDW database is publicly available for download, so you can use this database and the script below to re-create this table and try this in your environment.

SQL Script used to populate the table is:

 INSERT INTO SalesRecords(OrderNumber, StoreKey, ProductKey, Date, CustomerKey, Data)
SELECT FactOnlineSales.SalesOrderNumber, FactOnlineSales.StoreKey, FactOnlineSales.ProductKey, FactOnlineSales.DateKey, FactOnlineSales.CustomerKey,
 (SELECT FactOnlineSales.PromotionKey, 
 FactOnlineSales.SalesOrderLineNumber, FactOnlineSales.SalesQuantity, FactOnlineSales.SalesAmount, FactOnlineSales.CurrencyKey, 
 FactOnlineSales.ReturnQuantity, FactOnlineSales.ReturnAmount, FactOnlineSales.DiscountQuantity, FactOnlineSales.DiscountAmount, FactOnlineSales.TotalCost, 
 FactOnlineSales.UnitCost, FactOnlineSales.UnitPrice,
 DimProduct.ProductName AS [Product.Name], DimProduct.ProductDescription AS [Product.Description], DimProduct.Manufacturer AS [Product.Manufacturer], 
 DimProduct.BrandName AS [Product.Brand], DimProduct.ClassName AS [Product.Class], DimProduct.StyleName AS [Product.Style], 
 DimProduct.ColorName AS [Product.Color], DimProduct.Size AS [Product.Size], DimProduct.SizeRange AS [Product.SizeRange], 
 DimProduct.Weight AS [Product.Weight], DimProduct.UnitCost AS [Product.UnitCost], DimProduct.UnitPrice AS [Product.UnitPrice], 
 DimProduct.ImageURL AS [Product.ImageURL], DimProduct.ProductURL AS [Product.URL], 
 DimProductSubcategory.ProductSubcategoryLabel AS [Product.SubcategoryLabel], DimProductSubcategory.ProductSubcategoryName AS [Product.SubcategoryName], 
 DimProductSubcategory.ProductSubcategoryDescription AS [Product.SubcategoryDescription], DimProductCategory.ProductCategoryLabel AS [Product.CategoryLabel], 
 DimProductCategory.ProductCategoryName AS [Product.CategoryName], DimProductCategory.ProductCategoryDescription AS [Product.CategoryDescription], 
 DimCustomer.CustomerLabel AS [Customer.Label], DimCustomer.Title AS [Customer.Title], DimCustomer.FirstName AS [Customer.FirstName], 
 DimCustomer.MiddleName AS [Customer.MiddleName], DimCustomer.LastName AS [Customer.LastName], DimCustomer.NameStyle AS [Customer.NameStyle], 
 DimCustomer.BirthDate AS [Customer.BirthDate], DimCustomer.MaritalStatus AS [Customer.MaritalStatus], DimCustomer.Suffix AS [Customer.Suffix], 
 DimCustomer.Gender AS [Customer.Gender], DimCustomer.EmailAddress AS [Customer.EmailAddress], DimCustomer.YearlyIncome AS [Customer.YearlyIncome], 
 DimCustomer.TotalChildren AS [Customer.TotalChildren], DimCustomer.NumberChildrenAtHome AS [Customer.NumberChildrenAtHome], 
 DimCustomer.Education AS [Customer.Education], DimCustomer.Occupation AS [Customer.Occupation], 
 DimCustomer.HouseOwnerFlag AS [Customer.HouseOwnerFlag], DimCustomer.AddressLine1 AS [Customer.AddressLine1], 
 DimCustomer.NumberCarsOwned AS [Customer.NumberCarsOwned], DimCustomer.AddressLine2 AS [Customer.AddressLine2], 
 DimCustomer.Phone AS [Customer.Phone], DimCustomer.CompanyName AS [Customer.CompanyName], DimGeography_1.CityName AS [Customer.CityName], 
 DimGeography_1.StateProvinceName AS [Customer.StateProvinceName], DimGeography_1.RegionCountryName AS [Customer.RegionCountryName], 
 DimGeography_1.ContinentName AS [Customer.ContinentName], DimGeography_1.GeographyType AS [Customer.GeographyType], 
 JSON_QUERY(CONCAT('{"type": "Feature","geometry": {"type": "Point","coordinates": [',DimGeography_1.Geometry.STX,',', DimGeography_1.Geometry.STY,']}}')) AS [Customer.Geometry],
 DimCurrency.CurrencyName AS [Currency.Name], DimCurrency.CurrencyDescription AS [Currency.Description], 
 DimCurrency.CurrencyLabel AS [Currency.Label], DimPromotion.PromotionLabel AS [Promotion.Label], DimPromotion.PromotionName AS [Promotion.Name], 
 DimPromotion.PromotionDescription AS [Promotion.Description], DimPromotion.DiscountPercent AS [Promotion.DiscountPercent], 
 DimPromotion.PromotionType AS [Promotion.Type], DimPromotion.PromotionCategory AS [Promotion.Category], DimPromotion.StartDate AS [Promotion.StartDate], 
 DimPromotion.EndDate AS [Promotion.EndDate], DimPromotion.MinQuantity AS [Promotion.MinQuantity], DimPromotion.MaxQuantity AS [Promotion.MaxQuantity], 
 DimStore.StoreName AS [Store.Name], DimStore.StoreDescription AS [Store.Description], DimStore.StoreManager AS [Store.Manager], 
 DimStore.StoreType AS [Store.Type], DimStore.Status AS [Store.Status], DimStore.OpenDate AS [Store.OpenDate], DimStore.CloseDate AS [Store.CloseDate], 
 DimStore.ZipCode AS [Store.ZipCode], DimStore.ZipCodeExtension AS [Store.ZipCodeExtension], DimStore.StorePhone AS [Store.Phone], 
 DimStore.StoreFax AS [Store.Fax], DimStore.AddressLine1 AS [Store.AddressLine1], DimStore.AddressLine2 AS [Store.AddressLine2], 
 JSON_QUERY(CONCAT('{"type": "Feature","geometry": {"type": "Point","coordinates": [',DimStore.Geometry.STX,',', DimStore.Geometry.STY,']}}')) AS [Store.Geometry], 
 JSON_QUERY(CONCAT('{"type": "Feature","geometry": {"type": "Point","coordinates": [',DimStore.GeoLocation.Lat,',', DimStore.GeoLocation.Long,']}}')) AS [Store.GeoLocation], 
 DimGeography.CityName AS [Store.CityName], 
 DimGeography.StateProvinceName AS [Store.StateProvinceName], DimGeography.RegionCountryName AS [Store.RegionCountryName], 
 DimGeography.ContinentName AS [Store.ContinentName], DimGeography.GeographyType AS [Store.GeographyType],
 JSON_QUERY(CONCAT('{"type": "Feature","geometry": {"type": "Point","coordinates": [',DimGeography.Geometry.STX,',', DimGeography.Geometry.STY,']}}')) AS [Store.Geo.Location], 
 DimGeography.GeographyKey AS [Store.GeographyKey], DimEntity.EntityLabel AS [Store.Entity.Label], 
 DimEntity.EntityName AS [Store.Entity.Name], DimEntity.EntityDescription AS [Store.Entity.Description], DimEntity.EntityType AS [Store.Entity.Type], 
 DimEntity.Status AS [Store.Entity.Status], DimDate.FullDateLabel AS [Date.FullDateLabel], DimDate.DateDescription AS [Date.DateDescription], 
 DimDate.CalendarYear AS [Date.CalendarYear], DimDate.CalendarMonthLabel AS [Date.CalendarMonthLabel], DimDate.FiscalYear AS [Date.FiscalYear], 
 DimDate.FiscalMonth AS [Date.FiscalMonth], DimDate.FiscalYearLabel AS [Date.FiscalYearLabel], DimDate.CalendarYearLabel AS [Date.CalendarYearLabel], 
 DimDate.CalendarHalfYear AS [Date.CalendarHalfYear], DimDate.CalendarHalfYearLabel AS [Date.CalendarHalfYearLabel], DimDate.Datekey AS [Date.Datekey], 
 DimDate.CalendarQuarter AS [Date.CalendarQuarter], DimDate.CalendarQuarterLabel AS [Date.CalendarQuarterLabel], 
 DimDate.CalendarMonth AS [Date.CalendarMonth], DimDate.CalendarWeek AS [Date.CalendarWeek], DimDate.CalendarWeekLabel AS [Date.CalendarWeekLabel], 
 DimDate.CalendarDayOfWeekLabel AS [Date.CalendarDayOfWeekLabel], DimDate.CalendarDayOfWeek AS [Date.CalendarDayOfWeek], 
 DimDate.FiscalHalfYear AS [Date.FiscalHalfYear], DimDate.FiscalHalfYearLabel AS [Date.FiscalHalfYearLabel], DimDate.FiscalQuarter AS [Date.FiscalQuarter], 
 DimDate.FiscalQuarterLabel AS [Date.FiscalQuarterLabel], DimDate.FiscalMonthLabel AS [Date.FiscalMonthLabel]
 FOR JSON PATH, WITHOUT_ARRAY_WRAPPER) as Data
FROM FactOnlineSales INNER JOIN
 DimDate ON FactOnlineSales.DateKey = DimDate.Datekey INNER JOIN
 DimStore ON FactOnlineSales.StoreKey = DimStore.StoreKey INNER JOIN
 DimProduct ON FactOnlineSales.ProductKey = DimProduct.ProductKey INNER JOIN
 DimPromotion ON FactOnlineSales.PromotionKey = DimPromotion.PromotionKey INNER JOIN
 DimCurrency ON FactOnlineSales.CurrencyKey = DimCurrency.CurrencyKey INNER JOIN
 DimCustomer ON FactOnlineSales.CustomerKey = DimCustomer.CustomerKey INNER JOIN
 DimGeography ON DimStore.GeographyKey = DimGeography.GeographyKey INNER JOIN
 DimProductSubcategory ON DimProduct.ProductSubcategoryKey = DimProductSubcategory.ProductSubcategoryKey INNER JOIN
 DimProductCategory ON DimProductSubcategory.ProductCategoryKey = DimProductCategory.ProductCategoryKey INNER JOIN
 DimGeography DimGeography_1 ON DimCustomer.GeographyKey = DimGeography_1.GeographyKey INNER JOIN
 DimEntity ON DimStore.EntityKey = DimEntity.EntityKey

Comments

  • Anonymous
    February 11, 2017
    Hi there,I'm using SQL Server 2016 Dev. SP1You can't create nvarchar(max) in cci (I changed compatibility level to 2016) - you will get an error.https://msdn.microsoft.com/en-us/library/gg492153.aspxThanks
    • Anonymous
      February 14, 2017
      From above: "In Azure SQL Database and SQL Server vNext you can create CCI on tables with NVARCHAR(MAX) columns". 2016 SP1 is not mentioned.
    • Anonymous
      February 14, 2017
      Hi, LOBS in CCI are supported in SQL Server vNext and Azure SQL Database. This feature is not available in SQL Server 2016.
      • Anonymous
        February 18, 2017
        Thank you