Поделиться через


Top 10 Best Practices for Building a Large Scale Relational Data Warehouse

As of September 1, 2013 we decided to remove SQLCAT.COM site and use MSDN as the primary vehicle to post new SQL Server content. This was done to minimize reader confusion and to streamline content publication. MSDN SQLCAT blogs already includes most SQLCAT.COM Content and will continue to be updated with more SQLCAT learnings. You can also find a collection of our work in SQLCAT Guidance eBooks.

To make the transition to MSDN smoother, we are in the process of reposting a few of SQLCAT.COM’s blogs that are still being very actively viewed.  Following is reposting of one of the SQLCAT.Com blogs that still draws huge interest. You can find this, and other guidance in the SQLCAT's Guide to Relational Engine eBook.

Also, Follow us on Twitter as we normally use our Twitter handles @SQLCAT and @WinAzureCAT to announce news and new content.  

Top 10 Best Practices for Building a Large Scale Relational Data Warehouse

 

Building a large scale relational data warehouse is a complex task. This article describes some design techniques that can help in architecting an efficient large scale relational data warehouse with SQL Server. Most large scale data warehouses use table and index partitioning, and therefore, many of the recommendations here involve partitioning. Most of these tips are based on experiences building large data warehouses on SQL Server 2005.

 

1 - Consider partitioning large fact tables 

  • Consider partitioning fact tables that are 50 to 100GB or larger.
  • Partitioning can provide manageability and often performance benefits.
    • Faster, more granular index maintenance.
    • More flexible backup / restore options.
    • Faster data loading and deleting
  • Faster queries when restricted to a single partition..
  • Typically partition the fact table on the date key.
    • Enables sliding window.
  • Enables partition elimination.

 

2- Build clustered index on the date key of the fact table

  • This supports efficient queries to populate cubes or retrieve a historical data slice.
  • If you load data in a batch window then use the options ALLOW_ROW_LOCKS = OFF and ALLOW_PAGE_LOCKS = OFF for the clustered index on the fact table. This helps speed up table scan operations during query time and helps avoid excessive locking activity during large updates.
  • Build nonclustered indexes for each foreign key. This helps ‘pinpoint queries' to extract rows based on a selective dimension predicate.Use filegroups for administration requirements such as backup / restore, partial database availability, etc.

 

3 - Choose partition grain carefully

  • Most customers use month, quarter, or year.
  • For efficient deletes, you must delete one full partition at a time.
  • It is faster to load a complete partition at a time.
    • Daily partitions for daily loads may be an attractive option.
    • However, keep in mind that a table can have a maximum of 1000 partitions.
  • Partition grain affects query parallelism.
    • For SQL Server 2005:
      • Queries touching a single partition can parallelize up to MAXDOP (maximum degree of parallelism).
      • Queries touching multiple partitions use one thread per partition up to MAXDOP.
    • For SQL Server 2008:
      • Parallel threads up to MAXDOP are distributed proportionally to scan partitions, and multiple threads per partition may be used even when several partitions must be scanned.
  • Avoid a partition design where only 2 or 3 partitions are touched by frequent queries, if you need MAXDOP parallelism (assuming MAXDOP =4 or larger). 


4 - Design dimension tables appropriately

  • Use integer surrogate keys for all dimensions, other than the Date dimension. Use the smallest possible integer for the dimension surrogate keys. This helps to keep fact table narrow.
  • Use a meaningful date key of integer type derivable from the DATETIME data type (for example: 20060215).
    • Don't use a surrogate Key for the Date dimension
    • Easy to write queries that put a WHERE clause on this column, which will allow partition elimination of the fact table.
  • Build a clustered index on the surrogate key for each dimension table, and build a non-clustered index on the Business Key (potentially combined with a row-effective-date) to support surrogate key lookups during loads.
  • Build nonclustered indexes on other frequently searched dimension columns.
  • Avoid partitioning dimension tables.
  • Avoid enforcing foreign key relationships between the fact and the dimension tables, to allow faster data loads. You can create foreign key constraints with NOCHECK to document the relationships; but don’t enforce them. Ensure data integrity though Transform Lookups, or perform the data integrity checks at the source of the data.


5 - Write effective queries for partition elimination

  •  Whenever possible, place a query predicate (WHERE condition) directly on the partitioning key (Date dimension key) of the fact table.

 

6 - Use Sliding Window technique to maintain data

  • Maintain a rolling time window for online access to the fact tables. Load newest data, unload oldest data. 
  • Always keep empty partitions at both ends of the partition range to guarantee that the partition split (before loading new data) and partition merge (after unloading old data) do not incur any data movement. 
  • Avoid split or merge of populated partitions. Splitting or merging populated partitions can be extremely inefficient, as this may cause as much as 4 times more log generation, and also cause severe locking. 
  • Create the load staging table in the same filegroup as the partition you are loading. 
  • Create the unload staging table in the same filegroup as the partition you are deleteing. 
  • It is fastest to load newest full partition at one time, but only possible when partition size is equal to the data load frequency (for example, you have one partition per day, and you load data once per day).
  • If the partition size doesn't match the data load frequency, incrementally load the latest partition. 
  • Various options for loading bulk data into a partitioned table are discussed in the whitepaper https://www.microsoft.com/technet/prodtechnol/sql/bestpractice/loading_bulk_data_partitioned_table.mspx.
  • Always unload one partition at a time.

 

7- Efficiently load the initial data

  • Use SIMPLE or BULK LOGGED recovery model during the initial data load.
  • Create the partitioned fact table with the Clustered index.
  • Create non-indexed staging tables for each partition, and separate source data files for populating each partition.
  • Populate the staging tables in parallel.
    • Use multiple BULK INSERT, BCP or SSIS tasks.
      • Create as many load scripts to run in parallel as there are CPUs, if there is no IO bottleneck. If IO bandwidth is limited, use fewer scripts in parallel.
      • Use 0 batch size in the load.
      • Use 0 commit size in the load.
      • Use TABLOCK.
      • Use BULK INSERT if the sources are flat files on the same server. Use BCP or SSIS if data is being pushed from remote machines.
  • Build a clustered index on each staging table, then create appropriate CHECK constraints.
  • SWITCH all partitions into the partitioned table.
  • Build nonclustered indexes on the partitioned table.
  • Possible to load 1 TB in under an hour on a 64-CPU server with a SAN capable of 14 GB/Sec throughput (non-indexed table). Refer to SQLCAT blog entry https://blogs.msdn.com/sqlcat/archive/2006/05/19/602142.aspx for details.


8 - Efficiently delete old data

  • Use partition switching whenever possible.

  • To delete millions of rows from nonpartitioned, indexed tables

    • Avoid DELETE FROM ...WHERE ...
      • Huge locking and logging issues
      • Long rollback if the delete is canceled
    • Usually faster to
      • INSERT the records to keep into a non-indexed table
      • Create index(es) on the table
      • Rename the new table to replace the original
  • As an alternative, ‘trickle' deletes using the following repeatedly in a loop

    DELETE TOP (1000) ... ;

COMMIT

  • Another alternative is to update the row to mark as deleted, then delete later during non critical time.

 

9 - Manage statistics manually

  • Statistics on partitioned tables are maintained for the table as a whole.
  •  Manually update statistics on large fact tables after loading new data.
  •  Manually update statistics after rebuilding index on a partition.
  •  If you regularly update statistics after periodic loads, you may turn off autostats on that table.
  •  This is important for optimizing queries that may need to read only the newest data.
  •  Updating statistics on small dimension tables after incremental loads may also help performance. Use FULLSCAN option on update statistics on dimension tables for more accurate query plans.

 

10 -  Consider efficient backup strategies

  • Backing up the entire database may take significant amount of time for a very large database.
    • For example, backing up a 2 TB database to a 10-spindle RAID-5 disk on a SAN may take 2 hours (at the rate 275 MB/sec).
  • Snapshot backup using SAN technology is a very good option.
  • Reduce the volume of data to backup regularly.
    • The filegroups for the historical partitions can be marked as READ ONLY.
    • Perform a filegroup backup once when a filegroup becomes read-only.
    • Perform regular backups only on the read / write filegroups.
  • Note that RESTOREs of the read-only filegroups cannot be performed in parallel.

Comments

  • Anonymous
    October 04, 2013
    tks, very good

  • Anonymous
    October 07, 2013
    Something else I'd like to add:  instead of deleting directly from a partitioned table, SWITCH the partition with an empty single-partition-sized table and then safely TRUNCATE the latter.

    • Anonymous
      April 22, 2017
      Unless you have FK's that might prevent you from doing so
  • Anonymous
    November 21, 2013
    Awesome information! Thanks for the share.

  • Anonymous
    March 21, 2014
    Why not use the date data type directly rather than an integer surrogate key? What disadvantage am I missing?

  • Anonymous
    May 20, 2014
    @Martin Smith Star schema optimisations and possibly, depending on type choosen, size.

  • Anonymous
    June 30, 2014
    How does this change for 2012 / 2014?  I notice it's a little out of date  (like only supporting 1000 partitions)

  • Anonymous
    July 03, 2014
    Star join optimisation works fine with date datatype key.

  • Anonymous
    August 27, 2014
    @Martin Smith, the date dimension table usually contains a number of extra "derived" fields that enable efficient grouping and aggregating. For example, each date in the date dim will have quarter, week number, month etc that can be used directly without the need to use functions (which slow the queries down).

  • Anonymous
    October 01, 2014
    Thanks for this collection of points which I appreciate, however I beg to differ on Build clustered index on the date key of the fact table since this is not a unique key on a Fact Table for instance you could have many records having the same DateKey? Moreover in most of the datamarts I have used or created, the datekeys are foreignKeys to the Date or time dimension. I would much appreciate if you could please clarify. Thanks my email is roy.singer4@gmail.com

    • Anonymous
      June 09, 2016
      Building clustered index on the date key of the fact table is a good choice/common practice. Roy: clustered index does not have to be unique. Clustered index is used in physical location/position of the records in the table which will be used mostly for easy/faster retrieval when date is used in filtering criteria especially in table partitioned with date clustered index.
    • Anonymous
      January 21, 2017
      A clustered index does not need to be unique, but you can use primary key clustered indexes on the fact table, which are unique, by using the main date key that's always used to isolate records plus another column or set of columns to make it unique. For instance, use the following pseudo code pattern: "CREATE TABLE dbo.order_facts(order_fact_id bigint identity(1,1) not null, order_date_key int not null, ..., primary key clustered (order_date_key, order_fact_id));" At the very least, this has the benefit to logically group the underlying data pages together for each date leading to very fast pinpointing (i.e., table seek) of the main batch of records commonly consumed. You can still manage foreign keys on any column (i.e., including order_date_key in the prior example) if you choose to employ them, though it can be tedious for common fact-table loading scenarios.
  • Anonymous
    November 26, 2014
    @ROY I think we should include DateKey of fact table in the definition of composite key along with the dimension key of the lowest level grain in that fact table.

  • Anonymous
    April 06, 2015
    Is this accurate? "...to allow faster data loads. You can create foreign key constraints with NOCHECK to document the relationships; but don’t enforce them". Foreign key constraints created with NOCHECK will enforce referential integrity from the moment that they are created, they just won't check existing values for referential integrity. So a data load running against an FK with NOCHECK will be subject to the same referential integrity checks as an FK created without NOCHECK. In order to circumvent the overhead of referential integrity checking during a data load, surely the FK would need to be disabled entirely. But if you're going to disable a FK when any data is added or modified, and also re-enable it afterwards without checking the integrity of existing values, then you've got to wonder what is the point of having it in the first place. From BOL:- "When a FOREIGN KEY constraint is added to an existing column or columns in the table, by default, the Database Engine examines the existing data in the columns to make sure that all values, except NULL, exist in the columns of the referenced PRIMARY KEY or UNIQUE constraint. However, by specifying WITH NOCHECK, the Database Engine can be prevented from checking the data in the column against the new constraint and made to add the new constraint regardless of the data in the column. The WITH NOCHECK option is useful when the existing data already meets the new FOREIGN KEY constraint, or when a business rule requires the constraint to be enforced only from this point forward".

  • Anonymous
    May 06, 2015
    Good set of questions, helped in understanding improve ETL performance. Thanks for this. Regards, Nilesh.