SQL Data Warehousing Tips 1 - Building a Period table
Recently, I started a personal project for analyzing stock market trends and calculating the optimal trading strategies. I've learned a lot of new stuff about LINQ and SQL 2008 from this. Based on that I'm opening a new series on data warehouse tips.
In my project, one of the things I quickly realized after calculating that I would need to generate about 1 billion rows in order to have sufficient detail for analysis of 5,000 publicly traded stocks with 1 year of history was that I needed to consolidate the data. It was obvious I needed a way to summarize data into periods and use those as dimensions for analyzing the performance, rather than calculating all of this from the detailed data.
What would be nice, would be to have periods that could overlap and align. For example, I want to be able to go back and look at last month, last quarter, last 6 months, and last year. However each one of these requires a different set of transactions because even though the closing date is the same, the opening date is different. The variation in open date affects the price at which a stock would be purchased at (or short-sold at), thus leading to a different set of transactions based on an entry/exit strategy. In order to focus on the technical aspect of this, I won't get into all of the application design considerations. Suffice to say, that discrete, yet aligned periods were needed.
So I immediately wrote a stored procedure (final version below) to try to insert my periods and decided to leverage the merge capability.
Below is the table definition of the Period Table
CREATE TABLE [dbo].[Period](
[PeriodId] [smallint] NOT NULL,
[StartDate] [date] NOT NULL,
[EndDate] [date] NOT NULL,
[MonthsInPeriod] AS (datediff(month,[StartDate],dateadd(day,(1),[EndDate]))),
CONSTRAINT [PK_Period] PRIMARY KEY CLUSTERED
(
[PeriodId] ASC
)
) ON [PRIMARY]
Note, that we can calculate the number of months in the period (assumes all periods are in 1 month increments) easily using a computed field with datediff
And the Merging (upsert) stored procedure
CREATE PROCEDURE [dbo].[UpsertMonthlyPeriod]
@PeriodDate date,
@Months int
AS BEGIN
DECLARE @EndDate date
SET @EndDate = DATEADD(DD,-1,DATEADD(MM,@Months, @PeriodDate))
MERGE INTO Period AS Target
USING (VALUES (@PeriodDate, @EndDate)) AS Source (NewStartDate, NewEndDate)
ON Target.StartDate = Source.NewStartDate
AND Target.EndDate =Source.NewEndDate
WHEN NOT MATCHED BY TARGET THEN
INSERT (StartDate, EndDate)
VALUES (NewStartDate, NewEndDate);
END
This is very simple example of the new Merge functionality. The USING portion identifies the values to use for mapping the source data to the target, which in this case was simply start-date and end-date (remember I am basically allowing overlapping periods for analysis purposes).
The below stored procedure then actually generates the periods by calling the Upsert stored proc:
CREATE PROCEDURE [dbo].[SetupPeriods]
@StartDate date,
@Periods int,
@CreateCummulative bit
AS BEGIN
DECLARE @x int = 0
DECLARE @PeriodDate date
SET @PeriodDate = @StartDate
WHILE @x < @Periods
BEGIN
-- Create Monthly periods
DECLARE @PeriodMonths int
EXEC dbo.UpsertMonthlyPeriod @PeriodDate = @PeriodDate, @Months=1
-- Create Quarerly periods
IF @x % 3 = 0 OR @CreateCummulative =1
BEGIN
EXEC dbo.UpsertMonthlyPeriod @PeriodDate = @PeriodDate, @Months=3
END
-- Create Bi-Annual periods
IF @x % 6 = 0 OR @CreateCummulative =1
BEGIN
EXEC dbo.UpsertMonthlyPeriod @PeriodDate = @PeriodDate, @Months=6
END
-- Create Annual periods
IF @x % 12 = 0 OR @CreateCummulative =1
BEGIN
EXEC dbo.UpsertMonthlyPeriod @PeriodDate = @PeriodDate, @Months=12
END
SET @x = @x+1
SET @PeriodDate = DATEADD(MM,@x,@StartDate)
END
END
The end result is the data shown below:
PeriodId StartDate EndDate MonthsInPeriod
8041 2008-04-01 2008-04-30 1
8043 2008-04-01 2008-06-30 3
8046 2008-04-01 2008-09-30 6
8040 2008-04-01 2009-03-31 12
8051 2008-05-01 2008-05-31 1
8053 2008-05-01 2008-07-31 3
8056 2008-05-01 2008-10-31 6
8050 2008-05-01 2009-04-30 12
8061 2008-06-01 2008-06-30 1
8063 2008-06-01 2008-08-31 3
8066 2008-06-01 2008-11-30 6
8071 2008-07-01 2008-07-31 1
8073 2008-07-01 2008-09-30 3
8076 2008-07-01 2008-12-31 6
8081 2008-08-01 2008-08-31 1
8083 2008-08-01 2008-10-31 3
8086 2008-08-01 2009-01-31 6
8091 2008-09-01 2008-09-30 1
8093 2008-09-01 2008-11-30 3
8096 2008-09-01 2009-02-28 6
8101 2008-10-01 2008-10-31 1
8103 2008-10-01 2008-12-31 3
8106 2008-10-01 2009-03-31 6
8111 2008-11-01 2008-11-30 1
8113 2008-11-01 2009-01-31 3
8116 2008-11-01 2009-04-30 6
8121 2008-12-01 2008-12-31 1
8123 2008-12-01 2009-02-28 3
9011 2009-01-01 2009-01-31 1
9013 2009-01-01 2009-03-31 3
9021 2009-02-01 2009-02-28 1
9023 2009-02-01 2009-04-30 3
9031 2009-03-01 2009-03-31 1
9041 2009-04-01 2009-04-30 1
...
10011 2010-01-01...
..
But, wait! Where did the PeriodId come from? It's not an identity field and the numbers appear to follow a pattern.
Can you guess the pattern?
...
Yes, that's right - last 2 digits of year + 2 digit month + length of period (except that 12 month period is considered "0"). This was so that it could still fit in smallint field. I've got millions of rows linking back to the period and want to keep the index small, so didn't want the extra digit.
The trick for this was an insert/update trigger shown below:
CREATE TRIGGER [dbo].[Period_IU_Trig]
ON [dbo].[Period]
AFTER INSERT, UPDATE
AS
BEGIN
-- Set Period Id to be meaningful (YY + MM + Period Length)
SET NOCOUNT ON;
UPDATE Period
SET PeriodId = CAST(CAST(DATEPART(YY,INSERTED.StartDate) - 2000 AS VARCHAR(2)) +
CAST(RIGHT(100 + (DATEPART(MM,INSERTED.StartDate)),2) AS VARCHAR(2)) +
CAST((CASE WHEN INSERTED.MonthsInPeriod = 12 THEN 0 ELSE INSERTED.MonthsInPeriod END)AS CHAR(1)) AS SMALLINT)
FROM INSERTED
WHERE INSERTED.PeriodId = Period.PeriodId
END
Note, I normally don't advocate having "meaningful" surrogate keys (an oxymoron), but in this case, its very handy in testing to quickly identify the period without going back to the source table.
To present this to the users, I provide a view that formats the dates to show the date range in descending order of the period along with the length in months. For this scenario, this is easier to work with than a calendar control and works great for a drop down as you can see below
The stored procedure for formatting this is shown below:
CREATE procedure [Reports].[SelectCurrentPeriod]
AS
SELECT TOP (100) PERCENT PeriodId,
CAST(StartDate AS VARCHAR(10)) + ' To '
+ CAST(EndDate AS VARCHAR(10))
+ ' ('
+ CAST(MonthsInPeriod AS VARCHAR(2))
+ ' Months)' as PeriodDesc
FROM dbo.Period
WHERE (EndDate <
(SELECT MAX(LoadDate) AS Expr1
FROM dbo.LoadHistory))
There you have it, a period table and some supporting procedures. For our next tip, we'll look at how the period table actually comes into play when linking up with the data.