Compartilhar via


Re-summarizing data using UNPIVOT, ROW_NUMBER and CTEs–Denali Style

 

Last month I did a write up on a common scenario that I have seen with my customers where data needs to be re-summarized and the customer would like to avoid writing data into intermediate tables. For example, they may receive a flat file from an external source that has sales amount by month – but instead of each row containing a month’s worth of data, each row contains a year’s worth of data with one month in each column.

The first portion of the post we talked about unpivoting the data. This can be done in many ways – the most common using either SSIS or T-SQL’s unpivot function. This would be similar in SQL Server Denali.

What has changed is our approach to re-summarizing the data. What if we wanted to look at YTD sales or 3 Month rolling totals. SQL Server has expanded the capabilities of the window functions beyond the partition and order clauses that we had in SQL Server 2008 R2.

To start with, create a table and populate that table with data as we did in the previous post.

In SQL Server 2008 R2 we created a CTE that unpivoted the data, and then we joined that CTE on itself multiple times to create the offsets that we needed. This works OK for small-medium sized sets of data but we really need to do a lot of work to get these results.

image

This is a bit of an eye-chart. The intent here is not to go over every step of the execution plan but to show that there is a LOT of work that needs to take place to generate the results we need. Specifically, the base table is accessed, filtered and aggregated 5 times.

So, what improvements have been made in the next release of SQL Server and how is it going to help us here?

The next version of SQL Server includes a “row or range” clause to the “over” clause of a select statement.

The new statement looks like this:

WITH unpivoted_cte as (
SELECT VendorId = VendorId
, MonthAbbr = MonthAbbr
, Amount = Amount
, MonthId = row_number() over(partition by VendorId order by MonthAbbr)
FROM
(SELECT VendorId
, _01_Jan = Jan
, _02_Feb = Feb
, _03_Mar = Mar
, _04_Apr = Apr
, _05_May = May
, _06_Jun = Jun
, _07_Jul = Jul
, _08_Aug = Aug
, _09_Sep = Sep
, _10_Oct = Oct
, _11_Nov = Nov
, _12_Dec = Dec
FROM dbo.MonthlyData) AS p
UNPIVOT
(Amount FOR MonthAbbr IN
(_01_Jan, _02_Feb, _03_Mar, _04_Apr, _05_May, _06_Jun
, _07_Jul, _08_Aug, _09_Sep, _10_Oct, _11_Nov, _12_Dec
)
)AS unpvt
)
SELECT VendorId
, MonthId
, Amount
, Total_Ytd = SUM(Amount) OVER (
PARTITION BY VendorId
ORDER BY MonthId
ROWS UNBOUNDED PRECEDING
)
, Total_Rolling3 = SUM(Amount) OVER (
PARTITION BY VendorId
ORDER BY MonthId
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
)
FROM unpivoted_cte
WHERE VendorId = 999

The execution plan of this statement?

image

OK, Still a fair number of steps but it is a much more linear plan and the base table is only accessed the one time.

If I remove the where clause (where vendorId = 999) this query returns 120,000 rows based on the population script from last month. The old plan executes 6x slower than the new plan.

In future posts we will dive into some of the details of how this query works and look at some other improvements to the window functions that we can take advantage of in SQL Server Denali!