Query a data warehouse
When the dimension and fact tables in a data warehouse have been loaded with data, you can use SQL to query the tables and analyze the data they contain. The Transact-SQL syntax used to query tables in a Synapse dedicated SQL pool is similar to SQL used in SQL Server or Azure SQL Database.
Aggregating measures by dimension attributes
Most data analytics with a data warehouse involves aggregating numeric measures in fact tables by attributes in dimension tables. Because of the way a star or snowflake schema is implemented, queries to perform this kind of aggregation rely on JOIN
clauses to connect fact tables to dimension tables, and a combination of aggregate functions and GROUP BY
clauses to define the aggregation hierarchies.
For example, the following SQL queries the FactSales and DimDate tables in a hypothetical data warehouse to aggregate sales amounts by year and quarter:
SELECT dates.CalendarYear,
dates.CalendarQuarter,
SUM(sales.SalesAmount) AS TotalSales
FROM dbo.FactSales AS sales
JOIN dbo.DimDate AS dates ON sales.OrderDateKey = dates.DateKey
GROUP BY dates.CalendarYear, dates.CalendarQuarter
ORDER BY dates.CalendarYear, dates.CalendarQuarter;
The results from this query would look similar to the following table:
CalendarYear | CalendarQuarter | TotalSales |
---|---|---|
2020 | 1 | 25980.16 |
2020 | 2 | 27453.87 |
2020 | 3 | 28527.15 |
2020 | 4 | 31083.45 |
2021 | 1 | 34562.96 |
2021 | 2 | 36162.27 |
... | ... | ... |
You can join as many dimension tables as needed to calculate the aggregations you need. For example, the following code extends the previous example to break down the quarterly sales totals by city based on the customer's address details in the DimCustomer table:
SELECT dates.CalendarYear,
dates.CalendarQuarter,
custs.City,
SUM(sales.SalesAmount) AS TotalSales
FROM dbo.FactSales AS sales
JOIN dbo.DimDate AS dates ON sales.OrderDateKey = dates.DateKey
JOIN dbo.DimCustomer AS custs ON sales.CustomerKey = custs.CustomerKey
GROUP BY dates.CalendarYear, dates.CalendarQuarter, custs.City
ORDER BY dates.CalendarYear, dates.CalendarQuarter, custs.City;
This time, the results include a quarterly sales total for each city:
CalendarYear | CalendarQuarter | City | TotalSales |
---|---|---|---|
2020 | 1 | Amsterdam | 5982.53 |
2020 | 1 | Berlin | 2826.98 |
2020 | 1 | Chicago | 5372.72 |
... | ... | ... | .. |
2020 | 2 | Amsterdam | 7163.93 |
2020 | 2 | Berlin | 8191.12 |
2020 | 2 | Chicago | 2428.72 |
... | ... | ... | .. |
2020 | 3 | Amsterdam | 7261.92 |
2020 | 3 | Berlin | 4202.65 |
2020 | 3 | Chicago | 2287.87 |
... | ... | ... | .. |
2020 | 4 | Amsterdam | 8262.73 |
2020 | 4 | Berlin | 5373.61 |
2020 | 4 | Chicago | 7726.23 |
... | ... | ... | .. |
2021 | 1 | Amsterdam | 7261.28 |
2021 | 1 | Berlin | 3648.28 |
2021 | 1 | Chicago | 1027.27 |
... | ... | ... | .. |
Joins in a snowflake schema
When using a snowflake schema, dimensions may be partially normalized; requiring multiple joins to relate fact tables to snowflake dimensions. For example, suppose your data warehouse includes a DimProduct dimension table from which the product categories have been normalized into a separate DimCategory table. A query to aggregate items sold by product category might look similar to the following example:
SELECT cat.ProductCategory,
SUM(sales.OrderQuantity) AS ItemsSold
FROM dbo.FactSales AS sales
JOIN dbo.DimProduct AS prod ON sales.ProductKey = prod.ProductKey
JOIN dbo.DimCategory AS cat ON prod.CategoryKey = cat.CategoryKey
GROUP BY cat.ProductCategory
ORDER BY cat.ProductCategory;
The results from this query include the number of items sold for each product category:
ProductCategory | ItemsSold |
---|---|
Accessories | 28271 |
Bits and pieces | 5368 |
... | ... |
Note
JOIN clauses for FactSales and DimProduct and for DimProduct and DimCategory are both required, even though no fields from DimProduct are returned by the query.
Using ranking functions
Another common kind of analytical query is to partition the results based on a dimension attribute and rank the results within each partition. For example, you might want to rank stores each year by their sales revenue. To accomplish this goal, you can use Transact-SQL ranking functions such as ROW_NUMBER
, RANK
, DENSE_RANK
, and NTILE
. These functions enable you to partition the data over categories, each returning a specific value that indicates the relative position of each row within the partition:
- ROW_NUMBER returns the ordinal position of the row within the partition. For example, the first row is numbered 1, the second 2, and so on.
- RANK returns the ranked position of each row in the ordered results. For example, in a partition of stores ordered by sales volume, the store with the highest sales volume is ranked 1. If multiple stores have the same sales volumes, they'll be ranked the same, and the rank assigned to subsequent stores reflects the number of stores that have higher sales volumes - including ties.
- DENSE_RANK ranks rows in a partition the same way as RANK, but when multiple rows have the same rank, subsequent rows are ranking positions ignore ties.
- NTILE returns the specified percentile in which the row falls. For example, in a partition of stores ordered by sales volume,
NTILE(4)
returns the quartile in which a store's sales volume places it.
For example, consider the following query:
SELECT ProductCategory,
ProductName,
ListPrice,
ROW_NUMBER() OVER
(PARTITION BY ProductCategory ORDER BY ListPrice DESC) AS RowNumber,
RANK() OVER
(PARTITION BY ProductCategory ORDER BY ListPrice DESC) AS Rank,
DENSE_RANK() OVER
(PARTITION BY ProductCategory ORDER BY ListPrice DESC) AS DenseRank,
NTILE(4) OVER
(PARTITION BY ProductCategory ORDER BY ListPrice DESC) AS Quartile
FROM dbo.DimProduct
ORDER BY ProductCategory;
The query partitions products into groupings based on their categories, and within each category partition, the relative position of each product is determined based on its list price. The results from this query might look similar to the following table:
ProductCategory | ProductName | ListPrice | RowNumber | Rank | DenseRank | Quartile |
---|---|---|---|---|---|---|
Accessories | Widget | 8.99 | 1 | 1 | 1 | 1 |
Accessories | Knicknak | 8.49 | 2 | 2 | 2 | 1 |
Accessories | Sprocket | 5.99 | 3 | 3 | 3 | 2 |
Accessories | Doodah | 5.99 | 4 | 3 | 3 | 2 |
Accessories | Spangle | 2.99 | 5 | 5 | 4 | 3 |
Accessories | Badabing | 0.25 | 6 | 6 | 5 | 4 |
Bits and pieces | Flimflam | 7.49 | 1 | 1 | 1 | 1 |
Bits and pieces | Snickity wotsit | 6.99 | 2 | 2 | 2 | 1 |
Bits and pieces | Flange | 4.25 | 3 | 3 | 3 | 2 |
... | ... | ... | ... | ... | ... | ... |
Note
The sample results demonstrate the difference between RANK
and DENSE_RANK
. Note that in the Accessories category, the Sprocket and Doodah products have the same list price; and are both ranked as the 3rd highest priced product. The next highest priced product has a RANK of 5 (there are four products more expensive than it) and a DENSE_RANK of 4 (there are three higher prices).
To learn more about ranking functions, see Ranking Functions (Transact-SQL) in the Azure Synapse Analytics documentation.
Retrieving an approximate count
While the purpose of a data warehouse is primarily to support analytical data models and reports for the enterprise; data analysts and data scientists often need to perform some initial data exploration, just to determine the basic scale and distribution of the data.
For example, the following query uses the COUNT
function to retrieve the number of sales for each year in a hypothetical data warehouse:
SELECT dates.CalendarYear AS CalendarYear,
COUNT(DISTINCT sales.OrderNumber) AS Orders
FROM FactSales AS sales
JOIN DimDate AS dates ON sales.OrderDateKey = dates.DateKey
GROUP BY dates.CalendarYear
ORDER BY CalendarYear;
The results of this query might look similar to the following table:
CalendarYear | Orders |
---|---|
2019 | 239870 |
2020 | 284741 |
2021 | 309272 |
... | ... |
The volume of data in a data warehouse can mean that even simple queries to count the number of records that meet specified criteria can take a considerable time to run. In many cases, a precise count isn't required - an approximate estimate will suffice. In such cases, you can use the APPROX_COUNT_DISTINCT
function as shown in the following example:
SELECT dates.CalendarYear AS CalendarYear,
APPROX_COUNT_DISTINCT(sales.OrderNumber) AS ApproxOrders
FROM FactSales AS sales
JOIN DimDate AS dates ON sales.OrderDateKey = dates.DateKey
GROUP BY dates.CalendarYear
ORDER BY CalendarYear;
The APPROX_COUNT_DISTINCT
function uses a HyperLogLog algorithm to retrieve an approximate count. The result is guaranteed to have a maximum error rate of 2% with 97% probability, so the results of this query with the same hypothetical data as before might look similar to the following table:
CalendarYear | ApproxOrders |
---|---|
2019 | 235552 |
2020 | 290436 |
2021 | 304633 |
... | ... |
The counts are less accurate, but still sufficient for an approximate comparison of yearly sales. With a large volume of data, the query using the APPROX_COUNT_DISTINCT
function completes more quickly, and the reduced accuracy may be an acceptable trade-off during basic data exploration.
Note
See the APPROX_COUNT_DISTINCT function documentation for more details.