Fan-out Querying for Federations in SQL Azure (Part 2): Scalable Fan-out Queries with TOP, ORDER BY, DISTINCT and Other Powerful Aggregates, MapReduce Style!
Welcome back. In the previous post: Introduction to Fan-out Querying, we covered the basics and defined the fragments that make up the fan-out query namely the member and summary queries. Fan-out querying refer to querying multiple members.
We looked at queries examples in the previous post with member queries with no summary queries, that is the member queries were simply UNIONed. In this post, we’ll take a look at Summary queries. when are they needed? what are some common patterns and examples?
Summary queries are required for post processing the member queries. Simply put, summary queries can help reshape the unioned member results into the desired final share. Summary queries refer to object generated by the member queries and depending on the implementation can be executed on the client side or the server side. That said, today Federations do not provide a built in server side processing option for summary queries. Here are a few options for processing summary queries:
- LINQ To DataSets offers a great option for querying datasets. Some examples here. LINQ is best suited for the job in my opinion with flexible language constructs, dynamic execution and parallelism options.
- ADO.Net Expressions in DataSets offers a number of options for summary query processing as well. For example, DataColumn.Expressions allow you to add aggregate expressions to your Dataset for this type of processing. Or you can use DataTable.Compute for processing a rollup value.
- Obviously server-side full fan-out processing is also an option. This option refers to server side running member and summary query in a single round-trip from the client to SQL Azure. However as of Jan 2012, this is not built into SQL Azure Federations. We’ll take a look at a simulated version of this in the sample tool here; You can use the deployment closer to your database for efficiency;
Americas Deployment: https://federationsutility-scus.cloudapp.net/
European Deployment: https://federationsutility-weu.cloudapp.net/
Asian Deployment: https://federationsutility-seasia.cloudapp.net/
The tool provides a basic and full ‘fanout’ page. Basic page contains only member queries through a simplified interface. The results are simply unioned (or ‘union all’-ed to be precise) together. Full page provides additional capabilities including member and summary queries. Both member and summary queries is expressed in TSQL. The full page also allows for parallelism and allows specifying a federation key range other than all members. Tool has a help page with detailed notes on each of these capabilities.
Lets dive in and take a look at where summary queries can be useful. We’ll start with Ordering, TOP, aggregations and finally DISTINCT processing. By the way, for the examples in this post, I’ll continue to use the BlogsRUs_DB schema posted at the bottom of this article.
GROUP BY and HAVING with Fan-out Queries
With simple group-by items the rule is simple; if the grouping is aligned to the federation key in fan-out queries in federations, processing of group-by and having needs no special consideration. Simply union the results (or union-all to be precise) and we are done. However processing unaligned groupings (any grouping that does not include the federation key) requires a summary query.
When grouping isn’t aligned to federation key, grouping isn’t completely done with member queries. that means processing HAVING will generate incorrect results in member queries. Lets take the example form the previous post; here is the query simply reporting the months and counts of posts with more than a million posts;
SELECT DATEPART(mm, be.created_date) mon, COUNT(be.blog_entry_title) cnt
FROM blog_entries_tbl be
GROUP BY DATEPART(mm, be.created_date)
HAVING COUNT(be.blog_entry_title) > 100000
The grouping on month in each member yields the results from each member but grouping all members isn’t fully done yet! We need to GROUP_BY the same columns and expressions to finish grouping before we can apply the HAVING predicate.
So to correctly process HAVING predicate, we need to push t to the summary query;
SELECT DATEPART(mm, be.created_date) mon, COUNT(be.blog_entry_title) cnt
FROM blog_entries_tbl be
GROUP BY DATEPART(mm, be.created_date)
and the summary query should process the HAVING.
SELECT mon, sum(cnt) FROM #Table
GROUP BY mon
HAVING sum(cnt) > 100000
‘ORDER BY’ and ‘TOP’ with Fan-out Queries:
Lets start with ordering and TOP functionality in TSQL. The member queries can include ORDER BY and TOP but you will need the summary query to reprocess and finalize the ordering and top filtering of the member query results. Take the following example; The query is calculating the top 10 blogs created across the entire BlogsRUs_DB. Remember the blog_entries_tbl is federated on blog_id.
SELECT TOP 10 blog_entry_text FROM blog_entries_tbl
ORDER BY created_date DESC
Here is how to break this into a fan-out query with a member and a summary query; The #Table in the summary query refer to the resultset generated from the member query.
Member Query:
SELECT TOP 10 blog_entry_text, created_date FROM blog_entries_tbl
ORDER BY created_date DESC
Summary Query:
SELECT TOP 10 blog_entry_text FROM #Table
ORDER BY created_date DESC
Here is the output for the member query:
And the output with the summary query. Now the output only contains the true TOP 10 rows.
Additive Aggregates: MIN, MAX & SUM with Fan-out Queries:
When the processing of the MIN, MAX and SUM align with a grouping on the federation key, a UNION ALL of the results can simply yield the result. For example the count of blog entries per blog could be proceed with the following query;
SELECT blog_id, COUNT(blog_entry_id)
FROM blog_entries_tbl GROUP BY blog_id
Same applies to the OVER clause used for defining windows when processing aggregates. that is, as long as the PARTITION BY includes the federation key, a UNION ALL of the results is sufficient for the summary query.
However when the grouping does not align to the federation key, you will need summary queries. Here is the example that gets us the latest date that a blog entry was created. The grouping in the absence of a GROUP BY clause is a single bucket.
SELECT MAX(created_date) FROM blog_entries_tbl
Here is the way to break this down into a fan-out query;
Member Query:
SELECT MAX(created_date) FROM blog_entries_tbl
Summary Query:
SELECT MAX(Column1) FROM #Table
Fan-out Queries Processing AVG and COUNT
Average is a none additive aggregate thus it takes a rewrite to take an average when grouping does not align to the federation key. Here is an example of an average aggregate that aligns with a federation key: the next query calculates the average days between the blog entry and the last comment for the post for bloggers. the inner query gets the MAX date for the last comment on the blog entry. AVG is calculated per blogger for all the blog entries with the days between the create date of the blog and the last comment date on all blog entry.
SELECT blog_id, AVG(LEN(blog_entry_text))
FROM blog_entries_tbl
GROUP BY blog_id
There is no need for a summary query other than a UNION ALL of the results when fanning this query out given it is aligned to the federation key: blog_id.
However when the grouping is not aligned, there is work to do! Average and Count are not additive. That is, one cannot take 2 averages and average those to get the correct average or take 2 counts and use another count to calculate the output of 2 counts; we need SUM in the case of 2 counts to correctly get the final correct count and we need to use sum of values we want to average and the item count to calculate avg correctly from multiple member queries. Here is an example: Lets take the average length of blog entries across all each month. This time the grouping is on the month of the blog posts.
SELECT DATEPART(mm,be.created_date) month_of_entry,
AVG(LEN(blog_entry_text))
FROM blog_entries_tbl
GROUP BY DATEPART(mm,be.created_date)
Lets use SUM and COUNT to get to the correct average.
Member Query:
SELECT DATEPART(mm,created_date) month_of_entry,
SUM(LEN(blog_entry_text)) sum_len_blog_entry,
COUNT(blog_entry_text) count_blog_entry
FROM blog_entries_tbl
GROUP BY DATEPART(mm,created_date)
Summary Query:
SELECT month_of_entry,
SUM(sum_len_blog_entry)/SUM(count_blog_entry) avg_len_blog_entry
FROM #Table
GROUP BY month_of_entry
‘DISTINCT’ with Fan-out Queries:
DISTINCT is fairly easy to calculate much like the other additive aggregates. As long as the grouping is on the federation key. When grouping isn’t aligned to the federation key, a summary query reapplying the distinct for de-duplication is needed. Here is the query for DISTINCT count of languages used for blog comments across our entire dataset;
SELECT DISTINCT bec.language_id
FROM blog_entry_comments_tbl bec JOIN language_code_tbl lc
ON bec.language_id=lc.language_id
To break this apart, you have to run distinct in each member and then rerun the distinct on the summary query much like MIN and MAX;
Member Query:
SELECT DISTINCT bec.language_id
FROM blog_entry_comments_tbl bec JOIN language_code_tbl lc
ON bec.language_id=lc.language_id
Summary Query:
SELECT DISTINCT language_id FROM #Table
Now lets take a look at some more challenging none additive aggregates.
Fan-out Queries with TON N PERCENT and DISTINCT COUNT
TOP N PERCENT is not as popular as TOP N but for those who use it, lets dissect what you need to do with fanout queries. TOP N PERCENT simply require you collect all results without TOP from members and only apply the TOP N PERCENT in the summary query. Imagine the same query we used for TOP with TOP 10 PERCENT. Query gets latest 10% blog entries.
SELECT TOP 10 PERCENT blog_entry_text FROM blog_entries_tbl
ORDER BY created_date DESC
Here are the member and summary queries.
Member Query:
SELECT blog_entry_text FROM blog_entries_tbl
ORDER BY created_date DESC
Summary Query:
SELECT TOP 10 PERCENT blog_entry_text FROM #Table
ORDER BY create_date DESC
Alarm bell should go off whenever we are not able to push the filtration predicate (TOP clause) down to the member queries. So processing TOP N PERCENT is costlier that TOP N and additive keywords.
DISTINCT COUNT calculation is trivial if grouping is on the federation key. However when grouping isn’t aligned, it takes more work to calculate distinct count given that it isn’t additive. That means; you cannot simply add distinct-counts from each member given you may not know if you are counting certain things multiple times without full de-duplication of all the items. Here is an example; this query get the distinct count of languages per month across the whole resultset;
SELECT DATEPART(mm,created_date), COUNT(DISTINCT bec.language_id)
FROM blog_entry_comments_tbl bec JOIN language_code_tbl lc
ON bec.language_id=lc.language_id
GROUP BY DATEPART(mm,created_date)
We need to centralized processing so all comments language ids across days first need to be grouped across all members and then distinct can be calculated on that resultset. Here is the member and summary query for calculating distinct-count;
Member Query:
SELECT DISTINCT DATEPART(mm,created_date), bec.language_id
FROM blog_entry_comments_tbl bec JOIN language_code_tbl lc
ON bec.language_id=lc.language_id
Summary Query:
SELECT Column1, COUNT(DISTINCT language_id)
FROM #Table
GROUP BY Column1
Here is what the member query output looks like;
Here is the output with the summary query;
In conclusion, additive operations like MIN, MAX, COUNT or SUM and predicates like TOP or order by can easily be processed with fan-outs using simple modifications to original queries to break them into member and summary queries. With additive queries big advantage is that you can push these operation to the member queries for efficiencies. In cases where you need to deal with none-additive operations like average, there is some rewrite to help push filtrations and aggregate predicates to member queries. However there are a few cases like TOP N PERCENT or DISTINCT COUNT, where fan-out queries may require larger dataset shoveling thus will be more expensive to calculate because you cannot push these predicates to member queries and can only process them in the summary queries.
Happy fan-out querying.
*Sample Schema
Here is the schema I used for the sample queries above.
-- Connect to BlogsRUs_DB
CREATE FEDERATION Blogs_Federation(id bigint RANGE)
GO
USE FEDERATION blogs_federation (id=-1) WITH RESET, FILTERING=OFF
GO
CREATE TABLE blogs_tbl(
blog_id bigint not null,
user_id bigint not null,
blog_title nchar(256) not null,
created_date datetimeoffset not null DEFAULT getdate(),
updated_date datetimeoffset not null DEFAULT getdate(),
language_id bigint not null default 1,
primary key (blog_id)
)
FEDERATED ON (id=blog_id)
GO
CREATE TABLE blog_entries_tbl(
blog_id bigint not null,
blog_entry_id bigint not null,
blog_entry_title nchar(256) not null,
blog_entry_text nchar(2000) not null,
created_date datetimeoffset not null DEFAULT getdate(),
updated_date datetimeoffset not null DEFAULT getdate(),
language_id bigint not null default 1,
blog_style bigint null,
primary key (blog_entry_id,blog_id)
)
FEDERATED ON (id=blog_id)
GO
CREATE TABLE blog_entry_comments_tbl(
blog_id bigint not null,
blog_entry_id bigint not null,
blog_comment_id bigint not null,
blog_comment_title nchar(256) not null,
blog_comment_text nchar(2000) not null,
user_id bigint not null,
created_date datetimeoffset not null DEFAULT getdate(),
updated_date datetimeoffset not null DEFAULT getdate(),
language_id bigint not null default 1
primary key (blog_comment_id,blog_entry_id,blog_id)
)
FEDERATED ON (id=blog_id)
GO
CREATE TABLE language_code_tbl(
language_id bigint primary key,
name nchar(256) not null,
code nchar(256) not null
)
GO