Jaa


CTEs (Common Table Expressions)

CTEs or common table expressions, which are new in SQL Server 2005, provide an easy way to break a complex SQL statement down into smaller more manageable queries.  CTEs are is some ways very much like views.  Unlike a view which can be created once and used by many SQL statements, a CTE is associated with a single SQL statement.  There are several excellent CTE examples, including recursive CTE examples, in Books Online.  Rather than construct entirely new examples for this post, I've started with the examples from Books Online.  To run these examples, you'll need to install the Adventure Works Cycles OLTP sample database.

Let's start with the following simple example:

WITH DirReps(ManagerID, DirectReports) AS
(
    SELECT ManagerID, COUNT(*) 
    FROM HumanResources.Employee AS e
    WHERE ManagerID IS NOT NULL
    GROUP BY ManagerID
)
SELECT ManagerID, DirectReports
FROM DirReps
ORDER BY ManagerID

This query is identical to the following simpler query:

SELECT ManagerID, COUNT(*) AS DirectReports
FROM HumanResources.Employee AS e
WHERE ManagerID IS NOT NULL
GROUP BY ManagerID
ORDER BY ManagerID

SQL Server expands the CTE without materializing it.  Thus, both queries produce the same query plan:

  |--Compute Scalar(DEFINE:([Expr1002]=CONVERT_IMPLICIT(int,[Expr1005],0)))
       |--Stream Aggregate(GROUP BY:([e].[ManagerID]) DEFINE:([Expr1005]=Count(*)))
            |--Index Seek(OBJECT:([HumanResources].[Employee].[IX_Employee_ManagerID] AS [e]), SEEK:([e].[ManagerID] IsNotNull) ORDERED FORWARD)

CTEs are more useful when a query includes multiple instances of the same sub-select.  For example, the following query computes the number of orders sold by and the most recent order date for each Adventure Works employee and then computes the same information for each employee's manager.  Rather than repeat the sub-select that computes the number of orders and the most recent order date, we can use a CTE to make the query simpler and clearer:

WITH Sales_CTE (SalesPersonID, NumberOfOrders, MaxDate)
AS
(
    SELECT SalesPersonID, COUNT(*), MAX(OrderDate)
    FROM Sales.SalesOrderHeader
    GROUP BY SalesPersonID
)
SELECT E.EmployeeID, OS.NumberOfOrders, OS.MaxDate,
    E.ManagerID, OM.NumberOfOrders, OM.MaxDate
FROM HumanResources.Employee AS E
    JOIN Sales_CTE AS OS
    ON E.EmployeeID = OS.SalesPersonID
    LEFT OUTER JOIN Sales_CTE AS OM
    ON E.ManagerID = OM.SalesPersonID
ORDER BY E.EmployeeID

We could write this query without the CTE as:

SELECT E.EmployeeID, OS.NumberOfOrders, OS.MaxDate,
    E.ManagerID, OM.NumberOfOrders, OM.MaxDate
FROM HumanResources.Employee AS E
    JOIN (
        SELECT SalesPersonID, COUNT(*) AS NumberOfOrders, MAX(OrderDate) AS MaxDate
        FROM Sales.SalesOrderHeader
        GROUP BY SalesPersonID
        ) AS OS
    ON E.EmployeeID = OS.SalesPersonID
    LEFT OUTER JOIN (
        SELECT SalesPersonID, COUNT(*) AS NumberOfOrders, MAX(OrderDate) AS MaxDate
        FROM Sales.SalesOrderHeader
        GROUP BY SalesPersonID
        ) AS OM
    ON E.ManagerID = OM.SalesPersonID
ORDER BY E.EmployeeID

Alternatively, we could use a view although that requires explicitly creating a view (and dropping it if you do not want to persist it).  Views are less convenient and more costly (the DDL is not free) than CTEs.  If you do create and drop views on the fly, you must also be careful not to use a name that will not conflict with other sessions that might also create the same view.

CREATE VIEW Sales_View (SalesPersonID, NumberOfOrders, MaxDate) AS
SELECT SalesPersonID, COUNT(*), MAX(OrderDate)
FROM Sales.SalesOrderHeader
GROUP BY SalesPersonID
GO
SELECT E.EmployeeID, OS.NumberOfOrders, OS.MaxDate,
    E.ManagerID, OM.NumberOfOrders, OM.MaxDate
FROM HumanResources.Employee AS E
    JOIN Sales_View AS OS
    ON E.EmployeeID = OS.SalesPersonID
    LEFT OUTER JOIN Sales_View AS OM
    ON E.ManagerID = OM.SalesPersonID
ORDER BY E.EmployeeID;
GO
DROP VIEW Sales_View
GO

All three of these variations produce the same query plan:

  |--Nested Loops(Left Outer Join, WHERE:([E].[ManagerID]=[Sales].[SalesOrderHeader].[SalesPersonID]))
       |--Sort(ORDER BY:([E].[EmployeeID] ASC))
       |    |--Nested Loops(Inner Join, OUTER REFERENCES:([Sales].[SalesOrderHeader].[SalesPersonID]))
       |         |--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1021],0)))
       |         |    |--Hash Match(Aggregate, HASH:([Sales].[SalesOrderHeader].[SalesPersonID]), RESIDUAL:(...) DEFINE:([Expr1021]=COUNT(*), [Expr1006]=MAX([Sales].[SalesOrderHeader].[OrderDate])))
       |         |         |--Clustered Index Scan(OBJECT:([Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID]))
       |         |--Clustered Index Seek(OBJECT:([HumanResources].[Employee].[PK_Employee_EmployeeID] AS [E]), SEEK:([E].[EmployeeID]=[Sales].[SalesOrderHeader].[SalesPersonID]) ORDERED FORWARD)
       |--Table Spool
            |--Compute Scalar(DEFINE:([Expr1010]=[Expr1010], [Expr1011]=[Expr1011]))
                 |--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(int,[Expr1022],0)))
                      |--Hash Match(Aggregate, HASH:([Sales].[SalesOrderHeader].[SalesPersonID]), RESIDUAL:(...) DEFINE:([Expr1022]=COUNT(*), [Expr1011]=MAX([Sales].[SalesOrderHeader].[OrderDate])))
                           |--Clustered Index Scan(OBJECT:([Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID]))

Notice that this query plan computes the CTE result twice (once for each reference).  We have two scans of the SalesOrderHeader table and two hash aggregates.  Even with the CTE (or the view), SQL Server still computes the result twice.  Do not be fooled by the table spool.  This spool merely caches the CTE result to avoid computing the entire aggregate once per employee.

We could force SQL Server to compute the CTE result only once by explicitly materializing it using a table variable, a temp table, or an indexed view.  Here is an example using a table variable:

DECLARE @Sales_Data TABLE (SalesPersonID INT, NumberOfOrders INT, MaxDate DATETIME)

INSERT @Sales_Data
SELECT SalesPersonID, COUNT(*), MAX(OrderDate)
FROM Sales.SalesOrderHeader
GROUP BY SalesPersonID

SELECT E.EmployeeID, OS.NumberOfOrders, OS.MaxDate,
    E.ManagerID, OM.NumberOfOrders, OM.MaxDate
FROM HumanResources.Employee AS E
    JOIN @Sales_Data AS OS
    ON E.EmployeeID = OS.SalesPersonID
    LEFT OUTER JOIN @Sales_Data AS OM
    ON E.ManagerID = OM.SalesPersonID
ORDER BY E.EmployeeID

This rewrite has two statements and, thus, has two plans.  The first plan materializes the CTE result into a table variable while the second plan reads this result twice.

  |--Table Insert(OBJECT:(@Sales_Data), SET:([SalesPersonID] = [Sales].[SalesOrderHeader].[SalesPersonID],[NumberOfOrders] = [Expr1007],[MaxDate] = [Expr1008]))
       |--Top(ROWCOUNT est 0)
            |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1012],0)))
                 |--Hash Match(Aggregate, HASH:([Sales].[SalesOrderHeader].[SalesPersonID]), RESIDUAL:(...) DEFINE:([Expr1012]=COUNT(*), [Expr1008]=MAX([Sales].[SalesOrderHeader].[OrderDate])))
                      |--Clustered Index Scan(OBJECT:([Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID]))

  |--Nested Loops(Left Outer Join, WHERE:([E].[ManagerID]=[OM].[SalesPersonID]))
       |--Nested Loops(Inner Join, OUTER REFERENCES:([OS].[SalesPersonID]))
       |    |--Sort(ORDER BY:([OS].[SalesPersonID] ASC))
       |    |    |--Table Scan(OBJECT:(@Sales_Data AS [OS]))
       |    |--Clustered Index Seek(OBJECT:([HumanResources].[Employee].[PK_Employee_EmployeeID] AS [E]), SEEK:([E].[EmployeeID]=[OS].[SalesPersonID]) ORDERED FORWARD)
       |--Table Scan(OBJECT:(@Sales_Data AS [OM]))

Explicitly materializing a CTE (or view or sub-select) result does not always yield better performance.  First, we must consider the cost of creating and writing the temp table.  If the cost of computing the CTE is not too great, it may be cheaper simply to compute the CTE result multiple times.  Second, in some cases, the query optimizer may be able to choose a better plan and avoid computing the entire CTE result.  For example, suppose that we only want to compute the number of orders for a single employee:

DECLARE @EID INT
SET @EID = 268;

WITH Sales_CTE (SalesPersonID, NumberOfOrders, MaxDate)
AS
(
    SELECT SalesPersonID, COUNT(*), MAX(OrderDate)
    FROM Sales.SalesOrderHeader
    GROUP BY SalesPersonID
)
SELECT E.EmployeeID, OS.NumberOfOrders, OS.MaxDate,
    E.ManagerID, OM.NumberOfOrders, OM.MaxDate
FROM HumanResources.Employee AS E
    JOIN Sales_CTE AS OS
    ON E.EmployeeID = OS.SalesPersonID
    LEFT OUTER JOIN Sales_CTE AS OM
    ON E.ManagerID = OM.SalesPersonID
WHERE E.EmployeeID = @EID

This query yields a different plan from the original query:

  |--Nested Loops(Left Outer Join, OUTER REFERENCES:([E].[ManagerID]))
       |--Nested Loops(Inner Join)
       |    |--Clustered Index Seek(OBJECT:([HumanResources].[Employee].[PK_Employee_EmployeeID] AS [E]), SEEK:([E].[EmployeeID]=[@EID]) ORDERED FORWARD)
       |    |--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1021],0)))
       |         |--Stream Aggregate(DEFINE:([Expr1021]=Count(*), [Expr1006]=MAX([Sales].[SalesOrderHeader].[OrderDate])))
       |              |--Clustered Index Scan(OBJECT:([Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID]), WHERE:([Sales].[SalesOrderHeader].[SalesPersonID]=[@EID]) )
       |--Compute Scalar(DEFINE:([Expr1010]=[Expr1010], [Expr1011]=[Expr1011]))
            |--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(int,[Expr1022],0)))
                 |--Stream Aggregate(DEFINE:([Expr1022]=Count(*), [Expr1011]=MAX([Sales].[SalesOrderHeader].[OrderDate])))
                      |--Clustered Index Scan(OBJECT:([Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID]), WHERE:([E].[ManagerID]=[Sales].[SalesOrderHeader].[SalesPersonID]) )

While the original plan computed the number of orders and the most recent order date for all employees, this plan only computes this information for a single employee and for that employee's manager.  Observe the WHERE predicates on the clustered index scans of the SalesOrderHeader table.  Moreover, notice that this plan uses scalar stream aggregates instead of the hash aggregates in the original plan.  Finally, in this example, the optimizer chose scans, but if the tables involved were larger and the predicates were more selective, it would have chosen index seeks to avoid scanning the entire table.

In my next post I'll take a look at what is perhaps the most important use of CTEs: recursive CTEs.

Comments

  • Anonymous
    October 25, 2007
    One of the most important uses of CTEs is to write recursive queries. In fact, CTEs provide the only

  • Anonymous
    December 28, 2007
    I think I understand CTE's but wonder if/why it's more efficient than just creating a temp table?

  • Anonymous
    January 03, 2008
    In some cases using a temp table or a table variable may be more efficient if the CTE is referenced multiple times and is very expensive to compute.  The last couple of examples in this post discuss this issue.  The very last example shows a case where the CTE may be more efficient than a temp table.  HTH.

  • Anonymous
    June 07, 2013
    Really good one !!!

  • Anonymous
    September 14, 2015
    Nice explantion