SEMIJOIN (U-SQL)
Summary
Semijoins are U-SQL’s way to filter a rowset based on the inclusion of its rows in another rowset. Other SQL dialects express this with the SELECT * FROM A WHERE A.key IN (SELECT B.key FROM B)
pattern. There are two variants: LEFT SEMIJOIN
and RIGHT SEMIJOIN
.
A LEFT SEMIJOIN
(or just SEMIJOIN
) gives only those rows in the left rowset that have a matching row in the right rowset.
The RIGHT SEMIJOIN
gives only those rows in the right rowset that have a matching row in the left rowset.
The join expression in the ON clause specifies how to determine the match.
Examples
Given the following rowsets:
EmpName | DepID |
---|---|
Rafferty | 31 |
Jones | 33 |
Heisenberg | 33 |
Robinson | 34 |
Smith | 34 |
Williams | null |
DeptID | DepName |
---|---|
31 | Sales |
33 | Engineering |
34 | Clerical |
35 | Marketing |
The following query finds all employees that are in valid departments by finding all the employees in the left @employees rowset that have a depID that is listed in the right @departments rowset):
@employees = SELECT *
FROM (VALUES
("Rafferty", (int?) 31)
, ("Jones", (int?) 33)
, ("Heisenberg", (int?) 33)
, ("Robinson", (int?) 34)
, ("Smith", (int?) 34)
, ("Williams", (int?) null)) AS E(EmpName, DepID);
@departments = SELECT *
FROM (VALUES
((int) 31, "Sales")
, ((int) 33, "Engineering")
, ((int) 34, "Clerical")
, ((int) 35, "Marketing")) AS D(DepID, DepName);
@emps_in_valid_dept =
SELECT e.EmpName, e.DepID
FROM @employees AS e
LEFT SEMIJOIN (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d
ON e.DepID == d.DepID;
OUTPUT @emps_in_valid_dept
TO "/output/rsLeftSemiJoinEmployeesInValidDept.csv"
USING Outputters.Csv();
The resulting rowset looks like:
EmpName | DepID |
---|---|
Rafferty | 31 |
Jones | 33 |
Heisenberg | 33 |
Robinson | 34 |
Smith | 34 |
The following query finds all departments with at least one employee:
@employees = SELECT *
FROM (VALUES
("Rafferty", (int?) 31)
, ("Jones", (int?) 33)
, ("Heisenberg", (int?) 33)
, ("Robinson", (int?) 34)
, ("Smith", (int?) 34)
, ("Williams", (int?) null)) AS E(EmpName, DepID);
@departments = SELECT *
FROM (VALUES
((int) 31, "Sales")
, ((int) 33, "Engineering")
, ((int) 34, "Clerical")
, ((int) 35, "Marketing")) AS D(DepID, DepName);
@depts_with_emps =
SELECT d.DepName, d.DepID
FROM @employees AS e
RIGHT SEMIJOIN (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d
ON e.DepID == d.DepID;
OUTPUT @depts_with_emps
TO "/output/rsRightSemiJoinDepartmentsWithEmployees.csv"
USING Outputters.Csv();
The query return the rowset:
DepName | DepID |
---|---|
Sales | 31 |
Engineering | 33 |
Clerical | 34 |
Additional Example
An example of comparing SEMIJOIN against a subquery with IN is available at Subqueries with IN/NOT IN and SEMIJOIN/ANTISEMIJOIN.