GROUP BY子句
適用於:Databricks SQL
Databricks Runtime
GROUP BY
子句是用來根據一組指定的群組表達式來分組數據列,並根據一或多個指定的聚合函數,計算數據列群組的匯總。
Databricks SQL 也支援進階匯總,透過 GROUPING SETS
、CUBE
、ROLLUP
子句,針對相同的輸入記錄集執行多個匯總。
分組表達式和進階聚合可以在 GROUP BY
子句中混合,並巢狀在 GROUPING SETS
子句中。
如需詳細資訊,請參閱混合/巢狀群組分析一節。
FILTER
當條件子句附加至聚合函數時,只有符合條件的數據列會被傳遞至該函數。
語法
GROUP BY ALL
GROUP BY group_expression [, ...] [ WITH ROLLUP | WITH CUBE ]
GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } ( grouping_set [, ...] ) } [, ...]
grouping_set
{ expression |
( [ expression [, ...] ] ) }
聚合函數定義為
aggregate_name ( [ DISTINCT ] expression [, ...] ) [ FILTER ( WHERE boolean_expression ) ]
參數
全部
適用於:
的 Databricks SQL
的 Databricks Runtime 12.2 LTS 以上
一種速記法可以將所有不包含聚合函數的
SELECT
清單表示式新增為group_expression
。 如果沒有這類表達式存在GROUP BY ALL
,就相當於省略GROUP BY
導致全域匯總的 子句。GROUP BY ALL
不保證會產生一組可解析的群組表達式。 如果產生的子句格式不正確,Azure Databricks 會引發 UNRESOLVED_ALL_IN_GROUP_BY 或 MISSING_AGGREGATION 。group_expression
指定將數據列分組在一起的準則。 數據列的分組是根據群組表達式的結果值來執行。 群組表達式可以是資料行名稱,如
GROUP BY a
,資料行位置如GROUP BY 0
,或是像GROUP BY a + b
的運算式。 如果group_expression
包含匯總函數,Azure Databricks 就會引發GROUP_BY_AGGREGATE錯誤。群組集
群組集是由括弧中的零個或多個逗號分隔表達式所指定。 當群組集只有一個專案時,可以省略括弧。 例如,
GROUPING SETS ((a), (b))
與GROUPING SETS (a, b)
相同。群組集
將
GROUPING SETS
之後指定的每個群組集的列進行分組。 例如:GROUP BY GROUPING SETS ((warehouse), (product))
在語義上等同於GROUP BY warehouse
與GROUP BY product
結果的聯集。這個子句是
UNION ALL
的簡寫,UNION ALL
運算子的每個運算部分會對GROUPING SETS
子句中指定的每個群組集進行匯總處理。同樣地,
GROUP BY GROUPING SETS ((warehouse, product), (product), ())
在語意上相當於GROUP BY warehouse, product
和GROUP BY product
結果的聯集,及全域彙總的結果。
注意
為了 Hive 的相容性,Databricks SQL 允許 GROUP BY ... GROUPING SETS (...)
。
通常會忽略GROUP BY
表達式,但如果它們除了GROUPING SETS
表達式以外還包含其他額外的表達式,這些額外的表達式將會被包含在群組表達式中,其值始終為 Null。
例如,在 SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b)
中,c 數據行的輸出一律為 null。
ROLLUP
在單一語句中指定多個匯總層級。 這個子句是用來根據多個群組集計算匯總。
ROLLUP
是GROUPING SETS
的簡寫。 例如:GROUP BY warehouse, product WITH ROLLUP
或
GROUP BY ROLLUP(warehouse, product)
相當於
GROUP BY GROUPING SETS((warehouse, product), (warehouse), ())
而
GROUP BY ROLLUP(warehouse, product, (warehouse, location))
相當於
GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ())
規格的
ROLLUP
N 元素會產生 N+1GROUPING SETS
。立方體
CUBE
子句是用來根據GROUP BY
子句中指定的群組數據行組合來執行匯總。CUBE
是GROUPING SETS
的簡寫。 例如:GROUP BY warehouse, product WITH CUBE
或
GROUP BY CUBE(warehouse, product)
相當於:
GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ())
而
GROUP BY CUBE(warehouse, product, location)
相當於以下内容:
GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (product, location), (warehouse), (product), (location), ())
規格的
CUBE
N 元素會產生 2^NGROUPING SETS
。aggregate_name
聚合函數名稱(MIN、MAX、COUNT、SUM、AVG 等)。
區別
移除輸入列中的重複項目後,傳遞至彙總函數。
FILTER
篩選輸入的行,其中
WHERE
子句中的boolean_expression
評估為 true 的會被傳遞到聚合函數;其他行則會被捨棄。
混合/巢狀群組分析
GROUP BY
子句可以包含多個 group_expressions
、多個 CUBE
,以及多個 ROLLUP
和 GROUPING SETS
。
GROUPING SETS
也可以包含巢狀的 CUBE
、ROLLUP
或 GROUPING SETS
子句。 例如:
GROUPING SETS(ROLLUP(warehouse, location),
CUBE(warehouse, location)),
GROUPING SETS(warehouse,
GROUPING SETS(location,
GROUPING SETS(ROLLUP(warehouse, location),
CUBE(warehouse, location))))
CUBE
和 ROLLUP
只是 GROUPING SETS
的語法糖。
請參閱上述各節以了解如何將CUBE
和ROLLUP
轉譯成GROUPING SETS
。
group_expression
在此內容中可以視為單一群組 GROUPING SETS
。
針對 GROUP BY
子句中的多個 GROUPING SETS
,Databricks SQL 透過將原始 GROUPING SETS
執行交叉乘積來生成單一的 GROUPING SETS
。
對於GROUPING SETS
巢狀於GROUPING SETS
子句中的情況,Databricks SQL 會採取其群組集並剝除它們。 例如,下列查詢:
GROUP BY warehouse,
GROUPING SETS((product), ()),
GROUPING SETS((location, size),
(location),
(size),
());
GROUP BY warehouse,
ROLLUP(product),
CUBE(location, size);
相當於以下內容:
GROUP BY GROUPING SETS( (warehouse, product, location, size),
(warehouse, product, location),
(warehouse, product, size),
(warehouse, product),
(warehouse, location, size),
(warehouse, location),
(warehouse, size),
(warehouse))
而
GROUP BY GROUPING SETS(GROUPING SETS(warehouse),
GROUPING SETS((warehouse, product)))`
相當於
GROUP BY GROUPING SETS((warehouse),
(warehouse, product))`.
範例
CREATE TEMP VIEW dealer (id, city, car_model, quantity) AS
VALUES (100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);
-- Sum of quantity per dealership. Group by `id`.
> SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
id sum(quantity)
--- -------------
100 32
200 33
300 13
-- Use column position in GROUP by clause.
> SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
id sum(quantity)
--- -------------
100 32
200 33
300 13
-- Multiple aggregations.
-- 1. Sum of quantity per dealership.
-- 2. Max quantity per dealership.
> SELECT id, sum(quantity) AS sum, max(quantity) AS max
FROM dealer GROUP BY id ORDER BY id;
id sum max
--- --- ---
100 32 15
200 33 20
300 13 8
-- Count the number of distinct dealers in cities per car_model.
> SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
car_model count
------------ -----
Honda Civic 3
Honda CRV 2
Honda Accord 3
-- Count the number of distinct dealers in cities per car_model, using GROUP BY ALL
> SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY ALL;
car_model count
------------ -----
Honda Civic 3
Honda CRV 2
Honda Accord 3
-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
> SELECT id,
sum(quantity) FILTER (WHERE car_model IN ('Honda Civic', 'Honda CRV')) AS `sum(quantity)`
FROM dealer
GROUP BY id ORDER BY id;
id sum(quantity)
--- -------------
100 17
200 23
300 5
-- Aggregations using multiple sets of grouping columns in a single statement.
-- Following performs aggregations based on four sets of grouping columns.
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. Empty grouping set. Returns quantities for all city and car models.
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
ORDER BY city;
city car_model sum
--------- ------------ ---
null null 78
null HondaAccord 33
null HondaCRV 10
null HondaCivic 35
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5
-- Group by processing with `ROLLUP` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), ())
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY city, car_model WITH ROLLUP
ORDER BY city, car_model;
city car_model sum
--------- ------------ ---
null null 78
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5
-- Group by processing with `CUBE` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY city, car_model WITH CUBE
ORDER BY city, car_model;
city car_model sum
--------- ------------ ---
null null 78
null HondaAccord 33
null HondaCRV 10
null HondaCivic 35
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5
--Prepare data for ignore nulls example
> CREATE TEMP VIEW person (id, name, age) AS
VALUES (100, 'Mary', NULL),
(200, 'John', 30),
(300, 'Mike', 80),
(400, 'Dan' , 50);
--Select the first row in column age
> SELECT FIRST(age) FROM person;
first(age, false)
--------------------
NULL
--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
> SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
first(age, true) last(id, false) sum(id)
------------------- ------------------ ----------
30 400 1000