shuffle query

發行項
11/23/2024

適用於：✅Microsoft網狀架構✅Azure 數據 ✅總管 Azure 監視器 ✅Microsoft Sentinel

查詢 shuffle 是一種語意保留轉換，用於支援 shuffle 策略的一組運算符。視所涉及的數據而定，使用 shuffle 策略進行查詢可能會產生更佳的效能。當索引鍵（join索引鍵、索引鍵、 make-series summarize索引鍵或partition索引鍵）具有高基數，且一般運算元查詢達到查詢限制時shuffle，最好使用隨機查詢策略。

您可以使用下列運算子搭配隨機命令：

若要使用 shuffle 查詢策略，請新增表示式 hint.strategy = shuffle 或 hint.shufflekey = <key>。當您使用 hint.strategy=shuffle時，運算符數據將會由所有索引鍵隨機顯示。當複合索引鍵是唯一但每個索引鍵不夠唯一時，請使用這個運算式，因此您將使用隨機運算子的所有索引鍵來隨機顯示數據。

使用隨機策略分割數據時，數據載入會在所有叢集節點上共用。每個節點都會處理一個數據分割。默認的數據分割數目等於叢集節點的數目。

您可以使用語法 hint.num_partitions = total_partitions來覆寫分割區編號，以控制分割區的數目。當叢集有少量的叢集節點和預設分割區數目會很小，且查詢失敗或需要很長的運行時間時，這會很有用。

注意

使用許多分割區可能會耗用更多叢集資源並降低效能。從開始 hint.strategy = shuffle ，然後逐漸開始增加分割區，以仔細選擇分割區編號。

在某些情況下， hint.strategy = shuffle 會忽略，而且查詢不會在策略中 shuffle 執行。這會在以下情況發生：

運算子join在左側或右側有另一個shuffle相容運算子（join、 summarizemake-series 或 partition）。
運算子summarize會出現在查詢中另一個shuffle相容運算子（join、 summarizemake-series 或 partition）之後。

語法

使用 `hint.strategy` = `shuffle`

T | DataExpressionhint.strategyjoin|shuffle ( = DataExpression )

T = | summarize hint.strategyshuffle DataExpression

T | 查詢|分割hint.strategy = shuffle(子查詢 )

使用 `hint.shufflekey` = 索引鍵

T | DataExpressionjoin = |hint.shufflekey 索引鍵 (DataExpression )

Tsummarize| = hint.shufflekey 鍵 DataExpression

Tmake-series| = hint.shufflekey 鍵 DataExpression

T| 查詢|分割hint.shufflekey = 區索引鍵 ( SubQuery )

深入瞭解語法慣例。

參數

姓名	類型	必要	描述
T	`string`	✔️	要由運算子處理其數據的表格式來源。
DataExpression	`string`		隱含或明確的表格式轉換表達式。
查詢	`string`		轉換表達式會在 T 的記錄上執行。
key	`string`		`join`使用金鑰、`summarize`索引鍵、`make-series`金鑰或`partition`金鑰。
SubQuery	`string`		轉換表達式。

注意

必須根據所選擇的語法來指定 DataExpression 或 Query。

範例

搭配隨機使用summarize

具有 shuffle 運算子的策略查詢會 summarize 共用所有叢集節點上的負載，其中每個節點都會處理數據的一個分割區。

執行查詢

StormEvents
| summarize hint.strategy = shuffle count(), avg(InjuriesIndirect) by State
| count

輸出

計數
67

搭配隨機使用聯結

執行查詢

StormEvents
| where State has "West"
| where EventType has "Flood"
| join hint.strategy=shuffle 
    (
    StormEvents
    | where EventType has "Hail"
    | project EpisodeId, State, DamageProperty
    )
    on State
| count

輸出

計數
103

搭配隨機使用Make系列

執行查詢

StormEvents
| where State has "North"
| make-series hint.shufflekey = State sum(DamageProperty) default = 0 on StartTime in range(datetime(2007-01-01 00:00:00.0000000), datetime(2007-01-31 23:59:00.0000000), 15d) by State

輸出

州/省	sum_DamageProperty	StartTime
NORTH DAKOTA	[60000,0,0]	[“2006-12-31T00：00：00.0000000Z”，“2007-01-15T00：00：00：00.000000Z”，“2007-01-30T00：00：00.000000Z”]
北卡羅來那州	[20000,0,1000]	[“2006-12-31T00：00：00.0000000Z”，“2007-01-15T00：00：00：00.000000Z”，“2007-01-30T00：00：00.000000Z”]
大西洋北	[0,0,0]	[“2006-12-31T00：00：00.0000000Z”，“2007-01-15T00：00：00：00.000000Z”，“2007-01-30T00：00：00.000000Z”]

搭配隨機使用分割區

執行查詢

StormEvents
| partition hint.strategy=shuffle by EpisodeId
(
    top 3 by DamageProperty
    | project EpisodeId, State, DamageProperty
)
| count

輸出

計數
22345

比較 hint.strategy=shuffle 和 hint.shufflekey=key

當您使用 hint.strategy=shuffle時，隨機運算符將會由所有索引鍵隨機顯示。在下列範例中，查詢會使用 EpisodeId 和 EventId 作為索引鍵來隨機顯示數據：

執行查詢

StormEvents
| where StartTime > datetime(2007-01-01 00:00:00.0000000)
| join kind = inner hint.strategy=shuffle (StormEvents | where DamageCrops > 62000000) on EpisodeId, EventId
| count

輸出

計數
14

下列查詢使用 hint.shufflekey = key。上述查詢相當於此查詢。

執行查詢

StormEvents
| where StartTime > datetime(2007-01-01 00:00:00.0000000)
| join kind = inner hint.shufflekey = EpisodeId hint.shufflekey = EventId (StormEvents | where DamageCrops > 62000000) on EpisodeId, EventId

輸出

計數
14

使用多個索引鍵來隨機顯示數據

在某些情況下， hint.strategy=shuffle 將會忽略，而且查詢不會在隨機策略中執行。例如，在下列範例中，聯結在其左側有摘要，因此使用 hint.strategy=shuffle 不會將隨機策略套用至查詢：

執行查詢

StormEvents
| where StartTime > datetime(2007-01-01 00:00:00.0000000)
| summarize count() by EpisodeId, EventId
| join kind = inner hint.strategy=shuffle (StormEvents | where DamageCrops > 62000000) on EpisodeId, EventId

輸出

EpisodeId	EventId	...	EpisodeId1	EventId1	...
1030	4407	...	1030	4407	...
1030	13721	...	1030	13721	...
2477	12530	...	2477	12530	...
2103	10237	...	2103	10237	...
2103	10239	...	2103	10239	...
...	...	...	...	...	...

若要克服此問題並以隨機策略執行，請選擇和 join 作業通用的summarize索引鍵。在這裡情況下，此機碼為 EpisodeId。使用提示 hint.shufflekey 會被 join 按鍵指定為 hint.shufflekey = EpisodeId：

執行查詢

StormEvents
| where StartTime > datetime(2007-01-01 00:00:00.0000000)
| summarize count() by EpisodeId, EventId
| join kind = inner hint.shufflekey=EpisodeId (StormEvents | where DamageCrops > 62000000) on EpisodeId, EventId

輸出

EpisodeId	EventId	...	EpisodeId1	EventId1	...
1030	4407	...	1030	4407	...
1030	13721	...	1030	13721	...
2477	12530	...	2477	12530	...
2103	10237	...	2103	10237	...
2103	10239	...	2103	10239	...
...	...	...	...	...	...

使用摘要搭配隨機顯示來改善效能

在此範例中 summarize ，搭配策略使用運算符 shuffle 可改善效能。源數據表有 150M 筆記錄，且依索引鍵分組的基數為 10M，分散在 10 個叢集節點上。

使用 summarize 不含策略的運算符 shuffle 時，查詢會在 1：08 之後結束，記憶體使用量尖峰為 ~3 GB：

orders
| summarize arg_max(o_orderdate, o_totalprice) by o_custkey 
| where o_totalprice < 1000
| count

輸出

計數
1086

搭配使用 shuffle 策略 summarize時，查詢會在 ~7 秒之後結束，記憶體使用量尖峰為 0.43 GB：

orders
| summarize hint.strategy = shuffle arg_max(o_orderdate, o_totalprice) by o_custkey 
| where o_totalprice < 1000
| count

輸出

計數
1086

下列範例示範叢集上有兩個叢集節點的效能，其中具有 60M 筆記錄的數據表，其中依索引鍵分組的基數為 2M。

執行沒有 hint.num_partitions 的查詢只會使用兩個分割區（作為叢集節點編號），而下列查詢將需要約 1：10 分鐘：

lineitem 
| summarize hint.strategy = shuffle dcount(l_comment), dcount(l_shipdate) by l_partkey 
| consume

如果將分割區編號設定為 10，查詢將會在 23 秒後結束：

lineitem 
| summarize hint.strategy = shuffle hint.num_partitions = 10 dcount(l_comment), dcount(l_shipdate) by l_partkey 
| consume

使用搭配隨機聯結來改善效能

下列範例示範如何搭配 join 運算符使用shuffle策略來改善效能。

這些範例是在叢集上取樣，其中有10個節點，其中數據會分散到所有這些節點上。

查詢的左側源數據表有 15M 筆記錄，其中索引鍵的 join 基數為 ~14M。查詢的右側來源有 150M 筆記錄，且索引鍵的 join 基數為 10M。查詢會在 ~28 秒後結束，記憶體使用量尖峰為 1.43 GB：

customer
| join
    orders
on $left.c_custkey == $right.o_custkey
| summarize sum(c_acctbal) by c_nationkey

搭配join運算符使用shuffle策略時，查詢會在 ~4 秒後結束，記憶體使用量尖峰為 0.3 GB：

customer
| join
    hint.strategy = shuffle orders
on $left.c_custkey == $right.o_custkey
| summarize sum(c_acctbal) by c_nationkey

在另一個範例中，我們會在具有下列條件的較大數據集上嘗試相同的查詢：

的左側來源 join 為 150M，金鑰基數為 148M。
的右側來源 join 為 1.5B，且金鑰的基數為 ~100M。

只有運算子的 join 查詢會達到限制，並在 4 分鐘後逾時。不過，搭配運算符使用 shuffle 策略 join 時，查詢會在 ~34 秒後結束，記憶體使用量尖峰為 1.23 GB。

下列範例顯示叢集節點有兩個叢集節點的改善，其中索引鍵的 join 基數為 2M 的數據表。執行沒有 hint.num_partitions 的查詢只會使用兩個分割區（作為叢集節點編號），而下列查詢將需要約 1：10 分鐘：

lineitem
| summarize dcount(l_comment), dcount(l_shipdate) by l_partkey
| join
    hint.shufflekey = l_partkey   part
on $left.l_partkey == $right.p_partkey
| consume

將分割區編號設定為 10 時，查詢會在 23 秒後結束：

lineitem
| summarize dcount(l_comment), dcount(l_shipdate) by l_partkey
| join
    hint.shufflekey = l_partkey  hint.num_partitions = 10    part
on $left.l_partkey == $right.p_partkey
| consume

共用方式為

shuffle query

語法

使用 `hint.strategy` = `shuffle`

使用 `hint.shufflekey` = 索引鍵

參數

範例

搭配隨機使用summarize

搭配隨機使用聯結

搭配隨機使用Make系列

搭配隨機使用分割區

比較 hint.strategy=shuffle 和 hint.shufflekey=key

使用多個索引鍵來隨機顯示數據

使用摘要搭配隨機顯示來改善效能

使用搭配隨機聯結來改善效能

意見反應

其他資源

共用方式為

shuffle query

語法

使用 hint.strategy = shuffle

使用 hint.shufflekey = 索引鍵

參數

範例

搭配隨機使用summarize

搭配隨機使用聯結

搭配隨機使用Make系列

搭配隨機使用分割區

比較 hint.strategy=shuffle 和 hint.shufflekey=key

使用多個索引鍵來隨機顯示數據

使用摘要搭配隨機顯示來改善效能

使用搭配隨機聯結來改善效能

意見反應

其他資源

使用 `hint.strategy` = `shuffle`

使用 `hint.shufflekey` = 索引鍵