Redigera

Dela via


autocluster plugin

Applies to: ✅ Microsoft FabricAzure Data Explorer

autocluster finds common patterns of discrete attributes (dimensions) in the data. It then reduces the results of the original query, whether it's 100 or 100,000 rows, to a few patterns. The plugin was developed to help analyze failures (such as exceptions or crashes) but can potentially work on any filtered dataset. The plugin is invoked with the evaluate operator.

Note

autocluster is largely based on the Seed-Expand algorithm from the following paper: Algorithms for Telemetry Data Mining using Discrete Attributes.

Syntax

T | evaluate autocluster ([SizeWeight [, WeightColumn [, NumSeeds [, CustomWildcard [, ... ]]]]])

Learn more about syntax conventions.

Parameters

The parameters must be ordered as specified in the syntax. To indicate that the default value should be used, put the string tilde value ~. For more information, see Examples.

Name Type Required Description
T string ✔️ The input tabular expression.
SizeWeight double A double between 0 and 1 that controls the balance between generic (high coverage) and informative (many shared) values. Increasing this value typically reduces the quantity of patterns while expanding coverage. Conversely, decreasing this value generates more specific patterns characterized by increased shared values and a smaller percentage coverage. The default is 0.5. The formula is a weighted geometric mean with weights SizeWeight and 1-SizeWeight.
WeightColumn string Considers each row in the input according to the specified weight. Each row has a default weight of 1. The argument must be a name of a numeric integer column. A common usage of a weight column is to take into account sampling or bucketing or aggregation of the data that is already embedded into each row.
NumSeeds int Determines the number of initial local search points. Adjusting the number of seeds impacts result quantity or quality based on data structure. Increasing seeds can enhance results but with a slower query tradeoff. Decreasing below five yields negligible improvements, while increasing above 50 rarely generates more patterns. The default is 25.
CustomWildcard string A type literal that sets the wildcard value for a specific type in the results table, indicating no restriction on this column. The default is null, which represents an empty string. If the default is a good value in the data, a different wildcard value should be used, such as *. You can include multiple custom wildcards by adding them consecutively.

Returns

The autocluster plugin usually returns a small set of patterns. The patterns capture portions of the data with shared common values across multiple discrete attributes. Each pattern in the results is represented by a row.

The first column is the segment ID. The next two columns are the count and percentage of rows out of the original query that are captured by the pattern. The remaining columns are from the original query. Their value is either a specific value from the column, or a wildcard value (which are by default null) meaning variable values.

The patterns aren't distinct, may be overlapping, and usually don't cover all the original rows. Some rows may not fall under any pattern.

Tip

Use where and project in the input pipe to reduce the data to just what you're interested in.

When you find an interesting row, you might want to drill into it further by adding its specific values to your where filter.

Examples

Using evaluate

T | evaluate autocluster()

Using autocluster

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.6)

Output

SegmentId Count Percent State EventType Damage
0 2278 38.7 Hail NO
1 512 8.7 Thunderstorm Wind YES
2 898 15.3 TEXAS

Using custom wildcards

StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.2, '~', '~', '*')

Output

SegmentId Count Percent State EventType Damage
0 2278 38.7 * Hail NO
1 512 8.7 * Thunderstorm Wind YES
2 898 15.3 TEXAS * *