Using hll() and tdigest()

บทความ
08/12/2024

Applies to: ✅ Microsoft Fabric ✅ Azure Data Explorer ✅ Azure Monitor ✅ Microsoft Sentinel

Suppose you want to calculate the count of distinct users every day over the last seven days. You can run summarize dcount(user) once a day with a span filtered to the last seven days. This method is inefficient, because each time the calculation is run, there's a six-day overlap with the previous calculation. You can also calculate an aggregate for each day, and then combine these aggregates. This method requires you to "remember" the last six results, but it's much more efficient.

Partitioning queries as described is easy for simple aggregates, such as count() and sum(). It can also be useful for complex aggregates, such as dcount() and percentiles(). This article explains how Kusto supports such calculations.

The following examples show how to use hll/tdigest and demonstrate that using these commands is highly performant in some scenarios:

Important

The results of hll, hll_if, hll_merge, tdigest, and tdigest_merge are objects of type dynamic which can then be processed by other functions (dcount_hll, percentile_tdigest, percentiles_array_tdigest, and percentrank_tdigest). The encoding of this object might change over time (for example, due to a software upgrade); however, such changes will be done in a backwards-compatible manner, so one can store such values persistently and reference them in queries reliably.

Note

In some cases, the dynamic objects generated by the hll or the tdigest aggregate functions may be big and exceed the default MaxValueSize property in the encoding policy. If so, the object will be ingested as null. For example, when persisting the output of hll function with accuracy level 4, the size of the hll object exceeds the default MaxValueSize, which is 1MB. To avoid this issue, modify the encoding policy of the column as shown in the following examples.

Run the query

range x from 1 to 1000000 step 1
| summarize hll(x,4)
| project sizeInMb = estimate_data_size(hll_x) / pow(1024,2)

Output

sizeInMb
1.0000524520874

Ingesting this object into a table before applying this kind of policy will ingest null:

.set-or-append MyTable <| range x from 1 to 1000000 step 1
| summarize hll(x,4)

MyTable
| project isempty(hll_x)

Output

Column1
1

To avoid ingesting null, use the special encoding policy type bigobject, which overrides the MaxValueSize to 2 MB like this:

.alter column MyTable.hll_x policy encoding type='bigobject'

Ingesting a value now to the same table above:

.set-or-append MyTable <| range x from 1 to 1000000 step 1
| summarize hll(x,4)

ingests the second value successfully:

MyTable
| project isempty(hll_x)

Output

Column1
1
0

Example: Count with binned timestamp

There's a table, PageViewsHllTDigest, containing hll values of Pages viewed in each hour. You want these values binned to 12h. Merge the hll values using the hll_merge() aggregate function, with the timestamp binned to 12h. Use the function dcount_hll to return the final dcount value:

PageViewsHllTDigest
| summarize merged_hll = hll_merge(hllPage) by bin(Timestamp, 12h)
| project Timestamp , dcount_hll(merged_hll)

Output

Timestamp	`dcount_hll_merged_hll`
2016-05-01 12:00:00.0000000	20056275
2016-05-02 00:00:00.0000000	38797623
2016-05-02 12:00:00.0000000	39316056
2016-05-03 00:00:00.0000000	13685621

To bin timestamp for 1d:

PageViewsHllTDigest
| summarize merged_hll = hll_merge(hllPage) by bin(Timestamp, 1d)
| project Timestamp , dcount_hll(merged_hll)

Output

Timestamp	`dcount_hll_merged_hll`
2016-05-01 00:00:00.0000000	20056275
2016-05-02 00:00:00.0000000	64135183
2016-05-03 00:00:00.0000000	13685621

The same query may be done over the values of tdigest, which represent the BytesDelivered in each hour:

PageViewsHllTDigest
| summarize merged_tdigests = merge_tdigest(tdigestBytesDel) by bin(Timestamp, 12h)
| project Timestamp , percentile_tdigest(merged_tdigests, 95, typeof(long))

Output

Timestamp	`percentile_tdigest_merged_tdigests`
2016-05-01 12:00:00.0000000	170200
2016-05-02 00:00:00.0000000	152975
2016-05-02 12:00:00.0000000	181315
2016-05-03 00:00:00.0000000	146817

Example: Temporary table

Kusto limits are reached with datasets that are too large, where you need to run periodic queries over the dataset, but run the regular queries to calculate percentile() or dcount() over large datasets.

To solve this problem, newly added data may be added to a temp table as hll or tdigest values using hll() when the required operation is dcount or tdigest() when the required operation is percentile using set/append or update policy. In this case, the intermediate results of dcount or tdigest are saved into another dataset, which should be smaller than the target large one.

To solve this problem, newly added data may be added to a temp table as hll or tdigest values using hll() when the required operation is dcount. In this case, the intermediate results of dcount are saved into another dataset, which should be smaller than the target large one.

When you need to get the final results of these values, the queries may use hll/tdigest mergers: hll-merge()/tdigest_merge(). Then, after getting the merged values, percentile_tdigest() / dcount_hll() may be invoked on these merged values to get the final result of dcount or percentiles.

Assuming there's a table, PageViews, into which data is ingested daily, every day on which you want to calculate the distinct count of pages viewed per minute later than date = datetime(2016-05-01 18:00:00.0000000).

Run the following query:

PageViews
| where Timestamp > datetime(2016-05-01 18:00:00.0000000)
| summarize percentile(BytesDelivered, 90), dcount(Page,2) by bin(Timestamp, 1d)

Output

Timestamp	percentile_BytesDelivered_90	dcount_Page
2016-05-01 00:00:00.0000000	83634	20056275
2016-05-02 00:00:00.0000000	82770	64135183
2016-05-03 00:00:00.0000000	72920	13685621

This query aggregates all the values every time you run this query (for example, if you want to run it many times a day).

If you save the hll and tdigest values (which are the intermediate results of dcount and percentile) into a temp table, PageViewsHllTDigest, using an update policy or set/append commands, you may only merge the values and then use dcount_hll/percentile_tdigest using the following query:

PageViewsHllTDigest
| summarize  percentile_tdigest(merge_tdigest(tdigestBytesDel), 90), dcount_hll(hll_merge(hllPage)) by bin(Timestamp, 1d)

Output

Timestamp	`percentile_tdigest_merge_tdigests_tdigestBytesDel`	`dcount_hll_hll_merge_hllPage`
2016-05-01 00:00:00.0000000	84224	20056275
2016-05-02 00:00:00.0000000	83486	64135183
2016-05-03 00:00:00.0000000	72247	13685621

This query should be more performant, as it runs over a smaller table. In this example, the first query runs over ~215M records, while the second one runs over just 32 records:

Example: Intermediate results

The Retention Query. Assume you have a table that summarizes when each Wikipedia page was viewed (sample size is 10M), and you want to find for each date1 date2 the percentage of pages reviewed in both date1 and date2 relative to the pages viewed on date1 (date1 < date2).

The trivial way uses join and summarize operators:

// Get the total pages viewed each day
let totalPagesPerDay = PageViewsSample
| summarize by Page, Day = startofday(Timestamp)
| summarize count() by Day;
// Join the table to itself to get a grid where 
// each row shows foreach page1, in which two dates
// it was viewed.
// Then count the pages between each two dates to
// get how many pages were viewed between date1 and date2.
PageViewsSample
| summarize by Page, Day1 = startofday(Timestamp)
| join kind = inner
(
    PageViewsSample
    | summarize by Page, Day2 = startofday(Timestamp)
)
on Page
| where Day2 > Day1
| summarize count() by Day1, Day2
| join kind = inner
    totalPagesPerDay
on $left.Day1 == $right.Day
| project Day1, Day2, Percentage = count_*100.0/count_1

Output

Day1	Day2	Percentage
2016-05-01 00:00:00.0000000	2016-05-02 00:00:00.0000000	34.0645725975255
2016-05-01 00:00:00.0000000	2016-05-03 00:00:00.0000000	16.618368960101
2016-05-02 00:00:00.0000000	2016-05-03 00:00:00.0000000	14.6291376489636

The above query took ~18 seconds.

When you use the hll(), hll_merge(), and dcount_hll() functions, the equivalent query will end after ~1.3 seconds and show that the hll functions speeds up the query above by ~14 times:

let Stats=PageViewsSample | summarize pagehll=hll(Page, 2) by day=startofday(Timestamp); // saving the hll values (intermediate results of the dcount values)
let day0=toscalar(Stats | summarize min(day)); // finding the min date over all dates.
let dayn=toscalar(Stats | summarize max(day)); // finding the max date over all dates.
let daycount=tolong((dayn-day0)/1d); // finding the range between max and min
Stats
| project idx=tolong((day-day0)/1d), day, pagehll
| mv-expand pidx=range(0, daycount) to typeof(long)
// Extend the column to get the dcount value from hll'ed values for each date (same as totalPagesPerDay from the above query)
| extend key1=iff(idx < pidx, idx, pidx), key2=iff(idx < pidx, pidx, idx), pages=dcount_hll(pagehll)
// For each two dates, merge the hll'ed values to get the total dcount over each two dates, 
// This helps to get the pages viewed in both date1 and date2 (see the description below about the intersection_size)
| summarize (day1, pages1)=arg_min(day, pages), (day2, pages2)=arg_max(day, pages), union_size=dcount_hll(hll_merge(pagehll)) by key1, key2
| where day2 > day1
// To get pages viewed in date1 and also date2, look at the merged dcount of date1 and date2, subtract it from pages of date1 + pages on date2.
| project pages1, day1,day2, intersection_size=(pages1 + pages2 - union_size)
| project day1, day2, Percentage = intersection_size*100.0 / pages1

Output

day1	day2	Percentage
2016-05-01 00:00:00.0000000	2016-05-02 00:00:00.0000000	33.2298494510578
2016-05-01 00:00:00.0000000	2016-05-03 00:00:00.0000000	16.9773830213667
2016-05-02 00:00:00.0000000	2016-05-03 00:00:00.0000000	14.5160020350006

Note

The results of the queries are not 100% accurate due to the error of the hll functions. For more information about the errors, see dcount().

แชร์ผ่าน

Using hll() and tdigest()

Example: Count with binned timestamp

Example: Temporary table

Example: Intermediate results

คำติชม

แหล่งทรัพยากรเพิ่มเติม