Using hll() and tdigest()
Applies to: ✅ Microsoft Fabric ✅ Azure Data Explorer ✅ Azure Monitor ✅ Microsoft Sentinel
Suppose you want to calculate the count of distinct users every day over the last seven days. You can run summarize dcount(user)
once a day with a span filtered to the last seven days. This method is inefficient, because each time the calculation is run, there's a six-day overlap with the previous calculation. You can also calculate an aggregate for each day, and then combine these aggregates. This method requires you to "remember" the last six results, but it's much more efficient.
Partitioning queries as described is easy for simple aggregates, such as count()
and sum()
. It can also be useful for complex aggregates, such as dcount()
and percentiles()
. This article explains how Kusto supports such calculations.
The following examples show how to use hll
/tdigest
and demonstrate that using these commands is highly performant in some scenarios:
Important
The results of hll
, hll_if
, hll_merge
, tdigest
, and tdigest_merge
are objects of type dynamic
which can then be processed by other functions (dcount_hll
, percentile_tdigest
, percentiles_array_tdigest
, and percentrank_tdigest
). The encoding of this object might change over time (for example, due to a software upgrade); however, such changes will be done in a backwards-compatible manner, so one can store such values persistently and reference them in queries reliably.
Note
In some cases, the dynamic objects generated by the hll
or the tdigest
aggregate functions may be big and exceed the default MaxValueSize property in the encoding policy. If so, the object will be ingested as null.
For example, when persisting the output of hll
function with accuracy level 4, the size of the hll
object exceeds the default MaxValueSize, which is 1MB.
To avoid this issue, modify the encoding policy of the column as shown in the following examples.
range x from 1 to 1000000 step 1
| summarize hll(x,4)
| project sizeInMb = estimate_data_size(hll_x) / pow(1024,2)
Output
sizeInMb |
---|
1.0000524520874 |
Ingesting this object into a table before applying this kind of policy will ingest null:
.set-or-append MyTable <| range x from 1 to 1000000 step 1
| summarize hll(x,4)
MyTable
| project isempty(hll_x)
Output
Column1 |
---|
1 |
To avoid ingesting null, use the special encoding policy type bigobject
, which overrides the MaxValueSize
to 2 MB like this:
.alter column MyTable.hll_x policy encoding type='bigobject'
Ingesting a value now to the same table above:
.set-or-append MyTable <| range x from 1 to 1000000 step 1
| summarize hll(x,4)
ingests the second value successfully:
MyTable
| project isempty(hll_x)
Output
Column1 |
---|
1 |
0 |
Example: Count with binned timestamp
There's a table, PageViewsHllTDigest
, containing hll
values of Pages viewed in each hour. You want these values binned to 12h
. Merge the hll
values using the hll_merge()
aggregate function, with the timestamp binned to 12h
. Use the function dcount_hll
to return the final dcount
value:
PageViewsHllTDigest
| summarize merged_hll = hll_merge(hllPage) by bin(Timestamp, 12h)
| project Timestamp , dcount_hll(merged_hll)
Output
Timestamp | dcount_hll_merged_hll |
---|---|
2016-05-01 12:00:00.0000000 | 20056275 |
2016-05-02 00:00:00.0000000 | 38797623 |
2016-05-02 12:00:00.0000000 | 39316056 |
2016-05-03 00:00:00.0000000 | 13685621 |
To bin timestamp for 1d
:
PageViewsHllTDigest
| summarize merged_hll = hll_merge(hllPage) by bin(Timestamp, 1d)
| project Timestamp , dcount_hll(merged_hll)
Output
Timestamp | dcount_hll_merged_hll |
---|---|
2016-05-01 00:00:00.0000000 | 20056275 |
2016-05-02 00:00:00.0000000 | 64135183 |
2016-05-03 00:00:00.0000000 | 13685621 |
The same query may be done over the values of tdigest
, which represent the BytesDelivered
in each hour:
PageViewsHllTDigest
| summarize merged_tdigests = merge_tdigest(tdigestBytesDel) by bin(Timestamp, 12h)
| project Timestamp , percentile_tdigest(merged_tdigests, 95, typeof(long))
Output
Timestamp | percentile_tdigest_merged_tdigests |
---|---|
2016-05-01 12:00:00.0000000 | 170200 |
2016-05-02 00:00:00.0000000 | 152975 |
2016-05-02 12:00:00.0000000 | 181315 |
2016-05-03 00:00:00.0000000 | 146817 |
Example: Temporary table
Kusto limits are reached with datasets that are too large, where you need to run periodic queries over the dataset, but run the regular queries to calculate percentile()
or dcount()
over large datasets.
To solve this problem, newly added data may be added to a temp table as hll
or tdigest
values using hll()
when the required operation is dcount
or tdigest()
when the required operation is percentile using set/append
or update policy
. In this case, the intermediate results of dcount
or tdigest
are saved into another dataset, which should be smaller than the target large one.
To solve this problem, newly added data may be added to a temp table as hll
or tdigest
values using hll()
when the required operation is dcount
. In this case, the intermediate results of dcount
are saved into another dataset, which should be smaller than the target large one.
When you need to get the final results of these values, the queries may use hll
/tdigest
mergers: hll-merge()
/tdigest_merge()
. Then, after getting the merged values, percentile_tdigest()
/ dcount_hll()
may be invoked on these merged values to get the final result of dcount
or percentiles.
Assuming there's a table, PageViews, into which data is ingested daily, every day on which you want to calculate the distinct count of pages viewed per minute later than date = datetime(2016-05-01 18:00:00.0000000).
Run the following query:
PageViews
| where Timestamp > datetime(2016-05-01 18:00:00.0000000)
| summarize percentile(BytesDelivered, 90), dcount(Page,2) by bin(Timestamp, 1d)
Output
Timestamp | percentile_BytesDelivered_90 | dcount_Page |
---|---|---|
2016-05-01 00:00:00.0000000 | 83634 | 20056275 |
2016-05-02 00:00:00.0000000 | 82770 | 64135183 |
2016-05-03 00:00:00.0000000 | 72920 | 13685621 |
This query aggregates all the values every time you run this query (for example, if you want to run it many times a day).
If you save the hll
and tdigest
values (which are the intermediate results of dcount
and percentile) into a temp table, PageViewsHllTDigest
, using an update policy or set/append commands, you may only merge the values and then use dcount_hll
/percentile_tdigest
using the following query:
PageViewsHllTDigest
| summarize percentile_tdigest(merge_tdigest(tdigestBytesDel), 90), dcount_hll(hll_merge(hllPage)) by bin(Timestamp, 1d)
Output
Timestamp | percentile_tdigest_merge_tdigests_tdigestBytesDel |
dcount_hll_hll_merge_hllPage |
---|---|---|
2016-05-01 00:00:00.0000000 | 84224 | 20056275 |
2016-05-02 00:00:00.0000000 | 83486 | 64135183 |
2016-05-03 00:00:00.0000000 | 72247 | 13685621 |
This query should be more performant, as it runs over a smaller table. In this example, the first query runs over ~215M records, while the second one runs over just 32 records:
Example: Intermediate results
The Retention Query. Assume you have a table that summarizes when each Wikipedia page was viewed (sample size is 10M), and you want to find for each date1 date2 the percentage of pages reviewed in both date1 and date2 relative to the pages viewed on date1 (date1 < date2).
The trivial way uses join and summarize operators:
// Get the total pages viewed each day
let totalPagesPerDay = PageViewsSample
| summarize by Page, Day = startofday(Timestamp)
| summarize count() by Day;
// Join the table to itself to get a grid where
// each row shows foreach page1, in which two dates
// it was viewed.
// Then count the pages between each two dates to
// get how many pages were viewed between date1 and date2.
PageViewsSample
| summarize by Page, Day1 = startofday(Timestamp)
| join kind = inner
(
PageViewsSample
| summarize by Page, Day2 = startofday(Timestamp)
)
on Page
| where Day2 > Day1
| summarize count() by Day1, Day2
| join kind = inner
totalPagesPerDay
on $left.Day1 == $right.Day
| project Day1, Day2, Percentage = count_*100.0/count_1
Output
Day1 | Day2 | Percentage |
---|---|---|
2016-05-01 00:00:00.0000000 | 2016-05-02 00:00:00.0000000 | 34.0645725975255 |
2016-05-01 00:00:00.0000000 | 2016-05-03 00:00:00.0000000 | 16.618368960101 |
2016-05-02 00:00:00.0000000 | 2016-05-03 00:00:00.0000000 | 14.6291376489636 |
The above query took ~18 seconds.
When you use the hll()
, hll_merge()
, and dcount_hll()
functions, the equivalent query will end after ~1.3 seconds and show that the hll
functions speeds up the query above by ~14 times:
let Stats=PageViewsSample | summarize pagehll=hll(Page, 2) by day=startofday(Timestamp); // saving the hll values (intermediate results of the dcount values)
let day0=toscalar(Stats | summarize min(day)); // finding the min date over all dates.
let dayn=toscalar(Stats | summarize max(day)); // finding the max date over all dates.
let daycount=tolong((dayn-day0)/1d); // finding the range between max and min
Stats
| project idx=tolong((day-day0)/1d), day, pagehll
| mv-expand pidx=range(0, daycount) to typeof(long)
// Extend the column to get the dcount value from hll'ed values for each date (same as totalPagesPerDay from the above query)
| extend key1=iff(idx < pidx, idx, pidx), key2=iff(idx < pidx, pidx, idx), pages=dcount_hll(pagehll)
// For each two dates, merge the hll'ed values to get the total dcount over each two dates,
// This helps to get the pages viewed in both date1 and date2 (see the description below about the intersection_size)
| summarize (day1, pages1)=arg_min(day, pages), (day2, pages2)=arg_max(day, pages), union_size=dcount_hll(hll_merge(pagehll)) by key1, key2
| where day2 > day1
// To get pages viewed in date1 and also date2, look at the merged dcount of date1 and date2, subtract it from pages of date1 + pages on date2.
| project pages1, day1,day2, intersection_size=(pages1 + pages2 - union_size)
| project day1, day2, Percentage = intersection_size*100.0 / pages1
Output
day1 | day2 | Percentage |
---|---|---|
2016-05-01 00:00:00.0000000 | 2016-05-02 00:00:00.0000000 | 33.2298494510578 |
2016-05-01 00:00:00.0000000 | 2016-05-03 00:00:00.0000000 | 16.9773830213667 |
2016-05-02 00:00:00.0000000 | 2016-05-03 00:00:00.0000000 | 14.5160020350006 |
Note
The results of the queries are not 100% accurate due to the error of the hll
functions. For more information about the errors, see dcount()
.