How to optimize ADX Cluster with very high CPU usage?

Question

We're currently using 3 nodes in our ADX cluster, which are experiencing extreme workload due to running very heavy calculations that result in long periods (between 30-60 mins) of 100% CPU usage of the cluster (all nodes)

The calculation can, and will, be optimized, but this is not the issue. The main problem is -- what is the best way (or ways) to optimize our cluster toward reducing the CPU workload? This is critical since our analytics system is ADX, and when the CPU is at ~100% the dashboards are failing, simple query returns with timeout, etc

Another important factor is that we're using ~55%-60% of our cache and to optimize towards faster analytics we're considering increasing the chance from 30 days to 60 days. We have no ingestion issues whatsoever. We're currently using 3 Standard_L8as_v3 nodes.

We can approach this from a few angles:

Scaling Up (SKU) - stronger or more suitable SKU
Scaling Out (Nodes) - more or auto-scaling
other solutions?

From the research I made it seems that scaling Up is more relevant, but it would be great to hear what you think, and about the experience you had with scaling your ASDX cluster resources :)

Another critical thing is if there is a risk to data ingestion (I assume not) downtime issues or any other data-related risk with the process and how each approach might result at

Many thanks!

p.s.

we try to increase the number of nodes to 6 for an approx 1.5-2 hours but this results in no visible effect we're constantly optimizing our functions/syntax to be more efficient but the increase in user volume is higher than the optimization rate and thus we thought on more resources

Accepted Answer

@Ori Bandel - Thanks for the question and using MS Q&A platform.

Based on the information you provided, it seems that you are experiencing high CPU usage on your ADX cluster due to heavy calculations. You are considering scaling up (SKU) or scaling out (nodes) to optimize your cluster and reduce the CPU workload.

Scaling up to a stronger or more suitable SKU can help improve the performance of your cluster. However, it may not be the most cost-effective solution. Scaling out by adding more nodes or using auto-scaling can also help distribute the workload and reduce the CPU usage.

In terms of data ingestion, there should not be any risk as long as you follow the best practices for scaling your cluster. It is important to monitor your cluster and adjust the resources as needed to ensure optimal performance.

Before scaling up or out, you may want to consider optimizing your calculations to reduce the workload on your cluster. Additionally, you mentioned that you are using 55%-60% of your cache and considering increasing it from 30 days to 60 days. This can also help improve the performance of your analytics system.

Overall, it is important to find the right balance between performance and cost when optimizing your ADX cluster. You may want to experiment with different configurations and monitor the performance to determine the best solution for your specific needs.

I hope this helps! Let me know if you have any other questions.

Answer

Following up on the solution:

We decided to Scale UP - the reason was that based on the documentation and based on the above it seems that Scaling UP is a more suitable solution for high CPU load issues
It turned out quite good - we chose 1 size stronger/higher SKU/VM which doubled our CPU power
The >99% CPU periods we experienced were by tens of percent (less than half) which is a great outcome for our data infra
Clearly the price is not cheap but this proves valuable as we experience fewer failures and increase data stability
This can't be our final solution as our user base grows we will explore more solutions - other scaling, code improvements, better integration, etc

I marked the relevant answer as 'accepted answer' :)

Reader of the future - if you read this and want to consult please do so! you can reach out here or via the common/professional channels :)

Tnx!

Answer

Adding:

a few weeks ago we also implemented Scaling OUT and increased the number of nodes. This might have a minor impact on the CPU usage/volume. It did improve the overall experience of the ADX users with the dashboards/queries but nothing that I can add a metric/number to

IMPORTANT -- the major update that almost totally removed the 100% CPU usage was a restructure and optimization of the summarized table we are building via ETLs in the cluster (a process of creating summary tables, for example for different periods, out of the raw tables)

when we restructured the process to run much less (AND with higher relevance) this made a huge difference

One lesson here is that ADX is NOT optimized to build complex summary tables via a straight .set-or-replace (or an alike function) and if used so much better to be optimized with a smart logic

Share via

How to optimize ADX Cluster with very high CPU usage?

2 additional answers

Your answer