Analysis of Cost Saved

Jayaswaroop Adabala 20 Reputation points
2024-12-11T05:47:30.34+00:00

Hi,

Current Solution: Delta table from HiveMetastore -> Staging Table with Business Logic in Databricks -> PowerBI

Current Solution in Detailed:

  • Main responsibilities for me is to consume the data the delta tables from hive_metastore in databricks, create business transformations in the NB, using the Pyspark/Spark SQL.
  • Once i finish the all transformations , i create the staging table in the new database that is created. The table that I create has around 200k records with size of 57834891 bytes

the cluster we use -

  • 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)
  • Driver Node: Standard_DS3_v2 14GB Memory, 4 Cores
  • Worker Node: Standard_DS3_v2 14GB Memory, 4 Cores
  • No.of workers : 8

We have one more scheduler running in the databricks which to refresh this table at frequency of 1 week.

From this table, I will load the data to the PowerBI desktop app, which is to refreshed every one week.

New Solution to Propose: Delta table from HiveMetastore -> PowerBI

New Proposed Solution in Detailed:

  • Here i am planning to ingest the data directly to the PowerBI from hive_metastore and perform the business logic using SQL query as input with the help of following function in Power BI Databricks.Query("adb-123xxxxxxx2.2.azuredatabrick")
  • By doing this I can
    • Avoid Creation of NB,
    • Avoid Creation of staging table
    • Less Load on the Cluster
    • No need to have Schuler in the Databricks

I want to understand on how I could save by doing this change. I expect some approximate cost so that i could propose this to our team.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,283 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,044 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 18,630 Reputation points Microsoft Vendor
    2024-12-11T11:12:28.0033333+00:00

    Hi @Jayaswaroop Adabala
    Welcome to Microsoft Q&A platform and thanks for posting your query here.

    It's great to see that you are looking for ways to optimize your solution and save costs. Based on the information you provided, I can suggest a few potential cost savings that you could achieve by switching to the new proposed solution.

    In your current setup you're running a Databricks cluster with 1 driver node and 8 worker nodes (Standard_DS3 instance)

    Your cluster configuration uses a Standard_DS3 for both the driver and worker nodes:

    • VM Cost: $0.185/hour per node.
    • DBU Cost: $0.400/hour per node.
    • Total Cost per Node: $0.585/hour.

    For your cluster setup with 1 driver + 8 worker nodes (9 nodes):

    • Total Hourly Cluster Cost: $0.585 × 9 = $5.265/hour.

    Assuming a refresh schedule of 1 hour per week:

    • Weekly Cost: $5.265 × 1 = $5.27/week.
    • Monthly Cost (4 weeks): ~$21.06/month.

    Besides direct costs, maintaining notebooks, staging tables, and schedulers adds to development and operational effort. These are important but harder to quantify as they depend on your team's time and complexity.

    By moving business logic directly into Power BI and querying the Hive Metastore:

    1. Databricks Cluster Cost: Eliminated entirely for this workflow, saving ~$21/month in direct cluster costs.
    2. Operational Efficiency: No need to manage notebooks, staging tables, or schedulers. Simplifies the overall architecture, saving time and reducing potential points of failure.

    Also, it's difficult to provide an exact estimate of the cost savings without knowing more about your specific usage patterns and pricing model. However, based on the information you provided, I would expect that you could achieve significant cost savings by switching to the new proposed solution.

    Reference:

    I hope this helps! Let me know if you have any other questions.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.