Επεξεργασία

Κοινή χρήση μέσω


Modern analytics architecture with Azure Databricks

Azure Databricks
Microsoft Fabric
Power BI
Azure Data Lake Storage

Solution ideas

This article describes a solution idea. Your cloud architect can use this guidance to help visualize the major components for a typical implementation of this architecture. Use this article as a starting point to design a well-architected solution that aligns with your workload's specific requirements.

This solution outlines modern data architecture. Azure Databricks forms the core of the solution. This platform works seamlessly with other services, such as Azure Data Lake Storage Gen2, Microsoft Fabric, and Power BI.

Apache® and Apache Spark™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Architecture

Architecture diagram showing how a modern data architecture collects, processes, analyzes, and visualizes data.

Download a Visio file of this architecture.

Dataflow

  1. Azure Databricks ingests raw streaming data from Azure Event Hubs using Delta Live Tables.

  2. Fabric Data Factory loads raw batch data into Data Lake Storage Gen2.

  3. For data storage:

    • Data Lake Storage Gen2 houses data of all types, such as structured, unstructured, and semi-structured. It also stores batch and streaming data.

    • Delta Lake forms the curated layer of the data lake. It stores the refined data in an open-source format.

    • Azure Databricks works well with a medallion architecture that organizes data into layers:

      • Bronze: Holds raw data.
      • Silver: Contains cleaned, filtered data.
      • Gold: Stores aggregated data that's useful for business analytics.
  4. The analytical platform ingests data from the disparate batch and streaming sources. Data scientists use this data for these tasks:

    • Data preparation.
    • Data exploration.
    • Model preparation.
    • Model training.

    MLflow manages parameter, metric, and model tracking in data science code runs. The coding possibilities are flexible:

    • Code can be in SQL, Python, R, and Scala.
    • Code can use popular open-source libraries and frameworks such as Koalas, Pandas, and scikit-learn, which are pre-installed and optimized.
    • Practitioners can optimize for performance and cost with single-node and multi-node compute options.
  5. Machine learning models are available in several formats:

    • Azure Databricks stores information about models in the MLflow Model Registry. The registry makes models available through batch, streaming, and REST APIs.
    • The solution can also deploy models to Azure Machine Learning web services or Azure Kubernetes Service (AKS).
  6. Services that work with the data connect to a single underlying data source to ensure consistency. For instance, users can run SQL queries on the data lake with Azure Databricks SQL Warehouses. This service:

  7. Users can mirror gold data sets out of Databricks Unity Catalog into Fabric. Databricks mirroring in Fabric allow users to easily integrate without data movement or data replication.

  8. Power BI generates analytical and historical reports and dashboards from the unified data platform. This service uses these features when working with Azure Databricks:

    • A built-in Azure Databricks connector for visualizing the underlying data.
    • Optimized Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) drivers.
    • With Databricks mirroring in Fabric, you can leverage Direct Lake to load your PBI semantic models for higher performance queries.
  9. The solution uses Unity Catalog and Azure services for collaboration, performance, reliability, governance, and security:

    • Databricks Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

    • Microsoft Purview provides data discovery services, sensitive data classification, and governance insights across the data estate.

    • Azure DevOps offers continuous integration and continuous deployment (CI/CD) and other integrated version control features.

    • Azure Key Vault securely manages secrets, keys, and certificates.

    • Microsoft Entra ID and SCIM provisioning provides single sign-on (SSO) for Azure Databricks users and groups. Azure Databricks supports automated user provisioning with Microsoft Entra ID for these tasks:

      • Creating new users and groups.
      • Assigning each user an access level.
      • Removing users and denying them access.
    • Azure Monitor collects and analyzes Azure resource telemetry. By proactively identifying problems, this service maximizes performance and reliability.

    • Microsoft Cost Management provides financial governance services for Azure workloads.

Components

The solution uses the following components.

Core components

  • Azure Databricks is a data analytics platform that uses Spark clusters to process large data streams. It cleans and transforms unstructured data, combines it with structured data, and can train and deploy machine learning models. In this architecture, Databricks serves as the central tool for data ingestion, processing, and serving, providing a unified environment for managing the entire data lifecycle.

  • Azure Databricks SQL Warehouse are compute resources that let you query and explore data on Databricks. In this architecture, you can leverage SQL Endpoints to connect directly to your data from Power BI.

  • Azure Databricks Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. In this architecture, Delta Live Tables helps you define transformations to perform on your data and manage task orchestration, cluster management, monitoring, data quality, and error handling within Databricks.

  • Microsoft Fabric is an end-to-end analytics and data platform designed for enterprises needing a unified solution. The platform offers services like Data Engineering, Data Factory, Data Science, Real-Time Analytics, Data Warehouse, and Databases. In this architecture, we mirror Unity Catalog tables into Fabric and use Direct Lake in Power BI for better performance.

  • Data Factory in Microsoft Fabric empowers you with a modern data integration experience to ingest, prepare, and transform data from a rich set of data sources in Fabric. In this architecture, we are leveraging built-in connectors to several data sources for quick ingestion into ADLS or OneLake, where Databricks will later retrieve and further transform the batch data.

  • Event Hubs is a fully managed, big data streaming platform. As a Platform as a Service (PaaS), it provides event ingestion capabilities. In this architecture, Event Hubs is utilized for streaming data, which Databricks can connect to and process using Spark Streaming or Delta Live Tables.

  • Data Lake Storage Gen2 is a scalable and secure data lake for high-performance analytics. It handles multiple petabytes of data and supports hundreds of gigabits of throughput. ADLS can store structured, semi-structured, and unstructured data. In this architecture, we use ADLS to store both batch and streaming data.

  • Machine Learning is a cloud-based environment that helps you build, deploy, and manage predictive analytics solutions. With these models, you can forecast behavior, outcomes, and trends. In this architecture, AML can leverage data transformed by Databricks for training and inferring models.

  • AKS is a highly available, secure, and fully managed Kubernetes service. AKS makes it easy to deploy and manage containerized applications. In this architecture, AKS is leveraged to host machine learning models in a containerized environment for scalable inferencing.

  • Delta Lake is a storage layer that uses an open file format. This layer runs on top of cloud storage such as Data Lake Storage Gen2. Delta Lake supports data versioning, rollback, and transactions for updating, deleting, and merging data. In this architecture, Delta works as the primary file format for writing and reading data from ADLS.

  • MLflow is an open-source platform for managing the machine learning lifecycle. Its components monitor machine learning models during training and running. In this architecture, similar to AML, you can leverage MLflow in Databricks to manage your ML lifecycle, including training and inferring using the Unity Catalog data you just transformed within Databricks.

Reporting and governing components

  • Databricks Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. In this architecture, Unity Catalog works as the primary tool within Databricks to manage and secure data access.

  • Power BI is a collection of software services and apps. These services create and share reports that connect and visualize unrelated sources of data. Together with Azure Databricks, Power BI can provide root cause determination and raw data analysis. In this architecture, Power BI is used for creating dashboards and reports that provide insights into the data processed by Databricks and Fabric.

  • Microsoft Purview manages on-premises, multicloud, and software as a service (SaaS) data. This governance service maintains data landscape maps. Features include automated data discovery, sensitive data classification, and data lineage. In this architecture, Purview is used to scan and keep track of data ingested in Unity Catalog, Fabric, Power BI and ADLS.

  • Azure DevOps is a DevOps orchestration platform. This SaaS provides tools and environments for building, deploying, and collaborating on applications. In this architecture, Azure DevOps is used for automating the deployment of Azure infrastructure. Additionally, you could leverage GitHub for automation and version control of Databricks code, for better collaboration, tracking of changes, and integration with CI/CD pipelines.

  • Azure Key Vault stores and controls access to secrets such as tokens, passwords, and API keys. Key Vault also creates and controls encryption keys and manages security certificates. In this architecure, AKV is used to store SAS keys from ADLS. These keys are then used in Databricks and other services for authentication.

  • Microsoft Entra ID offers cloud-based identity and access management services. These features provide a way for users to sign in and access resources. In this architecure, Entra Id is used for authenticating and authorizing users and services in Azure.

  • SCIM allows you to set up provisioning to the Azure Databricks account using Microsoft Entra ID. In this architecture, it’s used for managing users accessing Databricks workspaces.

  • Azure Monitor collects and analyzes data on environments and Azure resources. This data includes app telemetry, such as performance metrics and activity logs. In this architecture, Azure monitor is used for monitoring the health of compute resources in Databricks and Azure Machine Learning, as well as other components that send logs to Azure Monitor.

  • Microsoft Cost Management manages cloud spending. By using budgets and recommendations, this service organizes expenses and shows how to reduce costs. In this architecture, Microsoft Cost Management is used for monitoring and controlling the cost of the entire solution.

Scenario details

Modern data architectures meet these criteria:

  • Unify data, analytics, and AI workloads.
  • Run efficiently and reliably at any scale.
  • Provide insights through analytics dashboards, operational reports, or advanced analytics.

This solution outlines a modern data architecture that achieves these goals. Azure Databricks forms the core of the solution. This platform works seamlessly with other services. Together, these services provide a solution with these qualities:

  • Simple: Unified analytics, data science, and machine learning simplify the data architecture.
  • Open: The solution supports open-source code, open standards, and open frameworks. It also works with popular integrated development environments (IDEs), libraries, and programming languages. Through native connectors and APIs, the solution works with a broad range of other services, too.
  • Collaborative: Data engineers, data scientists, and analysts work together with this solution. They can use collaborative notebooks, IDEs, dashboards, and other tools to access and analyze common underlying data.

Potential use cases

The system that Swiss Re Group built for its Property & Casualty Reinsurance division inspired this solution. Besides the insurance industry, any area that works with big data or machine learning can also benefit from this solution. Examples include:

  • The energy sector
  • Retail and e-commerce
  • Banking and finance
  • Medicine and healthcare

Next steps

To learn about related solutions, see this information: