Rediger

Del via


Create a modern analytics architecture by using Azure Databricks

Azure Databricks
Microsoft Fabric
Power BI
Azure Data Lake Storage

Solution ideas

This article describes a solution idea. Your cloud architect can use this guidance to help visualize the major components for a typical implementation of this architecture. Use this article as a starting point to design a well-architected solution that aligns with your workload's specific requirements.

This solution outlines the key principles and components of modern data architectures. Azure Databricks forms the core of the solution. This platform works seamlessly with other services, such as Azure Data Lake Storage, Microsoft Fabric, and Power BI.

Apache® and Apache Spark™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Architecture

Architecture diagram that shows how a modern data architecture collects, processes, analyzes, and visualizes data.

Download a Visio file of this architecture.

Dataflow

  1. Azure Databricks ingests raw streaming data from Azure Event Hubs by using Delta Live Tables.

  2. Fabric Data Factory loads raw batch data into Data Lake Storage.

  3. For data storage:

    • Data Lake Storage houses all types of data, including structured, unstructured, and partially structured data. It also stores batch and streaming data.

    • Delta Lake forms the curated layer of the data lake. It stores the refined data in an open-source format.

    • Azure Databricks works well with a medallion architecture that organizes data into layers:

      • Bronze layer: Holds raw data.
      • Silver layer: Contains cleaned, filtered data.
      • Gold layer: Stores aggregated data that's useful for business analytics.
  4. The analytical platform ingests data from the disparate batch and streaming sources. Data scientists use this data for tasks like:

    • Data preparation.
    • Data exploration.
    • Model preparation.
    • Model training.

    MLflow manages parameter, metric, and model tracking in data science code runs. The coding possibilities are flexible:

    • Code can be in SQL, Python, R, and Scala.
    • Code can use popular open-source libraries and frameworks such as Koalas, Pandas, and scikit-learn, which are preinstalled and optimized.
    • Users can optimize for performance and cost by using single-node and multiple-node compute options.
  5. Machine learning models are available in the following formats:

    • Azure Databricks stores information about models in the MLflow Model Registry. The registry makes models available through batch, streaming, and REST APIs.
    • The solution can also deploy models to Azure Machine Learning web services or Azure Kubernetes Service (AKS).
  6. Services that work with the data connect to a single underlying data source to help ensure consistency. For instance, you can run SQL queries on the data lake by using Azure Databricks SQL warehouses. This service:

  7. You can mirror gold datasets out of Azure Databricks Unity Catalog into Fabric. Use Azure Databricks mirroring in Fabric to easily integrate without the need to move or replicate data.

  8. Power BI generates analytical and historical reports and dashboards from the unified data platform. This service uses the following features when it works with Azure Databricks:

    • A built-in Azure Databricks connector for visualizing the underlying data.
    • Optimized Java Database Connectivity and Open Database Connectivity drivers.
    • You can use Direct Lake with Azure Databricks mirroring in Fabric to load your Power BI semantic models for higher-performance queries.
  9. The solution uses Unity Catalog and Azure services for collaboration, performance, reliability, governance, and security:

    • Azure Databricks Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

    • Microsoft Purview provides data discovery services, sensitive data classification, and governance insights across the data estate.

    • Azure DevOps offers continuous integration and continuous deployment (CI/CD) and other integrated version control features.

    • Azure Key Vault helps you securely manage secrets, keys, and certificates.

    • Microsoft Entra ID and the System for Cross-domain Identity Management (SCIM) provisioning provide single sign-on for Azure Databricks users and groups. Azure Databricks supports automated user provisioning with Microsoft Entra ID to:

      • Create new users and groups.
      • Assign each user an access level.
      • Remove users and denying them access.
    • Azure Monitor collects and analyzes Azure resource telemetry. By proactively identifying problems, this service maximizes performance and reliability.

    • Microsoft Cost Management provides financial governance services for Azure workloads.

Components

This solution uses the following components.

Core components

  • Azure Databricks is a data analytics platform that uses Spark clusters to process large data streams. It cleans and transforms unstructured data and combines it with structured data. It can also train and deploy machine learning models. In this architecture, Azure Databricks serves as the central tool for data ingestion, processing, and serving. It provides a unified environment for managing the entire data lifecycle.

  • Azure Databricks SQL warehouses are compute resources that you can use to query and explore data on Azure Databricks. In this architecture, you can use SQL endpoints to connect directly to your data from Power BI.

  • Azure Databricks Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. In this architecture, Delta Live Tables helps you define transformations to perform on your data. It also helps you manage task orchestration, cluster management, monitoring, data quality, and error handling within Azure Databricks.

  • Microsoft Fabric is an end-to-end analytics and data platform for organizations that need a unified solution. The platform provides services like Data Engineering, Data Factory, Data Science, Real-Time Intelligence, Data Warehouse, and Databases. This architecture mirrors Unity Catalog tables into Fabric and uses Direct Lake in Power BI for better performance.

  • Data Factory in Microsoft Fabric is a modern data integration platform that you can use to ingest, prepare, and transform data from a rich set of data sources in Fabric. This architecture uses built-in connectors to several data sources for quick ingestion into Data Lake Storage or OneLake. Azure Databricks later retrieves and further transforms the batch data.

  • Event Hubs is a fully managed, big data streaming platform. As a platform as a service, it provides event ingestion capabilities. This architecture uses Event Hubs for streaming data. Azure Databricks can connect to this data and process it by using Spark Streaming or Delta Live Tables.

  • Data Lake Storage is a scalable and secure data lake for high-performance analytics. It handles multiple petabytes of data and supports hundreds of gigabits of throughput. Data Lake Storage can store structured, partially structured, and unstructured data. This architecture uses Data Lake Storage to store both batch and streaming data.

  • Machine Learning is a cloud-based environment that helps you build, deploy, and manage predictive analytics solutions. By using these models, you can forecast behavior, outcomes, and trends. In this architecture, Machine Learning uses data that Azure Databricks transforms for training and inferring models.

  • AKS is a highly available, secure, and fully managed Kubernetes service. AKS makes it easy to deploy and manage containerized applications. In this architecture, AKS hosts machine learning models in a containerized environment for scalable inferencing.

  • Delta Lake is a storage layer that uses an open file format. This layer runs on top of cloud storage solutions like Data Lake Storage. Delta Lake supports data versioning, rollback, and transactions for updating, deleting, and merging data. In this architecture, Delta Lake works as the primary file format for writing and reading data from Data Lake Storage.

  • MLflow is an open-source platform for managing the machine learning lifecycle. Its components monitor machine learning models during training and operation. In this architecture, similar to Machine Learning, you can use MLflow in Azure Databricks to manage your machine learning lifecycle. Train and infer models by using the Unity Catalog data that you transformed within Azure Databricks.

Reporting and governing components

  • Azure Databricks Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. In this architecture, Unity Catalog works as the primary tool within Azure Databricks to manage and secure data access.

  • Power BI is a collection of software services and apps. These services create and share reports that connect and visualize unrelated sources of data. Together with Azure Databricks, Power BI can provide root cause determination and raw data analysis. This architecture uses Power BI to create dashboards and reports that provide insights into the data that Azure Databricks and Fabric process.

  • Microsoft Purview manages on-premises, multicloud, and software as a service (SaaS) data. This governance service maintains data landscape maps. Its features include automated data discovery, sensitive data classification, and data lineage. This architecture uses Microsoft Purview to scan and track data that's ingested in Unity Catalog, Fabric, Power BI, and Data Lake Storage.

  • Azure DevOps is a DevOps orchestration platform. This SaaS provides tools and environments to build, deploy, and collaborate on applications. This architecture uses Azure DevOps to automate the deployment of Azure infrastructure. You can also use GitHub for automation and version control of Azure Databricks code for better collaboration, change tracking, and integration with CI/CD pipelines.

  • Key Vault stores and controls access to secrets, such as tokens, passwords, and API keys. Key Vault also creates and controls encryption keys and manages security certificates. This architecture uses Key Vault to store shared access signature keys from Data Lake Storage. These keys are then used in Azure Databricks and other services for authentication.

  • Microsoft Entra ID offers cloud-based identity and access management services. These features provide a way for users to sign in and access resources. This architecture uses Microsoft Entra ID to authenticate and authorize users and services in Azure.

  • SCIM allows you to set up provisioning to the Azure Databricks account by using Microsoft Entra ID. This architecture uses SCIM to manage users who access Azure Databricks workspaces.

  • Azure Monitor collects and analyzes data in environments and Azure resources. This data includes app telemetry, such as performance metrics and activity logs. This architecture uses Azure Monitor to monitor the health of compute resources in Azure Databricks and Machine Learning and other components that send logs to Azure Monitor.

  • Cost Management helps you manage cloud spending. By using budgets and recommendations, this service organizes expenses and shows you how to reduce costs. This architecture uses Cost Management to help monitor and control the cost of the entire solution.

Scenario details

Modern data architectures:

  • Unify data, analytics, and AI workloads.
  • Run efficiently and reliably at any scale.
  • Provide insights through analytics dashboards, operational reports, or advanced analytics.

This solution outlines a modern data architecture that achieves these goals. Azure Databricks forms the core of the solution. This platform works seamlessly with other services. Together, these services provide a solution that is:

  • Simple: Unified analytics, data science, and machine learning simplify the data architecture.
  • Open: The solution supports open-source code, open standards, and open frameworks. It also works with popular integrated development environments (IDEs), libraries, and programming languages. Through native connectors and APIs, the solution works with a broad range of other services, too.
  • Collaborative: Data engineers, data scientists, and analysts work together with this solution. They can use collaborative notebooks, IDEs, dashboards, and other tools to access and analyze common underlying data.

Potential use cases

The system that Swiss Re Group built for its Property & Casualty Reinsurance division inspired this solution. In addition to the insurance industry, any area that works with big data or machine learning can also benefit from this solution. Examples include:

  • The energy sector.
  • Retail and e-commerce.
  • Banking and finance.
  • Medicine and healthcare.

Next steps

To learn about related solutions, see the following guides and architectures.