Solution ideas
This article describes a solution idea. Your cloud architect can use this guidance to help visualize the major components for a typical implementation of this architecture. Use this article as a starting point to design a well-architected solution that aligns with your workload's specific requirements.
This article provides a machine learning operations (MLOps) architecture and process that uses Azure Databricks. Data scientists and engineers can use this standardized process to move machine learning models and pipelines from development to production.
This solution can take advantage of full automation, continuous monitoring, and robust collaboration and therefore targets a level 4 of MLOps maturity. This architecture uses the promote code that generates the model approach rather than the promote models approach. The promote code that generates the model approach focuses on writing and managing the code that generates machine learning models. The recommendations in this article include options for automated or manual processes.
Architecture
Download a Visio file of this architecture.
Workflow
The following workflow corresponds to the preceding diagram. Use source control and storage components to manage and organize code and data.
Source control: This project's code repository organizes the notebooks, modules, and pipelines. You can create development branches to test updates and new models. Develop code in Git-supported notebooks or integrated development environments (IDEs) that integrate with Git folders so that you can sync with your Azure Databricks workspaces. Source control promotes machine learning pipelines from the development environment, to testing in the staging environment, and to deployment in the production environment.
Lakehouse production data: As a data scientist, you have read-only access to production data in the development environment. The development environment can have mirrored data and redacted confidential data. You also have read and write access in a dev storage environment for development and experimentation. We recommend that you use a lakehouse architecture for data in which you store Delta Lake-format data in Azure Data Lake Storage. A lakehouse provides a robust, scalable, and flexible solution for data management. To define access controls, use Microsoft Entra ID credential passthrough or table access controls.
The following environments comprise the main workflow.
Development
In the development environment, you develop machine learning pipelines.
Perform exploratory data analysis (EDA): Explore data in an interactive, iterative process. You might not deploy this work to staging or production. Use tools like Databricks SQL, the dbutils.data.summarize command, and Databricks AutoML.
Develop model training and other machine learning pipelines: Develop machine learning pipelines modular code, and orchestrate code via Databricks Notebooks or an MLFlow Project. In this architecture, the model training pipeline reads data from the feature store and other lakehouse tables. The pipeline trains and tunes log model parameters and metrics to the MLflow tracking server. The feature store API logs the final model. These logs include the model, its inputs, and the training code.
Commit code: To promote the machine learning workflow toward production, commit the code for featurization, training, and other pipelines to source control. In the code base, place machine learning code and operational code in different folders so that team members can develop code at the same time. Machine learning code is code that's related to the model and data. Operational code is code that's related to Databricks jobs and infrastructure.
This core cycle of activities that you do when you write and test code are referred to as the innerloop process. To perform the innerloop process for the development phase, use Visual Studio Code in combination with the dev container CLI and the Databricks CLI. You can write the code and do unit testing locally. You should also submit, monitor, and analyze the model pipelines from the local development environment.
Staging
In the staging environment, continuous integration (CI) infrastructure tests changes to machine learning pipelines in an environment that mimics production.
Merge a request: When you submit a merge request or pull request against the staging (main) branch of the project in source control, a continuous integration and continuous delivery (CI/CD) tool like Azure DevOps runs tests.
Run unit tests and CI tests: Unit tests run in CI infrastructure, and integration tests run in end-to-end workflows on Azure Databricks. If tests pass, the code changes merge.
Build a release branch: When you want to deploy the updated machine learning pipelines to production, you can build a new release. A deployment pipeline in the CI/CD tool redeploys the updated pipelines as new workflows.
Production
Machine learning engineers manage the production environment, where machine learning pipelines directly serve end applications. The key pipelines in production refresh feature tables, train and deploy new models, run inference or serving, and monitor model performance.
Feature table refresh: This pipeline reads data, computes features, and writes to feature store tables. You can configure this pipeline to either run continuously in streaming mode, run on a schedule, or run on a trigger.
Model training: In production, you can configure the model training or retraining pipeline to either run on a trigger or a schedule to train a fresh model on the latest production data. Models automatically register to Unity Catalog.
Model evaluation and promotion: When a new model version is registered, the CD pipeline triggers, which runs tests to ensure that the model will perform well in production. When the model passes tests, Unity Catalog tracks its progress via model stage transitions. Tests include compliance checks, A/B tests to compare the new model with the current production model, and infrastructure tests. Lakehouse tables record test results and metrics. You can optionally require manual sign-offs before models transition to production.
Model deployment: When a model enters production, it's deployed for scoring or serving. The most common deployment modes include:
Batch or streaming scoring: For latencies of minutes or longer, batch and streaming are the most cost-effective options. The scoring pipeline reads the latest data from the feature store, loads the latest production model version from Unity Catalog, and performs inference in a Databricks job. It can publish predictions to lakehouse tables, a Java Database Connectivity (JDBC) connection, flat files, message queues, or other downstream systems.
Online serving (REST APIs): For low-latency use cases, you generally need online serving. MLflow can deploy models to Mosaic AI Model Serving, cloud provider serving systems, and other systems. In all cases, the serving system initializes with the latest production model from Unity Catalog. For each request, it fetches features from an online feature store and makes predictions.
Monitoring: Continuous or periodic workflows monitor input data and model predictions for drift, performance, and other metrics. You can use the Delta Live Tables framework to automate monitoring for pipelines, and store the metrics in lakehouse tables. Databricks SQL, Power BI, and other tools can read from those tables to create dashboards and alerts. To monitor application metrics, logs, and infrastructure, you can also integrate Azure Monitor with Azure Databricks.
Drift detection and model retraining: This architecture supports both manual and automatic retraining. Schedule retraining jobs to keep models fresh. After a detected drift crosses a preconfigured threshold that you set in the monitoring step, the retraining pipelines analyze the drift and trigger retraining. You can configure pipelines to trigger automatically, or you can receive a notification and then run the pipelines manually.
Components
A data lakehouse architecture unifies the elements of data lakes and data warehouses. Use a lakehouse to get data management and performance capabilities that are typically found in data warehouses but with the low-cost, flexible object stores that data lakes offer.
- Delta Lake is the recommended open-source data format for a lakehouse. Azure Databricks stores data in Data Lake Storage and provides a high-performance query engine.
MLflow is an open-source project for managing the end-to-end machine learning lifecycle. MLflow has the following components:
The tracking feature tracks experiments, so you can record and compare parameters, metrics, and model artifacts.
- Databricks autologging extends MLflow automatic logging to track machine learning experiments and automatically logs model parameters, metrics, files, and lineage information.
MLflow Model is a format that you can use to store and deploy models from any machine learning library to various model-serving and inference platforms.
Unity Catalog provides centralized access control, auditing, lineage, and data-discovery capabilities across Azure Databricks workspaces.
Mosaic AI Model Serving hosts MLflow models as REST endpoints.
Azure Databricks provides a managed MLflow service that has enterprise security features, high availability, and integrations with other Azure Databricks workspace features.
Databricks Runtime for Machine Learning automates the creation of a cluster that's optimized for machine learning and preinstalls popular machine learning libraries like TensorFlow, PyTorch, and XGBoost. It also preinstalls Azure Databricks for Machine Learning tools, like AutoML and feature store clients.
A feature store is a centralized repository of features. Use the feature store to discover and share features and help prevent data skew between model training and inference.
Databricks SQL integrates with a variety of tools so that you can author queries and dashboards in your favorite environments without adjusting to a new platform.
Git folders provides integration with your Git provider in the Azure Databricks workspace, which improves notebook or code collaboration and IDE integration.
Workflows and jobs provide a way to run non-interactive code in an Azure Databricks cluster. For machine learning, jobs provide automation for data preparation, featurization, training, inference, and monitoring.
Alternatives
You can tailor this solution to your Azure infrastructure. Consider the following customizations:
Use multiple development workspaces that share a common production workspace.
Exchange one or more architecture components for your existing infrastructure. For example, you can use Azure Data Factory to orchestrate Databricks jobs.
Integrate with your existing CI/CD tooling via Git and Azure Databricks REST APIs.
Use Microsoft Fabric or Azure Synapse Analytics as alternative services for machine learning capabilities.
Scenario details
This solution provides a robust MLOps process that uses Azure Databricks. You can replace all elements in the architecture, so you can integrate other Azure services and partner services as needed. This architecture and description are adapted from the e-book The Big Book of MLOps. The e-book explores this architecture in more detail.
MLOps helps reduce the risk of failures in machine learning and AI systems and improves the efficiency of collaboration and tooling. For an introduction to MLOps and an overview of this architecture, see Architect MLOps on the lakehouse.
Use this architecture to:
Connect your business stakeholders with machine learning and data science teams. Use this architecture to incorporate notebooks and IDEs for development. Business stakeholders can view metrics and dashboards in Databricks SQL, all within the same lakehouse architecture.
Make your machine learning infrastructure datacentric. This architecture treats machine learning data just like other data. Machine learning data includes data from feature engineering, training, inference, and monitoring. This architecture reuses tooling for production pipelines, dashboarding, and other general data processing for machine learning data processing.
Implement MLOps in modules and pipelines. As with any software application, use the modularized pipelines and code in this architecture to test individual components and decrease the cost of future refactoring.
Automate your MLOps processes as needed. In this architecture, you can automate steps to improve productivity and reduce the risk of human error, but you don't need to automate every step. Azure Databricks permits UI and manual processes in addition to APIs for automation.
Potential use cases
This architecture applies to all types of machine learning, deep learning, and advanced analytics. Common machine learning and AI techniques in this architecture include:
- Classical machine learning, like linear models, tree-based models, and boosting.
- Modern deep learning, like TensorFlow and PyTorch.
- Custom analytics, like statistics, Bayesian methods, and graph analytics.
The architecture supports both small data (single machine) and large data (distributed computing and GPU-accelerated). In each stage of the architecture, you can choose compute resources and libraries to adapt to your scenario's data and problem dimensions.
The architecture applies to all types of industries and business use cases. Azure Databricks customers that use this architecture include small and large organizations in the following industries:
- Consumer goods and retail services
- Financial services
- Healthcare and life sciences
- Information technology
For examples, see Databricks customers.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal authors:
- Brandon Cowen | Senior Cloud Solution Architect
- Prabal Deb | Principal Software Engineer
To see non-public LinkedIn profiles, sign in to LinkedIn.