Jaa


What are all the Delta things in Azure Databricks?

This article is an introduction to the technologies collectively branded Delta on Azure Databricks. Delta refers to technologies related to or in the Delta Lake open source project.

This article answers:

  • What are the Delta technologies in Azure Databricks?
  • What do they do? Or what are they used for?
  • How are they related to and distinct from one another?

What are the Delta things used for?

Delta is a term introduced with Delta Lake, the foundation for storing data and tables in the Databricks lakehouse. Delta Lake was conceived of as a unified data management system for handling transactional real-time and batch big data, by extending Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.

Delta Lake: OS data management for the lakehouse

Delta Lake is an open-source storage layer that brings reliability to data lakes by adding a transactional storage layer on top of data stored in cloud storage (on AWS S3, Azure Storage, and GCS). It allows for ACID transactions, data versioning, and rollback capabilities. It allows you to handle both batch and streaming data in a unified way.

Delta tables are built on top of this storage layer and provide a table abstraction, making it easy to work with large-scale structured data using SQL and the DataFrame API.

Delta tables: Default data table architecture

Delta table is the default data table format in Azure Databricks and is a feature of the Delta Lake open source data framework. Delta tables are typically used for data lakes, where data is ingested via streaming or in large batches.

See:

Delta Live Tables: Data pipelines

Delta Live Tables manage the flow of data between many Delta tables, thus simplifying the work of data engineers on ETL development and management. The pipeline is the main unit of execution for Delta Live Tables. Delta Live Tables offers declarative pipeline development, improved data reliability, and cloud-scale production operations. Users can perform both batch and streaming operations on the same table and the data is immediately available for querying. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. Delta Live Tables enhanced autoscaling can handle streaming workloads which are spiky and unpredictable.

See the Delta Live Tables tutorial.

Delta tables vs. Delta Live Tables

Delta table is a way to store data in tables, whereas Delta Live Tables allows you to describe how data flows between these tables declaratively. Delta Live Tables is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta tables is a data table architecture while Delta Live Tables is a data pipeline framework.

Delta: Open source or proprietary?

A strength of the Azure Databricks platform is that it doesn’t lock customers into proprietary tools: Much of the technology is powered by open source projects, which Azure Databricks contributes to.

The Delta OSS projects are examples:

Delta Live Tables is a proprietary framework in Azure Databricks.

What are the other Delta things on Azure Databricks?

Below are descriptions of other features that include Delta in their name.

Delta Sharing

An open standard for secure data sharing, Delta Sharing enables data sharing between organizations regardless of their compute platform.

Delta engine

A query optimizer for big data that uses Delta Lake open source technology included in Databricks. Delta engine optimizes the performance of Spark SQL, Databricks SQL, and DataFrame operations by pushing computation to the data.

Delta Lake transaction log (AKA DeltaLogs)

A single source of truth tracking all changes that users make to the table and the mechanism through which Delta Lake guarantees atomicity. See the Delta transaction log protocol on GitHub.

The transaction log is key to understanding Delta Lake, because it is the common thread that runs through many of its most important features:

  • ACID transactions
  • Scalable metadata handling
  • Time travel
  • And more.