Ingest data into an Azure Databricks lakehouse
Azure Databricks offers various methods for ingesting data into a lakehouse backed by Delta Lake. This article lists supported ingestion tools and guidance on which method to use based on criteria like data source and latency.
Ingestion methods
You can ingest data into Databricks using the following methods:
- Batch ingestion of a set of data rows for infrequent processing
- Streaming ingestion of individual data rows or sets of data rows as they arrive for real-time processing
Ingested data is loaded into Delta tables that can then be used across your downstream data and AI use cases. Because of Databricks’ Lakehouse architecture, you do not need to duplicate your data across use cases, and you can leverage Unity Catalog for centralized access control, auditing, lineage, and data discovery across all of your data.
Batch ingestion
With batch ingestion you load data as sets (or batches) of rows into Databricks often based on a schedule (for example, every day) or triggered manually. This represents the “extract” piece of traditional extract, transform, load (ETL) use cases. You can use batch ingestion to load data from:
Local files like CSVs
Cloud object storage, including Amazon S3, Azure Data Lake Storage, and Google Cloud Storage
SaaS applications like Salesforce and databases like SQL Server
Batch ingestion supports a wide range of file source formats, including CSV, TSV, JSON, XML, Avro, ORC, Parquet, and text files.
Databricks supports both traditional batch ingestion and incremental batch ingestion options. While traditional batch ingestion processes all records each time it runs, incremental batch ingestion automatically detects new records in the data source and ignores records that have already been ingested. This means less data needs to be processed and, as result, ingestion jobs run faster and use compute resources more efficiently.
Traditional (one-time) batch ingestion
You can upload local data files or download files from a public URL using the add data UI. See Upload files.
Incremental batch ingestion
This section describes supported incremental batch ingestion tools.
Streaming tables
The CREATE STREAMING TABLE
SQL command lets you load data incrementally into streaming tables from cloud object storage. See CREATE STREAMING TABLE.
Example: Incremental batch ingestion using streaming tables
CREATE OR REFRESH STREAMING TABLE customers
AS SELECT * FROM read_files("/databricks-datasets/retail-org/customers/", "csv")
Cloud object storage connector
Auto Loader, the built-in cloud object storage connector, allows you to incrementally and efficiently process new data files as they arrive in Amazon S3 (S3), Azure Data Lake Storage Gen 2 (ALDS2), or Google Cloud Storage (GCS). See Auto Loader.
Example: Incremental batch ingestion using Auto Loader
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("rescuedDataColumn", "_rescued_data")
.schema("/databricks-datasets/retail-org/customers/schema")
.load("/databricks-datasets/retail-org/customers/")
Fully-managed connectors
LakeFlow Connect provides fully-managed connectors to ingest from SaaS applications like Salesforce and databases like SQL Server. Managed connectors are available using the following:
- Databricks UI
- Databricks CLI
- Databricks APIs
- Databricks SDKs
- Databricks Asset Bundles (DABs)
See LakeFlow Connect.
Streaming ingestion
With streaming ingestion you continuously load data rows or batches of data rows as it is generated so you can query it as it arrives in near real-time. You can use streaming ingestion to load streaming data from sources like Apache Kafka, Amazon Kinesis, Google Pub/Sub, and Apache Pulsar.
Databricks also supports streaming ingestion using built-in connectors. These connectors allow you to incrementally and efficiently process new data as it arrives from streaming sources. See Configure streaming data sources.
Example: Streaming ingestion from Kafka
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<server:ip>")
.option("subscribe", "topic1")
.option("startingOffsets", "latest")
.load()
Batch and streaming ingestion with Delta Live Tables
Databricks recommends using Delta Live Tables to build reliable and scalable data processing pipelines. Delta Live Tables supports both batch and streaming ingestion, and you can ingest data from any data source supported by Auto Loader.
Example: Incremental batch ingestion using Delta Live Tables
@dlt.table
def customers():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/databricks-datasets/retail-org/customers/")
)
Example: Streaming ingestion from Kafka using Delta Live Tables
@dlt.table
def kafka_raw():
return (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<server:ip>")
.option("subscribe", "topic1")
.option("startingOffsets", "latest")
.load()
)
Ingestion schedules
You can ingest data as a one-time operation, on a recurring schedule, or continuously.
- For near real-time streaming use cases, use continuous mode.
- For batch ingestion use cases, ingest one time or set a recurring schedule.
See Ingestion with Jobs and Triggered vs. continuous pipeline mode.
Ingestion partners
Many third-party tools support batch or streaming ingestion into Databricks. Databricks validates various third-party integrations, although the steps to configure access to source systems and ingest data vary by tool. See Ingestion partners for a list of validated tools. Some technology partners are also featured in Databricks Partner Connect, which provides a UI that simplifies connecting third-party tools to lakehouse data.
DIY ingestion
Databricks provides a general compute platform. As a result, you can create your own ingestion connectors using any programming language supported by Databricks, like Python or Java. You can also import and leverage popular open source connector libraries like data load tool, Airbyte, and Debezium.
Ingestion alternatives
Databricks recommends ingestion for most use cases because it scales to accommodate high data volumes, low-latency querying, and third-party API limits. Ingestion copies data from your source systems to Azure Databricks, which results in duplicate data that might become stale over time. If you don’t want to copy data, you can use the following tools:
- Lakehouse Federation allows you to query external data sources without moving your data.
- Delta Sharing allows you to securely share data across platforms, clouds, and regions.
However, if you don’t want to copy data, use Lakehouse Federation or Delta Sharing.
When to use Delta Sharing
Choose Delta Sharing for the following scenarios:
- Limiting data duplication
- Querying the freshest possible data
When to use Lakehouse Federation
Choose Lakehouse Federation for the following scenarios:
- Ad hoc reporting or proof-of-concept work on your ETL pipelines
Considerations when choosing an ingestion method
Consideration | Guidance |
---|---|
Data source | If LakeFlow Connect native connectors exist for the data source, this will be the simplest way to ingest data. For data sources not supported by LakeFlow Connect, extract the data from the source and then use Auto Loader to ingest the data into Databricks. For local files, use the Databricks UI to upload the data. |
Latency | If you want to analyze data in near real time, use streaming to leverage incremental processing. With streaming, the data is available for querying as soon as each record arrives. Otherwise, use batch ingestion. |
Data movement | If you are unable to copy data from your source systems to Databricks, use Lakehouse Federation or Delta Sharing. |
Migrate data to Delta Lake
To learn how to migrate existing data to Delta Lake, see Migrate data to Delta Lake.
COPY INTO (legacy)
The CREATE STREAMING TABLE
SQL command is the recommended alternative to the legacy COPY INTO
SQL command for incremental ingestion from cloud object storage. See COPY INTO. For a more scalable and robust file ingestion experience, Databricks recommends that SQL users leverage streaming tables instead of COPY INTO
.
Additional resources
What is the difference between streaming and incremental batch ingestion?