Ingest data into an Azure Databricks lakehouse

Article
01/24/2025

Azure Databricks offers various methods for ingesting data into a lakehouse backed by Delta Lake. This article lists supported ingestion tools and guidance on which method to use based on criteria like data source and latency.

Ingestion methods

You can ingest data into Databricks using the following methods:

Batch ingestion of a set of data rows for infrequent processing
Streaming ingestion of individual data rows or sets of data rows as they arrive for real-time processing

Ingested data is loaded into Delta tables that can then be used across your downstream data and AI use cases. Because of Databricks’ Lakehouse architecture, you do not need to duplicate your data across use cases, and you can leverage Unity Catalog for centralized access control, auditing, lineage, and data discovery across all of your data.

Batch ingestion

With batch ingestion you load data as sets (or batches) of rows into Databricks often based on a schedule (for example, every day) or triggered manually. This represents the “extract” piece of traditional extract, transform, load (ETL) use cases. You can use batch ingestion to load data from:

Local files like CSVs
Cloud object storage, including Amazon S3, Azure Data Lake Storage, and Google Cloud Storage

SaaS applications like Salesforce and databases like SQL Server

Batch ingestion supports a wide range of file source formats, including CSV, TSV, JSON, XML, Avro, ORC, Parquet, and text files.

Databricks supports both traditional batch ingestion and incremental batch ingestion options. While traditional batch ingestion processes all records each time it runs, incremental batch ingestion automatically detects new records in the data source and ignores records that have already been ingested. This means less data needs to be processed and, as result, ingestion jobs run faster and use compute resources more efficiently.

Traditional (one-time) batch ingestion

You can upload local data files or download files from a public URL using the add data UI. See Upload files.

Incremental batch ingestion

This section describes supported incremental batch ingestion tools.

Streaming tables

The CREATE STREAMING TABLE SQL command lets you load data incrementally into streaming tables from cloud object storage. See CREATE STREAMING TABLE.

Example: Incremental batch ingestion using streaming tables

CREATE OR REFRESH STREAMING TABLE customers
AS SELECT * FROM STREAM read_files(
  "/databricks-datasets/retail-org/customers/",
  format => "csv"
)

Cloud object storage connector

Auto Loader, the built-in cloud object storage connector, allows you to incrementally and efficiently process new data files as they arrive in Amazon S3 (S3), Azure Data Lake Storage Gen 2 (ALDS2), or Google Cloud Storage (GCS). See Auto Loader.

Example: Incremental batch ingestion using Auto Loader

df = spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("rescuedDataColumn", "_rescued_data")
  .schema("/databricks-datasets/retail-org/customers/schema")
  .load("/databricks-datasets/retail-org/customers/")

Fully-managed connectors

Lakeflow Connect provides fully-managed connectors to ingest from SaaS applications and databases. Managed connectors are available using the following:

Databricks UI
Databricks CLI
Databricks APIs
Databricks SDKs
Databricks Asset Bundles

See Lakeflow Connect.

Streaming ingestion

With streaming ingestion you continuously load data rows or batches of data rows as it is generated so you can query it as it arrives in near real-time. You can use streaming ingestion to load streaming data from sources like Apache Kafka, Amazon Kinesis, Google Pub/Sub, and Apache Pulsar.

Databricks also supports streaming ingestion using built-in connectors. These connectors allow you to incrementally and efficiently process new data as it arrives from streaming sources. See Configure streaming data sources.

Example: Streaming ingestion from Kafka

spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "<server:ip>")
    .option("subscribe", "topic1")
    .option("startingOffsets", "latest")
    .load()

Batch and streaming ingestion with DLT

Databricks recommends using DLT to build reliable and scalable data processing pipelines. DLT supports both batch and streaming ingestion, and you can ingest data from any data source supported by Auto Loader.

Example: Incremental batch ingestion using DLT

@dlt.table
def customers():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .load("/databricks-datasets/retail-org/customers/")
  )

Example: Streaming ingestion from Kafka using DLT

@dlt.table
def kafka_raw():
  return (
    spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "<server:ip>")
      .option("subscribe", "topic1")
      .option("startingOffsets", "latest")
      .load()
  )

Ingestion schedules

You can ingest data as a one-time operation, on a recurring schedule, or continuously.

For near real-time streaming use cases, use continuous mode.
For batch ingestion use cases, ingest one time or set a recurring schedule.

See Ingestion with Jobs and Triggered vs. continuous pipeline mode.

Ingestion partners

Many third-party tools support batch or streaming ingestion into Databricks. Databricks validates various third-party integrations, although the steps to configure access to source systems and ingest data vary by tool. See Ingestion partners for a list of validated tools. Some technology partners are also featured in Databricks Partner Connect, which provides a UI that simplifies connecting third-party tools to lakehouse data.

DIY ingestion

Databricks provides a general compute platform. As a result, you can create your own ingestion connectors using any programming language supported by Databricks, like Python or Java. You can also import and leverage popular open source connector libraries like data load tool, Airbyte, and Debezium.

Ingestion alternatives

Databricks recommends ingestion for most use cases because it scales to accommodate high data volumes, low-latency querying, and third-party API limits. Ingestion copies data from your source systems to Azure Databricks, which results in duplicate data that might become stale over time. If you don’t want to copy data, you can use the following tools:

Lakehouse Federation allows you to query external data sources without moving your data.
Delta Sharing allows you to securely share data across platforms, clouds, and regions.

However, if you don’t want to copy data, use Lakehouse Federation or Delta Sharing.

Choose Delta Sharing for the following scenarios:

Limiting data duplication
Querying the freshest possible data

When to use Lakehouse Federation

Choose Lakehouse Federation for the following scenarios:

Ad hoc reporting or proof-of-concept work on your ETL pipelines

Considerations when choosing an ingestion method

Consideration	Guidance
Data source	If Lakeflow Connect native connectors exist for the data source, this will be the simplest way to ingest data. For data sources not supported by Lakeflow Connect, extract the data from the source and then use Auto Loader to ingest the data into Databricks. For local files, use the Databricks UI to upload the data.
Latency	If you want to analyze data in near real time, use streaming to leverage incremental processing. With streaming, the data is available for querying as soon as each record arrives. Otherwise, use batch ingestion.
Data movement	If you are unable to copy data from your source systems to Databricks, use Lakehouse Federation or Delta Sharing.

Migrate data to Delta Lake

To learn how to migrate existing data to Delta Lake, see Migrate data to Delta Lake.

COPY INTO (legacy)

The CREATE STREAMING TABLE SQL command is the recommended alternative to the legacy COPY INTO SQL command for incremental ingestion from cloud object storage. See COPY INTO. For a more scalable and robust file ingestion experience, Databricks recommends that SQL users leverage streaming tables instead of COPY INTO.

Additional resources

What is the difference between streaming and incremental batch ingestion?

Share via

Ingest data into an Azure Databricks lakehouse

Ingestion methods

Batch ingestion

Traditional (one-time) batch ingestion

Incremental batch ingestion

Streaming tables

Cloud object storage connector

Fully-managed connectors

Streaming ingestion

Batch and streaming ingestion with DLT

Ingestion schedules

Ingestion partners

DIY ingestion

Ingestion alternatives

When to use Lakehouse Federation

Considerations when choosing an ingestion method

Migrate data to Delta Lake

COPY INTO (legacy)

Additional resources

Feedback

Additional resources

Share via

Ingest data into an Azure Databricks lakehouse

Ingestion methods

Batch ingestion

Traditional (one-time) batch ingestion

Incremental batch ingestion

Streaming tables

Cloud object storage connector

Fully-managed connectors

Streaming ingestion

Batch and streaming ingestion with DLT

Ingestion schedules

Ingestion partners

DIY ingestion

Ingestion alternatives

When to use Delta Sharing

When to use Lakehouse Federation

Considerations when choosing an ingestion method

Migrate data to Delta Lake

COPY INTO (legacy)

Additional resources

Feedback

Additional resources