Develop DLT pipelines

Artikel
03/04/2025

Developing and testing pipeline code differs from other Apache Spark workloads. This article provides an overview of supported functionality, best practices, and considerations when developing pipeline code. For more recommendations and best practices, see Applying software development & DevOps best practices to DLT pipelines.

Note

You must add source code to a pipeline configuration to validate code or run an update. See Configure a DLT pipeline.

What files are valid for pipeline source code?

DLT pipeline code can be Python or SQL. You can have a mix of Python and SQL source code files backing a single pipeline, but each file can only contain one language. See Develop pipeline code with Python and Develop pipeline code with SQL.

You can use notebooks and workspace files when specifying source code for a pipeline. Workspace files represent Python or SQL scripts authored in your preferred IDE or the Databricks file editor. See What are workspace files?.

If you develop Python code as modules or libraries, you must install and import the code and then call methods from a Python notebook or workspace file configured as source code. See Manage Python dependencies for DLT pipelines.

Note

If you need to use arbitrary SQL commands in a Python notebook, you can use the syntax pattern spark.sql("<QUERY>") to run SQL as Python code.

Unity Catalog functions allow you to register arbitrary Python user-defined functions for use in SQL. See User-defined functions (UDFs) in Unity Catalog.

Overview of DLT development features

DLT extends and leverages many Azure Databricks features, and introduces new features and concepts. The following table provides a brief overview of concepts and features that support pipeline code development:

Feature	Description
Development mode	New pipelines are configured to run in development mode by default. Databricks recommends using development mode for interactive development and testing. See Development and production modes.
Validate	A `Validate` update verifies the correctness of pipeline source code without running an update on any tables. See Check a pipeline for errors without waiting for tables to update.
Notebooks	Notebooks configured as source code for a DLT pipeline provide interactive options for validating code and running updates. See Develop and debug DLT pipelines in notebooks.
Parameters	Leverage parameters in source code and pipeline configurations to simplify testing and extensibility. See Use parameters with DLT pipelines.
Databricks Asset Bundles	Databricks Asset Bundles allow you to move pipeline configurations and source code between workspaces. See Convert a DLT pipeline into a Databricks Asset Bundle project.

Create sample datasets for development and testing

Databricks recommends creating development and test datasets to test pipeline logic with expected data and potentially malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:

Select a subset of data from a production dataset.
Use anonymized or artificially generated data for sources containing PII.
Create test data with well-defined outcomes based on downstream transformation logic.
Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.

For example, if you have a notebook that defines a dataset using the following code:

CREATE OR REFRESH STREAMING TABLE input_data AS SELECT * FROM read_files("/production/data", "json")

You could create a sample dataset containing specific records using a query like the following:

CREATE OR REFRESH MATERIALIZED VIEW input_data AS
SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading

The following example demonstrates filtering published data to create a subset of the production data for development or testing:

CREATE OR REFRESH MATERIALIZED VIEW input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY

To use these different datasets, create multiple pipelines with the notebooks implementing the transformation logic. Each pipeline can read data from the input_data dataset but is configured to include the notebook that creates the dataset specific to the environment.

Freigeben über

Develop DLT pipelines

What files are valid for pipeline source code?

Overview of DLT development features

Create sample datasets for development and testing

Feedback

Zusätzliche Ressourcen