Develop Delta Live Tables pipelines
Developing and testing pipeline code differs from other Apache Spark workloads. This article provides an overview of supported functionality, best practices, and considerations when developing pipeline code. For more recommendations and best practices, see Applying software development & DevOps best practices to Delta Live Table pipelines.
Note
You must add source code to a pipeline configuration to validate code or run an update. See Configure a Delta Live Tables pipeline.
What files are valid for pipeline source code?
Delta Live Tables pipeline code can be Python or SQL. You can have a mix of Python and SQL source code files backing a single pipeline, but each file can only contain one language. See Develop pipeline code with Python and Develop pipeline code with SQL.
You can use notebooks and workspace files when specifying source code for a pipeline. Workspace files represent Python or SQL scripts authored in your preferred IDE or the Databricks file editor. See What are workspace files?.
If you develop Python code as modules or libraries, you must install and import the code and then call methods from a Python notebook or workspace file configured as source code. See Manage Python dependencies for Delta Live Tables pipelines.
Note
If you need to use arbitrary SQL commands in a Python notebook, you can use the syntax pattern spark.sql("<QUERY>")
to run SQL as Python code.
Unity Catalog functions allow you to register arbitrary Python user-defined functions for use in SQL. See User-defined functions (UDFs) in Unity Catalog.
Overview of Delta Live Tables development features
Delta Live Tables extends and leverages many Azure Databricks features, and introduces new features and concepts. The following table provides a brief overview of concepts and features that support pipeline code development:
Feature | Description |
---|---|
Development mode | New pipelines are configured to run in development mode by default. Databricks recommends using development mode for interactive development and testing. See Development and production modes. |
Validate | A Validate update verifies the correctness of pipeline source code without running an update on any tables. See Check a pipeline for errors without waiting for tables to update. |
Notebooks | Notebooks configured as source code for a Delta Live Tables pipeline provide interactive options for validating code and running updates. See Develop and debug Delta Live Tables pipelines in notebooks. |
Parameters | Leverage parameters in source code and pipeline configurations to simplify testing and extensibility. See Use parameters with Delta Live Tables pipelines. |
Databricks Asset Bundles | Databricks Asset Bundles allow you to move pipeline configurations and source code between workspaces. See Convert a Delta Live Tables pipeline into a Databricks Asset Bundle project. |
Create sample datasets for development and testing
Databricks recommends creating development and test datasets to test pipeline logic with expected data and potentially malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:
- Select a subset of data from a production dataset.
- Use anonymized or artificially generated data for sources containing PII.
- Create test data with well-defined outcomes based on downstream transformation logic.
- Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.
For example, if you have a notebook that defines a dataset using the following code:
CREATE OR REFRESH STREAMING TABLE input_data AS SELECT * FROM read_files("/production/data", "json")
You could create a sample dataset containing specific records using a query like the following:
CREATE OR REFRESH MATERIALIZED VIEW input_data AS
SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading
The following example demonstrates filtering published data to create a subset of the production data for development or testing:
CREATE OR REFRESH MATERIALIZED VIEW input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY
To use these different datasets, create multiple pipelines with the notebooks implementing the transformation logic. Each pipeline can read data from the LIVE.input_data
dataset but is configured to include the notebook that creates the dataset specific to the environment.