Configure a DLT pipeline

项目
03/04/2025

This article describes the basic configuration for DLT pipelines using the workspace UI.

Databricks recommends developing new pipelines using serverless. For configuration instructions for serverless pipelines, see Configure a serverless DLT pipeline.

The configuration instructions in this article use Unity Catalog. For instructions for configuring pipelines with legacy Hive metastore, see Use DLT pipelines with legacy Hive metastore.

This article discusses functionality for the current default publishing mode for pipelines. Pipelines created before February 5, 2025, might use the legacy publishing mode and LIVE virtual schema. See LIVE schema (legacy).

Note

The UI has an option to display and edit settings in JSON. You can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration.

JSON configuration files are also helpful when deploying pipelines to new environments or using the CLI or REST API.

For a complete reference to the DLT JSON configuration settings, see DLT pipeline configurations.

Configure a new DLT pipeline

To configure a new DLT pipeline, do the following:

Click DLT in the sidebar.
Click Create Pipeline.
Provide a unique Pipeline name.
(Optional) Use the file picker to configure notebooks and workspace files as Source code.
- If you don’t add any source code, a new notebook is created for the pipeline. The notebook is created in a new directory in your user directory, and a link to access this notebook is shown in the Source code field in the Pipeline details pane after you’ve created the pipeline.
  - You can access this notebook with the URL presented under the Source code field in the Pipeline details panel once you’ve created your pipeline.
- Use the Add source code button to add additional source code assets.
Select Unity Catalog under Storage options.
Select a Catalog. This setting controls the default catalog and the storage location for pipeline metadata.
Select a Schema in the catalog. By default, streaming tables and materialized views defined in the pipeline are created in this schema.
In the Compute section, check the box next to Use Photon Acceleration. For additional compute configuration considerations, see Compute configuration options.
Click Create.

These recommended configurations create a new pipeline configured to run in Triggered mode and use the Current channel. This configuration is recommended for many use cases, including development and testing, and is well-suited to production workloads that should run on a schedule. For details on scheduling pipelines, see DLT pipeline task for jobs.

Compute configuration options

Databricks recommends always using Enhanced autoscaling. Default values for other compute configurations work well for many pipelines.

Serverless pipelines remove compute configuration options. For configuration instructions for serverless pipelines, see Configure a serverless DLT pipeline.

Use the following settings to customize compute configurations:

Workspace admins can configure a Cluster policy. Compute policies allow admins to control what compute options are available to users. See Select a cluster policy.
You can optionally configure Cluster mode to run with Fixed size or Legacy autoscaling. See Optimize the cluster utilization of DLT pipelines with Autoscaling.
For workloads with autoscaling enabled, set Min workers and Max workers to set limits for scaling behaviors. See Configure compute for a DLT pipeline.
You can optionally turn off Photon acceleration. See What is Photon?.

Use Cluster tags to help monitor costs associated with DLT pipelines. See Configure cluster tags.
Configure Instance types to specify the type of virtual machines used to run your pipeline. See Select instance types to run a pipeline.
- Select a Worker type optimized for the workloads configured in your pipeline.
- You can optionally select a Driver type that differs from your worker type. This can be useful for reducing costs in pipelines with large worker types and low driver compute utilization or for choosing a larger driver type to avoid out-of-memory issues in workloads with many small workers.

Other configuration considerations

The following configuration options are also available for pipelines:

The Advanced product edition gives you access to all DLT features. You can optionally run pipelines using the Pro or Core product editions. See Choose a product edition.
You might choose to use the Continuous pipeline mode when running pipelines in production. See Triggered vs. continuous pipeline mode.
If your workspace is not configured for Unity Catalog or your workload needs to use legacy Hive metastore, see Use DLT pipelines with legacy Hive metastore.
Add Notifications for email updates based on success or failure conditions. See Add email notifications for pipeline events.
Use the Configuration field to set key-value pairs for the pipeline. These configurations serve two purposes:
- Set arbitrary parameters you can reference in your source code. See Use parameters with DLT pipelines.
- Configure pipeline settings and Spark configurations. See DLT properties reference.
Use the Preview channel to test your pipeline against pending DLT runtime changes and trial new features.

Choose a product edition

Select the DLT product edition with the best features for your pipeline requirements. The following product editions are available:

Core to run streaming ingest workloads. Select the Core edition if your pipeline doesn’t require advanced features such as change data capture (CDC) or DLT expectations.
Pro to run streaming ingest and CDC workloads. The Pro product edition supports all of the Core features, plus support for workloads that require updating tables based on changes in source data.
Advanced to run streaming ingest workloads, CDC workloads, and workloads that require expectations. The Advanced product edition supports the features of the Core and Pro editions and includes data quality constraints with DLT expectations.

You can select the product edition when you create or edit a pipeline. You can choose a different edition for each pipeline. See the DLT product page.

Note: If your pipeline includes features not supported by the selected product edition, such as expectations, you will receive an error message explaining the reason for the error. You can then edit the pipeline to select the appropriate edition.

Configure source code

You can use the file selector in the DLT UI to configure the source code defining your pipeline. Pipeline source code is defined in Databricks notebooks or SQL or Python scripts stored in workspace files. When you create or edit your pipeline, you can add one or more notebooks or workspace files or a combination of notebooks and workspace files.

Because DLT automatically analyzes dataset dependencies to construct the processing graph for your pipeline, you can add source code assets in any order.

You can modify the JSON file to include DLT source code defined in SQL and Python scripts stored in workspace files. The following example includes notebooks and workspace files:

{
  "name": "Example pipeline 3",
  "storage": "dbfs:/pipeline-examples/storage-location/example3",
  "libraries": [
    { "notebook": { "path": "/example-notebook_1" } },
    { "notebook": { "path": "/example-notebook_2" } },
    { "file": { "path": "/Workspace/Users/<user-name>@databricks.com/Apply_Changes_Into/apply_changes_into.sql" } },
    { "file": { "path": "/Workspace/Users/<user-name>@databricks.com/Apply_Changes_Into/apply_changes_into.py" } }
  ]
}

Manage external dependencies for pipelines that use Python

DLT supports using external dependencies in your pipelines, such as Python packages and libraries. To learn about options and recommendations for using dependencies, see Manage Python dependencies for DLT pipelines.

Use Python modules stored in your Azure Databricks workspace

In addition to implementing your Python code in Databricks notebooks, you can use Databricks Git Folders or workspace files to store your code as Python modules. Storing your code as Python modules is especially useful when you have common functionality you want to use in multiple pipelines or notebooks in the same pipeline. To learn how to use Python modules with your pipelines, see Import Python modules from Git folders or workspace files.

通过