Run an update on a DLT pipeline

Artykuł
03/04/2025

This article explains pipeline updates and provides details on how to trigger an update.

What is a pipeline update?

After you create a pipeline and are ready to run it, you start an update. A pipeline update does the following:

Starts a cluster with the correct configuration.
Discovers all the defined tables and views and checks for any analysis errors such as not valid column names, missing dependencies, and syntax errors.
Creates or updates tables and views with the most recent data available.

Using a validate update, you can check for problems in a pipeline’s source code without waiting for tables to be created or updated. This feature is useful when developing or testing pipelines because it lets you quickly find and fix errors in your pipeline, such as incorrect table or column names.

How are pipeline updates triggered?

Use one of the following options to start pipeline updates:

Update trigger	Details
Manual	You can manually trigger pipeline updates from the pipeline UI, the pipelines list, or a notebook attached to a pipeline. See Manually trigger a pipeline update and Develop and debug DLT pipelines in notebooks.
Scheduled	You can schedule updates for pipelines using jobs. See DLT pipeline task for jobs.
Programmatic	You can programmatically trigger updates using third-party tools, APIs, and CLIs. See Run a DLT pipeline in a workflow and Pipeline API.

Manually trigger a pipeline update

Use one of the following options to manually trigger a pipeline update:

Click the button on the pipeline details page.
From the pipelines list, click in the Actions column.

Note

The default behavior for manually triggered pipeline updates is to refresh all datasets defined in the pipeline.

Pipeline refresh semantics

The following table describes the behaviors for materialized views and streaming tables for default refresh and full refresh:

Update type	Materialized view semantics	Streaming table semantics
Refresh (default)	Updates results to reflect the current results for the defining query.	Processes new records through logic defined in streaming tables and flows.
Full refresh	Updates results to reflect the current results for the defining query.	Clears data from streaming tables, clears state information (checkpoints) from flows, and reprocesses all records from the data source.

By default, all materialized views and streaming tables in a pipeline refresh with each update. You can optionally omit tables from updates using the following features:

Select tables for refresh: Use this UI to add or remove materialized views and streaming tables before running an update. See Start a pipeline update for selected tables.
Refresh failed tables: Start an update for failed materialized views and streaming tables, including downstream dependencies. See Start a pipeline update for failed tables.

Both of these features support default refresh semantics or full refresh. You can optionally use the Select tables for refresh dialog to exclude additional tables when running a refresh for failed tables.

Should I use a full refresh?

Databricks recommends running full refreshes only when necessary. A full refresh always reprocesses all records from the specified data sources through the logic that defines the dataset. The time and resources to complete a full refresh are correlated to the size of the source data.

Materialized views return the same results whether default or full refresh is used. Using a full refresh with streaming tables resets all state processing and checkpoint information and can result in dropped records if input data is no longer available.

Databricks only recommends full refresh when the input data sources contain the data needed to recreate the desired state of the table or view. Consider the following scenarios where input source data is no longer available and the outcome of running a full refresh:

Data source	Reason input data is absent	Outcome of full refresh
Kafka	Short retention threshold	Records no longer present in the Kafka source are dropped from the target table.
Files in object storage	Lifecycle policy	Data files no longer present in the source directory are dropped from the target table.
Records in a table	Deleted for compliance	Only records present in the source table are processed.

To prevent full refreshes from being run on a table or view, set the table property pipelines.reset.allowed to false. See DLT table properties. You can also use an append flow to append data to an existing streaming table without requiring a full refresh.

Start a pipeline update for selected tables

You can optionally reprocess data for only selected tables in your pipeline. For example, during development, you only change a single table and want to reduce testing time, or a pipeline update fails and you want to refresh only the failed tables.

Note

You can use selective refresh with only triggered pipelines.

To start an update that refreshes selected tables only, on the Pipeline details page:

Click Select tables for refresh. The Select tables for refresh dialog appears.

If you do not see the Select tables for refresh button, confirm that the Pipeline details page displays the latest update and that the update is complete. If a DAG is not shown for the latest update, for example, because the update failed, the Select tables for refresh button is not displayed.
To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.
Click Refresh selection.

Note

The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data already ingested for the selected tables, click Blue Down Caret next to the Refresh selection button and click Full Refresh selection.

Start a pipeline update for failed tables

If a pipeline update fails because of errors in one or more tables in the pipeline graph, you can start an update of only failed tables and any downstream dependencies.

Note

Excluded tables are not refreshed, even if they depend on a failed table.

To update failed tables, on the Pipeline details page, click Refresh failed tables.

To update only selected failed tables:

Click next to the Refresh failed tables button and click Select tables for refresh. The Select tables for refresh dialog appears.
To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.
Click Refresh selection.

Note

The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data already ingested for the selected tables, click Blue Down Caret next to the Refresh selection button and click Full Refresh selection.

Check a pipeline for errors without waiting for tables to update

Important

The DLT Validate update feature is in Public Preview.

To check whether a pipeline’s source code is valid without running a full update, use Validate. A Validate update resolves the definitions of datasets and flows defined in the pipeline but does not materialize or publish any datasets. Errors found during validation, such as incorrect table or column names, are reported in the UI.

To run a Validate update, click Blue Down Caret on the pipeline details page next to Start and click Validate.

After the Validate update is complete, the event log shows events related only to the Validate update, and no metrics are displayed in the DAG. If errors are found, details are available in the event log.

You can see results for only the most recent Validate update. If the Validate update was the most recently run update, you can see the results by selecting it in the update history. If another update is run after the Validate update, the results are no longer available in the UI.

Development and production modes

You can optimize pipeline execution by switching between development and production modes. Use the DLT Environment Toggle Icon buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode.

When you run your pipeline in development mode, the DLT system does the following:

Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the pipelines.clusterShutdown.delay setting in the Configure compute for a DLT pipeline.
Disables pipeline retries so you can immediately detect and fix errors.

In production mode, the DLT system does the following:

Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
Retries execution in the event of specific errors, such as a failure to start a cluster.

Note

Switching between development and production modes only controls cluster and pipeline execution behavior. Storage locations and target schemas in the catalog for publishing tables must be configured as part of pipeline settings and are not affected when switching between modes.

Udostępnij za pośrednictwem

Run an update on a DLT pipeline

What is a pipeline update?

How are pipeline updates triggered?

Manually trigger a pipeline update

Pipeline refresh semantics

Should I use a full refresh?

Start a pipeline update for selected tables

Start a pipeline update for failed tables

Check a pipeline for errors without waiting for tables to update

Development and production modes

Opinia

Dodatkowe zasoby