แชร์ผ่าน


Run an update on a Delta Live Tables pipeline

This article explains pipeline updates and provides details on how to trigger an update.

What is a pipeline update?

After you create a pipeline and are ready to run it, you start an update. A pipeline update does the following:

  • Starts a cluster with the correct configuration.
  • Discovers all the defined tables and views and checks for any analysis errors such as not valid column names, missing dependencies, and syntax errors.
  • Creates or updates tables and views with the most recent data available.

Using a validate update, you can check for problems in a pipeline’s source code without waiting for tables to be created or updated. This feature is useful when developing or testing pipelines because it lets you quickly find and fix errors in your pipeline, such as incorrect table or column names.

How are pipeline updates triggered?

Use one of the following options to start pipeline updates:

Update trigger Details
Manual You can manually trigger pipeline updates from the pipeline UI, the pipelines list, or a notebook attached to a pipeline. See Manually trigger a pipeline update and Develop and debug Delta Live Tables pipelines in notebooks.
Scheduled You can schedule updates for pipelines using jobs. See Delta Live Tables pipeline task for jobs.
Programmatic You can programmatically trigger updates using third-party tools, APIs, and CLIs. See Run a Delta Live Tables pipeline in a workflow and Pipeline API.

Manually trigger a pipeline update

Use one of the following options to manually trigger a pipeline update:

  • Click the Delta Live Tables Start Icon button on the pipeline details page.
  • From the pipelines list, click Right Arrow Icon in the Actions column.

Note

The default behavior for manually triggered pipeline updates is to refresh all datasets defined in the pipeline.

Pipeline refresh semantics

The following table describes the behaviors for materialized views and streaming tables for default refresh and full refresh:

Update type Materialized view semantics Streaming table semantics
Refresh (default) Updates results to reflect the current results for the defining query. Processes new records through logic defined in streaming tables and flows.
Full refresh Updates results to reflect the current results for the defining query. Clears data from streaming tables, clears state information (checkpoints) from flows, and reprocesses all records from the data source.

By default, all materialized views and streaming tables in a pipeline refresh with each update. You can optionally omit tables from updates using the following features:

Both of these features support default refresh semantics or full refresh. You can optionally use the Select tables for refresh dialog to exclude additional tables when running a refresh for failed tables.

Should I use a full refresh?

Databricks recommends running full refreshes only when necessary. A full refresh always reprocesses all records from the specified data sources through the logic that defines the dataset. The time and resources to complete a full refresh are correlated to the size of the source data.

Materialized views return the same results whether default or full refresh is used. Using a full refresh with streaming tables resets all state processing and checkpoint information and can result in dropped records if input data is no longer available.

Databricks only recommends full refresh when the input data sources contain the data needed to recreate the desired state of the table or view. Consider the following scenarios where input source data is no longer available and the outcome of running a full refresh:

Data source Reason input data is absent Outcome of full refresh
Kafka Short retention threshold Records no longer present in the Kafka source are dropped from the target table.
Files in object storage Lifecycle policy Data files no longer present in the source directory are dropped from the target table.
Records in a table Deleted for compliance Only records present in the source table are processed.

To prevent full refreshes from being run on a table or view, set the table property pipelines.reset.allowed to false. See Delta Live Tables table properties. You can also use an append flow to append data to an existing streaming table without requiring a full refresh.

Start a pipeline update for selected tables

You can optionally reprocess data for only selected tables in your pipeline. For example, during development, you only change a single table and want to reduce testing time, or a pipeline update fails and you want to refresh only the failed tables.

Note

You can use selective refresh with only triggered pipelines.

To start an update that refreshes selected tables only, on the Pipeline details page:

  1. Click Select tables for refresh. The Select tables for refresh dialog appears.

    If you do not see the Select tables for refresh button, confirm that the Pipeline details page displays the latest update and that the update is complete. If a DAG is not shown for the latest update, for example, because the update failed, the Select tables for refresh button is not displayed.

  2. To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.

  3. Click Refresh selection.

    Note

    The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data already ingested for the selected tables, click Blue Down Caret next to the Refresh selection button and click Full Refresh selection.

Start a pipeline update for failed tables

If a pipeline update fails because of errors in one or more tables in the pipeline graph, you can start an update of only failed tables and any downstream dependencies.

Note

Excluded tables are not refreshed, even if they depend on a failed table.

To update failed tables, on the Pipeline details page, click Refresh failed tables.

To update only selected failed tables:

  1. Click Button Down next to the Refresh failed tables button and click Select tables for refresh. The Select tables for refresh dialog appears.

  2. To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.

  3. Click Refresh selection.

    Note

    The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data already ingested for the selected tables, click Blue Down Caret next to the Refresh selection button and click Full Refresh selection.

Check a pipeline for errors without waiting for tables to update

Important

The Delta Live Tables Validate update feature is in Public Preview.

To check whether a pipeline’s source code is valid without running a full update, use Validate. A Validate update resolves the definitions of datasets and flows defined in the pipeline but does not materialize or publish any datasets. Errors found during validation, such as incorrect table or column names, are reported in the UI.

To run a Validate update, click Blue Down Caret on the pipeline details page next to Start and click Validate.

After the Validate update is complete, the event log shows events related only to the Validate update, and no metrics are displayed in the DAG. If errors are found, details are available in the event log.

You can see results for only the most recent Validate update. If the Validate update was the most recently run update, you can see the results by selecting it in the update history. If another update is run after the Validate update, the results are no longer available in the UI.

Development and production modes

You can optimize pipeline execution by switching between development and production modes. Use the Delta Live Tables Environment Toggle Icon buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode.

When you run your pipeline in development mode, the Delta Live Tables system does the following:

  • Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the pipelines.clusterShutdown.delay setting in the Configure compute for a Delta Live Tables pipeline.
  • Disables pipeline retries so you can immediately detect and fix errors.

In production mode, the Delta Live Tables system does the following:

  • Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
  • Retries execution in the event of specific errors, such as a failure to start a cluster.

Note

Switching between development and production modes only controls cluster and pipeline execution behavior. Storage locations and target schemas in the catalog for publishing tables must be configured as part of pipeline settings and are not affected when switching between modes.