Run an update on a Delta Live Tables pipeline
This article explains what a Delta Live Tables pipeline update is and how to run one.
After you create a pipeline and are ready to run it, you start an update. A pipeline update does the following:
- Starts a cluster with the correct configuration.
- Discovers all the defined tables and views and checks for any analysis errors such as not valid column names, missing dependencies, and syntax errors.
- Creates or updates tables and views with the most recent data available.
Using a validate update, you can check for problems in a pipeline’s source code without waiting for tables to be created or updated. This feature is useful when developing or testing pipelines because it lets you quickly find and fix errors in your pipeline, such as incorrect table or column names.
To learn how to create a pipeline, see Configure a Delta Live Tables pipeline.
You can orchestrate pipeline updates with Databricks jobs or other tools. See Run a Delta Live Tables pipeline in a workflow.
Start a pipeline update
Azure Databricks has several options to start pipeline updates, including the following:
- In the Delta Live Tables UI, you have the following options:
- Click the button on the pipeline details page.
- From the pipelines list, click in the Actions column.
- To start an update in a notebook, attach the notebook to a configured pipeline and click Start. See Develop and debug Delta Live Tables pipelines in notebooks.
- You can trigger pipelines programmatically using the API or CLI. See Pipeline API.
- You can schedule the pipeline as a job using the Delta Live Tables UI or the jobs UI. See Schedule a pipeline.
Note
The default behavior for manually triggered pipeline updates using any of these methods is to refresh all.
How Delta Live Tables updates tables and views
Important
A full refresh of a streaming table or materialized view truncates and recomputes the table or view to reflect the current state of its input data sources. For streaming tables, checkpoints are also reset. If records have been removed from the data sources, for example, because of data retention policies, manual deletion, or sources with short retention periods such as Kafka, the state of the table or view after a full refresh might differ from the previous state. Additionally, the time and resources to complete a full refresh are correlated to the size of the source data.
Databricks recommends running full refreshes only when necessary, and when the input data sources contain the data to recreate the state of the table or view. To prevent full refreshes from being run on a table or view, set the table property pipelines.reset.allowed
to false
. See Delta Live Tables table properties. You can also use an append flow to append data to an existing streaming table without requiring a full refresh.
The tables and views updated, and how those tables and views are updated, depends on the update type:
- Refresh all: All tables are updated to reflect the current state of their input data sources. For streaming tables, new rows are appended to the table.
- Full refresh all: All tables are updated to reflect the current state of their input data sources. For streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.
- Refresh selection: The behavior of
refresh selection
is identical torefresh all
but allows you to refresh only selected tables. Selected tables are updated to reflect the current state of their input data sources. For Streaming tables, new rows are appended to the table. - Full refresh selection: The behavior of
full refresh selection
is identical tofull refresh all
but allows you to perform a full refresh of only selected tables. Selected tables are updated to reflect the current state of their input data sources. For Streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.
For existing materialized views, an update has the same behavior as a SQL REFRESH
on a materialized view. For new materialized views, the behavior is the same as a SQL CREATE
operation.
Start a pipeline update for selected tables
You can optionally reprocess data for only selected tables in your pipeline. For example, during development, you only change a single table and want to reduce testing time, or a pipeline update fails and you want to refresh only the failed tables.
Note
You can use selective refresh with only triggered pipelines.
To start an update that refreshes selected tables only, on the Pipeline details page:
Click Select tables for refresh. The Select tables for refresh dialog appears.
If you do not see the Select tables for refresh button, confirm that the Pipeline details page displays the latest update and that the update is complete. If a DAG is not shown for the latest update, for example, because the update failed, the Select tables for refresh button is not displayed.
To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.
Click Refresh selection.
Note
The Refresh selection button displays the number of selected tables in parentheses.
To reprocess data already ingested for the selected tables, click next to the Refresh selection button and click Full Refresh selection.
Start a pipeline update for failed tables
If a pipeline update fails because of errors in one or more tables in the pipeline graph, you can start an update of only failed tables and any downstream dependencies.
Note
Excluded tables are not refreshed, even if they depend on a failed table.
To update failed tables, on the Pipeline details page, click Refresh failed tables.
To update only selected failed tables:
Click next to the Refresh failed tables button and click Select tables for refresh. The Select tables for refresh dialog appears.
To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.
Click Refresh selection.
Note
The Refresh selection button displays the number of selected tables in parentheses.
To reprocess data already ingested for the selected tables, click next to the Refresh selection button and click Full Refresh selection.
Check a pipeline for errors without waiting for tables to update
Important
The Delta Live Tables Validate
update feature is in Public Preview.
To check whether a pipeline’s source code is valid without running a full update, use Validate. A Validate
update resolves the definitions of datasets and flows defined in the pipeline but does not materialize or publish any datasets. Errors found during validation, such as incorrect table or column names, are reported in the UI.
To run a Validate
update, click on the pipeline details page next to Start and click Validate.
After the Validate
update is complete, the event log shows events related only to the Validate
update, and no metrics are displayed in the DAG. If errors are found, details are available in the event log.
You can see results for only the most recent Validate
update. If the Validate
update was the most recently run update, you can see the results by selecting it in the update history. If another update is run after the Validate
update, the results are no longer available in the UI.
How to choose pipeline boundaries
A Delta Live Tables pipeline can process updates to a single table, many tables with dependent relationships, many tables without relationships, or multiple independent flows of tables with dependent relationships. This section contains considerations to help determine how to break up your pipelines.
Larger Delta Live Tables pipelines have several benefits. These include the following:
- More efficiently use cluster resources.
- Reduce the number of pipelines in your workspace.
- Reduce the complexity of workflow orchestration.
Some common recommendations on how processing pipelines should be split include the following:
- Split functionality at team boundaries. For example, your data team might maintain pipelines to transform data while your data analysts maintain pipelines that analyze the transformed data.
- Split functionality at application-specific boundaries to reduce coupling and facilitate the re-use of common functionality.
Development and production modes
You can optimize pipeline execution by switching between development and production modes. Use the buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode.
When you run your pipeline in development mode, the Delta Live Tables system does the following:
- Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the
pipelines.clusterShutdown.delay
setting in the Configure compute for a Delta Live Tables pipeline. - Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system does the following:
- Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
- Retries execution in the event of specific errors, such as a failure to start a cluster.
Note
Switching between development and production modes only controls cluster and pipeline execution behavior. Storage locations and target schemas in the catalog for publishing tables must be configured as part of pipeline settings and are not affected when switching between modes.
Schedule a pipeline
You can start a triggered pipeline manually or run the pipeline on a schedule with an Azure Databricks job. You can create and schedule a job with a single pipeline task directly in the Delta Live Tables UI or add a pipeline task to a multi-task workflow in the jobs UI. See Delta Live Tables pipeline task for jobs.
To create a single-task job and a schedule for the job in the Delta Live Tables UI:
- Click Schedule > Add a schedule. If the pipeline is included in one or more scheduled jobs, the Schedule button is updated to show the number of existing schedules, for example, Schedule (5).
- Enter a name for the job in the Job name field.
- Set the Schedule to Scheduled.
- Specify the period, starting time, and time zone.
- Configure one or more email addresses to receive alerts on pipeline start, success, or failure.
- Click Create.