Κοινή χρήση μέσω


Run an update on a Delta Live Tables pipeline

This article explains what a Delta Live Tables pipeline update is and how to run one.

After you create a pipeline and are ready to run it, you start an update. A pipeline update does the following:

  • Starts a cluster with the correct configuration.
  • Discovers all the defined tables and views and checks for any analysis errors such as not valid column names, missing dependencies, and syntax errors.
  • Creates or updates tables and views with the most recent data available.

Using a validate update, you can check for problems in a pipeline’s source code without waiting for tables to be created or updated. This feature is useful when developing or testing pipelines because it lets you quickly find and fix errors in your pipeline, such as incorrect table or column names.

To learn how to create a pipeline, see Configure a Delta Live Tables pipeline.

You can orchestrate pipeline updates with Databricks jobs or other tools. See Run a Delta Live Tables pipeline in a workflow.

Start a pipeline update

Azure Databricks has several options to start pipeline updates, including the following:

  • In the Delta Live Tables UI, you have the following options:
    • Click the Delta Live Tables Start Icon button on the pipeline details page.
    • From the pipelines list, click Right Arrow Icon in the Actions column.
  • To start an update in a notebook, attach the notebook to a configured pipeline and click Start. See Develop and debug Delta Live Tables pipelines in notebooks.
  • You can trigger pipelines programmatically using the API or CLI. See Pipeline API.
  • You can schedule the pipeline as a job using the Delta Live Tables UI or the jobs UI. See Schedule a pipeline.

Note

The default behavior for manually triggered pipeline updates using any of these methods is to refresh all.

How Delta Live Tables updates tables and views

Important

A full refresh of a streaming table or materialized view truncates and recomputes the table or view to reflect the current state of its input data sources. For streaming tables, checkpoints are also reset. If records have been removed from the data sources, for example, because of data retention policies, manual deletion, or sources with short retention periods such as Kafka, the state of the table or view after a full refresh might differ from the previous state. Additionally, the time and resources to complete a full refresh are correlated to the size of the source data.

Databricks recommends running full refreshes only when necessary, and when the input data sources contain the data to recreate the state of the table or view. To prevent full refreshes from being run on a table or view, set the table property pipelines.reset.allowed to false. See Delta Live Tables table properties. You can also use an append flow to append data to an existing streaming table without requiring a full refresh.

The tables and views updated, and how those tables and views are updated, depends on the update type:

  • Refresh all: All tables are updated to reflect the current state of their input data sources. For streaming tables, new rows are appended to the table.
  • Full refresh all: All tables are updated to reflect the current state of their input data sources. For streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.
  • Refresh selection: The behavior of refresh selection is identical to refresh all but allows you to refresh only selected tables. Selected tables are updated to reflect the current state of their input data sources. For Streaming tables, new rows are appended to the table.
  • Full refresh selection: The behavior of full refresh selection is identical to full refresh all but allows you to perform a full refresh of only selected tables. Selected tables are updated to reflect the current state of their input data sources. For Streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.

For existing materialized views, an update has the same behavior as a SQL REFRESH on a materialized view. For new materialized views, the behavior is the same as a SQL CREATE operation.

Start a pipeline update for selected tables

You can optionally reprocess data for only selected tables in your pipeline. For example, during development, you only change a single table and want to reduce testing time, or a pipeline update fails and you want to refresh only the failed tables.

Note

You can use selective refresh with only triggered pipelines.

To start an update that refreshes selected tables only, on the Pipeline details page:

  1. Click Select tables for refresh. The Select tables for refresh dialog appears.

    If you do not see the Select tables for refresh button, confirm that the Pipeline details page displays the latest update and that the update is complete. If a DAG is not shown for the latest update, for example, because the update failed, the Select tables for refresh button is not displayed.

  2. To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.

  3. Click Refresh selection.

    Note

    The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data already ingested for the selected tables, click Blue Down Caret next to the Refresh selection button and click Full Refresh selection.

Start a pipeline update for failed tables

If a pipeline update fails because of errors in one or more tables in the pipeline graph, you can start an update of only failed tables and any downstream dependencies.

Note

Excluded tables are not refreshed, even if they depend on a failed table.

To update failed tables, on the Pipeline details page, click Refresh failed tables.

To update only selected failed tables:

  1. Click Button Down next to the Refresh failed tables button and click Select tables for refresh. The Select tables for refresh dialog appears.

  2. To select the tables to refresh, click each table. The selected tables are highlighted and labeled. To remove a table from the update, click the table again.

  3. Click Refresh selection.

    Note

    The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data already ingested for the selected tables, click Blue Down Caret next to the Refresh selection button and click Full Refresh selection.

Check a pipeline for errors without waiting for tables to update

Important

The Delta Live Tables Validate update feature is in Public Preview.

To check whether a pipeline’s source code is valid without running a full update, use Validate. A Validate update resolves the definitions of datasets and flows defined in the pipeline but does not materialize or publish any datasets. Errors found during validation, such as incorrect table or column names, are reported in the UI.

To run a Validate update, click Blue Down Caret on the pipeline details page next to Start and click Validate.

After the Validate update is complete, the event log shows events related only to the Validate update, and no metrics are displayed in the DAG. If errors are found, details are available in the event log.

You can see results for only the most recent Validate update. If the Validate update was the most recently run update, you can see the results by selecting it in the update history. If another update is run after the Validate update, the results are no longer available in the UI.

How to choose pipeline boundaries

A Delta Live Tables pipeline can process updates to a single table, many tables with dependent relationships, many tables without relationships, or multiple independent flows of tables with dependent relationships. This section contains considerations to help determine how to break up your pipelines.

Larger Delta Live Tables pipelines have several benefits. These include the following:

  • More efficiently use cluster resources.
  • Reduce the number of pipelines in your workspace.
  • Reduce the complexity of workflow orchestration.

Some common recommendations on how processing pipelines should be split include the following:

  • Split functionality at team boundaries. For example, your data team might maintain pipelines to transform data while your data analysts maintain pipelines that analyze the transformed data.
  • Split functionality at application-specific boundaries to reduce coupling and facilitate the re-use of common functionality.

Development and production modes

You can optimize pipeline execution by switching between development and production modes. Use the Delta Live Tables Environment Toggle Icon buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode.

When you run your pipeline in development mode, the Delta Live Tables system does the following:

  • Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the pipelines.clusterShutdown.delay setting in the Configure compute for a Delta Live Tables pipeline.
  • Disables pipeline retries so you can immediately detect and fix errors.

In production mode, the Delta Live Tables system does the following:

  • Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
  • Retries execution in the event of specific errors, such as a failure to start a cluster.

Note

Switching between development and production modes only controls cluster and pipeline execution behavior. Storage locations and target schemas in the catalog for publishing tables must be configured as part of pipeline settings and are not affected when switching between modes.

Schedule a pipeline

You can start a triggered pipeline manually or run the pipeline on a schedule with an Azure Databricks job. You can create and schedule a job with a single pipeline task directly in the Delta Live Tables UI or add a pipeline task to a multi-task workflow in the jobs UI. See Delta Live Tables pipeline task for jobs.

To create a single-task job and a schedule for the job in the Delta Live Tables UI:

  1. Click Schedule > Add a schedule. If the pipeline is included in one or more scheduled jobs, the Schedule button is updated to show the number of existing schedules, for example, Schedule (5).
  2. Enter a name for the job in the Job name field.
  3. Set the Schedule to Scheduled.
  4. Specify the period, starting time, and time zone.
  5. Configure one or more email addresses to receive alerts on pipeline start, success, or failure.
  6. Click Create.