共用方式為


教學課程:在本機建立、執行及測試 dbt 模型

本教學課程將逐步引導您瞭解如何在本機建立、執行及測試 dbt 模型。 您也可以以 Azure Databricks 作業工作的形式執行 dbt 專案。 如需詳細資訊,請參閱 在 Azure Databricks 作業中使用 dbt 轉換。

開始之前

若要遵循本教學課程,您必須先將 Azure Databricks 工作區連線到 dbt Core。 如需詳細資訊,請參閱連線至 dbt Core

步驟 1:建立和執行模型

在此步驟中,您會使用您慣用的文本編輯器來建立 模型,這些模型是 select 語句,這些語句會根據該資料庫中的現有數據,建立新的檢視(預設值) 或資料庫中的新數據表。 此程式會根據 範例數據集 diamonds 數據表建立模型。

使用下列程式代碼來建立此數據表。

DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
  1. 在項目的 models 目錄中,使用下列 SQL 語句建立名為 diamonds_four_cs.sql 的檔案。 此語句只會從 diamonds 數據表中選取每個鑽石的克拉、切工、顏色和淨度詳細數據。 config 區塊會指示 dbt 根據這個語句在資料庫中建立數據表。

    {{ config(
      materialized='table',
      file_format='delta'
    ) }}
    
    select carat, cut, color, clarity
    from diamonds
    

    提示

    如需其他 config 選項,例如使用 Delta 檔格式和 merge 累加策略,請參閱 dbt 檔中的 Databricks 組態

  2. 在項目的 models 目錄中,使用下列 SQL 語句建立名為 diamonds_list_colors.sql 的第二個檔案。 此語句會從 diamonds_four_cs 數據表的 [colors] 數據行中選取唯一值,先依字母順序排序結果。 由於沒有 config 區塊,此模型將指示 dbt 根據此陳述式在資料庫中建立檢視。

    select distinct color
    from {{ ref('diamonds_four_cs') }}
    sort by color asc
    
  3. 在項目的 models 目錄中,使用下列 SQL 語句建立名為 diamonds_prices.sql 的第三個檔案。 此陳述式依色彩計算寶石的平均價格,並依平均價格從最高到最低對結果進行排序。 此模型將指示 dbt 根據此陳述式在資料庫中建立檢視。

    select color, avg(price) as price
    from diamonds
    group by color
    order by price desc
    
  4. 啟用虛擬環境后,使用上述三個檔案的路徑執行 dbt run 命令。 在 default 資料庫中(如 profiles.yml 檔案中所指定),dbt 會建立一個名為 diamonds_four_cs 的數據表,以及名為 diamonds_list_colorsdiamonds_prices的兩個檢視表。 dbt 會從其相關的 .sql 檔名取得這些檢視和數據表名稱。

    dbt run --model models/diamonds_four_cs.sql models/diamonds_list_colors.sql models/diamonds_prices.sql
    
    ...
    ... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN]
    ... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...]
    ... | 2 of 3 START view model default.diamonds_list_colors................. [RUN]
    ... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...]
    ... | 3 of 3 START view model default.diamonds_prices...................... [RUN]
    ... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...]
    ... |
    ... | Finished running 1 table model, 2 view models ...
    
    Completed successfully
    
    Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
    
  5. 執行下列 SQL 程式代碼來列出新檢視的相關信息,以及從數據表和檢視中選取所有數據列。

    如果您要連線到叢集,您可以從連線到叢集的筆記本執行此 SQL 程式代碼,並將 SQL 指定為筆記本的預設語言。 如果要連線至 SQL 倉儲,您可以從查詢執行此 SQL 程式碼。

    SHOW views IN default;
    
    +-----------+----------------------+-------------+
    | namespace | viewName             | isTemporary |
    +===========+======================+=============+
    | default   | diamonds_list_colors | false       |
    +-----------+----------------------+-------------+
    | default   | diamonds_prices      | false       |
    +-----------+----------------------+-------------+
    
    SELECT * FROM diamonds_four_cs;
    
    +-------+---------+-------+---------+
    | carat | cut     | color | clarity |
    +=======+=========+=======+=========+
    | 0.23  | Ideal   | E     | SI2     |
    +-------+---------+-------+---------+
    | 0.21  | Premium | E     | SI1     |
    +-------+---------+-------+---------+
    ...
    
    SELECT * FROM diamonds_list_colors;
    
    +-------+
    | color |
    +=======+
    | D     |
    +-------+
    | E     |
    +-------+
    ...
    
    SELECT * FROM diamonds_prices;
    
    +-------+---------+
    | color | price   |
    +=======+=========+
    | J     | 5323.82 |
    +-------+---------+
    | I     | 5091.87 |
    +-------+---------+
    ...
    

步驟 2:建立和執行更複雜的模型

在此步驟中,您會為一組相關的數據表建立更複雜的模型。 這些數據表包含三支球隊在一個賽季中進行六場比賽的虛構體育聯盟的相關信息。 此程式會建立數據表、建立模型,以及執行模型。

  1. 執行下列 SQL 程式代碼來建立必要的資料表。

    如果您要連線到叢集,您可以從連線到叢集的筆記本執行此 SQL 程式代碼,並將 SQL 指定為筆記本的預設語言。 如果要連線至 SQL 倉儲,您可以從查詢執行此 SQL 程式碼。

    此步驟中的數據表和檢視會從 zzz_ 開始,以協助識別它們作為此範例的一部分。 您不需要針對自己的數據表和檢視遵循此模式。

    DROP TABLE IF EXISTS zzz_game_opponents;
    DROP TABLE IF EXISTS zzz_game_scores;
    DROP TABLE IF EXISTS zzz_games;
    DROP TABLE IF EXISTS zzz_teams;
    
    CREATE TABLE zzz_game_opponents (
    game_id INT,
    home_team_id INT,
    visitor_team_id INT
    ) USING DELTA;
    
    INSERT INTO zzz_game_opponents VALUES (1, 1, 2);
    INSERT INTO zzz_game_opponents VALUES (2, 1, 3);
    INSERT INTO zzz_game_opponents VALUES (3, 2, 1);
    INSERT INTO zzz_game_opponents VALUES (4, 2, 3);
    INSERT INTO zzz_game_opponents VALUES (5, 3, 1);
    INSERT INTO zzz_game_opponents VALUES (6, 3, 2);
    
    -- Result:
    -- +---------+--------------+-----------------+
    -- | game_id | home_team_id | visitor_team_id |
    -- +=========+==============+=================+
    -- | 1       | 1            | 2               |
    -- +---------+--------------+-----------------+
    -- | 2       | 1            | 3               |
    -- +---------+--------------+-----------------+
    -- | 3       | 2            | 1               |
    -- +---------+--------------+-----------------+
    -- | 4       | 2            | 3               |
    -- +---------+--------------+-----------------+
    -- | 5       | 3            | 1               |
    -- +---------+--------------+-----------------+
    -- | 6       | 3            | 2               |
    -- +---------+--------------+-----------------+
    
    CREATE TABLE zzz_game_scores (
    game_id INT,
    home_team_score INT,
    visitor_team_score INT
    ) USING DELTA;
    
    INSERT INTO zzz_game_scores VALUES (1, 4, 2);
    INSERT INTO zzz_game_scores VALUES (2, 0, 1);
    INSERT INTO zzz_game_scores VALUES (3, 1, 2);
    INSERT INTO zzz_game_scores VALUES (4, 3, 2);
    INSERT INTO zzz_game_scores VALUES (5, 3, 0);
    INSERT INTO zzz_game_scores VALUES (6, 3, 1);
    
    -- Result:
    -- +---------+-----------------+--------------------+
    -- | game_id | home_team_score | visitor_team_score |
    -- +=========+=================+====================+
    -- | 1       | 4               | 2                  |
    -- +---------+-----------------+--------------------+
    -- | 2       | 0               | 1                  |
    -- +---------+-----------------+--------------------+
    -- | 3       | 1               | 2                  |
    -- +---------+-----------------+--------------------+
    -- | 4       | 3               | 2                  |
    -- +---------+-----------------+--------------------+
    -- | 5       | 3               | 0                  |
    -- +---------+-----------------+--------------------+
    -- | 6       | 3               | 1                  |
    -- +---------+-----------------+--------------------+
    
    CREATE TABLE zzz_games (
    game_id INT,
    game_date DATE
    ) USING DELTA;
    
    INSERT INTO zzz_games VALUES (1, '2020-12-12');
    INSERT INTO zzz_games VALUES (2, '2021-01-09');
    INSERT INTO zzz_games VALUES (3, '2020-12-19');
    INSERT INTO zzz_games VALUES (4, '2021-01-16');
    INSERT INTO zzz_games VALUES (5, '2021-01-23');
    INSERT INTO zzz_games VALUES (6, '2021-02-06');
    
    -- Result:
    -- +---------+------------+
    -- | game_id | game_date  |
    -- +=========+============+
    -- | 1       | 2020-12-12 |
    -- +---------+------------+
    -- | 2       | 2021-01-09 |
    -- +---------+------------+
    -- | 3       | 2020-12-19 |
    -- +---------+------------+
    -- | 4       | 2021-01-16 |
    -- +---------+------------+
    -- | 5       | 2021-01-23 |
    -- +---------+------------+
    -- | 6       | 2021-02-06 |
    -- +---------+------------+
    
    CREATE TABLE zzz_teams (
    team_id INT,
    team_city VARCHAR(15)
    ) USING DELTA;
    
    INSERT INTO zzz_teams VALUES (1, "San Francisco");
    INSERT INTO zzz_teams VALUES (2, "Seattle");
    INSERT INTO zzz_teams VALUES (3, "Amsterdam");
    
    -- Result:
    -- +---------+---------------+
    -- | team_id | team_city     |
    -- +=========+===============+
    -- | 1       | San Francisco |
    -- +---------+---------------+
    -- | 2       | Seattle       |
    -- +---------+---------------+
    -- | 3       | Amsterdam     |
    -- +---------+---------------+
    
  2. 在項目的 models 目錄中,使用下列 SQL 語句建立名為 zzz_game_details.sql 的檔案。 此語句會建立一個數據表,提供每個遊戲的詳細數據,例如小組名稱和分數。 config 區塊會指示 dbt 根據這個語句在資料庫中建立數據表。

    -- Create a table that provides full details for each game, including
    -- the game ID, the home and visiting teams' city names and scores,
    -- the game winner's city name, and the game date.
    
    {{ config(
      materialized='table',
      file_format='delta'
    ) }}
    
    -- Step 4 of 4: Replace the visitor team IDs with their city names.
    select
      game_id,
      home,
      t.team_city as visitor,
      home_score,
      visitor_score,
      -- Step 3 of 4: Display the city name for each game's winner.
      case
        when
          home_score > visitor_score
            then
              home
        when
          visitor_score > home_score
            then
              t.team_city
      end as winner,
      game_date as date
    from (
      -- Step 2 of 4: Replace the home team IDs with their actual city names.
      select
        game_id,
        t.team_city as home,
        home_score,
        visitor_team_id,
        visitor_score,
        game_date
      from (
        -- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).
        select
          g.game_id,
          go.home_team_id,
          gs.home_team_score as home_score,
          go.visitor_team_id,
          gs.visitor_team_score as visitor_score,
          g.game_date
        from
          zzz_games as g,
          zzz_game_opponents as go,
          zzz_game_scores as gs
        where
          g.game_id = go.game_id and
          g.game_id = gs.game_id
      ) as all_ids,
        zzz_teams as t
      where
        all_ids.home_team_id = t.team_id
    ) as visitor_ids,
      zzz_teams as t
    where
      visitor_ids.visitor_team_id = t.team_id
    order by game_date desc
    
  3. 在項目的 models 目錄中,使用下列 SQL 語句建立名為 zzz_win_loss_records.sql 的檔案。 此陳述式會建立一個檢視,其中列出該賽季隊伍的勝負記錄。

    -- Create a view that summarizes the season's win and loss records by team.
    
    -- Step 2 of 2: Calculate the number of wins and losses for each team.
    select
      winner as team,
      count(winner) as wins,
      -- Each team played in 4 games.
      (4 - count(winner)) as losses
    from (
      -- Step 1 of 2: Determine the winner and loser for each game.
      select
        game_id,
        winner,
        case
          when
            home = winner
              then
                visitor
          else
            home
        end as loser
      from {{ ref('zzz_game_details') }}
    )
    group by winner
    order by wins desc
    
  4. 啟用虛擬環境之後,請使用上述兩個檔案的路徑執行 dbt run 命令。 在 default 資料庫中(如 profiles.yml 檔案中所指定),dbt 會建立一個名為 zzz_game_details 的數據表,以及一個名為 zzz_win_loss_records的檢視表。 dbt 會從其相關的 .sql 檔名取得這些檢視和數據表名稱。

    dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql
    
    ...
    ... | 1 of 2 START table model default.zzz_game_details.................... [RUN]
    ... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...]
    ... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN]
    ... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...]
    ... |
    ... | Finished running 1 table model, 1 view model ...
    
    Completed successfully
    
    Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
    
  5. 執行下列 SQL 程式代碼來列出新檢視的相關信息,以及從數據表和檢視中選取所有數據列。

    如果您要連線到叢集,您可以從連線到叢集的筆記本執行此 SQL 程式代碼,並將 SQL 指定為筆記本的預設語言。 如果要連線至 SQL 倉儲,您可以從查詢執行此 SQL 程式碼。

    SHOW VIEWS FROM default LIKE 'zzz_win_loss_records';
    
    +-----------+----------------------+-------------+
    | namespace | viewName             | isTemporary |
    +===========+======================+=============+
    | default   | zzz_win_loss_records | false       |
    +-----------+----------------------+-------------+
    
    SELECT * FROM zzz_game_details;
    
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | game_id | home          | visitor       | home_score | visitor_score | winner        | date       |
    +=========+===============+===============+============+===============+===============+============+
    | 1       | San Francisco | Seattle       | 4          | 2             | San Francisco | 2020-12-12 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 2       | San Francisco | Amsterdam     | 0          | 1             | Amsterdam     | 2021-01-09 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 3       | Seattle       | San Francisco | 1          | 2             | San Francisco | 2020-12-19 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 4       | Seattle       | Amsterdam     | 3          | 2             | Seattle       | 2021-01-16 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 5       | Amsterdam     | San Francisco | 3          | 0             | Amsterdam     | 2021-01-23 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    | 6       | Amsterdam     | Seattle       | 3          | 1             | Amsterdam     | 2021-02-06 |
    +---------+---------------+---------------+------------+---------------+---------------+------------+
    
    SELECT * FROM zzz_win_loss_records;
    
    +---------------+------+--------+
    | team          | wins | losses |
    +===============+======+========+
    | Amsterdam     | 3    | 1      |
    +---------------+------+--------+
    | San Francisco | 2    | 2      |
    +---------------+------+--------+
    | Seattle       | 1    | 3      |
    +---------------+------+--------+
    

步驟 3:建立並執行測試

在此步驟中,您將建立測試,這些測試是您對模型所做的判斷提示。 執行這些測試時,dbt 會告訴您專案中的每個測試是通過還是失敗。

有兩種類型的測試。 架構測試套用在 YAML 中,返回未通過斷言的記錄數量。 此數字為零時,表示所有記錄都通過,因此測試也通過。 資料測試是必須傳回零個記錄才算通過的特定查詢。

  1. 在項目的 models 目錄中,使用下列內容建立名為 schema.yml 的檔案。 此檔案包含架構測試,可判斷指定的數據行是否具有唯一值、不是 Null、只有指定的值或組合。

    version: 2
    
    models:
      - name: zzz_game_details
        columns:
          - name: game_id
            tests:
              - unique
              - not_null
          - name: home
            tests:
              - not_null
              - accepted_values:
                  values: ['Amsterdam', 'San Francisco', 'Seattle']
          - name: visitor
            tests:
              - not_null
              - accepted_values:
                  values: ['Amsterdam', 'San Francisco', 'Seattle']
          - name: home_score
            tests:
              - not_null
          - name: visitor_score
            tests:
              - not_null
          - name: winner
            tests:
              - not_null
              - accepted_values:
                  values: ['Amsterdam', 'San Francisco', 'Seattle']
          - name: date
            tests:
              - not_null
      - name: zzz_win_loss_records
        columns:
          - name: team
            tests:
              - unique
              - not_null
              - relationships:
                  to: ref('zzz_game_details')
                  field: home
          - name: wins
            tests:
              - not_null
          - name: losses
            tests:
              - not_null
    
  2. 在項目的 tests 目錄中,使用下列 SQL 語句建立名為 zzz_game_details_check_dates.sql 的檔案。 此檔案包含一個資料測試,用於確定常規賽季外是否進行了任何比賽。

    -- This season's games happened between 2020-12-12 and 2021-02-06.
    -- For this test to pass, this query must return no results.
    
    select date
    from {{ ref('zzz_game_details') }}
    where date < '2020-12-12'
    or date > '2021-02-06'
    
  3. 在項目的 tests 目錄中,使用下列 SQL 語句建立名為 zzz_game_details_check_scores.sql 的檔案。 此檔案包含一個資料測試,用於確定是否有任何分數為負或任何比賽為平局。

    -- This sport allows no negative scores or tie games.
    -- For this test to pass, this query must return no results.
    
    select home_score, visitor_score
    from {{ ref('zzz_game_details') }}
    where home_score < 0
    or visitor_score < 0
    or home_score = visitor_score
    
  4. 在項目的 tests 目錄中,使用下列 SQL 語句建立名為 zzz_win_loss_records_check_records.sql 的檔案。 此檔案包含一個資料測試,用於確定任何隊伍是否有負數的勝場或敗場記錄、是否有超過進行比賽數的勝負記錄,或者進行的比賽數是否多於允許的比賽數。

    -- Each team participated in 4 games this season.
    -- For this test to pass, this query must return no results.
    
    select wins, losses
    from {{ ref('zzz_win_loss_records') }}
    where wins < 0 or wins > 4
    or losses < 0 or losses > 4
    or (wins + losses) > 4
    
  5. 啟用虛擬環境後,執行 dbt test 命令。

    dbt test --models zzz_game_details zzz_win_loss_records
    
    ...
    ... | 1 of 19 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN]
    ... | 1 of 19 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS ...]
    ...
    ... |
    ... | Finished running 19 tests ...
    
    Completed successfully
    
    Done. PASS=19 WARN=0 ERROR=0 SKIP=0 TOTAL=19
    

步驟 4:清理

您可以執行下列 SQL 程式代碼,以刪除您為此範例建立的數據表和檢視。

如果您要連線到叢集,您可以從連線到叢集的筆記本執行此 SQL 程式代碼,並將 SQL 指定為筆記本的預設語言。 如果要連線至 SQL 倉儲,您可以從查詢執行此 SQL 程式碼。

DROP TABLE zzz_game_opponents;
DROP TABLE zzz_game_scores;
DROP TABLE zzz_games;
DROP TABLE zzz_teams;
DROP TABLE zzz_game_details;
DROP VIEW zzz_win_loss_records;

DROP TABLE diamonds;
DROP TABLE diamonds_four_cs;
DROP VIEW diamonds_list_colors;
DROP VIEW diamonds_prices;

疑難排解

如需搭配 Azure Databricks 使用 dbt Core 以及如何解決常見問題的相關信息,請參閱 在 dbt Labs 網站上取得協助

下一步

以 Azure Databricks 作業工作的形式執行 dbt Core 專案。 請參閱在 Azure Databricks 作業中使用 dbt 轉換

其他資源

探索 dbt Labs 網站上的下列資源: