Load data using COPY INTO with a service principal

Artikkeli
08/12/2024

This article describes how to use the COPY INTO command to load data from an Azure Data Lake Storage Gen2 (ADLS Gen2) container in your Azure account into a table in Databricks SQL.

The steps in this article assume that your admin has configured a SQL warehouse to use a Azure Databricks service principal so that you can access your source files in ADLS Gen2. If your admin configured a Unity Catalog external location with a storage credential, see Load data using COPY INTO with Unity Catalog volumes or external locations instead. If your admin gave you temporary credentials (a Blob SAS token), follow the steps in Load data using COPY INTO with temporary credentials instead.

Databricks recommends using the COPY INTO command for incremental and bulk data loading with Databricks SQL.

Note

COPY INTO works well for data sources that contain thousands of files. Databricks recommends that you use Auto Loader for loading millions of files, which is not supported in Databricks SQL.

Before you begin

Before you load data into Azure Databricks, make sure you have the following:

Access to data in ADLS Gen2. Your admin must first complete the steps in Configure data access for ingestion so your Databricks SQL warehouse can read your source files.
A Databricks SQL warehouse.
The Can manage permission on the SQL warehouse.
The path to your data in an ADLS Gen2 container.
Familiarity with the Databricks SQL user interface.

Step 1: Confirm access to data in cloud storage

To confirm that you have access to the correct data in cloud object storage, do the following:

In the sidebar, click Create > Query.
In the SQL editor’s menu bar, select a SQL warehouse.
In the SQL editor, paste the following code:
```
select * from csv.<path>
```
Replace <path> with the ADLS Gen2 container path that you received from your admin. For example, abfss://<container>@<storage-account>.dfs.core.windows.net/<folder>.
Click Run.

Step 2: Create a table

This step describes how to create a table in your Azure Databricks workspace to hold the incoming data.

In the SQL editor, paste the following code:

CREATE TABLE <catalog_name>.<schema_name>.<table_name> (
  tpep_pickup_datetime  TIMESTAMP,
  tpep_dropoff_datetime TIMESTAMP,
  trip_distance DOUBLE,
  fare_amount DOUBLE,
  pickup_zip INT,
  dropoff_zip INT
);

Click Run.

Step 3: Load data from cloud storage into the table

This step describes how to load data from an ADLS Gen2 container into the table in your Azure Databricks workspace.

In the sidebar, click Create > Query.
In the SQL editor’s menu bar, select a SQL warehouse and make sure the SQL warehouse is running.
In the SQL editor, paste the following code. In this code, replace:
- <container> with the name of your ADLS Gen2 container within your storage account.
- <storage-account> with the name of your ADLS Gen2 storage account.
- <folder> with the name of the folder that contains your data.
- <blob-sas-token> with the value of the Blob SAS token that you received from your admin.
```
COPY INTO <catalog-name>.<schema-name>.<table-name>
FROM 'abfss://<container>@<storage-account>.dfs.core.windows.net/<folder>'
FILEFORMAT = CSV
FORMAT_OPTIONS (
  'header' = 'true',
  'inferSchema' = 'true'
);

SELECT * FROM <catalog-name>.<schema-name>.<table-name>;
```
Note

FORMAT_OPTIONS differs by FILEFORMAT. In this case, the header option instructs Azure Databricks to treat the first row of the CSV file as a header, and the inferSchema options instructs Azure Databricks to automatically determine the data type of each field in the CSV file.
Click Run.

Note

If you click Run again, no new data is loaded into the table. This is because the COPY INTO command only processes what it considers to be new data.

Clean up

You can clean up the associated resources in your workspace if you no longer want to keep them.

Delete the tables

In the sidebar, click Create > Query.
Select a SQL warehouse and make sure that the SQL warehouse is running.

Paste the following code:

DROP TABLE <catalog-name>.<schema-name>.<table-name>;

Click Run.
Hover over the tab for this query, and then click the X icon.

Delete the queries in the SQL editor

In the sidebar, click SQL Editor.
In the SQL editor’s menu bar, hover over the tab for each query that you created for this tutorial, and then click the X icon.

Additional resources

The COPY INTO reference article

Jaa