Upravit

Sdílet prostřednictvím


CREATE EXTERNAL TABLE AS SELECT (CETAS) (Transact-SQL)

Applies to: SQL Server 2022 (16.x) and later Azure Synapse Analytics Analytics Platform System (PDW)

Creates an external table and then exports, in parallel, the results of a Transact-SQL SELECT statement.

  • Azure Synapse Analytics and Analytics Platform System support Hadoop or Azure Blob storage.
  • SQL Server 2022 (16.x) and later versions support CREATE EXTERNAL TABLE AS SELECT (CETAS) to create an external table and then export, in parallel, the result of a Transact-SQL SELECT statement to Azure Data Lake Storage (ADLS) Gen2, Azure Storage Account V2, and S3-compatible object storage.

Note

The capabilities and security of CETAS for Azure SQL Managed Instance are different from SQL Server or Azure Synapse Analytics. For more information, see the Azure SQL Managed Instance version of CREATE EXTERNAL TABLE AS SELECT.

Note

The capabilities and security of CETAS for serverless pools in Azure Synapse Analytics are different from SQL Server. For more information, see CETAS with Synapse SQL.

Transact-SQL syntax conventions

Syntax

CREATE EXTERNAL TABLE { [ [ database_name . [ schema_name ] . ] | schema_name . ] table_name }
    [ (column_name [ , ...n ] ) ]
    WITH (
        LOCATION = 'hdfs_folder' | '<prefix>://<path>[:<port>]' ,
        DATA_SOURCE = external_data_source_name ,
        FILE_FORMAT = external_file_format_name
        [ , <reject_options> [ , ...n ] ]
    )
    AS <select_statement>
[;]

<reject_options> ::=
{
    | REJECT_TYPE = value | percentage
    | REJECT_VALUE = reject_value
    | REJECT_SAMPLE_VALUE = reject_sample_value
}

<select_statement> ::=
    [ WITH <common_table_expression> [ , ...n ] ]
    SELECT <select_criteria>

Arguments

[ [ database_name . [ schema_name ] . ] | schema_name . ] table_name

The one- to three-part name of the table to create in the database. For an external table, relational database stores only the table metadata.

[ ( column_name [ ,...n ] ) ]

The name of a table column.

LOCATION

Applies to: Azure Synapse Analytics and Analytics Platform System

'hdfs_folder'**
Specifies where to write the results of the SELECT statement on the external data source. The location is a folder name and can optionally include a path that's relative to the root folder of the Hadoop cluster or Blob storage. PolyBase creates the path and folder if it doesn't already exist.

The external files are written to hdfs_folder and named QueryID_date_time_ID.format, where ID is an incremental identifier and format is the exported data format. An example is QID776_20160130_182739_0.orc.

LOCATION must point to a folder and have a trailing /, for example: aggregated_data/.

Applies to: SQL Server 2022 (16.x) and later

prefix://path[:port] provides the connectivity protocol (prefix), path and optionally the port, to the external data source, where the result of the SELECT statement will be written.

If the destination is S3-compatible object storage, a bucket must first exist, but PolyBase can create subfolders if necessary. SQL Server 2022 (16.x) supports Azure Data Lake Storage Gen2, Azure Storage Account V2, and S3-compatible object storage. ORC files aren't currently supported.

DATA_SOURCE = external_data_source_name

Specifies the name of the external data source object that contains the location where the external data is stored or will be stored. The location is either a Hadoop cluster or an Azure Blob storage. To create an external data source, use CREATE EXTERNAL DATA SOURCE (Transact-SQL).

FILE_FORMAT = external_file_format_name

Specifies the name of the external file format object that contains the format for the external data file. To create an external file format, use CREATE EXTERNAL FILE FORMAT (Transact-SQL).

REJECT options

REJECT options don't apply at the time this CREATE EXTERNAL TABLE AS SELECT statement is run. Instead, they're specified here so that the database can use them at a later time when it imports data from the external table. Later, when the CREATE TABLE AS SELECT statement selects data from the external table, the database will use the reject options to determine the number or percentage of rows that can fail to import before it stops the import.

  • REJECT_VALUE = reject_value

    Specifies the value or the percentage of rows that can fail to import before the database halts the import.

  • REJECT_TYPE = value | percentage

    Clarifies whether the REJECT_VALUE option is a literal value or a percentage.

    • value

      Used if REJECT_VALUE is a literal value, not a percentage. The database stops importing rows from the external data file when the number of failed rows exceeds reject_value.

      For example, if REJECT_VALUE = 5 and REJECT_TYPE = value, the database stops importing rows after five rows have failed to import.

    • percentage

      Used if REJECT_VALUE is a percentage, not a literal value. The database stops importing rows from the external data file when the percentage of failed rows exceeds reject_value. The percentage of failed rows is calculated at intervals. Only valid in dedicated SQL pools when TYPE=HADOOP.

  • REJECT_SAMPLE_VALUE = reject_sample_value

    Required when REJECT_TYPE = percentage. Specifies the number of rows to attempt to import before the database recalculates the percentage of failed rows.

    For example, if REJECT_SAMPLE_VALUE = 1000, the database will calculate the percentage of failed rows after it has attempted to import 1000 rows from the external data file. If the percentage of failed rows is less than reject_value, the database attempts to load another 1000 rows. The database continues to recalculate the percentage of failed rows after it attempts to import each additional 1000 rows.

    Note

    Because the database computes the percentage of failed rows at intervals, the actual percentage of failed rows can exceed reject_value.

    Example:

    This example shows how the three REJECT options interact with each other. For example, if REJECT_TYPE = percentage, REJECT_VALUE = 30, REJECT_SAMPLE_VALUE = 100, the following scenario could occur:

    • The database attempts to load the first 100 rows, of which 25 fail and 75 succeed.
    • The percent of failed rows is calculated as 25%, which is less than the reject value of 30%. So, there's no need to halt the load.
    • The database attempts to load the next 100 rows. This time, 25 succeed and 75 fail.
    • The percent of failed rows is recalculated as 50%. The percentage of failed rows has exceeded the 30% reject value.
    • The load fails with 50% failed rows after attempting to load 200 rows, which is larger than the specified 30% limit.

WITH common_table_expression

Specifies a temporary named result set, known as a common table expression (CTE). For more information, see WITH common_table_expression (Transact-SQL)

SELECT <select_criteria>

Populates the new table with the results from a SELECT statement. select_criteria is the body of the SELECT statement that determines which data to copy to the new table. For information about SELECT statements, see SELECT (Transact-SQL).

Note

ORDER BY clause in SELECT has no effect on CETAS.

Column options

  • column_name [ ,...n ]

    Column names do not allow the column options mentioned in CREATE TABLE. Instead, you can provide an optional list of one or more column names for the new table. The columns in the new table use the names you specify. When you specify column names, the number of columns in the column list must match the number of columns in the select results. If you don't specify any column names, the new target table uses the column names in the select statement results.

    You can't specify any other column options such as data types, collation, or nullability. Each of these attributes is derived from the results of the SELECT statement. However, you can use the SELECT statement to change the attributes. For an example, see Use CETAS to change column attributes.

Permissions

To run this command, the database user needs all of these permissions or memberships:

  • ALTER SCHEMA permission on the local schema that will contain the new table or membership in the db_ddladmin fixed database role.
  • CREATE TABLE permission or membership in the db_ddladmin fixed database role.
  • SELECT permission on any objects referenced in the select_criteria.

The login needs all of these permissions:

  • ADMINISTER BULK OPERATIONS
  • ALTER ANY EXTERNAL DATA SOURCE
  • ALTER ANY EXTERNAL FILE FORMAT
  • In general, you need to have permissions to List folder content and Write to the LOCATION folder for CETAS.
  • In Azure Synapse Analytics and Analytics Platform System, Write permission to read and write to the external folder on the Hadoop cluster or in Azure Blob storage.
  • In SQL Server 2022 (16.x), it is also required to set proper permissions on the external location. Write permission to output the data to the location and Read permission to access it.
  • For Azure Blob Storage and Azure Data Lake Gen2 the SHARED ACCESS SIGNATURE token must be granted the following privileges on the container: Read, Write, List, Create.
  • For Azure Blog Storage, the Allowed Services: Blob checkbox must be selected to generate the SAS token.
  • For Azure Data Lake Gen2, the Allowed Services: Container and Object checkboxes must be selected to generate the SAS token.

Important

The ALTER ANY EXTERNAL DATA SOURCE permission grants any principal the ability to create and modify any external data source object, so it also grants the ability to access all database scoped credentials on the database. This permission must be considered as highly privileged and must be granted only to trusted principals in the system.

Error handling

When CREATE EXTERNAL TABLE AS SELECT exports data to a text-delimited file, there's no rejection file for rows that fail to export.

When you create the external table, the database attempts to connect to the external location. If the connection fails, the command fails, and the external table is not created. It can take a minute or more for the command to fail because the database retries the connection at least three times.

If CREATE EXTERNAL TABLE AS SELECT is canceled or fails, the database makes a one-time attempt to remove any new files and folders already created on the external data source.

In Azure Synapse Analytics and Analytics Platform System, the database reports any Java errors that occur on the external data source during the data export.

Remarks

After the CREATE EXTERNAL TABLE AS SELECT statement finishes, you can run Transact-SQL queries on the external table. These operations import data into the database for the duration of the query unless you import by using the CREATE TABLE AS SELECT statement.

The external table name and definition are stored in the database metadata. The data is stored in the external data source.

The CREATE EXTERNAL TABLE AS SELECT statement always creates a nonpartitioned table, even if the source table is partitioned.

For SQL Server 2022 (16.x), the option allow polybase export must be enabled by using sp_configure. For more information, see Set allow polybase export configuration option.

For query plans in Azure Synapse Analytics and Analytics Platform System, created with EXPLAIN, the database uses these query plan operations for external tables: External shuffle move, External broadcast move, External partition move.

In Analytics Platform System, as a prerequisite for creating an external table, the appliance administrator needs to configure Hadoop connectivity. For more information, see "Configure Connectivity to External Data (Analytics Platform System)" in the Analytics Platform System documentation, which you can download from the Microsoft Download Center.

Limitations and restrictions

Because external table data resides outside of the database, backup and restore operations only operate on data stored in the database. As a result, only the metadata is backed up and restored.

The database doesn't verify the connection to the external data source when restoring a database backup that contains an external table. If the original source isn't accessible, the metadata restore of the external table will still succeed, but SELECT operations on the external table fail.

The database doesn't guarantee data consistency between the database and the external data. You, the customer, are solely responsible to maintain consistency between the external data and the database.

Data manipulation language (DML) operations aren't supported on external tables. For example, you can't use the Transact-SQL update, insert, or delete Transact-SQL statements to modify the external data.

CREATE TABLE, DROP TABLE, CREATE STATISTICS, DROP STATISTICS, CREATE VIEW, and DROP VIEW are the only data definition language (DDL) operations allowed on external tables.

Limitations and restrictions for Azure Synapse Analytics

  • In Azure Synapse Analytics dedicated SQL pools, and Analytics Platform System, PolyBase can consume a maximum of 33,000 files per folder when running 32 concurrent PolyBase queries. This maximum number includes both files and subfolders in each HDFS folder. If the degree of concurrency is less than 32, a user can run PolyBase queries against folders in HDFS that contain more than 33,000 files. We recommend that users of Hadoop and PolyBase keep file paths short and use no more than 30,000 files per HDFS folder. When too many files are referenced, a JVM out-of-memory exception occurs.

  • In serverless SQL pools, external tables can't be created in a location where you currently have data. To reuse a location that has been used to store data, the location must be manually deleted on ADLS. For more limitations and best practices, see Filter optimization best practices.

In Azure Synapse Analytics dedicated SQL pools, and Analytics Platform System, when CREATE EXTERNAL TABLE AS SELECT selects from an RCFile, the column values in the RCFile must not contain the pipe (|) character.

SET ROWCOUNT (Transact-SQL) has no effect on CREATE EXTERNAL TABLE AS SELECT. To achieve a similar behavior, use TOP (Transact-SQL).

Review Naming and Referencing Containers, Blobs, and Metadata for limitations on file names.

Character errors

The following characters present in data may cause errors including rejected records with CREATE EXTERNAL TABLE AS SELECT to Parquet files.

In Azure Synapse Analytics and Analytics Platform System, this also applies to ORC files.

  • |
  • " (quotation mark character)
  • \r\n
  • \r
  • \n

To use CREATE EXTERNAL TABLE AS SELECT containing these characters, you must first run the CREATE EXTERNAL TABLE AS SELECT statement to export the data to delimited text files where you can then convert them to Parquet or ORC by using an external tool.

Working with parquet

When working with parquet files, CREATE EXTERNAL TABLE AS SELECT will generate one parquet file per available CPU, up to the configured maximum degree of parallelism (MAXDOP). Each file can grow up to 190 GB, after that SQL Server will generate more Parquet files as needed.

The query hint OPTION (MAXDOP n) will only affect the SELECT part of CREATE EXTERNAL TABLE AS SELECT, it has no influence on the amount of parquet files. Only database-level MAXDOP and instance-level MAXDOP is considered.

Locking

Takes a shared lock on the SCHEMARESOLUTION object.

Supported data types

CETAS can be used to store result sets with the following SQL data types:

  • binary
  • varbinary
  • char
  • varchar
  • nchar
  • nvarchar
  • smalldate
  • date
  • datetime
  • datetime2
  • datetimeoffset
  • time
  • decimal
  • numeric
  • float
  • real
  • bigint
  • tinyint
  • smallint
  • int
  • bigint
  • bit
  • money
  • smallmoney

Examples

A. Create a Hadoop table by using CREATE EXTERNAL TABLE AS SELECT

Applies to: Azure Synapse Analytics and Analytics Platform System

The following example creates a new external table named hdfsCustomer that uses the column definitions and data from the source table dimCustomer.

The table definition is stored in the database, and the results of the SELECT statement are exported to the /pdwdata/customer.tbl file on the Hadoop external data source customer_ds. The file is formatted according to the external file format customer_ff.

The file name is generated by the database and contains the query ID for ease of aligning the file with the query that generated it.

The path hdfs://xxx.xxx.xxx.xxx:5000/files/ preceding the Customer directory must already exist. If the Customer directory doesn't exist, the database creates the directory.

Note

This example specifies for 5000. If the port isn't specified, the database uses 8020 as the default port.

The resulting Hadoop location and file name will be hdfs:// xxx.xxx.xxx.xxx:5000/files/Customer/ QueryID_YearMonthDay_HourMinutesSeconds_FileIndex.txt..

-- Example is based on AdventureWorks
CREATE EXTERNAL TABLE hdfsCustomer
    WITH (
            LOCATION = '/pdwdata/customer.tbl',
            DATA_SOURCE = customer_ds,
            FILE_FORMAT = customer_ff
            ) AS

SELECT *
FROM dimCustomer;
GO

B. Use a query hint with CREATE EXTERNAL TABLE AS SELECT

Applies to: Azure Synapse Analytics and Analytics Platform System

This query shows the basic syntax for using a query join hint with the CREATE EXTERNAL TABLE AS SELECT statement. After the query is submitted, the database uses the hash join strategy to generate the query plan. For more information on join hints and how to use the OPTION clause, see OPTION Clause (Transact-SQL).

Note

This example specifies for 5000. If the port isn't specified, the database uses 8020 as the default port.

-- Example is based on AdventureWorks
CREATE EXTERNAL TABLE dbo.FactInternetSalesNew
    WITH (
            LOCATION = '/files/Customer',
            DATA_SOURCE = customer_ds,
            FILE_FORMAT = customer_ff
            ) AS

SELECT T1.*
FROM dbo.FactInternetSales T1
INNER JOIN dbo.DimCustomer T2
    ON (T1.CustomerKey = T2.CustomerKey)
OPTION (HASH JOIN);
GO

C. Use CETAS to change column attributes

Applies to: Azure Synapse Analytics and Analytics Platform System

This example uses CETAS to change data types, nullability, and collation for several columns in the FactInternetSales table.

-- Example is based on AdventureWorks
CREATE EXTERNAL TABLE dbo.FactInternetSalesNew
    WITH (
            LOCATION = '/files/Customer',
            DATA_SOURCE = customer_ds,
            FILE_FORMAT = customer_ff
            ) AS

SELECT T1.ProductKey AS ProductKeyNoChange,
    T1.OrderDateKey AS OrderDate,
    T1.ShipDateKey AS ShipDate,
    T1.CustomerKey AS CustomerKeyNoChange,
    T1.OrderQuantity AS Quantity,
    T1.SalesAmount AS MONEY
FROM dbo.FactInternetSales T1
INNER JOIN dbo.DimCustomer T2
    ON (T1.CustomerKey = T2.CustomerKey)
OPTION (HASH JOIN);
GO

D. Use CREATE EXTERNAL TABLE AS SELECT exporting data as parquet

Applies to: SQL Server 2022 (16.x)

The following example creates a new external table named ext_sales that uses the data from the table SalesOrderDetail of AdventureWorks2022. The allow polybase export configuration option must be enabled.

The result of the SELECT statement will be saved on S3-compatible object storage previously configured and named s3_eds, and proper credential created as s3_dsc. The parquet file location will be <ip>:<port>/cetas/sales.parquet where cetas is the previously created storage bucket.

Note

Delta format is currently only supported as read-only.

-- Credential to access the S3-compatible object storage
CREATE DATABASE SCOPED CREDENTIAL s3_dsc
    WITH IDENTITY = 'S3 Access Key',
        SECRET = '<accesskeyid>:<secretkeyid>'
GO

-- S3-compatible object storage data source
CREATE EXTERNAL DATA SOURCE s3_eds
    WITH (
            LOCATION = 's3://<ip>:<port>',
            CREDENTIAL = s3_dsc
            )

-- External File Format for PARQUET
CREATE EXTERNAL FILE FORMAT ParquetFileFormat
    WITH (FORMAT_TYPE = PARQUET);
GO

CREATE EXTERNAL TABLE ext_sales
    WITH (
            LOCATION = '/cetas/sales.parquet',
            DATA_SOURCE = s3_eds,
            FILE_FORMAT = ParquetFileFormat
            ) AS

SELECT *
FROM AdventureWorks2022.[Sales].[SalesOrderDetail];
GO

E. Use CREATE EXTERNAL TABLE AS SELECT from delta table to parquet

Applies to: SQL Server 2022 (16.x)

The following example creates a new external table named Delta_to_Parquet, that uses Delta Table type of data located at an S3-compatible object storage named s3_delta, and writes the result in another data source named s3_parquet as a parquet file. For that the example makes uses of OPENROWSET command. The allow polybase export configuration option must be enabled.

-- External File Format for PARQUET
CREATE EXTERNAL FILE FORMAT ParquetFileFormat
    WITH (FORMAT_TYPE = PARQUET);
GO

CREATE EXTERNAL TABLE Delta_to_Parquet
    WITH (
            LOCATION = '/backup/sales.parquet',
            DATA_SOURCE = s3_parquet,
            FILE_FORMAT = ParquetFileFormat
            ) AS

SELECT *
FROM OPENROWSET(BULK '/delta/sales_fy22/', FORMAT = 'DELTA', DATA_SOURCE = 's3_delta') AS [r];
GO

F. Use CREATE EXTERNAL TABLE AS SELECT with a view as the source

Applies to: Azure Synapse Analytics serverless SQL pools and dedicated SQL pools.

In this example, we can see example of a template code for writing CETAS with a user-defined view as source, using managed identity as an authentication, and wasbs:.

CREATE DATABASE [<mydatabase>];
GO

USE [<mydatabase>];
GO

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong password>';

CREATE DATABASE SCOPED CREDENTIAL [WorkspaceIdentity] WITH IDENTITY = 'managed identity';
GO

CREATE EXTERNAL FILE FORMAT [ParquetFF] WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
GO

CREATE EXTERNAL DATA SOURCE [SynapseSQLwriteable] WITH (
    LOCATION = 'wasbs://<mystoageaccount>.dfs.core.windows.net/<mycontainer>/<mybaseoutputfolderpath>',
    CREDENTIAL = [WorkspaceIdentity]
);
GO

CREATE EXTERNAL TABLE [dbo].[<myexternaltable>] WITH (
        LOCATION = '<myoutputsubfolder>/',
        DATA_SOURCE = [SynapseSQLwriteable],
        FILE_FORMAT = [ParquetFF]
) AS
SELECT * FROM [<myview>];
GO

G. Use CREATE EXTERNAL TABLE AS SELECT with a view as the source

Applies to: Azure Synapse Analytics serverless SQL pools and dedicated SQL pools.

In this example, we can see example of a template code for writing CETAS with a user-defined view as source, using managed identity as an authentication, and https:.

CREATE DATABASE [<mydatabase>];
GO

USE [<mydatabase>];
GO

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong password>';

CREATE DATABASE SCOPED CREDENTIAL [WorkspaceIdentity] WITH IDENTITY = 'managed identity';
GO

CREATE EXTERNAL FILE FORMAT [ParquetFF] WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
GO

CREATE EXTERNAL DATA SOURCE [SynapseSQLwriteable] WITH (
    LOCATION = 'https://<mystoageaccount>.dfs.core.windows.net/<mycontainer>/<mybaseoutputfolderpath>',
    CREDENTIAL = [WorkspaceIdentity]
);
GO

CREATE EXTERNAL TABLE [dbo].[<myexternaltable>] WITH (
        LOCATION = '<myoutputsubfolder>/',
        DATA_SOURCE = [SynapseSQLwriteable],
        FILE_FORMAT = [ParquetFF]
) AS
SELECT * FROM [<myview>];
GO

Next steps

Applies to: Azure SQL Managed Instance

Creates an external table and then exports, in parallel, the results of a Transact-SQL SELECT statement.

You can use CREATE EXTERNAL TABLE AS SELECT (CETAS) to complete the following tasks:

  • Create an external table on top of Parquet or CSV files in Azure Blob storage or Azure Data Lake Storage (ADLS) Gen2.
  • Export, in parallel, the results of a T-SQL SELECT statement into the created external table.
  • For more data virtualization capabilities of Azure SQL Managed Instance, see Data virtualization with Azure SQL Managed Instance.

Note

This content applies to Azure SQL Managed Instance only. For other platforms, choose the appropriate version of CREATE EXTERNAL TABLE AS SELECT from the dropdrown selector.

Transact-SQL syntax conventions

Syntax

CREATE EXTERNAL TABLE [ [database_name  . [ schema_name ] . ] | schema_name . ] table_name
    WITH (
        LOCATION = 'path_to_folder/',  
        DATA_SOURCE = external_data_source_name,  
        FILE_FORMAT = external_file_format_name
        [, PARTITION ( column_name [ , ...n ] ) ]
)
    AS <select_statement>  
[;]

<select_statement> ::=  
    [ WITH <common_table_expression> [ ,...n ] ]  
    SELECT <select_criteria>

Arguments

[ [ database_name . [ schema_name ] . ] | schema_name . ] table_name

The one to three-part name of the table to create. For an external table, only the table metadata is stored. No actual data is moved or stored.

LOCATION = 'path_to_folder'

Specifies where to write the results of the SELECT statement on the external data source. The root folder is the data location specified in the external data source. LOCATION must point to a folder and have a trailing /. Example: aggregated_data/.

The destination folder for the CETAS must be empty. If the path and folder do not already exist, they are created automatically.

DATA_SOURCE = external_data_source_name

Specifies the name of the external data source object that contains the location where the external data will be stored. To create an external data source, use CREATE EXTERNAL DATA SOURCE (Transact-SQL).

FILE_FORMAT = external_file_format_name

Specifies the name of the external file format object that contains the format for the external data file. To create an external file format, use CREATE EXTERNAL FILE FORMAT (Transact-SQL). Only external file formats with FORMAT_TYPE=PARQUET and FORMAT_TYPE=DELIMITEDTEXT are currently supported. GZip compression for DELIMITEDTEXT format is not supported.

[, PARTITION ( column name [ , ...n ] ) ]

Partitions the output data into multiple parquet file paths. Partitioning happens per given columns (column_name), matching the wildcards (*) in the LOCATION to respective partitioning column. Number of columns in the PARTITION part must match the number of wildcards in the LOCATION. There must be at least one column that is not used for partitioning.

WITH <common_table_expression>

Specifies a temporary named result set, known as a common table expression (CTE). For more information, see WITH common_table_expression (Transact-SQL).

SELECT <select_criteria>

Populates the new table with the results from a SELECT statement. select_criteria is the body of the SELECT statement that determines which data to copy to the new table. For information about SELECT statements, see SELECT (Transact-SQL).

Note

ORDER BY clause in SELECT is not supported for CETAS.

Permissions

Permissions in storage

You need to have permissions to list folder content and write to the LOCATION path for CETAS to work.

Supported authentication methods are managed identity or a Shared Access Signature (SAS) token.

  • If you are using managed identity for authentication, make sure that the service principal of your SQL managed instance has a role of Storage Blob Data Contributor on the destination container.
  • If you are using an SAS token, Read, Write, and List permissions are required.
  • For Azure Blog Storage, the Allowed Services: Blob checkbox must be selected to generate the SAS token.
  • For Azure Data Lake Gen2, the Allowed Services: Container and Object checkboxes must be selected to generate the SAS token.

A user-assigned managed identity is not supported. Microsoft Entra passthrough authentication is not supported. Microsoft Entra ID is (formerly Azure Active Directory).

Permissions in the SQL managed instance

To run this command, the database user needs all of these permissions or memberships:

  • ALTER SCHEMA permission on the local schema that will contain the new table or membership in the db_ddladmin fixed database role.
  • CREATE TABLE permission or membership in the db_ddladmin fixed database role.
  • SELECT permission on any objects referenced in the select_criteria.

The login needs all of these permissions:

  • ADMINISTER BULK OPERATIONS
  • ALTER ANY EXTERNAL DATA SOURCE
  • ALTER ANY EXTERNAL FILE FORMAT

Important

The ALTER ANY EXTERNAL DATA SOURCE permission grants any principal the ability to create and modify any external data source object, so it also grants the ability to access all database scoped credentials on the database. This permission must be considered as highly privileged and must be granted only to trusted principals in the system.

Supported data types

CETAS stores result sets with following SQL data types:

  • binary
  • varbinary
  • char
  • varchar
  • nchar
  • nvarchar
  • smalldatetime
  • date
  • datetime
  • datetime2
  • datetimeoffset
  • time
  • decimal
  • numeric
  • float
  • real
  • bigint
  • tinyint
  • smallint
  • int
  • bigint
  • bit
  • money
  • smallmoney

Note

LOBs larger than 1MB can't be used with CETAS.

Limitations and restrictions

  • CREATE EXTERNAL TABLE AS SELECT (CETAS) for Azure SQL Managed Instance is disabled by default. For more information, see the next section, Disabled by default.
  • For more information on limitations or known issues with data virtualization in Azure SQL Managed Instance, see Limitations and Known issues.

Because external table data resides outside of the database, backup and restore operations only operate on data stored in the database. As a result, only the metadata is backed up and restored.

The database doesn't verify the connection to the external data source when restoring a database backup that contains an external table. If the original source isn't accessible, the metadata restore of the external table still succeed, but SELECT operations on the external table fail.

The database doesn't guarantee data consistency between the database and the external data. You, the customer, are solely responsible to maintain consistency between the external data and the database.

Data manipulation language (DML) operations aren't supported on external tables. For example, you can't use the Transact-SQL update, insert, or delete Transact-SQLstatements to modify the external data.

CREATE TABLE, DROP TABLE, CREATE STATISTICS, DROP STATISTICS, CREATE VIEW, and DROP VIEW are the only data definition language (DDL) operations allowed on external tables.

External tables can't be created in a location where you currently have data. To reuse a location that has been used to store data, the location must be manually deleted on ADLS.

SET ROWCOUNT (Transact-SQL) has no effect on CREATE EXTERNAL TABLE AS SELECT. To achieve a similar behavior, use TOP (Transact-SQL).

Review Naming and Referencing Containers, Blobs, and Metadata for limitations on file names.

Storage types

Files can be stored in Azure Data Lake Storage Gen2 or Azure Blob Storage. To query files, you need to provide the location in a specific format and use the location type prefix corresponding to the type of external source and endpoint/protocol, such as the following examples:

--Blob Storage endpoint
abs://<container>@<storage_account>.blob.core.windows.net/<path>/<file_name>.parquet

--Data Lake endpoint
adls://<container>@<storage_account>.blob.core.windows.net/<path>/<file_name>.parquet

Important

The provided Location type prefix is used to choose the optimal protocol for communication and to leverage any advanced capabilities offered by the particular storage type. Using the generic https:// prefix is disabled. Always use endpoint-specific prefixes.

Disabled by default

CREATE EXTERNAL TABLE AS SELECT (CETAS) allows you to export data from your SQL managed instance into an external storage account, so there is potential for data exfiltration risk with these capabilities. Therefore, CETAS is disabled by default for Azure SQL Managed Instance.

Enable CETAS

CETAS for Azure SQL Managed Instance can only be enabled via a method that requires elevated Azure permissions, and cannot be enabled via T-SQL. Because of the risk of unauthorized data exfiltration, CETAS cannot be enabled via the sp_configure T-SQL stored procedure, but instead requires that the user action outside of the SQL managed instance.

Permissions to enable CETAS

To enable via Azure PowerShell, your user running the command must have Contributor or SQL Security Manager Azure RBAC roles for your SQL managed instance.

A custom role can be created for this as well, requiring the Read and Write action for the Microsoft.Sql/managedInstances/serverConfigurationOptions action.

Methods to enable CETAS

In order to invoke the PowerShell commands on a computer, Az package version 9.7.0 or newer must be installed locally. Or, consider using the Azure Cloud Shell to run Azure PowerShell at shell.azure.com.

First, log in to Azure and set the proper context for your subscription:

Login-AzAccount
$SubscriptionID = "<YourSubscriptionIDHERE>"
Select-AzSubscription -SubscriptionName $SubscriptionID

To manage the server configuration option "allowPolybaseExport", adjust the following PowerShell scripts to your subscription and SQL managed instance name, then run the commands. For more information, see Set-AzSqlServerConfigurationOption and Get-AzSqlServerConfigurationOption.

# Enable ServerConfigurationOption with name "allowPolybaseExport"
Set-AzSqlServerConfigurationOption -ResourceGroupName "resource_group_name" -InstanceName "ManagedInstanceName" `
-Name "allowPolybaseExport" -Value 1

To disable the server configuration option "allowPolybaseExport":

# Disable ServerConfigurationOption with name "allowPolybaseExport"
Set-AzSqlServerConfigurationOption -ResourceGroupName "resource_group_name" -InstanceName "ManagedInstanceName" `
-Name "allowPolybaseExport" -Value 0

To get the current value of the server configuration option "allowPolybaseExport":

# Get ServerConfigurationOptions with name "allowPolybaseExport"
Get-AzSqlServerConfigurationOption -ResourceGroupName "resource_group_name" -InstanceName "ManagedInstanceName" `
-Name "allowPolybaseExport"

Verify status of CETAS

At any time you can check the current status of the CETAS configuration option.

Connect to your SQL managed instance. Run the following T-SQL and observe the value column of the response. Once the server configuration change is complete, the results of this query should match your desired setting.

SELECT [name], [value] FROM sys.configurations WHERE name ='allow polybase export';

Troubleshoot

For more steps to troubleshoot data virtualization in Azure SQL Managed Instance, see Troubleshoot. Error handling and common error messages for CETAS in Azure SQL Managed Instance follows.

Error handling

When CREATE EXTERNAL TABLE AS SELECT exports data to a text-delimited file, there's no rejection file for rows that fail to export.

When you create the external table, the database attempts to connect to the external location. If the connection fails, the command fails, and the external table won't be created. It can take a minute or more for the command to fail because the database retries the connection at least three times.

Common error messages

These common error messages have quick explanations for CETAS for Azure SQL Managed Instance.

  1. Specifying a location already existing in the storage.

    Solution: Clear storage location (including snapshot), or change location parameter in query.

    Sample error message: Msg 15842: Cannot create external table. External table location already exists.

  2. Column values formatted using JSON objects.

    Solution: Convert value column to a single VARCHAR or NVARCHAR column, or a set of columns with explicitly defined types.

    Sample error message: Msg 16549: Values in column value are formatted as JSON objects and cannot be written using CREATE EXTERNAL TABLE AS SELECT.

  3. Location parameter invalid (for example, multiple //).

    Solution: Fix location parameter.

    Sample error message: Msg 46504: External option 'LOCATION' is not valid. Ensure that the length and range are appropriate.

  4. Missing one of the required options (DATA_SOURCE, FILE_FORMAT, LOCATION).

    Solution: Add the missing parameter to CETAS query.

    Sample error message: Msg 46505: Missing required external DDL option 'FILE_FORMAT'

  5. Access problems (invalid credential, expired credential or credential with insufficient permissions). Alternate possibility is an invalid path, where the SQL managed instance received an Error 404 from storage.

    Solution: Verify credential validity and permissions. Alternatively, validate the path is valid and storage exists. Use the URL path adls://<container>@<storage_account>.blob.core.windows.net/<path>/.

    Sample error message: Msg 15883: Access check for '<Read/Write>' operation against '<Storage>' failed with HRESULT = '0x...'

  6. Location part of DATA_SOURCE contains wildcards.

    Solution: Remove wildcards from the location.

    Sample error message: Msg 16576: Location provided by DATA_SOURCE '<data source name>' cannot contain any wildcards.

  7. Number of wildcards in LOCATION parameter and number of partitioned columns do not match.

    Solution: Ensure same number of wildcards in LOCATION as partition columns.

    Sample error message: Msg 16577: Number of wildcards in LOCATION must match the number of columns in PARTITION clause.

  8. Column name in PARTITION clause does not match any columns in the list.

    Solution: Make sure that columns in PARTITION are valid.

    Sample error message: Msg 16578: The column name '<column name>' specified in the PARTITION option does not match any column specified in the column list

  9. Column specified more than once in PARTITION list.

    Solution: Make sure that columns in PARTITION clause are unique.

    Sample error message: Msg 16579: A column has been specified more than once in the PARTITION list. Column '<column name>' is specified more than once.

  10. Column was specified more than once in PARTITION list, or it matches no columns from SELECT list.

    Solution: Ensure no duplicates are there in partition list, and the partition columns exist in SELECT part.

    Sample error messages: Msg 11569: Column <column name> has been specified more than once in the partition columns list. Please try again with a valid parameter. or Msg 11570: Column <column name> specified in the partition columns list does not match any columns in SELECT clause. Please try again with a valid parameter.

  11. Using all columns in PARTITION list.

    Solution: At least one of the columns from SELECT part must not be in PARTITION part of the query.

    Sample error message: Msg 11571: All output columns in DATA_EXPORT query are declared as PARTITION columns. DATA_EXPORT query requires at least one column to export.

  12. Feature is disabled.

    Solution: Enable the feature, using the Disabled by default section in this article.

    Sample error message: Msg 46552: Writing into an external table is disabled. See 'https://go.microsoft.com/fwlink/?linkid=2201073' for more information

Locking

Takes a shared lock on the SCHEMARESOLUTION object.

Examples

A. Use CETAS with a view to create an external table using the managed identity

This example provides code for writing CETAS with a view as source, using system managed identity an authentication.

CREATE DATABASE [<mydatabase>];
GO

USE [<mydatabase>];
GO

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong password>';

CREATE DATABASE SCOPED CREDENTIAL [WorkspaceIdentity] WITH IDENTITY = 'managed identity';
GO

CREATE EXTERNAL FILE FORMAT [ParquetFF] WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
GO

CREATE EXTERNAL DATA SOURCE [SQLwriteable] WITH (
    LOCATION = 'adls://container@mystorageaccount.blob.core.windows.net/mybaseoutputfolderpath',
    CREDENTIAL = [WorkspaceIdentity]
);
GO

CREATE EXTERNAL TABLE [dbo].[<myexternaltable>] WITH (
        LOCATION = '<myoutputsubfolder>/',
        DATA_SOURCE = [SQLwriteable],
        FILE_FORMAT = [ParquetFF]
) AS
SELECT * FROM [<myview>];
GO

B. Use CETAS with a view to create an external table with SAS authentication

This example provides code for writing CETAS with a view as source, using an SAS token as authentication.

CREATE DATABASE [<mydatabase>];
GO

USE [<mydatabase>];
GO

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong password>';

CREATE DATABASE SCOPED CREDENTIAL SAS_token
WITH
  IDENTITY = 'SHARED ACCESS SIGNATURE',
  -- Remove ? from the beginning of the SAS token
  SECRET = '<azure_shared_access_signature>' ;
GO

CREATE EXTERNAL FILE FORMAT [ParquetFF] WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
GO

CREATE EXTERNAL DATA SOURCE [SQLwriteable] WITH (
    LOCATION = 'adls://container@mystorageaccount.blob.core.windows.net/mybaseoutputfolderpath',
    CREDENTIAL = [SAS_token]
);
GO

CREATE EXTERNAL TABLE [dbo].[<myexternaltable>] WITH (
        LOCATION = '<myoutputsubfolder>/',
        DATA_SOURCE = [SQLwriteable],
        FILE_FORMAT = [ParquetFF]
) AS
SELECT * FROM [<myview>];
GO

C. Create an external table into a single parquet file on the storage

The next two examples show how to offload some of the data from a local table into an external table stored as parquet file(s) on Azure Blob storage container. They're designed to work with AdventureWorks2022 database. This example shows creating an external table as a single parquet file, where the next example shows how to create an external table and partition it into multiple folders with parquet files.

The example below works using managed identity for authentication. As such, make sure that your Azure SQL Managed Instance service principal has Storage Blob Data Contributor role on your Azure Blob Storage Container. Alternatively, you can modify the example and use Shared Access Secret (SAS) tokens for authentication.

The following sample, you create an external table into a single parquet file in Azure Blob Storage, selecting from SalesOrderHeader table for orders older than 1-Jan-2014:

--Example 1: Creating an external table into a single parquet file on the storage, selecting from SalesOrderHeader table for orders older than 1-Jan-2014:
USE [AdventureWorks2022]
GO

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Strong Password';
GO

CREATE DATABASE SCOPED CREDENTIAL [CETASCredential]
    WITH IDENTITY = 'managed identity';
GO

CREATE EXTERNAL DATA SOURCE [CETASExternalDataSource]
WITH (
    LOCATION = 'abs://container@storageaccount.blob.core.windows.net',
    CREDENTIAL = [CETASCredential] );
GO

CREATE EXTERNAL FILE FORMAT [CETASFileFormat]
WITH(
    FORMAT_TYPE=PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
    );
GO

-- Count how many rows we plan to offload
SELECT COUNT(*) FROM [AdventureWorks2022].[Sales].[SalesOrderHeader] WHERE
        OrderDate < '2013-12-31';

-- CETAS write to a single file, archive all data older than 1-Jan-2014:
CREATE EXTERNAL TABLE SalesOrdersExternal
WITH (
    LOCATION = 'SalesOrders/',
    DATA_SOURCE = [CETASExternalDataSource],
    FILE_FORMAT = [CETASFileFormat])
AS 
    SELECT 
        *
    FROM 
        [AdventureWorks2022].[Sales].[SalesOrderHeader]
    WHERE
        OrderDate < '2013-12-31';

-- you can query the newly created external table
SELECT COUNT (*) FROM SalesOrdersExternal;

D. Create a partitioned external table into multiple parquet files stored in a folder tree

This example builds on the previous example to show how to create an external table and partition it into multiple folders with parquet files. You can use partitioned tables to gain performance benefits if your data set is large.

Create an external table from SalesOrderHeader data, using the steps from Example B, but partition the external table by OrderDate year and month. When querying partitioned external tables, we can benefit from partition elimination for performance.

--CETAS write to a folder hierarchy (partitioned table):
CREATE EXTERNAL TABLE SalesOrdersExternalPartitioned
WITH (
    LOCATION = 'PartitionedOrders/year=*/month=*/', 
    DATA_SOURCE = CETASExternalDataSource,
    FILE_FORMAT = CETASFileFormat,
    --year and month will correspond to the two respective wildcards in folder path    
    PARTITION (
        [Year],
        [Month]
        ) 
    )
AS
    SELECT
        *,
        YEAR(OrderDate) AS [Year],
        MONTH(OrderDate) AS [Month]
    FROM [AdventureWorks2022].[Sales].[SalesOrderHeader]
    WHERE
        OrderDate < '2013-12-31';
GO

-- you can query the newly created partitioned external table
SELECT COUNT (*) FROM SalesOrdersExternalPartitioned;

Next steps