Connect to Amazon S3

Artikkeli
11/22/2024

This article explains how to connect to AWS S3 from Azure Databricks.

Access S3 buckets with URIs and AWS keys

You can set Spark properties to configure a AWS keys to access S3.

Databricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the AWS key while allowing users to access S3. To create a secret scope, see Manage secret scopes.

The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to S3. See Compute permissions and Collaborate using Databricks notebooks.

To set Spark properties, use the following snippet in a cluster’s Spark configuration to set the AWS keys stored in secret scopes as environment variables:

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

You can then read from S3 using the following commands:

aws_bucket_name = "my-s3-bucket"

df = spark.read.load(f"s3a://{aws_bucket_name}/flowers/delta/")
display(df)
dbutils.fs.ls(f"s3a://{aws_bucket_name}/")

Access S3 with open-source Hadoop options

Databricks Runtime supports configuring the S3A filesystem using open-source Hadoop options. You can configure global properties and per-bucket properties.

Global configuration

# Global S3 configuration
spark.hadoop.fs.s3a.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.endpoint <aws-endpoint>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS

Per-bucket configuration

You configure per-bucket properties using the syntax spark.hadoop.fs.s3a.bucket.<bucket-name>.<configuration-key>. This lets you set up buckets with different credentials, endpoints, and so on.

For example, in addition to global S3 settings you can configure each bucket individually using the following keys:

# Set up authentication and endpoint for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>

# Configure a different KMS encryption key for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>

Deprecated patterns for storing and accessing data from Azure Databricks

The following are deprecated storage patterns:

Databricks no longer recommends mounting external data locations to Databricks Filesystem. See Mounting cloud object storage on Azure Databricks.

Important

The S3A filesystem enables caching by default and releases resources on ‘FileSystem.close()’. To avoid other threads using a reference to the cached file system incorrectly, do not explicitly use the ‘FileSystem.close().
The S3A filesystem does not remove directory markers when closing an output stream. Legacy applications based on Hadoop versions that do not include HADOOP-13230 can misinterpret them as empty directories even if there are files inside.

Jaa