Jaa


What is Auto Loader directory listing mode?

Auto Loader uses directory listing mode by default. In directory listing mode, Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any permission configurations other than access to your data on cloud storage.

For best performance with directory listing mode, use Databricks Runtime 9.1 or above. This article describes the default functionality of directory listing mode as well as optimizations based on lexical ordering of files.

How does directory listing mode work?

Azure Databricks has optimized directory listing mode for Auto Loader to discover files in cloud storage more efficiently than other Apache Spark options.

For example, if you have files being uploaded every 5 minutes as /some/path/YYYY/MM/DD/HH/fileName, to find all the files in these directories, the Apache Spark file source lists all subdirectories in parallel. The following algorithm estimates the total number of API LIST directory calls to object storage:

1 (base directory) + 365 (per day) * 24 (per hour) = 8761 calls

By receiving a flattened response from storage, Auto Loader reduces the number of API calls to the number of files in storage divided by the number of results returned by each API call, greatly reducing your cloud costs. The following table shows the number of files returned by each API call for common object storage:

Results returned per call Object storage
1000 S3
5000 ADLS Gen2
1024 GCS

Incremental Listing (deprecated)

Important

This feature has been deprecated. Databricks recommends using file notification mode instead of incremental listing.

Note

Available in Databricks Runtime 9.1 LTS and above.

Incremental listing is available for Azure Data Lake Storage Gen2 (abfss://), S3 (s3://) and GCS (gs://).

For lexicographically generated files, Auto Loader leverages the lexical file ordering and optimized listing APIs to improve the efficiency of directory listing by listing from recently ingested files rather than listing the contents of the entire directory.

By default, Auto Loader automatically detects whether a given directory is applicable for incremental listing by checking and comparing file paths of previously completed directory listings. To ensure eventual completeness of data in auto mode, Auto Loader automatically triggers a full directory list after completing 7 consecutive incremental lists. You can control the frequency of full directory lists by setting cloudFiles.backfillInterval to trigger asynchronous backfills at a given interval.

Lexical ordering of files

For files to be lexically ordered, new files that are uploaded need to have a prefix that is lexicographically greater than existing files. Some examples of lexical ordered directories are shown below.

Versioned files

Delta Lake makes commits to table transaction logs in a lexical order.

<path-to-table>/_delta_log/00000000000000000000.json
<path-to-table>/_delta_log/00000000000000000001.json <- guaranteed to be written after version 0
<path-to-table>/_delta_log/00000000000000000002.json <- guaranteed to be written after version 1
...

AWS DMS uploads CDC files to AWS S3 in a versioned manner.

database_schema_name/table_name/LOAD00000001.csv
database_schema_name/table_name/LOAD00000002.csv
...

Date partitioned files

Files can be uploaded in a date partitioned format. Some examples of this are:

// <base-path>/yyyy/MM/dd/HH:mm:ss-randomString
<base-path>/2021/12/01/10:11:23-b1662ecd-e05e-4bb7-a125-ad81f6e859b4.json
<base-path>/2021/12/01/10:11:23-b9794cf3-3f60-4b8d-ae11-8ea320fad9d1.json
...

// <base-path>/year=yyyy/month=MM/day=dd/hour=HH/minute=mm/randomString
<base-path>/year=2021/month=12/day=04/hour=08/minute=22/442463e5-f6fe-458a-8f69-a06aa970fc69.csv
<base-path>/year=2021/month=12/day=04/hour=08/minute=22/8f00988b-46be-4112-808d-6a35aead0d44.csv <- this may be uploaded before the file above as long as processing happens less frequently than a minute

When files are uploaded with date partitioning, some things to keep in mind are:

  • Months, days, hours, minutes need to be left padded with zeros to ensure lexical ordering (should be uploaded as hour=03, instead of hour=3 or 2021/05/03 instead of 2021/5/3).
  • Files don’t necessarily have to be uploaded in lexical order in the deepest directory as long as processing happens less frequently than the parent directory’s time granularity.

Some services that can upload files in a date partitioned lexical ordering are:

Change source path for Auto Loader

In Databricks Runtime 11.3 LTS and above, you can change the directory input path for Auto Loader configured with directory listing mode without having to choose a new checkpoint directory.

Warning

This functionality is not supported for file notification mode. If file notification mode is used and the path is changed, you might fail to ingest files that are already present in the new directory at the time of the directory update.

For example, if you wish to run a daily ingestion job that loads all data from a directory structure organized by day, such as /YYYYMMDD/, you can use the same checkpoint to track ingestion state information across a different source directory each day while maintaining state information for files ingested from all previously used source directories.