แชร์ผ่าน


Dynamic file pruning

Dynamic file pruning, can significantly improve the performance of many queries on Delta Lake tables. Dynamic file pruning triggers for queries that contain filter statements or WHERE clauses. You must use Photon-enabled compute to use dynamic file pruning in MERGE, UPDATE, and DELETE statements. Only SELECT statements leverage dynamic file pruning when Photon is not used.

Dynamic file pruning is especially efficient for non-partitioned tables, or for joins on non-partitioned columns. The performance impact of dynamic file pruning is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit.

For background and use cases for dynamic file pruning, see Faster SQL queries on Delta Lake with dynamic file pruning.

Configuration

Dynamic file pruning is controlled by the following Apache Spark configuration options:

  • spark.databricks.optimizer.dynamicFilePruning (default is true): The main flag that directs the optimizer to push down filters. When set to false, dynamic file pruning will not be in effect.
  • spark.databricks.optimizer.deltaTableSizeThreshold (default is 10,000,000,000 bytes (10 GB)): Represents the minimum size (in bytes) of the Delta table on the probe side of the join required to trigger dynamic file pruning. If the probe side is not very large, it is probably not worthwhile to push down the filters and we can just simply scan the whole table. You can find the size of a Delta table by running the DESCRIBE DETAIL table_name command and then looking at the sizeInBytes column.
  • spark.databricks.optimizer.deltaTableFilesThreshold (default is 10): Represents the number of files of the Delta table on the probe side of the join required to trigger dynamic file pruning. When the probe side table contains fewer files than the threshold value, dynamic file pruning is not triggered. If a table has only a few files, it is probably not worthwhile to enable dynamic file pruning. You can find the size of a Delta table by running the DESCRIBE DETAIL table_name command and then looking at the numFiles column.