What is Photon?
This article explains the benefits of running your workloads on the Photon query engine.
Photon is a high-performance Azure Databricks-native vectorized query engine that runs your SQL workloads and DataFrame API calls faster to reduce your total cost per workload. Photon is compatible with Apache Spark APIs, so it works with your existing code.
Photon features
The following are key features and advantages of using Photon.
- Support for SQL and equivalent DataFrame operations with Delta and Parquet tables.
- Accelerated queries that process data faster and include aggregations and joins.
- Faster performance when data is accessed repeatedly from the disk cache.
- Robust scan performance on tables with many columns and many small files.
- Faster Delta and Parquet writing using
UPDATE
,DELETE
,MERGE INTO
,INSERT
, andCREATE TABLE AS SELECT
, including wide tables that contain thousands of columns. - Replaces sort-merge joins with hash-joins.
- For AI and ML workloads, Photon improves performance for applications using Spark SQL, Spark DataFrames, feature engineering, GraphFrames, and xgboost4j.
Photon enablement
Photon enablement varies by compute type:
Photon runs by default on SQL warehouses and serverless compute for notebooks and workflows.
Photon is enabled by default on compute running Databricks Runtime 9.1 LTS and above.
Photon can be enabled manually on compute running Databricks Runtime 15.2 for Machine Learning or above.
Configure Photon enablement
To enable or disable Photon on all-purpose and jobs compute, select the Use Photon Acceleration checkbox in the Compute UI.
Photon is not enabled by default on any compute created with the Clusters API or Jobs API. To enable Photon, you must set the runtime_engine
attribute to PHOTON
.
Supported instance types
Photon supports a number of instance types on the driver and worker nodes. Photon instance types consume DBUs at a different rate than the same instance type running the non-Photon runtime. For more information about Photon instances and DBU consumption, see the Azure Databricks pricing page.
Supported operators, expressions, and data types
The following are the operators, expressions, and data types that Photon covers.
Operators
- Scan, Filter, Project
- Hash Aggregate/Join/Shuffle
- Nested-Loop Join
- Null-Aware Anti Join
- Union, Expand, ScalarSubquery
- Delta/Parquet Write Sink
- Sort
- Window Function
Expressions
- Comparison / Logic
- Arithmetic / Math (most)
- Conditional (IF, CASE, etc.)
- String (common ones)
- Casts
- Aggregates(most common ones)
- Date/Timestamp
Data types
- Byte/Short/Int/Long
- Boolean
- String/Binary
- Decimal
- Float/Double
- Date/Timestamp
- Struct
- Array
- Map
Features that require Photon
The following are features that require Photon.
- Predictive I/O for read and write. See What is predictive I/O?.
- H3 geospatial expressions. See H3 geospatial functions.
- Dynamic file pruning in
MERGE
,UPDATE
, andDELETE
statements. See Dynamic file pruning.
Limitations
- Structured Streaming: Photon currently supports stateless streaming with Delta, Parquet, CSV, and JSON. Stateless Kafka and Kinesis streaming is supported when writing to a Delta or Parquet sink.
- Photon does not support UDFs, RDD APIs, or Dataset APIs.
- Photon doesn’t impact queries that normally run in under two seconds.
If your workload hits an unsupported operation, the compute resource switches to the standard runtime engine for the remainder of the workload.