Train Spark ML models on Databricks Connect with pyspark.ml.connect
Important
This feature is in Public Preview.
This article provides an example that demonstrates how to use the pyspark.ml.connect
module to perform distributed training to train Spark ML models and run model inference on Databricks Connect.
What is pyspark.ml.connect
?
Spark 3.5 introduces pyspark.ml.connect
which is designed for supporting Spark connect mode and Databricks Connect. Learn more about Databricks Connect.
The pyspark.ml.connect
module consists of common learning algorithms and utilities, including classification, feature transformers, ML pipelines, and cross validation. This module provides similar interfaces to the legacy pyspark.ml
module, but the pyspark.ml.connect
module currently only contains a subset of the algorithms in pyspark.ml
. The supported algorithms are listed below:
- Classification algorithm:
pyspark.ml.connect.classification.LogisticRegression
- Feature transformers:
pyspark.ml.connect.feature.MaxAbsScaler
andpyspark.ml.connect.feature.StandardScaler
- Evaluator:
pyspark.ml.connect.RegressionEvaluator
,pyspark.ml.connect.BinaryClassificationEvaluator
andMulticlassClassificationEvaluator
- Pipeline:
pyspark.ml.connect.pipeline.Pipeline
- Model tuning:
pyspark.ml.connect.tuning.CrossValidator
Requirements
- Set up Databricks Connect on your clusters. See Compute configuration for Databricks Connect.
- Databricks Runtime 14.0 ML or higher installed.
- Cluster access mode of
Assigned
.
Example notebook
The following notebook demonstrates how to use Distributed ML on Databricks Connect:
Distributed ML on Databricks Connect
For reference information about APIs in pyspark.ml.connect
, Databricks recommends the Apache Spark API reference