Migrate to Databricks Connect for Python
This article describes how to migrate from Databricks Connect for Databricks Runtime 12.2 LTS and below to Databricks Connect for Databricks Runtime 13.3 LTS and above for Python. Databricks Connect enables you to connect popular IDEs, notebook servers, and custom applications to Azure Databricks clusters. See What is Databricks Connect?. For the Scala version of this article, see Migrate to Databricks Connect for Scala.
Note
Before you begin to use Databricks Connect, you must set up the Databricks Connect client.
Follow these guidelines to migrate your existing Python code project or coding environment from Databricks Connect for Databricks Runtime 12.2 LTS and below to Databricks Connect for Databricks Runtime 13.3 LTS and above.
Install the correct version of Python as listed in the installation requirements to match your Azure Databricks cluster, if it is not already installed locally.
Upgrade your Python virtual environment to use the correct version of Python to match your cluster, if needed. For instructions, see your virtual environment provider’s documentation.
With your virtual environment activated, uninstall PySpark from your virtual environment:
pip3 uninstall pyspark
With your virtual environment still activated, uninstall Databricks Connect for Databricks Runtime 12.2 LTS and below:
pip3 uninstall databricks-connect
With your virtual environment still activated, install Databricks Connect for Databricks Runtime 13.3 LTS and above:
pip3 install --upgrade "databricks-connect==14.0.*" # Or X.Y.* to match your cluster version.
Note
Databricks recommends that you append the “dot-asterisk” notation to specify
databricks-connect==X.Y.*
instead ofdatabricks-connect=X.Y
, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.Update your Python code to initialize the
spark
variable (which represents an instantiation of theDatabricksSession
class, similar toSparkSession
in PySpark). See Compute configuration for Databricks Connect.Migrate your RDD APIs to use DataFrame APIs, and migrate your
SparkContext
to use alternatives.
Set Hadoop configurations
On the client you can set Hadoop configurations using the spark.conf.set
API, which applies to SQL and DataFrame operations. Hadoop configurations set on the sparkContext
must be set in the cluster configuration or using a notebook. This is because configurations set on sparkContext
are not tied to user sessions but apply to the entire cluster.