Install Databricks Connect for Scala

Note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

This article describes how to install Databricks Connect for Scala. See What is Databricks Connect?. For the Python version of this article, see Install Databricks Connect for Python.

Requirements

  • Your target Azure Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.
  • The Java Development Kit (JDK) installed on your development machine. Databricks recommends that the version of your JDK installation that you use matches the JDK version on your Azure Databricks cluster. To find the JDK version on your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. For instance, Zulu 8.70.0.23-CA-linux64 corresponds to JDK 8. See Databricks Runtime release notes versions and compatibility.
  • Scala installed on your development machine. Databricks recommends that the version of your Scala installation you use matches the Scala version on your Azure Databricks cluster. To find the Scala version on your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. See Databricks Runtime release notes versions and compatibility.
  • A Scala build tool on your development machine, such as sbt.

Set up the client

After you meet the requirements for Databricks Connect, complete the following steps to set up the Databricks Connect client.

Step 1: Add a reference to the Databricks Connect client

  1. In your Scala project’s build file such as build.sbt for sbt, pom.xml for Maven, or build.gradle for Gradle, add the following reference to the Databricks Connect client:

    Sbt

    libraryDependencies += "com.databricks" % "databricks-connect" % "14.0.0"
    

    Maven

    <dependency>
      <groupId>com.databricks</groupId>
      <artifactId>databricks-connect</artifactId>
      <version>14.0.0</version>
    </dependency>
    

    Gradle

    implementation 'com.databricks.databricks-connect:14.0.0'
    
  2. Replace 14.0.0 with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.

Step 2: Configure connection properties

In this section, you configure properties to establish a connection between Databricks Connect and your remote Azure Databricks cluster. These properties include settings to authenticate Databricks Connect with your cluster.

For Databricks Connect for Databricks Runtime 13.3 LTS and above, for Scala, Databricks Connect includes the Databricks SDK for Java. This SDK implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach makes setting up and automating authentication with Azure Databricks more centralized and predictable. It enables you to configure Azure Databricks authentication once and then use that configuration across multiple Azure Databricks tools and SDKs without further authentication configuration changes.

Note

  1. Collect the following configuration properties.

  2. Configure the connection within your code. Databricks Connect searches for configuration properties in the following order until it finds them. Once it finds them, it stops searching through the remaining options. The details for each option appear after the following table:

    Configuration properties option Applies to
    1. The DatabricksSession class’s remote() method Azure Databricks personal access token authentication only
    2. An Azure Databricks configuration profile All Azure Databricks authentication types
    3. The SPARK_REMOTE environment variable Azure Databricks personal access token authentication only
    4. The DATABRICKS_CONFIG_PROFILE environment variable All Azure Databricks authentication types
    5. An environment variable for each configuration property All Azure Databricks authentication types
    6. An Azure Databricks configuration profile named DEFAULT All Azure Databricks authentication types
    1. The DatabricksSession class’s remote() method

      For this option, which applies to Azure Databricks personal access token authentication only, specify the workspace instance name, the Azure Databricks personal access token, and the ID of the cluster.

      You can initialize the DatabricksSession class in several ways, as follows:

      • Set the host, token, and clusterId fields in DatabricksSession.builder.
      • Use the Databricks SDK’s Config class.
      • Specify a Databricks configuration profile along with the clusterId field.

      Databricks does not recommend that you directly specify these connection properties in your code. Instead, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed retrieve* functions yourself to get the necessary properties from the user or from some other configuration store, such as Azure KeyVault.

      The code for each of these approaches is as follows:

      // Set the host, token, and clusterId fields in DatabricksSession.builder.
      // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
      // cluster's ID, you do not also need to set the clusterId field here.
      import com.databricks.connect.DatabricksSession
      
      val spark = DatabricksSession.builder()
        .host(retrieveWorkspaceInstanceName())
        .token(retrieveToken())
        .clusterId(retrieveClusterId())
        .getOrCreate()
      
      // Use the Databricks SDK's Config class.
      // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
      // cluster's ID, you do not also need to set the clusterId field here.
      import com.databricks.connect.DatabricksSession
      import com.databricks.sdk.core.DatabricksConfig
      
      val config = new DatabricksConfig()
        .setHost(retrieveWorkspaceInstanceName())
        .setToken(retrieveToken())
      val spark = DatabricksSession.builder()
        .sdkConfig(config)
        .clusterId(retrieveClusterId())
        .getOrCreate()
      
      // Specify a Databricks configuration profile along with the clusterId field.
      // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
      // cluster's ID, you do not also need to set the clusterId field here.
      import com.databricks.connect.DatabricksSession
      import com.databricks.sdk.core.DatabricksConfig
      
      val config = new DatabricksConfig()
        .setProfile("<profile-name>")
      val spark = DatabricksSession.builder()
        .sdkConfig(config)
        .clusterId(retrieveClusterId())
        .getOrCreate()
      
    2. An Azure Databricks configuration profile

      For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the supported Databricks authentication type that you want to use.

      The required configuration profile fields for each authentication type are as follows:

      Then set the name of this configuration profile through the DatabricksConfig class.

      You can specify cluster_id in a few ways, as follows:

      • Include the cluster_id field in your configuration profile, and then just specify the configuration profile’s name.
      • Specify the configuration profile name along with the clusterId field.

      If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify the cluster_id or clusterId fields.

      The code for each of these approaches is as follows:

      // Include the cluster_id field in your configuration profile, and then
      // just specify the configuration profile's name:
      import com.databricks.connect.DatabricksSession
      import com.databricks.sdk.core.DatabricksConfig
      
      val config = new DatabricksConfig()
        .setProfile("<profile-name>")
        val spark = DatabricksSession.builder()
        .sdkConfig(config)
        .getOrCreate()
      
      // Specify the configuration profile name along with the clusterId field.
      // In this example, retrieveClusterId() assumes some custom implementation that
      // you provide to get the cluster ID from the user or from some other
      // configuration store:
      import com.databricks.connect.DatabricksSession
      import com.databricks.sdk.core.DatabricksConfig
      
      val config = new DatabricksConfig()
        .setProfile("<profile-name>")
      val spark = DatabricksSession.builder()
        .sdkConfig(config)
        .clusterId(retrieveClusterId())
        .getOrCreate()
      
    3. The SPARK_REMOTE environment variable

      For this option, which applies to Azure Databricks personal access token authentication only, set the SPARK_REMOTE environment variable to the following string, replacing the placeholders with the appropriate values.

      sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
      

      Then initialize the DatabricksSession class as follows:

      import com.databricks.connect.DatabricksSession
      
      val spark = DatabricksSession.builder().getOrCreate()
      

      To set environment variables, see your operating system’s documentation.

    4. The DATABRICKS_CONFIG_PROFILE environment variable

      For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the supported Databricks authentication type that you want to use.

      If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

      The required configuration profile fields for each authentication type are as follows:

      Set the DATABRICKS_CONFIG_PROFILE environment variable to the name of this configuration profile. Then initialize the DatabricksSession class as follows:

      import com.databricks.connect.DatabricksSession
      
      val spark = DatabricksSession.builder().getOrCreate()
      

      To set environment variables, see your operating system’s documentation.

    5. An environment variable for each configuration property

      For this option, set the DATABRICKS_CLUSTER_ID environment variable and any other environment variables that are necessary for the supported Databricks authentication type that you want to use.

      The required environment variables for each authentication type are as follows:

      Then initialize the DatabricksSession class as follows:

      import com.databricks.connect.DatabricksSession
      
      val spark = DatabricksSession.builder().getOrCreate()
      

      To set environment variables, see your operating system’s documentation.

    6. An Azure Databricks configuration profile named DEFAULT

      For this option, create or identify an Azure Databricks configuration profile containing the field cluster_id and any other fields that are necessary for the supported Databricks authentication type that you want to use.

      If you have already set the DATABRICKS_CLUSTER_ID environment variable with the cluster’s ID, you do not also need to specify cluster_id.

      The required configuration profile fields for each authentication type are as follows:

      Name this configuration profile DEFAULT.

      Then initialize the DatabricksSession class as follows:

      scala
      import com.databricks.connect.DatabricksSession
      
      val spark = DatabricksSession.builder().getOrCreate()