Databricks Connect for R

Article
11/15/2024

Note

This article covers sparklyr integration with Databricks Connect for Databricks Runtime 13.0 and above. This integration is neither provided by Databricks nor directly supported by Databricks.

For questions, go to the Posit Community.

To report issues, go to the Issues section of the sparklyr repository in GitHub.

For more information, see Databricks Connect v2 in the sparklyr documentation.

This article demonstrates how to quickly get started with Databricks Connect by using R, sparklyr, and RStudio Desktop.

For the Python version of this article, see Databricks Connect for Python.
For the Scala version of this article, see Databricks Connect for Scala.

Databricks Connect enables you to connect popular IDEs such as RStudio Desktop, notebook servers, and other custom applications to Azure Databricks clusters. See What is Databricks Connect?.

Tutorial

This tutorial uses RStudio Desktop and Python 3.10. If you don’t have them already installed, install R and RStudio Desktop and Python 3.10.

For supplemental information about this tutorial, see the “Databricks Connect” section of Spark Connect, and Databricks Connect v2 on the sparklyr website.

Requirements

To complete this tutorial, you must meet the following requirements:

Your target Azure Databricks workspace and cluster must meet the requirements for Compute configuration for Databricks Connect.
You must have your cluster ID available. To get your cluster ID, in your workspace, click Compute on the sidebar, and then click your cluster’s name. In your web browser’s address bar, copy the string of characters between clusters and configuration in the URL.

Step 1: Create a personal access token

Note

Databricks Connect for R authentication currently only supports Azure Databricks personal access tokens.

This tutorial uses Azure Databricks personal access token authentication for authenticating with your Azure Databricks workspace.

If you already have an Azure Databricks personal access token, skip to Step 2. If you are not sure whether you already have an Azure Databricks personal access token, you can follow this step without affecting any other Azure Databricks personal access tokens in your user account.

To create a personal access token, follow the steps in Azure Databricks personal access tokens for workspace users.

Step 2: Create the project

Start RStudio Desktop.
On the main menu, click File > New Project.
Select New Directory.
Select New Project.
For Directory name and Create project as subdirectory of, enter the new project directory’s name and where to create this new project directory.
Select Use renv with this project. If prompted to install an updated version of the renv package, click Yes.
Click Create Project.

Step 3: Add the Databricks Connect package and other dependencies

On the RStudio Desktop main menu, click Tools > Install Packages.
Leave Install from set to Repository (CRAN).
For Packages, enter the following list of packages that are prerequisites for the Databricks Connect package and this tutorial:
```
sparklyr,pysparklyr,reticulate,usethis,dplyr,dbplyr
```
Leave Install to Library set to your R virtual environment.
Make sure that Install dependencies is selected.
Click Install.

When you are prompted in the Console view (View > Move Focus to Console) to proceed with the installation, enter Y. The sparklyr and pysparklyr packages and their dependencies are installed in your R virtual environment.

In the Console pane, use reticulate to install Python by running the following command. (Databricks Connect for R requires reticulate and Python to be installed first.) In the following command, replace 3.10 with the major and minor version of the Python version that is installed on your Azure Databricks cluster. To find this major and minor version, see the “System environment” section of the release notes for your cluster’s Databricks Runtime version in Databricks Runtime release notes versions and compatibility.
```
reticulate::install_python(version = "3.10")
```
In the Console pane, install the Databricks Connect package by running the following command. In the following command, replace 13.3 with the Databricks Runtime version that is installed on your Azure Databricks cluster. To find this version, on your cluster’s details page in your Azure Databricks workspace, on the Configuration tab, see the Databricks Runtime Version box.
```
pysparklyr::install_databricks(version = "13.3")
```
If you do not know the Databricks Runtime version for your cluster or you do not want to look it up, you can run the following command instead, and pysparklyr will query the cluster to determine the correct Databricks Runtime version to use:
```
pysparklyr::install_databricks(cluster_id = "<cluster-id>")
```
If you want your project to connect later to a different cluster that has the same Databricks Runtime version than the one that you just specified, pysparklyr will use the same Python environment. If the new cluster has a different Databricks Runtime version, you should run the pysparklyr::install_databricks command again with the new Databricks Runtime version or cluster ID.

Step 4: Set environment variables for the workspace URL, access token, and cluster ID

Databricks does not recommend that you hard-code sensitive or changing values such as your Azure Databricks workspace URL, Azure Databricks personal access token, or Azure Databricks cluster ID into your R scripts. Instead, store these values separately, for example in local environment variables. This tutorial uses RStudio Desktop’s built-in support for storing environment variables in a .Renviron file.

Create an .Renviron file to store the environment variables, if this file does not already exist, and then open this file for editing: in the RStudio Desktop Console, run the following command:

usethis::edit_r_environ()

In the .Renviron file that appears (View > Move Focus to Source), enter the following content. In this content, replace the following placeholders:
- Replace <workspace-url> with your per-workspace URL, for example https://adb-1234567890123456.7.azuredatabricks.net.
- Replace <personal-access-token> with your Azure Databricks personal access token from Step 1.
- Replace <cluster-id> with your cluster ID from this tutorial’s requirements.
```
DATABRICKS_HOST=<workspace-url>
DATABRICKS_TOKEN=<personal-access-token>
DATABRICKS_CLUSTER_ID=<cluster-id>
```
Save the .Renviron file.
Load the environment variables into R: on the main menu, click Session > Restart R.

Set the environment variables for Databricks Connect

Step 5: Add code

On the RStudio Desktop main menu, click File > New File > R Script.

Enter the following code into the file and then save the file (File > Save) as demo.R:

library(sparklyr)
library(dplyr)
library(dbplyr)

sc <- sparklyr::spark_connect(
  master     = Sys.getenv("DATABRICKS_HOST"),
  cluster_id = Sys.getenv("DATABRICKS_CLUSTER_ID"),
  token      = Sys.getenv("DATABRICKS_TOKEN"),
  method     = "databricks_connect",
  envname    = "r-reticulate"
)

trips <- dplyr::tbl(
  sc,
  dbplyr::in_catalog("samples", "nyctaxi", "trips")
)

print(trips, n = 5)

Step 6: Run the code

On the RStudio Desktop, in the toolbar for the demo.R file, click Source.
In the Console, the first five rows of the trips table appear.
In the Connections view (View > Show Connections), you can explore available catalogs, schemas, tables, and views.

Step 7: Debug the code

In the demo.R file, click the gutter next to print(trips, n = 5) to set a breakpoint.
In the toolbar for the demo.R file, click Source.
When the code pauses running at the breakpoint, you can inspect variable in the Environment view (View > Show Environment).
On the main menu, click Debug > Continue.
In the Console, the first five rows of the trips table appear.

Share via