Jaa


Use dbx to sync local files with remote workspaces in real time

Important

This documentation has been retired and might not be updated.

Databricks recommends that instead of dbx sync, you use the Databricks CLI versions 0.205 or above, which includes functionality similar to dbx sync through the databricks sync command.

The Databricks extension for Visual Studio Code also includes functionality similar to dbx sync integrated into the Visual Studio Code IDE. Note that dbx sync can synchronize file changes from a local development machine to DBFS, workspace locations, and Databricks Git folders in your Azure Databricks workspaces. The Databricks extension for Visual Studio Code supports synchronizing file changes only to workspace user (/Users) files and Databricks Git folders (/Repos).

Note

This article covers dbx by Databricks Labs, which is provided as-is and is not supported by Databricks through customer technical support channels. Questions and feature requests can be communicated through the Issues page of the databrickslabs/dbx repo on GitHub.

You can perform real-time synchronization of changes to files on your local development machine with their corresponding files in your Azure Databricks workspaces by using dbx by Databricks Labs. These workspace files can be in DBFS or in Databricks Git folders.

Real-time file synchronization with dbx (also known as dbx sync) is useful in rapid code development scenarios. For example, you can use a local integrated development environment (IDE) for productivity features such as syntax highlighting, smart code completion, code linting, and testing and debugging. You can then go immediately to your workspace and run your updated code.

You can use dbx sync by itself, with automated jobs, or with an IDE.

dbx sync development workflows

There are two development workflows for dbx sync, one with DBFS and another with Databricks Git folders.

The typical development workflow with dbx sync and DBFS is:

  1. Identify a local directory that contains the files you want to synchronize to DBFS.
  2. Identify the path in DBFS that you want your local directory to synchronize with (or let dbx sync create a default DBFS path for you).
  3. Run dbx sync dbfs to synchronize your local directory to the DBFS path. dbx sync begins watching your local directory for any file changes.
  4. Make changes to files in your local directory as needed. dbx sync applies those changes to the corresponding files in the DBFS path in real time.

The typical development workflow with dbx sync and Databricks Git folders is:

  1. Create a repository with a Git provider that Databricks Git folders supports, if you do not have a repository available already.
  2. Clone your repo into your Azure Databricks workspace.
  3. Clone your repo into your local development machine.
  4. Run dbx sync repo to associate your local cloned repo with your workspace cloned repo. dbx sync begins watching your local directory for any file changes.
  5. Make changes to files in your local cloned repo as needed. dbx sync applies those changes to the corresponding files in Databricks Git folders in real time.
  6. Periodically push updated files from the cloned repo in your workspace to your Git provider, so that the repo stays up to date with your Git provider.

Important

dbx sync only performs one-way, real-time synchronization of file changes from your local development machine to your remote workspace. Therefore, Databricks does not recommend that you initiate changes in your Azure Databricks workspace to files that are monitored by dbx sync. If you must make such workspace-initiated file changes, then you must also do the following:

  • For file changes in DBFS, make the corresponding changes to the local files manually.
  • For file changes in Databricks Git folders, push the file changes from your workspace to your Git provider. Then, on your local development machine, pull those file changes from your Git provider.

Requirements

If you want to use dbx sync with Databricks Git folders, your Azure Databricks workspace must meet the following requirement:

  • A clone of your repository with your Git provider, while not required, is suggested.

On your local development machine, you must have the following installed:

  • Python version 3.8 or above. To check whether Python is installed, and to check your installed Python version, run python --version in your terminal or PowerShell.

    python --version
    

    Note

    Some installations of python may require you to use python3 instead of python. If so, substitute python with python3 throughout this article.

  • pip. To check whether pip is installed, and to check your installed pip version, run pip --version or python -m pip --version.

    pip --version
    
    # Or...
    
    python -m pip --version
    

    Note

    Some installations of pip may require you to use pip3 instead of pip. If so, substitute pip with pip3 throughout this article.

  • dbx version 0.8.0 or above. To check whether dbx is installed, and to check your installed dbx version, run dbx --version. To install dbx from the Python Package Index (PyPI), run pip install dbx or python -m pip install dbx. (dbx includes dbx sync.)

    # Check whether dbx is installed, and check its version.
    dbx --version
    
    # Install dbx.
    pip install dbx
    
    # Or...
    python -m pip install dbx
    

    Note

    For more information about dbx, see dbx by Databricks Labs and the dbx documentation.

  • The Databricks CLI version 0.18 or below, set up with authentication. The legacy Databricks CLI (Databricks CLI version 0.17) is automatically installed when you install dbx. This authentication can be set up on your local development machine in one or both of the following locations:

    • Within the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables (starting with legacy Databricks CLI version 0.8.0).
    • In an Azure Databricks configuration profile within your .databrickscfg file.

    dbx looks for authentication credentials in these two locations, respectively. dbx uses only the first set of matching credentials that it finds.

    Note

    If you use a .databrickscfg file, dbx sync looks in this file for a configuration profile named DEFAULT by default. To specify a different profile, use the --profile option when you run the dbx sync command, later in this article.

    dbx does not support the use of a .netrc file for authentication.

  • If you want to use dbx sync with Databricks Git folders, a local clone of your repository with your Git provider, while not required, is suggested. To perform a local clone, consult your Git provider’s documentation.

Use DBFS with dbx sync

  1. From the terminal or PowerShell on your local development machine, change to the directory that contains the files you want to synchronize to DBFS in your Azure Databricks workspace.

  2. Run the dbx sync command to synchronize your local directory to DBFS in your workspace, as follows. (Do not forget the dot (.) at the end, which represents your current directory.)

    dbx sync dbfs --source .
    

    Tip

    To specify a different source directory, replace the dot (.) with a different path.

    Note

    If the error Error: No such command 'sync' appears, your installation of dbx is likely out of date. To fix this, run pip install --upgrade dbx==<version> or python -m pip install --upgrade dbx==version, where <version> is the latest version of dbx. This version number can be found on the PyPI webpage for dbx.

    pip install --upgrade dbx==<version>
    
    # Or...
    python -m pip install --upgrade dbx==version
    
  3. dbx sync begins synchronizing files in your current local directory with files in the following DBFS path in your workspace. dbx sync confirms this by printing Target base path followed by the DBFS path, for example:

    /tmp/users/<your-Databricks-username>/<local-directory-name>
    

    Tip

    To specify a different username or DBFS path, specify the --user and --dest options, respectively, when you run dbx sync.

  4. Make changes to your local files, as needed.

    Important

    You must keep your terminal or PowerShell open for dbx sync to continue synchronizing. If you close your terminal or PowerShell, dbx sync stops watching for file changes and stops synchronizing. To resume file change synchronization, repeat this procedure from the beginning.

  5. As needed, verify your file changes in the preceding path in DBFS in your workspace.

Use Databricks Git folders with dbx sync

  1. From the terminal or PowerShell on your local development machine, change to the root directory that contains the clone of the repository with your Git provider.

  2. In your Azure Databricks workspace, identify the name of the Databricks Git folder that you want to synchronize your local cloned repo to. You can find this repo name by clicking Git folders in your workspace’s sidebar.

  3. On your local development machine, run the dbx sync command to synchronize your local cloned repository to the Databricks Git folders in your workspace as follows, replacing <your-repo-name> with the name of your repo in Databricks Git folders. (Do not forget the dot (.) at the end, which represents your current directory.)

    dbx sync repo -d <your-repo-name> --source .
    

    Tip

    To specify a different source directory, replace the dot (.) with a different path.

    Note

    If the error Error: No such command 'sync' appears, your installation of dbx is likely out of date. To fix this, run pip install --upgrade dbx==<version> or python -m pip install --upgrade dbx==version, where <version> is the latest version of dbx. This version number can be found on the PyPI webpage for dbx.

    pip install --upgrade dbx==<version>
    
    # Or...
    python -m pip install --upgrade dbx==version
    
  4. dbx sync begins synchronizing files in your local cloned repository with files in Databricks Git folders in your workspace. dbx sync confirms this by printing Target base path followed by the Databricks Git folders path, for example:

    /Repos/<your-Databricks-username>/<your-repo-name>
    

    Tip

    To specify a different username or repo name, specify the --user and --dest-repo options, respectively, when you run dbx sync.

  5. Make changes to your local files, as needed.

    Important

    You must keep your terminal or PowerShell open for dbx sync to continue synchronizing. If you close your terminal or PowerShell, dbx sync stops watching for file changes and stops synchronizing. To resume file change synchronization, repeat this procedure from the beginning.

  6. As needed, verify your file changes in Databricks Git folders in your workspace.

Additional resources