Muokkaa

Jaa


Quickstart: Deploy a Managed Apache Spark Cluster with Azure Databricks

Azure Managed Instance for Apache Cassandra provides automated deployment and scaling operations for managed open-source Apache Cassandra datacenters. This feature accelerates hybrid scenarios and reducing ongoing maintenance.

This quickstart demonstrates how to use the Azure portal to create a fully managed Apache Spark cluster inside the Azure Virtual Network of your Azure Managed Instance for Apache Cassandra cluster. You create the Spark cluster in Azure Databricks. Later, you can create or attach notebooks to the cluster, read data from different data sources, and analyze insights.

You can also learn more with detailed instructions on Deploying Azure Databricks in your Azure Virtual Network (Virtual Network Injection).

Prerequisites

If you don't have an Azure subscription, create a free account before you begin.

Create an Azure Databricks cluster

Follow these steps to create an Azure Databricks cluster in a Virtual Network that has the Azure Managed Instance for Apache Cassandra:

  1. Sign in to the Azure portal.

  2. In the left navigation pane, locate Resource groups. Navigate to your resource group that contains the Virtual Network where your managed instance is deployed.

  3. Open the Virtual Network resource, and make a note of the Address space:

    Screenshot shows where to get the address space of your Virtual Network.

  4. From the resource group, select Add and search for Azure Databricks in the search field:

    Screenshot shows a search for Azure Databricks.

  5. Select Create to create an Azure Databricks account:

    Screenshot shows Azure Databricks offering with the Create button selected.

  6. Enter the following values:

    • Workspace name Provide a name for your Databricks workspace.
    • Region Make sure to select the same region as your Virtual Network.
    • Pricing Tier Choose between Standard, Premium, or Trial. For more information on these tiers, see Databricks pricing page.

    Screenshot shows a dialog box where you can enter workspace name, region, and pricing tier for the Databricks account.

  7. Next, select the Networking tab, and enter the following details:

    • Deploy Azure Databricks workspace in your Virtual Network (VNet) Select Yes.
    • Virtual Network From the dropdown, choose the Virtual Network where your managed instance exists.
    • Public Subnet Name Enter a name for the public subnet.
    • Public Subnet CIDR Range Enter an IP range for the public subnet.
    • Private Subnet Name Enter a name for the private subnet.
    • Private Subnet CIDR Range Enter an IP range for the private subnet.

    To avoid range collisions, ensure that you select higher ranges. If necessary, use a visual subnet calculator to divide the ranges:

    Screenshot shows the Visual Subnet Calculator with two highlighted identical network addresses.

    The following screenshot shows example details on the networking pane:

    Screenshot shows specified public and private subnet names.

  8. Select Review and create and then Create to deploy the workspace.

  9. Launch Workspace after it's created.

  10. You're redirected to the Azure Databricks portal. From the portal, select New Cluster.

  11. In the New cluster pane, accept default values for all fields other than the following fields:

    • Cluster Name Enter a name for the cluster.
    • Databricks Runtime Version We recommend selecting Databricks runtime version 7.5 or higher, for Spark 3.x support.

    Screenshot shows the New Cluster dialog box with a Databricks Runtime Version selected.

  12. Expand Advanced Options and add the following configuration. Make sure to replace the node IPs and credentials:

    spark.cassandra.connection.host <node1 IP>,<node 2 IP>, <node IP>
    spark.cassandra.auth.password cassandra
    spark.cassandra.connection.port 9042
    spark.cassandra.auth.username cassandra
    spark.cassandra.connection.ssl.enabled true
    
  13. Add the Apache Spark Cassandra Connector library to your cluster to connect to both native and Azure Cosmos DB Cassandra endpoints. In your cluster, select Libraries > Install New > Maven, and then add com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0 in Maven coordinates.

Screenshot that shows searching for Maven packages in Databricks.

Clean up resources

If you're not going to continue to use this managed instance cluster, delete it with the following steps:

  1. From the left-hand menu of Azure portal, select Resource groups.
  2. From the list, select the resource group you created for this quickstart.
  3. On the resource group Overview pane, select Delete resource group.
  4. In the next window, enter the name of the resource group to delete, and then select Delete.

Next steps

In this quickstart, you learned how to create a fully managed Apache Spark cluster inside the Virtual Network of your Azure Managed Instance for Apache Cassandra cluster. Next, you can learn how to manage the cluster and datacenter resources: