Bewerken

Delen via


Submit Spark jobs by using command-line tools

Applies to: SQL Server 2019 (15.x)

This article provides guidance about how to use command-line tools to run Spark jobs on SQL Server Big Data Clusters.

Important

The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.

Prerequisites

Spark jobs that use azdata or Livy

This article provides examples of how to use command-line patterns to submit Spark applications to SQL Server Big Data Clusters.

The Azure Data CLI azdata bdc spark commands surface all capabilities of SQL Server Big Data Clusters Spark on the command line. This article focuses on job submission. But azdata bdc spark also supports interactive modes for Python, Scala, SQL, and R through the azdata bdc spark session command.

If you need direct integration with a REST API, use standard Livy calls to submit jobs. This article uses the curl command-line tool in the Livy examples to run the REST API call. For a detailed example that shows how to interact with the Spark Livy endpoint by using Python code, see Use Spark from the Livy endpoint on GitHub.

Simple ETL that uses Big Data Clusters Spark

This extract, transform, and load (ETL) application follows a common data engineering pattern. It loads tabular data from an Apache Hadoop Distributed File System (HDFS) landing zone path. It then uses a table format to write to an HDFS-processed zone path.

Download the sample application's dataset. Then create PySpark applications by using PySpark, Spark Scala, or Spark SQL.

In the following sections, you'll find sample exercises for each solution. Select the tab for your platform. You'll run the application by using azdata or curl.

This example uses the following PySpark application. It's saved as a Python file named parquet_etl_sample.py on the local machine.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Read clickstream_data from storage pool HDFS into a Spark data frame. Applies column renames.
df = spark.read.option("inferSchema", "true").csv('/securelake/landing/criteo/test.txt', sep='\t', 
    header=False).toDF("feat1","feat2","feat3","feat4","feat5","feat6","feat7","feat8",
    "feat9","feat10","feat11","feat12","feat13","catfeat1","catfeat2","catfeat3","catfeat4",
    "catfeat5","catfeat6","catfeat7","catfeat8","catfeat9","catfeat10","catfeat11","catfeat12",
    "catfeat13","catfeat14","catfeat15","catfeat16","catfeat17","catfeat18","catfeat19",
    "catfeat20","catfeat21","catfeat22","catfeat23","catfeat24","catfeat25","catfeat26")

# Print the data frame inferred schema
df.printSchema()

tot_rows = df.count()
print("Number of rows:", tot_rows)

# Drop the managed table
spark.sql("DROP TABLE dl_clickstream")

# Write data frame to HDFS managed table by using optimized Delta Lake table format
df.write.format("parquet").mode("overwrite").saveAsTable("dl_clickstream")

print("Sample ETL pipeline completed")

Copy the PySpark application to HDFS

Store the application in HDFS so the cluster can access it for execution. As a best practice, standardize and govern application locations within the cluster to streamline administration.

In this example use case, all ETL pipeline applications are stored on the hdfs:/apps/ETL-Pipelines path. The sample application is stored at hdfs:/apps/ETL-Pipelines/parquet_etl_sample.py.

Run the following command to upload parquet_etl_sample.py from the local development or staging machine to the HDFS cluster.

azdata bdc hdfs cp --from-path parquet_etl_sample.py  --to-path "hdfs:/apps/ETL-Pipelines/parquet_etl_sample.py"

Run the Spark application

Use the following command to submit the application to SQL Server Big Data Clusters Spark for execution.

The azdata command runs the application by using commonly specified parameters. For complete parameter options for azdata bdc spark batch create, see azdata bdc spark.

This application requires the spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation configuration parameter. So the command uses the --config option. This setup shows how to pass configurations into the Spark session.

You can use the --config option to specify multiple configuration parameters. You could also specify them inside the application session by setting the configuration in the SparkSession object.

azdata bdc spark batch create -f hdfs:/apps/ETL-Pipelines/parquet_etl_sample.py \
--config '{"spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation":"true"}' \
-n MyETLPipelinePySpark --executor-count 2 --executor-cores 2 --executor-memory 1664m

Warning

The "name" or "n" parameter for batch name should be unique each time a new batch is created.

Monitor Spark jobs

The azdata bdc spark batch commands provide management actions for Spark batch jobs.

To list all running jobs, run the following command.

  • The azdata command:

    azdata bdc spark batch list -o table
    
  • The curl command, using Livy:

    curl -k -u <USER>:<PASSWORD> -X POST <LIVY_ENDPOINT>/batches
    

To get information for a Spark batch with the given ID, run the following command. The batch id is returned from spark batch create.

  • The azdata command:

    azdata bdc spark batch info --batch-id 0
    
  • The curl command, using Livy:

    curl -k -u <USER>:<PASSWORD> -X POST <LIVY_ENDPOINT>/batches/<BATCH_ID>
    

To get state information for a Spark batch with the given ID, run the following command.

  • The azdata command:

    azdata bdc spark batch state --batch-id 0
    
  • The curl command, using Livy:

    curl -k -u <USER>:<PASSWORD> -X POST <LIVY_ENDPOINT>/batches/<BATCH_ID>/state
    

To get the logs for a Spark batch with the given ID, run the following command.

  • The azdata command:

    azdata bdc spark batch log --batch-id 0
    
  • The curl command, using Livy:

    curl -k -u <USER>:<PASSWORD> -X POST <LIVY_ENDPOINT>/batches/<BATCH_ID>/log
    

Next steps

For information about troubleshooting Spark code, see Troubleshoot a PySpark notebook.

Comprehensive Spark sample code is available at SQL Server Big Data Clusters Spark samples on GitHub.

For more information about SQL Server Big Data Clusters and related scenarios, see SQL Server Big Data Clusters.