Submit Spark jobs on SQL Server big data cluster in Visual Studio Code

Artikel
03/03/2023

Important

The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.

Learn how to use Spark & Hive Tools for Visual Studio Code to create and submit PySpark scripts for Apache Spark, first we'll describe how to install the Spark & Hive tools in Visual Studio Code and then we'll walk through how to submit jobs to Spark.

Spark & Hive Tools can be installed on platforms that are supported by Visual Studio Code, which include Windows, Linux, and macOS. Below you'll find the prerequisites for different platforms.

Prerequisites

The following items are required for completing the steps in this article:

A SQL Server big data cluster. See SQL Server Big Data Clusters.
Visual Studio Code.
Python and the Python extension on Visual Studio Code.
Mono. Mono is only required for Linux and macOS.
Set up PySpark interactive environment for Visual Studio Code.
A local directory named SQLBDCexample. This article uses C:\SQLBDC\SQLBDCexample.

Install Spark & Hive Tools

After you have completed the prerequisites, you can install Spark & Hive Tools for Visual Studio Code. Complete the following steps to install Spark & Hive Tools:

Open Visual Studio Code.
From the menu bar, navigate to View > Extensions.
In the search box, enter Spark & Hive.
Select Spark & Hive Tools, published by Microsoft, from the search results, and then select Install.
Reload when needed.

Open work folder

Complete the following steps to open a work folder, and create a file in Visual Studio Code:

From the menu bar, navigate to File > Open Folder... > C:\SQLBDC\SQLBDCexample, then select the Select Folder button. The folder appears in the Explorer view on the left.
From the Explorer view, select the folder, SQLBDCexample, and then the New File icon next to the work folder.
Name the new file with the .py (Spark script) file extension. This example uses HelloWorld.py.

Copy and paste the following code into the script file:

import sys
from operator import add
from pyspark.sql import SparkSession, Row

spark = SparkSession\
   .builder\
   .appName("PythonWordCount")\
   .getOrCreate()

data = [Row(col1='pyspark and spark', col2=1), Row(col1='pyspark', col2=2), Row(col1='spark vs hadoop', col2=2), Row(col1='spark', col2=2), Row(col1='hadoop', col2=2)]
df = spark.createDataFrame(data)
lines = df.rdd.map(lambda r: r[0])

counters = lines.flatMap(lambda x: x.split(' ')) \
   .map(lambda x: (x, 1)) \
   .reduceByKey(add)

output = counters.collect()
sortedCollection = sorted(output, key = lambda r: r[1], reverse = True)

for (word, count) in sortedCollection:
   print("%s: %i" % (word, count))

Link a SQL Server big data cluster

Before you can submit scripts to your clusters from Visual Studio Code, you need to link a SQL Server big data cluster.

From the menu bar navigate to View > Command Palette..., and enter Spark / Hive: Link a Cluster.
Select linked cluster type SQL Server Big Data.
Enter SQL Server Big Data endpoint.
Enter SQL Server big data cluster user name.
Enter password for user admin.
Set the display name of the big data cluster (optional).
List clusters, review OUTPUT view for verification.

List clusters

From the menu bar navigate to View > Command Palette..., and enter Spark / Hive: List Cluster.
Review the OUTPUT view. The view will show your linked cluster(s).

Set default cluster

Re-Open the folder SQLBDCexample created earlier if closed.
Select the file HelloWorld.py created earlier and it will open in the script editor.
Link a cluster if you haven't yet done so.
Right-click the script editor, and select Spark / Hive: Set Default Cluster.
Select a cluster as the default cluster for the current script file. The tools automatically update the configuration file .VSCode\settings.json.

Submit interactive PySpark queries

You can submit interactive PySpark queries by following the steps below:

Reopen the folder SQLBDCexample created earlier if closed.
Select the file HelloWorld.py created earlier and it will open in the script editor.
Link a cluster if you haven't yet done so.
Choose all the code and right-click the script editor, select Spark: PySpark Interactive to submit the query, or use shortcut Ctrl + Alt + I.
Select the cluster if you haven't specified a default cluster. After a few moments, the Python Interactive results appear in a new tab. The tools also allow you to submit a block of code instead of the whole script file using the context menu.
Enter "%%info", and then press Shift + Enter to view job information. (Optional)

Note

When Python Extension Enabled is unchecked in the settings (The default setting is checked), the submitted pyspark interaction results will use the old window.

Submit PySpark batch job

Reopen the folder SQLBDCexample created earlier if closed.
Select the file HelloWorld.py created earlier and it will open in the script editor.
Link a cluster if you haven't yet done so.
Right-click the script editor, and then select Spark: PySpark Batch, or use shortcut Ctrl + Alt + H.
Select the cluster if you haven't specified a default cluster. After you submit a Python job, submission logs appear in the OUTPUT window in Visual Studio Code. The Spark UI URL and Yarn UI URL are shown as well. You can open the URL in a web browser to track the job status.

Apache Livy configuration

Apache Livy configuration is supported, it can be set at the .VSCode\settings.json in the work space folder. Currently, Livy configuration only supports Python script. More details, see Livy README.

How to trigger Livy configuration

Method 1

From the menu bar, navigate to File > Preferences > Settings.
In the Search settings text box enter HDInsight Job Submission: Livy Conf.
Select Edit in settings.json for the relevant search result.

Method 2

Submit a file, notice the .vscode folder is added automatically to the work folder. You can find the Livy configuration by selecting settings.json under .vscode.

The project settings:

Livy configuration

Note

For settings driverMemory and executorMemory, set the value with unit, for example 1gb or 1024mb.

Supported Livy configurations

POST /batches

Request body

name	description	type
file	File containing the application to execute	path (required)
proxyUser	User to impersonate when running the job	string
className	Application Java/Spark main class	string
args	Command line arguments for the application	list of strings
jars	jars to be used in this session	List of string
pyFiles	Python files to be used in this session	List of string
files	files to be used in this session	List of string
driverMemory	Amount of memory to use for the driver process	string
driverCores	Number of cores to use for the driver process	int
executorMemory	Amount of memory to use per executor process	string
executorCores	Number of cores to use for each executor	int
numExecutors	Number of executors to launch for this session	int
archives	Archives to be used in this session	List of string
queue	The name of the YARN queue to which submitted	string
name	The name of this session	string
conf	Spark configuration properties	Map of key=val
:-	:-	:-

Response Body

The created batch object.

name	description	type
id	The session id	int
appId	The application id of this session	String
appInfo	The detailed application info	Map of key=val
log	The log lines	list of strings
state	The batch state	string
:-	:-	:-

Note

The assigned Livy config will display in output pane when submit script.

Additional features

Spark & Hive for Visual Studio Code supports the following features:

IntelliSense autocomplete. Suggestions pop up for keyword, methods, variables, and more. Different icons represent different types of objects.
IntelliSense error marker. The language service underlines the editing errors for the Hive script.
Syntax highlights. The language service uses different colors to differentiate variables, keywords, data type, functions, and more.

Unlink cluster

From the menu bar navigate to View > Command Palette..., and then enter Spark / Hive: Unlink a Cluster.
Select cluster to unlink.
Review OUTPUT view for verification.

Next steps

For more information on SQL Server big data cluster and related scenarios, See SQL Server Big Data Clusters.

Delen via