SynapseSparkStep Class

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Creates an Azure ML Synapse step that submit and execute Python script.

Create an Azure ML Pipeline step that runs spark job on synapse spark pool.

Inheritance
azureml.pipeline.core._synapse_spark_step_base._SynapseSparkStepBase
SynapseSparkStep

Constructor

SynapseSparkStep(file, source_directory, compute_target, driver_memory, driver_cores, executor_memory, executor_cores, num_executors, name=None, app_name=None, environment=None, arguments=None, inputs=None, outputs=None, conf=None, py_files=None, jars=None, files=None, allow_reuse=True, version=None)

Parameters

Name Description
file
Required
str

The name of a synapse script relative to source_directory.

source_directory
Required
str

A folder that contains Python script, conda env, and other resources used in the step.

compute_target
Required

The compute target to use.

driver_memory
Required
str

Amount of memory to use for the driver process.

driver_cores
Required
int

Number of cores to use for the driver process.

executor_memory
Required
str

Amount of memory to use per executor process.

executor_cores
Required
int

Number of cores to use for each executor.

num_executors
Required
int

Number of executors to launch for this session.

name
Required
str

The name of the step. If unspecified, file is used.

app_name
Required
str

The App name used to submit the spark job.

environment
Required

AML environment will be supported in later release.

arguments
Required

Command line arguments for the Synapse script file.

inputs
Required

A list of inputs.

outputs
Required

A list of outputs.

conf
Required

Spark configuration properties.

py_files
Required

Python files to be used in this session, parameter of livy API.

files
Required

Files to be used in this session, parameter of livy API.

allow_reuse
Required

Indicates if the step should reuse previous results when re-run with the same settings.

version
Required
str

An optional version tag to denote a change in functionality for the step.

file
Required
str

The name of a Synapse script relative to source_directory.

source_directory
Required
str

A folder that contains Python script, conda env, and other resources used in the step.

compute_target
Required

The compute target to use.

driver_memory
Required
str

Amount of memory to use for the driver process.

driver_cores
Required
int

Number of cores to use for the driver process.

executor_memory
Required
str

Amount of memory to use per executor process.

executor_cores
Required
int

Number of cores to use for each executor.

num_executors
Required
int

Number of executors to launch for this session.

name
Required
str

The name of the step. If unspecified, file is used.

app_name
Required
str

The App name used to submit the Apache Spark job.

environment
Required

AML environment that will be leveraged in this SynapseSparkStep.

arguments
Required

Command line arguments for the Synapse script file.

inputs
Required

A list of inputs.

outputs
Required

A list of outputs.

conf
Required

Spark configuration properties.

py_files
Required

Python files to be used in this session, parameter of livy API.

jars
Required

Jar files to be used in this session, parameter of livy API.

files
Required

Files to be used in this session, parameter of livy API.

allow_reuse
Required

Indicates if the step should reuse previous results when re-run with the same settings.

version
Required
str

An optional version tag to denote a change in functionality for the step.

Remarks

A SynapseSparkStep is a basic, built-in step to run a Python Spark job on a synapse spark pools. It takes a main file name and other optional parameters like arguments for the script, compute target, inputs and outputs.

The best practice for working with SynapseSparkStep is to use a separate folder for scripts and any dependent files associated with the step, and specify that folder with the source_directory parameter. Following this best practice has two benefits. First, it helps reduce the size of the snapshot created for the step because only what is needed for the step is snapshotted. Second, the step's output from a previous run can be reused if there are no changes to the source_directory that would trigger a re-upload of the snapshot.


   from azureml.core import Dataset
   from azureml.pipeline.steps import SynapseSparkStep
   from azureml.data import HDFSOutputDatasetConfig

   # get input dataset
   input_ds = Dataset.get_by_name(workspace, "weather_ds").as_named_input("weather_ds")

   # register pipeline output as dataset
   output_ds = HDFSOutputDatasetConfig("synapse_step_output",
                                       destination=(ws.datastores['datastore'],"dir")
                                       ).register_on_complete(name="registered_dataset")

   step_1 = SynapseSparkStep(
       name = "synapse_step",
       file = "pyspark_job.py",
       source_directory="./script",
       inputs=[input_ds],
       outputs=[output_ds],
       compute_target = "synapse",
       driver_memory = "7g",
       driver_cores = 4,
       executor_memory = "7g",
       executor_cores = 2,
       num_executors = 1,
       conf = {})

SynapseSparkStep only supports DatasetConsumptionConfig as input and HDFSOutputDatasetConfig as output.

Methods

create_node

Create a node for Synapse script step.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node

Create a node for Synapse script step.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node(graph, default_datastore, context)

Parameters

Name Description
graph
Required

The graph object to add the node to.

default_datastore
Required

The default datastore.

context
Required
<xref:azureml.pipeline.core._GraphContext>

The graph context.

Returns

Type Description

The created node.