SynapseSparkStep Class

Reference

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Creates an Azure ML Synapse step that submit and execute Python script.

Create an Azure ML Pipeline step that runs spark job on synapse spark pool.

Inheritance: azureml.pipeline.core._synapse_spark_step_base._SynapseSparkStepBase

SynapseSparkStep

Constructor

SynapseSparkStep(file, source_directory, compute_target, driver_memory, driver_cores, executor_memory, executor_cores, num_executors, name=None, app_name=None, environment=None, arguments=None, inputs=None, outputs=None, conf=None, py_files=None, jars=None, files=None, allow_reuse=True, version=None)

Parameters

Name	Description
file Required	str The name of a synapse script relative to source_directory.
source_directory Required	str A folder that contains Python script, conda env, and other resources used in the step.
compute_target Required	SynapseCompute or str The compute target to use.
driver_memory Required	str Amount of memory to use for the driver process.
driver_cores Required	int Number of cores to use for the driver process.
executor_memory Required	str Amount of memory to use per executor process.
executor_cores Required	int Number of cores to use for each executor.
num_executors Required	int Number of executors to launch for this session.
name Required	str The name of the step. If unspecified, `file` is used.
app_name Required	str The App name used to submit the spark job.
environment Required	Environment AML environment will be supported in later release.
arguments Required	list Command line arguments for the Synapse script file.
inputs Required	list[DatasetConsumptionConfig] A list of inputs.
outputs Required	list[HDFSOutputDatasetConfig] A list of outputs.
conf Required	dict Spark configuration properties.
py_files Required	list Python files to be used in this session, parameter of livy API.
files Required	list Files to be used in this session, parameter of livy API.
allow_reuse Required	bool Indicates if the step should reuse previous results when re-run with the same settings.
version Required	str An optional version tag to denote a change in functionality for the step.
file Required	str The name of a Synapse script relative to `source_directory`.
source_directory Required	str A folder that contains Python script, conda env, and other resources used in the step.
compute_target Required	SynapseCompute or str The compute target to use.
driver_memory Required	str Amount of memory to use for the driver process.
driver_cores Required	int Number of cores to use for the driver process.
executor_memory Required	str Amount of memory to use per executor process.
executor_cores Required	int Number of cores to use for each executor.
num_executors Required	int Number of executors to launch for this session.
name Required	str The name of the step. If unspecified, `file` is used.
app_name Required	str The App name used to submit the Apache Spark job.
environment Required	Environment AML environment that will be leveraged in this SynapseSparkStep.
arguments Required	list Command line arguments for the Synapse script file.
inputs Required	list[DatasetConsumptionConfig] A list of inputs.
outputs Required	list[HDFSOutputDatasetConfig] A list of outputs.
conf Required	dict Spark configuration properties.
py_files Required	list Python files to be used in this session, parameter of livy API.
jars Required	list Jar files to be used in this session, parameter of livy API.
files Required	list Files to be used in this session, parameter of livy API.
allow_reuse Required	bool Indicates if the step should reuse previous results when re-run with the same settings.
version Required	str An optional version tag to denote a change in functionality for the step.

Remarks

A SynapseSparkStep is a basic, built-in step to run a Python Spark job on a synapse spark pools. It takes a main file name and other optional parameters like arguments for the script, compute target, inputs and outputs.

The best practice for working with SynapseSparkStep is to use a separate folder for scripts and any dependent files associated with the step, and specify that folder with the source_directory parameter. Following this best practice has two benefits. First, it helps reduce the size of the snapshot created for the step because only what is needed for the step is snapshotted. Second, the step's output from a previous run can be reused if there are no changes to the source_directory that would trigger a re-upload of the snapshot.


   from azureml.core import Dataset
   from azureml.pipeline.steps import SynapseSparkStep
   from azureml.data import HDFSOutputDatasetConfig

   # get input dataset
   input_ds = Dataset.get_by_name(workspace, "weather_ds").as_named_input("weather_ds")

   # register pipeline output as dataset
   output_ds = HDFSOutputDatasetConfig("synapse_step_output",
                                       destination=(ws.datastores['datastore'],"dir")
                                       ).register_on_complete(name="registered_dataset")

   step_1 = SynapseSparkStep(
       name = "synapse_step",
       file = "pyspark_job.py",
       source_directory="./script",
       inputs=[input_ds],
       outputs=[output_ds],
       compute_target = "synapse",
       driver_memory = "7g",
       driver_cores = 4,
       executor_memory = "7g",
       executor_cores = 2,
       num_executors = 1,
       conf = {})

SynapseSparkStep only supports DatasetConsumptionConfig as input and HDFSOutputDatasetConfig as output.

Methods

create_node

Create a node for Synapse script step.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node

Create a node for Synapse script step.

create_node(graph, default_datastore, context)

Parameters

Name	Description
graph Required	Graph The graph object to add the node to.
default_datastore Required	Union[AbstractAzureStorageDatastore, AzureDataLakeDatastore] The default datastore.
context Required	<xref:azureml.pipeline.core._GraphContext> The graph context.

Returns

Type	Description
Node	The created node.

Share via

SynapseSparkStep Class

Constructor

Parameters

Remarks

Methods

create_node

Parameters

Returns

Feedback

Additional resources