SynapseSparkStep Class
Note
This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Creates an Azure ML Synapse step that submit and execute Python script.
Create an Azure ML Pipeline step that runs spark job on synapse spark pool.
- Inheritance
-
azureml.pipeline.core._synapse_spark_step_base._SynapseSparkStepBaseSynapseSparkStep
Constructor
SynapseSparkStep(file, source_directory, compute_target, driver_memory, driver_cores, executor_memory, executor_cores, num_executors, name=None, app_name=None, environment=None, arguments=None, inputs=None, outputs=None, conf=None, py_files=None, jars=None, files=None, allow_reuse=True, version=None)
Parameters
Name | Description |
---|---|
file
Required
|
The name of a synapse script relative to source_directory. |
source_directory
Required
|
A folder that contains Python script, conda env, and other resources used in the step. |
compute_target
Required
|
The compute target to use. |
driver_memory
Required
|
Amount of memory to use for the driver process. |
driver_cores
Required
|
Number of cores to use for the driver process. |
executor_memory
Required
|
Amount of memory to use per executor process. |
executor_cores
Required
|
Number of cores to use for each executor. |
num_executors
Required
|
Number of executors to launch for this session. |
name
Required
|
The name of the step. If unspecified, |
app_name
Required
|
The App name used to submit the spark job. |
environment
Required
|
AML environment will be supported in later release. |
arguments
Required
|
Command line arguments for the Synapse script file. |
inputs
Required
|
A list of inputs. |
outputs
Required
|
A list of outputs. |
conf
Required
|
Spark configuration properties. |
py_files
Required
|
Python files to be used in this session, parameter of livy API. |
files
Required
|
Files to be used in this session, parameter of livy API. |
allow_reuse
Required
|
Indicates if the step should reuse previous results when re-run with the same settings. |
version
Required
|
An optional version tag to denote a change in functionality for the step. |
file
Required
|
The name of a Synapse script relative to |
source_directory
Required
|
A folder that contains Python script, conda env, and other resources used in the step. |
compute_target
Required
|
The compute target to use. |
driver_memory
Required
|
Amount of memory to use for the driver process. |
driver_cores
Required
|
Number of cores to use for the driver process. |
executor_memory
Required
|
Amount of memory to use per executor process. |
executor_cores
Required
|
Number of cores to use for each executor. |
num_executors
Required
|
Number of executors to launch for this session. |
name
Required
|
The name of the step. If unspecified, |
app_name
Required
|
The App name used to submit the Apache Spark job. |
environment
Required
|
AML environment that will be leveraged in this SynapseSparkStep. |
arguments
Required
|
Command line arguments for the Synapse script file. |
inputs
Required
|
A list of inputs. |
outputs
Required
|
A list of outputs. |
conf
Required
|
Spark configuration properties. |
py_files
Required
|
Python files to be used in this session, parameter of livy API. |
jars
Required
|
Jar files to be used in this session, parameter of livy API. |
files
Required
|
Files to be used in this session, parameter of livy API. |
allow_reuse
Required
|
Indicates if the step should reuse previous results when re-run with the same settings. |
version
Required
|
An optional version tag to denote a change in functionality for the step. |
Remarks
A SynapseSparkStep is a basic, built-in step to run a Python Spark job on a synapse spark pools. It takes a main file name and other optional parameters like arguments for the script, compute target, inputs and outputs.
The best practice for working with SynapseSparkStep is to use a separate folder for scripts and any dependent
files associated with the step, and specify that folder with the source_directory
parameter.
Following this best practice has two benefits. First, it helps reduce the size of the snapshot
created for the step because only what is needed for the step is snapshotted. Second, the step's output
from a previous run can be reused if there are no changes to the source_directory
that would trigger
a re-upload of the snapshot.
from azureml.core import Dataset
from azureml.pipeline.steps import SynapseSparkStep
from azureml.data import HDFSOutputDatasetConfig
# get input dataset
input_ds = Dataset.get_by_name(workspace, "weather_ds").as_named_input("weather_ds")
# register pipeline output as dataset
output_ds = HDFSOutputDatasetConfig("synapse_step_output",
destination=(ws.datastores['datastore'],"dir")
).register_on_complete(name="registered_dataset")
step_1 = SynapseSparkStep(
name = "synapse_step",
file = "pyspark_job.py",
source_directory="./script",
inputs=[input_ds],
outputs=[output_ds],
compute_target = "synapse",
driver_memory = "7g",
driver_cores = 4,
executor_memory = "7g",
executor_cores = 2,
num_executors = 1,
conf = {})
SynapseSparkStep only supports DatasetConsumptionConfig as input and HDFSOutputDatasetConfig as output.
Methods
create_node |
Create a node for Synapse script step. This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow. |
create_node
Create a node for Synapse script step.
This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.
create_node(graph, default_datastore, context)
Parameters
Name | Description |
---|---|
graph
Required
|
The graph object to add the node to. |
default_datastore
Required
|
The default datastore. |
context
Required
|
<xref:azureml.pipeline.core._GraphContext>
The graph context. |
Returns
Type | Description |
---|---|
The created node. |