DatasetSnapshot Class

Reference

Manages Dataset snapshots with operations to get a snapsot, return its status, and convert it to a dataframe.

Note

This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.

A DataSnapshot object is returned from the create_snapshot method of the Dataset class.

Dataset snapshot is a combination of Profile and an optional materialized copy of the data.

To learn more about Dataset Snapshots, go to https://aka.ms/azureml/howto/createsnapshots

Inheritance: builtins.object

DatasetSnapshot

Constructor

DatasetSnapshot(workspace, snapshot_name, dataset_id, definition_version=None, time_stamp=None, profile_action_id=None, datastore_name=None, relative_path=None, dataset_name=None)

Parameters

Name	Description
workspace Required	<xref:azureml.core.Workspace.> The workspace the Dataset is registered in.
snapshot_name Required	str The name of the Dataset snapshot.
dataset_id Required	str The identifier of the Dataset.
definition_version Required	str The definition version of the Dataset.
time_stamp Required	datetime The snapshot creation time.
profile_action_id Required	str The snapshot profile action ID.
datastore_name Required	str The snapshot data store name.
relative_path Required	str The relative path to the snapshot data.
dataset_name Required	str The name of the Dataset.

Methods

compare_profiles	Compare the current dataset profile with rhs_dataset profile. If profiles do not exist, this method will raise an exception.
get	Get the snapshot of Dataset by snapshot name.
get_all	Get all the snapshots of the given Dataset.
get_profile	Get the profile of the Dataset snapshot.
get_status	Get the Dataset snapshot creation status.
is_data_snapshot_available	Check if the materialized copy of the snapshot is available.
to_pandas_dataframe	Create a Pandas DataFrame by loading the data saved with the snapshot.
to_spark_dataframe	Create a Spark DataFrame by loading the data saved with the snapshot.
wait_for_completion	Wait for the completion of DatasetSnapshot generaton.

compare_profiles

Compare the current dataset profile with rhs_dataset profile.

If profiles do not exist, this method will raise an exception.

compare_profiles(rhs_dataset_snapshot, include_columns=None, exclude_columns=None, histogram_compare_method=HistogramCompareMethod.WASSERSTEIN)

Parameters

Name	Description
rhs_dataset_snapshot Required	DatasetSnapshot The Dataset snapshot to compare with.
include_columns	list[str] A list of column names to be included in the comparison. Default value: None
exclude_columns	list[str] A list of column names to be excluded in the comparison. Default value: None
histogram_compare_method	HistogramCompareMethod An enum describing the comparison method, for example: WASSERSTEIN or ENERGY. Default value: HistogramCompareMethod.WASSERSTEIN

Returns

Type	Description
<xref:azureml.dataprep.api.engineapi.typedefinitions.DataProfileDifference>	The difference between the profiles.

get

Get the snapshot of Dataset by snapshot name.

static get(workspace, snapshot_name, dataset_name=None, dataset_id=None)

Parameters

Name	Description
workspace Required	Workspace The workspace the Dataset is registered in.
snapshot_name Required	str The name of the Dataset snapshot.
dataset_name Required	The name of the Dataset.
dataset_id Required	uuid The identifier of the Dataset.

Returns

Type	Description
DatasetSnapshot	A DatasetSnapshot object.

get_all

Get all the snapshots of the given Dataset.

static get_all(workspace, dataset_name)

Parameters

Name	Description
workspace Required	Workspace The workspace the Dataset is registered in.
dataset_name Required	The name of the Dataset.

Returns

Type	Description
list[DatasetSnapshot]	A list of Dataset snapshots

get_profile

Get the profile of the Dataset snapshot.

get_profile()

Returns

Type	Description
<xref:azureml.dataprep.DataProfile>	The DataProfile of the Dataset snapshot

get_status

Get the Dataset snapshot creation status.

get_status()

Returns

Type	Description
str	The status of Dataset snapshot.

is_data_snapshot_available

Check if the materialized copy of the snapshot is available.

is_data_snapshot_available()

Returns

Type	Description
bool	True if the data snapshot is available.

to_pandas_dataframe

Create a Pandas DataFrame by loading the data saved with the snapshot.

to_pandas_dataframe()

Returns

Type	Description
DataFrame	A Pandas DataFrame.

Remarks

The Pandas DataFrame is fully materialized in memory. If the snapshot was created with create_data_snapshot=False, then an exception is thrown. To check if the snapshot contains data, use the function is_data_snapshot_available.

to_spark_dataframe

Create a Spark DataFrame by loading the data saved with the snapshot.

to_spark_dataframe()

Returns

Type	Description
DataFrame	A Spark DataFrame.

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated. If the snapshot was created with create_data_snapshot=False, an exception is thrown when you try to access the data. To check if the snapshot contains data, use is_data_snapshot_available.

wait_for_completion

Wait for the completion of DatasetSnapshot generaton.

wait_for_completion(show_output=True, status_update_frequency=10)

Parameters

Name	Description
show_output	bool Indicates if the method will print the output. Default value: True
status_update_frequency	int The Action run status update frequency in seconds. Default value: 10

Attributes

dataset_id

Get the Dataset identifier.

Returns

Type	Description
str	The Dataset ID.

name

Get the Dataset snapshot name.

Returns

Type	Description
str	The Dataset snapshot name.

workspace

Get the Azure Machine Learning workspace where the Dataset is registered.

Returns

Type	Description
Workspace	The workspace where the Dataset is registered.

Share via

DatasetSnapshot Class

Constructor

Parameters

Methods

compare_profiles

Parameters

Returns

get

Parameters

Returns

get_all

Parameters

Returns

get_profile

Returns

get_status

Returns

is_data_snapshot_available

Returns

to_pandas_dataframe

Returns

Remarks

to_spark_dataframe

Returns

Remarks

wait_for_completion

Parameters

Attributes

dataset_id

Returns

name

Returns

workspace

Returns

Feedback

Additional resources