DatasetSnapshot Class

Manages Dataset snapshots with operations to get a snapsot, return its status, and convert it to a dataframe.

Note

This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.

A DataSnapshot object is returned from the create_snapshot method of the Dataset class.

Dataset snapshot is a combination of Profile and an optional materialized copy of the data.

To learn more about Dataset Snapshots, go to https://aka.ms/azureml/howto/createsnapshots

Inheritance
builtins.object
DatasetSnapshot

Constructor

DatasetSnapshot(workspace, snapshot_name, dataset_id, definition_version=None, time_stamp=None, profile_action_id=None, datastore_name=None, relative_path=None, dataset_name=None)

Parameters

Name Description
workspace
Required
<xref:azureml.core.Workspace.>

The workspace the Dataset is registered in.

snapshot_name
Required
str

The name of the Dataset snapshot.

dataset_id
Required
str

The identifier of the Dataset.

definition_version
Required
str

The definition version of the Dataset.

time_stamp
Required

The snapshot creation time.

profile_action_id
Required
str

The snapshot profile action ID.

datastore_name
Required
str

The snapshot data store name.

relative_path
Required
str

The relative path to the snapshot data.

dataset_name
Required
str

The name of the Dataset.

Methods

compare_profiles

Compare the current dataset profile with rhs_dataset profile.

If profiles do not exist, this method will raise an exception.

get

Get the snapshot of Dataset by snapshot name.

get_all

Get all the snapshots of the given Dataset.

get_profile

Get the profile of the Dataset snapshot.

get_status

Get the Dataset snapshot creation status.

is_data_snapshot_available

Check if the materialized copy of the snapshot is available.

to_pandas_dataframe

Create a Pandas DataFrame by loading the data saved with the snapshot.

to_spark_dataframe

Create a Spark DataFrame by loading the data saved with the snapshot.

wait_for_completion

Wait for the completion of DatasetSnapshot generaton.

compare_profiles

Compare the current dataset profile with rhs_dataset profile.

If profiles do not exist, this method will raise an exception.

compare_profiles(rhs_dataset_snapshot, include_columns=None, exclude_columns=None, histogram_compare_method=HistogramCompareMethod.WASSERSTEIN)

Parameters

Name Description
rhs_dataset_snapshot
Required

The Dataset snapshot to compare with.

include_columns

A list of column names to be included in the comparison.

Default value: None
exclude_columns

A list of column names to be excluded in the comparison.

Default value: None
histogram_compare_method

An enum describing the comparison method, for example: WASSERSTEIN or ENERGY.

Default value: HistogramCompareMethod.WASSERSTEIN

Returns

Type Description
<xref:azureml.dataprep.api.engineapi.typedefinitions.DataProfileDifference>

The difference between the profiles.

get

Get the snapshot of Dataset by snapshot name.

static get(workspace, snapshot_name, dataset_name=None, dataset_id=None)

Parameters

Name Description
workspace
Required

The workspace the Dataset is registered in.

snapshot_name
Required
str

The name of the Dataset snapshot.

dataset_name
Required

The name of the Dataset.

dataset_id
Required

The identifier of the Dataset.

Returns

Type Description

A DatasetSnapshot object.

get_all

Get all the snapshots of the given Dataset.

static get_all(workspace, dataset_name)

Parameters

Name Description
workspace
Required

The workspace the Dataset is registered in.

dataset_name
Required

The name of the Dataset.

Returns

Type Description

A list of Dataset snapshots

get_profile

Get the profile of the Dataset snapshot.

get_profile()

Returns

Type Description
<xref:azureml.dataprep.DataProfile>

The DataProfile of the Dataset snapshot

get_status

Get the Dataset snapshot creation status.

get_status()

Returns

Type Description
str

The status of Dataset snapshot.

is_data_snapshot_available

Check if the materialized copy of the snapshot is available.

is_data_snapshot_available()

Returns

Type Description

True if the data snapshot is available.

to_pandas_dataframe

Create a Pandas DataFrame by loading the data saved with the snapshot.

to_pandas_dataframe()

Returns

Type Description

A Pandas DataFrame.

Remarks

The Pandas DataFrame is fully materialized in memory. If the snapshot was created with create_data_snapshot=False, then an exception is thrown. To check if the snapshot contains data, use the function is_data_snapshot_available.

to_spark_dataframe

Create a Spark DataFrame by loading the data saved with the snapshot.

to_spark_dataframe()

Returns

Type Description

A Spark DataFrame.

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated. If the snapshot was created with create_data_snapshot=False, an exception is thrown when you try to access the data. To check if the snapshot contains data, use is_data_snapshot_available.

wait_for_completion

Wait for the completion of DatasetSnapshot generaton.

wait_for_completion(show_output=True, status_update_frequency=10)

Parameters

Name Description
show_output

Indicates if the method will print the output.

Default value: True
status_update_frequency
int

The Action run status update frequency in seconds.

Default value: 10

Attributes

dataset_id

Get the Dataset identifier.

Returns

Type Description
str

The Dataset ID.

name

Get the Dataset snapshot name.

Returns

Type Description
str

The Dataset snapshot name.

workspace

Get the Azure Machine Learning workspace where the Dataset is registered.

Returns

Type Description

The workspace where the Dataset is registered.