DatasetSnapshot Class
Manages Dataset snapshots with operations to get a snapsot, return its status, and convert it to a dataframe.
Note
This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.
A DataSnapshot object is returned from the create_snapshot method of the Dataset class.
Dataset snapshot is a combination of Profile and an optional materialized copy of the data.
To learn more about Dataset Snapshots, go to https://aka.ms/azureml/howto/createsnapshots
- Inheritance
-
builtins.objectDatasetSnapshot
Constructor
DatasetSnapshot(workspace, snapshot_name, dataset_id, definition_version=None, time_stamp=None, profile_action_id=None, datastore_name=None, relative_path=None, dataset_name=None)
Parameters
Name | Description |
---|---|
workspace
Required
|
<xref:azureml.core.Workspace.>
The workspace the Dataset is registered in. |
snapshot_name
Required
|
The name of the Dataset snapshot. |
dataset_id
Required
|
The identifier of the Dataset. |
definition_version
Required
|
The definition version of the Dataset. |
time_stamp
Required
|
The snapshot creation time. |
profile_action_id
Required
|
The snapshot profile action ID. |
datastore_name
Required
|
The snapshot data store name. |
relative_path
Required
|
The relative path to the snapshot data. |
dataset_name
Required
|
The name of the Dataset. |
Methods
compare_profiles |
Compare the current dataset profile with rhs_dataset profile. If profiles do not exist, this method will raise an exception. |
get |
Get the snapshot of Dataset by snapshot name. |
get_all |
Get all the snapshots of the given Dataset. |
get_profile |
Get the profile of the Dataset snapshot. |
get_status |
Get the Dataset snapshot creation status. |
is_data_snapshot_available |
Check if the materialized copy of the snapshot is available. |
to_pandas_dataframe |
Create a Pandas DataFrame by loading the data saved with the snapshot. |
to_spark_dataframe |
Create a Spark DataFrame by loading the data saved with the snapshot. |
wait_for_completion |
Wait for the completion of DatasetSnapshot generaton. |
compare_profiles
Compare the current dataset profile with rhs_dataset profile.
If profiles do not exist, this method will raise an exception.
compare_profiles(rhs_dataset_snapshot, include_columns=None, exclude_columns=None, histogram_compare_method=HistogramCompareMethod.WASSERSTEIN)
Parameters
Name | Description |
---|---|
rhs_dataset_snapshot
Required
|
The Dataset snapshot to compare with. |
include_columns
|
A list of column names to be included in the comparison. Default value: None
|
exclude_columns
|
A list of column names to be excluded in the comparison. Default value: None
|
histogram_compare_method
|
An enum describing the comparison method, for example: WASSERSTEIN or ENERGY. Default value: HistogramCompareMethod.WASSERSTEIN
|
Returns
Type | Description |
---|---|
<xref:azureml.dataprep.api.engineapi.typedefinitions.DataProfileDifference>
|
The difference between the profiles. |
get
Get the snapshot of Dataset by snapshot name.
static get(workspace, snapshot_name, dataset_name=None, dataset_id=None)
Parameters
Name | Description |
---|---|
workspace
Required
|
The workspace the Dataset is registered in. |
snapshot_name
Required
|
The name of the Dataset snapshot. |
dataset_name
Required
|
The name of the Dataset. |
dataset_id
Required
|
The identifier of the Dataset. |
Returns
Type | Description |
---|---|
A DatasetSnapshot object. |
get_all
Get all the snapshots of the given Dataset.
static get_all(workspace, dataset_name)
Parameters
Name | Description |
---|---|
workspace
Required
|
The workspace the Dataset is registered in. |
dataset_name
Required
|
The name of the Dataset. |
Returns
Type | Description |
---|---|
A list of Dataset snapshots |
get_profile
Get the profile of the Dataset snapshot.
get_profile()
Returns
Type | Description |
---|---|
<xref:azureml.dataprep.DataProfile>
|
The DataProfile of the Dataset snapshot |
get_status
Get the Dataset snapshot creation status.
get_status()
Returns
Type | Description |
---|---|
The status of Dataset snapshot. |
is_data_snapshot_available
Check if the materialized copy of the snapshot is available.
is_data_snapshot_available()
Returns
Type | Description |
---|---|
True if the data snapshot is available. |
to_pandas_dataframe
Create a Pandas DataFrame by loading the data saved with the snapshot.
to_pandas_dataframe()
Returns
Type | Description |
---|---|
A Pandas DataFrame. |
Remarks
The Pandas DataFrame is fully materialized in memory. If the snapshot was created
with create_data_snapshot=False
, then an exception is thrown. To check if the snapshot
contains data, use the function is_data_snapshot_available.
to_spark_dataframe
Create a Spark DataFrame by loading the data saved with the snapshot.
to_spark_dataframe()
Returns
Type | Description |
---|---|
A Spark DataFrame. |
Remarks
The Spark Dataframe returned is only an execution plan and does not actually contain any data,
as Spark Dataframes are lazily evaluated. If the snapshot was created with
create_data_snapshot=False
, an exception is thrown when you try to access the data. To check if
the snapshot contains data, use is_data_snapshot_available.
wait_for_completion
Wait for the completion of DatasetSnapshot generaton.
wait_for_completion(show_output=True, status_update_frequency=10)
Parameters
Name | Description |
---|---|
show_output
|
Indicates if the method will print the output. Default value: True
|
status_update_frequency
|
The Action run status update frequency in seconds. Default value: 10
|
Attributes
dataset_id
name
workspace
Get the Azure Machine Learning workspace where the Dataset is registered.
Returns
Type | Description |
---|---|
The workspace where the Dataset is registered. |