Convert to Dataset
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
Converts data input to the internal Dataset format used by Microsoft Machine Learning
Category: Data Format Conversions
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
Module overview
This article describes how to use the Convert to Dataset module in Machine Learning Studio (classic), to convert any data that you might need for an experiment to the internal format used by Studio (classic).
Conversion is not required in most cases, because Machine Learning implicitly converts data to its native dataset format when any operation is performed on the data.
However, saving data to the dataset format is recommended if you have performed some kind of normalization or cleaning on a set of data, and you want to ensure that the changes are used in further experiments.
Note
Convert to Dataset changes only the format of the data, and it does not save a new copy of the data in the workspace. To save the dataset, double-click the output port, select Save as dataset, and type a new name.
How to use Convert to Dataset
We recommend that you use the Edit Metadata module to prepare the dataset before using Convert to Dataset. You can add or change column names, adjust data types, and so forth.
Add the Convert to Dataset module to your experiment. You can find this module in the Data Format Conversions category in Machine Learning Studio (classic).
Connect it to any module that outputs a dataset.
As long as the data is tabular, you can convert it to a dataset. This includes data loaded using Import Data, data created by using Enter Data Manually, data generated by code in custom modules, datasets transformed by using Apply Transformation, or datasets that were generated or modified by using Apply SQL Transformation.
In the Action dropdown list, indicate if you want to do any cleanup on the data before saving the dataset:
None: Use the data as is.
SetMissingValue: Specify a placeholder that is inserted in the dataset wherever there is a missing value. The default placeholder is the question mark character (?), but you can use the Custom missing value option to type a different value.
ReplaceValues: Use this option to specify a single exact value to be replaced with any other exact value. For example, assuming your data contains the string
obs
used as a placeholder for missing values, you could specify a custom replacement operation using these options:Set Replace to Custom
For Custom value, type the value you want to find. In this case, you would type
obs
.For New value, type the new value to replace the original string with. In this case, you might type
?
Note that the ReplaceValues operation applies only to exact matches. For example, these strings would not be affected:
obs.
,obsolete
.- SparseOutput: Indicates that the dataset is sparse. By creating a sparse data vector, you can ensure that missing values do not affect a sparse data distribution. After choosing this option, you must indicate how missing values and zero values should be handled.
To remove any value other than zero, click the Remove option and type a single value to remove. You can remove missing values, or set a custom value to delete from the vector. Only exact matches will be removed. For example, if you type
x
in the Remove value text box, the rowxx
would not be affected.By default, the option Remove zeroes is set to
True
, meaning that all zero values are removed when the sparse column is created.Run the experiment, or right-click the Convert to Dataset module and select Run selected.
Results
- To save the resulting dataset with a new name, right-click the output of Convert to Dataset and select Save as Dataset.
Examples
You can see examples of how the Convert to Dataset module is used in the Azure AI Gallery:
CRM sample: Reads from a shared dataset and saves a copy of the dataset in the local workspace.
Flight Delay example: Saves a dataset that has been cleaned by replacing missing values so that you can use it for future experiments.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
Any module that takes a dataset as input can also take data in the CSV, TSV, or ARFF formats. Before any module code is executed, preprocessing of the inputs is performed, which is equivalent to running the Convert to Dataset module on the input.
You cannot convert from the SVMLight format to dataset.
When specifying a custom replace operation, the search and replace operation applies to complete values; partial matches are not allowed. For example, you can replace a 3 with a -1 or with 33, but you cannot replace a 3 in a two-digit number such as 35.
For custom replace operations, the replacement will silently fail if you use as a replacement any character that does not conform to the current data type of the column.
If you need to save data that uses numerical data that is sparse and has missing values, internally, Studio (classic) supports sparse arrays by using a SparseVector, which is a class in the Math.NET numeric library. Prepare your data that uses zeros and has missing values, and then use Convert to Dataset with the arguments SparseOutput and Remove Zeros = TRUE.
Expected inputs
Name | Type | Description |
---|---|---|
Dataset | Data Table | Input dataset |
Module parameters
Name | Range | Type | Default | Description |
---|---|---|---|---|
Action | List | Action Method | None | Action to apply to input dataset |
Output
Name | Type | Description |
---|---|---|
Results dataset | Data Table | Output dataset |