Replace Discrete Values
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
Replaces discrete values from one column with numeric values based on another column
Category: Statistical Functions
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
Module overview
This article describes how to use the Replace Discrete Values module in Machine Learning Studio (classic), to generate a probability score that can be used to represent a discrete value. This score can be useful for understanding the information value of the discrete values.
How it works:
You select a column that contains the discrete (or categorical) value, and then select another column to use for reference.
Depending on whether the second column is categorical or non-categorical, the module computes one of the following values:
- The conditional probability for the second column given the values in the first column.
- The mean and standard deviation for each group of values in the first column.
The module outputs both a dataset with the scores, and a function that you can save and apply to other datasets.
How to configure Replace Discrete Values
Tip
We recommend working with only one pair of columns at a time. The module does not raise an error if you select multiple columns to analyze. However, in practice, if you choose multiple columns, they are matched by an internal heuristic, not by order of selection.
Therefore, we recommend that you select a single pair of columns each time, one for Discrete columns and one for Replacement columns.
If you need to generate scores for multiple columns, use separate instances of Replace Discrete Values.
Add the Replace Discrete Values module to your experiment. You can find this module in the Statistical Functions group in the experiment items list in Machine Learning Studio (classic).
Connect a dataset containing at least one column of categorical data.
Discrete columns: Click Launch column selector to choose a column that contain discrete (or categorical) values.
Any discrete columns that you select must be categorical. If you get an error, use the Edit Metadata module to change the column type.
Replacement columns: Click Launch column selector to choose the column that contains the values to use in computing a replacement score.
If you select multiple columns for Discrete columns, you must choose an equal number of replacement columns.
Run the experiment.
Note
You cannot choose which statistical function to apply. The module calculates an appropriate measure, based on the data type of the column selected for Replacement column.
Results
The module computes one of the following values for each pair of columns:
If the second column contains categorical values, the module computes the conditional probability of the second column, given the values in the first column.
For example, assume you chose
occupation
from the Census dataset as the discrete column and choosegender
as the replacement column. The output of the module would be the:P(gender | occupation)
If the second column contains non-categorical values that can be converted to numbers (such as numeric or Boolean values not marked as categorical), the module outputs the mean and standard deviation for each group of values in the first column.
For example, assume you use
occupation
as the Discrete column and the other column is the numeric columnhours-per-week
. The module would output these new values:Mean(hours-per-week | occupation)
Std-Dev(hours-per-week | occupation)
In addition to the probability scores, the module also outputs a transformed dataset. In this dataset, the column selected as the Replacement columns is replaced with a column containing the computed scores.
Tip
The columns in the source dataset are not actually changed or deleted by the operation; the score columns are new ones generated by the module and output instead of the source data.
To view the source values together with the probability scores, use the Add Columns module.
Examples
The usage of Replace Discrete Values can be illustrated by some simple examples.
Example 1 - Replace a categorical value with a probability score
The following table contains a categorical column X, and a column Y with True/False values that are treated as categorical values. When you use Replace Discrete Values, it calculates a conditional probability score for the probability of Y given X, as shown in the third column.
X | Y | P(Y|X) |
---|---|---|
Blue | 0 | P(Y=0|X=Blue) = 0.5 |
Blue | 1 | P(Y=1|X=Blue) = 0.5 |
Green | 0 | P(Y=0|X=Green) = 2/3 |
Green | 0 | P(Y=0|X=Green) = 2/3 |
Green | 1 | P(Y=1|X=Green) = 1/3 |
Red | 0 | P(Y=0|X=Red) = .75 |
Red | 0 | P(Y=0|X=Red) = .75 |
Red | 1 | P(Y=1|X=Red) = .25 |
Red | 0 | P(Y=0|X=Red) = .75 |
Example 2 - Calculate mean and standard deviation based on a noncategorical column
When the second column is numerical, Replace Discrete Values calculates the mean and standard deviation instead of a conditional probability score.
The following example is based on the Auto Prices sample dataset, simplified as follows:
A small subset of columns was selected.
Only the top 30 rows were extracted, by using the Head option of the Partition and Sample module.
The Replace Discrete Values module was used to compute the mean and standard deviation for vehicle curb weight. given the categorical column,
num-of-doors
.
The following table illustrates the results:
Body | Num-of-doors | Curb-weight | Mean(curb-weight|num-of-doors) | Std-Dev(curb-weight|num-of-doors) |
---|---|---|---|---|
std | two | 2548 | 2429.785714 | 507.45699 |
std | four | 2337 | 2625.6 | 493.409877 |
std | two | 2507 | 2429.785714 | 507.45699 |
turbo | four | 3086 | 2625.6 5 | 493.409877 |
std | four | 1989 | 2625.6 | 493.409877 |
turbo | 2191 | |||
std | four | 2535 | 2625.6 | 493.409877 |
You can verify the mean for each group of values by using the AVERAGEIF
function in Excel.
Example 3 - Handling missing values
This example demonstrates how missing values (nulls) propagate to the results when conditional probability scores are calculated.
If the discrete value column and the calculation lookup column contains any missing values, the missing values are propagated to the new column.
If the discrete value column contains only missing values, the module cannot process the column and an error message appears.
X | Y | P(Y|X) |
---|---|---|
1 | True | P(Y=true|X=1) = 1/2 |
1 | False | P(Y=false|X=1) = 1/2 |
2 | True | P(Y=true|X=2) = 1/3 |
2 | False | P(Y=false|X=2) = 1/3 |
2 | Null | P(Y=null|X=2) = null |
Technical notes
You must ensure that any discrete columns you want to replace are categorical, or the module will return an error. To do this, use the Edit Metadata module.
If the second column contains Boolean values, the True-False values are processed as numeric with FALSE and TRUE equivalent to 0 and 1 respectively.
The formula for the standard deviation column calculates the population standard deviation. Therefore, N is used in the denominator instead of (N - 1).
If the second column contains noncategorical data (numeric or Boolean values), the module computes the mean and standard deviation of Y for the given value of X.
That is, for each row in the dataset indexed by
i
:Mean(Y│X)i = Mean(Y│X = Xi)
StdDev(Y│X)i = StdDev(Y│X = Xi)
If the second column contains categorical data or values that are neither numeric nor Boolean, the module computes the conditional probability of Y for the given value of X.
Any Boolean values in the second column are processed as numeric data with FALSE and TRUE equivalent to 0 and 1 respectively.
If there is a class in the discrete column, such that a row with a missing value is present in the second column, the sum of conditional probabilities within the class is less than one.
Expected inputs
Name | Type | Description |
---|---|---|
Dataset | Data Table | Input dataset |
Module parameters
Name | Range | Type | Default | Description |
---|---|---|---|---|
Discrete columns | Any | ColumnSelection | Selects the columns that contain discrete values | |
Replacement columns | Any | ColumnSelection | Selects the columns that contain the data to use in place of the discrete values |
Outputs
Name | Type | Description |
---|---|---|
Supplemented dataset | Data Table | Dataset with replaced data |
Transform function | ITransform interface | Definition of the transform function, which can be applied to other datasets |
Exceptions
Exception | Description |
---|---|
Error 0001 | Exception occurs if one or more specified columns of the data set couldn't be found. |
Error 0003 | Exception occurs if one or more of inputs are null or empty. |
Error 0020 | Exception occurs if the number of columns in some of the datasets passed to the module is too small. |
Error 0021 | Exception occurs if the number of rows in some of the datasets passed to the module is too small. |
Error 0017 | Exception occurs if one or more specified columns have a type that is unsupported by the current module. |
Error 0026 | Exception occurs when columns with the same name are not allowed. |
Error 0022 | Exception occurs if the number of selected columns in the input dataset does not equal the expected number. |
For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.
For a list of API exceptions, see Machine Learning REST API Error Codes.