Fast Forest Quantile Regression
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
Creates a quantile regression model
Category: Machine Learning / Initialize Model / Regression
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
Module overview
This article describes how to use the Fast Forest Quantile Regression module in Machine Learning Studio (classic), to create a regression model that can predict values for a specified number of quantiles.
Quantile regression is useful if you want to understand more about the distribution of the predicted value, rather than get a single mean prediction value. This method has many applications, including:
Predicting prices
Estimating student performance or applying growth charts to assess child development
Discovering predictive relationships in cases where there is only a weak relationship between variables
This regression algorithm is a supervised learning method, which means it requires a tagged dataset that includes a label column. Because it is a regression algorithm, the label column must contain only numerical values.
More about quantile regression
There are many different types of regression. In the most basic sense, regression means fitting a model to a target expressed as a numeric vector. However, statisticians have been developing increasingly advanced methods for regression.
The simplest definition of quantile is a value that divides a set of data into equal-sized groups; thus, the quantile values mark the boundaries between groups. Statistically speaking, quantiles are values taken at regular intervals from the inverse of the cumulative distribution function (CDF) of a random variable.
Whereas linear regression models attempt to predict the value of a numeric variable using a single estimate, the mean, sometimes you need to predict the range or entire distribution of the target variable. Techniques such as Bayesian regression and quantile regression have been developed for this purpose.
Quantile regression helps you understand the distribution of the predicted value. Tree-based quantile regression models, such as the one used in this module, have the additional advantage that they can be used to predict non-parametric distributions.
For additional implementation details and resources, see the Technical Notes section.
How to configure Fast_Forest Quantile Regression
You configure the properties of the regression model using this module, and then train it using one of the training modules.
Configuration steps differ considerably dependng on whether you are providing a fixed set of parameters, or setting up a parameter sweep.
To create a quantile regression model using fixed parameters
To create a quantile regression model using a parameter sweep
Create a quantile regression model using fixed parameters
Assuming you know how you want to configure the model, you can provide a specific set of values as arguments. When you train the model, use Train Model.
Add the Fast Forest Quantile Regression module to your experiment in Studio (classic).
Set the Create trainer mode option to Single Parameter.
For Number of Trees, type the maximum number of trees that can be created in the ensemble. If you create more trees, it generally leads to greater accuracy, but at the cost of longer training time.
For Number of Leaves, type the maximum number of leaves, or terminal nodes, that can be created in any tree.
For Minimum number of training instances required to form a leaf , specify the minimum number of examples that are required to create any terminal node (leaf) in a tree.
By increasing this value, you increase the threshold for creating new rules. For example, with the default value of 1, even a single case can cause a new rule to be created. If you increase the value to 5, the training data would have to contain at least 5 cases that meet the same conditions
For Bagging fraction, specify a number between 0 and 1 that represents the fraction of samples to use when building each group of quantiles. Samples are chosen randomly, with replacement.
For Feature fraction, type a number between 0 and 1 that indicates the fraction of total features to use when building any particular tree. Features are always chosen randomly.
For Split fraction, type a number between 0 and 1 that represents the fraction of features to use in each split of the tree. The features used are always chosen randomly.
For Quantile sample count, type the number of cases to evaluate when estimating the quantiles.
For Quantiles to be estimated, type a comma-separated list of the quantiles for which you want the model to train and create predictions.
For example, if you want to build a model that estimates for quartiles, you would type
0.25, 0.5, 0.75
.Optionally, type a value for Random number seed to seed the random number generator used by the model. The default is 0, meaning a random seed is chosen.
You should provide a value if you need to reproduce results across successive runs on the same data.
Select the Allow unknown categorical levels option to create a group for unknown values.
If you deselect it, the model can accept only the values that are contained in the training data.
If you select this option, the model might be less precise for known values, but it can provide better predictions for new (unknown) values.
Connect a training dataset, select a single label column, and connect Train Model.
Run the experiment.
Use a parameter sweep to create a quantile regression model
If you are not sure of the optimal parameters for the model, you can configure a parameter sweep, and provide a range of values as arguments. When you train the model, use the Tune Model Hyperparameters module.
Add the Fast Forest Quantile Regression module to your experiment in Studio (classic).
Set the Create trainer mode option to Parameter Range.
A parameter sweep is recommended if you are not sure of the best parameters. By specifying multiple values and using the Tune Model Hyperparameters module to train the model, you can find the optimal set of parameters for your data.
After choosing a parameter sweep, for each property that is tunable, you can set either a single value, or multiple values. For example, you might decide to fix the number of trees, but randomly change other values that control the way each tree is built.
If you type a single value, that value is used across all iterations of the sweep, even if other values change.
Type a comma-separated list of discrete values to use. These values are used in combination with other properties.
Use the Range Builder to define a range of continuous values.
During the training process, the Tune Model Hyperparameters module iterates over various combinations of the values to build the best model.
For Maximum number of leaves per tree, type the total number of leaves, or terminal nodes, to allow in each tree.
For Number of trees constructed, type the number of iterations to perform when constructing the ensemble. By creating more trees, you can potentially get better coverage, at the expense of increased training time.
For Minimum number of sample per leaf node, indicate how many cases are required to create a leaf node.
By increasing this value, you increase the threshold for creating new rules. For example, with the default value of 1, even a single case can cause a new rule to be created. If you increase the value to 5, the training data would have to contain at least 5 cases that meet the same conditions.
In Range for bagging fraction, type the fraction of samples to use when building each group of quantiles. Samples are chosen randomly, with replacement.
Each fraction should be a number between 0 and 1. Separate multiple fractions, by using commas.
In Range for feature fraction, type the fraction of total features to use when building each group of quantiles. Features are chosen randomly.
Each fraction should be a number between 0 and 1; separate multiple fractions by using commas.
In Range for split fraction, specify some fraction of features to use in each group of quantiles. The actual features used are chosen randomly.
Each fraction should be a number between 0 and 1; separate multiple fractions by using commas.
In Sample count used to estimate the quantiles, indicate how many samples should be evaluated when estimating the quantiles. If you type a number greater than the number of available samples, all samples are used.
In Required quantile values, type a comma-separated list of the quantiles on which you want the model to train. For example, if you want to build a model that estimates quartiles, you would type `0.25, 0.5, 0.75
In Random number seed, type a value to seed the random number generator used by the model. Use of a seed is useful in order to reproduce duplicate runs.
The default is 0, meaning a random seed is chosen.
Select the Allow unknown values for categorical features option to create a group for unknown values in the training or validation sets.
If you deselect this option, the model can accept only the values that are contained in the training data.
If you select this option, the model might be less precise for known values, but it can provide better predictions for new (unknown) values.
Connect a training dataset, select the label column, and connect the Tune Model Hyperparameters module.
Note
Do not use Train Model. If you configure a parameter range but train using Train Model, it uses only the first value in the parameter range list.
Run the experiment.
Results
After training is complete:
- To see the final hyperparameters of the optimized model, right-click the output of Tune Model Hyperparameters and select Visualize.
Examples
For examples of how to use this module, see the Azure AI Gallery:
- Quantile Regression: Demonstrates how to build and interpret a quantile regression model, using the auto price dataset.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
Implementation details
The Fast Forest Quantile Regression module in Machine Learning is an implementation of random forest quantile regression using decision trees. Random forests can be helpful to avoid overfitting that can occur with decision trees. A decision tree is a binary tree-like flow chart, where at every interior node, one decides which of the two child nodes to continue to, based on the value of one of the features of the input.
In each leaf node, a value is returned. In the interior nodes, the decision is based on the test ``x≤v`, where x is the value of the feature in the input sample and v is one of the possible values of this feature. The functions that can be produced by a regression tree are all the piece-wise constant functions.
In a random forest, an ensemble of trees is created by using bagging to select a subset of random samples and features of the training data, and then fit a decision tree to each subset of data. Unlike the random forest algorithm, which averages out the output of the all the trees, Fast Forest Quantile Regression keeps all the predicted labels in trees specified by the parameter Quantile sample count and outputs the distribution, so that the user can view the quantile values for the given instance.
Related research
For more information about quantile regression, see these books and articles:
Quantile Regression Forests. Nicolai Meinshausen
http://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf
Random forests. Leo Breiman.
Module parameters
Name | Type | Range | Optional | Description | Default |
---|---|---|---|---|---|
Create trainer mode | CreateLearnerMode | List:Single Parameter|Parameter Range | Required | Single Parameter | Create advanced learner options |
Number of Trees | Integer | mode:Single Parameter | 100 | Specify the number of trees to be constructed | |
Number of Leaves | Integer | mode:Single Parameter | 20 | Specify the maximum number of leaves per tree. The default number is 20 | |
Minimum number of training instances required to form a leaf | Integer | mode:Single Parameter | 10 | Indicates the minimum number of training instances required to form a leaf | |
Bagging fraction | Float | mode:Single Parameter | 0.7 | Specifies the fraction of training data to use for each tree | |
Feature fraction | Float | mode:Single Parameter | 0.7 | Specifies the fraction of features (chosen randomly) to use for each tree | |
Split fraction | Float | mode:Single Parameter | 0.7 | Specifies the fraction of features (chosen randomly) to use for each split | |
Quantile sample count | Integer | Max: 2147483647 | mode:Single Parameter | 100 | Specifies number of instances used in each node to estimate quantiles |
Quantiles to be estimated | String | mode:Single Parameter | "0.25;0.5;0.75" | Specifies the quantile to be estimated | |
Random number seed | Integer | Optional | Provide a seed for the random number generator used by the model. Leave blank for default. | ||
Allow unknown categorical levels | Boolean | Required | true | If true, create an additional level for each categorical column. Levels in the test dataset not available in the training dataset are mapped to this additional level. | |
Maximum number of leaves per tree | ParameterRangeSettings | [16;128] | mode:Parameter Range | 16; 32; 64 | Specify range for the maximum number of leaves allowed per tree |
Number of trees constructed | ParameterRangeSettings | [1;256] | mode:Parameter Range | 16; 32; 64 | Specify the range for the maximum number of trees that can be created during training |
Minimum number of samples per leaf node | ParameterRangeSettings | [1;10] | mode:Parameter Range | 1; 5; 10 | Specify the range for the minimum number of cases required to form a leaf |
Range for bagging fraction | ParameterRangeSettings | [0.25;1.0] | mode:Parameter Range | 0.25; 0.5; 0.75 | Specifies the range for fraction of training data to use for each tree |
Range for feature fraction | ParameterRangeSettings | [0.25;1.0] | mode:Parameter Range | 0.25; 0.5; 0.75 | Specifies the range for fraction of features (chosen randomly) to use for each tree |
Range for split fraction | ParameterRangeSettings | [0.25;1.0] | mode:Parameter Range | 0.25; 0.5; 0.75 | Specifies the range for fraction of features (chosen randomly) to use for each split |
Sample count used to estimate the quantiles | Integer | mode:Parameter Range | 100 | Sample count used to estimate the quantiles | |
Required quantile values | String | mode:Parameter Range | "0.25;0.5;0.75" | Required quantile value used during parameter sweep |
Outputs
Name | Type | Description |
---|---|---|
Untrained model | ILearner interface | An untrained quantile regression model that can be connected to the Train Generic Model or Cross Validate Model modules. |