Summarize Data
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
Generates a basic descriptive statistics report for the columns in a dataset
Category: Statistical Functions
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
Module overview
This article describes how to use the Summarize Data module in Machine Learning Studio (classic), to create a set of standard statistical measures that describe each column in the input table.
Such summary statistics are useful when you want to understand the characteristics of the complete dataset. For example, you might need to know:
- How many missing values are there in each column?
- How many unique values are there in a feature column?
- What is the mean and standard deviation for each column?
The module calculates the important scores for each column, and returns a row of summary statistics for each variable (data column) provided as input.
Tip
You might already know that you can get a short list of statistics by using the Visualize option in Studio (classic). However, this visualization is created based on some top number of rows. In contrast, the Summarize Data module computes its statistics on all rows of data.
How to use Summarize Data
Add the Summarize Data module to your experiment. You can find this module in the Statistical Functions category in Studio (classic).
Connect the dataset for which you want to generate a report.
If you want to report on only some columns, use the Select Columns in Dataset module to project a subset of columns to work with.
No additional parameters are required. By default, the module analyzes all columns that are provided as input, and depending on the type of values in the columns, outputs a relevant set of statistics as described in the Results section.
Run the experiment, or right-click the module, and select Run selected.
Results
The report from the module can include the following statistics.
The exact statistics that are generated depend on the column data type. See the Technical notes section for details.
The assumption is made that the instances belong to a representative sample of a population. If you need to compute statistics on a population, use the options in the Compute Elementary Statistics module, which can compute either sample or population statistics.
Column name | Description |
---|---|
Feature | Name of the column |
Count | Count of all rows |
Unique Value Count | Number of unique values in column |
Missing Value Count | Number of unique values in column |
Min | Lowest value in column |
Max | Highest value in column |
Mean | Mean of all column values |
Mean Deviation | Mean deviation of column values |
1st Quartile | Value at first quartile |
Median | Median column value |
3rd Quartile | Value at third quartile |
Mode | Mode of column values |
Range | Integer representing the number of values between the maximum and minimum values |
Sample Variance | Variance for column; see Note |
Sample Standard Deviation | Standard deviation for column; see Note |
Sample Skewness | Skewness for column; see Note |
Sample Kurtosis | Kurtosis for column; see Note |
P0.5 | 0.5% percentile |
P1 | 1% percentile |
P5 | 5% percentile |
P95 | 95% percentile |
P99.5 | 99.5% percentile |
Tip
Output the statistics report as a tabular dataset, so that you can use the data in BI reporting tools, or use the values as input to another operation in the experiment.
Examples
For examples of how to use the Summarize Data module in an experiment, see the Azure AI Gallery:
Download dataset from UCI: Reads a dataset in CSV format by using its URL in the UCI Machine Learning Repository, and generates some basic statistics about the dataset.
Dataset Processing and Analysis: Loads the dataset into the workspace, changes column names, and adds metadata.
Prediction of student performance: Reads data stored in TSV format from Azure Blob storage.
Technical notes
For numeric and Boolean columns, you can output the mean, median, mode, and standard deviation.
For non-numeric columns, only the values for Count, Unique value count, and Missing value count are computed. For other statistics, a null value is returned.
Columns that contain Boolean values are processed using these rules:
When calculating Min, a logical AND is applied.
When calculating Max, a logical OR is applied
When computing Range, the module first checks whether the number of unique values in the column equals 2.
When computing any statistic that requires floating-point calculations, values of True are treated as 1.0, and values of False are treated as 0.0.
Expected inputs
Name | Type | Description |
---|---|---|
Dataset | Data Table | Input dataset |
Output
Name | Type | Description |
---|---|---|
Results dataset | Data Table | A profile of the input dataset that contains descriptive statistics |
Exceptions
Exception | Description |
---|---|
Error 0003 | Exception occurs if one or more inputs are null or empty. |
Error 0020 | Exception occurs if the number of columns in some of the datasets passed to the module is too small. |
Error 0021 | Exception occurs if the number of rows in some of the datasets passed to the module is too small. |
For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.
For a list of API exceptions, see Machine Learning REST API Error Codes.