Data preparation for forecasting
This article describes how AutoML prepares data for forecasting training and describes configurable data settings. You can adjust these options during experiment setup in the AutoML UI.
For configuring these settings using the AutoML API), refer to the AutoML Python API reference.
Supported data feature types
Only the feature types listed below are supported. For example, images are not supported.
The following feature types are supported:
- Numeric (
ByteType
,ShortType
,IntegerType
,LongType
,FloatType
, andDoubleType
) - Boolean
- String (categorical or English text)
- Timestamps (
TimestampType
,DateType
) - ArrayType[Numeric] (Databricks Runtime 10.4 LTS ML and above)
- DecimalType (Databricks Runtime 11.3 LTS ML and above)
Impute missing values
In Databricks Runtime 10.4 LTS ML and above, you can specify how null values are imputed. In the UI, select a method from the drop-down in the Impute with column in the table schema. In the API, use the imputers
parameter. For more information, see AutoML Python API reference.
By default, AutoML selects an imputation method based on the column type and content.
Note
If you specify a non-default imputation method, AutoML does not perform semantic type detection.
Split forecasting data into train, validation, and test sets
AutoML splits your data into three splits for training, validation, and testing.
For forecasting tasks, AutoML uses time series cross-validation. This method incrementally extends the training dataset chronologically and performs validation on subsequent time points. Cross-validation provides a robust evaluation of a model’s performance over different segments of time. It ensures that the forecasting model is rigorously tested against unseen future data, maintaining the relevance and accuracy of predictions.
The number of cross-validation folds depends on input table characteristics such as the number of time series, the presence of covariates, and the time series length.
Time series aggregation
For forecasting problems, when there are multiple values for a timestamp in a time series, AutoML uses the average of the values.
To use the sum instead, edit the source code notebook generated by the trial runs. In the Aggregate data by … cell, change .agg(y=(target_col, "avg"))
to .agg(y=(target_col, "sum"))
, as shown:
group_cols = [time_col] + id_cols
df_aggregation = df_loaded \
.groupby(group_cols) \
.agg(y=(target_col, "sum")) \
.reset_index() \
.rename(columns={ time_col : "ds" })