Forecasting with AutoML (classic compute)

Artikkeli
12/05/2024

Use AutoML to automatically finding the best forecasting algorithm and hyperparameter configuration to predict values based on time-series data.

Time series forecasting is only available for Databricks Runtime 10.0 ML or above.

Set up forecasting experiment with the UI

You can set up a forecasting problem using the AutoML UI with the following steps:

In the sidebar, select Experiments.
In the Forecasting card, select Start training.

The forecasting UI defaults to serverless forecasting. To access forecasting with your own compute, select revert back to the old experience.

Configure the AutoML experiment

The Configure AutoML experiment page displays. On this page, you configure the AutoML process, specifying the dataset, problem type, target or label column to predict, metric to use to evaluate and score the experiment runs, and stopping conditions.
In the Compute field, select a cluster running Databricks Runtime 10.0 ML or above.
Under Dataset, click Browse. Navigate to the table you want to use and click Select. The table schema appears.
Click in the Prediction target field. A dropdown menu appears, listing the columns shown in the schema. Select the column you want the model to predict.
Click in the Time column field. A drop-down appears showing the dataset columns that are of type timestamp or date. Select the column containing the time periods for the time series.
For multi-series forecasting, select the column(s) that identify the individual time series from the Time series identifiers drop-down. AutoML groups the data by these columns as different time series and trains a model for each series independently. If you leave this field blank, AutoML assumes that the dataset contains a single time series.
In the Forecast horizon and frequency fields, specify the number of time periods into the future for which AutoML should calculate forecasted values. In the left box, enter the integer number of periods to forecast. In the right box, select the units.

Note

To use Auto-ARIMA, the time series must have a regular frequency where the interval between any two points must be the same throughout the time series. The frequency must match the frequency unit specified in the API call or in the AutoML UI. AutoML handles missing time steps by filling in those values with the previous value.
In Databricks Runtime 11.3 LTS ML and above, you can save prediction results. To do so, specify a database in the Output Database field. Click Browse and select a database from the dialog. AutoML writes the prediction results to a table in this database.
The Experiment name field shows the default name. To change it, type the new name in the field.

You can also:

Advanced configurations

Open the Advanced Configuration (optional) section to access these parameters.

The evaluation metric is the primary metric used to score the runs.
In Databricks Runtime 10.4 LTS ML and above, you can exclude training frameworks from consideration. By default, AutoML trains models using frameworks listed under AutoML algorithms.
You can edit the stopping conditions. Default stopping conditions are:
- For forecasting experiments, stop after 120 minutes.
- In Databricks Runtime 10.4 LTS ML and below, for classification and regression experiments, stop after 60 minutes or after completing 200 trials, whichever happens first. For Databricks Runtime 11.0 ML and above, the number of trials is not used as a stopping condition.
- In Databricks Runtime 10.4 LTS ML and above, for classification and regression experiments, AutoML incorporates early stopping; it stops training and tuning models if the validation metric is no longer improving.
In Databricks Runtime 10.4 LTS ML and above, you can select a time column to split the data for training, validation, and testing in chronological order (applies only to classification and regression).
Databricks recommends not populating the Data directory field. Doing so triggers the default behavior of securely storing the dataset as an MLflow artifact. A DBFS path can be specified, but in this case, the dataset does not inherit the AutoML experiment’s access permissions.

Run the experiment and monitor the results

To start the AutoML experiment, click Start AutoML. The experiment starts to run, and the AutoML training page appears. To refresh the runs table, click .

View experiment progress

From this page, you can:

Stop the experiment at any time.
Open the data exploration notebook.
Monitor runs.
Navigate to the run page for any run.

With Databricks Runtime 10.1 ML and above, AutoML displays warnings for potential issues with the dataset, such as unsupported column types or high cardinality columns.

Note

Databricks does its best to indicate potential errors or issues. However, this may not be comprehensive and may not capture the issues or errors you may be searching for.

To see any warnings for the dataset, click the Warnings tab on the training page or the experiment page after the experiment completes.

AutoML warnings

View results

When the experiment completes, you can:

Register and deploy one of the models with MLflow.
Select View notebook for best model to review and edit the notebook that created the best model.
Select View data exploration notebook to open the data exploration notebook.
Search, filter, and sort the runs in the runs table.
See details for any run:
- The generated notebook containing source code for a trial run can be found by clicking into the MLflow run. The notebook is saved in the Artifacts section of the run page. You can download this notebook and import it into the workspace, if downloading artifacts is enabled by your workspace administrators.
- To view the run results, click in the Models column or the Start Time column. The run page appears, showing information about the trial run (such as parameters, metrics, and tags) and artifacts created by the run, including the model. This page also includes code snippets that you can use to make predictions with the model.

To return to this AutoML experiment later, find it in the table on the Experiments page. The results of each AutoML experiment, including the data exploration and training notebooks, are stored in a databricks_automl folder in the home folder of the user who ran the experiment.

Register and deploy a model

You can register and deploy your model with the AutoML UI:

Select the link in the Models column for the model to register. When a run completes, the top row is the best model (based on the primary metric).
Select to register the model in Model Registry.
Select Models in the sidebar to navigate to the Model Registry.
Select the name of your model in the model table.
From the registered model page, you can serve the model with Model Serving.

No module named ‘pandas.core.indexes.numeric

When serving a model built using AutoML with Model Serving, you may get the error: No module named 'pandas.core.indexes.numeric.

This is due to an incompatible pandas version between AutoML and the model serving endpoint environment. You can resolve this error by running the add-pandas-dependency.py script. The script edits the requirements.txt and conda.yaml for your logged model to include the appropriate pandas dependency version: pandas==1.5.3

Modify the script to include the run_id of the MLflow run where your model was logged.
Re-registering the model to the MLflow model registry.
Try serving the new version of the MLflow model.

Jaa