Share via


Import from Web URL via HTTP

Important

Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.

Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.

ML Studio (classic) documentation is being retired and may not be updated in the future.

This article describes how to use the Import Data module in Machine Learning Studio (classic), to read data from a public Web page for use in a machine learning experiment.

Note

Applies to: Machine Learning Studio (classic) only

Similar drag-and-drop modules are available in Azure Machine Learning designer.

The following restrictions apply to data published on a web page:

  • Data must be in one of the supported formats: CSV, TSV, ARFF, or SvmLight. Other data will cause errors.
  • No authentication is required or supported. Data must be publicly available.

How to import data via HTTP

There are two ways to get data: use the wizard to set up the data source, or configure it manually.

Use the Data Import Wizard

  1. Add the Import Data module to your experiment. You can find the module in Studio (classic), in the Data Input and Output category.

  2. Click Launch Import Data Wizard and select Web URL via HTTP.

  3. Paste in the URL, and select a data format.

  4. When configuration is complete, right-click the module, and select Run Selected.

To edit an existing data connection, start the wizard again. The wizard loads all previous configuration details so that you don't have to start again from scratch

Manually set properties in the Import Data module

The following steps describe how to manually configure the import source.

  1. Add the Import Data module to your experiment. You can find the module in Studio (classic), in the Data Input and Output category.

  2. For Data source, select Web URL via HTTP.

  3. For URL, type or paste the full URL of the page that contains the data you want to load.

    The URL should include the site URL and the full path, with file name and extension, to the page that contains the data to load.

    For example, the following page contains the Iris data set from the machine learning repository of the University of California, Irvine:

    https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

  4. For Data format, select one of the supported data formats from the list.

    We recommend that you always check the data beforehand to determine the format. The UC Irvine page uses the CSV format. Other supported data formats are TSV, ARFF, and SvmLight.

  5. If the data is in CSV or TSV format, use the File has header row option to indicate whether or not the source data includes a header row. The header row is used to assign column names.

  6. Select the Use cached results options if you don't expect the data to change much, or if you want to avoid reloading the data each time you run the experiment.

    When this option is selected, the experiment loads the data the first time the module is run, and thereafter uses a cached version of the dataset.

    If you want to re-load the dataset on each iteration of the experiment dataset, deselect the Use cached results option. Results are also re-loaded if there are any changes to the parameters of Import Data.

  7. Run the experiment.

Results

When complete, click the output dataset and select Visualize to see if the data was imported successfully.

Examples

See these examples in the Azure AI Gallery of machine learning experiments that get data from public web sites:

Technical notes

This section contains implementation details, tips, and answers to frequently asked questions.

Common questions

Can I filter data as it is being read from the source

No. That option is not supported with this data source.

After reading the data into Machine Learning Studio (classic), you can split the dataset, use sampling, and so forth to get just the rows you want:

  • Write some simple R code in the Execute R Script to get a portion of the data by rows or columns.

  • Use the Split Data module with a relative expression or a regular expression to isolate the data you want.

  • If you loaded more data than you need, overwrite the cached dataset by reading a new dataset, and saving it with the same name.

How can I avoid re-loading the same data unnecessarily

If your source data changes, you can refresh the dataset and add new data by re-running Import Data.

If you don't want to re-read from the source each time you run the experiment, select the Use cached results option to TRUE. When this option is set to TRUE, the module checks whether the experiment has run previously using the same source and same input options. If a previous run is found, the data in the cache is used, instead of re-loading the data from the source.

Why was an extra row added at the end of my dataset

If the Import Data module encounters a row of data that is followed by an empty line or a trailing new line character, an extra row is added at the end of the table. This new row contains missing values.

The reason for interpreting a trailing new line as a new row is that Import Data cannot determine the difference between an actual empty line and an empty line that is created by the user pressing ENTER at the end of a file.

Because some machine learning algorithms support missing data and would thus treat this line as a case (which in turn could affect the results), you should use Clean Missing Data to check for missing values (particularly rows that are completely empty), and remove them as needed.

Before you check for empty rows, you might also want to divide the dataset by using Split Data. This separates rows with partial missing values, which represent actual missing values in the source data. Use the Select head N rows option to read the first part of the dataset into a separate container from the last line.

Why are some characters in my source file not displayed correctly

Machine Learning supports the UTF-8 encoding. If your source file used another type of encoding, the characters might not be imported correctly.

Module parameters

Name Range Type Default Description
Data source List Data Source Or Sink Azure Blob Storage Data source can be HTTP, FTP, anonymous HTTPS or FTPS, a file in Azure BLOB storage, an Azure table, an Azure SQL Database, an on-premises SQL Server database, a Hive table, or an OData endpoint.
URL any String none URL for HTTP
Data format CSV

TSV

ARFF

SvmLight
Data Format CSV File type of HTTP source
CSV or TSV has header row TRUE/FALSE Boolean false Indicates if CSV or TSV file has a header row
Use cached results TRUE/FALSE Boolean FALSE Module executes only if valid cache does not exist. Otherwise, cached data from previous execution is used.

Outputs

Name Type Description
Results dataset Data Table Dataset with downloaded data

Exceptions

Exception Description
Error 0027 An exception occurs when two objects have to be the same size, but they are not.
Error 0003 An exception occurs if one or more of inputs are null or empty.
Error 0029 An exception occurs when an invalid URI is passed.
Error 0030 an exception occurs in when it is not possible to download a file.
Error 0002 An exception occurs if one or more parameters could not be parsed or converted from the specified type to the type required by the target method.
Error 0048 An exception occurs when it is not possible to open a file.
Error 0046 An exception occurs when it is not possible to create a directory on specified path.
Error 0049 An exception occurs when it is not possible to parse a file.

For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.

For a list of API exceptions, see Machine Learning REST API Error Codes.

See also

Import Data
Export Data
Import from Hive Query
Import from Azure SQL Database
Import from Azure Table
Import from Azure Blob Storage
Import from Data Feed Providers
Import from On-Premises SQL Server Database