Apply SQL Transformation
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
Runs a SQLite query on input datasets to transform the data
Category: Data Transformation / Manipulation
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
Module overview
This article describes how to use the Apply SQL Transformation module in Machine Learning Studio (classic), to specify a SQL query on an input dataset or datasets.
SQL is handy when you need to modify your data in complex ways, or persist the data for use in other environments. For example, using the Apply SQL Transformation module, you can:
Create tables for results and save the datasets in a portable database.
Perform custom transformations on data types, or create aggregates.
Execute SQL query statements to filter or alter data and return the query results as a data table.
Important
The SQL engine used in this module is SQLite. If you are unfamiliar with SQLite syntax, be sure to read the syntax and usage section of this article for examples.
What is SQLite?
SQLite is a public domain relational database management system that is contained in a C programming library. SQLite is a popular choice as an embedded database for local storage in web browsers.
SQLite was originally designed in 2000 for the U.S. Navy, to support serverless transactions. It is a self-contained database engine that has no management system and hence requires no configuration or administration.
How to configure Apply SQL Transformation
The module can take up to three datasets as inputs. When you reference the datasets connected to each input port, you must use the names t1
, t2
, and t3
. The table number indicates the index of the input port.
The remaining parameter is a SQL query, which uses the SQLite syntax. This module supports all standard statements of the SQLite syntax. For a list of unsupported statements, see the Technical Notes section.
General syntax and usage
When typing multiple lines in the SQL Script text box, use a semi-colon to terminate each statement. Otherwise, line breaks are converted to spaces.
For example, the following statements are equivalent:
SELECT * from t1;
SELECT * from t1;
You can add comments by using either
--
at the beginning of each line, or by enclosing text using/* */
.For example, this statement is valid:
SELECT * from t1 /*WHERE ItemID BETWEEN 1 AND 100*/;
If a column name duplicates the name of a reserved keyword, syntax highlighting is applied to the text inside the SQL Script text box. To avoid confusion, you should enclose column names with square brackets (to follow the Transact-SQL convention) or backticks or double quotation marks (the ANSI SQL convention).
For example, in the following query on the Blood Donation dataset, Time is a valid column name but is also a reserved keyword.
SELECT Recency, Frequency, Monetary, Time, Class FROM t1 WHERE Time between 3 and 20;
If you run the query as is, the query might return the correct results, but depending on the dataset, it might return an error. Here are some examples of how to avoid the issue:
-- Transact-SQL SELECT [Recency], [Frequency], [Monetary], [Time], [Class] FROM t1 WHERE [Time] between 3 and 20; -- ANSI SQL SELECT "Recency", "Frequency", "Monetary", "Time", "Class" FROM t1 WHERE `Time` between 3 and 20;
Note
Syntax highlighting remains on the keyword even after it is enclosed in quotes or brackets.
SQLite is case insensitive, except for a few commands that have case-sensitive variants with different meanings (GLOB vs. glob).
SELECT statement
In the SELECT
statement, column names that include spaces or other characters prohibited in identifiers must be enclosed in double quotation marks, square brackets, or backtick characters (`).
For example, this query references the Two-Class Iris dataset on t1
, but one column name contains a prohibited character, so the column name is enclosed in quotation marks.
SELECT class, "sepal-length" FROM t1;
You can add a WHERE
clause to filter values in the dataset.
SELECT class, "sepal-length" FROM t1 WHERE "sepal-length" >5.0;
The SQLite syntax does not support the TOP
keyword, which is used in Transact-SQL. Instead, you can use the LIMIT
keyword, or a FETCH
statement.
For example, compare these queries on the Bike Rental dataset.
-- unsupported in SQLite
SELECT TOP 100 [dteday] FROM t1 ;
ORDER BY [dteday] DESC;
-- Returns top 100
SELECT [dteday] FROM t1 LIMIT 100 ;
ORDER BY [dteday] DESC;
-- Returns top 100. Note that FETCH is on a new line.
SELECT [dteday] FROM t1 - ;
FETCH FIRST 100 rows ONLY;
ORDER BY [dteday] DESC;
Joins
The following examples use the Restaurant Ratings dataset on the input port corresponding to t1
, and the Restaurant Features dataset on the input port corresponding to t2
.
The following statement joins the two tables to create a dataset that combines the specified restaurant features with average ratings for each restaurant.
SELECT DISTINCT(t2.placeid),
t2.name, t2.city, t2.state, t2.price, t2.alcohol,
AVG(rating) AS 'AvgRating'
FROM t1
JOIN t2
ON t1.placeID = t2.placeID
GROUP BY t2.placeid;
Aggregate functions
This section provides basic examples of some common SQL aggregate functions, using SQLite.
Aggregate functions currently supported are: AVG
, COUNT
, MAX
, MIN
, SUM
, TOTAL
.
The following query returns a dataset containing the restaurant ID, along with the average rating for the restaurant.
SELECT DISTINCT placeid,
AVG(rating) AS ‘AvgRating’,
FROM t1
GROUP BY placeid
Working with strings
SQLite supports the double pipe operator for concatenating strings.
The following statement creates a new column by concatenating two text columns.
SELECT placeID, name,
(city || '-' || state) AS 'Target Region',
FROM t1
Warning
The Transact-SQL string concatenation operator is not supported: + (String Concatenation). For example, the expression ('city + '-' + state) AS 'Target Region'
in the example query would return 0 for all values.
However, even though the operator is not supported for this data type, no error is raised in Machine Learning. Be sure to verify the results of Apply SQL Transformation before using the resulting dataset in an experiment.
COALESCE and CASE
COALESCE
evaluates multiple arguments, in order, and returns the value of the first expression that does not evaluate to NULL.
For example, this query on the Steel Annealing Multi-Class dataset returns the first non-null flag from a list of columns assumed to have mutually exclusive values. If no flag is found, the string “none” is returned.
SELECT classes, family, [product-type],
COALESCE(bt,bc,bf,[bw/me],bl, "none") AS TemperType
FROM t1;
The CASE
statement is useful for testing values and returning a new value based on the evaluated results. SQLite supports the following syntax for CASE
statements:
CASE WHEN [condition] THEN [expression] ELSE [expression] END
CASE [expression] WHEN [value] THEN [expression] ELSE [expression] END
For example, suppose you had previously used the Convert to Indicator Values module to create a set feature columns containing true-false values. The following query collapses the values in multiple feature columns into a single multivalued column.
SELECT userID, [smoker-0], [smoker-1],
CASE
WHEN [smoker-0]= '1' THEN 'smoker'
WHEN [smoker-1]= '1' THEN 'nonsmoker'
ELSE 'unknown'
END AS newLabel
FROM t1;
Examples
For an example of how this module might be used in machine learning experiments, see this sample in the Azure AI Gallery:
- Apply SQL Transformation: Uses the Restaurant Ratings, Restaurant Features, and Restaurant Customers dataset to illustrate simple joins, select statements, and aggregate functions.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
An input is always required on port 1.
If the input dataset has column names, the columns in the output dataset will use the column names from the input dataset.
If the input dataset does not have column names, the column names in the table are automatically created by using the following naming convention: T1COL1, T1COL2, T1COL3, and so on, where the numbers indicate the index of each column in the input dataset.
For column identifiers that contain a space or other special characters, always enclose the column identifier in square brackets or double quotation marks when referring to the column in the
SELECT
orWHERE
clauses.
Unsupported statements
Although SQLite supports much of the ANSI SQL standard, it does not include many features supported by commercial relational database systems. For more information, see SQL as Understood by SQLite. Also, be aware of the following restrictions when creating SQL statements:
SQLite uses dynamic typing for values, rather than assigning a type to a column as in most relational database systems. It is weakly typed, and allows implicit type conversion.
LEFT OUTER JOIN
is implemented, but notRIGHT OUTER JOIN
orFULL OUTER JOIN
.You can use
RENAME TABLE
andADD COLUMN
statements with theALTER TABLE
command, but other clauses are not supported, includingDROP COLUMN
,ALTER COLUMN
, andADD CONSTRAINT
.You can create a VIEW within SQLite, but thereafter views are read-only. You cannot execute a
DELETE
,INSERT
, orUPDATE
statement on a view. However, you can create a trigger that fires on an attempt toDELETE
,INSERT
, orUPDATE
on a view and perform other operations in the body of the trigger.
In addition to the list of non-supported functions provided on the official SQLite site, the following wiki provides a list of other unsupported features: SQLite - Unsupported SQL
Expected inputs
Name | Type | Description |
---|---|---|
Table1 | Data Table | Input dataset1 |
Table2 | Data Table | Input dataset2 |
Table3 | Data Table | Input dataset3 |
Module parameters
Name | Range | Type | Default | Description |
---|---|---|---|---|
SQL Query Script | any | StreamReader | SQL query statement |
Outputs
Name | Type | Description |
---|---|---|
Results dataset | Data Table | Output dataset |
Exceptions
Exception | Description |
---|---|
Error 0001 | An exception occurs if one or more specified columns of the dataset couldn't be found. |
Error 0003 | An exception occurs if one or more of the input datasets is null or empty. |
Error 0069 | SQL logic error or missing database |
For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.
For a list of API exceptions, see Machine Learning REST API Error Codes.