Split Data using Regular Expression
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
This article describes how to use the Regular Expression Split option in the Split Data module of Machine Learning Studio (classic). This option is useful when you need to apply a filter criteria to a text column. For example, you might divide your dataset by whether a particular product is mentioned.
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
You can use a regular expression split on a single text column. You define a regular expression that includes the text column name, and then set conditions that apply to the column, such as "begins with", ""contains", or "does not contain".
For general information about data partitioning for machine learning experiments, see Split Data and Partition and Split.
Related tasks
Other options in the Split Data module:
Split data using relative expressions: Apply an expression to numeric data.
Split recommender datasets: Divide datasets that are used in recommendation models. The dataset should have three columns: items, users, and ratings
Use a regular expression to divide a dataset
Add the Split Data module to your experiment, and connect it as input to the dataset you want to split.
For Splitting mode, select Regular expression split.
In the Regular expression box, type a valid regular expression. Some examples are provided here.
The regular expression is applied only to the specified column, which must be a string data type.
For help composing regular expressions, see the Regular Expression Language - Quick Reference.
Run the experiment, or right-click the module and select Run selected.
Based on the regular expression you provide, the dataset is divided into two sets of rows: rows with values that match the expression and all remaining rows.
Examples
The following examples demonstrate how to divide a dataset using the Regular Expression option.
Single whole word
This example puts into the first dataset all rows that contain the text Gryphon
in the column Text
, and puts other rows into the second output of Split Data:
\"Text" Gryphon
Substring
This example looks for the specified string in any position within the second column of the dataset, denoted here by the index value of 1. The match is case-sensitive.
(\1) ^[a-f]
The first result dataset contains all rows where the index column begins with one of these characters: a
, b
, c
, d
, e
, f
. All other rows are directed to the second output.
String match on IP addresses
This example divides some server log data into two categories for analysis: connections behind the firewall and connections with IP addresses outside the firewall. The regular expression is applied to the IP_Address
field (a string data type).
(\IP_Address) ^[10]
The first output contains all addresses that begin with 10
.