TabularDatasetFactory Class

Reference

Contains methods to create a tabular dataset for Azure Machine Learning.

A TabularDataset is created using the from_* methods in this class, for example, the method from_delimited_files.

For more information on working with tabular datasets, see the notebook https://aka.ms/tabulardataset-samplenotebook.

Inheritance: builtins.object

TabularDatasetFactory

Constructor

TabularDatasetFactory()

Methods

from_delimited_files	Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).
from_json_lines_files	Create a TabularDataset to represent tabular data in JSON Lines files (http://jsonlines.org/).
from_parquet_files	Create a TabularDataset to represent tabular data in Parquet files.
from_sql_query	Create a TabularDataset to represent tabular data in SQL databases.
register_dask_dataframe	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Create a dataset from dask dataframe.
register_pandas_dataframe	Create a dataset from pandas dataframe.
register_spark_dataframe	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Create a dataset from spark dataframe.

from_delimited_files

Create a TabularDataset to represent tabular data in delimited files (e.g. CSV and TSV).

static from_delimited_files(path, validate=True, include_path=False, infer_column_types=True, set_column_types=None, separator=',', header=True, partition_format=None, support_multi_line=False, empty_as_string=False, encoding='utf8')

Parameters

Name	Description
path Required	Union[str, list[str], DataPath, list[DataPath], (Datastore, str), list[(Datastore, str)]] The path to the source files, which can be single value or list of url string (http[s]\|abfs[s]\|wasb[s]), DataPath object, or tuple of Datastore and relative path. Note that list of paths can't include both urls and datastores together.
validate Required	bool Boolean to validate if data can be loaded from the returned dataset. Defaults to True. Validation requires that the data source is accessible from the current compute. To disable the validation, "infer_column_types" also need to be set to False.
include_path Required	bool Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
infer_column_types Required	bool Boolean to infer column data types. Defaults to True. Type inference requires that the data source is accessible from current compute. Currently type inference will only pull first 200 rows. If the data contains multiple types of value, it is better to provide desired type as an override via set_column_types argument. Please check the Remarks section for code samples about set_column_types.
set_column_types Required	dict[str, DataType] A dictionary to set column data type, where key is column name and value is DataType.
separator Required	str The separator used to split columns.
header Required	bool or PromoteHeadersBehavior Controls how column headers are promoted when reading from files. Defaults to True for all files having the same header. Files will read as having no header When header=False. More options can be specified using enum value of PromoteHeadersBehavior.
partition_format Required	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
support_multi_line Required	bool By default (support_multi_line=False), all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to True when the delimited files are known to contain quoted line breaks. Given this csv file as example, the data will be read differently based on support_multi_line. A,B,C A1,B1,C1 A2,"B 2",C2 `from azureml.core import Dataset, Datastore from azureml.data.datapath import DataPath # default behavior: support_multi_line=False dataset = Dataset.Tabular.from_delimited_files(path=datastore_path) print(dataset.to_pandas_dataframe()) # A B C # 0 A1 B1 C1 # 1 A2 B None # 2 2" C2 None # to handle quoted line breaks dataset = Dataset.Tabular.from_delimited_files(path=datastore_path, support_multi_line=True) print(dataset.to_pandas_dataframe()) # A B C # 0 A1 B1 C1 # 1 A2 B\r\n2 C2`
empty_as_string Required	bool Specify if empty field values should be loaded as empty strings. The default (False) will read empty field values as nulls. Passing this as True will read empty field values as empty strings. If the values are converted to numeric or datetime then this has no effect, as empty values will be converted to nulls.
encoding Required	str Specify the file encoding. Supported encodings are 'utf8', 'iso88591', 'latin1', 'ascii', 'utf16', 'utf32', 'utf8bom' and 'windows1252'

Returns

Type	Description
TabularDataset	Returns a TabularDataset object.

Remarks

from_delimited_files creates an object of TabularDataset class, which defines the operations to load data from delimited files into tabular representation.

For the data to be accessible by Azure Machine Learning, the delimited files specified by path must be located in Datastore or behind public web urls or url of Blob, ADLS Gen1 and ADLS Gen2. users' AAD token will be used in notebook or local python program if it directly calls one of these functions: FileDataset.mount FileDataset.download FileDataset.to_path TabularDataset.to_pandas_dataframe TabularDataset.to_dask_dataframe TabularDataset.to_spark_dataframe TabularDataset.to_parquet_files TabularDataset.to_csv_files the identity of the compute target will be used in jobs submitted by Experiment.submit for data access authentication. Learn more: https://aka.ms/data-access

Column data types are by default inferred from data in the delimited files. Providing set_column_types will override the data type for the specified columns in the returned TabularDataset.


   from azureml.core import Dataset, Datastore

   # create tabular dataset from a single file in datastore
   datastore = Datastore.get(workspace, 'workspaceblobstore')
   tabular_dataset_1 = Dataset.Tabular.from_delimited_files(path=(datastore,'weather/2018/11.csv'))

   # create tabular dataset from a single directory in datastore
   datastore = Datastore.get(workspace, 'workspaceblobstore')
   tabular_dataset_2 = Dataset.Tabular.from_delimited_files(path=(datastore,'weather/'))

   # create tabular dataset from all csv files in the directory
   tabular_dataset_3 = Dataset.Tabular.from_delimited_files(path=(datastore,'weather/**/*.csv'))

   # create tabular dataset from multiple paths
   data_paths = [(datastore, 'weather/2018/11.csv'), (datastore, 'weather/2018/12.csv')]
   tabular_dataset_4 = Dataset.Tabular.from_delimited_files(path=data_paths)

   # create tabular dataset from url
   tabular_dataset_5 = Dataset.Tabular.from_delimited_files(path='https://url/weather/2018/12.csv')

   # use `set_column_types` to set column data types
   from azureml.data import DataType
   data_types = {
       'ID': DataType.to_string(),
       'Date': DataType.to_datetime('%d/%m/%Y %I:%M:%S %p'),
       'Count': DataType.to_long(),
       'Latitude': DataType.to_float(),
       'Found': DataType.to_bool()
   }
   web_path = [
       'https://url/weather/2018/11.csv',
       'https://url/weather/2018/12.csv'
   ]
   tabular = Dataset.Tabular.from_delimited_files(path=web_path, set_column_types=data_types)

from_json_lines_files

Create a TabularDataset to represent tabular data in JSON Lines files (http://jsonlines.org/).

static from_json_lines_files(path, validate=True, include_path=False, set_column_types=None, partition_format=None, invalid_lines='error', encoding='utf8')

Parameters

Name	Description
path Required	Union[str, list[str], DataPath, list[DataPath], (Datastore, str), list[(Datastore, str)]] The path to the source files, which can be single value or list of url string (http[s]\|abfs[s]\|wasb[s]), DataPath object, or tuple of Datastore and relative path. Note that list of paths can't include both urls and datastores together.
validate Required	bool Boolean to validate if data can be loaded from the returned dataset. Defaults to True. Validation requires that the data source is accessible from the current compute.
include_path Required	bool Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
set_column_types Required	dict[str, DataType] A dictionary to set column data type, where key is column name and value is DataType
partition_format Required	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.jsonl' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.jsonl' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
invalid_lines Required	str How to handle lines that are invalid JSON. Supported values are 'error' and 'drop'.
encoding Required	str Specify the file encoding. Supported encodings are 'utf8', 'iso88591', 'latin1', 'ascii', 'utf16', 'utf32', 'utf8bom' and 'windows1252'

Returns

Type	Description
TabularDataset	Returns a TabularDataset object.

Remarks

from_json_lines_files creates an object of TabularDataset class, which defines the operations to load data from JSON Lines files into tabular representation.

For the data to be accessible by Azure Machine Learning, the JSON Lines files specified by path must be located in Datastore or behind public web urls or url of Blob, ADLS Gen1 and ADLS Gen2. users' AAD token will be used in notebook or local python program if it directly calls one of these functions: FileDataset.mount FileDataset.download FileDataset.to_path TabularDataset.to_pandas_dataframe TabularDataset.to_dask_dataframe TabularDataset.to_spark_dataframe TabularDataset.to_parquet_files TabularDataset.to_csv_files the identity of the compute target will be used in jobs submitted by Experiment.submit for data access authentication. Learn more: https://aka.ms/data-access

Column data types are read from data types saved in the JSON Lines files. Providing set_column_types will override the data type for the specified columns in the returned TabularDataset.


   from azureml.core import Dataset, Datastore

   # create tabular dataset from a single file in datastore
   datastore = Datastore.get(workspace, 'workspaceblobstore')
   tabular_dataset_1 = Dataset.Tabular.from_json_lines_files(path=(datastore,'weather/2018/11.jsonl'))

   # create tabular dataset from a single directory in datastore
   datastore = Datastore.get(workspace, 'workspaceblobstore')
   tabular_dataset_2 = Dataset.Tabular.from_json_lines_files(path=(datastore,'weather/'))

   # create tabular dataset from all jsonl files in the directory
   tabular_dataset_3 = Dataset.Tabular.from_json_lines_files(path=(datastore,'weather/**/*.jsonl'))

   # create tabular dataset from multiple paths
   data_paths = [(datastore, 'weather/2018/11.jsonl'), (datastore, 'weather/2018/12.jsonl')]
   tabular_dataset_4 = Dataset.Tabular.from_json_lines_files(path=data_paths)

   # create tabular dataset from url
   tabular_dataset_5 = Dataset.Tabular.from_json_lines_files(path='https://url/weather/2018/12.jsonl')

   # use `set_column_types` to set column data types
   from azureml.data import DataType
   data_types = {
       'ID': DataType.to_string(),
       'Date': DataType.to_datetime('%d/%m/%Y %I:%M:%S %p'),
       'Count': DataType.to_long(),
       'Latitude': DataType.to_float(),
       'Found': DataType.to_bool()
   }
   web_path = [
       'https://url/weather/2018/11.jsonl',
       'https://url/weather/2018/12.jsonl'
   ]
   tabular = Dataset.Tabular.from_json_lines_files(path=web_path, set_column_types=data_types)

from_parquet_files

Create a TabularDataset to represent tabular data in Parquet files.

static from_parquet_files(path, validate=True, include_path=False, set_column_types=None, partition_format=None)

Parameters

Name	Description
path Required	Union[str, list[str], DataPath, list[DataPath], (Datastore, str), list[(Datastore, str)]] The path to the source files, which can be single value or list of url string (http[s]\|abfs[s]\|wasb[s]), DataPath object, or tuple of Datastore and relative path. Note that list of paths can't include both urls and datastores together.
validate Required	bool Boolean to validate if data can be loaded from the returned dataset. Defaults to True. Validation requires that the data source is accessible from the current compute.
include_path Required	bool Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
set_column_types Required	dict[str, DataType] A dictionary to set column data type, where key is column name and value is DataType.
partition_format Required	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

Returns

Type	Description
TabularDataset	Returns a TabularDataset object.

Remarks

from_parquet_files creates an object of TabularDataset class, which defines the operations to load data from Parquet files into tabular representation.

For the data to be accessible by Azure Machine Learning, the Parquet files specified by path must be located in Datastore or behind public web urls or url of Blob, ADLS Gen1 and ADLS Gen2. users' AAD token will be used in notebook or local python program if it directly calls one of these functions: FileDataset.mount FileDataset.download FileDataset.to_path TabularDataset.to_pandas_dataframe TabularDataset.to_dask_dataframe TabularDataset.to_spark_dataframe TabularDataset.to_parquet_files TabularDataset.to_csv_files the identity of the compute target will be used in jobs submitted by Experiment.submit for data access authentication. Learn more: https://aka.ms/data-access

Column data types are read from data types saved in the Parquet files. Providing set_column_types will override the data type for the specified columns in the returned TabularDataset.


   # create tabular dataset from a single file in datastore
   datastore = Datastore.get(workspace, 'workspaceblobstore')
   tabular_dataset_1 = Dataset.Tabular.from_parquet_files(path=(datastore,'weather/2018/11.parquet'))

   # create tabular dataset from a single directory in datastore
   datastore = Datastore.get(workspace, 'workspaceblobstore')
   tabular_dataset_2 = Dataset.Tabular.from_parquet_files(path=(datastore,'weather/'))

   # create tabular dataset from all parquet files in the directory
   tabular_dataset_3 = Dataset.Tabular.from_parquet_files(path=(datastore,'weather/**/*.parquet'))

   # create tabular dataset from multiple paths
   data_paths = [(datastore, 'weather/2018/11.parquet'), (datastore, 'weather/2018/12.parquet')]
   tabular_dataset_4 = Dataset.Tabular.from_parquet_files(path=data_paths)

   # create tabular dataset from url
   tabular_dataset_5 = Dataset.Tabular.from_parquet_files(path='https://url/weather/2018/12.parquet')

   # use `set_column_types` to set column data types
   from azureml.data import DataType
   data_types = {
       'ID': DataType.to_string(),
       'Date': DataType.to_datetime('%d/%m/%Y %I:%M:%S %p'),
       'Count': DataType.to_long(),
       'Latitude': DataType.to_float(),
       'Found': DataType.to_bool()
   }
   web_path = [
       'https://url/weather/2018/11.parquet',
       'https://url/weather/2018/12.parquet'
   ]
   tabular = Dataset.Tabular.from_parquet_files(path=web_path, set_column_types=data_types)

from_sql_query

Create a TabularDataset to represent tabular data in SQL databases.

static from_sql_query(query, validate=True, set_column_types=None, query_timeout=30)

Parameters

Name	Description
query Required	Union[DataPath, (Datastore, str)] A SQL-kind datastore and a query.
validate Required	bool Boolean to validate if data can be loaded from the returned dataset. Defaults to True. Validation requires that the data source is accessible from the current compute.
set_column_types Required	dict[str, DataType] A dictionary to set column data type, where key is column name and value is DataType.
query_timeout Required	Sets the wait time (in seconds) before terminating the attempt to execute a command and generating an error. The default is 30 seconds.

Returns

Type	Description
TabularDataset	Returns a TabularDataset object.

Remarks

from_sql_query creates an object of TabularDataset class, which defines the operations to load data from SQL databases into tabular representation. Currently, we only support MSSQLDataSource.

For the data to be accessible by Azure Machine Learning, the SQL database specified by query must be located in Datastore and the datastore type must be of a SQL kind.

Column data types are read from data types in SQL query result. Providing set_column_types will override the data type for the specified columns in the returned TabularDataset.


   from azureml.core import Dataset, Datastore
   from azureml.data.datapath import DataPath

   # create tabular dataset from a SQL database in datastore
   datastore = Datastore.get(workspace, 'mssql')
   query = DataPath(datastore, 'SELECT * FROM my_table')
   tabular = Dataset.Tabular.from_sql_query(query, query_timeout=10)
   df = tabular.to_pandas_dataframe()

   # use `set_column_types` to set column data types
   from azureml.data import DataType
   data_types = {
       'ID': DataType.to_string(),
       'Date': DataType.to_datetime('%d/%m/%Y %I:%M:%S %p'),
       'Count': DataType.to_long(),
       'Latitude': DataType.to_float(),
       'Found': DataType.to_bool()
   }
   tabular = Dataset.Tabular.from_sql_query(query, set_column_types=data_types)

register_dask_dataframe

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Create a dataset from dask dataframe.

static register_dask_dataframe(dataframe, target, name, description=None, tags=None, show_progress=True)

Parameters

Name	Description
dataframe Required	<xref:dask.dataframe.core.DataFrame> Required, dask dataframe to be uploaded.
target Required	Union[DataPath, Datastore, tuple(Datastore, str)] Required, the datastore path where the dataframe parquet data will be uploaded to. A guid folder will be generated under the target path to avoid conflict.
name Required	str Required, the name of the registered dataset.
description Required	str Optional. A text description of the dataset. Defaults to None.
tags Required	dict[str, str] Optional. Dictionary of key value tags to give the dataset. Defaults to None.
show_progress Required	bool Optional, indicates whether to show progress of the upload in the console. Defaults to be True.

Returns

Type	Description
TabularDataset	The registered dataset.

register_pandas_dataframe

Create a dataset from pandas dataframe.

static register_pandas_dataframe(dataframe, target, name, description=None, tags=None, show_progress=True, row_group_size=None, make_target_path_unique=True)

Parameters

Name	Description
dataframe Required	DataFrame Required, in memory dataframe to be uploaded.
target Required	Union[DataPath, Datastore, tuple(Datastore, str)] Required, the datastore path where the dataframe parquet data will be uploaded to. A guid folder will be generated under the target path to avoid conflict.
name Required	str Required, the name of the registered dataset.
description Required	int Optional. A text description of the dataset. Defaults to None.
tags Required	dict[str, str] Optional. Dictionary of key value tags to give the dataset. Defaults to None.
show_progress Required	bool Optional, indicates whether to show progress of the upload in the console. Defaults to be True.
row_group_size Required	Optional. Max size of row group to use when writing parquet file. Defaults to None.
make_target_path_unique Required	Optional, indicates if unique subfolder should be created in the target. Defaults to be True.

Returns

Type	Description
TabularDataset	The registered dataset.

register_spark_dataframe

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Create a dataset from spark dataframe.

static register_spark_dataframe(dataframe, target, name, description=None, tags=None, show_progress=True)

Parameters

Name	Description
dataframe Required	DataFrame Required, in memory dataframe to be uploaded.
target Required	Union[DataPath, Datastore, tuple(Datastore, str)] Required, the datastore path where the dataframe parquet data will be uploaded to. A guid folder will be generated under the target path to avoid conflict.
name Required	str Required, the name of the registered dataset.
description Required	str Optional. A text description of the dataset. Defaults to None.
tags Required	dict[str, str] Optional. Dictionary of key value tags to give the dataset. Defaults to None.
show_progress Required	bool Optional, indicates whether to show progress of the upload in the console. Defaults to be True.

Returns

Type	Description
TabularDataset	The registered dataset.

Share via

TabularDatasetFactory Class

Constructor

Methods

from_delimited_files

Parameters

Returns

Remarks

from_json_lines_files

Parameters

Returns

Remarks

from_parquet_files

Parameters

Returns

Remarks

from_sql_query

Parameters

Returns

Remarks

register_dask_dataframe

Parameters

Returns

register_pandas_dataframe

Parameters

Returns

register_spark_dataframe

Parameters

Returns

Feedback

Additional resources