Data sources that connect to Data Map

Article
02/13/2025

This article lists the supported data sources, file types, and scanning concepts in Microsoft Purview Data Map.

Data source listing by type

The tables below show all data sources that have technical metadata available in Microsoft Purview Data Map, along with other supported capabilities. Select a data source name in the Data source column for instructions on connecting that source to Data Map.

Microsoft Azure
Database
File
Services and apps

Azure

Azure resources are only available in the same tenant as your Microsoft Purview account, unless noted otherwise on each data source's page.

Data source	Can automatically apply classifications	Can apply sensitivity labels to Data Map assets	Can apply policies	Data lineage	Accessible in live view
Select link for connection and scanning instructions.	Select Yes* for scanning instructions. Learn how classifications are applied during scanning.*	Learn about sensitivity labeling (preview).	Select Yes* to see supported policies; for example, data owner, self-service access, or protection.*	Select Yes* for details.*	Learn about live view.
Multiple sources	Yes	Source dependent	Yes	No	Limited
Azure Blob Storage	Yes	Yes	Yes (preview)	Limited*	Yes
Azure Cosmos DB (API for NoSQL)	Yes	No	No	No*	No
Azure Data Explorer	Yes	No	No	No*	No
Azure Data Factory	No	No	No	Yes	No
Azure Data Lake Storage Gen2	Yes	Yes	Yes (preview)	Limited*	Yes
Azure Data Share	No	No	No	Yes	No
Azure Database for MySQL	Yes	No	No	No*	No
Azure Database for PostgreSQL	Yes	No	No	No*	No
Azure Databricks Hive Metastore	No	No	No	Yes	No
Azure Databricks Unity Catalog	Yes	No	No	No	No
Azure Dedicated SQL pool (formerly SQL DW)	Yes	No	No	No*	No
Azure Files	Yes	Yes	No	Limited*	No
Azure Machine Learning	No	No	No	Yes	No
Azure SQL Database	Yes	Yes	Yes	Yes (Preview)	Yes
Azure SQL Managed Instance	Yes	No	Yes	No*	No
Azure Synapse Analytics (Workspace)	Yes	No	No	Yes - Synapse pipelines	No

* Besides the lineage on assets within the data source, lineage is also supported if dataset is used as a source/sink in Data Factory or Synapse pipeline.

Database

Data source	Can automatically apply classifications	Can apply sensitivity labels to Data Map assets	Can apply policies	Data lineage	Accessible in live view
Select link for connection and scanning instructions.	Select Yes* for scanning instructions. Learn how classifications are applied during scanning.*	Learn about sensitivity labeling (preview).	Select Yes* to see supported policies; for example, data owner, self-service access, or protection.*	Select Yes* for details.*	Learn about live view.
Amazon RDS	Yes	No	No	No	No
Amazon Redshift	No	No	No	No	No
Cassandra	No	No	No	Yes	No
Db2	No	No	No	Yes	No
Google BigQuery	No	No	No	Yes	No
Hive Metastore Database	No	No	No	Yes*	No
MongoDB	No	No	No	No	No
MySQL	No	No	No	Yes	No
Oracle	Yes	No	No	Yes*	No
PostgreSQL	No	No	No	Yes	No
SAP Business Warehouse	No	No	No	No	No
SAP HANA	No	No	No	No	No
Snowflake	Yes	No	No	Yes	No
SQL Server	Yes	No	No	No*	No
SQL Server on Azure-Arc	Yes	No	Yes	No*	No
Teradata	Yes	No	No	Yes*	No

* Besides the lineage on assets within the data source, lineage is also supported if dataset is used as a source/sink in Data Factory or Synapse pipeline.

File

Data source	Can automatically apply classifications	Can apply sensitivity labels to Data Map assets	Can apply policies	Data lineage	Accessible in live view
Select link for connection and scanning instructions.	Select Yes* for scanning instructions. Learn how classifications are applied during scanning.*	Learn about sensitivity labeling (preview).	Select Yes* to see supported policies; for example, data owner, self-service access, or protection.*	Select Yes* for details.*	Learn about live view.
Amazon S3	Yes	No	No	Limited*	No
Hadoop Distributed File System (HDFS)	Yes	No	No	No	No

* Besides the lineage on assets within the data source, lineage is also supported if dataset is used as a source/sink in Data Factory or Synapse pipeline.

Services and apps

Data source	Can automatically apply classifications	Can apply sensitivity labels to Data Map assets	Can apply policies	Data lineage	Accessible in live view
Select link for connection and scanning instructions.	Select Yes* for scanning instructions. Learn how classifications are applied during scanning.*	Learn about sensitivity labeling (preview).	Select Yes* to see supported policies; for example, data owner, self-service access, or protection.*	Select Yes* for details.*	Learn about live view.
Airflow	No	No	No	Yes	No
Dataverse	Yes	No	No	No	No
Erwin	No	No	No	Yes	No
Fabric	No	No	No	Yes	Yes
Looker	No	No	No	Yes	No
Power BI	No	No	No	Yes	Yes**
Qlik Sense	No	No	No	No	No
Salesforce	No	No	No	No	No
SAP ECC	No	No	No	Yes*	No
SAP S/4HANA	No	No	No	Yes*	No
Tableau	No	No	No	No	No

* Besides the lineage on assets within the data source, lineage is also supported if dataset is used as a source/sink in Data Factory or Synapse pipeline.

** Power BI items in a Fabric tenant are available using live view.

Note

Currently, the Microsoft Purview Data Map can't scan an asset that has /, \, or # in its name. To scope your scan and avoid scanning assets that have those characters in the asset name, use the example in Register and scan an Azure SQL Database.

Important

If you plan on using a self-hosted integration runtime, scanning some data sources requires extra setup on the self-hosted integration runtime machine. For example, JDK, Visual C++ Redistributable, or specific driver. For your source, refer to each source article for prerequisite details. Any requirements are listed in the Prerequisites section.

Data Map scanner regions

The following is a list of all the Azure data source (data center) regions where the Microsoft Purview Data Map scanner runs. If your Azure data source is in a region outside of this list, the scanner will run in the region of your Microsoft Purview instance.

Australia East
Australia Southeast
Brazil South
Canada Central
Canada East
Central India
China North 3
East Asia
East US
East US 2
France Central
Germany West Central
Japan East
Korea Central
North Central US
North Europe
Qatar Central
South Africa North
South Central US
Southeast Asia
Switzerland North
UAE North
UK South
USGov Virginia
West Central US
West Europe
West US
West US 2
West US 3

File types supported for scanning

The file types listed below are supported for scanning, for schema extraction, and classification where applicable. Additionally, Data Map supports custom file extensions and custom parsers.

Structured file formats supported by extension include scanning, schema extraction, and asset and column level classification:

AVRO
CSV
GZIP
JSON
ORC
PARQUET
PSV
SSV
TSV
TXT
XML

Document file formats supported by extension include scanning and asset level classification:

DOC
DOCM
DOCX
DOT
ODP
ODS
ODT
PDF
POT
PPS
PPSX
PPT
PPTM
PPTX
XLC
XLS
XLSB
XLSM
XLSX
XLT

Note

Known limitations:

The Microsoft Purview Data Map scanner only supports schema extraction for the structured file types listed above.
For AVRO, ORC, and PARQUET file types, the scanner does not support schema extraction for files that contain complex data types (for example, MAP, LIST, STRUCT).
The scanner supports scanning snappy compressed PARQUET types for schema extraction and classification.
For GZIP file types, the GZIP must be mapped to a single csv file within. Gzip files are subject to System and Custom Classification rules. We currently don't support scanning a gzip file mapped to multiple files within, or any file type other than csv.
For delimited file types (CSV, PSV, SSV, TSV, TXT):
- Delimited files with only 1 column can't be determined to be CSV files and will have no schema.
- We do not support data type detection. The data type will be listed as "string" for all columns.
- We only support comma(‘,’), semicolon(‘;’), vertical bar(‘|’) and tab(‘\t’) as delimiters.
- Delimited files with less than three rows cannot be determined to be CSV files if they are using a custom delimiter. For example: files with ~ delimiter and less than three rows will not be able to be determined to be CSV files.
- If a field contains double quotes, the double quotes can only appear at the beginning and end of the field and must be matched. Double quotes that appear in the middle of the field or appear at the beginning and end but are not matched will be recognized as bad data and there will be no schema get parsed from the file. Rows that have different number of columns than the header row will be judged as error rows. (numbers of error rows / numbers of rows sampled ) must be less than 0.1.
For Parquet files, if you are using a self-hosted integration runtime, you need to install the 64-bit JRE 11 (Java Runtime Environment) or OpenJDK on your IR machine. Check our Java Runtime Environment section at the bottom of the page for an installation guide.
Currently the delta format isn't supported. If you are scanning the delta format directly from storage data source like Azure Data Lake Storage (ADLS Gen2), the set of parquet files from the delta format will be parsed and handled as resource set as described in Understanding resource sets. Besides the columns used for partitioning will not be recognized as part of the schema for the resource set.

Schema extraction

For data sources which support schema extraction during scan, the asset schema won't be directly truncated by the number of columns.

Nested data

Nested data is only supported for JSON content. For all system supported file types, if there's nested JSON content in a column, then the scanner parses the nested JSON data and surfaces it within the schema tab of the asset.

Nested data, or nested schema parsing, isn't supported in SQL. A column with nested data will be reported and classified as is, and subdata won't be parsed.

Sampling data for classification

In Data Map terminology,

L1 scan: Extracts basic information and meta data like file name, size, and fully qualified name
L2 scan: Extracts schema for structured file types and database tables
L3 scan: Extracts schema where applicable and subjects the sampled file to system and custom classification rules

Learn more about customizing the scan levels.

For all structured file formats, the Microsoft Purview Data Map scanner samples files in the following way:

For structured file types, it samples the top 128 rows in each column or the first 1 MB, whichever is lower.
For document file formats, it samples the first 20 MB of each file.
- If a document file is larger than 20 MB, then it isn't subject to a deep scan (subject to classification). In that case, Microsoft Purview captures only basic meta data like file name and fully qualified name.
For tabular data sources (SQL), it samples the top 128 rows.
For Azure Cosmos DB for NoSQL, up to 300 distinct properties from the first 10 documents in a container will be collected for the schema and for each property, values from up to 128 documents or the first 1 MB will be sampled.

Resource set file sampling

A folder or group of partition files is detected as a resource set in the Microsoft Purview Data Map if it matches with a system resource set policy or a customer defined resource set policy. If a resource set is detected, then the scanner samples each folder that it contains. Learn more about resource sets here.

File sampling for resource sets by file types:

Delimited files (CSV, PSV, SSV, TSV) - 1 in 100 files are sampled (L3 scan) within a folder or group of partition files that are considered a 'Resource set'
Data Lake file types (Parquet, Avro, Orc) - 1 in 18446744073709551615 (long max) files are sampled (L3 scan) within a folder or group of partition files that are considered a 'Resource set'
Other structured file types (JSON, XML, TXT) - 1 in 100 files are sampled (L3 scan) within a folder or group of partition files that are considered a 'Resource set'
SQL objects and Azure Cosmos DB entities - Each file is L3 scanned.
Document file types - Each file is L3 scanned. Resource set patterns don't apply to these file types.

Share via

Data sources that connect to Data Map

Data source listing by type

Azure

Database

File

Services and apps

Data Map scanner regions

File types supported for scanning

Schema extraction

Nested data

Sampling data for classification

Resource set file sampling

Next steps

Feedback

Additional resources