How to data profile data sources in Azure Data Catalog
Important
Azure Data Catalog was retired on May 15, 2024.
For data catalog features, use the Microsoft Purview service, which offers unified data governance for your entire data estate.
Introduction
Microsoft Azure Data Catalog is a fully managed cloud service that serves as a system of registration and system of discovery for enterprise data sources. In other words, Azure Data Catalog is all about helping people discover, understand, and use data sources, and helping organizations to get more value from their existing data. When a data source is registered with Azure Data Catalog, its metadata is copied and indexed by the service, but the story doesn’t end there.
The Data Profiling feature of Azure Data Catalog examines the data from supported data sources in your catalog and collects statistics and information about that data. It's easy to include a profile of your data assets. When you register a data asset, choose Include Data Profile in the data source registration tool.
What is data profiling?
Data profiling examines the data in the data source being registered, and collects statistics and information about that data. During data source discovery, these statistics can help you determine the suitability of the data to solve their business problem.
The following data sources support data profiling:
- SQL Server (including Azure SQL DB and Azure Synapse Analytics) tables and views
- Oracle tables and views
- Teradata tables and views
- Hive tables
Including data profiles when registering data assets helps users answer questions about data sources, including:
- Can it be used to solve my business problem?
- Does the data conform to particular standards or patterns?
- What are some of the anomalies of the data source?
- What are possible challenges of integrating this data into my application?
Note
You can also add documentation to an asset to describe how data could be integrated into an application. See How to document data sources.
How to include a data profile when registering a data source
It's easy to include a profile of your data source. When you register a data source, in the Objects to be registered panel of the data source registration tool, choose Include Data Profile.
To learn more about how to register data sources, see How to register data sources and Get started with Azure Data Catalog.
Filtering on data assets that include data profiles
To discover data assets that include a data profile, you can include has:tableDataProfiles
or has:columnsDataProfiles
as one of your search terms.
Note
Selecting Include Data Profile in the data source registration tool includes both table and column-level profile information. However, the Data Catalog API allows data assets to be registered with only one set of profile information included.
Viewing data profile information
Once you find a suitable data source with a profile, you can view the data profile details. To view the data profile, select a data asset and choose Data Profile in the Data Catalog portal window.
A data profile in Azure Data Catalog shows table and column profile information including:
Object data profile
- Number of rows
- Table size
- When the object was last updated
Column data profile
- Column data type
- Number of distinct values
- Number of rows with NULL values
- Minimum, maximum, average, and standard deviation for column values
Summary
Data profiling provides statistics and information about registered data assets to help you determine the suitability of the data to solve business problems. Along with annotating, and documenting data sources, data profiles can give users a deeper understanding of your data.