Share via


How to data profile data sources in Azure Data Catalog

Important

Azure Data Catalog was retired on May 15, 2024.

For data catalog features, use the Microsoft Purview service, which offers unified data governance for your entire data estate.

Introduction

Microsoft Azure Data Catalog is a fully managed cloud service that serves as a system of registration and system of discovery for enterprise data sources. In other words, Azure Data Catalog is all about helping people discover, understand, and use data sources, and helping organizations to get more value from their existing data. When a data source is registered with Azure Data Catalog, its metadata is copied and indexed by the service, but the story doesn’t end there.

The Data Profiling feature of Azure Data Catalog examines the data from supported data sources in your catalog and collects statistics and information about that data. It's easy to include a profile of your data assets. When you register a data asset, choose Include Data Profile in the data source registration tool.

What is data profiling?

Data profiling examines the data in the data source being registered, and collects statistics and information about that data. During data source discovery, these statistics can help you determine the suitability of the data to solve their business problem.

The following data sources support data profiling:

  • SQL Server (including Azure SQL DB and Azure Synapse Analytics) tables and views
  • Oracle tables and views
  • Teradata tables and views
  • Hive tables

Including data profiles when registering data assets helps users answer questions about data sources, including:

  • Can it be used to solve my business problem?
  • Does the data conform to particular standards or patterns?
  • What are some of the anomalies of the data source?
  • What are possible challenges of integrating this data into my application?

Note

You can also add documentation to an asset to describe how data could be integrated into an application. See How to document data sources.

How to include a data profile when registering a data source

It's easy to include a profile of your data source. When you register a data source, in the Objects to be registered panel of the data source registration tool, choose Include Data Profile.

The Include Data Profile box is checked at the bottom of the Objects to be registered window.

To learn more about how to register data sources, see How to register data sources and Get started with Azure Data Catalog.

Filtering on data assets that include data profiles

To discover data assets that include a data profile, you can include has:tableDataProfiles or has:columnsDataProfiles as one of your search terms.

Note

Selecting Include Data Profile in the data source registration tool includes both table and column-level profile information. However, the Data Catalog API allows data assets to be registered with only one set of profile information included.

Viewing data profile information

Once you find a suitable data source with a profile, you can view the data profile details. To view the data profile, select a data asset and choose Data Profile in the Data Catalog portal window.

The data profile tab is selected at the top of the page, between columns and documentation.

A data profile in Azure Data Catalog shows table and column profile information including:

Object data profile

  • Number of rows
  • Table size
  • When the object was last updated

Column data profile

  • Column data type
  • Number of distinct values
  • Number of rows with NULL values
  • Minimum, maximum, average, and standard deviation for column values

Summary

Data profiling provides statistics and information about registered data assets to help you determine the suitability of the data to solve business problems. Along with annotating, and documenting data sources, data profiles can give users a deeper understanding of your data.

See Also