Data catalog

The data catalog registers and maintains the data information in a centralized place and makes it available for the organization. It ensures that enterprises avoid duplicate data products caused by redundant data ingestion by different project teams.

We recommend you create a data catalog service to define the metadata of the data products stored across the data landing zones.

Cloud-scale analytics depends on Microsoft Purview to register enterprise data sources, classify them, ensure data quality, and offer secure, self-service access.

Microsoft Purview is tenant based service and can communicate with each data landing zone by creating a Managed Virtual Network deployed to the region of your data landing zones. You can deploy Azure Managed Virtual Network Integration Runtimes (IR) within Microsoft Purview Managed Virtual Networks in any available Microsoft Purview region. From there, the managed virtual network IR can use private endpoints to securely connect to and scan the supported data sources. For more information, see Use Managed virtual network with your Microsoft Purview account. Creating a Managed virtual network IR within Managed Virtual Network ensures that data integration process is isolated and secure.

Note

Although this documentation focuses primarily on using Microsoft Purview for governance, enterprises might have invested in other products, such as Alation, Okera, or Collibra. These solutions are subscription based and we would recommend deploying thsese to the data management landing zone. Be aware that some custom integration might be required.

Data discovery

Data discovery reflects the state of all the data that the enterprise owns. This data is known as the data estate. During data discovery, the data estate is scanned and classified. The data scanning process connects directly to the data source according to a set schedule.

As you add a new data landing zone to the environment, the associated data lakes and polyglot persistence sources must be registered as sources for the data catalog crawlers to scan.

With automated discovery of your data estate to populate the catalog, you can:

  • Crawl metadata from Azure and on-premises data sources
  • Scan your data lakes, blobs, and other supported targets
  • Extract schema from your data targets for XML, TSV, CSV, PSV, SSV, JSON, Parquet, Avro, and ORC file types
  • Allow automated catalog updates through configurable scheduling of scans and scan rule sets

Important

When you add a new data landing zone to the environment, register the associated data lakes and polyglot storage through Azure DevOps as a source for the data catalog crawlers to scan, govern and manage data integrity.

Data classification

Microsoft Purview allows you to apply system or custom data classifications on file, table, or column assets.

Data classifications are like subject tags. Microsoft Purview marks and identifies the content of specific data types found within your data estate during scanning. You use sensitivity labels to identify the categories of classification types within your organizational data. You can also use sensitivity labels to group the policies you wish to apply to each category. Microsoft Purview makes use of the same sensitive information types as Microsoft 365, allowing you to stretch your existing security policies and protections across your entire content and data estate.

Microsoft Purview can scan and automatically classify documents. For example, if you have a file named multiple.docx and it has a national ID number in its content, Microsoft Purview adds a classification such as EU National Identification Number in the asset detail page.

Microsoft Defender for SQL is a feature available for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. It includes functionality for discovering and classifying sensitive data, surfacing and mitigating potential database vulnerabilities, and detecting anomalous activities that could indicate a threat to your database. Microsoft Defender for SQL provides a single go-to location for enabling and managing these capabilities.

Next steps