Data catalog
The data catalog registers and maintains the data information in a centralized place and makes it available for the organization. It ensures that enterprises avoid duplicate data products caused by redundant data ingestion by different project teams. We recommend that you create a data catalog service to define the metadata of the data products stored across the data landing zones.
Cloud-scale analytics depends on Microsoft Purview to register enterprise data sources, classify them, ensure data quality, and offer secure, self-service access.
Microsoft Purview is a tenant based service and can communicate with each data landing zone by creating a Managed Virtual Network deployed to the region of your data landing zones. You can deploy Azure Managed Virtual Network Integration Runtimes (IR) within Microsoft Purview Managed Virtual Networks in any available Microsoft Purview region. From there, the managed virtual network IR can use private endpoints to securely connect to and scan the supported data sources. For more information, see Use Managed virtual network with your Microsoft Purview account. Creating a Managed virtual network IR within Managed Virtual Network ensures that data integration process is isolated and secure.
When using Azure Databricks, we recommend using Azure Databricks Unity Catalog in addition to Microsoft Purview. Azure Databricks Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. For best practices for setting up Unity Catalog, see Unity Catalog best practices.
Note
Although this documentation focuses primarily on using Microsoft Purview for governance, enterprises might have invested in other products, such as Alation, Okera, or Collibra. These solutions are subscription based and we would recommend deploying them to the data management landing zone. Be aware that some custom integration might be required.
Data discovery
Data discovery reflects the state of all the data that the enterprise owns. This data is known as the data estate. During data discovery, the data estate is scanned and classified. The data scanning process connects directly to the data source according to a set schedule.
As you add a new data landing zone to the environment, the associated data lakes and polyglot persistence sources must be registered as sources for the data catalog crawlers to scan.
With automated discovery of your data estate to populate the catalog, you can:
- Crawl metadata from Azure and on-premises data sources
- Scan your data lakes, blobs, and other supported targets
- Extract schema from your data targets for XML, TSV, CSV, PSV, SSV, JSON, Parquet, Avro, and ORC file types
- Allow automated catalog updates through configurable scheduling of scans and scan rule sets
Important
When you add a new data landing zone to the environment, register the associated data lakes and polyglot storage through Azure DevOps as a source for the data catalog crawlers to scan, govern, and manage data integrity.
Data classification
Microsoft Purview allows you to apply system or custom data classifications on file, table, or column assets.
Data classifications are like subject tags. Microsoft Purview marks and identifies the content of specific data types found within your data estate during scanning. You use sensitivity labels to identify the categories of classification types within your organizational data. You can also use sensitivity labels to group the policies you wish to apply to each category. Microsoft Purview makes use of the same sensitive information types as Microsoft 365, allowing you to extend your existing security policies and protections across your entire content and data estate.
Microsoft Purview can scan and automatically classify documents. For example, if you have a file named multiple.docx
and it has a national ID number in its content, Microsoft Purview adds a classification such as EU National Identification Number
in the asset detail page.
Microsoft Defender for SQL is a feature available for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. It includes functionality for discovering and classifying sensitive data, surfacing and mitigating potential database vulnerabilities, and detecting anomalous activities that could indicate a threat to your database. Microsoft Defender for SQL provides a single goto location for enabling and managing these capabilities.