Data lineage in Microsoft Purview
This article provides an overview of data lineage in the Microsoft Purview Unified Catalog. It also details how data systems can integrate with the catalog to capture lineage of data. Microsoft Purview can capture lineage for data in different parts of your organization's data estate, and at different levels of preparation including:
- Raw data staged from various platforms
- Transformed and prepared data
- Data used by visualization platforms
Use cases
Data lineage is broadly understood as the lifecycle that spans the data’s origin, and where it moves over time across the data estate. It's used for different kinds of backwards-looking scenarios such as troubleshooting, tracing root cause in data pipelines and debugging. Lineage is also used for data quality analysis, compliance and “what if” scenarios often referred to as impact analysis. Lineage is represented visually to show data moving from source to destination including how the data was transformed. Given the complexity of most enterprise data environments, these views can be hard to understand without doing some consolidation or masking of peripheral data points.
Lineage experience in Unified Catalog
Unified Catalog connects with other data processing, storage, and analytics systems to extract lineage information. The information is combined to represent a generic, scenario-specific lineage experience in the catalog.
Your data estate might include systems doing data extraction, transformation (ETL/ELT systems), analytics, and visualization systems. Each of the systems captures rich static and operational metadata that describes the state and quality of the data within the systems boundary. The goal of lineage in a Unified Catalog is to extract the movement, transformation, and operational metadata from each data system at the lowest grain possible.
The following example is a typical use case of data moving across multiple systems, where Unified Catalog would connect to each of the systems for lineage.
- Data Factory copies data from on-prem/raw zone to a landing zone in the cloud.
- Data processing systems like Synapse, Databricks would process and transform data from landing zone to Curated zone using notebooks.
- Further processing of data into analytical models for optimal query performance and aggregation.
- Data visualization systems will consume the datasets and process through their meta model to create a BI Dashboard, ML experiments and so on.
Lineage granularity
The following section covers the details about the granularity of which the lineage information is gathered by Microsoft Purview. This granularity can vary based on the data systems supported in Microsoft Purview.
Entity level lineage: Sources > Process > Targets
- Lineage is represented as a graph, typically it contains source and target entities in Data storage systems that are connected by a process invoked by a compute system.
- Data systems connect to Unified Catalog to generate and report a unique object referencing the physical object of the underlying data system for example: SQL Stored procedure, notebooks, and so on.
- High fidelity lineage with other metadata like ownership is captured to show the lineage in a human readable format for source & target entities. for example: lineage at a hive table level instead of partitions or file level.
Column or attribute level lineage
Identify attributes of a source entity that is used to create or derive attributes in the target entity. The name of the source attribute could be retained or renamed in a target. Systems like Azure Data Factory (ADF) can do a one-one copy from on-premises environment to the cloud. For example: Table1/ColumnA -> Table2/ColumnA
.
Process execution status
To support root cause analysis and data quality scenarios, we capture the execution status of the jobs in data processing systems. This requirement has nothing to do with replacing the monitoring capabilities of other data processing systems, neither the goal is to replace them.
Summary
Lineage is a critical feature of Unified Catalog to support quality, trust, and audit scenarios. The goal of a Unified Catalog is to build a robust framework where all the data systems within your environment can naturally connect and report lineage. Once the metadata is available, Unified Catalog can bring together the metadata provided by data systems to power data governance use cases.