Delen via


Data quality with Microsoft Purview Unified Catalog

Data quality in Microsoft Purview Unified Catalog empowers governance domain and data owners to assess and oversee the quality of their data ecosystem, facilitating targeted actions for improvement. In today's AI-driven landscape, the reliability of data directly impacts the accuracy of AI-driven insights and recommendations. Without trustworthy data, there's a risk of eroding trust in AI systems and hindering their adoption.

Poor data quality or incompatible data structures can hamper business processes and decision-making capabilities. Data quality addresses these challenges by offering users the ability to evaluate data quality using no-code/low-code rules, including out-of-the-box (OOB) rules and AI-generated rules. These rules are applied at the column level and aggregated to provide scores at the levels of data assets, data products, and governance domains, ensuring end-to-end visibility of data quality within each domain.

Data quality also incorporates AI-powered data profiling capabilities, recommending columns for profiling while allowing human intervention to refine these recommendations. This iterative process not only enhances the accuracy of data profiling but also contributes to the continuous improvement of the underlying AI models.

By applying data quality, organizations can effectively measure, monitor, and enhance the quality of their data assets, bolstering the reliability of AI-driven insights and fostering confidence in AI-based decision-making processes.

Data quality life cycle

  1. Assign users(s) data quality steward permissions in Unified Catalog to use all data quality features.
  2. Register and scan a data source in your Microsoft Purview Data Map.
  3. Add your data asset to a data product
  4. Set up a data source connection to prepare your source for data quality assessment.
  5. Configure and run data profiling for an asset in your data source.
    1. When profiling is complete, browse the results for each column in the data asset to understand your data's current structure and state.
  6. Set up data quality rules based on the profiling results, and apply them to your data asset.
  7. Configure and run a data quality scan on a data product to assess the quality of all supported assets in the data product.
  8. Review your scan results to evaluate your data product's current data quality.
  9. Repeat steps 5-8 periodically over your data asset's life cycle to ensure it's maintaining quality.
  10. Continually monitor your data quality
    1. Review data quality actions to identify and resolve problems.
    2. Set data quality notifications to alert you to quality issues.

Supported data quality regions

Data quality is currently supported in the following regions.

Supported multicloud data sources

View the list of supported data sources.

Important

Data quality for Parquet file is designed to support:

  1. A directory with Parquet Part File. For example: ./Sales/{Parquet Part Files}. The Fully Qualified Name must follow https://(storage account).dfs.core.windows.net/(container)/path/path2/{SparkPartitions}. Make sure we don't have {n} patterns in directory/sub-directory structure, must rather be a direct FQN leading to {SparkPartitions}.
  2. A directory with Partitioned Parquet Files, partitioned by Columns within the dataset like sales data partitioned by year and month. For example: ./Sales/{Year=2018}/{Month=Dec}/{Parquet Part Files}.

Both of these essential scenarios, which present a consistent Parquet dataset schema, are supported. Limitation: It isn't designed to or won't support N arbitrary Hierarchies of Directories with Parquet Files. We recommend presenting data in (1) or (2) constructed structure.

Currently, Microsoft Purview can only run data quality scans using Managed Identity as authentication option. Data quality services run on Apache Spark 3.4 and Delta Lake 2.4.

Data quality features

  • Data source connection configuration
    • Configure connection to allow Purview DQ SaaS application to have read access to data for Quality scanning and profiling.
    • MS Purview uses Managed Identity as an authentication option
  • Data profiling
    • AI enabled data profiling experience
    • Industry standard statistical snapshot (distribution, min, max, standard deviation, uniqueness, completeness, duplicate, …)
    • Drill down column level profiling measures.
  • Data quality rules
    • Out of box rules to measure six industry standards Data quality dimensions (completeness, consistency, conformity, accuracy, freshness, and uniqueness)
    • Custom rules creation features include number of out of the box functions and expression values.
    • Auto generated rules with AI integrated experience
  • Data quality scanning
    • Select and assign rules to columns for data quality scan.
    • Apply Data freshness rule in the entity / table level to measure the data freshness SLA.
    • Scheduling data quality scanning job for time period (hourly, daily, weekly, monthly, etc.)
  • Data quality job monitoring
    • Enable monitoring data quality job status (active, completed, failed, etc.)
    • Enable browsing the DQ scanning history.
  • Data quality scoring
    • Data quality score in rule level (what is the quality score for a rule that applied to a column)
    • Data quality score for Data assets, Data Products, and Governance Domains (one governance domain can have many data products, one data product can have many data assets, one data asset can have many data columns)
  • Data quality for critical data elements (CDEs)
    • This is one of the key features of data quality, the ability to apply data quality rules to the logical construct of CDEs, which then propagate down to the physical data elements that comprise them. By defining data quality rules at the level of CDEs, organizations can establish specific criteria and thresholds that CDEs must meet to maintain their quality
  • Data quality alerts
    • Configure alerts to notify data owners, data stewards if data quality threshold missed the expectation.
    • configure email alias or distribution group to send the notification about data quality issues.
  • Data quality actions
    • Actions center for DQ with actions to address DQ anomaly states including diagnostic queries for DQ steward to zero in on the specific data to fix for each anomaly state.
  • Data quality managed virtual network
    • A virtual network managed by data quality that connects with private endpoints to your Azure data sources.

Data residency and encryption

Data quality metadata and profiling summary are stored im Microsoft Managed Storage account. They're stored in the same region as the data source, so data residency remains intact. All data are encrypted. We are leveraging the Microsoft Purview Resource Provider regional user data store for metadata, which handles all the encryption and is common across all Microsoft Purview services. If you want more control over your data encryption with a CMK (customer-managed encryption key), then there's a separate process for it. (Learn more about Microsoft Purview Customer Key.)

Data quality compute pricing

Data quality usage is billed based on the Data Governance Processing Unit (DGPU) pay-as-you-go meters. A DGPU is the amount of service performance consumed for 60 minutes and is available in three different performance options: basic, standard, and advanced. The basic SKU option is set as the default performance option until a higher option is selected. For example, if a customer runs 100 data quality rules in a single day, and each run produces 0.02 DGPU with the Basic SKU, then the total DGPU for that day would equal two DGPU, costing the customer $30. Basic SKU price is 15 dollars per processing unit, Standard SKU price is 60 dollars per processing unit, and advance SKU price is 240 dollars per processing unit. Learn more about Microsoft Purview Unified Catalog pricing.

Here is example of consumed processing units for basic to complex rules for different data volumes, tested for a standard SKU.

Rule complexity 10,000 records - 100,000 records - 1,000,000 records - 10,000,000 records - 100,000,000 records - 1,000,000,000 records -
Duration PU Duration PU Duration PU Duration PU Duration PU Duration PU
Simple Elapsed time: 1m 1s 0.02 Elapsed time: 1m 1s 0.02 Elapsed time: 1m 1s 0.02 Elapsed time: 1m 16s 0.02 Elapsed time: 1m 16s 0.02 Elapsed time: 1m 31s 0.03
Medium Elapsed time: 1m 1s 0.02 Elapsed time: 1m 1s 0.02 Elapsed time: 1m 1s 0.02 Elapsed time: 1m 16s 0.02 Elapsed time: 1m 31s 0.03 Elapsed time: 2m 1s 0.03
High Elapsed time: 1m 1s 0.02 Elapsed time: 1m 1s 0.02 Elapsed time: 1m 31s 0.03 Elapsed time: 1m 32s 0.03 Elapsed time: 2m 1s 0.03 Elapsed time: 2m 51s 0.04

Limitation

  • vNet isn't supported for Google Big Query, Snowflake, and Azure Databricks Unity Catalog.

Next steps

  1. Assign users(s) data quality steward permissions in Unified Catalog to use all data quality features.
  2. Set up a data source connection to prepare your source for data quality assessment.
  3. Configure and run data profiling for an asset in your data source.