Data quality for Google BigQuery (Preview)

Note

The Microsoft Purview Data Catalog is changing its name to Microsoft Purview Unified Catalog. All the features will stay the same. You'll see the name change when the new Microsoft Purview Data Governance experience is generally available in your region. Check the name in your region.

Supported capabilities

When scanning Google BigQuery source, Microsoft Purview supports:

  • Extracting technical metadata including:
    • Projects and datasets
    • Tables including the columns
    • Views including the columns
  • Fetching static lineage on assets relationships among tables and views.

When setting up scan, you can choose to scan an entire Google BigQuery project. You can also scope the scan to a subset of datasets matching the given names, or name patterns.

Known limitations

  • Currently, Microsoft Purview only supports scanning Google BigQuery datasets in US multi-regional location. If the specified dataset is in other location for example, us-east1 or EU, you observe scan completes but no assets shown up in Microsoft Purview.
  • When object is deleted from the data source, currently the subsequent scan isn't automatically removed the corresponding asset in Microsoft Purview.

Configure datamap scan to catalog Google BigQuery data in Microsoft Purview

Register a Google BigQuery project

  • Open the Microsoft Purview, and select Data Map on the left navigation.
  • Select Register.
  • On Register sources, select Google BigQuery. Select Continue.
    • Enter a Name that the data source will be listed within the Catalog.
    • Enter the ProjectID. This should be a fully qualified project ID. For example, mydomain.com: myProject
    • Select a collection from the list.
    • Select Register.

Set up a datamap scan for Google BigQuery project

  • Make sure a self-hosted integration runtime is set up. If it isn't set up, use the steps mentioned in prerequisites
  • Navigate to Sources.
  • Select the registered BigQuery project.
  • Select + New scan.
  • Provide the below details:
    • Name: The name of the scan
    • Connect via integration runtime: Select the configured self-hosted integration runtime
    • Credential: While configuring BigQuery credential, make sure to:
      • Select Basic Authentication as the Authentication method
      • Provide the email ID of the service account in the User name field. For example, xyz@developer.gserviceaccount.com
      • Follow below steps to generate the private key. Copy the entire JSON key file and store it as the value of a Key Vault secret.
      • To create a new private key from Google's cloud platform:
        • In the navigation menu, select IAM (Identity Access Management), and select Admin --> Service Accounts --> Select a project -->
        • Select the email address of the service account that you want to create a key for.
        • Select the Keys tab.
        • Select the Add key drop-down menu, then select Create new key.
        • Choose JSON format.
    • Specify the path to the JDBC (Java Database Connectivity) driver location in your machine where self-host integration runtime is running. For example: D:\Drivers\GoogleBigQuery.
    • Specify a list of BigQuery datasets to import. For example, dataset1;dataset2. When the list is empty, all available datasets are imported.
    • Maximum memory (in GB) available on your VM (Virtual Machine) to be used by scanning processes. This is dependent on the size of Google BigQuery project to be scanned.
  • Select Test connection.
  • Select Continue.
  • Choose your scan trigger. You can set up a schedule or ran the scan once.
  • Review your scan and select Save and Run.

Once scanned, the data assets in Google BigQuery project will be available on the data catalog search. For more details about how to connect and manage Google BigQuery in Microsoft Purview, follow this document.

Important

Deleting your scan does not delete catalog assets created from previous scans.

Set up connection to Google BigQuery project for data quality scan

At this point we have the scanned asset ready for cataloging and governance. Associate the scanned assets to the Data Products in a Governance Domain to set up Data Quality scan.

  1. Select Data quality > Governance Domain > Manage tab to create connection.

    Screenshot that shows connection page navigation.

  2. Configure connection

    • Add connection name and description
    • Select source type Google BigQuery
    • Add Project ID, Dataset name, and Table name
    • Select Service account private key
      • Add Azure subscription
      • Key vault connection
      • secret name
      • secret version
  3. Test connection to make the data source connection is successfully configured.

    Screenshot that shows how to set up google BigQuery connection.

    Screenshot that shows how to configure connection for google BigQuery.

Important

Data Quality stewards need read only access to Google BigQuery to setup data quality connection.

Profiling and data quality scanning for data in Google BigQuery

After completed connection setup successfully, you can profile, create and apply rules, and run Data Quality scan of your data in Google BigQuery. Follow the step-by-step guideline described in below documents:

Reference documents