Section 2: Set up and register your data

If you don't have data sources available for scanning, then you can follow along with these steps to fully deploy an Azure Data Lake Stroage (ADLS Gen2) example.

Tip

If you already have a data source in the same tenant as your Microsoft Purview account, move ahead the next part of this section to scan your assets.

In a real data estate you find many different systems in use for different data applications. There are reporting environments like Fabric and Snowflake where teams use copies of data to build analytical solutions and power their reports and dashboards. There are operational data systems that power the applications teams or customers use to complete business processes that collect or add data based on decisions made during the process.

To create a more realistic data estate, the recommendation is to show many sources of data in the catalog, which can cover the breadth of different data uses any company might have. The types of data required to power a use case can be vastly different with business users that need reports and dashboards, analysts need conformed dimensions and facts to build reports, data scientists or data engineers need raw source data that came directly from the system that collects the data all of these and more enable different users to see the importance of finding, understanding, and accessing data in the same place.

For some other tutorials to add data to your estate, you can follow these guides:

Prerequisites

  • Subscription in Azure: Create Your Azure Free Account Today
  • Microsoft Entra ID for your tenant: Microsoft Entra ID Governance
  • A Microsoft Purview Account
    • Admin access to the Microsoft Purview account (This is the default if you created the Microsoft Purview account. Permissions in new Microsoft Purview portal preview | Microsoft Learn)
  • All resources; Microsoft Purview, your data source, and Microsoft Entra ID have to be in the same cloud tenant.

Steps to set up your data estate

Create and populate a storage account

  1. Follow along with this guide to create a storage account: Create a storage account for Azure Data Lake Storage Gen2
  2. Create containers for your new data lake:
    1. Navigate to the Overview page of our Storage Account.
    2. Select the Containers tab under the Data storage section.
    3. Select the + Container button
    4. Name as 'bronze' and select the Create button
    5. Repeat these steps to create a 'gold' container
  3. Download some example CSV data from data.gov: Covid-19 Vaccination And Case Trends by Age Group, United States
  4. Upload the CSV to the container named 'bronze' in the storage account you created.
  5. Select the container named 'bronze' and select the Upload button.
  6. Browse the location where you saved the CSV and select the Covid-19_Vaccination_Case _Trends file.
  7. Select Upload.

Create an Azure Data Factory

This step will demonstrate how data moves between layers of a medallion data lake and ensure the data is in a standardized format that consumers would expect to use, this is a prerequisite step for running Data Quality.

  1. Follow this guide to create an Azure Data Factory: Create an Azure Data Factory

  2. Copy the data from the CSV in the 'bronze' container to the 'gold' container as a Delta format table using this Azure Data Factory guide: Transform data using a mapping data flow

  3. Open the Azure Data Factory (ADF) experience from the Azure portal by selecting the Launch studio button on the Overview tab of the ADF resource created.

    Screenshot of launch ADF Studio from Azure portal.

  4. Select the Author tab in ADF studio.

    Screenshot of the select author in left navigation menu of Azure Data Factory.

  5. Select the + button and pick Data flow from the drop-down menu.

    Screenshot of the button to create a data flow.

  6. Name the dataflow 'CSVtoDeltaC19VaxTrends'.

  7. Select Add Source in the empty box.

    Screenshot of adding a data source for the dataflow.

  8. Set Source settings to:

    1. Output stream name: 'C19csv'
    2. Description: leave blank
    3. Source type: Inline
    4. Inline dataset type: Delimited Text
    5. Linked Service: Select the data lake where you stored the csv
  9. Set Source options to:

    1. File mode: File
    2. File path: /bronze/ Covid-19_Vaccination_Case _Trends
    3. Allow no files found: leave unchecked
    4. Change data capture: leave unchecked
    5. Compression type: None
    6. Encoding: Default(UTF-8)
    7. Column delimiter: Comma (,)
    8. Row delimiter: Default(\r, \n, or\r\n)
    9. Quote character: Double quote (“)
    10. Escape character: Backslash ()
    11. First row as header: CHECKED
    12. Leave the rest as defaults
  10. Select the small + Next to the source created and select Sink

    Screenshot of creating a sink for the dataflow.

  11. Create the sink where the format and location of the data to be stored to move the data from a csv in 'bronze' to a delta table in 'gold'.

    1. Set the Sink values (leave all settings as default unless specified)
    2. Sink type: Inline
    3. Inline dataset type: Delta
    4. Linked service: the same data lake as used in the source, because we'll be storing in a different container.
  12. Set the Setting values (leave all settings as default unless specified)

    1. Folder path: gold/Covid19 Vaccine and Case Trends
  13. You need to enter the value because this name is how we want the data to be stored and doesn't exist to select.

  14. Select Validate, this checks your data flow and provide instructions to fix any errors.

  15. Select Publish all.

    Screenshot of publishing the dataflow.

  16. Select the + button and select pipeline from the drop-down menu

    Screenshot of creating a pipeline.

  17. Name your pipeline 'CSV to Delta C19 Vax Trends'

  18. Select the dataflow created in the previous steps CSV to Delta (C19VaxTrends) and drag and drop it on the open pipeline tab.

  19. Select Validate

  20. Select Publish

  21. Select Debug (use activity runtime) to run the pipeline.

    Screenshot of running the pipeline to create delta table.

    Tip

    If you hit errors for spaces or inappropriate characters for delta format: open the downloaded CSV and make corrections. Then,reupload and overwrite the CSV in the bronze zone. Then rerun your pipeline.

  22. Navigate to your gold container in the data lake and you should now see the new Delta table created during the pipeline.

Scan your assets

If you haven't scanned data assets into your Microsoft Purview Data Map, then you can follow these steps to populate your data map.

Scanning sources in your data estate will automatically collect the metadata of the data assets (tables, files, folders, reports, etc.) in those sources. By registering a data source and creating the scan, you establish the technical ownership over the sources and assets that are displayed in the catalog and ensure that you have control over who can access which metadata in Microsoft Purview. By registering and storing sources and assets at the domain level, it will be stored at the highest level of access hierarchy. Typically it's best to create some collections where you'll scan the asset metadata and establish the correct access hierarchy for that data.

If you've chosen to use Microsoft Fabric or SQL, you can use these guides to provide access:

Register your data lake and scan your assets

  1. In Microsoft Purview Data Map under domains tab, select the Role assignments for the domain (it will be the name of Microsoft Purview account):

    1. Add yourself as the data source admin and the data curator to the domain.
      1. Select the person icon next to the role Data source admin.
      2. Search your name as it is in Microsoft Entra ID (it could require you to enter your full name spelled exactly as it is in Microsoft Entra ID).
      3. Select OK.
      4. Repeat these steps for data curator.

    Screenshot of adding required access permissions to a collection.

  2. Register the data lake:

    1. Select the Data sources tab.
    2. Select Register.
    3. Select the Azure Data Lake Storage Gen2 storage type.

    Screenshot of registering a data source.

  3. Provide the details to connect:

    1. Subscription (optional)
    2. Data Source Name (this will be the name of the ADLS Gen2 source)
    3. Collection where asset metadata should be stored (optional)
    4. Select Register
  4. Once registration of the data source is complete, you can configure the scan. Registration signifies that Microsoft Purview is connected to the data source and has placed it in the correct collection for ownership. Scanning will then read the metadata from the source and populate the assets in the data map.

  5. Select the source you registered in data sources tab

    Screenshot of creating a scan for your data source.

  6. Select new scan and provide details:

    1. Use the default integration runtime for this scan
    2. Credential should be Microsoft Purview MSI (system)
    3. Scan level is Auto Detect
    4. Select a collection or use the domain (collection must be the same collection or child collection of where the data source was registered)
    5. Select Continue

    Tip

    At this point Microsoft Purview will test the connection to validate a scan can be done. If you have not granted the Microsoft Purview MSI reader access on the data source it will fail. If you are not the data source owner or have user access contributor the scan will fail since it expects you have authorization to create the connection.

  7. Now only select the container 'gold' where we placed the delta table in the building data section of the tutorial. This will prevent scanning any other data assets that are in your data store.

    1. Should have only one blue check next to gold, you can leave checks next to everything as it will scan the full source and still create the assets we'll use and more.
    2. Select Continue
  8. In the select a scan rule set screen you should use the default scan rule set.

  9. Select Continue

  10. In set a scan trigger you'll set the frequency of the scanning so as you continue to add data assets to the gold container of the lake it will continue to populate the data map. Select Once.

  11. Select Continue.

  12. Select Save and Run. This will create a scan that will only read the metadata from the gold container of your data lake and populate the table we'll use in Microsoft Purview Unified Catalog in the next sections. If you only select save, it will not run the scan, and you won't see the assets. Once the scan is running, you'll see the scan you created with a Last run status of Queued. When the scan reads complete your assets are ready for the next section. This could take a few minutes or hours depending on how many assets you have in your source.

Next steps

Section 3 - Publish data products