Retrieving schema of scanned assets(files in blob,sql tables) likee column names,datatypes from purview using synapse pyspark notebook.

Wasim 0 Reputation points
2024-11-04T07:18:16.8733333+00:00

Hi there, we are having a purview account,synapse with pyspakr, storage account. We are scanning the files in purview that are stored in storage, we are able to see the schema in purview for those files,now we are trying to retrieve the schema f for the column in those files in synapse pyspark notebook. We are doing this coz we want to dynamically create the tables in sql database using these column schemas for hese files. But for that need to retrieve this schema. Can anybody tell how we can do that? Thanks in adv.

Also we are referring to this article - https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058

Init we could get the aseet list with files naes, guid etc. How can we then get the schema for those files using the guid if at all possible.

Also ,in the article the code is connecting to storage account and reading file ,not sure why it needs to connect to storage account when we are trying to get the schema from purview .

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,002 questions
Microsoft Purview
Microsoft Purview
A Microsoft data governance service that helps manage and govern on-premises, multicloud, and software-as-a-service data. Previously known as Azure Purview.
1,227 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 16,890 Reputation points Microsoft Vendor
    2024-11-04T15:08:22.2933333+00:00

    Hi @Wasim

    Welcome to Microsoft Q&A platform and thanks for posting your query here.

    Regarding retrieving the schema of scanned assets from Azure Purview using a Synapse PySpark notebook. I understand that you want to dynamically create tables in your SQL database based on the column schemas of the files scanned in Purview.

    To retrieve the schema for the files you have scanned in Azure Purview, you can utilize the Azure Purview REST API. Here’s a step-by-step approach to achieve this:

    1. Since you already have the GUIDs for each asset, you can use Purview’s REST API to retrieve the schema details. Specifically, the GET /catalog/api/atlas/v2/entity/guid/{guid} endpoint will allow you to fetch metadata, including the schema, for each asset. You can then parse this information in your PySpark notebook to dynamically generate SQL tables.
    2. In your Synapse PySpark notebook, you can send API requests to Purview to get this metadata. Libraries like requests in Python will help you query the REST API within your PySpark environment. Once you get the metadata response, you can parse it to extract the column names and data types.
    3. The article you mentioned connects to the storage account to access the data directly, but for schema retrieval only, that’s not necessary. Purview’s API should provide the metadata you need without requiring access to the storage account itself.

    To summarize:

    • Use the Purview API to get schema details based on the GUID.
    • Parse the API response in Synapse to get column information.
    • Skip direct file reading unless it’s required for other tasks.

    I hope this helps. Please let me know if you have any questions.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.