Hi @Wasim
Welcome to Microsoft Q&A platform and thanks for posting your query here.
Regarding retrieving the schema of scanned assets from Azure Purview using a Synapse PySpark notebook. I understand that you want to dynamically create tables in your SQL database based on the column schemas of the files scanned in Purview.
To retrieve the schema for the files you have scanned in Azure Purview, you can utilize the Azure Purview REST API. Here’s a step-by-step approach to achieve this:
- Since you already have the GUIDs for each asset, you can use Purview’s REST API to retrieve the schema details. Specifically, the
GET /catalog/api/atlas/v2/entity/guid/{guid}
endpoint will allow you to fetch metadata, including the schema, for each asset. You can then parse this information in your PySpark notebook to dynamically generate SQL tables. - In your Synapse PySpark notebook, you can send API requests to Purview to get this metadata. Libraries like
requests
in Python will help you query the REST API within your PySpark environment. Once you get the metadata response, you can parse it to extract the column names and data types. - The article you mentioned connects to the storage account to access the data directly, but for schema retrieval only, that’s not necessary. Purview’s API should provide the metadata you need without requiring access to the storage account itself.
To summarize:
- Use the Purview API to get schema details based on the GUID.
- Parse the API response in Synapse to get column information.
- Skip direct file reading unless it’s required for other tasks.
I hope this helps. Please let me know if you have any questions.