Retrieving schema of scanned assets(files in blob,sql tables) likee column names,datatypes from purview using synapse pyspark notebook.

Wasim 0

Hi there, we are having a purview account,synapse with pyspakr, storage account. We are scanning the files in purview that are stored in storage, we are able to see the schema in purview for those files,now we are trying to retrieve the schema f for the column in those files in synapse pyspark notebook. We are doing this coz we want to dynamically create the tables in sql database using these column schemas for hese files. But for that need to retrieve this schema. Can anybody tell how we can do that? Thanks in adv.

Also we are referring to this article - https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058

Init we could get the aseet list with files naes, guid etc. How can we then get the schema for those files using the guid if at all possible.

Also ,in the article the code is connecting to storage account and reading file ,not sure why it needs to connect to storage account when we are trying to get the schema from purview .

1 answer

Smaran Thoomu 16,890 Reputation points Microsoft Vendor

2024-11-04T15:08:22.2933333+00:00
Hi @Wasim

Welcome to Microsoft Q&A platform and thanks for posting your query here.

Regarding retrieving the schema of scanned assets from Azure Purview using a Synapse PySpark notebook. I understand that you want to dynamically create tables in your SQL database based on the column schemas of the files scanned in Purview.

To retrieve the schema for the files you have scanned in Azure Purview, you can utilize the Azure Purview REST API. Here’s a step-by-step approach to achieve this:

Since you already have the GUIDs for each asset, you can use Purview’s REST API to retrieve the schema details. Specifically, the GET /catalog/api/atlas/v2/entity/guid/{guid} endpoint will allow you to fetch metadata, including the schema, for each asset. You can then parse this information in your PySpark notebook to dynamically generate SQL tables.

In your Synapse PySpark notebook, you can send API requests to Purview to get this metadata. Libraries like requests in Python will help you query the REST API within your PySpark environment. Once you get the metadata response, you can parse it to extract the column names and data types.

The article you mentioned connects to the storage account to access the data directly, but for schema retrieval only, that’s not necessary. Purview’s API should provide the metadata you need without requiring access to the storage account itself.

To summarize:

Use the Purview API to get schema details based on the GUID.

Parse the API response in Synapse to get column information.

Skip direct file reading unless it’s required for other tasks.

I hope this helps. Please let me know if you have any questions.
Please sign in to rate this answer.
Wasim 0 Reputation points

2024-11-05T07:07:39.02+00:00

@Smaran Thoomu thanks for the response, wanted to correct the question in originally posted, would wanna ask how can i also get the all the assets with thier guid and then using these guids i would wanna get the schema. Is there any example of implementation you can point me to?

Also, i was referring to this code- https://github.com/dcnsakthi/blogs/blob/main/azure/purview/exporting_assets_via_restapi.ipynb using this code i get the op like this-

here is the code that i am using to get this output(code1)-(my question is can i make use of anything from here like id? to get the schema of the assets? i have tried it in #code2 but i am not getting it rather getting this op-

#Code1 # Exporting Data Assets using REST API in Python - Notebook from azure.purview.catalog import PurviewCatalogClient from azure.identity import ClientSecretCredential from azure.core.exceptions import HttpResponseError import pandas as pd from pandas.io.json import json_normalize keywords = "*" #export_csv_path = "purview_search_export.csv" purview_account_name = "" client_id= "" client_secret= "" tenant_id="" resource_url = "https://purview.azure.net" data_catalog_name = "" purview_api_url = f"https://{purview_account_name}.purview.azure.com" # Acquire an access token token_url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token" purview_endpoint = f"https://{purview_account_name}.purview.azure.com/" purview_scan_endpoint = f"https://{purview_account_name}.scan.purview.azure.com/" def get_credentials(): credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id) return credentials def get_catalog_client(): credentials=get_credentials() client=PurviewCatalogClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True) return client body_input={ "keywords": keywords } try: catlog_client=get_catalog_client() except ValueError as e: print(e) try: response=catlog_client.discovery.query(search_request=body_input) df=pd.DataFrame(response) display(df) #jdf= pd.json_normalize(df.value) #jdf.to_csv(export_csv_path, index=False) except HttpResponseError as e: print(e) #code2------------------------------------------------------------- import requests import json import pandas as pd purview_account=purview_account_name # Function to authenticate and retrieve access token def azuread_auth(tenant_id: url = f payload = f headers = { response = requests.post(url, headers=headers, data=payload) response.raise_for_status() # Raise an error for bad responses return response.json()[ # Function to get asset schema by GUID def get_asset_schema(access_token: url = f headers = { } response = requests.get(url, headers=headers) response.raise_for_status() # Raise an error for bad responses return response.json() # Get the access token access_token = azuread_auth(tenant_id, client_id, client_secret) # List of asset GUIDs (replace with your actual GUIDs) #asset_guids = ["guid1", "guid2", "guid3"] # Add your asset GUIDs here #asset_guids=[""] asset_guids=[ # Fetch and process the schema for each asset for guid in asset_guids: asset_schema = get_asset_schema(access_token, purview_account, guid) # Parse schema from the response (this may vary based on your schema structure) schema_info = asset_schema.get( # Create a list of columns for easier readability columns = [(col[ # Example: Create a DataFrame from the schema and display it schema_df = pd.DataFrame(columns, columns=[ print(f op is second screenshot

Wasim 0 Reputation points

2024-11-05T07:09:12.2366667+00:00

if there is any sample implementation of this plz let know.

Smaran Thoomu 16,890 Reputation points Microsoft Vendor

2024-11-05T14:41:35.4533333+00:00

Hi Wasim

Thanks for your follow-up question and for sharing the code you’re working with. I can see that you’re looking to retrieve all asset GUIDs and then use those GUIDs to get the schema details for each asset. I’ll provide guidance on how you can achieve this in two parts:

Getting all Asset GUIDs: You can use the code snippet you shared in #Code1 to search and list assets within Purview. The GUIDs for each asset can be extracted from the response dataframe. Once you have this list of GUIDs, you can pass it as input to the schema retrieval function.

Retrieving Schema for Each Asset by GUID: After obtaining the list of GUIDs, you can use #Code2 to retrieve the schema details for each asset. I noticed that there might be a few issues in your code for #Code2, so here are some suggestions:

Correcting the Azure AD Authentication Function: Ensure that the authentication function is complete and returning the correct token. Here's an example:
def azuread_auth(tenant_id, client_id, client_secret): url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token" payload = f"client_id={client_id}&scope=https://purview.azure.net/.default&client_secret={client_secret}&grant_type=client_credentials" headers = { "Content-Type": "application/x-www-form-urlencoded" } response = requests.post(url, headers=headers, data=payload) response.raise_for_status() return response.json().get("access_token")

Fetching Schema Information Using GUIDs: Use the GET /catalog/api/atlas/v2/entity/guid/{guid} endpoint to fetch schema details for each GUID. Here’s how to implement it:
def get_asset_schema(access_token, purview_account, guid): url = f"https://{purview_account}.purview.azure.com/catalog/api/atlas/v2/entity/guid/{guid}" headers = { "Authorization": f"Bearer {access_token}", "Content-Type": "application/json" } response = requests.get(url, headers=headers) response.raise_for_status() return response.json()

Looping through GUIDs to Extract Schema Information: After you have the access token and the list of GUIDs, you can fetch and parse schema details:
# Assuming asset_guids contains your list of GUIDs asset_guids = ["guid1", "guid2", "guid3"] # Replace with actual GUIDs for guid in asset_guids: asset_schema = get_asset_schema(access_token, purview_account, guid) columns = [(col["name"], col["typeName"]) for col in asset_schema["attributes"]["schemaElements"]] schema_df = pd.DataFrame(columns, columns=["Column Name", "Data Type"]) print(f"Schema for GUID {guid}:") print(schema_df)

This approach should help you retrieve the schema information in a structured format for each asset.

If you need a sample implementation, you may refer to the Azure Purview REST API documentation for additional examples on retrieving entity details by GUID.

I hope this helps. Please let me know if you have any questions.

Wasim 0 Reputation points

2024-11-05T15:04:17.8366667+00:00

Hi smaran thanks for the response,

Smaran Thoomu 16,890 Reputation points Microsoft Vendor

2024-11-06T07:55:11.1066667+00:00

Wasim Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Retrieving schema of scanned assets(files in blob,sql tables) likee column names,datatypes from purview using synapse pyspark notebook.

1 answer

Your answer