教程:使用 Microsoft Purview Python SDK

本教程介绍如何使用 Microsoft Purview Python SDK。 可以使用 SDK 以编程方式执行所有最常见的Microsoft Purview 操作,而不是通过 Microsoft Purview 治理门户。

在本教程中,你将了解 SDK 如何执行以下操作:

  • 授予以编程方式使用 Microsoft Purview 所需的权限
  • 在 Microsoft Purview 中将 Blob 存储容器注册为数据源
  • 定义并运行扫描
  • 搜索目录
  • 删除数据源

先决条件

本教程需要:

重要

对于这些脚本,终结点值将有所不同,具体取决于你正在使用的 Purview 门户Microsoft。 经典 Microsoft Purview 治理门户的终结点:purview.azure.com/ New Microsoft Purview 门户的终结点:purview.microsoft.com/

因此,如果使用新门户,终结点值将如下所示:“https://consotopurview.scan.purview.microsoft.com"

向 Microsoft Purview 授予对存储帐户的访问权限

在能够扫描存储帐户的内容之前,需要为 Microsoft Purview 指定适当的角色。

  1. 通过Azure 门户转到存储帐户。

  2. 选择“访问控制 (IAM) ”。

  3. 选择“添加”按钮,然后选择“ 添加角色分配”。

    存储帐户中“访问控制”菜单的屏幕截图,其中选择了“添加”按钮,并选择了“添加角色分配”。

  4. 在下一个窗口中,搜索 “存储 blob 读取者 ”角色并选择它:

    “添加角色分配”菜单的屏幕截图,其中从可用角色列表中选择了“存储 Blob 数据读取者”。

  5. 然后转到“ 成员 ”选项卡,然后选择“ 选择成员”:

    “添加角色分配”菜单的屏幕截图,其中选择了“+ 选择成员”按钮。

  6. 右侧会显示一个新窗格。 搜索并选择现有 Microsoft Purview 实例的名称。

  7. 然后,可以选择“ 查看 + 分配”。

Microsoft Purview 现在具有扫描 Blob 存储所需的读取权限。

向应用程序授予对 Microsoft Purview 帐户的访问权限

  1. 首先,需要来自服务主体的客户端 ID、租户 ID 和客户端密码。 若要查找此信息,请选择Microsoft Entra ID

  2. 然后选择“应用注册”。

  3. 选择应用程序并找到所需的信息:

    • 名称

    • 客户端 ID (或应用程序 ID)

    • 租户 ID (或目录 ID)

      Azure 门户中服务主体页的屏幕截图,其中突出显示了“客户端 ID”和“租户 ID”。

    • 客户端密码

      Azure 门户中服务主体页的屏幕截图,其中选择了“证书 & 机密”选项卡,其中显示了可用的客户端证书和机密。

  4. 现在需要向服务主体提供相关的Microsoft Purview 角色。 为此,请访问 Microsoft Purview 实例。 选择“ 打开Microsoft Purview 治理门户 ”或直接打开 Microsoft Purview 的治理门户 ,然后选择部署的实例。

  5. 在 Microsoft Purview 治理门户中,依次选择“ 数据映射”、“ 集合”:

    Microsoft Purview 治理门户左侧菜单的屏幕截图。选择“数据映射”选项卡,然后选择“集合”选项卡。

  6. 选择要使用的集合,然后转到 “角色分配 ”选项卡。将服务主体添加为以下角色:

    • 集合管理员
    • 数据源管理员
    • 数据策展人
    • 数据读取器
  7. 对于每个角色,选择 “编辑角色分配 ”按钮,然后选择要将服务主体添加到的角色。 或者选择每个角色旁边的“ 添加 ”按钮,并通过搜索服务主体的名称或客户端 ID 添加服务主体,如下所示:

    Microsoft Purview 治理门户中集合下的“角色分配”菜单的屏幕截图。选择“集合管理员”选项卡旁边的“添加用户”按钮。显示“添加或删除集合管理员”窗格,并在文本框中搜索服务主体。

安装 Python 包

  1. 打开新的命令提示符或终端
  2. 安装用于身份验证的 Azure 标识包:
    pip install azure-identity
    
  3. 安装 Microsoft Purview 扫描客户端包:
    pip install azure-purview-scanning
    
  4. 安装 Microsoft Purview 管理客户端包:
    pip install azure-purview-administration
    
  5. 安装 Microsoft Purview 客户端包:
    pip install azure-purview-catalog
    
  6. 安装 Microsoft Purview 帐户包:
    pip install azure-purview-account
    
  7. 安装 Azure Core 包:
    pip install azure-core
    

创建 Python 脚本文件

创建纯文本文件,并将其另存为带有后缀.py的 Python 脚本。 例如:tutorial.py。

实例化扫描、目录和管理客户端

在本部分中,你将了解如何实例化:

  • 扫描客户端可用于注册数据源、创建和管理扫描规则、触发扫描等。
  • 目录客户端可用于通过搜索、浏览发现的资产、识别数据的敏感度等方式与目录进行交互。
  • 对于列出集合等操作,管理客户端可用于与Microsoft Purview 数据映射本身交互。

首先,需要向Microsoft Entra ID进行身份验证。 为此,你将使用 创建的客户端密码

  1. 从所需的导入语句开始:我们的三个客户端、凭据语句和 Azure 异常语句。

    from azure.purview.scanning import PurviewScanningClient
    from azure.purview.catalog import PurviewCatalogClient
    from azure.purview.administration.account import PurviewAccountClient
    from azure.identity import ClientSecretCredential 
    from azure.core.exceptions import HttpResponseError
    
  2. 在代码中指定以下信息:

    • 客户端 ID (或应用程序 ID)
    • 租户 ID (或目录 ID)
    • 客户端密码
    client_id = "<your client id>" 
    client_secret = "<your client secret>"
    tenant_id = "<your tenant id>"
    
  3. 指定终结点:

    重要

    终结点值将有所不同,具体取决于你使用的 Purview 门户Microsoft。 经典 Microsoft Purview 治理门户的终结点:https://{your_purview_account_name}.purview.azure.com/新Microsoft Purview 门户的终结点:https://api.purview-service.microsoft.com

    扫描 经典 Microsoft Purview 治理门户的终结点: https://{your_purview_account_name}.scan.purview.azure.com/ 新Microsoft Purview 门户的终结点: https://api.scan.purview-service.microsoft.com

    purview_endpoint = "<endpoint>"
    
    purview_scan_endpoint = "<scan endpoint>"
    
  4. 现在可以实例化这三个客户端:

    def get_credentials():
        credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
        return credentials
    
    def get_purview_client():
        credentials = get_credentials()
        client = PurviewScanningClient(endpoint=purview_scan_endpoint, credential=credentials, logging_enable=True)  
        return client
    
    def get_catalog_client():
        credentials = get_credentials()
        client = PurviewCatalogClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
        return client
    
    def get_admin_client():
        credentials = get_credentials()
        client = PurviewAccountClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
        return client
    

我们的许多脚本都将从这些步骤开始,因为我们需要这些客户端来与帐户交互。

注册数据源

在本部分中,你将注册 Blob 存储。

  1. 正如我们在上一部分中讨论的,首先导入访问 Microsoft Purview 帐户所需的客户端。 此外,导入 Azure 错误响应包以便进行故障排除,并导入 ClientSecretCredential 来构造 Azure 凭据。

    from azure.purview.administration.account import PurviewAccountClient
    from azure.purview.scanning import PurviewScanningClient
    from azure.core.exceptions import HttpResponseError
    from azure.identity import ClientSecretCredential
    
  2. 按照以下指南收集存储帐户的资源 ID:获取存储帐户的资源 ID。

  3. 然后,在 Python 文件中定义以下信息,以便能够以编程方式注册 Blob 存储:

    storage_name = "<name of your Storage Account>"
    storage_id = "<id of your Storage Account>"
    rg_name = "<name of your resource group>"
    rg_location = "<location of your resource group>"
    reference_name_purview = "<name of your Microsoft Purview account>"
    
  4. 提供要在其中注册 Blob 存储的集合的名称。 (它应与之前应用权限的集合相同。如果不是,请先对此集合应用权限。) 如果是根集合,请使用与 Microsoft Purview 实例相同的名称。

    collection_name = "<name of your collection>"
    
  5. 创建一个函数来构造用于访问 Microsoft Purview 帐户的凭据:

    client_id = "<your client id>" 
    client_secret = "<your client secret>"
    tenant_id = "<your tenant id>"
    
    
    def get_credentials():
         credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
         return credentials
    
  6. Microsoft Purview 数据映射中的所有集合都有一个友好名称和一个名称

    • 友好名称是在集合中看到的名称。 例如:Sales。
    • 根集合) 之外的所有集合 (的名称是由数据映射分配的六个字符的名称。

    Python 需要此六个字符的名称来引用任何子集合。 若要将 友好名称 自动转换为脚本中所需的六个字符集合名称,请添加以下代码块:

    重要

    终结点值将有所不同,具体取决于你使用的 Purview 门户Microsoft。 经典 Microsoft Purview 治理门户的终结点:purview.azure.com/ New Microsoft Purview 门户的终结点:purview.microsoft.com/

    因此,如果使用新门户,终结点值将如下所示:“https://consotopurview.scan.purview.microsoft.com"

    def get_admin_client():
         credentials = get_credentials()
         client = PurviewAccountClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
         return client
    
    try:
      admin_client = get_admin_client()
    except ValueError as e:
        print(e)
    
    collection_list = client.collections.list_collections()
     for collection in collection_list:
      if collection["friendlyName"].lower() == collection_name.lower():
          collection_name = collection["name"]
    
  7. 对于这两个客户端,你还需要提供一个输入正文,具体取决于操作。 若要注册源,需要提供数据源注册的输入正文:

    ds_name = "<friendly name for your data source>"
    
    body_input = {
            "kind": "AzureStorage",
            "properties": {
                "endpoint": f"https://{storage_name}.blob.core.windows.net/",
                "resourceGroup": rg_name,
                "location": rg_location,
                "resourceName": storage_name,
                "resourceId": storage_id,
                "collection": {
                    "type": "CollectionReference",
                    "referenceName": collection_name
                },
                "dataUseGovernance": "Disabled"
            }
    }    
    
  8. 现在可以调用 Microsoft Purview 客户端并注册数据源。

    重要

    终结点值将有所不同,具体取决于你使用的 Purview 门户Microsoft。 经典 Microsoft Purview 治理门户的终结点:https://{your_purview_account_name}.purview.azure.com/新Microsoft Purview 门户的终结点:https://api.purview-service.microsoft.com

    如果使用的是经典门户,则终结点值为: https://{your_purview_account_name}.scan.purview.azure.com 如果使用的是新门户,则终结点值为: https://scan.api.purview-service.microsoft.com

    def get_purview_client():
         credentials = get_credentials()
         client = PurviewScanningClient(endpoint={{ENDPOINT}}, credential=credentials, logging_enable=True)  
         return client
    
    try:
        client = get_purview_client()
    except ValueError as e:
        print(e)
    
    try:
        response = client.data_sources.create_or_update(ds_name, body=body_input)
        print(response)
        print(f"Data source {ds_name} successfully created or updated")
    except HttpResponseError as e:
        print(e)
    

注册过程成功后,可以看到来自客户端的扩充正文响应。

在以下部分中,你将扫描已注册的数据源并搜索目录。 其中每个脚本的结构都与此注册脚本类似。

完整代码

from azure.purview.scanning import PurviewScanningClient
from azure.identity import ClientSecretCredential 
from azure.core.exceptions import HttpResponseError
from azure.purview.administration.account import PurviewAccountClient

client_id = "<your client id>" 
client_secret = "<your client secret>"
tenant_id = "<your tenant id>"
purview_endpoint = "<endpoint>"
purview_scan_endpoint = "<scan endpoint>"
storage_name = "<name of your Storage Account>"
storage_id = "<id of your Storage Account>"
rg_name = "<name of your resource group>"
rg_location = "<location of your resource group>"
collection_name = "<name of your collection>"
ds_name = "<friendly data source name>"

def get_credentials():
	credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
	return credentials

def get_purview_client():
	credentials = get_credentials()
	client = PurviewScanningClient(endpoint=purview_scan_endpoint, credential=credentials, logging_enable=True)  
	return client

def get_admin_client():
	credentials = get_credentials()
	client = PurviewAccountClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
	return client

try:
	admin_client = get_admin_client()
except ValueError as e:
        print(e)

collection_list = admin_client.collections.list_collections()
for collection in collection_list:
	if collection["friendlyName"].lower() == collection_name.lower():
		collection_name = collection["name"]


body_input = {
	"kind": "AzureStorage",
	"properties": {
		"endpoint": f"https://{storage_name}.blob.core.windows.net/",
		"resourceGroup": rg_name,
		"location": rg_location,
		"resourceName": storage_name,
 		"resourceId": storage_id,
		"collection": {
			"type": "CollectionReference",
			"referenceName": collection_name
		},
		"dataUseGovernance": "Disabled"
	}
}

try:
	client = get_purview_client()
except ValueError as e:
        print(e)

try:
	response = client.data_sources.create_or_update(ds_name, body=body_input)
	print(response)
	print(f"Data source {ds_name} successfully created or updated")
except HttpResponseError as e:
    print(e)

扫描数据源

可以通过两个步骤扫描数据源:

  1. 创建扫描定义
  2. 触发扫描运行

在本教程中,你将使用 Blob 存储容器的默认扫描规则。 但是,也可以使用 Microsoft Purview 扫描客户端以编程方式创建自定义扫描规则

现在,让我们扫描上面注册的数据源。

  1. 添加导入语句以生成 唯一标识符,调用 Microsoft Purview 扫描客户端、Microsoft Purview 管理客户端、能够进行故障排除的 Azure 错误响应包,以及用于收集 Azure 凭据的客户端机密凭据。

    import uuid
    from azure.purview.scanning import PurviewScanningClient
    from azure.purview.administration.account import PurviewAccountClient
    from azure.core.exceptions import HttpResponseError
    from azure.identity import ClientSecretCredential 
    
  2. 使用凭据创建扫描客户端:

    client_id = "<your client id>" 
    client_secret = "<your client secret>"
    tenant_id = "<your tenant id>"
    
    def get_credentials():
         credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
         return credentials
    
    def get_purview_client():
         credentials = get_credentials()
         client = PurviewScanningClient(endpoint=f"https://{reference_name_purview}.scan.purview.azure.com", credential=credentials, logging_enable=True)  
         return client
    
    try:
         client = get_purview_client()
    except ValueError as e:
         print(e)
    
  3. 添加代码以收集集合的内部名称。 (有关详细信息,请参阅上一部分) :

    collection_name = "<name of the collection where you will be creating the scan>"
    
    def get_admin_client():
         credentials = get_credentials()
         client = PurviewAccountClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
         return client
    
    try:
        admin_client = get_admin_client()
    except ValueError as e:
        print(e)
    
    collection_list = client.collections.list_collections()
     for collection in collection_list:
      if collection["friendlyName"].lower() == collection_name.lower():
          collection_name = collection["name"]
    
  4. 然后,创建扫描定义:

    ds_name = "<name of your registered data source>"
    scan_name = "<name of the scan you want to define>"
    reference_name_purview = "<name of your Microsoft Purview account>"
    
    body_input = {
            "kind":"AzureStorageMsi",
            "properties": { 
                "scanRulesetName": "AzureStorage", 
                "scanRulesetType": "System", #We use the default scan rule set 
                "collection": 
                    {
                        "referenceName": collection_name,
                        "type": "CollectionReference"
                    }
            }
    }
    
    try:
        response = client.scans.create_or_update(data_source_name=ds_name, scan_name=scan_name, body=body_input)
        print(response)
        print(f"Scan {scan_name} successfully created or updated")
    except HttpResponseError as e:
        print(e)
    
  5. 定义扫描后,可以使用唯一 ID 触发扫描运行:

    run_id = uuid.uuid4() #unique id of the new scan
    
    try:
        response = client.scan_result.run_scan(data_source_name=ds_name, scan_name=scan_name, run_id=run_id)
        print(response)
        print(f"Scan {scan_name} successfully started")
    except HttpResponseError as e:
        print(e)
    

完整代码

import uuid
from azure.purview.scanning import PurviewScanningClient
from azure.purview.administration.account import PurviewAccountClient
from azure.identity import ClientSecretCredential

ds_name = "<name of your registered data source>"
scan_name = "<name of the scan you want to define>"
reference_name_purview = "<name of your Microsoft Purview account>"
client_id = "<your client id>" 
client_secret = "<your client secret>"
tenant_id = "<your tenant id>"
collection_name = "<name of the collection where you will be creating the scan>"

def get_credentials():
	credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
	return credentials

def get_purview_client():
	credentials = get_credentials()
	client = PurviewScanningClient(endpoint=purview_scan_endpoint, credential=credentials, logging_enable=True)  
	return client

def get_admin_client():
	credentials = get_credentials()
	client = PurviewAccountClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
	return client

try:
	admin_client = get_admin_client()
except ValueError as e:
        print(e)

collection_list = admin_client.collections.list_collections()
for collection in collection_list:
	if collection["friendlyName"].lower() == collection_name.lower():
		collection_name = collection["name"]


try:
	client = get_purview_client()
except AzureError as e:
	print(e)

body_input = {
	"kind":"AzureStorageMsi",
	"properties": { 
		"scanRulesetName": "AzureStorage", 
		"scanRulesetType": "System",
		"collection": {
			"type": "CollectionReference",
			"referenceName": collection_name
		}
	}
}

try:
	response = client.scans.create_or_update(data_source_name=ds_name, scan_name=scan_name, body=body_input)
	print(response)
	print(f"Scan {scan_name} successfully created or updated")
except HttpResponseError as e:
	print(e)

run_id = uuid.uuid4() #unique id of the new scan

try:
	response = client.scan_result.run_scan(data_source_name=ds_name, scan_name=scan_name, run_id=run_id)
	print(response)
	print(f"Scan {scan_name} successfully started")
except HttpResponseError as e:
	print(e)

搜索目录

扫描完成后,很可能已发现资产,甚至已分类。 扫描后,此过程可能需要一些时间才能完成,因此在运行下一部分代码之前,可能需要等待。 等待扫描显示完成,资产显示在Microsoft Purview 数据目录中。

资产准备就绪后,可以使用 Microsoft Purview 目录客户端搜索整个目录。

  1. 这一次,需要导入 目录 客户端,而不是扫描客户端。 还包括 HTTPResponse 错误和 ClientSecretCredential。

    from azure.purview.catalog import PurviewCatalogClient
    from azure.identity import ClientSecretCredential 
    from azure.core.exceptions import HttpResponseError
    
  2. 创建一个函数以获取用于访问 Microsoft Purview 帐户的凭据,并实例化目录客户端。

    client_id = "<your client id>" 
    client_secret = "<your client secret>"
    tenant_id = "<your tenant id>"
    reference_name_purview = "<name of your Microsoft Purview account>"
    
    def get_credentials():
         credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
         return credentials
    
    def get_catalog_client():
        credentials = get_credentials()
        client = PurviewCatalogClient(endpoint=f"https://{reference_name_purview}.scan.purview.azure.com", credential=credentials, logging_enable=True)
        return client
    
    try:
        client_catalog = get_catalog_client()
    except ValueError as e:
        print(e)  
    
  3. 在输入正文中配置搜索条件和关键字:

    keywords = "keywords you want to search"
    
    body_input={
        "keywords": keywords
    }
    

    此处仅指定关键字,但请记住 ,可以添加许多其他字段以进一步指定查询

  4. 搜索目录:

    try:
        response = client_catalog.discovery.query(search_request=body_input)
        print(response)
    except HttpResponseError as e:
        print(e)
    

完整代码

from azure.purview.catalog import PurviewCatalogClient
from azure.identity import ClientSecretCredential 
from azure.core.exceptions import HttpResponseError

client_id = "<your client id>" 
client_secret = "<your client secret>"
tenant_id = "<your tenant id>"
reference_name_purview = "<name of your Microsoft Purview account>"
keywords = "<keywords you want to search for>"

def get_credentials():
	credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
	return credentials

def get_catalog_client():
	credentials = get_credentials()
	client = PurviewCatalogClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
	return client

body_input={
	"keywords": keywords
}

try:
	catalog_client = get_catalog_client()
except ValueError as e:
	print(e)

try:
	response = catalog_client.discovery.query(search_request=body_input)
	print(response)
except HttpResponseError as e:
	print(e)

删除数据源

在本部分中,你将了解如何删除之前注册的数据源。 此操作相当简单,使用扫描客户端完成。

  1. 导入 扫描 客户端。 还包括 HTTPResponse 错误和 ClientSecretCredential。

    from azure.purview.scanning import PurviewScanningClient
    from azure.identity import ClientSecretCredential 
    from azure.core.exceptions import HttpResponseError
    
  2. 创建一个函数以获取用于访问 Microsoft Purview 帐户的凭据,并实例化扫描客户端。

    client_id = "<your client id>" 
    client_secret = "<your client secret>"
    tenant_id = "<your tenant id>"
    reference_name_purview = "<name of your Microsoft Purview account>"
    
    def get_credentials():
         credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
         return credentials
    
    def get_scanning_client():
        credentials = get_credentials()
        PurviewScanningClient(endpoint=f"https://{reference_name_purview}.scan.purview.azure.com", credential=credentials, logging_enable=True) 
        return client
    
    try:
        client_scanning = get_scanning_client()
    except ValueError as e:
        print(e)  
    
  3. 删除数据源:

        ds_name = "<name of the registered data source you want to delete>"
        try:
            response = client_scanning.data_sources.delete(ds_name)
            print(response)
            print(f"Data source {ds_name} successfully deleted")
        except HttpResponseError as e:
            print(e)
    

完整代码

from azure.purview.scanning import PurviewScanningClient
from azure.identity import ClientSecretCredential 
from azure.core.exceptions import HttpResponseError


client_id = "<your client id>" 
client_secret = "<your client secret>"
tenant_id = "<your tenant id>"
reference_name_purview = "<name of your Microsoft Purview account>"
ds_name = "<name of the registered data source you want to delete>"

def get_credentials():
	credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
	return credentials

def get_scanning_client():
	credentials = get_credentials()
	client = PurviewScanningClient(endpoint=f"https://{reference_name_purview}.scan.purview.azure.com", credential=credentials, logging_enable=True) 
	return client

try:
	client_scanning = get_scanning_client()
except ValueError as e:
	print(e)  

try:
	response = client_scanning.data_sources.delete(ds_name)
	print(response)
	print(f"Data source {ds_name} successfully deleted")
except HttpResponseError as e:
	print(e)

后续步骤