Public access error from Spark Pool, but access working from SQL on Demand, linked blob storage

Question

Hi,

I have successfully linked to my Synapse workspace different Data Lake Gen2 (Abfss) storage accounts, and Blob storage accounts. I can explore the content, and even during the linking process, I get the connection test successful. All using managed identity and registered as Blob contributor on each storage account.

The Gen2 account all work ok (having them as private), very practical to right-click and load the data in Spark or SQL.

For the Blob storage accounts, only SQL on-demand works. The Spark pool does not work, and I get the error "public access is disabled on storage account".

If this is a linked service, and even using the automatically generated notebooks, why this error? Why the SQL on demand is authorized and not the Spark pool? I even tried passing a SAS token to the Spark session, but I get the same error.

I also do not understand, why when using the Synapse workspace, I need to white list my IP on the Storage account firewall (Vnet is active). I have whitelisted the Synapse IPs for my region, as well as listed the Synapse workspace in the resource instances and marked to allow Azure trusted services.

If anyone has any experience with all the different configurations necessary to access a blob storage from Synapse, please let me know how can I solve my issue and/or modify my configuration to keep all secured but be able to work with my data.

Answer

Hi Saurabh,

Thanks for your reply.

Please confirm, as I'm not able to understand your answer.

I have only issues with accessing Azure Blob Storage "wasbs" NOT Gen2 "abfss". My Synapse workspace do not have a managed VNET, only the blob storage are in a Vnet.

Let me give you an example. Using linked services I am able to:

blob_sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service)

spark.conf.set(
    f"fs.azure.sas.{container}.{account}.dfs.core.windows.net",
    blob_sas_token)

abfss_path = f'abfss://{blob_container_name}@{blob_account_name}.dfs.core.windows.net/path/part0.snappy.parquet'

df = spark.read.load(abfss_path, format='parquet')
display(df.limit(10))

I'm even able to read this without having to get and pass as spark config the blob_sas_token.

If I take the same code, but for blob storage (wasbs). I get the error:

Py4JJavaError: An error occurred while calling o167.load. : org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Public access is not permitted on this storage account.

Which I find it weird due to the fact I can access to it form the SQL pool.

The example code, as above:

blob_sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service)

spark.conf.set(
 f"fs.azure.sas.{container}.{account}.blob.core.windows.net",
 blob_sas_token)

wasbs_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/path/part0.snappy.parquet'

df = spark.read.load(wasbs_path, format='parquet')
display(df.limit(10))

** for each of them I have created a linked service

Any advice/idea on I should be doing this is very much welcome

Answer

@Ferreira, Gregorio ,

In order to make it work you should create a private link from managed VNET Managed private endpoints - Azure Synapse Analytics.
SQL On-Demand bypasses VNETs (we also set it up during workspace provisioning (you can also set it up during workspace provisioning – and we products team has plan to automate flow of creating private link as well in coming months).
Also, in the first code snippet, AAD-passthrough is being used inadvertently (fs.azure.sas doesn't apply to abfs driver).
The second code snippet is equivalent in functionality to the first one (abfss + dfs.core = wasbs + blob.core).
I suggest you to use the first code snippet instead of the second one and in order to pass SAS in, you can use ConfBasedSASProvider or AkvBasedSASProvider (documentation available through TokenLibrary.help() in Notebook).
Please let me know if you have any questions.

Thanks
Saurabh

Answer

Hi Gregorio,

we're having the same issue. DLS Gen2 is working fine, blob accounts however are not. We also have no network configurations on it, so the error message is sortof "wrong".

We did manage to get it to work using Managed Identity and passthrough permissions. RBAC role blob data reader is given on Storage Account level, container level appears to create the 403 error.

What worked for us, against the blob is to change from dfs.core to blob.core, but continue to use abfss

 abfss_path = f'abfss://{blob_container_name}@{blob_account_name}.blob.core.windows.net/path/part0.snappy.parquet'  
      
 df = spark.read.load(abfss_path, format='parquet')  
 display(df.limit(10))

Could you check if this works for you? At least it sorted it for us.

Best Regards
Morten

Share via

Public access error from Spark Pool, but access working from SQL on Demand, linked blob storage

3 answers

Your answer