How to access ADLS Gen2 storage with restricted network access from PySpark notebook

Question

I'm trying to access ADLS Gen2 storage from PySpark notebook in my Synapse workspace. The storage account has public network access enabled from selected virtual networks and IP addresses, and my IP address is added to the firewall rules, trusted Azure services access is enabled, and Synapse workspace resources instances are added to allowlisted explicitly (see the image attached below).

Here's how I'm trying to read the file from ADLS using spark. I tried with and without specifying the linked service, and none of those work:

spark_df_1 = spark.read.parquet('abfss://@.dfs.core.windows.net/')

spark_df_2 = spark.read.parquet('abfss://@.dfs.core.windows.net/', storage_options={'linked_service': ''})

This is the error I'm getting:

Py4JJavaError: An error occurred while calling o5482.parquet. : java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation.", 403

It is important to mention that when I enable public network access from all networks for my storage account, the code above works properly. And to make things even more interesting, for some reason I'm able to read from ADLS using pandas when I specify linked service in storage options, and I'm only getting the error when using spark. So, the code below works properly, even when public network access is not enabled from all networks:

pandas_df = pandas.read_parquet('abfss://@.dfs.core.windows.net/', storage_options={'linked_service': ''})

I'm trying to understand why the pandas is able to access the storage, while spark isn't? Am I passing the linked service correctly when using spark?

By the way, I also tried to do the same thing as in official docs and in this similar question I found, but that didn't help. However, I'm not sure whether public network access was restricted in those cases (probably not since it wasn't mentioned), it seems to me that it is the main source of my issues (because everything works as expected when public network access from all networks is enabled). Below is the piece of code I tried, but I'm still getting the same access denied error as above:

source_full_storage_account_name = ".dfs.core.windows.net"
spark.conf.set(f"spark.storage.synapse.{source_full_storage_account_name}.linkedServiceName", "")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{source_full_storage_account_name}", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
sc._jsc.hadoopConfiguration().set(f"fs.azure.account.oauth.provider.type.{source_full_storage_account_name}", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")


df = spark.read.parquet('abfss://@.dfs.core.windows.net/')

Here's how the storage account networking setup looks like:

User's image

Answer

Hello Uros Stojiljkovic,

Greetings! Welcome to Microsoft Q&A Platform.

I understand that you’re encountering a 403 Forbidden error while trying to write a Parquet file using PySpark. This error typically indicates that your request lacks the necessary permissions to perform the operation.

Please consider checking the below to troubleshoot the issue further,

Ensure that the account or service principal you’re using has the necessary permissions to write to the specified location. This might involve checking your cloud storage permissions (e.g. Azure Blob Storage).

Verify that your authentication credentials (e.g., access keys, tokens) are correctly configured and have not expired.

Double-check the path you’re writing to and ensure it is correct. Sometimes, a typo or incorrect path can lead to access issues.

Ensure that your Spark environment is correctly configured with the necessary access credentials. This might involve setting environment variables or configuring Spark properties.

If you’re working within a corporate network, there might be network policies or firewalls that restrict access to certain resources and if you are using a virtual network (vNet) with service endpoints for the storage account, ensure that the vNet and the APIs are properly configured to allow traffic between them. Check if any network security groups or route tables are blocking the communication.

Review the logs and diagnostics of both the APIs and the storage account to gather more information about the error and identify any potential issues.

Please check the permissions required when using Account Access Key with RBAC- https://learn.microsoft.com/en-us/azure/storage/blobs/assign-azure-role-data-access?tabs=portal, https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-access-azure-active-directory,

Hope the above information helps! please let us know if you have any further queries. I’m happy to assist you further.

Please "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

How to access ADLS Gen2 storage with restricted network access from PySpark notebook

1 answer

Your answer