How to Setup AZUREML_OBO_SERVICE_ENDPOINT for Spark Component in Azure ML

Abdelkhalek Hamdi 40

I am running an Azure ML pipeline that includes a Spark serverless-based component. The component's logic is supposed to fetch data from Azure Data Lake, repartition it, then write data back to Azure Data Lake in "feather" format using pandas. The goal is to leverage Spark's parallelism capabilities to write all partitions in parallel, as illustrated below:

def write_feather(partition):
    pandas_df = pd.DataFrame([row.asDict() for row in partition])
    pandas_df.to_feather(write_path + "data.feather")

df = spark.read.parquet(read_path)
df = df.repartition([col1, col2])
df.foreachPartition(write_feather)

Data is read and repartitioned successfully; however, when the job reaches the writing step using pandas, it throws the following error:

java.lang.IllegalStateException: Could not find configuration value for spark.yarn.appMasterEnv.AZUREML_OBO_SERVICE_ENDPOINT

The error suggests that executors failed to obtain an access token to write data back to Azure Data Lake. I attempted to set the missing variable using the following approach, but the same error persists:

spark = SparkSession.builder.\
    config("spark.yarn.appMasterEnv.AZUREML_OBO_SERVICE_ENDPOINT", "https://login.microsoftonline.com/

SriLakshmi C 2,825 Reputation points Microsoft Vendor

2025-02-17T07:07:17.99+00:00

Hello Abdelkhalek Hamdi,

Greetings and Welcome to Microsoft Q&A! Thanks for posting the question.

I understand you're trying to run a Spark job in Azure Machine Learning (Azure ML) that reads data from Azure Data Lake, processes it, and writes it back. However, you're encountering an error when Spark tries to write the data back to Azure Data Lake. Here are some steps to resolve this:

Make sure that the configuration for spark.yarn.appMasterEnv.AZUREML_OBO_SERVICE_ENDPOINT is set correctly in the Spark session. You can verify this by checking the Spark UI to see if the configuration is being applied.

Ensure that the endpoint URL you are using is correct and that it points to the appropriate Azure Active Directory (AAD) token endpoint for your tenant.

If setting the variable in the Spark session does not work, you may need to define it in the component's YAML file under the environment variables section. Make sure the syntax is correct and that it is being read properly by the Spark job.

Also, the issue is caused by missing AzureML On-Behalf-Of (OBO) authentication configurations, preventing Spark executors from obtaining the necessary access token. To resolve this, ensure that AzureML OBO authentication is enabled in the Azure ML component YAML by specifying the identity type as user identity.

Modify the Spark component to retrieve an access token using AzureMLOnBehalfOfCredential from Azure AI ML. This ensures that authentication is properly handled when accessing Azure Data Lake.

Kindly refer this AzureML - On Behalf of Feature, Feature Materialization Job Errors,

Deploy and run MLflow models in Spark jobs.

I Hope this helps. Do let me know if you have any further queries.

Thank you!
Abdelkhalek Hamdi 40 Reputation points

2025-02-17T09:19:09.3933333+00:00
Hi @SriLakshmi C ,

Thank you for the detailed answer.

After trying the following:

Set identity type to user_identity in the YAML file of the spark component.

Add the following variable to the conf field of the YAML file:

conf: spark.yarn.appMasterEnv.AZUREML_OBO_SERVICE_ENDPOINT: "https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/token"

Explicitly set the variable from the spark session:

spark.conf.set("spark.yarn.appMasterEnv.AZUREML_OBO_SERVICE_ENDPOINT", "https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/token")

I am still getting the same error.

Further debugging:

When I checked the environment variables from Spark UI, I found that this variable is be default set to a different value:

I tried to print this variable from both the driver node and one of the executor nodes and I got the same value shown above.

So apparently something is preventing the endpoint value from being changed.
Saideep Anchuri 2,825 Reputation points Microsoft Vendor

2025-02-17T11:58:20.57+00:00

Hello Abdelkhalek Hamdi,

Good to see that you have already enabled environment variable for authentication. Could you use managed identity with storage contributor access and use it on your spark compute before running the operations.

Kindly refer below link: user-assigned-managed-identity

Thank You.
Abdelkhalek Hamdi 40 Reputation points

2025-02-17T18:31:50.5333333+00:00

I wasn't able to make any progress so far, I tried using a user-assigned managed identity for the spark job, but again the data read part works fine but data write fails with the same error.
SriLakshmi C 2,825 Reputation points Microsoft Vendor

2025-02-18T01:33:53.2333333+00:00

Hi Abdelkhalek Hamdi,

Sorry for the inconvenience caused,

To set up the AZUREML_OBO_SERVICE_ENDPOINT for a Spark component in Azure Machine Learning, please ensure that the user-assigned managed identity has the appropriate role assignments for both reading and writing data to the Azure storage account. Specifically, you should assign the Contributor and Storage Blob Data Contributor roles to the identity that you are using for the Spark job submission. This will ensure that the identity has the necessary permissions to write data.

Additionally, if you are using an attached Synapse Spark pool, ensure that the Spark pool is configured correctly with a managed private endpoint to the storage account, especially if the workspace has a managed virtual network associated with it.

Thank you!

Abdelkhalek Hamdi 40

Hi @SriLakshmi C ,

I have been trying what was suggested but still I got the same error. I tried to submit the pipeline from a compute instance instead of my local computer, MLClient was instantiated using a ManagedIdentityCredentials with clientId of a user-assigned identity that was attached to the workspace and was given all required access. The same identity was passed to the spark component in the pipeline but again, nothing was different.

The error was again:

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.microsoft.azure.synapse.tokenlibrary.TokenLibrary.getAccessToken.
: java.lang.IllegalStateException: Could not find configuration value for spark.yarn.appMasterEnv.AZUREML_OBO_SERVICE_ENDPOINT

After this, I tried to use a different spark runtime_verion , the above error occurs when running with spark version: 3.4.0 however, when running with spark 3.3.0, then yet I get another error:

2025-02-18 14:29:26,260 ERROR Executor [Executor task launch worker for task 12.0 in stage 3.0 (TID 9)]: Exception in task 12.0 in stage 3.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/fsspec_wrapper/utils/token_service.py", line 63, in get_aad_credential
    access_token = token_library.getAccessToken(resource)
  File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.microsoft.azure.synapse.tokenlibrary.TokenLibrary.getAccessToken.
: java.util.NoSuchElementException: spark.aml.obotoken
	at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:266)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.SparkConf.get(SparkConf.scala:266)
	at com.microsoft.azure.aml.tokenlibrary.TokenLibraryAML.$anonfun$getAccessTokenInternal$4(TokenLibraryAML.scala:247)
	at scala.util.Try$.apply(Try.scala:213)
	at com.microsoft.azure.aml.tokenlibrary.TokenLibraryAML.getAccessTokenInternal(TokenLibraryAML.scala:237)
	at com.microsoft.azure.aml.tokenlibrary.TokenLibraryAML$.getAccessTokenAsync(TokenLibraryAML.scala:409)
	at com.microsoft.azure.synapse.tokenlibrary.TokenLibrary.getAccessTokenAsync(TokenLibrary.scala:217)
	at com.microsoft.azure.synapse.tokenlibrary.TokenLibrary$.getAccessToken(TokenLibrary.scala:1217)
	at com.microsoft.azure.synapse.tokenlibrary.TokenLibrary.getAccessToken(TokenLibrary.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)

What is interesting though, in this case I can see that the OBO token was fetched successfully from the executors logs:

2025-02-18 14:25:37,744 INFO TokenLibraryAML [finagle/netty4-1-1]: AmlUtils: POST success
2025-02-18 14:25:37,745 INFO TokenLibraryAML [finagle/netty4-1-1]: AmlUtils: Processing response from OBO
2025-02-18 14:25:37,896 INFO TokenLibraryAML [finagle/netty4-1-1]: AmlUtils: caching obo token
2025-02-18 14:25:37,933 INFO Utilities$ [finagle/netty4-1-1]: AmlUtils: Token expiry retrieved is =>1739893702
2025-02-18 14:25:37,933 INFO TokenLibraryAML [finagle/netty4-1-1]: AmlUtils: Token expiry added in cache =>1739893702
2025-02-18 14:25:38,007 INFO AzureMLTokenBasedTokenProviderGen2 [readingParquetFooters-ForkJoinPool-1-worker-1]: AmlUtils: Successfully get access token
2025-02-18 14:25:38,008 INFO AzureMLTokenBasedTokenProviderGen2 [readingParquetFooters-ForkJoinPool-1-worker-1]: AmlUtils: Get Expiry AzureMLTokenBasedTokenProviderGen2
2025-02-18 14:25:38,011 INFO Utilities$ [readingParquetFooters-ForkJoinPool-1-worker-1]: AmlUtils: Token expiry retrieved is =>1739893702
2025-02-18 14:25:38,017 INFO Utilities$ [readingParquetFooters-ForkJoinPool-1-worker-1]: AmlUtils: Successfully get the expiry time: Tue Feb 18 15:48:22 UTC 2025

But still the executor was not able to write data, which is very confusing!

Once again, this is the logic that I would like to execute:

def write_feather(partition):
    pandas_df = pd.DataFrame([row.asDict() for row in partition])
    pandas_df.to_feather(write_path + "data.feather")

df = spark.read.parquet(read_path)
df = df.repartition([col1, col2])
df.show(5)
df.foreachPartition(write_feather)

Data read part works fine, data write at the executor level fails. It's worth mentioning that if I change the code to the following, then things work fine, this is to confirm that the identity has access to write data to the storage account

df = spark.read.parquet(read_path)
df = df.repartition([col1, col2])
df.show(5)
df.write.format("parquet").save(write_path)

Manas Mohanty 745 Reputation points Microsoft Vendor

2025-02-18T17:10:34.7+00:00
Hi Abdelkhalek Hamdi

Could you provide the write_path in below line.

pandas_df.to_feather(write_path + "data.feather")

Thank you
Abdelkhalek Hamdi 40 Reputation points

2025-02-18T17:35:04.78+00:00
Hi Manas,

This is the path I am writing to:

write_path = abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/job_output/

Writing using pandas works fine if I run the code from the driver node, as follows:

df = spark.read.parquet(read_path) df_pandas = df.toPandas() df_pandas.write.format("parquet").save(write_path)

However, if I try to distribute the write to different executor nodes then it fails:

def write_feather(partition): pandas_df = pd.DataFrame([row.asDict() for row in partition]) pandas_df.to_feather(write_path + "data.feather") df = spark.read.parquet(read_path) df.foreachPartition(write_feather)

1 answer

JAYA SHANKAR G S 485 Reputation points Microsoft Vendor

2025-02-19T06:25:35.2833333+00:00
Hello Abdelkhalek Hamdi,

Whatever the spark configuration you are making it is limited to spark context and not used in pandas, to write data using pandas you need to pass credential details of storage account.

Below are the ways you can write.

st = {'tenant_id': 'tenant_id_value', 'client_id' : 'client_id_value', 'client_secret': 'client_secret_value'} service principal # or st = {'account_key' : '<account_key>'} # or st = {'sas_token' : 'sas_token_value'} # or st = {'connection_string' : 'connection_string_value'} def write_feather(partition): pandas_df = pd.DataFrame([row.asDict() for row in partition]) pandas_df.to_feather(write_path + "data.feather", storage_options = st df.foreachPartition(write_feather)

I would recommend to use service principal or user identity.

If you are using service principal, get tenant_id,client_id and client_secret from key vault like given here.

sc = SparkSession.builder.getOrCreate() token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary # Set up service principal tenant ID, client ID and secret from Azure Key Vault client_id = token_library.getSecret("<KEY_VAULT_NAME>", "<CLIENT_ID_SECRET_NAME>") tenant_id = token_library.getSecret("<KEY_VAULT_NAME>", "<TENANT_ID_SECRET_NAME>") client_secret = token_library.getSecret("<KEY_VAULT_NAME>", "<CLIENT_SECRET_NAME>")

Make sure you given Contributor and Storage Blob Data Contributor to your service principal or identity.

Also according to this documentation below are 2 mechanisms to access adls gen 2.

User identity passthrough

Service principal-based data access

But it seems user identity passthrough not working in pandas Api so as of now work around would be using service principal, will update later regarding User identity passthrough usage in pandas.

Output:

in adls gen 2

If you any further query do let me know, and if the solution worked for you, please do accept and give feedback clicking yes
Please sign in to rate this answer.
Abdelkhalek Hamdi 40 Reputation points

2025-02-19T18:47:53.39+00:00

Hi @JAYA SHANKAR G S ,

Thank you for taking the time to look into the issue.

To start with, the spark documentation mentions that it is possible to run jobs using user-assigned managed identity, so I would've assumed that things would work seamlessly after attaching an identity with the required access to the azure ml workspace. In fact it works fine, as you mentioned, when reading/writing using a spark session context, OR if the spark dataframe is converted to a pandas dataframe then written directly from the driver node without using foreachPartition as mentioned in my previous comments.

Your suggestion to use service principal has worked fine for me, but it doesn't make sense that the spark job uses a user_identity or a user-assigned managed identity to read data then uses a separate service principal to write it. Additionally, maintaining the service principal's secret adds unnecessary overhead.

For now, I will use your suggestion as a workaround but I will keep this thread open until we find a proper solution for it.

I appreciate your assistance :)

JAYA SHANKAR G S 485 Reputation points Microsoft Vendor

2025-02-20T04:37:06.87+00:00

Hi @Abdelkhalek Hamdi ,

Working on usage of managed identity, will update once i get.

Please take feedback survey by clicking yes .
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

How to Setup AZUREML_OBO_SERVICE_ENDPOINT for Spark Component in Azure ML

1 answer

Your answer