Hello Everyone
I'm using Azure Synapse to run my Spark pipelines. I've build my pipeline to use a custom library code from utils package of the same project to perform some operation. To access the utils.pipeline file, have packaged my entire project in a Python wheel and uploaded as a workspace package to the Spark serverless pool. When I tried to import using ***from utils.pipeline import **** statement, am getting an error as No 'utils' module found.
Any help would be really appreciated. Below are the details of the used configurations.
Project Hierarchy:
-code_repo
--src
---xyz
----main.py (Entry file)
---utils
----pipeline.py (Reusable functions are available)
-setup.py
-requirements.txt
setup.py -Used to package the project files.
from setuptools import setup, find_packages
setup(
name="Data_Project",
version="0.1.0",
author="Maniganda",
packages=find_packages(),
include_package_data=True,
description='Data Engineering Project'
)
Spark Pool:
Spark version: 3.4
Python: 3.10
Error on executing the Spark pipeline:
2024-12-24 20:55:40,730 INFO SignalUtils [main]: Registering signal handler for TERM
2024-12-24 20:55:41,249 INFO SignalUtils [main]: Registering signal handler for HUP
2024-12-24 20:55:41,250 INFO SignalUtils [main]: Registering signal handler for INT
2024-12-24 20:55:42,200 WARN NativeCodeLoader [AsyncAppender-Dispatcher-Thread-3]: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-12-24 20:55:42,438 INFO ApplicationMaster [main]: ApplicationAttemptId: appattempt_1735073652210_0001_000001
2024-12-24 20:55:43,858 WARN MetricsConfig [main]: Cannot locate configuration: tried hadoop-metrics2-azure-file-system.properties,hadoop-metrics2.properties
2024-12-24 20:55:43,873 INFO MetricsSystemImpl [main]: Scheduled Metric snapshot period at 10 second(s).
2024-12-24 20:55:43,873 INFO MetricsSystemImpl [main]: azure-file-system metrics system started
2024-12-24 20:55:44,218 INFO ApplicationMaster [main]: Starting the user application in a separate Thread
2024-12-24 20:55:44,233 INFO ApplicationMaster [main]: Waiting for spark context initialization...
2024-12-24 20:55:44,381 INFO PythonRunner$ [Driver]: Initialized PythonRunnerOutputStream plugin org.apache.spark.microsoft.tools.api.plugin.MSToolsPythonRunnerOutputStreamPlugin.
2024-12-24 20:55:52,826 ERROR ApplicationMaster [Driver]: User application exited with status 1, error msg: Traceback (most recent call last):
File "/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1735073652210_0001/container_1735073652210_0001_01_000001/annotation.py", line 18, in <module>
from utils.pipeline import *
ModuleNotFoundError: No module named 'utils'
2024-12-24 20:55:52,831 INFO ApplicationMaster [Driver]: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 1, error msg: Traceback (most recent call last):
File "/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1735073652210_0001/container_1735073652210_0001_01_000001/annotation.py", line 18, in <module>
from utils.pipeline import *
ModuleNotFoundError: No module named 'utils'
)
2024-12-24 20:55:52,839 ERROR ApplicationMaster [main]: Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:525)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:284)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:967)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:966)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1907)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:966)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: org.apache.spark.PySparkUserAppException: User application exited with 1 : Traceback (most recent call last):
File "/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1735073652210_0001/container_1735073652210_0001_01_000001/annotation.py", line 18, in <module>
from utils.pipeline import *
ModuleNotFoundError: No module named 'utils'
at org.apache.spark.deploy.PythonRunner$.runPythonProcess(PythonRunner.scala:124)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:103)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:757)
2024-12-24 20:55:52,885 INFO ShutdownHookManager [shutdown-hook-0]: Shutdown hook called
2024-12-24 20:55:53,051 INFO MetricsSystemImpl [shutdown-hook-0]: Stopping azure-file-system metrics system...
2024-12-24 20:55:53,052 INFO MetricsSystemImpl [shutdown-hook-0]: azure-file-system metrics system stopped.
2024-12-24 20:55:53,052 INFO MetricsSystemImpl [shutdown-hook-0]: azure-file-system metrics system shutdown complete.
End of LogType:stderr
***********************************************************************
Additionally I found another error TokenNotFoundInConfigurationException on when uploading the workspace package and apply to the Spark serverless pool. I've created linked service with the SAS authentication to the Azure ADLS account where Synapse pools used as storage.
Error:
2024-12-24 20:48:28,430 ERROR TokenLibrary$ [Thread-38]: No SasToken found in Configuration for conf: fs.azure.sas.python.gqxzenxanqd46y8etk1vhpob.blob.core.windows.net
java.lang.RuntimeException: No SasToken found in Configuration for conf: fs.azure.sas.python.gqxzenxanqd46y8etk1vhpob.blob.core.windows.net
at com.microsoft.azure.synapse.tokenlibrary.TokenLibrary$.getSystemSasToken(TokenLibrary.scala:120)
at com.microsoft.azure.synapse.tokenlibrary.TokenLibrary$.getSystemSasToken(TokenLibrary.scala:87)
at com.microsoft.azure.synapse.tokenlibrary.TokenLibrary.getSystemSasToken(TokenLibrary.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.microsoft.azure.synapse.tokenlibrary.util.TokenNotFoundInConfigurationException: No SasToken found in conf for Key: fs.azure.sas.python.gqxzenxanqd46y8etk1vhpob.blob.core.windows.net
at com.microsoft.azure.synapse.tokenlibrary.TokenLibraryInternal.getSasTokenOnlyFromConfiguration(TokenLibraryInternal.scala:593)
at com.microsoft.azure.synapse.tokenlibrary.TokenLibraryInternal.getSasTokenFromCacheOrConfiguration(TokenLibraryInternal.scala:561)
at com.microsoft.azure.synapse.tokenlibrary.TokenLibrary$.getSystemSasToken(TokenLibrary.scala:105)
... 14 more
2024-12-24 21:00:06,862 INFO TokenLibrary$ [Thread-38]: Getting SasToken for confKey: fs.azure.sas.library.gqxzenxanqd46y8etk1vhpob.blob.core.windows.net
... 14 more