How to use the wheel packaged Json files in the Azure Synapse Spark job definition or Notebook

manigandan 0 Reputation points
2025-01-06T16:14:49.98+00:00

Hello Everyone

I'm facing an issue on accessing the json file packaged as a Python wheel on the Synapse Spark Job definition file. Below are my configs. Any help would be truly appreciated. Thanks.

I'm using the below setup.py file to package the pyspark project.

from setuptools import setup, find_packages
from glob import glob

setup(
    name="Data_Project",
    version="1.0.0",
    author="Mani",
    author_email="****",
    packages=find_packages(),
    package_data={"Data_Project": ["*.json"]},
    data_files=[
        ("Data_Project", glob('code_repo/src/schemas/*.json'))
    ],
    include_package_data=True,
    description='Data Engineering Project'
)

Here is the project structure.

-code_repo
--src
---xyz
-----main.py (Entry file)
---utils
-----pipeline.py (Reusable functions are available)
---schemas
-----data.json
-----abc.json
-setup.py
-requirements.txt

I'm using python setup.py bdist_wheel command to package the project as wheel. I've unzipped and validated the packaged wheel, it has the .json files under the code_repo/src/schemas directory. Additionally, have validated the RECORD file, it also has the Json files under schemas directory.

I've uploaded the wheel as workspace package in Synapse studio and tried to access the util files, am able to access it successfully, whereas when try to access the .json from schemas folder am getting no such file error.

Tried approach: I'm trying to access the .json file from the main.py file present in xyz directory.

schema_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "schemas", "xyz.json")
        logging.info("Schema path:: {schema_path}")
        with open(schema_path, 'r', encoding='utf-8') as file:
            schema_json = json.load(file)

I've tried Synapse notebook to verify the successful installation of the package by executing the following command os.system("pip list"), can see the custom project as one of the installed library.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,118 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 18,965 Reputation points Microsoft Vendor
    2025-01-07T11:07:10.42+00:00

    Hi @manigandan
    Welcome to Microsoft Q&A platform.
    Thank you for sharing the details of your issue! I understand that working with files packaged inside a Python wheel can be tricky, especially in environments like Azure Synapse. Let's work through this step by step.

    When you package your JSON files into a wheel using setup.py, they're included as part of your Python package, but they aren't treated like regular files in the file system. Instead, you need to access them as resources using Python's built-in tools.

    To fix this issue, you can use the importlib.resources module if you're using Python 3.9 or newer. This module is designed for exactly this purpose. If you're using an older version of Python, you can install a backport library called importlib-resources to get similar functionality.

    Here's an example of how you can update your code in main.py to properly load the JSON file using importlib.resources:

    import json
    from importlib.resources import files
    
    # Make sure this matches your project structure
    from Data_Project import schemas
    
    # Access the JSON file
    schema_file = files(schemas).joinpath("xyz.json")
    with schema_file.open('r', encoding='utf-8') as file:
        schema_json = json.load(file)
    
    # Optional: Print to confirm it's working
    print("Loaded schema:", schema_json)
    

    If you can't use importlib.resources, another way to do this is with pkgutil. Here's an example:

    import pkgutil
    import json
    
    # Access the JSON file
    data = pkgutil.get_data('Data_Project.schemas', 'xyz.json')
    schema_json = json.loads(data.decode('utf-8'))
    
    # Optional: Print to confirm it's working
    print("Loaded schema:", schema_json)
    

    Before you try these solutions, make sure that the JSON files are included in your wheel and that the package is installed correctly in Synapse. You can also debug the path in your code by adding a quick print statement.

    If you're still running into issues, feel free to share any error messages you're seeing. I'll be happy to help you out.

    I hope this advice helps you move forward! Let me know if you have any more questions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.