Partitioned parquet file into an Azure Database

Iwan 65

I inserted a parquet file into an Azure database and it had low throughput so I thought if I partitioned the file I could load the partitions in parallel.

I partitioned the file on DefaultRating using pyspark and tried to insert but I'm not getting the right settings as it's not copying at all anymore.

Below is one of the partition folders based on DefaultRating:

User's image

Below is the source dataset settings, a simple copy data activity with the parquet dataset including the snappy part files in each partition folder.

User's image

This returned a path not found error. When the filename was the Output_25_10_24.parquet nothing was written:

User's image

Smaran Thoomu 16,735 Reputation points Microsoft Vendor

2024-11-07T12:59:14.0133333+00:00
Hi @Iwan
Welcome to Microsoft Q&A platform and thanks for posting your query here.

As I understand it looks like you're having trouble with partitioned Parquet files in Azure Data Factory (ADF). Here are a few things to check and try:

Correct File Path Configuration: The "path not found" error may be due to an incorrect file path or wildcard usage.

Make sure your source dataset's file path correctly matches your files. If you have multiple partitioned files, use a wildcard like part-* for the filename.

Double-check the folder structure to ensure it matches the path specified in the dataset.

Partitioned Files Handling:

ADF might have trouble processing deeply nested partition folders if the root folder isn't specified properly.

If you enabled partition discovery, make sure ADF correctly detects all partitions and files.

Dataflow for Complex Scenarios: If you still have issues or need better performance, try using a Dataflow activity instead of Copy activity. Dataflow allows better control over partitioning and parallel processing.

Ensure your ADF setup has enough Data Integration Units (DIUs) for better throughput.

Increase the parallel copy settings if needed for faster copying.

Start by testing with a small number of files to make sure everything is configured correctly. If the "path not found" error keeps happening, double-check the error details to see what's missing or misconfigured.

I hope this helps! Let me know if you have any further questions.
Iwan 65 Reputation points

2024-11-07T15:16:45.18+00:00

@Smaran Thoomu

Filename:

Output_25_10_24.parquet/DefaultRating=1.0/part-00003-bb8410a2-e623-47e5-8aa8-00ef961b2c61.c000.snappy.parquet

inserts that partition successfully.

Filename:

Output_25_10_24.parquet/DefaultRating=1.0/part-*

returned an invalid status code 'NotFound'

I've tried this:

So I'm not sure how to even get one partition folder.

I have enabled partition discovery per my screenshot and I got the partition root path by stepping through the root folder path.
Smaran Thoomu 16,735 Reputation points Microsoft Vendor

2024-11-07T18:35:24.0166667+00:00

@Iwan It seems that the wildcard pattern might not be working as expected. Sometimes, issues can arise from the syntax of the path itself. Try adjusting the syntax slightly. For example, you might try:

{source container}/Output_25_10_24.parquet/DefaultRating=1.0/part-*

If the above doesn't work, try:

{source container}/Output_25_10_24.parquet/DefaultRating=1.0/part-*.parquet

As a troubleshooting step, you can also try specifying a single file path without using wildcards, just to confirm that the basic path and file access are functioning correctly.

Share via

Partitioned parquet file into an Azure Database

Your answer