Hi everyone. I have the following scenario: a ADF pipeline stores a partitioned parquet file on ADLS2 a Synapse Spark Pool will read the data The partitioned data looks like this (step 1): Is this an optimal way of storing parquet data that will manipulated by a spark environment? Or is better to store one single file? Regards

Yep - storing partitioned Parquet files, as shown in your image, is generally better for Spark environments compared to storing a single large file. The primary benefits if this approach include: Optimized query performance: Spark can prune partitions and read only the required files instead of scanning a huge single file. If your data is partitioned by a frequently used filter column (e.g., date ), Spark will process queries much faster. Parallelism and scalability: Spark can distribute the workload across multiple executors and process several smaller files in parallel. A single large file can become a bottleneck, as only a few executors may be able to process it simultaneously. Fault tolerance and data skipping: If one small file fails during processing, only that partition needs to be retried, rather than reprocessing a massive file. Efficient reads and writes: Writing partitioned files is more efficient because Spark avoids updating a single large file, which can be slow and require shuffling. A single file might be more suitable if you have: A large number of small files (so called "Small Files Problem"). If your partitioning creates too many small files, it may cause Spark to spend excessive time managing metadata instead of actual processing. In such cases, consider using coalesce() or repartition() in Spark to merge smaller files into larger ones. Schema evolution complexity. Managing schema evolution across multiple partitioned files can be tricky if different partitions have slight variations in schema. So if partitions are meaningful (e.g., date-based, category-based), stick with partitioned files. If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated. hth Marcin

About partitioned parquet files on ADLS2

Jona 515

Hi everyone.

I have the following scenario:

a ADF pipeline stores a partitioned parquet file on ADLS2
a Synapse Spark Pool will read the data

The partitioned data looks like this (step 1):

User's image

Is this an optimal way of storing parquet data that will manipulated by a spark environment? Or is better to store one single file?

Regards

1 answer

Marcin Policht 33,375 Reputation points MVP

2025-02-02T00:45:06.1966667+00:00
Yep - storing partitioned Parquet files, as shown in your image, is generally better for Spark environments compared to storing a single large file. The primary benefits if this approach include:

Optimized query performance:

Spark can prune partitions and read only the required files instead of scanning a huge single file.

If your data is partitioned by a frequently used filter column (e.g., date), Spark will process queries much faster.

Parallelism and scalability:

Spark can distribute the workload across multiple executors and process several smaller files in parallel.

A single large file can become a bottleneck, as only a few executors may be able to process it simultaneously.

Fault tolerance and data skipping:

If one small file fails during processing, only that partition needs to be retried, rather than reprocessing a massive file.

Efficient reads and writes:

Writing partitioned files is more efficient because Spark avoids updating a single large file, which can be slow and require shuffling.

A single file might be more suitable if you have:

A large number of small files (so called "Small Files Problem"). If your partitioning creates too many small files, it may cause Spark to spend excessive time managing metadata instead of actual processing. In such cases, consider using coalesce() or repartition() in Spark to merge smaller files into larger ones.

Schema evolution complexity. Managing schema evolution across multiple partitioned files can be tricky if different partitions have slight variations in schema.

So if partitions are meaningful (e.g., date-based, category-based), stick with partitioned files.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin
Please sign in to rate this answer.
Jona 515 Reputation points

2025-02-02T01:14:54.4933333+00:00

My case is datalake approach, so data won't be updated (or added as new folder/partition). So:

Optimized query performance:

These statement are related on the folder structure, not the phisical parquet file, which is the matter of my question

Parallelism and scalability:

So, I my case, the Synapse Spark will load all the partitions simultaniously?

Fault tolerance and data skipping:

So, if there is an error reading or writting a partition .. how can I retry that partition?

Efficient reads and writes:

I can't undestand this very well. If I had 40GB to store in ADLS via ADF ... is it faster to store it partitioned as I did? Spark will read faster?

Regards

Marcin Policht 33,375 Reputation points MVP

2025-02-02T02:48:00.4766667+00:00

Appreciate the clarification. This helps to further narrow it down...

Optimized Query Performance (Folder structure vs. physical file)
You are correct that partition pruning is influenced by the folder structure rather than the physical file size. However, within each partition, the number and size of Parquet files do impact query performance.

If partitions contain too many small files, Spark spends more time managing metadata and opening/closing files rather than actually reading data.

If partitions contain very large files, such as 10GB or more, Spark may not parallelize reading effectively.

Consider the following:

Keep each Parquet file between 100MB and 1GB per partition.

If you have too many small files, consider compacting them using Spark with coalesce or repartition.

Parallelism and scalability
Synapse Spark will read multiple Parquet files in parallel. Each file is processed by a different executor, so more files allow for better parallelism up to a certain point. However, having excessive small files creates overhead due to metadata operations and task scheduling inefficiencies.

Consider the following:

If there are too many small files, merge them into larger ones using Spark with coalesce or repartition.

If there are very few large files, such as a single 40GB file per partition, Spark may not distribute the workload efficiently.

Fault tolerance and data skipping (retrying failed partitions)
If a partition or Parquet file fails during reading, you can retry processing just that partition in Spark by filtering the problematic data instead of reloading everything.

For example:

df = spark.read.parquet("abfss://container@account.dfs.core.windows.net/path/to/data") df.filter("partition_column = '2025-01-02'").show() # Load only the affected partition

This approach helps avoid unnecessary reads. If a write operation fails in Azure Data Factory, you will need to manually re-trigger the pipeline or configure retry logic in ADF.

Efficient reads and writes (partitioned vs. single large file)
If you are storing 40GB in Azure Data Lake Storage via Azure Data Factory, the key questions are:

Is it faster to store it partitioned?

Will Spark read it faster?

Write performance (ADF to ADLS):

Writing partitioned files is usually faster because ADF can write multiple files in parallel.

Writing a single 40GB Parquet file is slower because it requires a single large operation and may hit memory constraints.

Read performance (Synapse Spark):

Partitioned files between 100MB and 1GB enable faster parallel reads.

A single 40GB file results in slower reads and less parallelism.

Best practice for 40GB data:

Partition it but ensure each file within a partition is at least 100MB.

If partitions create many small files, use coalesce to merge them:

df = spark.read.parquet("path/to/data") df.coalesce(10).write.mode("overwrite").parquet("path/to/output") # Adjust number of files

To summarize

Keep partitioning, but check if your partitions are too small.

Ensure Parquet file sizes are between 100MB and 1GB.

Use coalesce if files are too small.

Avoid a single 40GB file since it reduces parallelism.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

About partitioned parquet files on ADLS2

1 answer

Your answer