About partition strategies on Synapse Spark Pools

Question

Hi

I recently posted this question:

https://learn.microsoft.com/en-us/answers/questions/2153368/about-partitioned-parquet-files-on-adls2

I wonder if there is a way to customize the partition size on Synapse Spark. I have a 40GB table that generates a couple of thounsands files. I'm afraid to lost performance fo this amount of partitions when reading them on a Spark Pool.

I would want for this table to generate a 100MB partition file size, and for other tables another partition file size.

Regards

Accepted Answer

@Jona

Welcome to the Microsoft Q&A forum.

In Synapse Pipelines, you can achieve similar partitioning and control over file sizes using the Copy Activity with parallelism and dynamic partitioning. Here are some steps to help you:

Parallel Copy: Use the Degree of copy parallelism setting to control the number of parallel threads for copying data. This can help distribute the load and improve performance.

Go to the Source settings of the Copy Activity.
Set the Degree of copy parallelism to a value that suits your needs (default is 20, maximum is 50).

Dynamic Partitioning: If your source table is partitioned, you can leverage these partitions for parallel copying.

In the Source settings, enable the Physical partitions of table option.
If your source table is not partitioned, you can define dynamic partition ranges based on a column (e.g., date or ID).

Custom Partition Ranges: Define custom partition ranges using parameters and dynamic content expressions.

Create pipeline parameters for the partition column and ranges.
Use these parameters in the Source settings to define the partition ranges dynamically.

Optimize Performance: Ensure your cluster size and resources are appropriate for the data size and partition count.

Monitor and adjust the Data Integration Units (DIUs) and nodes to handle the load efficiently.

For more detailed guidance, please refer to the Copy Activity performance and scalability guideI hope the above steps will resolve the issue, please do let us know if issue persists. Thank you

Answer

You can control the number of partitions using the repartition and coalesce functions in Spark.

Repartition: shuffles the data and creates new partitions you can use it to increase or decrease the number of partitions significantly :
```
  
  df = df.repartition(400)
  
```
Coalesce: reduces the number of partitions without a full shuffle, you can say it is more efficient than repartition when decreasing the number of partitions.
```
  
  df = df.coalesce(100)
  
```

If you want to achieve a specific partition size let's say 100MB, you can calculate the number of partitions based on the total data size and desired partition size.


desired_partition_size_mb = 100

total_size_mb = 40 * 1024 

num_partitions = int(total_size_mb / desired_partition_size_mb)

df = df.repartition(num_partitions)

When you write data to ADLS, you can control the file size by repartitioning the df before writing.


df.repartition(num_partitions).write.parquet("abfss://@.dfs.core.windows.net/")

You can also set Spark configurations to control the default number of partitions and shuffle partitions.


spark.conf.set("spark.sql.shuffle.partitions", "200")

Keep in mind :

Larger clusters can handle more partitions efficiently. Ensure your cluster size is appropriate for the data size and partition count.
Data is evenly distributed across partitions to avoid skew, which can impact performance.
Parquet is generally efficient for large datasets, but ensure compression settings are optimized.

Share via

About partition strategies on Synapse Spark Pools

1 additional answer

Your answer