About partition strategies on Synapse Spark Pools

Jona 640 Reputation points
2025-02-17T15:14:08.6966667+00:00

Hi

I recently posted this question:

https://learn.microsoft.com/en-us/answers/questions/2153368/about-partitioned-parquet-files-on-adls2

I wonder if there is a way to customize the partition size on Synapse Spark. I have a 40GB table that generates a couple of thounsands files. I'm afraid to lost performance fo this amount of partitions when reading them on a Spark Pool.

I would want for this table to generate a 100MB partition file size, and for other tables another partition file size.

Regards

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,200 questions
{count} votes

Accepted answer
  1. phemanth 13,905 Reputation points Microsoft Vendor
    2025-02-18T19:59:22.5866667+00:00

    @Jona

    Welcome to the Microsoft Q&A forum.

    In Synapse Pipelines, you can achieve similar partitioning and control over file sizes using the Copy Activity with parallelism and dynamic partitioning. Here are some steps to help you:

    Parallel Copy: Use the Degree of copy parallelism setting to control the number of parallel threads for copying data. This can help distribute the load and improve performance.

    • Go to the Source settings of the Copy Activity.
    • Set the Degree of copy parallelism to a value that suits your needs (default is 20, maximum is 50).

    Dynamic Partitioning: If your source table is partitioned, you can leverage these partitions for parallel copying.

    • In the Source settings, enable the Physical partitions of table option.
    • If your source table is not partitioned, you can define dynamic partition ranges based on a column (e.g., date or ID).

    Custom Partition Ranges: Define custom partition ranges using parameters and dynamic content expressions.

    • Create pipeline parameters for the partition column and ranges.
    • Use these parameters in the Source settings to define the partition ranges dynamically.

    Optimize Performance: Ensure your cluster size and resources are appropriate for the data size and partition count.

    • Monitor and adjust the Data Integration Units (DIUs) and nodes to handle the load efficiently.

    For more detailed guidance, please refer to the Copy Activity performance and scalability guideI hope the above steps will resolve the issue, please do let us know if issue persists. Thank you

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Amira Bedhiafi 28,851 Reputation points
    2025-02-17T16:18:45.3566667+00:00

    You can control the number of partitions using the repartition and coalesce functions in Spark.

    • Repartition: shuffles the data and creates new partitions you can use it to increase or decrease the number of partitions significantly :
        
        df = df.repartition(400)
        
      
    • Coalesce: reduces the number of partitions without a full shuffle, you can say it is more efficient than repartition when decreasing the number of partitions.
        
        df = df.coalesce(100)
        
      

    If you want to achieve a specific partition size let's say 100MB, you can calculate the number of partitions based on the total data size and desired partition size.

    
    desired_partition_size_mb = 100
    
    total_size_mb = 40 * 1024 
    
    num_partitions = int(total_size_mb / desired_partition_size_mb)
    
    df = df.repartition(num_partitions)
    

    When you write data to ADLS, you can control the file size by repartitioning the df before writing.

    
    df.repartition(num_partitions).write.parquet("abfss://<container>@<storage_account>.dfs.core.windows.net/<path>")
    

    You can also set Spark configurations to control the default number of partitions and shuffle partitions.

    
    spark.conf.set("spark.sql.shuffle.partitions", "200")
    

    Keep in mind :

    • Larger clusters can handle more partitions efficiently. Ensure your cluster size is appropriate for the data size and partition count.
    • Data is evenly distributed across partitions to avoid skew, which can impact performance.
    • Parquet is generally efficient for large datasets, but ensure compression settings are optimized.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.