MySQL Flexible Server Replication Lag (~37 mins) During Peak Load Without CPU, Memory, or IOPS Bottlenecks

Niket Kumar Singh 435 Reputation points
2025-02-14T08:13:56.8366667+00:00

Environment Details:

  • Primary Server:
  • Configuration: Business Critical, E16ds_v5 (16 vCores, 128 GiB RAM, 2227 GB Storage, 10000 IOPS)
  • Replication Type: Binlog-based replication (gtid_mode = OFF)
    • MySQL Version: 8.0
  • Replica Server:
  • Configuration: Business Critical, E8ds_v5 (8 vCores, 64 GiB RAM, 2338 GB Storage, 10000 IOPS)
    • MySQL Version: 8.0
    • Replication Type: Binlog-based
    • Parallel Workers (slave_parallel_workers): 32
    • Pending Jobs Size (slave_pending_jobs_size_max): 1GB
    • Disk & IOPS: No burst observed, stable at 10000 IOPS
    • CPU & Memory Usage: No spikes during lag periods
    We are observing consistent replication lag (~37 minutes) during peak load, even though:
    • No CPU, memory, or IOPS burst is observed on the Primary or Replica.
      • Binlog processing is happening, but the replica is not catching up.
      • Long-running transactions on the Primary are minimal.
Azure Database for MySQL
Azure Database for MySQL
An Azure managed MySQL database service for app development and deployment.
915 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Prasad Chaganti 0 Reputation points Microsoft External Staff
    2025-02-14T18:21:06.9133333+00:00

    hi @Niket Kumar Singh

    Thank you for reaching out. Based on your environment details and observations, the replication lag (~37 minutes) during peak load despite stable CPU, memory, and IOPS suggests that the issue could be related to one or more of the following factors:

    Possible Causes & Recommendations:

    1. Parallel Replication Bottlenecks

    Even though you have 32 parallel workers (slave_parallel_workers = 32), MySQL replication executes transactions in a single-threaded manner per schema. If your workload is concentrated on a few schemas, parallelism may not be fully utilized.

    Solution: Consider increasing binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count to optimize group commits.

    1. Replica Thread Throughput Limitations

    The SQL thread on the replica might be struggling to apply binlog events fast enough, even if parallel workers exist.

    Solution: Monitor the replication thread latency using:

          SHOW PROCESSLIST;
    
          SHOW SLAVE STATUS\G;
    

    Look for events where the "Waiting for prior transaction to commit" state appears frequently.

    3)Large Transactions or Row-Based Replication Overhead

    Even though you mentioned long-running transactions are minimal, large transactions (especially row-based replication) can block parallel workers from processing other events until completion.

    Solution:Check for large transactions using:

    SELECT * FROM performance_schema.events_transactions_history_long;

    Consider reducing the transaction size or enabling logical replication for better parallelism.

    1. Replica Disk Write Latency

    While IOPS is stable at 10,000, disk latency can still be a factor. If the fsync() operations on the replica take longer than expected, the replica might slow down.

    Solution: Monitor disk write latency using:

            SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs';
    
            SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs_per_sec'
    

    If high latency is observed, consider switching to IO-optimized storage.

    5)Network Latency Between Primary & Replica

    Even with no CPU/IOPS bottlenecks, network congestion between primary and replica can slow binlog transfer

    Solution: Check replication network latency using

           ping <replica_ip>
    

    If high latency is observed, consider deploying the replica closer to the primary region.

    Additional Debugging Steps

    1. Enable Performance Schema and monitor replication bottlenecks SELECT * FROM performance_schema.replication_applier_status_by_worker;
    2. MySQL binary log file size growth SHOW BINARY LOGS;
    3. Check if temporary tables are causing stalls SHOW GLOBAL STATUS LIKE 'Created_tmp_tables';

    If the replication lag continues even after applying these optimizations, please follow the below steps

    1.Upgrade the replica instance to match the primary’s resources for better performance.

    2.Use MySQL Group Replication (if applicable) to improve synchronization and reduce lag

    3.Switch to Logical Replication, such as MySQL Replication with binlog_row_image=MINIMAL, to optimiz

    For more information, please refer the documents:

    https://learn.microsoft.com/en-us/azure/mysql/how-to-troubleshoot-replication-latency

    I hope this information helps. Please do let us know if you have any further queries.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.