MySQL Flexible Server Replication Lag (~37 mins) During Peak Load Without CPU, Memory, or IOPS Bottlenecks

Question

Environment Details:

Primary Server:
Configuration: Business Critical, E16ds_v5 (16 vCores, 128 GiB RAM, 2227 GB Storage, 10000 IOPS)
Replication Type: Binlog-based replication (gtid_mode = OFF)
- MySQL Version: 8.0
Replica Server:
Configuration: Business Critical, E8ds_v5 (8 vCores, 64 GiB RAM, 2338 GB Storage, 10000 IOPS)
- MySQL Version: 8.0
- Replication Type: Binlog-based
- Parallel Workers (slave_parallel_workers): 32
- Pending Jobs Size (slave_pending_jobs_size_max): 1GB
- Disk & IOPS: No burst observed, stable at 10000 IOPS
- CPU & Memory Usage: No spikes during lag periods
We are observing consistent replication lag (~37 minutes) during peak load, even though:
- No CPU, memory, or IOPS burst is observed on the Primary or Replica.
  - Binlog processing is happening, but the replica is not catching up.
  - Long-running transactions on the Primary are minimal.

Answer

Thank you for reaching out. Based on your environment details and observations, the replication lag (~37 minutes) during peak load despite stable CPU, memory, and IOPS suggests that the issue could be related to one or more of the following factors:

Possible Causes & Recommendations:

Parallel Replication Bottlenecks

Even though you have 32 parallel workers (slave_parallel_workers = 32), MySQL replication executes transactions in a single-threaded manner per schema. If your workload is concentrated on a few schemas, parallelism may not be fully utilized.

Solution: Consider increasing binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count to optimize group commits.

Replica Thread Throughput Limitations

The SQL thread on the replica might be struggling to apply binlog events fast enough, even if parallel workers exist.

Solution: Monitor the replication thread latency using:

      SHOW PROCESSLIST;

      SHOW SLAVE STATUS\G;

Look for events where the "Waiting for prior transaction to commit" state appears frequently.

3)Large Transactions or Row-Based Replication Overhead

Even though you mentioned long-running transactions are minimal, large transactions (especially row-based replication) can block parallel workers from processing other events until completion.

Solution:Check for large transactions using:

SELECT * FROM performance_schema.events_transactions_history_long;

Consider reducing the transaction size or enabling logical replication for better parallelism.

Replica Disk Write Latency

While IOPS is stable at 10,000, disk latency can still be a factor. If the fsync() operations on the replica take longer than expected, the replica might slow down.

Solution: Monitor disk write latency using:

        SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs';

        SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs_per_sec'

If high latency is observed, consider switching to IO-optimized storage.

5)Network Latency Between Primary & Replica

Even with no CPU/IOPS bottlenecks, network congestion between primary and replica can slow binlog transfer

Solution: Check replication network latency using

       ping

If high latency is observed, consider deploying the replica closer to the primary region.

Additional Debugging Steps

Enable Performance Schema and monitor replication bottlenecks SELECT * FROM performance_schema.replication_applier_status_by_worker;
MySQL binary log file size growth SHOW BINARY LOGS;
Check if temporary tables are causing stalls SHOW GLOBAL STATUS LIKE 'Created_tmp_tables';

If the replication lag continues even after applying these optimizations, please follow the below steps

1.Upgrade the replica instance to match the primary’s resources for better performance.

2.Use MySQL Group Replication (if applicable) to improve synchronization and reduce lag

3.Switch to Logical Replication, such as MySQL Replication with binlog_row_image=MINIMAL, to optimiz

For more information, please refer the documents:

https://learn.microsoft.com/en-us/azure/mysql/how-to-troubleshoot-replication-latency

I hope this information helps. Please do let us know if you have any further queries.

Share via

MySQL Flexible Server Replication Lag (~37 mins) During Peak Load Without CPU, Memory, or IOPS Bottlenecks

1 answer

Your answer