Thank you for reaching out. Based on your environment details and observations, the replication lag (~37 minutes) during peak load despite stable CPU, memory, and IOPS suggests that the issue could be related to one or more of the following factors:
Possible Causes & Recommendations:
- Parallel Replication Bottlenecks
Even though you have 32 parallel workers (slave_parallel_workers = 32), MySQL replication executes transactions in a single-threaded manner per schema. If your workload is concentrated on a few schemas, parallelism may not be fully utilized.
Solution: Consider increasing binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count to optimize group commits.
- Replica Thread Throughput Limitations
The SQL thread on the replica might be struggling to apply binlog events fast enough, even if parallel workers exist.
Solution: Monitor the replication thread latency using:
SHOW PROCESSLIST;
SHOW SLAVE STATUS\G;
Look for events where the "Waiting for prior transaction to commit" state appears frequently.
3)Large Transactions or Row-Based Replication Overhead
Even though you mentioned long-running transactions are minimal, large transactions (especially row-based replication) can block parallel workers from processing other events until completion.
Solution:Check for large transactions using:
SELECT * FROM performance_schema.events_transactions_history_long;
Consider reducing the transaction size or enabling logical replication for better parallelism.
- Replica Disk Write Latency
While IOPS is stable at 10,000, disk latency can still be a factor. If the fsync() operations on the replica take longer than expected, the replica might slow down.
Solution: Monitor disk write latency using:
SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs';
SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs_per_sec'
If high latency is observed, consider switching to IO-optimized storage.
5)Network Latency Between Primary & Replica
Even with no CPU/IOPS bottlenecks, network congestion between primary and replica can slow binlog transfer
Solution: Check replication network latency using
ping <replica_ip>
If high latency is observed, consider deploying the replica closer to the primary region.
Additional Debugging Steps
- Enable Performance Schema and monitor replication bottlenecks SELECT * FROM performance_schema.replication_applier_status_by_worker;
- MySQL binary log file size growth SHOW BINARY LOGS;
- Check if temporary tables are causing stalls SHOW GLOBAL STATUS LIKE 'Created_tmp_tables';
If the replication lag continues even after applying these optimizations, please follow the below steps
1.Upgrade the replica instance to match the primary’s resources for better performance.
2.Use MySQL Group Replication (if applicable) to improve synchronization and reduce lag
3.Switch to Logical Replication, such as MySQL Replication with binlog_row_image=MINIMAL, to optimiz
For more information, please refer the documents:
https://learn.microsoft.com/en-us/azure/mysql/how-to-troubleshoot-replication-latency
I hope this information helps. Please do let us know if you have any further queries.