参数配置不当导致GreatSQL异步复制IO线程中断
1.问题发生
某个客户的一套GreatSQL数据库,异步复制的slave节点的IO线程无法启动,报错信息如下:
2026-02-16T12:36:40.086620+08:00 12452 [ERROR] [MY-010557] [Repl] Error reading packet from server for channel '':Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633, 8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740, 8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5', and the missing transaction is '8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718' (server errno=1236)
2026-02-16T12:36:40.086664+08:00 12452 [ERROR] [MY-013114] [Repl] Slave I/0 for channel: '' Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633, 8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740, 8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5', and the missing transaction is '8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718'', Error code: MY-013114
此日志信息显示:由于主库的binlog已清除,IO线程中断。用户现场DBA进行了clone恢复。
但问题原因究竟是什么呢?为什么中间丢失了一个事务8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718呢?
2. 问题分析
2.1 error.log解读
仔细阅读error.log中的报错信息,可以解读出如下信息:
- slave在启动IO线程时向master节点发送的GTID事务集是:
8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633,
8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740,
8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5'
- slave已经执行的GTID事务集不连续,中间缺少了事务
8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718。 - slave在缺少该事务
8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718后,执行了257021个事务。 - IO线程启动失败的原因是:master节点中包含事务
19789718的binlog已被清理。
2.2 初步判断
检查用户的binlog_expire_logs_seconds是604800(7天),并不短。
根据GTID有空洞的现象,判断与参数slave_skip_errors数相关,检查用户的参数配置slave_skip_errors是ALL。
参数slave_skip_errors默认(OFF)情况下,当slave上发生错误时,复制进程会停止,让DBA有机会解决数据中的不一致问题。当配置slave_skip_errors配置错误类型时,当slave上某条语句返回的错误属于该变量值中列出的任一错误类型,复制SQL线程仍会继续执行。当slave_skip_errors=ALL时,slave上发生任何错误,SQL线程都会继续执行。
3. 实验验证
3.1 实验设计
- 搭建异步复制的环境,slave端设置并行复制,参数
slave_skip_errors=all。 - 在slave端制造事务阻塞,让slave SQL线程,执行失败。
- 在slave端,开启事务执行update语句,不提交;
- 在master端执行修改同一行数据的update语句;
- 观察slave端事务的执行情况,检查是否会出现GTID空洞。
- 重启slave,检查是否出现与用户相同的报错信息。
3.2 实验1:slave_skip_errors=ALL模拟问题
- 搭建异步复制的环境,slave端设置并行复制,参数
slave_skip_errors=ALL
检查slave端的参数
GreatSQL> SELECT @@global.slave_parallel_workers,@@global.innodb_lock_wait_timeout,@@global.slave_skip_errors\G
*************************** 1. row ***************************
@@global.slave_parallel_workers: 64
@@global.innodb_lock_wait_timeout: 10
@@global.slave_skip_errors: ALL
1 row in set, 2 warnings (0.01 sec)
- 在slave端,开始事务执行update语句(test1的id是主键),不提交;
BEGIN;
UPDATE test1 SET c='dddddddddddd' WHERE id=2;
- 在master端执行修改相同行的update语句,该语句在slave端应用时,会被阻塞。
UPDATE test1 SET c='eeeeeeeee' WHERE id=2;
- 在master端使用sysbench执行其他事务。
- 检查master端和slave端的gtid_execued
-- master端
GreatSQL> SELECT @@gtid_executed;
+----------------------------------------------------------------------------------------+
| @@gtid_executed |
+----------------------------------------------------------------------------------------+
| 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-84055 |
+----------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
GreatSQL> SELECT * FROM test1 WHERE id=2;
+----+-------+-----------+-------------------------------------------------------------+
| id | k | c | pad |
+----+-------+-----------+-------------------------------------------------------------+
| 2 | 50248 | eeeeeeeee | 28733802923-10548894641-11867531929-71265603657-36546888392 |
+----+-------+-----------+-------------------------------------------------------------+
1 row in set (0.00 sec)
-- slave端
GreatSQL> SELECT @@gtid_executed;
+--------------------------------------------------------------------------------------------------------------------------------------------+
| @@gtid_executed |
+--------------------------------------------------------------------------------------------------------------------------------------------+
| 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1 |
+--------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
GreatSQL> SELECT * FROM test1 WHERE id=2;
+----+-------+--------------+-------------------------------------------------------------+
| id | k | c | pad |
+----+-------+--------------+-------------------------------------------------------------+
| 2 | 50248 | dddddddddddd | 28733802923-10548894641-11867531929-71265603657-36546888392 |
+----+-------+--------------+-------------------------------------------------------------+
1 row in set (0.00 sec)
可见事务c12440a7-12c1-11f1-95ad-00163e379f11:40805没有在slave端执行,slave端的GTID出现可空洞。
- 检查slave端的error.log
在error.log中有该事务的如下信息
2026-02-27T09:26:49.530843+08:00 29 [Note] [MY-010584] [Repl] Slave SQL for channel '': Worker 18 failed executing transaction 'c12440a7-12c1-11f1-95ad-00163e379f11:43805' at master log binlog.000003, end_log_pos 266711465; Could not execute Update_rows event on table testdb.test1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log binlog.000003, end_log_pos 266711465, Error_code: MY-001205
可知由于锁等待超时,事务c12440a7-12c1-11f1-95ad-00163e379f11:43805执行失败。
- 检查slave的状态
GreatSQL> SHOW slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.138.128
Master_User: greatdb
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: binlog.000003
Read_Master_Log_Pos: 337004154
Relay_Log_File: relaylog.000004
Relay_Log_Pos: 337003634
Relay_Master_Log_File: binlog.000003
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 337004154
Relay_Log_Space: 337003837
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1006503306
Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:1-84055
Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
可见io和sql线程都正常,数据不一致的隐患未及时发现。
- master端purge binlog后,重启slave,slave正常。
GreatSQL> SHOW slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.138.128
Master_User: greatdb
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: binlog.000004
Read_Master_Log_Pos: 237
Relay_Log_File: relaylog.000007
Relay_Log_Pos: 407
Relay_Master_Log_File: binlog.000004
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 237
Relay_Log_Space: 860
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1006503306
Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:1-84055
Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
- slave端reset slave后,再start slave,io线程无法启动,报错如下:
2026-02-27T11:04:19.165355+08:00 31406 [ERROR] [MY-010557] [Repl] Error reading packet from server for channel '': Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-109033, f8afd36c-12c1-11f1-a6e8-00163e892deb:1', and the missing transactions are 'c12440a7-12c1-11f1-95ad-00163e379f11:43805' (server_errno=1236)
2026-02-27T11:04:19.165486+08:00 31406 [ERROR] [MY-013114] [Repl] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-109033, f8afd36c-12c1-11f1-a6e8-00163e892deb:1', and the missing transactions are 'c12440a7-12c1-11f1-95ad-00163e379f11:43805'', Error_code: MY-013114
报错信息和用户现场的报错信息相同。
3.3 实验2:slave_skip_errors=OFF重复实验
- 搭建异步复制的环境,slave端设置并行复制,参数
slave_skip_errors=OFF
检查slave端的参数
GreatSQL> SELECT @@global.slave_parallel_workers,@@global.innodb_lock_wait_timeout,@@global.slave_skip_errors\G
*************************** 1. row ***************************
@@global.slave_parallel_workers: 64
@@global.innodb_lock_wait_timeout: 10
@@global.slave_skip_errors: OFF
1 row in set, 2 warnings (0.01 sec)
- 在slave端,开始事务执行update语句,不提交;
BEGIN;
UPDATE test1 SET c='from slave222' WHERE id=2;
- 在master端执行修改相同行的update语句,该语句在slave端应用时,会被阻塞。
UPDATE test1 SET c='from master1111' WHERE id=2;
- 在master端使用sysbench执行其他事务。
- 检查slave的状态,发现sql线程已停止。
GreatSQL> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.138.128
Master_User: greatdb
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: binlog.000004
Read_Master_Log_Pos: 61931905
Relay_Log_File: relaylog.000004
Relay_Log_Pos: 411
Relay_Master_Log_File: binlog.000004
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1205
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c12440a7-12c1-11f1-95ad-00163e379f11:109034' at master log binlog.000004, end_log_pos 43620845. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 43620411
Relay_Log_Space: 18312445
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1205
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c12440a7-12c1-11f1-95ad-00163e379f11:109034' at master log binlog.000004, end_log_pos 43620845. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1006503306
Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 260227 11:18:58
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:109034-119519
Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-109033,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
GreatSQL> SELECT * from performance_schema.replication_applier_status_by_worker limit 1\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_ERROR_NUMBER: 1205
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'c12440a7-12c1-11f1-95ad-00163e379f11:109034' at master log binlog.000004, end_log_pos 43620845; Lock wait timeout exceeded; try restarting transaction
LAST_ERROR_TIMESTAMP: 2026-02-27 11:18:58.506075
LAST_APPLIED_TRANSACTION:
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION: c12440a7-12c1-11f1-95ad-00163e379f11:109034
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 2026-02-27 11:16:28.454244
APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2026-02-27 11:16:28.454244
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP: 2026-02-27 11:16:28.463083
LAST_APPLIED_TRANSACTION_RETRIES_COUNT: 0
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION_RETRIES_COUNT: 10
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 1205
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE: Lock wait timeout exceeded; try restarting transaction
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 2026-02-27 11:18:43.502374
1 row in set (0.00 sec)
可见:由于锁等待超时,事务c12440a7-12c1-11f1-95ad-00163e379f11:109034无法执行,从而导致slave的sql线程异常退出,可以有机会修复异常数据。
- slave端的事务提交或回滚后,阻塞解除。启动slave,事务开始正常应用。
GreatSQL> START slave;
Query OK, 0 rows affected, 1 warning (0.48 sec)
GreatSQL> SHOW slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.138.128
Master_User: greatdb
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: binlog.000004
Read_Master_Log_Pos: 61931905
Relay_Log_File: relaylog.000004
Relay_Log_Pos: 875819
Relay_Master_Log_File: binlog.000004
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
......
-- gtid已追齐
GreatSQL> SELECT @@gtid_executed\G
*************************** 1. row ***************************
@@gtid_executed: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-119519,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1
1 row in set (0.00 sec)
-- 数据已同步
GreatSQL> SELECT * FROM test1 WHERE id=2\G
*************************** 1. row ***************************
id: 2
k: 50248
c: from master1111
pad: 28733802923-10548894641-11867531929-71265603657-36546888392
1 row in set (0.00 sec)
总结
- 在生产数据库上要保持slave_skip_errors的默认值OFF,避免主从数据不一致的风险。
- 对于常见的错误,也需要仔细观察,从蛛丝马迹中找出问题的根本原因。