Redis源码分析之主从复制

178 阅读28分钟

Redis系列文章

原理篇

源码篇

问题分析


Redis源码分析之主从复制

关于Redis主从复制原理,可以查看本人另外一篇文章:Redis原理之主从复制,两篇文章结合使用更佳。

1. redisServer结构

struct redisServer {
    list *slaves, *monitors;    /* List of slaves and MONITORs */
    
    /* Replication (master) */
    char replid[CONFIG_RUN_ID_SIZE+1];  /* My current replication ID. */
    char replid2[CONFIG_RUN_ID_SIZE+1]; /* replid inherited from master*/
    long long master_repl_offset;   /* My current replication offset */
    long long second_replid_offset; /* Accept offsets up to this for replid2. */
  
    int repl_ping_slave_period;     /* Master pings the slave every N seconds */
    char *repl_backlog;             /* Replication backlog for partial syncs */
    long long repl_backlog_size;    /* Backlog circular buffer size */
    long long repl_backlog_histlen; /* Backlog actual data length */
    long long repl_backlog_idx;     /* Backlog circular buffer current offset,
                                       that is the next byte will'll write to.*/
    long long repl_backlog_off;     /* Replication "master offset" of first
                                       byte in the replication backlog buffer.*/

    int repl_min_slaves_to_write;   /* Min number of slaves to write. */
    int repl_min_slaves_max_lag;    /* Max lag of <count> slaves to write. */
    int repl_good_slaves_count;     /* Number of slaves with lag <= max_lag. */

    /* Replication (slave) */
    char *masterauth;               /* AUTH with this password with master */
    char *masterhost;               /* Hostname of master */
    int masterport;                 /* Port of master */
    client *master;     /* Client that is master for this slave */
     
    int repl_serve_stale_data; /* Serve stale data when link is down? */
    int repl_slave_ro;          /* Slave is read only? */
    
}

1)replid:Redis服务的运行ID,长度为CONFIG_RUN_ID_SIZE=40的随机字符串。对于主服务器,replid表示当前服务器的运行ID,对于从服务器,replid表示其复制的主服务器的运行ID。

2)repl_ping_slave_period:主服务向所有从服务器发送心跳包时间。因为主服务器和从服务器之间通过TCP长连接交互数据,必然需要周期性发送心跳包来检测连接有效性。可以通过配置repl-ping-slave-period(或者repl-ping-replica-period)设置,默认为10。(在serverCron中每1秒会调用一次replicationCron,10表示10s)

  //replcation.c#replicationCron
  ...
  /* First, send PING according to ping_slave_period. */
    if ((replication_cron_loops % server.repl_ping_slave_period) == 0 &&
        listLength(server.slaves))
    {
        /* Note that we don't send the PING if the clients are paused during
         * a Redis Cluster manual failover: the PING we send will otherwise
         * alter the replication offsets of master and slave, and will no longer
         * match the one stored into 'mf_master_offset' state. */
        int manual_failover_in_progress =
            server.cluster_enabled &&
            server.cluster->mf_end &&
            clientsArePaused();

        if (!manual_failover_in_progress) {
            ping_argv[0] = createStringObject("PING",4);
            replicationFeedSlaves(server.slaves, server.slaveseldb,
                ping_argv, 1);
            decrRefCount(ping_argv[0]);
        }
    }
...

3)repl_backlog:复制缓冲区,用于缓存主服务器已执行,且待发送给从服务器的命令请求。缓冲区大小由字段repl_backlog_size决定,可以通过配置repl-backlog-size设置,默认大小为1MB.

4)repl_backlog_off:复制缓冲区中第一个字节的复制偏移量。

5)repl_backlog_histlen:复制缓冲区中存储的命令请求数据长度。

6)repl_backlog_idx:复制缓冲区中存储的命令请求最后一个字节索引位置,即向复制缓冲区写入数据时,会从该索引位置开始。

7)slaves:记录所有的从服务器,是一个链表,节点类型为client。

8)repl_good_slaves_count:当前从服务器有效的数量。主服务器会定时检查从服务器是否有效,主服务器上会记录每个从服务器上次心跳检测成功的时间repl_ack_time,当超时repl_min_slaves_max_lag时长(默认10s)时,认为是失效状态。该时长可以通过配置参数min-slaves-max-lag或者min-replicas-max-lag设置。在refreshGoodSlavesCount方法中对从服务器有效性进行检测。

//replication.c#refreshGoodSlavesCount
void refreshGoodSlavesCount(void) {
    listIter li;
    listNode *ln;
    int good = 0;
	//如果没有设置,则没有必要检测。
    if (!server.repl_min_slaves_to_write ||
        !server.repl_min_slaves_max_lag) return;

    listRewind(server.slaves,&li);
    while((ln = listNext(&li))) {
        client *slave = ln->value;
        time_t lag = server.unixtime - slave->repl_ack_time;

        if (slave->replstate == SLAVE_STATE_ONLINE &&
            lag <= server.repl_min_slaves_max_lag) good++;
    }
    server.repl_good_slaves_count = good;
}

另外参数repl_min_slaves_to_write、repl_min_slaves_max_lag和repl_good_slaves_count会在命令执行前进行判断,如果有效的从服务器小于repl_min_slaves_to_write参数,则会拒绝执行写命令。

 //server.c#processCommand
 /* Don't accept write commands if there are not enough good slaves and
     * user configured the min-slaves-to-write option. */
    if (server.masterhost == NULL &&
        server.repl_min_slaves_to_write &&
        server.repl_min_slaves_max_lag &&
        c->cmd->flags & CMD_WRITE &&
        server.repl_good_slaves_count < server.repl_min_slaves_to_write)
    {
        flagTransaction(c);
        addReply(c, shared.noreplicaserr);
        return C_OK;
    }

9)masterauth:当主服务器配置了“requirepass ”时,表示从服务器必须通过密码认证才能同步主服务器数据。同时,从服务器需要配置“masterauth ”,用于设置请求同步主服务器的认证密码。

10)masterhost:主服务器IP地址,masterport主服务器端口。

11)master:当主从服务器成功建立连接之后,从服务器将成为主服务器的客户端。同理,主服务器也会成为从服务器的客户端,所以master在从服务器里,存储的是主服务器,类型为client。

12)repl_serve_stale_data:当主从服务器连接断开时,该变量表示从服务器是否继续处理命令请求,可以通过配置slave_server-stale-data或者replica-serve-stale-data设置,默认为1,表示连接断开,可以继续处理命令请求。该校验会在processCommand方法里,执行命令之前。

 //server.c#processCommand   
/* Only allow commands with flag "t", such as INFO, SLAVEOF and so on,
     * when slave-serve-stale-data is no and we are a slave with a broken
     * link with master. */
    if (server.masterhost && server.repl_state != REPL_STATE_CONNECTED &&
        server.repl_serve_stale_data == 0 &&
        !(c->cmd->flags & CMD_STALE))
    {
        flagTransaction(c);
        addReply(c, shared.masterdownerr);
        return C_OK;
    }

13)repl_slave_no:表示从服务器是否只读(不处理写命令),可通过配置参数slave-read-only或者replica-read-only设置,默认为1,即从服务器不处理写命令请求,除非是猪服务器发送过来的命令。该校验在processCommand方法里,执行命令之前。

//server.c#processCommand   
/* Don't accept write commands if this is a read only slave. But
     * accept write commands if this is our master. */
    if (server.masterhost && server.repl_slave_ro &&
        !(c->flags & CLIENT_MASTER) &&
        c->cmd->flags & CMD_WRITE)
    {
        addReply(c, shared.roslaveerr);
        return C_OK;
    }

2. 从服务器分析

用户可以通过执行slaveof命令开启主从复制功能。当redis服务器接收到slaveof命令时,需要主动连接主服务器请求同步数据。slaveof的命令处理函数为replicaofCommand,主要实现如下

//replication.c#replicaofCommand   
...
//strcasecmp函数,如果字符串匹配相等时,返回0
if (!strcasecmp(c->argv[1]->ptr,"no") &&
        !strcasecmp(c->argv[2]->ptr,"one")) {
        if (server.masterhost) {
            replicationUnsetMaster();
            sds client = catClientInfoString(sdsempty(),c);
            serverLog(LL_NOTICE,"MASTER MODE enabled (user request from '%s')",
                client);
            sdsfree(client);
        }
    }
...

可以看到用户可以通过命令"slaveof no one"取消主从复制功能,此时主从服务器之间会断开连接,从服务器变成普通的Redis实例。在replicaofCommand中只是记录主服务器IP地址和端口(上述代码中并没有贴出来,在else分支里)。从服务器和主服务器连接时一个异步操作,在replicationCron中。

//replication.c#replicationCron
/* Check if we should connect to a MASTER */
    if (server.repl_state == REPL_STATE_CONNECT) {
        serverLog(LL_NOTICE,"Connecting to MASTER %s:%d",
            server.masterhost, server.masterport);
        if (connectWithMaster() == C_OK) {
            serverLog(LL_NOTICE,"MASTER <-> REPLICA sync started");
        }
    }

在connectWithMaster方法中,除了发起连接请求外,还会创建文件事件,处理函数为syncWithMaster,还会设置repl_state状态。具体代码如下:

//replication.c#connectWithMaster
int connectWithMaster(void) {
    int fd;

    fd = anetTcpNonBlockBestEffortBindConnect(NULL,
        server.masterhost,server.masterport,NET_FIRST_BIND_ADDR);
    if (fd == -1) {
        serverLog(LL_WARNING,"Unable to connect to MASTER: %s",
            strerror(errno));
        return C_ERR;
    }

    if (aeCreateFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE,syncWithMaster,NULL) ==
            AE_ERR)
    {
        close(fd);
        serverLog(LL_WARNING,"Can't create readable event for SYNC");
        return C_ERR;
    }

    server.repl_transfer_lastio = server.unixtime;
    server.repl_transfer_s = fd;
    server.repl_state = REPL_STATE_CONNECTING;
    return C_OK;
}

另外,replicationCron函数还会检测主从连接是否超时,定时向主服务器发送心跳包,定时报告复制偏移量等等。其中超时判断如下代码所示:repl_transfer_lastio存储主从服务器上一次交互时间,repl_timeout表示主从服务器超时时间,用户可通过参数repl-timoue配置,默认60s。超时从服务器回主动断开连接。

//replication.c#replicationCron
 /* Non blocking connection timeout? */
    if (server.masterhost &&
        (server.repl_state == REPL_STATE_CONNECTING ||
         slaveIsInHandshakeState()) &&
         (time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
    {
        serverLog(LL_WARNING,"Timeout connecting to the MASTER...");
        cancelReplicationHandshake();
    }

另外,从服务器通过命令“REPLCONF ACK ”定时向主服务器汇报自己的复制偏移量,主服务器使用变量repl_ack_time存储接收到该命令的时间,以此作为检验从服务器是否有效的标准。具体代码实现如下:

//replication.c#replicationCron
if (server.masterhost && server.master &&
        !(server.master->flags & CLIENT_PRE_PSYNC))
        replicationSendAck();
//replication.c#replicationSendAck
void replicationSendAck(void) {
    client *c = server.master;

    if (c != NULL) {
        c->flags |= CLIENT_MASTER_FORCE_REPLY;
        addReplyMultiBulkLen(c,3);
        addReplyBulkCString(c,"REPLCONF");
        addReplyBulkCString(c,"ACK");
        addReplyBulkLongLong(c,c->reploff);
        c->flags &= ~CLIENT_MASTER_FORCE_REPLY;
    }
}

2.1 从服务器连接过程

从服务器接收到slaveof命令后,会主动连接主服务器请求同步数据,内部需要若干个步骤。

1)连接Socket

2)发送PING请求包确认连接是否正确。

3)发起密码认证(如果需要)。

4)信息同步。

5)发送PSYNC命令。

6)接收RDB文件并载入。

7)连接建立完,等待主服务器同步命令请求。

2.2 repl_state状态

在连接主服务器过程中,变量repl_state表示主从复制流程的进展,定义的宏和含义如下。

宏定义说明
#define REPL_STATE_NONE 0未开启主从复制,当前是个普通Redis实例
#define REPL_STATE_CONNECT 1待发起Socket连接主服务器
#define REPL_STATE_CONNECTING 2Scoket连接成功
#define REPL_STATE_RECEIVE_PONG 3已经发送PING请求包,等待接收主服务器PONG回复
#define REPL_STATE_SEND_AUTH 4待发起密码认证
#define REPL_STATE_RECEIVE_AUTH 5已经发送密码认证请求,等待主服务器回复
#define REPL_STATE_SEND_PORT 6待发送端口号
#define REPL_STATE_RECEIVE_PORT 7已经发送端口号,等待主服务器回复
#define REPL_STATE_SEND_IP 8待发送IP地址
#define REPL_STATE_RECEIVE_IP 9已经发送IP地址,等待主服务器回复。该IP和端口号用于主服务器主动建立soket连接,并向从服务器同步数据
#define REPL_STATE_SEND_CAPA 10主从复制功能优化升级,不同版本Redis服务器支持的能力不同。需要告诉主服务器支持的主从复制能力。
#define REPL_STATE_RECEIVE_CAPA 11等待接收主服务器回复
#define REPL_STATE_SEND_PSYNC 12待发送PSYNC命令。
#define REPL_STATE_RECEIVE_PSYNC 13待接收主服务器PSYNC命令的回复结果。
#define REPL_STATE_TRANSFER 14正在接收RDB文件。
#define REPL_STATE_CONNECTED 15RDB文件接收并载入完毕,主从复制建立成功。从服务器只需要等待接收主服务器同步数据即可。

2.3 从服务器与主服务器交互流程

在交互的过程中,完成从服务器的状态切换。具体的实现,就是在文件事件中定义的syncWithMaster方法里。

1)REPL_STATE_CONNECTING -> REPL_STATE_RECEIVE_PONG。当检测到状态为CONNECTING是,从服务器发送PING命令请求,并且修改状态为RECEIVE_PONG,函数直接返回。

//replication.c#syncWithMaster
/* Send a PING to check the master is able to reply without errors. */
    if (server.repl_state == REPL_STATE_CONNECTING) {
        serverLog(LL_NOTICE,"Non blocking connect for SYNC fired the event.");
        /* Delete the writable event so that the readable event remains
         * registered and we can wait for the PONG reply. */
        aeDeleteFileEvent(server.el,fd,AE_WRITABLE);
        server.repl_state = REPL_STATE_RECEIVE_PONG;
        /* Send the PING, don't check for errors at all, we have the timeout
         * that will take care about this. */
        err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"PING",NULL);
        if (err) goto write_error;
        return;
    }

2)REPL_STATE_RECEIVE_PONG -> REPL_STATE_SEND_AUTH -> REPL_STATE_RECEIVE_AUTH。当检测到状态为RECEIVE_PONG,会从Socket中读取主服务器PONG回复,并修改状态为SEND_AUTH。这里代码并没有返回,会继续往下执行。如果配置了masterauth,则从服务器会向主服务器发送密码认证请求,同时修改状态为RECEIVE_AUTH。

//replication.c#syncWithMaster
/* Receive the PONG command. */
    if (server.repl_state == REPL_STATE_RECEIVE_PONG) {
        err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
		...
        server.repl_state = REPL_STATE_SEND_AUTH;
    }
//replication.c#syncWithMaster
/* AUTH with the master if required. */
    if (server.repl_state == REPL_STATE_SEND_AUTH) {
        if (server.masterauth) {
            err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"AUTH",server.masterauth,NULL);
            if (err) goto write_error;
            server.repl_state = REPL_STATE_RECEIVE_AUTH;
            return;
        } else {
            server.repl_state = REPL_STATE_SEND_PORT;
        }
    }

3)REPL_STATE_RECEIVE_AUTH ->REPL_STATE_SEND_PORT。当检测到状态为RECEIVE_AUTH,会从Socket中读取主服务器回复结果,并修改状态为SEND_PORT,这里函数并没有返回,会继续往下。

//replication.c#syncWithMaster
if (server.repl_state == REPL_STATE_RECEIVE_AUTH) {
        err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
        ...
        server.repl_state = REPL_STATE_SEND_PORT;
    }

4)REPL_STATE_SEND_PORT ->REPL_STATE_RECEIVE_PORT。当状态为SEND_PORT,从服务器会向主服务器发送端口号,并修改状态为RECEIVE_PORT,函数直接返回。

//replication.c#syncWithMaster
if (server.repl_state == REPL_STATE_SEND_PORT) {
        sds port = sdsfromlonglong(server.slave_announce_port ?
            server.slave_announce_port : server.port);
        err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"REPLCONF",
                "listening-port",port, NULL);
        ...
        server.repl_state = REPL_STATE_RECEIVE_PORT;
        return;
    }

5)REPL_STATE_RECEIVE_PORT -> REPL_STATE_SEND_IP -> REPL_STATE_RECEIVE_IP。当检测到RECEIVE_PORT时,会从Socket中读取主服务器的回复结果,并修改为SEND_IP,继续往下执行。想主服务器发送ip地址,并修改为RECEIVE_IP状态。

//replication.c#syncWithMaster 
if (server.repl_state == REPL_STATE_RECEIVE_PORT) {
        err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
        ...
        server.repl_state = REPL_STATE_SEND_IP;
    }
//replication.c#syncWithMaster
if (server.repl_state == REPL_STATE_SEND_IP) {
        err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"REPLCONF",
                "ip-address",server.slave_announce_ip, NULL);
        ...
        server.repl_state = REPL_STATE_RECEIVE_IP;
        return;
    }

6)REPL_STATE_RECEIVE_IP -> REPL_STATE_SEND_CAPA ->REPL_STATE_RECEIVE_CAPA。当检测到状态为RECEIVE_IP时,会从Socket中读取主服务器回复的记过,并修改状态为SEND_CAPA,然后继续往下执行。会发送“REPLCONF capa eof capa psync2”,表示从服务器执行的主从复制功能。在主从复制中,主服务器接收psync命令,如果必须执行完整重同步,会持久化数据库到RDB文件,完成后将RDB文件发送给从服务器。当从服务器支持“eof”功能时,主服务器可以直接将数据库中的数据以RDB协议格式通过Socket发送给从服务器,避免本地磁盘文件不必要的读写操作。Redis4.0 针对主从复制提出了psync2协议。参数“psync2”表明从服务器支持psync2协议。最后,从服务器修改状态为RECEIVE_CAPA,然后函数返回。

//replication.c#syncWithMaster
if (server.repl_state == REPL_STATE_RECEIVE_IP) {
        err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
        ...
        server.repl_state = REPL_STATE_SEND_CAPA;
    }
//replication.c#syncWithMaster
if (server.repl_state == REPL_STATE_SEND_CAPA) {
        err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"REPLCONF",
                "capa","eof","capa","psync2",NULL);
        ...
        server.repl_state = REPL_STATE_RECEIVE_CAPA;
        return;
    }

7)REPL_STATE_RECEIVE_CAPA -> REPL_STATE_SEND_PSYNC -> REPL_STATE_RECEIVE_PSYNC。当检测到状态为RECEIVE_CAPA时,会从Socket中读取主服务器回复的结果,并修改状态为SEND_PSYNC,函数并没有返回,会继续往下执行。会调用函数slaveTryPartialResynchronization尝试执行部分重同步,并修改状态为RECEIVE_PSYNC。

slaveTryPartialResynchronization主要执行两个操作:1)尝试获取主服务器运行ID以及复制偏移量,并向主服务器发送psync命令请求。2)读取并解析psync命令回复,判断执行完整重同步,还是部分重同步。slaveTryPartialResynchronization第二个参数表示执行操作1还是操作2。

//replication.c#syncWithMaster
if (server.repl_state == REPL_STATE_RECEIVE_CAPA) {
        err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
        ...
        server.repl_state = REPL_STATE_SEND_PSYNC;
    }
//replication.c#syncWithMaster 
if (server.repl_state == REPL_STATE_SEND_PSYNC) {
        if (slaveTryPartialResynchronization(fd,0) == PSYNC_WRITE_ERROR) {
            err = sdsnew("Write error sending the PSYNC command.");
            goto write_error;
        }
        server.repl_state = REPL_STATE_RECEIVE_PSYNC;
        return;
    }

8)REPL_STATE_RECEIVE_PSYNC -> REPL_STATE_TRANSFER。调用slaveTryPartialResynchronization方法,读取并解析psync命令回复时,如果返回的是PSYNC_CONTINUE,表明可以执行部分重同步,内部会修改状态为REPL_STATE_TRANSFER。否则需要执行完整重同步,从服务器需要准备接受主服务器发送的RDB文件,可以看到创建了文件事件,处理函数为readSyncBulkPayload,并把状态修改为TRANSFER

//replication.c#syncWithMaster
psync_result = slaveTryPartialResynchronization(fd,1);
...
 if (psync_result == PSYNC_CONTINUE) {
        serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.");
        return;
 }
...
if (aeCreateFileEvent(server.el,fd, AE_READABLE,readSyncBulkPayload,NULL)
            == AE_ERR){...}
 server.repl_state = REPL_STATE_TRANSFER;

9)函数readSyncBulkPayload实现了RDB文件的接收和加载,加载完成后同时修改状态为REPL_STATE_CONNECTED。当从服务器状态成为REPL_STATE_CONNECTED,表明从服务器已经成功与主服务器建立连接,从服务器只需要接收并执行主服务器同步命令请求即可。

3. 主服务器分析

从服务器接收到slaveof命令会主动连接主服务器请求同步数据,主要流程如下:

1)连接Socket

2)发送PING请求确认连接信息是否正确。

3)发送密码认证(如果需要)

4)通过REPLCONF命令同步信息

5)发送PSYNC命令

6)接收RDB文件并载入

7)连接建立完成,等待主服务器同步命令请求。

主要讲解主服务器4到7的处理

3.1 REPLCONF

主服务器处理REPLCONF的入口函数为replconfCommand,具体实现如下:主要解析客户端请求参数,并储存在客户端对象client中。主要记录的信息:

  • 记录从服务器监听IP地址和端口。
  • 客户端能力标识,eof标识主服务器可以直接将数据库以RDB格式通过socket发送(同时主服务器要开启参数repl-diskless-sync),避免磁盘读写。psync2表明从服务器支持psync2协议,从服务器可以识别主服务器回复的“+CONTINUE <new_repl_id>”
  • 从服务器的复制偏移量和交互时间
//replication.c#replconfCommand
void replconfCommand(client *c) {
    int j;
	...
    /* Process every option-value pair. */
    for (j = 1; j < c->argc; j+=2) {
        if (!strcasecmp(c->argv[j]->ptr,"listening-port")) {
            long port;

            if ((getLongFromObjectOrReply(c,c->argv[j+1],
                    &port,NULL) != C_OK))
                return;
            c->slave_listening_port = port;
        } else if (!strcasecmp(c->argv[j]->ptr,"ip-address")) {
            sds ip = c->argv[j+1]->ptr;
            if (sdslen(ip) < sizeof(c->slave_ip)) {
                memcpy(c->slave_ip,ip,sdslen(ip)+1);
            } else {
                addReplyErrorFormat(c,"REPLCONF ip-address provided by "
                    "replica instance is too long: %zd bytes", sdslen(ip));
                return;
            }
        } else if (!strcasecmp(c->argv[j]->ptr,"capa")) {
            /* Ignore capabilities not understood by this master. */
            if (!strcasecmp(c->argv[j+1]->ptr,"eof"))
                c->slave_capa |= SLAVE_CAPA_EOF;
            else if (!strcasecmp(c->argv[j+1]->ptr,"psync2"))
                c->slave_capa |= SLAVE_CAPA_PSYNC2;
        } else if (!strcasecmp(c->argv[j]->ptr,"ack")) {
            /* REPLCONF ACK is used by slave to inform the master the amount
             * of replication stream that it processed so far. It is an
             * internal only command that normal clients should never use. */
            long long offset;

            if (!(c->flags & CLIENT_SLAVE)) return;
            if ((getLongLongFromObject(c->argv[j+1], &offset) != C_OK))
                return;
            if (offset > c->repl_ack_off)
                c->repl_ack_off = offset;
            c->repl_ack_time = server.unixtime;
            /* If this was a diskless replication, we need to really put
             * the slave online when the first ACK is received (which
             * confirms slave is online and ready to get more data). This
             * allows for simpler and less CPU intensive EOF detection
             * when streaming RDB files. */
            if (c->repl_put_online_on_ack && c->replstate == SLAVE_STATE_ONLINE)
                putSlaveOnline(c);
            /* Note: this command does not reply anything! */
            return;
        } else if (!strcasecmp(c->argv[j]->ptr,"getack")) {
            /* REPLCONF GETACK is used in order to request an ACK ASAP
             * to the slave. */
            if (server.masterhost && server.master) replicationSendAck();
            return;
        } else {
            addReplyErrorFormat(c,"Unrecognized REPLCONF option: %s",
                (char*)c->argv[j]->ptr);
            return;
        }
    }
    addReply(c,shared.ok);
}

3.2 psync

主服务器处理psync命令的入口函数为syncCommand。内部会调用masterTryPartialResynchronization进行判断是全部同步还是部分同步。执行部分重同步是有条件的:

1)服务器运行ID与复制偏移量必须合法。

2)复制偏移量必须包含在复制缓冲区中。

如果是重同步,主服务器还会根据从服务器的能力,发送“+CONTINUE”或者“+CONTINUE ”。然后主服务器会根据PSYNC请求参数中的复制偏移量,将复制缓冲区中的部分命令请求同步给从服务器。最后会更新有效从服务器数量。

//replication.c#masterTryPartialResynchronization
//判断服务器运行ID是否匹配,复制偏移量是否合法。
if ( strcasecmp( master_replid, server.replid ) &&
     (strcasecmp( master_replid, server.replid2 ) ||
      psync_offset > server.second_replid_offset) ){
	...
	goto need_full_resync;
}
//判断复制偏移量是否包含在复制缓冲区
if ( !server.repl_backlog ||
     psync_offset < server.repl_backlog_off ||
     psync_offset > (server.repl_backlog_off + server.repl_backlog_histlen) ){
	...
	goto need_full_resync;
}
//以下逻辑为执行部分同步,表示从服务器
c->flags			|= CLIENT_SLAVE;
c->replstate			= SLAVE_STATE_ONLINE;
c->repl_ack_time		= server.unixtime;
c->repl_put_online_on_ack	= 0;
//将客户端添加到从服务器链表slaves中。
listAddNodeTail( server.slaves, c );


/* We can't use the connection buffers since they are used to accumulate
 * new commands at this stage. But we are sure the socket send buffer is
 * empty so this write will never fail actually. */
//根据从服务器能力返回+CONTINUE
if ( c->slave_capa & SLAVE_CAPA_PSYNC2 ){
	buflen = snprintf( buf, sizeof(buf), "+CONTINUE %s\r\n", server.replid );
} else {
	buflen = snprintf( buf, sizeof(buf), "+CONTINUE\r\n" );
}
if ( write( c->fd, buf, buflen ) != buflen ){
	freeClientAsync( c );
	return C_OK;
}
//向客户端发送复制缓冲区中的命令请求
psync_len = addReplyReplicationBacklog( c, psync_offset );
//更新有效从服务器数目
refreshGoodSlavesCount();
return C_OK;

当主服务器判断需要执行完整同步时,会fork子进程执行RDB持久化,并将持久化数据发送给从服务器。RDB持久化有两种选择:

1)直接通过Socket发送给从服务器。

2)持久化到本地文件,待持久化完成后,再将文件发送给从服务器。

repl_diskless_sync的判断,可通过配置参数repl-diskless-sync进行设置,默认为0,即默认情况下,主服务器都是先持久化数据到本地文件,再将文件发送给从服务器。

//replication.c#startBgsaveForReplication
int startBgsaveForReplication(int mincapa) {
    int retval;
    int socket_target = server.repl_diskless_sync && (mincapa & SLAVE_CAPA_EOF);
    ...
    if (socket_target)
          retval = rdbSaveToSlavesSockets(rsiptr);
       else
          retval = rdbSaveBackground(server.rdb_filename,rsiptr);
    ...
}

3.3 命令同步

主服务器每次接受到写命令时,都会将命令记录在复制缓冲区中,同时将该请求广播给所有从服务器。实现函数在replicationFeedSlaves。入口在server.c#propagate方法中,会调用replicationFeedSlaves将命令同步给所有从服务器。

//server.c#propagate
void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
               int flags)
{
    if (server.aof_state != AOF_OFF && flags & PROPAGATE_AOF)
        feedAppendOnlyFile(cmd,dbid,argv,argc);
    if (flags & PROPAGATE_REPL)
        replicationFeedSlaves(server.slaves,dbid,argv,argc);
}

在replicationFeedSlaves方法中,会先判断客户端连接的数据库是否是上次向从服务器同步数据的数据库,因此可能需要先向从服务器同步select命令修改数据库。然后对每一个写命令,都会记录到复制缓存区中,然后再将命令同步给所有从服务器

//replcation.c#replicationFeedSlaves
void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {
    listNode *ln;
    listIter li;
    int j, len;
    char llstr[LONG_STR_SIZE];
	...
    //如果与上次选择的数据库不相等,需要先同步select命令。
    /* Send SELECT command to every slave if needed. */
    if (server.slaveseldb != dictid) {
        robj *selectcmd;

        /* For a few DBs we have pre-computed SELECT command. */
        if (dictid >= 0 && dictid < PROTO_SHARED_SELECT_CMDS) {
            selectcmd = shared.select[dictid];
        } else {
            int dictid_len;

            dictid_len = ll2string(llstr,sizeof(llstr),dictid);
            selectcmd = createObject(OBJ_STRING,
                sdscatprintf(sdsempty(),
                "*2\r\n$6\r\nSELECT\r\n$%d\r\n%s\r\n",
                dictid_len, llstr));
        }

        /* Add the SELECT command into the backlog. */
        if (server.repl_backlog) feedReplicationBacklogWithObject(selectcmd);

        /* Send it to slaves. */
        listRewind(slaves,&li);
        while((ln = listNext(&li))) {
            client *slave = ln->value;
            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) continue;
            addReply(slave,selectcmd);
        }

        if (dictid < 0 || dictid >= PROTO_SHARED_SELECT_CMDS)
            decrRefCount(selectcmd);
    }
    server.slaveseldb = dictid;
	//将当前命令请求添加到复制缓冲区
    /* Write the command to the replication backlog if any. */
    if (server.repl_backlog) {
        char aux[LONG_STR_SIZE+3];

        /* Add the multi bulk reply length. */
        aux[0] = '*';
        len = ll2string(aux+1,sizeof(aux)-1,argc);
        aux[len+1] = '\r';
        aux[len+2] = '\n';
        feedReplicationBacklog(aux,len+3);

        for (j = 0; j < argc; j++) {
            long objlen = stringObjectLen(argv[j]);

            /* We need to feed the buffer with the object as a bulk reply
             * not just as a plain string, so create the $..CRLF payload len
             * and add the final CRLF */
            aux[0] = '$';
            len = ll2string(aux+1,sizeof(aux)-1,objlen);
            aux[len+1] = '\r';
            aux[len+2] = '\n';
            feedReplicationBacklog(aux,len+3);
            feedReplicationBacklogWithObject(argv[j]);
            feedReplicationBacklog(aux+len+1,2);
        }
    }
    //向所有从服务器同步命令请求
    /* Write the command to every slave. */
    listRewind(slaves,&li);
    while((ln = listNext(&li))) {
        client *slave = ln->value;

        /* Don't feed slaves that are still waiting for BGSAVE to start */
        if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) continue;

        /* Feed slaves that are waiting for the initial SYNC (so these commands
         * are queued in the output buffer until the initial SYNC completes),
         * or are already in sync with the master. */

        /* Add the multi bulk length. */
        addReplyMultiBulkLen(slave,argc);

        /* Finally any additional argument that was not stored inside the
         * static buffer if any (from j to argc). */
        for (j = 0; j < argc; j++)
            addReplyBulk(slave,argv[j]);
    }
}

4. 第一次同步流程

主要要第一次同步作为流程,然后以RDB需要持久化到磁盘为例。

4.1 从服务器

1)从服务器执行slaveof <master_ip> <master_port>,从服务器内部会记录主服务器的host和port信息,以及一些其他清理动作。

//replication.c#replicaofCommand
void replicaofCommand(client *c) {
   ...
   replicationSetMaster(c->argv[1]->ptr, port);
   ...
}
//replication.c#replicationSetMaster
void replicationSetMaster(char *ip, int port) {
    int was_master = server.masterhost == NULL;

    sdsfree(server.masterhost);
    server.masterhost = sdsnew(ip);
    server.masterport = port;
    if (server.master) {
        freeClient(server.master);
    }
    ......
    server.repl_state = REPL_STATE_CONNECT;
}

2)在replicationCron函数中,会对主服务器进行连接。主服务器成功建立连接后,会获取文件描述符,将其加入到事件循环中。对应的处理函数为:syncWithMaster

//replication.c#replicationCron
......
if (server.repl_state == REPL_STATE_CONNECT) {
	if (connectWithMaster() == C_OK) {
		serverLog(LL_NOTICE,"MASTER <-> REPLICA sync started");
	}
}
......
//replication.c#connectWithMaster
int connectWithMaster(void) {
    int fd;
    fd = anetTcpNonBlockBestEffortBindConnect(NULL,server.masterhost,server.masterport,NET_FIRST_BIND_ADDR);
   	......
    if (aeCreateFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE,syncWithMaster,NULL) ==
            AE_ERR){
        ......
        return C_ERR;
    }
    server.repl_transfer_lastio = server.unixtime;
    server.repl_transfer_s = fd;
    server.repl_state = REPL_STATE_CONNECTING;
    return C_OK;
}

3)会发送一系列校验逻辑。主服务器会记录从服务器的 listening-port和ip-address,在哨兵和ROLE命令的地方会使用到。之所以需要这两个配置,是因为可能存在NAT的这种网络场景。(在replica-announce-ip和replica-announce-port的配置处有说明)

1)发送PING命令。

2)发送AUTH命令。(如果填写了masterauth配置)

3)发送REPLCONF listening-port 命令。

4)发送REPLCONF ip-address <announce_ip>命令。(如果slave-announce-ip或replica-announce-ip配置不为空)

5)发送REPLCONF capa eof capa psync命令,标记从服务器的复制能力。eof标识主服务器可以直接将数据库以RDB格式通过socket发送(同时主服务器要开启参数repl-diskless-sync),避免磁盘读写。psync2表明从服务器支持psync2协议,从服务器可以识别主服务器回复的“+CONTINUE <new_repl_id>”

4)发送PSYNC命令。第一次同步,会发送PSYNC ? -1命令。

//replication.c#slaveTryPartialResynchronization
...
psync_replid = "?";
memcpy(psync_offset,"-1",3);

reply = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"PSYNC",psync_replid,psync_offset,NULL);
...

5)接收主服务器FULLRESYNC结果。会记录主服务器的replId和offset

//replication.c#slaveTryPartialResynchronization
if (!strncmp(reply,"+FULLRESYNC",11)) {
	char *replid = NULL, *offset = NULL;
	memcpy(server.master_replid, replid, offset-replid-1);
	server.master_replid[CONFIG_RUN_ID_SIZE] = '';
	server.master_initial_offset = strtoll(offset,NULL,10);
	serverLog(LL_NOTICE,"Full resync from master: %s:%lld",
	                server.master_replid,
	                server.master_initial_offset);
	return PSYNC_FULLRESYNC;
}

6)创建事件循环,然后对应的处理数据为readSyncBulkPayload,在该方法内会从中读取RDB数据。最后记录相关状态、时间数据。

//replication.c#connectWithMaster
if (aeCreateFileEvent(server.el,fd, AE_READABLE,readSyncBulkPayload,NULL) == AE_ERR){
   ...
}
server.repl_state = REPL_STATE_TRANSFER;
server.repl_transfer_size = -1;
server.repl_transfer_read = 0;
server.repl_transfer_last_fsync_off = 0;
server.repl_transfer_fd = dfd;
server.repl_transfer_lastio = server.unixtime; 

7)读取RDB数据前,会先读取一行数据。如果是RDB持久化方式,文件大小是确认的,主服务器会发送$<文件大小>\r\n。如果是无盘的情况下,则RDB文件开始会发送$EOF:<40字节随机字符串>\r\n,结束会发送<40字节字符串>

//replication.c#readSyncBulkPayload
if (server.repl_transfer_size == -1) {
	if (syncReadLine(fd,buf,1024,server.repl_syncio_timeout*1000) == -1) {
		......
	}
	......
	server.repl_transfer_size = strtol(buf+1,NULL,10);
}

8)会从Socket中读取数据,然后写到RDB文件中,每写到8M时,会进行一次刷盘操作。

//replication.c#readSyncBulkPayload
//读取数据
nread = read(fd,buf,readlen);
//
server.repl_transfer_lastio = server.unixtime;
//写入到文件缓冲区
nwritten = write(server.repl_transfer_fd,buf,nread
//写入数据到8M时,会进行刷盘操作。
if (server.repl_transfer_read >=
        server.repl_transfer_last_fsync_off + REPL_MAX_WRITTEN_BEFORE_FSYNC) {
	off_t sync_size = server.repl_transfer_read -
	                          server.repl_transfer_last_fsync_off;
	rdb_fsync_range(server.repl_transfer_fd,
	            server.repl_transfer_last_fsync_off, sync_size);
	server.repl_transfer_last_fsync_off += sync_size;
}

9)如果读取到足够数据,则标记为读取完成。

//replication.c#readSyncBulkPayload
if (!usemark) {
	if (server.repl_transfer_read == server.repl_transfer_size)
	       eof_reached = 1;
}

10)如果读取完成,则会进行刷盘,然后修改文件名。如果开启了AOF,则会先关闭。然后将RDB持久化的文件进行加载到内存中。然后会把当前与master的连接,作为从服务器的客户端。

//replication.c#readSyncBulkPayload
//刷盘
fsync(server.repl_transfer_fd);
......
//重命名RDB文件
rename(server.repl_transfer_tmpfile,server.rdb_filename)
...
//关闭AOF
if(aof_is_enabled) stopAppendOnly();
...
//加载RDB文件
rdbLoad(server.rdb_filename,&rsi)
...
//把与master的连接作为从服务器客户端
replicationCreateMasterClient(server.repl_transfer_s,rsi.repl_stream_db);  

11)会把当前与master的连接,作为从服务器的客户端。设置的flag为CLIENT_MASTER,并记录当前的复制偏移量(server.master->reploff作为从服务器的复制偏移量,后续主服务器同步命令到从服务器,从服务器执行完后,会更新该值)。

//replication.c#replicationCreateMasterClient
void replicationCreateMasterClient(int fd, int dbid) {
    server.master = createClient(fd);
    server.master->flags |= CLIENT_MASTER;
    server.master->authenticated = 1;
    server.master->reploff = server.master_initial_offset;
    server.master->read_reploff = server.master->reploff;
    memcpy(server.master->replid, server.master_replid,
        sizeof(server.master_replid));
    /* If master offset is set to -1, this master is old and is not
     * PSYNC capable, so we flag it accordingly. */
    if (server.master->reploff == -1)
        server.master->flags |= CLIENT_PRE_PSYNC;
    if (dbid != -1) selectDb(server.master,dbid);
}

12)修改复制状态,如果AOF是开启的,则会重新启动AOF。如果AOF是开启的,则会立马对AOF进行重写,目的是为了让AOF文件能够记录最新数据。

//replication.c#readSyncBulkPayload
server.repl_state = REPL_STATE_CONNECTED;
server.repl_down_since = 0;
......
if (aof_is_enabled) restartAOFAfterSYNC();

4.2 主服务器

服务器入口主要从psyncCommand开始

1)在syncCommand中,会把客户端加入到server.slaves列表中维护,如果是第一个slave,则会创建repl_back

//replication.c#syncCommand
c->replstate = SLAVE_STATE_WAIT_BGSAVE_START;
if (server.repl_disable_tcp_nodelay)
        anetDisableTcpNoDelay(NULL, c->fd);
/* Non critical if it fails. */
c->repldbfd = -1;
c->flags |= CLIENT_SLAVE;
//将客户端信息加入到到server.slaves信息中
listAddNodeTail(server.slaves,c);
/* Create the replication backlog if needed. */
if (listLength(server.slaves) == 1 && server.repl_backlog == NULL) {
	......
    //创建repl_back
	createReplicationBacklog();
}

2)然后会在第三种场景中,调用startBgsaveForReplication方法。

//replication.c#syncCommand
.....
startBgsaveForReplication(c->slave_capa);

3)发送RDB数据给从库,也可以是不用写磁盘的方式。这里主要讨论需要写磁盘的情况,也就是这里的方法会调用rdbSaveBackground,该逻辑处理与之前的RDB持久化一样。

//replication.c#startBgsaveForReplication
int socket_target = server.repl_diskless_sync && (mincapa & SLAVE_CAPA_EOF);
if (socket_target) {
	retval = rdbSaveToSlavesSockets(rsiptr);
} else {
	retval = rdbSaveBackground(server.rdb_filename,rsiptr);
}

4)如果是需要写磁盘,则会调用replicationSetupSlaveForFullResync方法。在该方法内部,会返回给从服务器FULLRESYNC

//replication.c#startBgsaveForReplication
if (!socket_target) {
	listRewind(server.slaves,&li);
	while((ln = listNext(&li))) {
		client *slave = ln->value;
		if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
			replicationSetupSlaveForFullResync(slave,getPsyncInitialOffset());
		}
	}
}
int replicationSetupSlaveForFullResync(client *slave, long long offset) {
	char buf[128];
	int buflen;
	slave->psync_initial_offset = offset;
	slave->replstate = SLAVE_STATE_WAIT_BGSAVE_END;
	/* We are going to accumulate the incremental changes for this
     * slave as well. Set slaveseldb to -1 in order to force to re-emit
     * a SELECT statement in the replication stream. */
	server.slaveseldb = -1;
	/* Don't send this reply to slaves that approached us with
     * the old SYNC command. */
	if (!(slave->flags & CLIENT_PRE_PSYNC)) {
		buflen = snprintf(buf,sizeof(buf),"+FULLRESYNC %s %lldrn",
                          server.replid,offset);
        if (write(slave->fd,buf,buflen) != buflen) {
            freeClientAsync(slave);
            return C_ERR;
        }
    }
    return C_OK;
}

5)在RDB持久化完成,子进程退出之后。会执行backgroundSaveDoneHandlerDisk方法,内部会调用updateSlavesWaitingBgsave方法。

//rdb.c#backgroundSaveDoneHandlerDisk
void backgroundSaveDoneHandlerDisk(int exitcode, int bysignal) {
   ......
   updateSlavesWaitingBgsave((!bysignal && exitcode == 0) ? C_OK : C_ERR, RDB_CHILD_TYPE_DISK);
}

6)在updateSlavesWaitingBgsave方法中,会打开文件,获取文件大小。然后将从服务器连接的文件描述符,加入到事件循环中,处理函数为sendBulkToSlave。

//replication.c#updateSlavesWaitingBgsave
if ((slave->repldbfd = open(server.rdb_filename,O_RDONLY)) == -1 ||
                    redis_fstat(slave->repldbfd,&buf) == -1) {
	freeClient(slave);
	serverLog(LL_WARNING,"SYNC failed. Can't open/stat DB after BGSAVE: %s", strerror(errno));
	continue;
}
slave->repldboff = 0;
//记录文件大小
slave->repldbsize = buf.st_size;
slave->replstate = SLAVE_STATE_SEND_BULK;
slave->replpreamble = sdscatprintf(sdsempty(),"$%lldrn",(unsigned long long) slave->repldbsize);
aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE, sendBulkToSlave, slave) == AE_ERR) {
     freeClient(slave);
      continue;
}

7)会先发送给从服务“$<rdb文件大小>\r\n”,然后从RDB文件中读取数据,然后写到从服务器的socket中。最后记录写出去的数据。

//replication.c#sendBulkToSlave
if (slave->replpreamble) {
    nwritten = write(fd,slave->replpreamble,sdslen(slave->replpreamble));
    ......
}
lseek(slave->repldbfd,slave->repldboff,SEEK_SET);
buflen = read(slave->repldbfd,buf,PROTO_IOBUF_LEN);
...
nwritten = write(fd,buf,buflen)
...
slave->repldboff += nwritten;

8)当RDB数据传输完,会删除事件循环。然后调用putSlaveOnline方法。

//replication.c#sendBulkToSlave
if (slave->repldboff == slave->repldbsize) {
	close(slave->repldbfd);
	slave->repldbfd = -1;
	aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
	putSlaveOnline(slave);
}

9)会修改状态和记录repl_ack_time事件。然后会在注册到事件循环中,将生成RDB和发送RDB数据过程中产生的新命令,发送给从服务器。最后会更新一下“好的”从服务器。在sendReplyToClient就会把Replication Buffer中的数据发送给从服务器。

//replication.c#putSlaveOnline
void putSlaveOnline(client *slave) {
    slave->replstate = SLAVE_STATE_ONLINE;
    slave->repl_put_online_on_ack = 0;
    slave->repl_ack_time = server.unixtime; /* Prevent false timeout. */
    if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE,
        sendReplyToClient, slave) == AE_ERR) {
        serverLog(LL_WARNING,"Unable to register writable event for replica bulk transfer: %s", strerror(errno));
        freeClient(slave);
        return;
    }
    refreshGoodSlavesCount();
    serverLog(LL_NOTICE,"Synchronization with replica %s succeeded",
        replicationGetSlaveName(slave));
}

有个简单的问题:RDB的发送完,然后就再发送Replication Buffer,两者之间的数据会不会串起来?在从服务器上,如果是主服务器是需要写盘的情况,会发送文件的大小,所以读取的RDB的数据是固定的,不会和Replication Buffer串在一起。至于无盘的情况,是另外一种处理逻辑。在发送REPLCONF ACK时,才会判断是否要发送Replication Buffer

5. 问题

5.1 在生成和传输RDB期间的命令,如何传输到从库?

以下情况在配置repl-diskless-sync 为no的情况下。也就是说,传输的RDB数据需要持久化到文件的情况下。

1)在replicationFeedSlaves方法中,会把执行的命令记录到repl_backlog,并且记录到客户端输出缓冲区中。

//replcation.c#replicationFeedSlaves
......
listRewind(slaves,&li);
while((ln = listNext(&li))) {
	client *slave = ln->value;
	if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) continue;
	addReplyMultiBulkLen(slave,argc);
	for (j = 0; j < argc; j++)
	    addReplyBulk(slave,argv[j]);
}

2)在addReplyBulk方法内部,会调用addReply,内部会调用prepareClientToWrite方法。

//networking.c#addReply
void addReply(client *c, robj *obj) {
    if (prepareClientToWrite(c) != C_OK) return;
    ......
    //省略将数据加入到输出缓冲区
}

3)在prepareClientToWrite内部会调用clientInstallWriteHandler方法。其中要求主从复制状态为REPL_STATE_NONE,或者主从复制状态为SLAVE_STATE_ONLINE,且repl_put_online_on_ack为0,才会把当前的client加入到clients_pending_write中。

//networking.c#clientInstallWriteHandler
void clientInstallWriteHandler(client *c) {
   
    if (!(c->flags & CLIENT_PENDING_WRITE) &&
        (c->replstate == REPL_STATE_NONE ||
         (c->replstate == SLAVE_STATE_ONLINE && !c->repl_put_online_on_ack)))
    {
        c->flags |= CLIENT_PENDING_WRITE;
        listAddNodeHead(server.clients_pending_write,c);
    }
}

4)数据返回给客户端是在beforeSleep中。从clients_pending_write取出client,然后再获取client的输出数据,然后写到Socket中。

//networking.c#handleClientsWithPendingWrites
int handleClientsWithPendingWrites(void) {
	listIter li;
	listNode *ln;
	int processed = listLength(server.clients_pending_write);
	listRewind(server.clients_pending_write,&li);
	while((ln = listNext(&li))) {
		client *c = listNodeValue(ln);
		c->flags &= ~CLIENT_PENDING_WRITE;
		listDelNode(server.clients_pending_write,ln);
		......
		if (writeToClient(c->fd,c,0) == C_ERR) continue;
		......
	}
	return processed;
}

5)所以在生成RDB和传输过程中,数据都在client buffer中,还没有发送到从库中。

6)在RDB生成完后,会调用backgroundSaveDoneHandlerDisk方法,然后会调用updateSlavesWaitingBgsave。该方法内部会获取文件的大小,然后把从库的socket文件描述符注册到事件循环上,处理函数为sendBulkToSlave。

//replication.c#updateSlavesWaitingBgsave
......
slave->repldboff = 0;
slave->repldbsize = buf.st_size;
slave->replstate = SLAVE_STATE_SEND_BULK;
slave->replpreamble = sdscatprintf(sdsempty(),"$%lldrn",(unsigned long long) slave->repldbsize);
aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE, sendBulkToSlave, slave) == AE_ERR) {
	freeClient(slave);
	continue;
}
.....

7)在sendBulkToSlave内部,就是发送RDB数据到从库。当最后全部发送完后,删除事件循环,调用putSlaveOnline方法。

//replication.c#sendBulkToSlave
......
if (slave->repldboff == slave->repldbsize) {
	close(slave->repldbfd);
	slave->repldbfd = -1;
	aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
	putSlaveOnline(slave);
}

8)在该方法中,会修改复制状态,设置repl_put_online_on_ack参数。然后注册另外一个事件循环函数,sendReplyToClient,该函数内部主要是调用writeToClient。

//replication.c#putSlaveOnline
void putSlaveOnline(client *slave) {
	slave->replstate = SLAVE_STATE_ONLINE;
	slave->repl_put_online_on_ack = 0;
	slave->repl_ack_time = server.unixtime;
	/* Prevent false timeout. */
	if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE,
	        sendReplyToClient, slave) == AE_ERR) {
		......
		return;
	}
	refreshGoodSlavesCount();
	......
}
//networking.c#sendReplyToClient
void sendReplyToClient(aeEventLoop *el, int fd, void *privdata, int mask) {
    UNUSED(el);
    UNUSED(mask);
    writeToClient(fd,privdata,1);
}

9)在writeToClient内部,当在输出缓冲区的所有数据发送完之后,会从事件循环中删除该socket描述符。后续的同步到从库的命令,同client端一样处理了。

////networking.c#writeToClient
......
if (!clientHasPendingReplies(c)) {
	c->sentlen = 0;
	if (handler_installed) aeDeleteFileEvent(server.el,c->fd,AE_WRITABLE);
	/* Close connection after entire reply has been sent. */
	if (c->flags & CLIENT_CLOSE_AFTER_REPLY) 
	{
		freeClient(c);
		return C_ERR;
	}
}

6. 参考资料

  1. 《Redis 5设计与源码分析》

  2. Redis 5.0.12 源码