手把手教你写raft--sofajraft日志复制源码分析(5)

126 阅读23分钟

日志复制

1. Pipeline机制

在进入日志复制内容前,我们需要了解下sofa-jraft的pipeline机制。先来回顾下raft协议,raft协议要求follower与leader的日志顺序一致,此外raft禁止follower的日志出现空洞,如下图的情况是不允许出现。follower01出现了日志空洞,follower02则出现了乱序。

为了避免上述的情况,我们可以采用一个简单模式"request-response-request",即请求-应答-请求模型,这样能够保证日志的顺序性,如果请求过程中发生了异常也很好处理,直接重试即可。但是,这种模式可以预料到性能是比较低下的,而raft的日志复制是一个非常频繁的操作。

为了提高日志复制的性能,jraft除了采用批量复制外,还采用了pipeline机制。具体地说,jraft在日志复制时,不再一个个发送请求,而是允许发送一个窗口大小请求数目,即把原来串行的请求应答请求模型并行化,类似于TCP的滑动窗口算法。

在窗口大小范围内,leader允许连续发送请求,不必等待响应到达,jraft把这种发送但未收到响应的请求称为inflight request。leader在发送请求后会将请求进行记录,为其分配一个seqId用于记录request的顺序,并压入一个先进先出的队列。

Peer在返回响应时,会在响应中携带该seqId,以便leader查询响应对应请求。

Leader在接受到响应时,如果响应不是inflight request queue的第一条请求对应的响应,则先不处理,将其放入pending response priorityqueue,该队列根据seqId进行排序,保证和inflight request queue顺序一致。如果是第一条请求对应的响应,则按正常日志复制处理逻辑,这里暂且不细究。

如下图,黄色实心圆表示已经接收到的inflight request的响应,黄色虚线圆表示未收到的inflight request响应,requiredNextSeq指向期望收到的下一条响应。灰色的圆表示已经被处理的inflight request和response。

虽然通过pipeline机制解决了日志提交的顺序性,提高了传输效率,但仍然无法保证日志在网络传输的有序性。peer在接受到请求后,可能不会按照leader写入的顺序复制日志应用到状态机器。如下图,理想状况下leader和follower之间的复制是严格有序的,但现实情况中,由于jraft采用了boltRpc作为通信层,而这个框架默认是使用连接池的,这就导致了,replicator在发送日志以及follower接受日志时候的顺序是无法预测的。虽然jraft在收到乱序日志会重传解决,但由于日志复制的频繁性,就导致大量不必要的重传,降低系统性能。

为此jraft采用了两个方法来尽可能解决上述的问题

  1. 采用单连接
  2. follower端处理请求直接在io线程处理(单线程处理)

通过这种方式,能够保证peer在接受到请求时候尽最大可能保持与leader相同的处理顺序。

但上面的inflight request queue中还存在着一个问题,假设某个request的response一直失败,且超出了重传的次数,此时inflight request queue可能会一直夯住。假设队列大小无上限,则所有的请求都无法处理了。

jraft在碰到上述的情况时候会重建inflight request queue,并且自增一个version字段,该字段用来判断接受到的响应是否是上一个版本inflight request queue的响应,如果是则无视该响应。

2. 探针Probe

在上一章的心跳一节中,我们有提到在心跳请求被拒绝,即response.success = false时,leader会发送probeRequest。这一节我们就来详细讨论下probeRequest的作用。发送probeRequest实际上和心跳请求共用了sendEmptyEntries方法,只不过传入了false。

private void sendProbeRequest() {
    sendEmptyEntries(false);
}

虽然同样都是EmptyAppendEntriesRequest,但是处理逻辑还是与心跳不同主要体现在

  1. 更改本地state
  2. 使用onRpcReturned处理回调
  3. 将request压入了InFlight request queue,意味着该request的response需要被顺序处理。

如果说心跳请求只是不断地为leader节点进行续约,那么探针请求则是leader主动用来确定与follower之间日志差异的请求。

    private void sendEmptyEntries(final boolean isHeartbeat,
                                  final RpcResponseClosure<AppendEntriesResponse> heartBeatClosure) {
        final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
        if (!fillCommonFields(rb, this.nextIndex - 1, isHeartbeat)) {
            // id is unlock in installSnapshot
            installSnapshot();
            if (isHeartbeat && heartBeatClosure != null) {
                RpcUtils.runClosureInThread(heartBeatClosure, new Status(RaftError.EAGAIN,
                    "Fail to send heartbeat to peer %s, group %s", this.options.getPeerId(), this.options.getGroupId()));
            }
            return;
        }
        try {
            final long monotonicSendTimeMs = Utils.monotonicMs();

            if (isHeartbeat) {
                // ......省略心跳逻辑
            } else {
                // 
                rb.setData(ByteString.EMPTY);
                final AppendEntriesRequest request = rb.build();
                // 1. 记录本地状态
                this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
                this.statInfo.firstLogIndex = this.nextIndex;
                this.statInfo.lastLogIndex = this.nextIndex - 1;
                this.probeCounter++;
                setState(State.Probe);
                final int stateVersion = this.version;
                final int seq = getAndIncrementReqSeq();
                // 2. 发起请求并设置回调
                final Future<Message> rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(),
                    request, -1, new RpcResponseClosureAdapter<AppendEntriesResponse>() {

                        @Override
                        public void run(final Status status) {
                            onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request,
                                getResponse(), seq, stateVersion, monotonicSendTimeMs);
                        }

                    });
                // 3. 将请求加入 inflight 队列
                addInflight(RequestType.AppendEntries, this.nextIndex, 0, 0, seq, rpcFuture);
            }
        } finally {
                unlockId();
        }
    }

节点在收到该请求的处理逻辑与心跳请求一致,这里就做多赘述了。主要来看下请求发起方在接受到响应后的回调。

这里回调方法的核心内容是对reponse的排队处理

  1. 通过version确定response的合法性
  2. pendingResponsePriorityQueue通过seq来为响应进行排序
  3. 按照请求的顺序处理response,在出现response空洞时停止处理
  4. 如果continueSendEntries为true,则继续发送日志,
static void onRpcReturned(final ThreadId id, final RequestType reqType, final Status status, final Message request,
                              final Message response, final int seq, final int stateVersion, final long rpcSendTime) {
        if (id == null) {
            return;
        }
        final long startTimeMs = Utils.nowMs();
        Replicator r;
        if ((r = (Replicator) id.lock()) == null) {
            return;
        }

        // 非法响应,即上一个版本的inflight request的响应,无视即可。
        if (stateVersion != r.version) {
            id.unlock();
            return;
        }

        // 1.. 响应入pending response pq
        final PriorityQueue<RpcResponse> holdingQueue = r.pendingResponses;
        holdingQueue.add(new RpcResponse(reqType, seq, status, request, response, rpcSendTime));

        if (holdingQueue.size() > r.raftOptions.getMaxReplicatorInflightMsgs()) {
            // replicator等待响应太多,重建inflights
            r.resetInflights();
            r.setState(State.Probe);
            r.sendProbeRequest();
            return;
        }

        // 该变量用于标记处理完响应后,是否应该继续发送请求
        boolean continueSendEntries = false;

        try {
            int processed = 0;
            while (!holdingQueue.isEmpty()) {
                final RpcResponse queuedPipelinedResponse = holdingQueue.peek();

                // Sequence mismatch, waiting for next response.
                if (queuedPipelinedResponse.seq != r.requiredNextSeq) {
                    // 响应出现空洞,跳出循环
                    if (processed > 0) {
                        break;
                    } else {
                        // 没有处理任何响应,等待响应被处理,发布送请求
                        continueSendEntries = false;
                        id.unlock();
                        return;
                    }
                }
                holdingQueue.remove();
                processed++;
                final Inflight inflight = r.pollInflight();
                if (inflight == null) {
                    // inflight被清空了,说明该响应是之前版本的inflight queue的响应,忽略
                    // The previous in-flight requests were cleared.
                    continue;
                }
                if (inflight.seq != queuedPipelinedResponse.seq) {
                    // 请求和响应无法对应,需要重置inflight queue,重新生成请求和响应。
                    // 理论上不应该触发到这里的逻辑
                    r.resetInflights();
                    r.setState(State.Probe);
                    continueSendEntries = false;
                    r.block(Utils.nowMs(), RaftError.EREQUEST.getNumber());
                    return;
                }
                try {
                    switch (queuedPipelinedResponse.requestType) {
                        case AppendEntries:
                            // 对于probe request的回调,响应类型是AppendEntries
                            continueSendEntries = onAppendEntriesReturned(id, inflight, queuedPipelinedResponse.status,
                                (AppendEntriesRequest) queuedPipelinedResponse.request,
                                (AppendEntriesResponse) queuedPipelinedResponse.response, rpcSendTime, startTimeMs, r);
                            break;
                        case Snapshot:
                            continueSendEntries = onInstallSnapshotReturned(id, r, queuedPipelinedResponse.status,
                                (InstallSnapshotRequest) queuedPipelinedResponse.request,
                                (InstallSnapshotResponse) queuedPipelinedResponse.response);
                            break;
                    }
                } finally {
                    if (continueSendEntries) {
                        // Success, increase the response sequence.
                        r.getAndIncrementRequiredNextSeq();
                    } else {
                        // The id is already unlocked in onAppendEntriesReturned/onInstallSnapshotReturned, we SHOULD break out.
                        break;
                    }
                }
            }
        } finally {
            if (continueSendEntries) {
                // 继续发送请求
                r.sendEntries();
            }
        }
    }

Replicator#onAppendEntries是真正处理探针响应的方法,核心逻辑如下

  1. 当peer因为网络或者自身crash导致请求失败,则重建inflight request queue
  2. 如果peer拒绝了这次probe request,即response.success == false
    1. 如果是因为peer过忙,无法处理,并等待一个时间再尝试
    2. 如果peer的term大于当前节点的term(peer网络隔离,不断自增term后恢复),并增加自身的term
    3. 如果peer的日志少于leader的日志,更新replicator的nextIndex为peer缺失日志的起始位置
    4. 如果peer的日志多余leader的日志,不断减少nextIndex寻找最后一个日志相同位置
    5. 不论以上那种情况,都重建inflight request queue
  1. term和rplicator当前term不一致,但请求成功,这点倒没想明白在什么情况下出现。也算是一种失败请求,重建inflight request queue。
  2. 如果request是正常日志请求,则根据响应提交日志
  3. 更新本地replicator状态
// 删除了非主要逻辑的代码
private static boolean onAppendEntriesReturned(final ThreadId id, final Inflight inflight, final Status status,
                                               final AppendEntriesRequest request,
                                               final AppendEntriesResponse response, final long rpcSendTime,
                                               final long startTimeMs, final Replicator r) {
    // 日志未对齐,重建inflight queue,作废之前的请求和响应
    if (inflight.startIndex != request.getPrevLogIndex() + 1) {
        LOG.warn(
            "Replicator {} received invalid AppendEntriesResponse, in-flight startIndex={}, request prevLogIndex={}, reset the replicator state and probe again.",
            r, inflight.startIndex, request.getPrevLogIndex());
        r.resetInflights();
        r.setState(State.Probe);
        r.sendProbeRequest();
        return false;
    }

    if (!status.isOk()) {
        // follower崩溃或者其他网络问题
        // If the follower crashes, any RPC to the follower fails immediately,
        // so we need to block the follower for a while instead of looping until
        // it comes back or be removed
        // dummy_id is unlock in block
        notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
        if ((r.consecutiveErrorTimes++) % 10 == 0) {
            LOG.warn("Fail to issue RPC to {}, consecutiveErrorTimes={}, error={}, groupId={}", r.options.getPeerId(),
                r.consecutiveErrorTimes, status, r.options.getGroupId());
        }
        r.resetInflights();
        r.setState(State.Probe);
        // unlock in in block
        r.block(startTimeMs, status.getCode());
        return false;
    }
    r.consecutiveErrorTimes = 0;
    if (!response.getSuccess()) {
         // Target node is is busy, sleep for a while.
        // a. 如果是因为peer过忙,无法处理,则重建inflight request queue,并等待一个时间再尝试
        if(response.getErrorResponse().getErrorCode() == RaftError.EBUSY.getNumber()) {
          r.resetInflights();
          r.setState(State.Probe);
          // unlock in in block
          r.block(startTimeMs, status.getCode());
          return false;
        }

        if (response.getTerm() > r.options.getTerm()) {
            // leader term落后,重建replicator,并且增加leader term
            final NodeImpl node = r.options.getNode();
            r.notifyOnCaughtUp(RaftError.EPERM.getNumber(), true);
            r.destroy();
            node.increaseTermTo(response.getTerm(), new Status(RaftError.EHIGHERTERMRESPONSE,
                "Leader receives higher term heartbeat_response from peer:%s, group:%s", r.options.getPeerId(), r.options.getGroupId()));
            return false;
        }
        if (rpcSendTime > r.lastRpcSendTimestamp) {
            r.lastRpcSendTimestamp = rpcSendTime;
        }
        // Fail, reset the state to try again from nextIndex.
        r.resetInflights();
        // prev_log_index and prev_log_term doesn't match
        if (response.getLastLogIndex() + 1 < r.nextIndex) {
            // peer比leader的日志少,更新nextIndex,这样leader在下一次就会发送正确的日志了
            r.nextIndex = response.getLastLogIndex() + 1;
        } else {
            // peer的日志比leader多,包含了历史term(未被提交)的日志,需要被截断
            // 不断减少nextIndex,来探测一致的日志起点
            if (r.nextIndex > 1) {
                r.nextIndex--;
            } else {
                // peer的日志完全与leader不一致,这大概率不会出现
                LOG.error("Peer={} declares that log at index=0 doesn't match, which is not supposed to happen, groupId={}",
                    r.options.getPeerId(), r.options.getGroupId());
            }
        }
        // dummy_id is unlock in _send_heartbeat
        r.sendProbeRequest();
        return false;
    }
    // success
    if (response.getTerm() != r.options.getTerm()) {
        // 响应term不相同,说明repsonse的term小于当前节点的term,重建inflights
        r.resetInflights();
        r.setState(State.Probe);
        id.unlock();
        return false;
    }
    if (rpcSendTime > r.lastRpcSendTimestamp) {
        r.lastRpcSendTimestamp = rpcSendTime;
    }
    final int entriesSize = request.getEntriesCount();
    if (entriesSize > 0) {
        // 提交
        if (r.options.getReplicatorType().isFollower()) {
            // Only commit index when the response is from follower.
            r.options.getBallotBox().commitAt(r.nextIndex, r.nextIndex + entriesSize - 1, r.options.getPeerId());
        }
    }

    r.setState(State.Replicate);
    r.blockTimer = null;
    r.nextIndex += entriesSize;
    r.hasSucceeded = true;
    r.notifyOnCaughtUp(RaftError.SUCCESS.getNumber(), false);
    // dummy_id is unlock in _send_entries
    if (r.timeoutNowIndex > 0 && r.timeoutNowIndex < r.nextIndex) {
        r.sendTimeoutNow(false, false);
    }
    return true;
}

通过源码,可以看出probeRequest的作用就是不断地修正leader和peer(follower,learner)之间的日志差异,修正nextIndex的位置,来协助leader发送正确的日志给peer。

3. 日志复制

3.1. Leader日志复制

leader写日志的入口是NodeImpl#apply,应用层将回调和写入数据封装成一个task,leader将其发送到ApplyDisruptor的事件队列中。

@Override
public void apply(final Task task) {
    if (this.shutdownLatch != null) {
        // 停机校验
        ThreadPoolsFactory.runClosureInThread(this.groupId, task.getDone(), new Status(RaftError.ENODESHUTDOWN, "Node is shutting down."));
        throw new IllegalStateException("Node is shutting down");
    }
    // task携带数据以及应用层回调
    Requires.requireNonNull(task, "Null task");

    final LogEntry entry = new LogEntry();
    entry.setData(task.getData());

    // 包装成事件
    final EventTranslator<LogEntryAndClosure> translator = (event, sequence) -> {
      event.reset();
      event.done = task.getDone();
      event.entry = entry;
      event.expectedTerm = task.getExpectedTerm();
    };

    // 同步或者异步发布事件
    switch(this.options.getApplyTaskMode()) {
      case Blocking:
        this.applyQueue.publishEvent(translator);
        break;
      case NonBlocking:
      default:
        if (!this.applyQueue.tryPublishEvent(translator)) {
          String errorMsg = "Node is busy, has too many tasks, queue is full and bufferSize="+ this.applyQueue.getBufferSize();
            ThreadPoolsFactory.runClosureInThread(this.groupId, task.getDone(),
              new Status(RaftError.EBUSY, errorMsg));
          LOG.warn("Node {} applyQueue is overload.", getNodeId());
          this.metrics.recordTimes("apply-task-overload-times", 1);
          if(task.getDone() == null) {
            throw new OverloadException(errorMsg);
          }
        }
        break;
    }
}

以rheakv为例,这里task携带的回调如下所示,成功时写入数据,失败时写入错误码。至于这个回调啥时候被调用,先买个关子,暂且不谈。(ps:如果你了解raft算法,就应该清楚啥时候被回调)

this.rawKVStore.put(key, value, new BaseKVStoreClosure() {
                @Override
                public void run(final Status status) {
                    if (status.isOk()) {
                        response.setValue((Boolean) getData());
                    } else {
                        setFailure(request, response, status, getError());
                    }
                    closure.sendResponse(response);
                }
            });

3.1.1. ApplyDisruptor

applyDisruptor是一个高性能的消息队列,这边指定了LogEntryAndClosureHandler作为事件处理器,需要注意的是这边是单线程消费。

this.applyDisruptor = DisruptorBuilder.<LogEntryAndClosure> newInstance() //
    .setRingBufferSize(this.raftOptions.getDisruptorBufferSize()) //
    .setEventFactory(new LogEntryAndClosureFactory()) //
    .setThreadFactory(new NamedThreadFactory("JRaft-NodeImpl-Disruptor-", true)) //
    .setProducerType(ProducerType.MULTI) //
    .setWaitStrategy(new BlockingWaitStrategy()) //
    .build();
this.applyDisruptor.handleEventsWith(new LogEntryAndClosureHandler());
this.applyDisruptor.setDefaultExceptionHandler(new LogExceptionHandler<Object>(getClass().getSimpleName()));
this.applyQueue = this.applyDisruptor.start();

LogEntryAndClosureHandler#onEvent是处理leader apply事件的核心方法。指的注意的是这里采用了批处理的思想。关键变量endOfBatch取决于Disruptor在投递任务那一刻队列中的任务,在投递那一刻最后一个任务时候会携带endOfBatch==true。在节点比较空闲时候每个task的endOfBatch都等于true,只有在节点繁忙时,批处理的作用才真正显现。

private class LogEntryAndClosureHandler implements EventHandler<LogEntryAndClosure> {
    // task list for batch
    // 批处理
    private final List<LogEntryAndClosure> tasks = new ArrayList<>(NodeImpl.this.raftOptions.getApplyBatch());

    @Override
    public void onEvent(final LogEntryAndClosure event, final long sequence, final boolean endOfBatch)
                                                                                                      throws Exception {
        if (event.shutdownLatch != null) {
            // 停机,经可能处理最后一批任务
            if (!this.tasks.isEmpty()) {
                executeApplyingTasks(this.tasks);
                reset();
            }
            // 停机,丢弃不在批任务里的任务,并记录
            final int num = GLOBAL_NUM_NODES.decrementAndGet();
            LOG.info("The number of active nodes decrement to {}.", num);
            event.shutdownLatch.countDown();
            return;
        }
        
        this.tasks.add(event);                                                                                                
        if (this.tasks.size() >= NodeImpl.this.raftOptions.getApplyBatch() || endOfBatch) {
        // 满足两个条件就处理
        // 1. 批大小达到 jraft配置的最小批数量
        // 2. disruptor标记了批结束标志endOfBatch。
            executeApplyingTasks(this.tasks);
            reset();
        }
    }

    // 清空批处理队列
    private void reset() {
        for (final LogEntryAndClosure task : this.tasks) {
            task.reset();
        }
        this.tasks.clear();
    }
}

在NodeImpl#executeApplyingTasks方法中有两个比较关键的逻辑

  1. 调用this.ballotBox.appendPendingTask(this.conf.getConf(),this.conf.isStable() ? null : this.conf.getOldConf(), task.done)为当前任务新建票据,并存入信箱。注意这里的票据携带了应用层的回调,task.done。
  2. 调用this.logManager.appendEntries(entries, new LeaderStableClosure(entries))写入日志,这里新建了一个LeaderStableClosure的回调,该回调的作用是leader本地提交日志。
    private void executeApplyingTasks(final List<LogEntryAndClosure> tasks) {
        // 查看logManager是否busy,以便fail fast
        if (!this.logManager.hasAvailableCapacityToAppendEntries(1)) {
            // It's overload, fail-fast
            final List<Closure> dones = tasks.stream().map(ele -> ele.done).filter(Objects::nonNull)
                    .collect(Collectors.toList());
            ThreadPoolsFactory.runInThread(this.groupId, () -> {
                for (final Closure done : dones) {
                    done.run(new Status(RaftError.EBUSY, "Node %s log manager is busy.", this.getNodeId()));
                }
            });
            return;
        }

        this.writeLock.lock();
        try {
            final int size = tasks.size();
            if (this.state != State.STATE_LEADER) {
                // 校验节点状态
                final Status st = new Status();
                if (this.state != State.STATE_TRANSFERRING) {
                    st.setError(RaftError.EPERM, "Is not leader.");
                } else {
                    st.setError(RaftError.EBUSY, "Is transferring leadership.");
                }
                LOG.debug("Node {} can't apply, status={}.", getNodeId(), st);
                final List<Closure> dones = tasks.stream().map(ele -> ele.done)
                        .filter(Objects::nonNull).collect(Collectors.toList());
                ThreadPoolsFactory.runInThread(this.groupId, () -> {
                    for (final Closure done : dones) {
                        done.run(st);
                    }
                });
                return;
            }
            final List<LogEntry> entries = new ArrayList<>(size);
            for (int i = 0; i < size; i++) {
                final LogEntryAndClosure task = tasks.get(i);
                // term与当前节点term不一致,写入失败
                if (task.expectedTerm != -1 && task.expectedTerm != this.currTerm) {
                    LOG.debug("Node {} can't apply task whose expectedTerm={} doesn't match currTerm={}.", getNodeId(),
                        task.expectedTerm, this.currTerm);
                    if (task.done != null) {
                        final Status st = new Status(RaftError.EPERM, "expected_term=%d doesn't match current_term=%d",
                            task.expectedTerm, this.currTerm);
                        ThreadPoolsFactory.runClosureInThread(this.groupId, task.done, st);
                        task.reset();
                    }
                    continue;
                }
                // 初始化并添加appendEntries票据到ballotBox信箱
                if (!this.ballotBox.appendPendingTask(this.conf.getConf(),
                    this.conf.isStable() ? null : this.conf.getOldConf(), task.done)) {
                    ThreadPoolsFactory.runClosureInThread(this.groupId, task.done, new Status(RaftError.EINTERNAL, "Fail to append task."));
                    task.reset();
                    continue;
                }
                // set task entry info before adding to list.
                task.entry.getId().setTerm(this.currTerm);
                task.entry.setType(EnumOutter.EntryType.ENTRY_TYPE_DATA);
                entries.add(task.entry);
                task.reset();
            }
            // 调用logManager添加entries,并且新建了一个LeaderStableClosure回调
            this.logManager.appendEntries(entries, new LeaderStableClosure(entries));
            // update conf.first
            checkAndSetConfiguration(true);
        } finally {
            this.writeLock.unlock();
        }
    }
class LeaderStableClosure extends LogManager.StableClosure {

    public LeaderStableClosure(final List<LogEntry> entries) {
        super(entries);
    }

    @Override
    public void run(final Status status) {
        if (status.isOk()) {
            // 本地提交日志
            NodeImpl.this.ballotBox.commitAt(this.firstLogIndex, this.firstLogIndex + this.nEntries - 1,
                NodeImpl.this.serverId);
        } else {
            LOG.error("Node {} append [{}, {}] failed, status={}.", getNodeId(), this.firstLogIndex,
                this.firstLogIndex + this.nEntries - 1, status);
        }
    }
}

接着看logManager.appendEntries方法,该方法的核心逻辑如下

  1. 如果非leader节点,尝试解决冲突,如果是leader节点则会为entries分配index。
  2. 如果冲突解决成功,则根据配置为entries设置校验和。
  3. 如果entries非空,则将其添加到内存,并设置回调方法的firstLogIndex为第一个entry的index
  4. 如果entries非空,唤醒所有的replicator,注意这里得是leader
  5. 发布落盘时间,这里和diskDisruptor有关,放在下一小节中细究。要注意这里将原本的StableClosure回调传递到该Disruptor,并且该回调携带需要写入的entries
@Override
public void appendEntries(final List<LogEntry> entries, final StableClosure done) {
    assert(done != null);

    Requires.requireNonNull(done, "done");
    if (this.hasError) {
        // 异常情况
        entries.clear();
        ThreadPoolsFactory.runClosureInThread(this.groupId, done, new Status(RaftError.EIO, "Corrupted LogStorage"));
        return;
    }
    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        // 尝试解决日志append冲突,可能是leader向peer分发日志,与peer本地日志存在冲突,peer尝试解决冲突,失败返回false
        if (!entries.isEmpty() && !checkAndResolveConflict(entries, done, this.writeLock)) {
            // If checkAndResolveConflict returns false, the done will be called in it.
            entries.clear();
            return;
        }
        for (int i = 0; i < entries.size(); i++) {
            final LogEntry entry = entries.get(i);
            // Set checksum after checkAndResolveConflict
            // 设置校验和
            if (this.raftOptions.isEnableLogEntryChecksum()) {
                entry.setChecksum(entry.checksum());
            }
            // 动态配置
            if (entry.getType() == EntryType.ENTRY_TYPE_CONFIGURATION) {
                Configuration oldConf = new Configuration();
                if (entry.getOldPeers() != null) {
                    oldConf = new Configuration(entry.getOldPeers(), entry.getOldLearners());
                }
                final ConfigurationEntry conf = new ConfigurationEntry(entry.getId(),
                    new Configuration(entry.getPeers(), entry.getLearners()), oldConf);
                this.configManager.add(conf);
            }
        }
        if (!entries.isEmpty()) {
            // 添加到内存
            done.setFirstLogIndex(entries.get(0).getId().getIndex());
            this.logsInMemory.addAll(entries);
        }
        done.setEntries(entries);

        doUnlock = false;
        // 唤醒等待的replicator
        if (!wakeupAllWaiter(this.writeLock)) {
            notifyLastLogIndexListeners();
        }

        // publish event out of lock
        // 发布落盘时间,注意这里把closure传递进去了
        this.diskQueue.publishEvent((event, sequence) -> {
          event.reset();
          event.type = EventType.OTHER;
          event.done = done;
        });
    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
    }
}

3.1.2. DiskDisruptor

与applyDisruptor类似,这里我们也只需要关心相应的handler,并且知道是单线程消费。

this.disruptor = DisruptorBuilder.<StableClosureEvent> newInstance() //
        .setEventFactory(new StableClosureEventFactory()) //
        .setRingBufferSize(opts.getDisruptorBufferSize()) //
        .setThreadFactory(new NamedThreadFactory("JRaft-LogManager-Disruptor-", true)) //
        .setProducerType(ProducerType.MULTI) //
        /*
         *  Use timeout strategy in log manager. If timeout happens, it will called reportError to halt the node.
         */
        .setWaitStrategy(new TimeoutBlockingWaitStrategy(
            this.raftOptions.getDisruptorPublishEventWaitTimeoutSecs(), TimeUnit.SECONDS)) //
        .build();
this.disruptor.handleEventsWith(new StableClosureEventHandler());
this.disruptor.setDefaultExceptionHandler(new LogExceptionHandler<Object>(this.getClass().getSimpleName(),
        (event, ex) -> reportError(-1, "LogManager handle event error")));
this.diskQueue = this.disruptor.start();

核心逻辑由StableClosureEventHandler实现,该处理的核心逻辑如下

  1. 如果是停机时间,强刷数据到磁盘
  2. 如果entries不为空,调用AppendBatcher.append添加entries到缓冲区
  3. 处理LAST_LOG_ID、TRUNCATE_PREFIX、TRUNCATE_SUFFIX、RESET事件,在处理事件前都会调用AppendBatcher将缓冲区的日志写入磁盘
private class StableClosureEventHandler implements EventHandler<StableClosureEvent> {
        LogId               lastId  = LogManagerImpl.this.diskId;
        List<StableClosure> storage = new ArrayList<>(256);
        AppendBatcher appendBatcher = new AppendBatcher(this.storage, 256, new ArrayList<>(),
                                        LogManagerImpl.this.diskId);

        @Override
        public void onEvent(final StableClosureEvent event, final long sequence, final boolean endOfBatch)
                                                                                                          throws Exception {
            if (event.type == EventType.SHUTDOWN) {
                // 停机事件,刷新缓冲区数据到磁盘
                this.lastId = this.appendBatcher.flush();
                setDiskId(this.lastId);
                LogManagerImpl.this.shutDownLatch.countDown();
                event.reset();
                return;
            }
            final StableClosure done = event.done;
            final EventType eventType = event.type;

            event.reset();

            if (done.getEntries() != null && !done.getEntries().isEmpty()) {
                // appendEntries事件,添加entry到缓冲区
                this.appendBatcher.append(done);
            } else {
               this.lastId = this.appendBatcher.flush();
                boolean ret = true;
                switch (eventType) {
                    // 节点调用LogManager#getLastLogId方法时触发,选举、写日志等行为均会调用到该方法
                    case LAST_LOG_ID:
                        ((LastLogIdClosure) done).setLastLogId(this.lastId.copy());
                        break;
                    // 删除本地日志前缀触发,一般在安装snapshot时触发
                    case TRUNCATE_PREFIX:
                        long startMs = Utils.monotonicMs();
                        try {
                            final TruncatePrefixClosure tpc = (TruncatePrefixClosure) done;
                            LOG.debug("Truncating storage to firstIndexKept={}.", tpc.firstIndexKept);
                            ret = LogManagerImpl.this.logStorage.truncatePrefix(tpc.firstIndexKept);
                        } finally {
                            LogManagerImpl.this.nodeMetrics.recordLatency("truncate-log-prefix", Utils.monotonicMs()
                                                                                                 - startMs);
                        }
                        break;
                    // 删除本地日志后缀触发,一般是leader向follower发送日志时解冲突触发
                    case TRUNCATE_SUFFIX:
                        startMs = Utils.monotonicMs();
                        try {
                            final TruncateSuffixClosure tsc = (TruncateSuffixClosure) done;
                            LOG.warn("Truncating storage to lastIndexKept={}.", tsc.lastIndexKept);
                            ret = LogManagerImpl.this.logStorage.truncateSuffix(tsc.lastIndexKept);
                            if (ret) {
                                this.lastId.setIndex(tsc.lastIndexKept);
                                this.lastId.setTerm(tsc.lastTermKept);
                                Requires.requireTrue(this.lastId.getIndex() == 0 || this.lastId.getTerm() != 0);
                            }
                        } finally {
                            LogManagerImpl.this.nodeMetrics.recordLatency("truncate-log-suffix", Utils.monotonicMs()
                                                                                                 - startMs);
                        }
                        break;
                    // logManager被重置时触发
                    case RESET:
                        final ResetClosure rc = (ResetClosure) done;
                        LOG.info("Resetting storage to nextLogIndex={}.", rc.nextLogIndex);
                        ret = LogManagerImpl.this.logStorage.reset(rc.nextLogIndex);
                        break;
                    default:
                        break;
                }

                if (!ret) {
                    reportError(RaftError.EIO.getNumber(), "Failed operation in LogStorage");
                } else {
                    // 回调
                    done.run(Status.OK());
                }
            }
            if (endOfBatch) {
                this.lastId = this.appendBatcher.flush();
                setDiskId(this.lastId);
            }
        }

    }

在来看下AppendBatcher#append方法,核心逻辑是

  1. 将任务放入storage列表
  2. 将entries放入toAppend列表
List<StableClosure> storage;
List<LogEntry> toAppend;
void append(final StableClosure done) {
    if (this.size == this.cap || this.bufferSize >= LogManagerImpl.this.raftOptions.getMaxAppendBufferSize()) {
        flush();
    }
    this.storage.add(done);
    this.size++;
    this.toAppend.addAll(done.getEntries());
    for (final LogEntry entry : done.getEntries()) {
        this.bufferSize += entry.getData() != null ? entry.getData().remaining() : 0;
    }
}

AppendBatcher#flush方法则会将缓冲区的日志写入持久化存储,默认是RocksDB。随后调用StableClosure的回调,对于leader来说调用的是LeaderStableClosure,也就是提交日志对应的信箱。

LogId flush() {
    if (this.size > 0) {
        this.lastId = appendToStorage(this.toAppend);
        for (int i = 0; i < this.size; i++) {
            this.storage.get(i).getEntries().clear();
            Status st = null;
            try {
                if (LogManagerImpl.this.hasError) {
                    st = new Status(RaftError.EIO, "Corrupted LogStorage");
                } else {
                    st = Status.OK();
                }
                this.storage.get(i).run(st);
            } catch (Throwable t) {
                LOG.error("Fail to run closure with status: {}.", st, t);
            }
        }
        this.toAppend.clear();
        this.storage.clear();

    }
    this.size = 0;
    this.bufferSize = 0;
    return this.lastId;
}

class LeaderStableClosure extends LogManager.StableClosure {

    public LeaderStableClosure(final List<LogEntry> entries) {
        super(entries);
    }

    @Override
    public void run(final Status status) {
        if (status.isOk()) {
            // 本地提交日志
            NodeImpl.this.ballotBox.commitAt(this.firstLogIndex, this.firstLogIndex + this.nEntries - 1,
                NodeImpl.this.serverId);
        } else {
            LOG.error("Node {} append [{}, {}] failed, status={}.", getNodeId(), this.firstLogIndex,
                this.firstLogIndex + this.nEntries - 1, status);
        }
    }
}

LogManagerImpl#appendToStorage方法最终调用appendToStorage方法进行落盘

LogManagerImpl#appendToStorage
private LogId appendToStorage(final List<LogEntry> toAppend) {
    LogId lastId = null;
    if (!this.hasError) {
        final long startMs = Utils.monotonicMs();
        final int entriesCount = toAppend.size();
        this.nodeMetrics.recordSize("append-logs-count", entriesCount);
        try {
            int writtenSize = 0;
            for (int i = 0; i < entriesCount; i++) {
                final LogEntry entry = toAppend.get(i);
                writtenSize += entry.getData() != null ? entry.getData().remaining() : 0;
            }
            this.nodeMetrics.recordSize("append-logs-bytes", writtenSize);
            final int nAppent = this.logStorage.appendEntries(toAppend);
            if (nAppent != entriesCount) {
                LOG.error("**Critical error**, fail to appendEntries, nAppent={}, toAppend={}", nAppent,
                    toAppend.size());
                reportError(RaftError.EIO.getNumber(), "Fail to append log entries");
            }
            if (nAppent > 0) {
                lastId = toAppend.get(nAppent - 1).getId();
            }
            toAppend.clear();
        } finally {
            this.nodeMetrics.recordLatency("append-logs", Utils.monotonicMs() - startMs);
        }
    }
    return lastId;
}

RocksDBLogStorage#appendEntries
@Override
public int appendEntries(final List<LogEntry> entries) {
    if (entries == null || entries.isEmpty()) {
        return 0;
    }
    final int entriesCount = entries.size();
    final boolean ret = executeBatch(batch -> {
        final WriteContext writeCtx = newWriteContext();
        for (int i = 0; i < entriesCount; i++) {
            final LogEntry entry = entries.get(i);
            if (entry.getType() == EntryType.ENTRY_TYPE_CONFIGURATION) {
                addConfBatch(entry, batch);
            } else {
                writeCtx.startJob();
                addDataBatch(entry, batch, writeCtx);
            }
        }
        writeCtx.joinAll();
        doSync();
    });

    if (ret) {
        return entriesCount;
    } else {
        return 0;
    }
}

3.1.3. BallotBox

在上面的小节中,我们已经完成了leader上日志持久化的流程梳理。在本小节中我们将专注于日志提交,什么时候leader上的日志被提交,jraft又是怎么样实现半数提交机制的呢?

  1. leader本地提交

2.3.1.2小节中我们已经提到,在appendBatch完成flush后,会回调LeaderStableClosure对日志进行提交投票,这里就不再赘述。

AppendBatch#flush
LogId flush() {
    if (this.size > 0) {
        this.lastId = appendToStorage(this.toAppend);
        for (int i = 0; i < this.size; i++) {
            this.storage.get(i).getEntries().clear();
            Status st = null;
            try {
                if (LogManagerImpl.this.hasError) {
                    st = new Status(RaftError.EIO, "Corrupted LogStorage");
                } else {
                    st = Status.OK();
                }
                // LeaderStableClosure#run
                this.storage.get(i).run(st);
            } catch (Throwable t) {
                LOG.error("Fail to run closure with status: {}.", st, t);
            }
        }
        this.toAppend.clear();
        this.storage.clear();

    }
    this.size = 0;
    this.bufferSize = 0;
    return this.lastId;
}

2. follower投票提交

除了本地持久化得到本地的投票,leader必然需要来自follower的投票,才能够提交日志。BallotBox的commitAt方法只有两处调用,一处使我们熟悉的LeaderStableClosure,一处则被Replicator调用(Replicator是负责leader向peer复制日志的组件)。

该方法是Replicator收到AppendEntriesResponse后的返回,与投票相关的核心逻辑是:

如果当前Replicator对应的节点是Follower,则根据response返回的信息提交对应日志。

    private static boolean onAppendEntriesReturned(final ThreadId id, final Inflight inflight, final Status status,
                                                   final AppendEntriesRequest request,
                                                   final AppendEntriesResponse response, final long rpcSendTime,
                                                   final long startTimeMs, final Replicator r) {
        // 省略部分代码
        final int entriesSize = request.getEntriesCount();
        if (entriesSize > 0) {
            if (r.options.getReplicatorType().isFollower()) {
                //提交日志
                // Only commit index when the response is from follower.
                r.options.getBallotBox().commitAt(r.nextIndex, r.nextIndex + entriesSize - 1, r.options.getPeerId());
            }
            if (LOG.isDebugEnabled()) {
                LOG.debug("Replicated logs in [{}, {}] to peer {}", r.nextIndex, r.nextIndex + entriesSize - 1,
                    r.options.getPeerId());
            }
        }

        r.setState(State.Replicate);
        r.blockTimer = null;
        r.nextIndex += entriesSize;
        r.hasSucceeded = true;
        r.notifyOnCaughtUp(RaftError.SUCCESS.getNumber(), false);
        // dummy_id is unlock in _send_entries
        if (r.timeoutNowIndex > 0 && r.timeoutNowIndex < r.nextIndex) {
            r.sendTimeoutNow(false, false);
        }
        return true;
    }

3. BallotBox#commitAt

最后再来看看BallotBox#commitAt方法究竟做了什么,才让日志提交

核心逻辑就是根据传入的索引index信息,提交日志,取最后一条提交的日志的索引,将之前的日志全部标记为提交。最后根据该提交索引将日志应用到FSM(有限状态机,Finite State Machine)。

public boolean commitAt(final long firstLogIndex, final long lastLogIndex, final PeerId peer) {
    // TODO  use lock-free algorithm here?
    final long stamp = this.stampedLock.writeLock();
    long lastCommittedIndex = 0;
    try {
        // 1. 信箱未被初始化
        if (this.pendingIndex == 0) {
            return false;
        }
        // 2. 日志已经被提交了
        if (lastLogIndex < this.pendingIndex) {
            return true;
        }

        // 3. 提交索引越界
        if (lastLogIndex >= this.pendingIndex + this.pendingMetaQueue.size()) {
            throw new ArrayIndexOutOfBoundsException();
        }

        // 4. 计算索引起始位置取较小值,raft规则:leader会顺便提交之前任期的日志
        final long startAt = Math.max(this.pendingIndex, firstLogIndex);
        Ballot.PosHint hint = new Ballot.PosHint();
        for (long logIndex = startAt; logIndex <= lastLogIndex; logIndex++) {
            final Ballot bl = this.pendingMetaQueue.get((int) (logIndex - this.pendingIndex));
            hint = bl.grant(peer, hint);
            // 如果投票通过,更新提交索引
            if (bl.isGranted()) {
                lastCommittedIndex = logIndex;
            }
        }
        if (lastCommittedIndex == 0) {
            return true;
        }
        // When removing a peer off the raft group which contains even number of
        // peers, the quorum would decrease by 1, e.g. 3 of 4 changes to 2 of 3. In
        // this case, the log after removal may be committed before some previous
        // logs, since we use the new configuration to deal the quorum of the
        // removal request, we think it's safe to commit all the uncommitted
        // previous logs, which is not well proved right now
        this.pendingMetaQueue.removeFromFirst((int) (lastCommittedIndex - this.pendingIndex) + 1);
        LOG.debug("Node {} committed log fromIndex={}, toIndex={}.", this.opts.getNodeId(), this.pendingIndex,
            lastCommittedIndex);
        this.pendingIndex = lastCommittedIndex + 1;
        this.lastCommittedIndex = lastCommittedIndex;
    } finally {
        this.stampedLock.unlockWrite(stamp);
    }
    // 5. 将日志应用到状态机
    this.waiter.onCommitted(lastCommittedIndex);
    return true;
}

3.1.4. FSM

在ballotbox提交索引位置后,FSM会将其包装为FSMEvent,发送到FSMDisruptor的消费队列中

@Override
public boolean onCommitted(final long committedIndex) {
    return enqueueTask((task, sequence) -> {
        task.type = TaskType.COMMITTED;
        task.committedIndex = committedIndex;
    });
}
private boolean enqueueTask(final EventTranslator<ApplyTask> tpl) {
    if (this.shutdownLatch != null) {
        // Shutting down
        LOG.warn("FSMCaller is stopped, can not apply new task.");
        return false;
    }
    
    this.taskQueue.publishEvent(tpl);
    return true;
}

ApplyTaskHandler#onEvent方法则调用runApplyTask方法

private class ApplyTaskHandler implements EventHandler<ApplyTask> {
    boolean      firstRun          = true;
    // max committed index in current batch, reset to -1 every batch
    private long maxCommittedIndex = -1;

    @Override
    public void onEvent(final ApplyTask event, final long sequence, final boolean endOfBatch) throws Exception {
        setFsmThread();
        this.maxCommittedIndex = runApplyTask(event, this.maxCommittedIndex, endOfBatch);
    }

    private void setFsmThread() {
        if (firstRun) {
            fsmThread = Thread.currentThread();
            firstRun = false;
        }
    }
}


private long runApplyTask(final ApplyTask task, long maxCommittedIndex, final boolean endOfBatch) {
        CountDownLatch shutdown = null;
        if (task.type == TaskType.COMMITTED) {
            // 更新最大提交索引
            if (task.committedIndex > maxCommittedIndex) {
                maxCommittedIndex = task.committedIndex;
            }
            task.reset();
        } else {
            // 省略部分代码
        }
        try {
            if (endOfBatch && maxCommittedIndex >= 0) {
                this.currTask = TaskType.COMMITTED;
                // 应用最大索引日志到状态机
                doCommitted(maxCommittedIndex);
                maxCommittedIndex = -1L; // reset maxCommittedIndex
            }
            this.currTask = TaskType.IDLE;
            return maxCommittedIndex;
        } finally {
            if (shutdown != null) {
                shutdown.countDown();
            }
        }
    }

最终调用FSMCallerImpl#doCommitted方法,这里核心的逻辑如下

  1. 将日志中包含的操作apply到状态机
  2. 回调closure,这里的closure是应用层传入的closure,一般用来设置response,或者记录某些指标
  3. 更新lastAppliedIndex和lastAppliedTerm
private void doCommitted(final long committedIndex) {
    if (!this.error.getStatus().isOk()) {
        return;
    }
    final long lastAppliedIndex = this.lastAppliedIndex.get();
    // We can tolerate the disorder of committed_index
    if (lastAppliedIndex >= committedIndex) {
        return;
    }
    this.lastCommittedIndex.set(committedIndex);
    final long startMs = Utils.monotonicMs();
    try {
        final List<Closure> closures = new ArrayList<>();
        final List<TaskClosure> taskClosures = new ArrayList<>();
        final long firstClosureIndex = this.closureQueue.popClosureUntil(committedIndex, closures, taskClosures);

        // Calls TaskClosure#onCommitted if necessary
        onTaskCommitted(taskClosures);

        Requires.requireTrue(firstClosureIndex >= 0, "Invalid firstClosureIndex");
        // 组装成迭代器
        final IteratorImpl iterImpl = new IteratorImpl(this, this.logManager, closures, firstClosureIndex,
            lastAppliedIndex, committedIndex, this.applyingIndex);
        while (iterImpl.isGood()) {
            final LogEntry logEntry = iterImpl.entry();
            if (logEntry.getType() != EnumOutter.EntryType.ENTRY_TYPE_DATA) {
                if (logEntry.getType() == EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION) {
                    if (logEntry.getOldPeers() != null && !logEntry.getOldPeers().isEmpty()) {
                        // Joint stage is not supposed to be noticeable by end users.
                        this.fsm.onConfigurationCommitted(new Configuration(iterImpl.entry().getPeers()));
                    }
                }
                if (iterImpl.done() != null) {
                    // For other entries, we have nothing to do besides flush the
                    // pending tasks and run this closure to notify the caller that the
                    // entries before this one were successfully committed and applied.
                    // 回调应用层closure
                    iterImpl.done().run(Status.OK());
                }
                iterImpl.next();
                continue;
            }

            // Apply data task to user state machine
            doApplyTasks(iterImpl);
        }

        if (iterImpl.hasError()) {
            setError(iterImpl.getError());
            iterImpl.runTheRestClosureWithError();
        }
        long lastIndex = iterImpl.getIndex() - 1;
        final long lastTerm = this.logManager.getTerm(lastIndex);
        // 更新lastAppliedIndex和lastAppliedTerm
        setLastApplied(lastIndex, lastTerm);
    } finally {
        this.nodeMetrics.recordLatency("fsm-commit", Utils.monotonicMs() - startMs);
    }
}

3.1.5. 小结

在上面我们已经完成了leader部分日志复制核心逻辑的梳理,核心组件有

  1. NodeImpl:节点的抽象实现
  2. LogManagerImpl:日志管理抽象实现
  3. ApplyDisruptor:Leader发起写日志的消费队列
  4. DiskDisruptor:持久化事件的消费队列
  5. FSMCallerImpl:有限状态机的代理实现
  6. Replicator:负责leader到peer的日志复制、心跳等功能的组件。

3.2. Replicator

在一节中我们讲述了Leader接受日志-提交日志-应用到状态机的整个流程,但是还有一点是有缺失的,即Leader是如何将日志复制给peer(包括learner和follower)。在这一节中,我们将专注于探究Replicator是如何发送日志给Peer。

ReplicatorGroupImpl#addReplicator负责新建并添加一个Replicator到当前的group下。

@Override
public boolean addReplicator(final PeerId peer, final ReplicatorType replicatorType, final boolean sync) {
    // 省略部分代码
    final ThreadId rid = Replicator.start(opts, this.raftOptions);
    // 省略部分代码
    return this.replicatorMap.put(peer, rid) == null;
}

Replicator.start则是真正新建的方法,核心逻辑如下

  1. 通知Replicator状态监听器
  2. 启动heartbeatTimer
  3. 发送探针
public static ThreadId start(final ReplicatorOptions opts, final RaftOptions raftOptions) {
    // 省略部分代码
    final Replicator r = new Replicator(opts, raftOptions);
    if (!r.rpcService.connect(opts.getPeerId().getEndpoint())) {
        LOG.error("Fail to init sending channel to {}, group: {}.", opts.getPeerId(), opts.getGroupId());
        // Return and it will be retried later.
        return null;
    }
    // 省略部分代码
    // Start replication
    r.id = new ThreadId(r, r);
    r.id.lock();
    notifyReplicatorStatusListener(r, ReplicatorEvent.CREATED);
    LOG.info("Replicator [group: {}, peer: {}, type: {}] is started", r.options.getGroupId(), r.options.getPeerId(), r.options.getReplicatorType());
    r.catchUpClosure = null;
    r.lastRpcSendTimestamp = Utils.monotonicMs();
    r.startHeartbeatTimer(Utils.nowMs());
    // id.unlock in sendEmptyEntries
    // 发送探针
    r.sendProbeRequest();
    return r.id;
}

2.探针小节中我们了解探针的作用是同步leader和peer之间的日志差异,并在一定条件下修正差异后发送日志给peer。

当leader和peer之间的日志冲突可解决或者没有冲突,并且leader的日志多于peer时,continueSendEntries为true,最终会调用Replicator#snedEntries方法发送日志。

    static void onRpcReturned(final ThreadId id, final RequestType reqType, final Status status, final Message request,
                              final Message response, final int seq, final int stateVersion, final long rpcSendTime) {

        // 省略部分代码
        try {
        // 省略部分代码
                try {
                    switch (queuedPipelinedResponse.requestType) {
                        case AppendEntries:
                            // 对于probe request的回调,响应类型是AppendEntries
                            continueSendEntries = onAppendEntriesReturned(id, inflight, queuedPipelinedResponse.status,
                                (AppendEntriesRequest) queuedPipelinedResponse.request,
                                (AppendEntriesResponse) queuedPipelinedResponse.response, rpcSendTime, startTimeMs, r);
                            break;
                        case Snapshot:
                            continueSendEntries = onInstallSnapshotReturned(id, r, queuedPipelinedResponse.status,
                                (InstallSnapshotRequest) queuedPipelinedResponse.request,
                                (InstallSnapshotResponse) queuedPipelinedResponse.response);
                            break;
                    }
                } finally {
                     // 省略部分代码
                }
            }
        } finally {
        // 省略部分代码
            if (continueSendEntries) {
                // 继续发送请求
                r.sendEntries();
            }
        }
    }

比较核心的逻辑是

  1. 如果peer需要的日志已经被压缩,leader会通过发送snapshot来进行同步
  2. 如果peer的日志已经与leader一致了,replicator会将自身进入等待状态
  3. 如果一切正常,则构建AppendEntriesRequest,并发送日志
void sendEntries() {
    boolean doUnlock = true;
    try {
        long prevSendIndex = -1;
        while (true) {
            final long nextSendingIndex = getNextSendIndex();
            if (nextSendingIndex > prevSendIndex) {
                if (sendEntries(nextSendingIndex)) {
                    prevSendIndex = nextSendingIndex;
                } else {
                    doUnlock = false;
                    // id already unlock in sendEntries when it returns false.
                    break;
                }
            } else {
                break;
            }
        }
    } finally {
        if (doUnlock) {
            unlockId();
        }
    }

}


private boolean sendEntries(final long nextSendingIndex) {
    final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
    // 1. 无法从内存中获取日志,日志已经被压缩,这时只能通过sanpshot同步
    if (!fillCommonFields(rb, nextSendingIndex - 1, false)) {
        // unlock id in installSnapshot
        installSnapshot();
        return false;
    }

    ByteBufferCollector dataBuf = null;
    final int maxEntriesSize = this.raftOptions.getMaxEntriesSize();
    final RecyclableByteBufferList byteBufList = RecyclableByteBufferList.newInstance();
    try {
        // 2. 尽可能获取entry
        for (int i = 0; i < maxEntriesSize; i++) {
            final RaftOutter.EntryMeta.Builder emb = RaftOutter.EntryMeta.newBuilder();
            if (!prepareEntry(nextSendingIndex, i, emb, byteBufList)) {
                break;
            }
            rb.addEntries(emb.build());
        }
        // 3. 两种情况:
        // 	3.1 无法从内存中获取日志,日志已经被压缩,这时只能通过sanpshot同步
        //  3.2 leader和peer内存一致,这时候需要把repliator置为等待状态
        if (rb.getEntriesCount() == 0) {
            // 3. 无法从内存中获取日志,日志已经被压缩,这时只能通过sanpshot同步
            if (nextSendingIndex < this.options.getLogManager().getFirstLogIndex()) {
                // 当前要发送Entry为0,且记录地follower地下一条日志小于本地第一条日志
                // 通过snapshot复制
                installSnapshot();
                return false;
            }
            // 传入replicator期望发送的下一条日志的index
            waitMoreEntries(nextSendingIndex);
            return false;
        }
        if (byteBufList.getCapacity() > 0) {
            dataBuf = ByteBufferCollector.allocateByRecyclers(byteBufList.getCapacity());
            for (final ByteBuffer b : byteBufList) {
                dataBuf.put(b);
            }
            final ByteBuffer buf = dataBuf.getBuffer();
            BufferUtils.flip(buf);
            rb.setData(ZeroByteStringHelper.wrap(buf));
        }
    } finally {
        RecycleUtil.recycle(byteBufList);
    }

    // 4. 一切正常,则构建AppendEntriesRequest,并发送日志
    final AppendEntriesRequest request = rb.build();
    this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
    this.statInfo.firstLogIndex = rb.getPrevLogIndex() + 1;
    this.statInfo.lastLogIndex = rb.getPrevLogIndex() + rb.getEntriesCount();

    final Recyclable recyclable = dataBuf;
    final int v = this.version;
    final long monotonicSendTimeMs = Utils.monotonicMs();
    final int seq = getAndIncrementReqSeq();

    this.appendEntriesCounter++;
    Future<Message> rpcFuture = null;
    try {
        rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(), request, -1,
            new RpcResponseClosureAdapter<AppendEntriesResponse>() {

                @Override
                public void run(final Status status) {
                    RecycleUtil.recycle(recyclable); // TODO: recycle on send success, not response received.
                    onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request, getResponse(),
                        seq, v, monotonicSendTimeMs);
                }
            });
    } catch (final Throwable t) {
        RecycleUtil.recycle(recyclable);
        ThrowUtil.throwException(t);
    }
    addInflight(RequestType.AppendEntries, nextSendingIndex, request.getEntriesCount(), request.getData().size(),
        seq, rpcFuture);

    return true;
}

这里需要留意一下Replicator#waitMoreEntries方法,该方法的入参是期望发送的下一条日志的index,并让replicator进入idle状态等待,停止发送日志。

此外还调用了LogManagerImpl#wait方法,传入了continueSending的callback,并产生了一个waitId。waitId的作用是在唤醒Replicator时候确认调用哪一个回调,因为回调中携带了下一个要发送的日志的index。

private void waitMoreEntries(final long nextWaitIndex) {
    try {
        LOG.debug("Node {} waits more entries", this.options.getNode().getNodeId());
        if (this.waitId >= 0) {
            return;
        }
        this.waitId = this.options.getLogManager().wait(nextWaitIndex - 1,
            (arg, errorCode) -> continueSending((ThreadId) arg, errorCode), this.id);
        this.statInfo.runningState = RunningState.IDLE;
    } finally {
        unlockId();
    }
}
@Override
public long wait(final long expectedLastLogIndex, final NewLogCallback cb, final Object arg) {
    final WaitMeta wm = new WaitMeta(cb, arg, 0);
    return notifyOnNewLog(expectedLastLogIndex, wm);
}

private long notifyOnNewLog(final long expectedLastLogIndex, final WaitMeta wm) {
    this.writeLock.lock();
    try {
        // 1. 还有日志可发,提交到线程池中进行发送
        if (expectedLastLogIndex != this.lastLogIndex || this.stopped) {
            wm.errorCode = this.stopped ? RaftError.ESTOP.getNumber() : 0;
            ThreadPoolsFactory.runInThread(this.groupId, () -> runOnNewLog(wm));
            return 0L;
        }
        // 2. 无日志可发,自增waitId
        long waitId = this.nextWaitId++;
        if (waitId < 0) {
            // waitId的类型是int,防止溢出
            // Valid waitId starts from 1, skip 0.
            waitId = this.nextWaitId = 1;
        }
        this.waitMap.put(waitId, wm);
        return waitId;
    } finally {
        this.writeLock.unlock();
    }
}

如果是没有日志发送的情况,此时Replicator就已经被“挂起”了,那么Replicator是什么时候被唤醒的呢?

答案是LogManagerImpl#wakeupAllWaiter方法,该方法会获取往前的WaitMeta,一个个执行回调。

private boolean wakeupAllWaiter(final Lock lock) {
    if (this.waitMap.isEmpty()) {
        lock.unlock();
        return false;
    }
    final List<WaitMeta> wms = new ArrayList<>(this.waitMap.values());
    final int errCode = this.stopped ? RaftError.ESTOP.getNumber() : RaftError.SUCCESS.getNumber();
    this.waitMap.clear();
    lock.unlock();

    final int waiterCount = wms.size();
    for (int i = 0; i < waiterCount; i++) {
        final WaitMeta wm = wms.get(i);
        wm.errorCode = errCode;
        ThreadPoolsFactory.runInThread(this.groupId, () -> runOnNewLog(wm));
    }
    return true;
}

而wakeupAllWaiter方法被调用的时机是LogManagerImpl#appendEntries方法,也就是说leader在每次写入日志时,都会尝试唤醒处于等待状态的Replicator。

 @Override
    public void appendEntries(final List<LogEntry> entries, final StableClosure done) {
            // 省略部分带啊吗
            if (!wakeupAllWaiter(this.writeLock)) {
                notifyLastLogIndexListeners();
            }
            // 省略部分带啊吗
    }

通过上面的代码分析,我们可以看出Replicator复制日志的时机有两个

  1. 新建时候,通过sendProbeRequest触发日志同步
  2. leader写入日志时,调用LogManagerImpl#wakeupAllWaiter方法唤醒处于等待状态的Replicator,并发送日志。

4. 总结

在上面的内容中我们大致了解了Leader和peer之间是如何保证同步的:1.pipeline机制提高发送效率,保证顺序处理。2. probe探针修正leader和peer之间冲突 3. replicator等其他组件实现日志复制。但是也仅仅只是了解,再jraft中有许多有意思的设计,或者说一些技巧,是本篇文章尚未关注的,在后续的该系列文章,我将针对一些编码技巧进行总结,望大伙儿多多关注~。