手把手教你写raft--sofajraft选举源码分析(4)

181 阅读18分钟

1. 选举

1.1. Raft选举

先来回忆下raft选举的过程:

  1. follower心跳超时,将自身更改为candidate,自增term,向其他节点发起选举请求(RequestVote RPC)
  2. 收到RequestVote RPC的节点根据请求响应
    1. 如果term大于自己,更新自己的term,并将自己置为follower
    2. 查看请求携带的日志信息是否不旧于自己,如果是则投票
  1. 发起RequestVote RPC的candidate,根据投票结果更改节点状态,或者投票失败等待下次超期发起选举。

给自己打一个广告,更详细的介绍与实现细节可见raft--(1)选举一文。

1.2. Jraft选举

然后再看上图的一个场景,在某个任期S0担任leader,S4因为网络波动没有收到心跳,变成了candidate,发起选举投票。由于S4的Term大于leader节点,尽管S0的日志可能新于S4,但S0 leader节点在收到RequestVoteRPC,强制将自身降级为follower。此时集群将出现一个短暂地不可用时间,等待某个节点选举成功后,集群才可用。

图一 一个可能的选举结果

那么这次选举有必要吗?实际上Term1的leader S0,必然具有最全的日志,只是他的term相较于S4落后了。于是,有趣的事情发生了,一个非法的节点却让整个集群服软了,强行选举并提高了整个集群的Term。

sofa-jraft为了避免上述的情况,额外添加了一个步骤--prevote。顾名思义,prevote只是发起一次预投票,只有通过预投票的节点才能发起真正的投票。

  1. 在某个follower成为candidate后,并不会发起真正的投票,而是尝试增加自己的term(没有自增),拿着这个增加的term去向其他节点发起一个预备投票请求(PreRequestVote RPC)。
  2. 其他节点在收到该RPC请求后,会查看该请求是否有资格成为新的leader(比较term大小,比较请求方日志是不是不旧于自己)。
  3. 发起预投票的节点,会查看是否获取了超过半数的预投票准许响应,才自增term,发起真正的投票。

再来结合一下上面的案例。S4在electionTimeout后,发起预选举请求(注意此时S4的term没有自增,还是follower身份),由于S4的日志有很大概率落后于S0,预投票失败,S4识趣地放弃自增term和选举。

一次完整的预投票+投票过程如下:

1.3. 源码解析

注:作者在编写这篇文章时,时间是2024年12月4日,所有的代码都是同时间的master分支

关于jraft核心逻辑在com.alipay.sofa.jraft.core.NodeImpl这个类中,该类是对raft节点的抽象,与选举相关的成员变量如下所示。作者在看源码时有一个习惯,删除代码,为了便于理解,作者一般会新建分支,并尽量删除与想看的逻辑无关的代码,以确保这部分逻辑足够清晰。在删除代码的过程中,一般也对其他功能的实现与原理有了一定了解,虽然这样很笨拙,但对我的确有效,也希望大家分享一些看源码的高效方式。

public class NodeImpl implements Node, RaftServerService {

    private static final Logger                                            LOG                      = LoggerFactory
    .getLogger(NodeImpl.class);


    public static final AtomicInteger                                      GLOBAL_NUM_NODES         = new AtomicInteger(
        0);

    protected final Lock                                                   writeLock                = this.readWriteLock
    .writeLock();
    protected final Lock                                                   readLock                 = this.readWriteLock
    .readLock();
    private volatile State                                                 state;
    private long                                                           currTerm;
    private volatile long                                                  lastLeaderTimestamp;
    private PeerId                                                         leaderId                 = new PeerId();
    private PeerId                                                         votedId;
    private final Ballot                                                   voteCtx                  = new Ballot();
    private final Ballot                                                   prevVoteCtx              = new Ballot();
    private ConfigurationEntry                                             conf;
    private final PeerId                                                   serverId;
    /** Other services */
    private final ConfigurationCtx                                         confCtx;
    /** Timers */
    private Scheduler                                                      timerManager;
    private RepeatedTimer                                                  electionTimer;
    private RepeatedTimer                                                  voteTimer;
    private RepeatedTimer                                                  stepDownTimer;
}

如果你有看过作者raft系列文章的第一篇,在成员变量出就可以定位到选举相关的核心成员了

    private RepeatedTimer electionTimer;
    private RepeatedTimer voteTimer;

根据我们之前对raft算法的了解,electionTimer大概率是心跳超期发起选举的定时任务,voteTimer则是投票过程中超期的定时任务。相较于我们实现的raft node,这里还多了个Timer -- stepDownTimer,先说结论,这个timer是leader用来自检,并维持与其他节点连接。

各个Timer与其作用域如下图所示。

1.3.1. 选举触发

先来定位ET(ElectionTimer)的启动时机,通过IDEA的引用提示,可以看到ET只有在一处被启动。

对应方法为StepDown,该方法的核心逻辑是将当前节点状态设置为Follower,如果不是learner节点(learner节点,指不参加raft选举,但会复制leader的日志的节点)就启动electionTimer

private void stepDown(final long term, final boolean wakeupCandidate, final Status status) {
        if (!this.state.isActive()) {
            return;
        }
        if (this.state == State.STATE_CANDIDATE) {
            // 如果是candidate,由于要变成follower了,需要停止voteTimer
            stopVoteTimer();
        } else if (this.state.compareTo(State.STATE_TRANSFERRING) <= 0) {
            // 如果是leade转移情况,由于要变成follower了,需要停止stepDownTimer
            stopStepDownTimer();
            // 清空信箱,ballBox与日志投票有关,暂时先不要关注
            this.ballotBox.clearPendingTasks();
            // signal fsm leader stop immediately
            // 停止有限状态机相关,暂时先不要关注
            if (this.state == State.STATE_LEADER) {
                onLeaderStop(status);
            }
        }
        // reset leader_id
        resetLeaderId(PeerId.emptyPeer(), status);
        // 降级为follower
        this.state = State.STATE_FOLLOWER;
        // meta state
        if (term > this.currTerm) {
            this.currTerm = term;
            this.votedId = PeerId.emptyPeer();
            this.metaStorage.setTermAndVotedFor(term, this.votedId);
        }

        // ...... 省略部分代码
        if (!isLearner()) {
            this.electionTimer.restart();
        } else {
            LOG.info("Node {} is a learner, election timer is not started.", this.nodeId);
        }
    }

再来看ElectionTimer被触发后的逻辑,调用了handleElectionTimeout方法,这里RepeatedTimer是jraft对定时任务的抽象。

this.electionTimer = new RepeatedTimer(name, this.options.getElectionTimeoutMs(),
            TIMER_FACTORY.getElectionTimer(this.options.isSharedElectionTimer(), name)) {

            @Override
            protected void onTrigger() {
                handleElectionTimeout();
            }

            @Override
            protected int adjustTimeout(final int timeoutMs) {
                return randomTimeout(timeoutMs);
            }
        };

handleElectionTimeout方法的逻辑也比较简单

  1. 校验本次选举是否合法
  2. 发起预投票
private void handleElectionTimeout() {
    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        // 1.1 校验节点是否是Follower,只有Follower才能发起选举
        if (this.state != State.STATE_FOLLOWER) {
            return;
        }
        // 1.2 如果现在的leader合法,即心跳合法,则放弃选举
        if (isCurrentLeaderValid()) {
            return;
        }
        resetLeaderId(PeerId.emptyPeer(), new Status(RaftError.ERAFTTIMEDOUT, "Lost connection from leader %s.",
            this.leaderId));

        // 1.3 判断当前节点是否允许发起选举,jraft支持节点配置选举优先级
        if (!allowLaunchElection()) {
            return;
        }

        doUnlock = false;
        preVote();

    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
    }
}

1.3.2. PreVote 预投票

  • 预投票请求

preVote是预投票的核心实现,具体的逻辑是

  1. 预增本地term,对其他节点发起预投票请求。
  2. 投票给自己

这里需要注意的是对currentTerm的ABA检测。这里其实是一个很好的并发实践,在阻塞性操作时候释放锁,通过ABA检测来保证释放锁期间程序运行的状态是符合预期的。

    // in writeLock
    private void preVote() {
        long oldTerm;
        try {
            // ..... 省略部分代码
            oldTerm = this.currTerm;
        } finally {
            this.writeLock.unlock();
        }

        final LogId lastLogId = this.logManager.getLastLogId(true);

        boolean doUnlock = true;
        this.writeLock.lock();
        try {
            // 由于释放了锁,需要进行ABA检测,确保这次运行还是合法的
            if (oldTerm != this.currTerm) {
                return;
            }
            // 初始化信箱
            this.prevVoteCtx.init(this.conf.getConf(), this.conf.isStable() ? null : this.conf.getOldConf());
            for (final PeerId peer : this.conf.listPeers()) {
                if (peer.equals(this.serverId)) {
                    continue;
                }
                if (!this.rpcService.connect(peer.getEndpoint())) {
                    continue;
                }
                final OnPreVoteRpcDone done = new OnPreVoteRpcDone(peer, this.currTerm);
                done.request = RequestVoteRequest.newBuilder() //
                    .setPreVote(true) // it's a pre-vote request.
                    .setGroupId(this.groupId) //
                    .setServerId(this.serverId.toString()) //
                    .setPeerId(peer.toString()) //
                // 预增term
                    .setTerm(this.currTerm + 1) // next term
                    .setLastLogIndex(lastLogId.getIndex()) //
                    .setLastLogTerm(lastLogId.getTerm()) //
                    .build();
                this.rpcService.preVote(peer.getEndpoint(), done.request, done);
            }
            this.prevVoteCtx.grant(this.serverId);
            if (this.prevVoteCtx.isGranted()) {
                doUnlock = false;
                // 预投票被认可,发起真正的投票
                electSelf();
            }
        } finally {
            if (doUnlock) {
                this.writeLock.unlock();
            }
        }
    }
  • 预投票请求处理

在看投票部分逻辑前,先来看下节点对预投票的RequestVoteRequest的处理和响应的处理。

  1. 只有当请求的term大于等于当前节点,日志不旧于当前节点,并且当前节点认为leader不正常时(优化),才会投票给请求方
  2. 比较有意思的一点是用了while来控制代码逻辑,使用break可以提前跳出代码块。相较于实现一个方法提前return,这里免去了传参数,也减少了if elese分支。在jraft的代码中充斥着这种写法。
    @Override
    public Message handlePreVoteRequest(final RequestVoteRequest request) {
        boolean doUnlock = true;
        this.writeLock.lock();
        try {
            if (!this.state.isActive()) {
                return RpcFactoryHelper //
                    .responseFactory() //
                    .newResponse(RequestVoteResponse.getDefaultInstance(), RaftError.EINVAL,
                        "Node %s is not in active state, state %s.", getNodeId(), this.state.name());
            }
            final PeerId candidateId = new PeerId();
            if (!candidateId.parse(request.getServerId())) {
                return RpcFactoryHelper //
                    .responseFactory() //
                    .newResponse(RequestVoteResponse.getDefaultInstance(), RaftError.EINVAL,
                        "Parse candidateId failed: %s.", request.getServerId());
            }
            boolean granted = false;
            // noinspection ConstantConditions
            do {
                // 非法的请求clientId,放弃投票
                if (!this.conf.contains(candidateId)) {
                    break;
                }
                // 当前leader存活,放弃投票
                if (this.leaderId != null && !this.leaderId.isEmpty() && isCurrentLeaderValid()) {
                    break;
                }
                // 请求term小于当前节点term,放弃投票
                if (request.getTerm() < this.currTerm) {
                    // 当前节点如果是leader节点,确保对请求client的replicator线程工作
                    checkReplicator(candidateId);
                    break;
                }
                // 当前节点如果是leader节点,确保对请求client的replicator线程工作
                checkReplicator(candidateId);

                doUnlock = false;
                this.writeLock.unlock();
                // 阻塞操作前,先放弃持有锁
                final LogId lastLogId = this.logManager.getLastLogId(true);

                doUnlock = true;
                this.writeLock.lock();
                // 根据本地最新日志与请求最新日志比较,如果对方没自己新不投票,反之投票
                final LogId requestLastLogId = new LogId(request.getLastLogIndex(), request.getLastLogTerm());
                granted = requestLastLogId.compareTo(lastLogId) >= 0;
            } while (false);

            return RequestVoteResponse.newBuilder() //
                .setTerm(this.currTerm) //
                .setGranted(granted) //
                .build();
        } finally {
            if (doUnlock) {
                this.writeLock.unlock();
            }
        }
    }
  • 预投票响应处理
  1. 确保响应合法,自身仍然是follower节点,并且term没有发生变化
  2. 如果对方的term大于自己,stepDown自己
  3. 更新预投票信箱。如果通过预备投票(超半数机制),则发起正式选举。
public void handlePreVoteResponse(final PeerId peerId, final long term, final RequestVoteResponse response) {
    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        // 1. 确保当前节点仍然是follower
        if (this.state != State.STATE_FOLLOWER) {
            return;
        }
        // 2. 当前节点的term发生变化了,这是一次过期的投票响应,跳过
        if (term != this.currTerm) {
            return;
        }
        // 对方节点的term大于自己,说明自己已经过时了,
        // 通过stepDown重置follower相关逻辑
        if (response.getTerm() > this.currTerm) {
            // 
            stepDown(response.getTerm(), false, new Status(RaftError.EHIGHERTERMRESPONSE,
                "Raft node receives higher term pre_vote_response."));
            return;
        }
        if (response.getGranted()) {
            this.prevVoteCtx.grant(peerId);
            if (this.prevVoteCtx.isGranted()) {
                doUnlock = false;
                // 通过预投票,发起选举
                electSelf();
            }
        }
    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
    }
}

1.3.3. ElectSelf 正式投票

在正式发起投票时,follower会停止electionTimer,更改自身状态为candidate,启动voteTimer,并自增term,可见(1)处代码。

随后candidate再一次发起了RequestVoteRequest请求,只不过这一次成员变量preVote的值是false。

private void electSelf() {
        long oldTerm;
        try {
            if (!this.conf.contains(this.serverId)) {
                return;
            }
            // (1)处代码
            if (this.state == State.STATE_FOLLOWER) {
                this.electionTimer.stop();
            }
            resetLeaderId(PeerId.emptyPeer(), new Status(RaftError.ERAFTTIMEDOUT,
                "A follower's leader_id is reset to NULL as it begins to request_vote."));
            this.state = State.STATE_CANDIDATE;
            this.currTerm++;
            this.votedId = this.serverId.copy();
            LOG.debug("Node {} start vote timer, term={} .", getNodeId(), this.currTerm);
            this.voteTimer.start();
            this.voteCtx.init(this.conf.getConf(), this.conf.isStable() ? null : this.conf.getOldConf());
            oldTerm = this.currTerm;
            // (1)处代码
        } finally {
            this.writeLock.unlock();
        }

        final LogId lastLogId = this.logManager.getLastLogId(true);

        this.writeLock.lock();
        try {
            // vote need defense ABA after unlock&writeLock
            if (oldTerm != this.currTerm) {
                return;
            }
            for (final PeerId peer : this.conf.listPeers()) {
                if (peer.equals(this.serverId)) {
                    continue;
                }
                if (!this.rpcService.connect(peer.getEndpoint())) {
                    continue;
                }
                final OnRequestVoteRpcDone done = new OnRequestVoteRpcDone(peer, this.currTerm, this);
                done.request = RequestVoteRequest.newBuilder() //
                    .setPreVote(false) // It's not a pre-vote request.
                    .setGroupId(this.groupId) //
                    .setServerId(this.serverId.toString()) //
                    .setPeerId(peer.toString()) //
                    .setTerm(this.currTerm) //
                    .setLastLogIndex(lastLogId.getIndex()) //
                    .setLastLogTerm(lastLogId.getTerm()) //
                    .build();
                this.rpcService.requestVote(peer.getEndpoint(), done.request, done);
            }

            this.metaStorage.setTermAndVotedFor(this.currTerm, this.serverId);
            this.voteCtx.grant(this.serverId);
            if (this.voteCtx.isGranted()) {
                becomeLeader();
            }
        } finally {
            this.writeLock.unlock();
        }
    }

与预投票一样,正式投票的client有handleRequestVoteRequest函数处理请求,发起请求的candidate调用handleRequestVoteResponse处理响应。

  • handleRequestVoteRequest

处理ReuqestVoteReuqest的核心方法,是否投票给对方取决于三个条件

  1. 对方term >= 自己

  2. 对方日志 不旧于 自己

  3. 本轮选举还未投过票

    @Override
    public Message handleRequestVoteRequest(final RequestVoteRequest request) {
        boolean doUnlock = true;
        this.writeLock.lock();
        try {
            // (1) 合法性校验
            if (!this.state.isActive()) {
                return RpcFactoryHelper //
                    .responseFactory() //
                    .newResponse(RequestVoteResponse.getDefaultInstance(), RaftError.EINVAL,
                        "Node %s is not in active state, state %s.", getNodeId(), this.state.name());
            }
            final PeerId candidateId = new PeerId();
            if (!candidateId.parse(request.getServerId())) {
                return RpcFactoryHelper //
                    .responseFactory() //
                    .newResponse(RequestVoteResponse.getDefaultInstance(), RaftError.EINVAL,
                        "Parse candidateId failed: %s.", request.getServerId());
            }
            // (1) 合法性校验

            do {
                if (request.getTerm() >= this.currTerm) {
                    // 如果请求term大于自身,修改term,并将自己更改为follower
                    if (request.getTerm() > this.currTerm) {
                        stepDown(request.getTerm(), false, new Status(RaftError.EHIGHERTERMRESPONSE,
                            "Raft node receives higher term RequestVoteRequest."));
                    }
                } else {
                    // 请求term小于自身,拒绝投票 跳出
                    break;
                }
                doUnlock = false;
                this.writeLock.unlock();

                final LogId lastLogId = this.logManager.getLastLogId(true);

                doUnlock = true;
                this.writeLock.lock();
                // ABA check
                if (request.getTerm() != this.currTerm) {
                    LOG.warn("Node {} raise term {} when get lastLogId.", getNodeId(), this.currTerm);
                    break;
                }

                // 判断日志是否至少不旧于自己
                final boolean logIsOk = new LogId(request.getLastLogIndex(), request.getLastLogTerm())
                    .compareTo(lastLogId) >= 0;

                // 满足三个条件
                // 1. 对方term >= 自己
                // 2. 对方日志 不旧于 自己
                // 3. 本轮选举还未投过票
                if (logIsOk && (this.votedId == null || this.votedId.isEmpty())) {
                    stepDown(request.getTerm(), false, new Status(RaftError.EVOTEFORCANDIDATE,
                        "Raft node votes for some candidate, step down to restart election_timer."));
                    this.votedId = candidateId.copy();
                    this.metaStorage.setVotedFor(candidateId);
                }
            } while (false);

            return RequestVoteResponse.newBuilder() //
                .setTerm(this.currTerm) //
                .setGranted(request.getTerm() == this.currTerm && candidateId.equals(this.votedId)) //
                .build();
        } finally {
            if (doUnlock) {
                this.writeLock.unlock();
            }
        }
    }
  • handleReuqestVoteResponse

接着来到响应处理方法,在接收到超过半数的投票后,调用becomeLeader方法称为leader。

public void handleRequestVoteResponse(final PeerId peerId, final long term, final RequestVoteResponse response) {
    this.writeLock.lock();
    try {
        // 本响应所处的选举周期早就结束了,跳出
        if (this.state != State.STATE_CANDIDATE) {
            return;
        }
        // ABAcheck,该响应是个过期选举产生的,跳出
        if (term != this.currTerm) {
            return;
        }
        if (response.getTerm() > this.currTerm) {
            // 对方term大于自己,走降级逻辑 成为follower,终端选举
            stepDown(response.getTerm(), false, new Status(RaftError.EHIGHERTERMRESPONSE,
                                                           "Raft node receives higher term request_vote_response."));
            return;
        }
        // check granted quorum?
        if (response.getGranted()) {
            this.voteCtx.grant(peerId);
            if (this.voteCtx.isGranted()) {
                becomeLeader();
            }
        }
    } finally {
        this.writeLock.unlock();
    }
}
  • becomeLeader

在成为leader后节点主要做了几件事情

  1. 停止voteTimer
  2. 修改自身状态
  3. 开启follower/leader的日志复制线程
  4. 重置本地提交索引
  5. 启动stepDownTimer
private void becomeLeader() {
    Requires.requireTrue(this.state == State.STATE_CANDIDATE, "Illegal state: " + this.state);
    this.conf.getConf(), this.conf.getOldConf());
    // 1. 停止投票超时任务
    stopVoteTimer();
    // 2. 改变自身状态
    this.state = State.STATE_LEADER;
    this.leaderId = this.serverId.copy();
    // 3. 更改replicatorGroup的term,replicator的作用是日志复制
    this.replicatorGroup.resetTerm(this.currTerm);
    // Start follower's replicators
    for (final PeerId peer : this.conf.listPeers()) {
        if (peer.equals(this.serverId)) {
            continue;
        }
        // 4. 启动follower的replicator线程
        if (!this.replicatorGroup.addReplicator(peer)) {
            LOG.error("Fail to add a replicator, peer={}.", peer);
        }
    }

    for (final PeerId peer : this.conf.listLearners()) {
        // 5. 启动learner的replicator线程
        if (!this.replicatorGroup.addReplicator(peer, ReplicatorType.Learner)) {
            LOG.error("Fail to add a learner replicator, peer={}.", peer);
        }
    }

    // init commit manager
    this.ballotBox.resetPendingIndex(this.logManager.getLastLogIndex() + 1);
    // Register _conf_ctx to reject configuration changing before the first log
    // is committed.
    if (this.confCtx.isBusy()) {
        throw new IllegalStateException();
    }
    this.confCtx.flush(this.conf.getConf(), this.conf.getOldConf());
    // 6. 启动stepDownTimer
    this.stepDownTimer.start();
}

1.3.4. Timer

如果你细心留意了上面的代码,可以发现三个与选举有关的Timer

  1. ElectionTimer,这个比较好理解,follower用来触发leader超期的定时任务,只有当节点state为follower时启动。触发后就调用handleElectionTimeout()方法,发起预投票-投票逻辑。
 this.electionTimer = new RepeatedTimer(name, this.options.getElectionTimeoutMs(),
            TIMER_FACTORY.getElectionTimer(this.options.isSharedElectionTimer(), name)) {

            @Override
            protected void onTrigger() {
                handleElectionTimeout();
            }

            @Override
            protected int adjustTimeout(final int timeoutMs) {
                return randomTimeout(timeoutMs);
            }
        };

2. VoteTimer,candidate用来触发投票超期的定时任务,只有当节点节点state为candidate时启动。VoteTimer会根据配置,重新变换角色为follower发起预投票,或者直接发起正式投票。

this.voteTimer = new RepeatedTimer(name, this.options.getElectionTimeoutMs(), TIMER_FACTORY.getVoteTimer(
    this.options.isSharedVoteTimer(), name)) {

    @Override
    protected void onTrigger() {
        handleVoteTimeout();
    }

    @Override
    protected int adjustTimeout(final int timeoutMs) {
        return randomTimeout(timeoutMs);
    }
};

private void handleVoteTimeout() {
    this.writeLock.lock();
    if (this.state != State.STATE_CANDIDATE) {
        this.writeLock.unlock();
        return;
    }

    if (this.raftOptions.isStepDownWhenVoteTimedout()) {
        stepDown(this.currTerm, false, new Status(RaftError.ETIMEDOUT,
            "Vote timeout: fail to get quorum vote-granted."));
        preVote();
    } else {
        electSelf();
    }
}

3. StepDownTimer,只有当节点状态为leader时启动

这个Timer就比较有意思了,在看源码前先看一个案例:由5个节点组成的raft集群,原本由S1担任leader,在某个时间点发生了网络隔离。S0、S2、S4处于同一网络,S2被选举成了新的leader,由于超过半数这个子集群可用。S1、S3处于同一网络,S1维持着老leader的身份,此时往S1写数据是可以的,但是在读数据时却发生了问题,由于该子集群节点数量不超过原集群节点半数,写入的日志永远无法被提交了。

为了解决上面的问题,jraft针对leader提出了stepDownTimeout,核心逻辑是,leader需要确认自己是个合法的leader,怎么确认呢?通过与各个节点维持一个租约来避免上述的情况,如果不满足维持leader的条件了,则自降身份为follower。

ok了解了上面的案例后,再来看stepDownTimer就会容易许多了。该定时任务的核心逻辑在handleStepDownTimeout方法中。

this.stepDownTimer = new RepeatedTimer(name, this.options.getElectionTimeoutMs() >> 1,
    TIMER_FACTORY.getStepDownTimer(this.options.isSharedStepDownTimer(), name)) {

    @Override
    protected void onTrigger() {
        handleStepDownTimeout();
    }
};

在该方法中,分了两次检测leader自身是否合法:

第一次上读锁进行检测,此时就算leader非法了也不会更改节点状态

第二次上写锁进行检测,此时如果leader非法了才会更改节点状态

private void handleStepDownTimeout() {
    do {
        this.readLock.lock();
        try {
            if (this.state.compareTo(State.STATE_TRANSFERRING) > 0) {
                LOG.debug("Node {} stop step-down timer, term={}, state={}.", getNodeId(), this.currTerm,
                    this.state);
                return;
            }
            final long monotonicNowMs = Utils.monotonicMs();
            if (!checkDeadNodes(this.conf.getConf(), monotonicNowMs, false)) {
                break;
            }
            if (!this.conf.getOldConf().isEmpty()) {
                if (!checkDeadNodes(this.conf.getOldConf(), monotonicNowMs, false)) {
                    break;
                }
            }
            return;
        } finally {
            this.readLock.unlock();
        }
    } while (false);

    this.writeLock.lock();
    try {
        if (this.state.compareTo(State.STATE_TRANSFERRING) > 0) {
            LOG.debug("Node {} stop step-down timer, term={}, state={}.", getNodeId(), this.currTerm, this.state);
            return;
        }
        final long monotonicNowMs = Utils.monotonicMs();
        checkDeadNodes(this.conf.getConf(), monotonicNowMs, true);
        if (!this.conf.getOldConf().isEmpty()) {
            checkDeadNodes(this.conf.getOldConf(), monotonicNowMs, true);
        }
    } finally {
        this.writeLock.unlock();
    }
}

至怎么检验leader的逻辑代码注释所示,简短的说:计算 leader以当前时间 - 各个节点最后一次rpc成功请求的时间,有超过半数的节点的该差值不超过leaderLeaseTimeout,则leader合法。否则leader非法,走stepDown逻辑。

private boolean checkDeadNodes(final Configuration conf, final long monotonicNowMs,
                               final boolean stepDownOnCheckFail) {
    for (final PeerId peer : conf.getLearners()) {
        // 保证learn的复制线程可靠
        checkReplicator(peer);
    }
    final List<PeerId> peers = conf.listPeers();
    final Configuration deadNodes = new Configuration();
    // 如果leader合法,直接返回true
    if (checkDeadNodes0(peers, monotonicNowMs, true, deadNodes)) {
        return true;
    }
    if (stepDownOnCheckFail) {
        // 降级逻辑,只有持有写锁时执行
        LOG.warn("Node {} steps down when alive nodes don't satisfy quorum, term={}, deadNodes={}, conf={}.",
            getNodeId(), this.currTerm, deadNodes, conf);
        final Status status = new Status();
        status.setError(RaftError.ERAFTTIMEDOUT, "Majority of the group dies: %d/%d", deadNodes.size(),
            peers.size());
        stepDown(this.currTerm, false, status);
    }
    return false;
}

private boolean checkDeadNodes0(final List<PeerId> peers, final long monotonicNowMs, final boolean checkReplicator,
                                final Configuration deadNodes) {
    final int leaderLeaseTimeoutMs = this.options.getLeaderLeaseTimeoutMs();
    int aliveCount = 0;
    long startLease = Long.MAX_VALUE;
    for (final PeerId peer : peers) {
        // 遍历所有有资格选举的节点
        if (peer.equals(this.serverId)) {
            aliveCount++;
            continue;
        }
        if (checkReplicator) {
            checkReplicator(peer);
        }
        // 上一次成功rpc请求的时间
        final long lastRpcSendTimestamp = this.replicatorGroup.getLastRpcSendTimestamp(peer);
        // 当前时间 - 上一次rpc请求成功时间 <= 租约超时时间
        // 说明当前peer认leader
        if (monotonicNowMs - lastRpcSendTimestamp <= leaderLeaseTimeoutMs) {
            aliveCount++;
            if (startLease > lastRpcSendTimestamp) {
                // 以最早的rpc成功请求时间 作为租约的开始
                startLease = lastRpcSendTimestamp;
            }
            continue;
        }
        if (deadNodes != null) {
            deadNodes.addPeer(peer);
        }
    }
    // 超过半数节点没有租约超时,则leader能够维持身份
    if (aliveCount >= peers.size() / 2 + 1) {
        updateLastLeaderTimestamp(startLease);
        return true;
    }
    // leader非法
    return false;
}

虽然通过stepDownTimeout无法完全避免本小节开始的案例,但是能够保证在一个leaseTimeout后,client能够感知到leader发生变更。

1.3.5. 心跳

回顾上面的代码,你会发现还少了leader向各个follower发送心跳的逻辑。先来看下图,在成为leader后,除了启动StepDownTimer,还会为所有的follower和learner新建一个replicator,用于控制与follower的交互。

private void becomeLeader() {
        // ...... 省略代码
        // Start follower's replicators
        for (final PeerId peer : this.conf.listPeers()) {
            if (peer.equals(this.serverId)) {
                continue;
            }
            LOG.debug("Node {} add a replicator, term={}, peer={}.", getNodeId(), this.currTerm, peer);
            if (!this.replicatorGroup.addReplicator(peer)) {
                LOG.error("Fail to add a replicator, peer={}.", peer);
            }
        }
        // Start learner's replicators
        for (final PeerId peer : this.conf.listLearners()) {
            LOG.debug("Node {} add a learner replicator, term={}, peer={}.", getNodeId(), this.currTerm, peer);
            if (!this.replicatorGroup.addReplicator(peer, ReplicatorType.Learner)) {
                LOG.error("Fail to add a learner replicator, peer={}.", peer);
            }
        }
        // ...... 省略代码
    }

调用了ReplicatorGroupImpl#addReplicator方法,调用Replicator#start新建了一个replicator

    @Override
    public boolean addReplicator(final PeerId peer, final ReplicatorType replicatorType, final boolean sync) {
        // ......省略代码
        final ThreadId rid = Replicator.start(opts, this.raftOptions);
        // ......省略代码
        return this.replicatorMap.put(peer, rid) == null;
    }

start方法中启动了啊heartbeatTimer,并且立即发送了一个不携带数据的AppendEntriesRequest(用于快速同步leader与其他节点的日志)。

    public static ThreadId start(final ReplicatorOptions opts, final RaftOptions raftOptions) {
        // ...... 省略代码
        r.lastRpcSendTimestamp = Utils.monotonicMs();
        r.startHeartbeatTimer(Utils.nowMs());
        // id.unlock in sendEmptyEntries
        r.sendProbeRequest();
        return r.id;
    }

heartbeatTimer在每次被触发时候,调用Replicator#onTimeout方法。

private void onTimeout(final ThreadId id) {
    if (id != null) {
        id.setError(RaftError.ETIMEDOUT.getNumber());
    } else {
        LOG.warn("Replicator {} id is null when timeout, maybe it's destroyed.", this);
    }
}

最终会调用到Replicatior的onError方法,并发送心跳

@Override
public void onError(final ThreadId id, final Object data, final int errorCode) {
    final Replicator r = (Replicator) data;
    if (errorCode == RaftError.ESTOP.getNumber()) {
        // ...... 省略
    } else if (errorCode == RaftError.ETIMEDOUT.getNumber()) {
        // 发送心跳
        RpcUtils.runInThread(() -> sendHeartbeat(id));
    } else {
        // noinspection ConstantConditions
        Requires.requireTrue(false, "Unknown error code " + errorCode + " for replicator: " + r);
    }
}

private static void sendHeartbeat(final ThreadId id) {
    final Replicator r = (Replicator) id.lock();
    if (r == null) {
        return;
    }
    // unlock in sendEmptyEntries
    r.sendEmptyEntries(true);
}

private void sendEmptyEntries(final boolean isHeartbeat,
                              final RpcResponseClosure<AppendEntriesResponse> heartBeatClosure) {
    final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
    // ......省略代码
    try {
        final long monotonicSendTimeMs = Utils.monotonicMs();
        if (isHeartbeat) {
            final AppendEntriesRequest request = rb.build();
            // Sending a heartbeat request
            this.heartbeatCounter++;
            RpcResponseClosure<AppendEntriesResponse> heartbeatDone;
            // Prefer passed-in closure.
            if (heartBeatClosure != null) {
                heartbeatDone = heartBeatClosure;
            } else {
                heartbeatDone = new RpcResponseClosureAdapter<AppendEntriesResponse>() {

                    @Override
                    public void run(final Status status) {
                        // 重要!!心跳响应回调
                        onHeartbeatReturned(Replicator.this.id, status, request, getResponse(), monotonicSendTimeMs);
                    }
                };
            }
            this.heartbeatInFly = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(), request,
                this.options.getElectionTimeoutMs() / 2, heartbeatDone);
        } else {
            // ......省略代码
        }
        LOG.debug("Node {} send HeartbeatRequest to {} term {} lastCommittedIndex {}", this.options.getNode()
            .getNodeId(), this.options.getPeerId(), this.options.getTerm(), rb.getCommittedIndex());
    } finally {
            unlockId();
    }
}

接着来看下节点在接受到心跳请求后的处理逻辑,如果留意了前面发送心跳请求的代码,可以看到心跳请求实际上也是一个AppendEntriesRequest。节点在收到请求后的处理逻辑在NodeImpl#handleAppendEntriesRequest中。

对于heartbeat request而言,核心逻辑如下

  1. 进行term与preLogTerm和preLogIndex的校验,以确保心跳请求的合法性。
  2. 更新leader最新请求时间
  3. 此外Leader还会顺便根据心跳提交日志。
@Override
    public Message handleAppendEntriesRequest(final AppendEntriesRequest request, final RpcRequestClosure done) {
        boolean doUnlock = true;
        final long startMs = Utils.monotonicMs();
        this.writeLock.lock();
        final int entriesCount = request.getEntriesCount();
        boolean success = false;
        try {
            // 1. 检查当前节点状态
            if (!this.state.isActive()) {
                return RpcFactoryHelper //
                    .responseFactory() //
                    .newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EINVAL,
                        "Node %s is not in active state, state %s.", getNodeId(), this.state.name());
            }

            final PeerId serverId = new PeerId();
            // 2. 校验对方serverId
            if (!serverId.parse(request.getServerId())) {
                return RpcFactoryHelper //
                    .responseFactory() //
                    .newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EINVAL,
                        "Parse serverId failed: %s.", request.getServerId());
            }

            // 3. 检查term是否合法
            if (request.getTerm() < this.currTerm) {
                return AppendEntriesResponse.newBuilder() //
                    .setSuccess(false) //
                    .setTerm(this.currTerm) //
                    .build();
            }

            // 4. 校验term,如果自己还不是合法的follower,则stepDown自己
            checkStepDown(request.getTerm(), serverId);
            // 5. leader发生变化,集群中存在多个leader
            if (!serverId.equals(this.leaderId)) {
                // Increase the term by 1 and make both leaders step down to minimize the
                // loss of split brain
                stepDown(request.getTerm() + 1, false, new Status(RaftError.ELEADERCONFLICT,
                    "More than one leader in the same term."));
                return AppendEntriesResponse.newBuilder() //
                    .setSuccess(false) //
                    .setTerm(request.getTerm() + 1) //
                    .build();
            }

            // 6. 更新leader发送请求时间
            updateLastLeaderTimestamp(Utils.monotonicMs());

            // 7. 如果处在installSnapshot的过程中,忽略这次请求
            if (entriesCount > 0 && this.snapshotExecutor != null && this.snapshotExecutor.isInstallingSnapshot()) {
                LOG.warn("Node {} received AppendEntriesRequest while installing snapshot.", getNodeId());
                return RpcFactoryHelper //
                    .responseFactory() //
                    .newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EBUSY,
                        "Node %s:%s is installing snapshot.", this.groupId, this.serverId);
            }

            final long prevLogIndex = request.getPrevLogIndex();
            final long prevLogTerm = request.getPrevLogTerm();
            final long localPrevLogTerm = this.logManager.getTerm(prevLogIndex);
            // 8. 校验本地日志和请求方日志
            if (localPrevLogTerm != prevLogTerm) {
                final long lastLogIndex = this.logManager.getLastLogIndex();
                return AppendEntriesResponse.newBuilder() //
                    .setSuccess(false) //
                    .setTerm(this.currTerm) //
                    .setLastLogIndex(lastLogIndex) //
                    .build();
            }

            // 9. 这里可能是probeRequest或者heartbeatRequest
            if (entriesCount == 0) {
                // heartbeat or probe request
                final AppendEntriesResponse.Builder respBuilder = AppendEntriesResponse.newBuilder() //
                    .setSuccess(true) //
                    .setTerm(this.currTerm) //
                    .setLastLogIndex(this.logManager.getLastLogIndex());
                doUnlock = false;
                this.writeLock.unlock();
                // see the comments at FollowerStableClosure#run()
                this.ballotBox.setLastCommittedIndex(Math.min(request.getCommittedIndex(), prevLogIndex));
                return respBuilder.build();
            }

            // ...省略日志复制部分逻辑
            checkAndSetConfiguration(true);
            success = true;
            return null;
        } finally {
            if (doUnlock) {
                this.writeLock.unlock();
            }
            final long processLatency = Utils.monotonicMs() - startMs;
            if (entriesCount == 0) {
                this.metrics.recordLatency("handle-heartbeat-requests", processLatency);
            } else {
                this.metrics.recordLatency("handle-append-entries", processLatency);
            }
            if (success) {
                // Don't stats heartbeat requests.
                this.metrics.recordSize("handle-append-entries-count", entriesCount);
            }
        }
    }

在上面核心逻辑的第二点,leader最后请求时间是用来判断当前leader是否合法的重要变量,起到一个租期开始时间的作用。electionTimer中有这样一段逻辑,isCurrentLeaderValid方法用来判断当前leader是否合法,如果合法跳过这次选举超时。而isCurrentLeaderValid的依据就是leader的上一次请求时间是否超过electionTimeoutMs。

private void handleElectionTimeout() {
    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        if (this.state != State.STATE_FOLLOWER) {
            return;
        }
        if (isCurrentLeaderValid()) {
            return;
        }
        // .....省略代码
    }
}
private boolean isCurrentLeaderValid() {
    return Utils.monotonicMs() - this.lastLeaderTimestamp < this.options.getElectionTimeoutMs();
}

再来看节点对心跳响应的处理,

    static void onHeartbeatReturned(final ThreadId id, final Status status, final AppendEntriesRequest request,
                                    final AppendEntriesResponse response, final long rpcSendTime) {
        if (id == null) {
            return;
        }
        final long startTimeMs = Utils.nowMs();
        Replicator r;
        if ((r = (Replicator) id.lock()) == null) {
            return;
        }
        boolean doUnlock = true;
        try {
            final boolean isLogDebugEnabled = LOG.isDebugEnabled();
            if (!status.isOk()) {
                if (isLogDebugEnabled) {
                    sb.append(" fail, sleep, status=") //
                        .append(status);
                    LOG.debug(sb.toString());
                }
                r.setState(State.Probe);
                notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
                // 超时,或者其他网络问题,需要重发心跳
                r.startHeartbeatTimer(startTimeMs);
                return;
            }
            r.consecutiveErrorTimes = 0;
            // 节点的term大于replicator的term
            if (response.getTerm() > r.options.getTerm()) {
                final NodeImpl node = r.options.getNode();
                r.notifyOnCaughtUp(RaftError.EPERM.getNumber(), true);
                // 毁灭replicator
                r.destroy();
                // 这里有两种情况
                // 1. 当前节点是过时的leader,执行stepDown逻辑
                // 2. 当前节点term大于response term,则啥也不做
                node.increaseTermTo(response.getTerm(), new Status(RaftError.EHIGHERTERMRESPONSE,
                    "Leader receives higher term heartbeat_response from peer:%s, group:%s", r.options.getPeerId(), r.options.getGroupId()));
                return;
            }
            if (!response.getSuccess() && response.hasLastLogIndex()) {
                // 重要,假设心跳请求失败,意味着日志与follower节点对不上了
                // 发送probeRequest来获取日志差异,以供后面修复
                doUnlock = false;
                r.sendProbeRequest();
                r.startHeartbeatTimer(startTimeMs);
                return;
            }
            // 一切都正常
            if (rpcSendTime > r.lastRpcSendTimestamp) {
                r.lastRpcSendTimestamp = rpcSendTime;
            }
            r.startHeartbeatTimer(startTimeMs);
        } finally {
            if (doUnlock) {
                id.unlock();
            }
        }
    }

1.4 总结

相较于原本的raft算法jraft进行了不少的优化,使用preVote机制避免了集群中不必要的投票,同时leader能够修改自己的term以此来应对candidate term大于自身的情况,这也是避免不必要选举的一种手段。此外,jraft还实现了leader租约的机制,在租约期间内peer无条件认为leader正常,以此来fail fast选举流程。