1. 选举
1.1. Raft选举
先来回忆下raft选举的过程:
- follower心跳超时,将自身更改为candidate,自增term,向其他节点发起选举请求(RequestVote RPC)
- 收到RequestVote RPC的节点根据请求响应
-
- 如果term大于自己,更新自己的term,并将自己置为follower
- 查看请求携带的日志信息是否不旧于自己,如果是则投票
- 发起RequestVote RPC的candidate,根据投票结果更改节点状态,或者投票失败等待下次超期发起选举。
给自己打一个广告,更详细的介绍与实现细节可见raft--(1)选举一文。
1.2. Jraft选举
然后再看上图的一个场景,在某个任期S0担任leader,S4因为网络波动没有收到心跳,变成了candidate,发起选举投票。由于S4的Term大于leader节点,尽管S0的日志可能新于S4,但S0 leader节点在收到RequestVoteRPC,强制将自身降级为follower。此时集群将出现一个短暂地不可用时间,等待某个节点选举成功后,集群才可用。
图一 一个可能的选举结果
那么这次选举有必要吗?实际上Term1的leader S0,必然具有最全的日志,只是他的term相较于S4落后了。于是,有趣的事情发生了,一个非法的节点却让整个集群服软了,强行选举并提高了整个集群的Term。
sofa-jraft为了避免上述的情况,额外添加了一个步骤--prevote。顾名思义,prevote只是发起一次预投票,只有通过预投票的节点才能发起真正的投票。
- 在某个follower成为candidate后,并不会发起真正的投票,而是尝试增加自己的term(没有自增),拿着这个增加的term去向其他节点发起一个预备投票请求(PreRequestVote RPC)。
- 其他节点在收到该RPC请求后,会查看该请求是否有资格成为新的leader(比较term大小,比较请求方日志是不是不旧于自己)。
- 发起预投票的节点,会查看是否获取了超过半数的预投票准许响应,才自增term,发起真正的投票。
再来结合一下上面的案例。S4在electionTimeout后,发起预选举请求(注意此时S4的term没有自增,还是follower身份),由于S4的日志有很大概率落后于S0,预投票失败,S4识趣地放弃自增term和选举。
一次完整的预投票+投票过程如下:
1.3. 源码解析
注:作者在编写这篇文章时,时间是2024年12月4日,所有的代码都是同时间的master分支
关于jraft核心逻辑在com.alipay.sofa.jraft.core.NodeImpl这个类中,该类是对raft节点的抽象,与选举相关的成员变量如下所示。作者在看源码时有一个习惯,删除代码,为了便于理解,作者一般会新建分支,并尽量删除与想看的逻辑无关的代码,以确保这部分逻辑足够清晰。在删除代码的过程中,一般也对其他功能的实现与原理有了一定了解,虽然这样很笨拙,但对我的确有效,也希望大家分享一些看源码的高效方式。
public class NodeImpl implements Node, RaftServerService {
private static final Logger LOG = LoggerFactory
.getLogger(NodeImpl.class);
public static final AtomicInteger GLOBAL_NUM_NODES = new AtomicInteger(
0);
protected final Lock writeLock = this.readWriteLock
.writeLock();
protected final Lock readLock = this.readWriteLock
.readLock();
private volatile State state;
private long currTerm;
private volatile long lastLeaderTimestamp;
private PeerId leaderId = new PeerId();
private PeerId votedId;
private final Ballot voteCtx = new Ballot();
private final Ballot prevVoteCtx = new Ballot();
private ConfigurationEntry conf;
private final PeerId serverId;
/** Other services */
private final ConfigurationCtx confCtx;
/** Timers */
private Scheduler timerManager;
private RepeatedTimer electionTimer;
private RepeatedTimer voteTimer;
private RepeatedTimer stepDownTimer;
}
如果你有看过作者raft系列文章的第一篇,在成员变量出就可以定位到选举相关的核心成员了
private RepeatedTimer electionTimer;
private RepeatedTimer voteTimer;
根据我们之前对raft算法的了解,electionTimer大概率是心跳超期发起选举的定时任务,voteTimer则是投票过程中超期的定时任务。相较于我们实现的raft node,这里还多了个Timer -- stepDownTimer,先说结论,这个timer是leader用来自检,并维持与其他节点连接。
各个Timer与其作用域如下图所示。
1.3.1. 选举触发
先来定位ET(ElectionTimer)的启动时机,通过IDEA的引用提示,可以看到ET只有在一处被启动。
对应方法为StepDown,该方法的核心逻辑是将当前节点状态设置为Follower,如果不是learner节点(learner节点,指不参加raft选举,但会复制leader的日志的节点)就启动electionTimer。
private void stepDown(final long term, final boolean wakeupCandidate, final Status status) {
if (!this.state.isActive()) {
return;
}
if (this.state == State.STATE_CANDIDATE) {
// 如果是candidate,由于要变成follower了,需要停止voteTimer
stopVoteTimer();
} else if (this.state.compareTo(State.STATE_TRANSFERRING) <= 0) {
// 如果是leade转移情况,由于要变成follower了,需要停止stepDownTimer
stopStepDownTimer();
// 清空信箱,ballBox与日志投票有关,暂时先不要关注
this.ballotBox.clearPendingTasks();
// signal fsm leader stop immediately
// 停止有限状态机相关,暂时先不要关注
if (this.state == State.STATE_LEADER) {
onLeaderStop(status);
}
}
// reset leader_id
resetLeaderId(PeerId.emptyPeer(), status);
// 降级为follower
this.state = State.STATE_FOLLOWER;
// meta state
if (term > this.currTerm) {
this.currTerm = term;
this.votedId = PeerId.emptyPeer();
this.metaStorage.setTermAndVotedFor(term, this.votedId);
}
// ...... 省略部分代码
if (!isLearner()) {
this.electionTimer.restart();
} else {
LOG.info("Node {} is a learner, election timer is not started.", this.nodeId);
}
}
再来看ElectionTimer被触发后的逻辑,调用了handleElectionTimeout方法,这里RepeatedTimer是jraft对定时任务的抽象。
this.electionTimer = new RepeatedTimer(name, this.options.getElectionTimeoutMs(),
TIMER_FACTORY.getElectionTimer(this.options.isSharedElectionTimer(), name)) {
@Override
protected void onTrigger() {
handleElectionTimeout();
}
@Override
protected int adjustTimeout(final int timeoutMs) {
return randomTimeout(timeoutMs);
}
};
handleElectionTimeout方法的逻辑也比较简单
- 校验本次选举是否合法
- 发起预投票
private void handleElectionTimeout() {
boolean doUnlock = true;
this.writeLock.lock();
try {
// 1.1 校验节点是否是Follower,只有Follower才能发起选举
if (this.state != State.STATE_FOLLOWER) {
return;
}
// 1.2 如果现在的leader合法,即心跳合法,则放弃选举
if (isCurrentLeaderValid()) {
return;
}
resetLeaderId(PeerId.emptyPeer(), new Status(RaftError.ERAFTTIMEDOUT, "Lost connection from leader %s.",
this.leaderId));
// 1.3 判断当前节点是否允许发起选举,jraft支持节点配置选举优先级
if (!allowLaunchElection()) {
return;
}
doUnlock = false;
preVote();
} finally {
if (doUnlock) {
this.writeLock.unlock();
}
}
}
1.3.2. PreVote 预投票
- 预投票请求
preVote是预投票的核心实现,具体的逻辑是
- 预增本地term,对其他节点发起预投票请求。
- 投票给自己
这里需要注意的是对currentTerm的ABA检测。这里其实是一个很好的并发实践,在阻塞性操作时候释放锁,通过ABA检测来保证释放锁期间程序运行的状态是符合预期的。
// in writeLock
private void preVote() {
long oldTerm;
try {
// ..... 省略部分代码
oldTerm = this.currTerm;
} finally {
this.writeLock.unlock();
}
final LogId lastLogId = this.logManager.getLastLogId(true);
boolean doUnlock = true;
this.writeLock.lock();
try {
// 由于释放了锁,需要进行ABA检测,确保这次运行还是合法的
if (oldTerm != this.currTerm) {
return;
}
// 初始化信箱
this.prevVoteCtx.init(this.conf.getConf(), this.conf.isStable() ? null : this.conf.getOldConf());
for (final PeerId peer : this.conf.listPeers()) {
if (peer.equals(this.serverId)) {
continue;
}
if (!this.rpcService.connect(peer.getEndpoint())) {
continue;
}
final OnPreVoteRpcDone done = new OnPreVoteRpcDone(peer, this.currTerm);
done.request = RequestVoteRequest.newBuilder() //
.setPreVote(true) // it's a pre-vote request.
.setGroupId(this.groupId) //
.setServerId(this.serverId.toString()) //
.setPeerId(peer.toString()) //
// 预增term
.setTerm(this.currTerm + 1) // next term
.setLastLogIndex(lastLogId.getIndex()) //
.setLastLogTerm(lastLogId.getTerm()) //
.build();
this.rpcService.preVote(peer.getEndpoint(), done.request, done);
}
this.prevVoteCtx.grant(this.serverId);
if (this.prevVoteCtx.isGranted()) {
doUnlock = false;
// 预投票被认可,发起真正的投票
electSelf();
}
} finally {
if (doUnlock) {
this.writeLock.unlock();
}
}
}
- 预投票请求处理
在看投票部分逻辑前,先来看下节点对预投票的RequestVoteRequest的处理和响应的处理。
- 只有当请求的term大于等于当前节点,日志不旧于当前节点,并且当前节点认为leader不正常时(优化),才会投票给请求方
- 比较有意思的一点是用了while来控制代码逻辑,使用break可以提前跳出代码块。相较于实现一个方法提前return,这里免去了传参数,也减少了if elese分支。在jraft的代码中充斥着这种写法。
@Override
public Message handlePreVoteRequest(final RequestVoteRequest request) {
boolean doUnlock = true;
this.writeLock.lock();
try {
if (!this.state.isActive()) {
return RpcFactoryHelper //
.responseFactory() //
.newResponse(RequestVoteResponse.getDefaultInstance(), RaftError.EINVAL,
"Node %s is not in active state, state %s.", getNodeId(), this.state.name());
}
final PeerId candidateId = new PeerId();
if (!candidateId.parse(request.getServerId())) {
return RpcFactoryHelper //
.responseFactory() //
.newResponse(RequestVoteResponse.getDefaultInstance(), RaftError.EINVAL,
"Parse candidateId failed: %s.", request.getServerId());
}
boolean granted = false;
// noinspection ConstantConditions
do {
// 非法的请求clientId,放弃投票
if (!this.conf.contains(candidateId)) {
break;
}
// 当前leader存活,放弃投票
if (this.leaderId != null && !this.leaderId.isEmpty() && isCurrentLeaderValid()) {
break;
}
// 请求term小于当前节点term,放弃投票
if (request.getTerm() < this.currTerm) {
// 当前节点如果是leader节点,确保对请求client的replicator线程工作
checkReplicator(candidateId);
break;
}
// 当前节点如果是leader节点,确保对请求client的replicator线程工作
checkReplicator(candidateId);
doUnlock = false;
this.writeLock.unlock();
// 阻塞操作前,先放弃持有锁
final LogId lastLogId = this.logManager.getLastLogId(true);
doUnlock = true;
this.writeLock.lock();
// 根据本地最新日志与请求最新日志比较,如果对方没自己新不投票,反之投票
final LogId requestLastLogId = new LogId(request.getLastLogIndex(), request.getLastLogTerm());
granted = requestLastLogId.compareTo(lastLogId) >= 0;
} while (false);
return RequestVoteResponse.newBuilder() //
.setTerm(this.currTerm) //
.setGranted(granted) //
.build();
} finally {
if (doUnlock) {
this.writeLock.unlock();
}
}
}
- 预投票响应处理
- 确保响应合法,自身仍然是follower节点,并且term没有发生变化
- 如果对方的term大于自己,stepDown自己
- 更新预投票信箱。如果通过预备投票(超半数机制),则发起正式选举。
public void handlePreVoteResponse(final PeerId peerId, final long term, final RequestVoteResponse response) {
boolean doUnlock = true;
this.writeLock.lock();
try {
// 1. 确保当前节点仍然是follower
if (this.state != State.STATE_FOLLOWER) {
return;
}
// 2. 当前节点的term发生变化了,这是一次过期的投票响应,跳过
if (term != this.currTerm) {
return;
}
// 对方节点的term大于自己,说明自己已经过时了,
// 通过stepDown重置follower相关逻辑
if (response.getTerm() > this.currTerm) {
//
stepDown(response.getTerm(), false, new Status(RaftError.EHIGHERTERMRESPONSE,
"Raft node receives higher term pre_vote_response."));
return;
}
if (response.getGranted()) {
this.prevVoteCtx.grant(peerId);
if (this.prevVoteCtx.isGranted()) {
doUnlock = false;
// 通过预投票,发起选举
electSelf();
}
}
} finally {
if (doUnlock) {
this.writeLock.unlock();
}
}
}
1.3.3. ElectSelf 正式投票
在正式发起投票时,follower会停止electionTimer,更改自身状态为candidate,启动voteTimer,并自增term,可见(1)处代码。
随后candidate再一次发起了RequestVoteRequest请求,只不过这一次成员变量preVote的值是false。
private void electSelf() {
long oldTerm;
try {
if (!this.conf.contains(this.serverId)) {
return;
}
// (1)处代码
if (this.state == State.STATE_FOLLOWER) {
this.electionTimer.stop();
}
resetLeaderId(PeerId.emptyPeer(), new Status(RaftError.ERAFTTIMEDOUT,
"A follower's leader_id is reset to NULL as it begins to request_vote."));
this.state = State.STATE_CANDIDATE;
this.currTerm++;
this.votedId = this.serverId.copy();
LOG.debug("Node {} start vote timer, term={} .", getNodeId(), this.currTerm);
this.voteTimer.start();
this.voteCtx.init(this.conf.getConf(), this.conf.isStable() ? null : this.conf.getOldConf());
oldTerm = this.currTerm;
// (1)处代码
} finally {
this.writeLock.unlock();
}
final LogId lastLogId = this.logManager.getLastLogId(true);
this.writeLock.lock();
try {
// vote need defense ABA after unlock&writeLock
if (oldTerm != this.currTerm) {
return;
}
for (final PeerId peer : this.conf.listPeers()) {
if (peer.equals(this.serverId)) {
continue;
}
if (!this.rpcService.connect(peer.getEndpoint())) {
continue;
}
final OnRequestVoteRpcDone done = new OnRequestVoteRpcDone(peer, this.currTerm, this);
done.request = RequestVoteRequest.newBuilder() //
.setPreVote(false) // It's not a pre-vote request.
.setGroupId(this.groupId) //
.setServerId(this.serverId.toString()) //
.setPeerId(peer.toString()) //
.setTerm(this.currTerm) //
.setLastLogIndex(lastLogId.getIndex()) //
.setLastLogTerm(lastLogId.getTerm()) //
.build();
this.rpcService.requestVote(peer.getEndpoint(), done.request, done);
}
this.metaStorage.setTermAndVotedFor(this.currTerm, this.serverId);
this.voteCtx.grant(this.serverId);
if (this.voteCtx.isGranted()) {
becomeLeader();
}
} finally {
this.writeLock.unlock();
}
}
与预投票一样,正式投票的client有handleRequestVoteRequest函数处理请求,发起请求的candidate调用handleRequestVoteResponse处理响应。
- handleRequestVoteRequest
处理ReuqestVoteReuqest的核心方法,是否投票给对方取决于三个条件
-
对方term >= 自己
-
对方日志 不旧于 自己
-
本轮选举还未投过票
@Override
public Message handleRequestVoteRequest(final RequestVoteRequest request) {
boolean doUnlock = true;
this.writeLock.lock();
try {
// (1) 合法性校验
if (!this.state.isActive()) {
return RpcFactoryHelper //
.responseFactory() //
.newResponse(RequestVoteResponse.getDefaultInstance(), RaftError.EINVAL,
"Node %s is not in active state, state %s.", getNodeId(), this.state.name());
}
final PeerId candidateId = new PeerId();
if (!candidateId.parse(request.getServerId())) {
return RpcFactoryHelper //
.responseFactory() //
.newResponse(RequestVoteResponse.getDefaultInstance(), RaftError.EINVAL,
"Parse candidateId failed: %s.", request.getServerId());
}
// (1) 合法性校验
do {
if (request.getTerm() >= this.currTerm) {
// 如果请求term大于自身,修改term,并将自己更改为follower
if (request.getTerm() > this.currTerm) {
stepDown(request.getTerm(), false, new Status(RaftError.EHIGHERTERMRESPONSE,
"Raft node receives higher term RequestVoteRequest."));
}
} else {
// 请求term小于自身,拒绝投票 跳出
break;
}
doUnlock = false;
this.writeLock.unlock();
final LogId lastLogId = this.logManager.getLastLogId(true);
doUnlock = true;
this.writeLock.lock();
// ABA check
if (request.getTerm() != this.currTerm) {
LOG.warn("Node {} raise term {} when get lastLogId.", getNodeId(), this.currTerm);
break;
}
// 判断日志是否至少不旧于自己
final boolean logIsOk = new LogId(request.getLastLogIndex(), request.getLastLogTerm())
.compareTo(lastLogId) >= 0;
// 满足三个条件
// 1. 对方term >= 自己
// 2. 对方日志 不旧于 自己
// 3. 本轮选举还未投过票
if (logIsOk && (this.votedId == null || this.votedId.isEmpty())) {
stepDown(request.getTerm(), false, new Status(RaftError.EVOTEFORCANDIDATE,
"Raft node votes for some candidate, step down to restart election_timer."));
this.votedId = candidateId.copy();
this.metaStorage.setVotedFor(candidateId);
}
} while (false);
return RequestVoteResponse.newBuilder() //
.setTerm(this.currTerm) //
.setGranted(request.getTerm() == this.currTerm && candidateId.equals(this.votedId)) //
.build();
} finally {
if (doUnlock) {
this.writeLock.unlock();
}
}
}
- handleReuqestVoteResponse
接着来到响应处理方法,在接收到超过半数的投票后,调用becomeLeader方法称为leader。
public void handleRequestVoteResponse(final PeerId peerId, final long term, final RequestVoteResponse response) {
this.writeLock.lock();
try {
// 本响应所处的选举周期早就结束了,跳出
if (this.state != State.STATE_CANDIDATE) {
return;
}
// ABAcheck,该响应是个过期选举产生的,跳出
if (term != this.currTerm) {
return;
}
if (response.getTerm() > this.currTerm) {
// 对方term大于自己,走降级逻辑 成为follower,终端选举
stepDown(response.getTerm(), false, new Status(RaftError.EHIGHERTERMRESPONSE,
"Raft node receives higher term request_vote_response."));
return;
}
// check granted quorum?
if (response.getGranted()) {
this.voteCtx.grant(peerId);
if (this.voteCtx.isGranted()) {
becomeLeader();
}
}
} finally {
this.writeLock.unlock();
}
}
- becomeLeader
在成为leader后节点主要做了几件事情
- 停止voteTimer
- 修改自身状态
- 开启follower/leader的日志复制线程
- 重置本地提交索引
- 启动stepDownTimer
private void becomeLeader() {
Requires.requireTrue(this.state == State.STATE_CANDIDATE, "Illegal state: " + this.state);
this.conf.getConf(), this.conf.getOldConf());
// 1. 停止投票超时任务
stopVoteTimer();
// 2. 改变自身状态
this.state = State.STATE_LEADER;
this.leaderId = this.serverId.copy();
// 3. 更改replicatorGroup的term,replicator的作用是日志复制
this.replicatorGroup.resetTerm(this.currTerm);
// Start follower's replicators
for (final PeerId peer : this.conf.listPeers()) {
if (peer.equals(this.serverId)) {
continue;
}
// 4. 启动follower的replicator线程
if (!this.replicatorGroup.addReplicator(peer)) {
LOG.error("Fail to add a replicator, peer={}.", peer);
}
}
for (final PeerId peer : this.conf.listLearners()) {
// 5. 启动learner的replicator线程
if (!this.replicatorGroup.addReplicator(peer, ReplicatorType.Learner)) {
LOG.error("Fail to add a learner replicator, peer={}.", peer);
}
}
// init commit manager
this.ballotBox.resetPendingIndex(this.logManager.getLastLogIndex() + 1);
// Register _conf_ctx to reject configuration changing before the first log
// is committed.
if (this.confCtx.isBusy()) {
throw new IllegalStateException();
}
this.confCtx.flush(this.conf.getConf(), this.conf.getOldConf());
// 6. 启动stepDownTimer
this.stepDownTimer.start();
}
1.3.4. Timer
如果你细心留意了上面的代码,可以发现三个与选举有关的Timer
- ElectionTimer,这个比较好理解,follower用来触发leader超期的定时任务,只有当节点state为follower时启动。触发后就调用handleElectionTimeout()方法,发起预投票-投票逻辑。
this.electionTimer = new RepeatedTimer(name, this.options.getElectionTimeoutMs(),
TIMER_FACTORY.getElectionTimer(this.options.isSharedElectionTimer(), name)) {
@Override
protected void onTrigger() {
handleElectionTimeout();
}
@Override
protected int adjustTimeout(final int timeoutMs) {
return randomTimeout(timeoutMs);
}
};
2. VoteTimer,candidate用来触发投票超期的定时任务,只有当节点节点state为candidate时启动。VoteTimer会根据配置,重新变换角色为follower发起预投票,或者直接发起正式投票。
this.voteTimer = new RepeatedTimer(name, this.options.getElectionTimeoutMs(), TIMER_FACTORY.getVoteTimer(
this.options.isSharedVoteTimer(), name)) {
@Override
protected void onTrigger() {
handleVoteTimeout();
}
@Override
protected int adjustTimeout(final int timeoutMs) {
return randomTimeout(timeoutMs);
}
};
private void handleVoteTimeout() {
this.writeLock.lock();
if (this.state != State.STATE_CANDIDATE) {
this.writeLock.unlock();
return;
}
if (this.raftOptions.isStepDownWhenVoteTimedout()) {
stepDown(this.currTerm, false, new Status(RaftError.ETIMEDOUT,
"Vote timeout: fail to get quorum vote-granted."));
preVote();
} else {
electSelf();
}
}
3. StepDownTimer,只有当节点状态为leader时启动
这个Timer就比较有意思了,在看源码前先看一个案例:由5个节点组成的raft集群,原本由S1担任leader,在某个时间点发生了网络隔离。S0、S2、S4处于同一网络,S2被选举成了新的leader,由于超过半数这个子集群可用。S1、S3处于同一网络,S1维持着老leader的身份,此时往S1写数据是可以的,但是在读数据时却发生了问题,由于该子集群节点数量不超过原集群节点半数,写入的日志永远无法被提交了。
为了解决上面的问题,jraft针对leader提出了stepDownTimeout,核心逻辑是,leader需要确认自己是个合法的leader,怎么确认呢?通过与各个节点维持一个租约来避免上述的情况,如果不满足维持leader的条件了,则自降身份为follower。
ok了解了上面的案例后,再来看stepDownTimer就会容易许多了。该定时任务的核心逻辑在handleStepDownTimeout方法中。
this.stepDownTimer = new RepeatedTimer(name, this.options.getElectionTimeoutMs() >> 1,
TIMER_FACTORY.getStepDownTimer(this.options.isSharedStepDownTimer(), name)) {
@Override
protected void onTrigger() {
handleStepDownTimeout();
}
};
在该方法中,分了两次检测leader自身是否合法:
第一次上读锁进行检测,此时就算leader非法了也不会更改节点状态
第二次上写锁进行检测,此时如果leader非法了才会更改节点状态
private void handleStepDownTimeout() {
do {
this.readLock.lock();
try {
if (this.state.compareTo(State.STATE_TRANSFERRING) > 0) {
LOG.debug("Node {} stop step-down timer, term={}, state={}.", getNodeId(), this.currTerm,
this.state);
return;
}
final long monotonicNowMs = Utils.monotonicMs();
if (!checkDeadNodes(this.conf.getConf(), monotonicNowMs, false)) {
break;
}
if (!this.conf.getOldConf().isEmpty()) {
if (!checkDeadNodes(this.conf.getOldConf(), monotonicNowMs, false)) {
break;
}
}
return;
} finally {
this.readLock.unlock();
}
} while (false);
this.writeLock.lock();
try {
if (this.state.compareTo(State.STATE_TRANSFERRING) > 0) {
LOG.debug("Node {} stop step-down timer, term={}, state={}.", getNodeId(), this.currTerm, this.state);
return;
}
final long monotonicNowMs = Utils.monotonicMs();
checkDeadNodes(this.conf.getConf(), monotonicNowMs, true);
if (!this.conf.getOldConf().isEmpty()) {
checkDeadNodes(this.conf.getOldConf(), monotonicNowMs, true);
}
} finally {
this.writeLock.unlock();
}
}
至怎么检验leader的逻辑代码注释所示,简短的说:计算 leader以当前时间 - 各个节点最后一次rpc成功请求的时间,有超过半数的节点的该差值不超过leaderLeaseTimeout,则leader合法。否则leader非法,走stepDown逻辑。
private boolean checkDeadNodes(final Configuration conf, final long monotonicNowMs,
final boolean stepDownOnCheckFail) {
for (final PeerId peer : conf.getLearners()) {
// 保证learn的复制线程可靠
checkReplicator(peer);
}
final List<PeerId> peers = conf.listPeers();
final Configuration deadNodes = new Configuration();
// 如果leader合法,直接返回true
if (checkDeadNodes0(peers, monotonicNowMs, true, deadNodes)) {
return true;
}
if (stepDownOnCheckFail) {
// 降级逻辑,只有持有写锁时执行
LOG.warn("Node {} steps down when alive nodes don't satisfy quorum, term={}, deadNodes={}, conf={}.",
getNodeId(), this.currTerm, deadNodes, conf);
final Status status = new Status();
status.setError(RaftError.ERAFTTIMEDOUT, "Majority of the group dies: %d/%d", deadNodes.size(),
peers.size());
stepDown(this.currTerm, false, status);
}
return false;
}
private boolean checkDeadNodes0(final List<PeerId> peers, final long monotonicNowMs, final boolean checkReplicator,
final Configuration deadNodes) {
final int leaderLeaseTimeoutMs = this.options.getLeaderLeaseTimeoutMs();
int aliveCount = 0;
long startLease = Long.MAX_VALUE;
for (final PeerId peer : peers) {
// 遍历所有有资格选举的节点
if (peer.equals(this.serverId)) {
aliveCount++;
continue;
}
if (checkReplicator) {
checkReplicator(peer);
}
// 上一次成功rpc请求的时间
final long lastRpcSendTimestamp = this.replicatorGroup.getLastRpcSendTimestamp(peer);
// 当前时间 - 上一次rpc请求成功时间 <= 租约超时时间
// 说明当前peer认leader
if (monotonicNowMs - lastRpcSendTimestamp <= leaderLeaseTimeoutMs) {
aliveCount++;
if (startLease > lastRpcSendTimestamp) {
// 以最早的rpc成功请求时间 作为租约的开始
startLease = lastRpcSendTimestamp;
}
continue;
}
if (deadNodes != null) {
deadNodes.addPeer(peer);
}
}
// 超过半数节点没有租约超时,则leader能够维持身份
if (aliveCount >= peers.size() / 2 + 1) {
updateLastLeaderTimestamp(startLease);
return true;
}
// leader非法
return false;
}
虽然通过stepDownTimeout无法完全避免本小节开始的案例,但是能够保证在一个leaseTimeout后,client能够感知到leader发生变更。
1.3.5. 心跳
回顾上面的代码,你会发现还少了leader向各个follower发送心跳的逻辑。先来看下图,在成为leader后,除了启动StepDownTimer,还会为所有的follower和learner新建一个replicator,用于控制与follower的交互。
private void becomeLeader() {
// ...... 省略代码
// Start follower's replicators
for (final PeerId peer : this.conf.listPeers()) {
if (peer.equals(this.serverId)) {
continue;
}
LOG.debug("Node {} add a replicator, term={}, peer={}.", getNodeId(), this.currTerm, peer);
if (!this.replicatorGroup.addReplicator(peer)) {
LOG.error("Fail to add a replicator, peer={}.", peer);
}
}
// Start learner's replicators
for (final PeerId peer : this.conf.listLearners()) {
LOG.debug("Node {} add a learner replicator, term={}, peer={}.", getNodeId(), this.currTerm, peer);
if (!this.replicatorGroup.addReplicator(peer, ReplicatorType.Learner)) {
LOG.error("Fail to add a learner replicator, peer={}.", peer);
}
}
// ...... 省略代码
}
调用了ReplicatorGroupImpl#addReplicator方法,调用Replicator#start新建了一个replicator
@Override
public boolean addReplicator(final PeerId peer, final ReplicatorType replicatorType, final boolean sync) {
// ......省略代码
final ThreadId rid = Replicator.start(opts, this.raftOptions);
// ......省略代码
return this.replicatorMap.put(peer, rid) == null;
}
start方法中启动了啊heartbeatTimer,并且立即发送了一个不携带数据的AppendEntriesRequest(用于快速同步leader与其他节点的日志)。
public static ThreadId start(final ReplicatorOptions opts, final RaftOptions raftOptions) {
// ...... 省略代码
r.lastRpcSendTimestamp = Utils.monotonicMs();
r.startHeartbeatTimer(Utils.nowMs());
// id.unlock in sendEmptyEntries
r.sendProbeRequest();
return r.id;
}
heartbeatTimer在每次被触发时候,调用Replicator#onTimeout方法。
private void onTimeout(final ThreadId id) {
if (id != null) {
id.setError(RaftError.ETIMEDOUT.getNumber());
} else {
LOG.warn("Replicator {} id is null when timeout, maybe it's destroyed.", this);
}
}
最终会调用到Replicatior的onError方法,并发送心跳
@Override
public void onError(final ThreadId id, final Object data, final int errorCode) {
final Replicator r = (Replicator) data;
if (errorCode == RaftError.ESTOP.getNumber()) {
// ...... 省略
} else if (errorCode == RaftError.ETIMEDOUT.getNumber()) {
// 发送心跳
RpcUtils.runInThread(() -> sendHeartbeat(id));
} else {
// noinspection ConstantConditions
Requires.requireTrue(false, "Unknown error code " + errorCode + " for replicator: " + r);
}
}
private static void sendHeartbeat(final ThreadId id) {
final Replicator r = (Replicator) id.lock();
if (r == null) {
return;
}
// unlock in sendEmptyEntries
r.sendEmptyEntries(true);
}
private void sendEmptyEntries(final boolean isHeartbeat,
final RpcResponseClosure<AppendEntriesResponse> heartBeatClosure) {
final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
// ......省略代码
try {
final long monotonicSendTimeMs = Utils.monotonicMs();
if (isHeartbeat) {
final AppendEntriesRequest request = rb.build();
// Sending a heartbeat request
this.heartbeatCounter++;
RpcResponseClosure<AppendEntriesResponse> heartbeatDone;
// Prefer passed-in closure.
if (heartBeatClosure != null) {
heartbeatDone = heartBeatClosure;
} else {
heartbeatDone = new RpcResponseClosureAdapter<AppendEntriesResponse>() {
@Override
public void run(final Status status) {
// 重要!!心跳响应回调
onHeartbeatReturned(Replicator.this.id, status, request, getResponse(), monotonicSendTimeMs);
}
};
}
this.heartbeatInFly = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(), request,
this.options.getElectionTimeoutMs() / 2, heartbeatDone);
} else {
// ......省略代码
}
LOG.debug("Node {} send HeartbeatRequest to {} term {} lastCommittedIndex {}", this.options.getNode()
.getNodeId(), this.options.getPeerId(), this.options.getTerm(), rb.getCommittedIndex());
} finally {
unlockId();
}
}
接着来看下节点在接受到心跳请求后的处理逻辑,如果留意了前面发送心跳请求的代码,可以看到心跳请求实际上也是一个AppendEntriesRequest。节点在收到请求后的处理逻辑在NodeImpl#handleAppendEntriesRequest中。
对于heartbeat request而言,核心逻辑如下
- 进行term与preLogTerm和preLogIndex的校验,以确保心跳请求的合法性。
- 更新leader最新请求时间
- 此外Leader还会顺便根据心跳提交日志。
@Override
public Message handleAppendEntriesRequest(final AppendEntriesRequest request, final RpcRequestClosure done) {
boolean doUnlock = true;
final long startMs = Utils.monotonicMs();
this.writeLock.lock();
final int entriesCount = request.getEntriesCount();
boolean success = false;
try {
// 1. 检查当前节点状态
if (!this.state.isActive()) {
return RpcFactoryHelper //
.responseFactory() //
.newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EINVAL,
"Node %s is not in active state, state %s.", getNodeId(), this.state.name());
}
final PeerId serverId = new PeerId();
// 2. 校验对方serverId
if (!serverId.parse(request.getServerId())) {
return RpcFactoryHelper //
.responseFactory() //
.newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EINVAL,
"Parse serverId failed: %s.", request.getServerId());
}
// 3. 检查term是否合法
if (request.getTerm() < this.currTerm) {
return AppendEntriesResponse.newBuilder() //
.setSuccess(false) //
.setTerm(this.currTerm) //
.build();
}
// 4. 校验term,如果自己还不是合法的follower,则stepDown自己
checkStepDown(request.getTerm(), serverId);
// 5. leader发生变化,集群中存在多个leader
if (!serverId.equals(this.leaderId)) {
// Increase the term by 1 and make both leaders step down to minimize the
// loss of split brain
stepDown(request.getTerm() + 1, false, new Status(RaftError.ELEADERCONFLICT,
"More than one leader in the same term."));
return AppendEntriesResponse.newBuilder() //
.setSuccess(false) //
.setTerm(request.getTerm() + 1) //
.build();
}
// 6. 更新leader发送请求时间
updateLastLeaderTimestamp(Utils.monotonicMs());
// 7. 如果处在installSnapshot的过程中,忽略这次请求
if (entriesCount > 0 && this.snapshotExecutor != null && this.snapshotExecutor.isInstallingSnapshot()) {
LOG.warn("Node {} received AppendEntriesRequest while installing snapshot.", getNodeId());
return RpcFactoryHelper //
.responseFactory() //
.newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EBUSY,
"Node %s:%s is installing snapshot.", this.groupId, this.serverId);
}
final long prevLogIndex = request.getPrevLogIndex();
final long prevLogTerm = request.getPrevLogTerm();
final long localPrevLogTerm = this.logManager.getTerm(prevLogIndex);
// 8. 校验本地日志和请求方日志
if (localPrevLogTerm != prevLogTerm) {
final long lastLogIndex = this.logManager.getLastLogIndex();
return AppendEntriesResponse.newBuilder() //
.setSuccess(false) //
.setTerm(this.currTerm) //
.setLastLogIndex(lastLogIndex) //
.build();
}
// 9. 这里可能是probeRequest或者heartbeatRequest
if (entriesCount == 0) {
// heartbeat or probe request
final AppendEntriesResponse.Builder respBuilder = AppendEntriesResponse.newBuilder() //
.setSuccess(true) //
.setTerm(this.currTerm) //
.setLastLogIndex(this.logManager.getLastLogIndex());
doUnlock = false;
this.writeLock.unlock();
// see the comments at FollowerStableClosure#run()
this.ballotBox.setLastCommittedIndex(Math.min(request.getCommittedIndex(), prevLogIndex));
return respBuilder.build();
}
// ...省略日志复制部分逻辑
checkAndSetConfiguration(true);
success = true;
return null;
} finally {
if (doUnlock) {
this.writeLock.unlock();
}
final long processLatency = Utils.monotonicMs() - startMs;
if (entriesCount == 0) {
this.metrics.recordLatency("handle-heartbeat-requests", processLatency);
} else {
this.metrics.recordLatency("handle-append-entries", processLatency);
}
if (success) {
// Don't stats heartbeat requests.
this.metrics.recordSize("handle-append-entries-count", entriesCount);
}
}
}
在上面核心逻辑的第二点,leader最后请求时间是用来判断当前leader是否合法的重要变量,起到一个租期开始时间的作用。electionTimer中有这样一段逻辑,isCurrentLeaderValid方法用来判断当前leader是否合法,如果合法跳过这次选举超时。而isCurrentLeaderValid的依据就是leader的上一次请求时间是否超过electionTimeoutMs。
private void handleElectionTimeout() {
boolean doUnlock = true;
this.writeLock.lock();
try {
if (this.state != State.STATE_FOLLOWER) {
return;
}
if (isCurrentLeaderValid()) {
return;
}
// .....省略代码
}
}
private boolean isCurrentLeaderValid() {
return Utils.monotonicMs() - this.lastLeaderTimestamp < this.options.getElectionTimeoutMs();
}
再来看节点对心跳响应的处理,
static void onHeartbeatReturned(final ThreadId id, final Status status, final AppendEntriesRequest request,
final AppendEntriesResponse response, final long rpcSendTime) {
if (id == null) {
return;
}
final long startTimeMs = Utils.nowMs();
Replicator r;
if ((r = (Replicator) id.lock()) == null) {
return;
}
boolean doUnlock = true;
try {
final boolean isLogDebugEnabled = LOG.isDebugEnabled();
if (!status.isOk()) {
if (isLogDebugEnabled) {
sb.append(" fail, sleep, status=") //
.append(status);
LOG.debug(sb.toString());
}
r.setState(State.Probe);
notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
// 超时,或者其他网络问题,需要重发心跳
r.startHeartbeatTimer(startTimeMs);
return;
}
r.consecutiveErrorTimes = 0;
// 节点的term大于replicator的term
if (response.getTerm() > r.options.getTerm()) {
final NodeImpl node = r.options.getNode();
r.notifyOnCaughtUp(RaftError.EPERM.getNumber(), true);
// 毁灭replicator
r.destroy();
// 这里有两种情况
// 1. 当前节点是过时的leader,执行stepDown逻辑
// 2. 当前节点term大于response term,则啥也不做
node.increaseTermTo(response.getTerm(), new Status(RaftError.EHIGHERTERMRESPONSE,
"Leader receives higher term heartbeat_response from peer:%s, group:%s", r.options.getPeerId(), r.options.getGroupId()));
return;
}
if (!response.getSuccess() && response.hasLastLogIndex()) {
// 重要,假设心跳请求失败,意味着日志与follower节点对不上了
// 发送probeRequest来获取日志差异,以供后面修复
doUnlock = false;
r.sendProbeRequest();
r.startHeartbeatTimer(startTimeMs);
return;
}
// 一切都正常
if (rpcSendTime > r.lastRpcSendTimestamp) {
r.lastRpcSendTimestamp = rpcSendTime;
}
r.startHeartbeatTimer(startTimeMs);
} finally {
if (doUnlock) {
id.unlock();
}
}
}
1.4 总结
相较于原本的raft算法jraft进行了不少的优化,使用preVote机制避免了集群中不必要的投票,同时leader能够修改自己的term以此来应对candidate term大于自身的情况,这也是避免不必要选举的一种手段。此外,jraft还实现了leader租约的机制,在租约期间内peer无条件认为leader正常,以此来fail fast选举流程。