日志复制
1. Pipeline机制
在进入日志复制内容前,我们需要了解下sofa-jraft的pipeline机制。先来回顾下raft协议,raft协议要求follower与leader的日志顺序一致,此外raft禁止follower的日志出现空洞,如下图的情况是不允许出现。follower01出现了日志空洞,follower02则出现了乱序。
为了避免上述的情况,我们可以采用一个简单模式"request-response-request",即请求-应答-请求模型,这样能够保证日志的顺序性,如果请求过程中发生了异常也很好处理,直接重试即可。但是,这种模式可以预料到性能是比较低下的,而raft的日志复制是一个非常频繁的操作。
为了提高日志复制的性能,jraft除了采用批量复制外,还采用了pipeline机制。具体地说,jraft在日志复制时,不再一个个发送请求,而是允许发送一个窗口大小请求数目,即把原来串行的请求应答请求模型并行化,类似于TCP的滑动窗口算法。
在窗口大小范围内,leader允许连续发送请求,不必等待响应到达,jraft把这种发送但未收到响应的请求称为inflight request。leader在发送请求后会将请求进行记录,为其分配一个seqId用于记录request的顺序,并压入一个先进先出的队列。
Peer在返回响应时,会在响应中携带该seqId,以便leader查询响应对应请求。
Leader在接受到响应时,如果响应不是inflight request queue的第一条请求对应的响应,则先不处理,将其放入pending response priorityqueue,该队列根据seqId进行排序,保证和inflight request queue顺序一致。如果是第一条请求对应的响应,则按正常日志复制处理逻辑,这里暂且不细究。
如下图,黄色实心圆表示已经接收到的inflight request的响应,黄色虚线圆表示未收到的inflight request响应,requiredNextSeq指向期望收到的下一条响应。灰色的圆表示已经被处理的inflight request和response。
虽然通过pipeline机制解决了日志提交的顺序性,提高了传输效率,但仍然无法保证日志在网络传输的有序性。peer在接受到请求后,可能不会按照leader写入的顺序复制日志应用到状态机器。如下图,理想状况下leader和follower之间的复制是严格有序的,但现实情况中,由于jraft采用了boltRpc作为通信层,而这个框架默认是使用连接池的,这就导致了,replicator在发送日志以及follower接受日志时候的顺序是无法预测的。虽然jraft在收到乱序日志会重传解决,但由于日志复制的频繁性,就导致大量不必要的重传,降低系统性能。
为此jraft采用了两个方法来尽可能解决上述的问题
- 采用单连接
- follower端处理请求直接在io线程处理(单线程处理)
通过这种方式,能够保证peer在接受到请求时候尽最大可能保持与leader相同的处理顺序。
但上面的inflight request queue中还存在着一个问题,假设某个request的response一直失败,且超出了重传的次数,此时inflight request queue可能会一直夯住。假设队列大小无上限,则所有的请求都无法处理了。
jraft在碰到上述的情况时候会重建inflight request queue,并且自增一个version字段,该字段用来判断接受到的响应是否是上一个版本inflight request queue的响应,如果是则无视该响应。
2. 探针Probe
在上一章的心跳一节中,我们有提到在心跳请求被拒绝,即response.success = false时,leader会发送probeRequest。这一节我们就来详细讨论下probeRequest的作用。发送probeRequest实际上和心跳请求共用了sendEmptyEntries方法,只不过传入了false。
private void sendProbeRequest() {
sendEmptyEntries(false);
}
虽然同样都是EmptyAppendEntriesRequest,但是处理逻辑还是与心跳不同主要体现在
- 更改本地state
- 使用onRpcReturned处理回调
- 将request压入了InFlight request queue,意味着该request的response需要被顺序处理。
如果说心跳请求只是不断地为leader节点进行续约,那么探针请求则是leader主动用来确定与follower之间日志差异的请求。
private void sendEmptyEntries(final boolean isHeartbeat,
final RpcResponseClosure<AppendEntriesResponse> heartBeatClosure) {
final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
if (!fillCommonFields(rb, this.nextIndex - 1, isHeartbeat)) {
// id is unlock in installSnapshot
installSnapshot();
if (isHeartbeat && heartBeatClosure != null) {
RpcUtils.runClosureInThread(heartBeatClosure, new Status(RaftError.EAGAIN,
"Fail to send heartbeat to peer %s, group %s", this.options.getPeerId(), this.options.getGroupId()));
}
return;
}
try {
final long monotonicSendTimeMs = Utils.monotonicMs();
if (isHeartbeat) {
// ......省略心跳逻辑
} else {
//
rb.setData(ByteString.EMPTY);
final AppendEntriesRequest request = rb.build();
// 1. 记录本地状态
this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
this.statInfo.firstLogIndex = this.nextIndex;
this.statInfo.lastLogIndex = this.nextIndex - 1;
this.probeCounter++;
setState(State.Probe);
final int stateVersion = this.version;
final int seq = getAndIncrementReqSeq();
// 2. 发起请求并设置回调
final Future<Message> rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(),
request, -1, new RpcResponseClosureAdapter<AppendEntriesResponse>() {
@Override
public void run(final Status status) {
onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request,
getResponse(), seq, stateVersion, monotonicSendTimeMs);
}
});
// 3. 将请求加入 inflight 队列
addInflight(RequestType.AppendEntries, this.nextIndex, 0, 0, seq, rpcFuture);
}
} finally {
unlockId();
}
}
节点在收到该请求的处理逻辑与心跳请求一致,这里就做多赘述了。主要来看下请求发起方在接受到响应后的回调。
这里回调方法的核心内容是对reponse的排队处理
- 通过version确定response的合法性
- pendingResponsePriorityQueue通过seq来为响应进行排序
- 按照请求的顺序处理response,在出现response空洞时停止处理
- 如果continueSendEntries为true,则继续发送日志,
static void onRpcReturned(final ThreadId id, final RequestType reqType, final Status status, final Message request,
final Message response, final int seq, final int stateVersion, final long rpcSendTime) {
if (id == null) {
return;
}
final long startTimeMs = Utils.nowMs();
Replicator r;
if ((r = (Replicator) id.lock()) == null) {
return;
}
// 非法响应,即上一个版本的inflight request的响应,无视即可。
if (stateVersion != r.version) {
id.unlock();
return;
}
// 1.. 响应入pending response pq
final PriorityQueue<RpcResponse> holdingQueue = r.pendingResponses;
holdingQueue.add(new RpcResponse(reqType, seq, status, request, response, rpcSendTime));
if (holdingQueue.size() > r.raftOptions.getMaxReplicatorInflightMsgs()) {
// replicator等待响应太多,重建inflights
r.resetInflights();
r.setState(State.Probe);
r.sendProbeRequest();
return;
}
// 该变量用于标记处理完响应后,是否应该继续发送请求
boolean continueSendEntries = false;
try {
int processed = 0;
while (!holdingQueue.isEmpty()) {
final RpcResponse queuedPipelinedResponse = holdingQueue.peek();
// Sequence mismatch, waiting for next response.
if (queuedPipelinedResponse.seq != r.requiredNextSeq) {
// 响应出现空洞,跳出循环
if (processed > 0) {
break;
} else {
// 没有处理任何响应,等待响应被处理,发布送请求
continueSendEntries = false;
id.unlock();
return;
}
}
holdingQueue.remove();
processed++;
final Inflight inflight = r.pollInflight();
if (inflight == null) {
// inflight被清空了,说明该响应是之前版本的inflight queue的响应,忽略
// The previous in-flight requests were cleared.
continue;
}
if (inflight.seq != queuedPipelinedResponse.seq) {
// 请求和响应无法对应,需要重置inflight queue,重新生成请求和响应。
// 理论上不应该触发到这里的逻辑
r.resetInflights();
r.setState(State.Probe);
continueSendEntries = false;
r.block(Utils.nowMs(), RaftError.EREQUEST.getNumber());
return;
}
try {
switch (queuedPipelinedResponse.requestType) {
case AppendEntries:
// 对于probe request的回调,响应类型是AppendEntries
continueSendEntries = onAppendEntriesReturned(id, inflight, queuedPipelinedResponse.status,
(AppendEntriesRequest) queuedPipelinedResponse.request,
(AppendEntriesResponse) queuedPipelinedResponse.response, rpcSendTime, startTimeMs, r);
break;
case Snapshot:
continueSendEntries = onInstallSnapshotReturned(id, r, queuedPipelinedResponse.status,
(InstallSnapshotRequest) queuedPipelinedResponse.request,
(InstallSnapshotResponse) queuedPipelinedResponse.response);
break;
}
} finally {
if (continueSendEntries) {
// Success, increase the response sequence.
r.getAndIncrementRequiredNextSeq();
} else {
// The id is already unlocked in onAppendEntriesReturned/onInstallSnapshotReturned, we SHOULD break out.
break;
}
}
}
} finally {
if (continueSendEntries) {
// 继续发送请求
r.sendEntries();
}
}
}
Replicator#onAppendEntries是真正处理探针响应的方法,核心逻辑如下
- 当peer因为网络或者自身crash导致请求失败,则重建inflight request queue
- 如果peer拒绝了这次probe request,即response.success == false
-
- 如果是因为peer过忙,无法处理,并等待一个时间再尝试
- 如果peer的term大于当前节点的term(peer网络隔离,不断自增term后恢复),并增加自身的term
- 如果peer的日志少于leader的日志,更新replicator的nextIndex为peer缺失日志的起始位置
- 如果peer的日志多余leader的日志,不断减少nextIndex寻找最后一个日志相同位置
- 不论以上那种情况,都重建inflight request queue
- term和rplicator当前term不一致,但请求成功,这点倒没想明白在什么情况下出现。也算是一种失败请求,重建inflight request queue。
- 如果request是正常日志请求,则根据响应提交日志
- 更新本地replicator状态
// 删除了非主要逻辑的代码
private static boolean onAppendEntriesReturned(final ThreadId id, final Inflight inflight, final Status status,
final AppendEntriesRequest request,
final AppendEntriesResponse response, final long rpcSendTime,
final long startTimeMs, final Replicator r) {
// 日志未对齐,重建inflight queue,作废之前的请求和响应
if (inflight.startIndex != request.getPrevLogIndex() + 1) {
LOG.warn(
"Replicator {} received invalid AppendEntriesResponse, in-flight startIndex={}, request prevLogIndex={}, reset the replicator state and probe again.",
r, inflight.startIndex, request.getPrevLogIndex());
r.resetInflights();
r.setState(State.Probe);
r.sendProbeRequest();
return false;
}
if (!status.isOk()) {
// follower崩溃或者其他网络问题
// If the follower crashes, any RPC to the follower fails immediately,
// so we need to block the follower for a while instead of looping until
// it comes back or be removed
// dummy_id is unlock in block
notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
if ((r.consecutiveErrorTimes++) % 10 == 0) {
LOG.warn("Fail to issue RPC to {}, consecutiveErrorTimes={}, error={}, groupId={}", r.options.getPeerId(),
r.consecutiveErrorTimes, status, r.options.getGroupId());
}
r.resetInflights();
r.setState(State.Probe);
// unlock in in block
r.block(startTimeMs, status.getCode());
return false;
}
r.consecutiveErrorTimes = 0;
if (!response.getSuccess()) {
// Target node is is busy, sleep for a while.
// a. 如果是因为peer过忙,无法处理,则重建inflight request queue,并等待一个时间再尝试
if(response.getErrorResponse().getErrorCode() == RaftError.EBUSY.getNumber()) {
r.resetInflights();
r.setState(State.Probe);
// unlock in in block
r.block(startTimeMs, status.getCode());
return false;
}
if (response.getTerm() > r.options.getTerm()) {
// leader term落后,重建replicator,并且增加leader term
final NodeImpl node = r.options.getNode();
r.notifyOnCaughtUp(RaftError.EPERM.getNumber(), true);
r.destroy();
node.increaseTermTo(response.getTerm(), new Status(RaftError.EHIGHERTERMRESPONSE,
"Leader receives higher term heartbeat_response from peer:%s, group:%s", r.options.getPeerId(), r.options.getGroupId()));
return false;
}
if (rpcSendTime > r.lastRpcSendTimestamp) {
r.lastRpcSendTimestamp = rpcSendTime;
}
// Fail, reset the state to try again from nextIndex.
r.resetInflights();
// prev_log_index and prev_log_term doesn't match
if (response.getLastLogIndex() + 1 < r.nextIndex) {
// peer比leader的日志少,更新nextIndex,这样leader在下一次就会发送正确的日志了
r.nextIndex = response.getLastLogIndex() + 1;
} else {
// peer的日志比leader多,包含了历史term(未被提交)的日志,需要被截断
// 不断减少nextIndex,来探测一致的日志起点
if (r.nextIndex > 1) {
r.nextIndex--;
} else {
// peer的日志完全与leader不一致,这大概率不会出现
LOG.error("Peer={} declares that log at index=0 doesn't match, which is not supposed to happen, groupId={}",
r.options.getPeerId(), r.options.getGroupId());
}
}
// dummy_id is unlock in _send_heartbeat
r.sendProbeRequest();
return false;
}
// success
if (response.getTerm() != r.options.getTerm()) {
// 响应term不相同,说明repsonse的term小于当前节点的term,重建inflights
r.resetInflights();
r.setState(State.Probe);
id.unlock();
return false;
}
if (rpcSendTime > r.lastRpcSendTimestamp) {
r.lastRpcSendTimestamp = rpcSendTime;
}
final int entriesSize = request.getEntriesCount();
if (entriesSize > 0) {
// 提交
if (r.options.getReplicatorType().isFollower()) {
// Only commit index when the response is from follower.
r.options.getBallotBox().commitAt(r.nextIndex, r.nextIndex + entriesSize - 1, r.options.getPeerId());
}
}
r.setState(State.Replicate);
r.blockTimer = null;
r.nextIndex += entriesSize;
r.hasSucceeded = true;
r.notifyOnCaughtUp(RaftError.SUCCESS.getNumber(), false);
// dummy_id is unlock in _send_entries
if (r.timeoutNowIndex > 0 && r.timeoutNowIndex < r.nextIndex) {
r.sendTimeoutNow(false, false);
}
return true;
}
通过源码,可以看出probeRequest的作用就是不断地修正leader和peer(follower,learner)之间的日志差异,修正nextIndex的位置,来协助leader发送正确的日志给peer。
3. 日志复制
3.1. Leader日志复制
leader写日志的入口是NodeImpl#apply,应用层将回调和写入数据封装成一个task,leader将其发送到ApplyDisruptor的事件队列中。
@Override
public void apply(final Task task) {
if (this.shutdownLatch != null) {
// 停机校验
ThreadPoolsFactory.runClosureInThread(this.groupId, task.getDone(), new Status(RaftError.ENODESHUTDOWN, "Node is shutting down."));
throw new IllegalStateException("Node is shutting down");
}
// task携带数据以及应用层回调
Requires.requireNonNull(task, "Null task");
final LogEntry entry = new LogEntry();
entry.setData(task.getData());
// 包装成事件
final EventTranslator<LogEntryAndClosure> translator = (event, sequence) -> {
event.reset();
event.done = task.getDone();
event.entry = entry;
event.expectedTerm = task.getExpectedTerm();
};
// 同步或者异步发布事件
switch(this.options.getApplyTaskMode()) {
case Blocking:
this.applyQueue.publishEvent(translator);
break;
case NonBlocking:
default:
if (!this.applyQueue.tryPublishEvent(translator)) {
String errorMsg = "Node is busy, has too many tasks, queue is full and bufferSize="+ this.applyQueue.getBufferSize();
ThreadPoolsFactory.runClosureInThread(this.groupId, task.getDone(),
new Status(RaftError.EBUSY, errorMsg));
LOG.warn("Node {} applyQueue is overload.", getNodeId());
this.metrics.recordTimes("apply-task-overload-times", 1);
if(task.getDone() == null) {
throw new OverloadException(errorMsg);
}
}
break;
}
}
以rheakv为例,这里task携带的回调如下所示,成功时写入数据,失败时写入错误码。至于这个回调啥时候被调用,先买个关子,暂且不谈。(ps:如果你了解raft算法,就应该清楚啥时候被回调)
this.rawKVStore.put(key, value, new BaseKVStoreClosure() {
@Override
public void run(final Status status) {
if (status.isOk()) {
response.setValue((Boolean) getData());
} else {
setFailure(request, response, status, getError());
}
closure.sendResponse(response);
}
});
3.1.1. ApplyDisruptor
applyDisruptor是一个高性能的消息队列,这边指定了LogEntryAndClosureHandler作为事件处理器,需要注意的是这边是单线程消费。
this.applyDisruptor = DisruptorBuilder.<LogEntryAndClosure> newInstance() //
.setRingBufferSize(this.raftOptions.getDisruptorBufferSize()) //
.setEventFactory(new LogEntryAndClosureFactory()) //
.setThreadFactory(new NamedThreadFactory("JRaft-NodeImpl-Disruptor-", true)) //
.setProducerType(ProducerType.MULTI) //
.setWaitStrategy(new BlockingWaitStrategy()) //
.build();
this.applyDisruptor.handleEventsWith(new LogEntryAndClosureHandler());
this.applyDisruptor.setDefaultExceptionHandler(new LogExceptionHandler<Object>(getClass().getSimpleName()));
this.applyQueue = this.applyDisruptor.start();
LogEntryAndClosureHandler#onEvent是处理leader apply事件的核心方法。指的注意的是这里采用了批处理的思想。关键变量endOfBatch取决于Disruptor在投递任务那一刻队列中的任务,在投递那一刻最后一个任务时候会携带endOfBatch==true。在节点比较空闲时候每个task的endOfBatch都等于true,只有在节点繁忙时,批处理的作用才真正显现。
private class LogEntryAndClosureHandler implements EventHandler<LogEntryAndClosure> {
// task list for batch
// 批处理
private final List<LogEntryAndClosure> tasks = new ArrayList<>(NodeImpl.this.raftOptions.getApplyBatch());
@Override
public void onEvent(final LogEntryAndClosure event, final long sequence, final boolean endOfBatch)
throws Exception {
if (event.shutdownLatch != null) {
// 停机,经可能处理最后一批任务
if (!this.tasks.isEmpty()) {
executeApplyingTasks(this.tasks);
reset();
}
// 停机,丢弃不在批任务里的任务,并记录
final int num = GLOBAL_NUM_NODES.decrementAndGet();
LOG.info("The number of active nodes decrement to {}.", num);
event.shutdownLatch.countDown();
return;
}
this.tasks.add(event);
if (this.tasks.size() >= NodeImpl.this.raftOptions.getApplyBatch() || endOfBatch) {
// 满足两个条件就处理
// 1. 批大小达到 jraft配置的最小批数量
// 2. disruptor标记了批结束标志endOfBatch。
executeApplyingTasks(this.tasks);
reset();
}
}
// 清空批处理队列
private void reset() {
for (final LogEntryAndClosure task : this.tasks) {
task.reset();
}
this.tasks.clear();
}
}
在NodeImpl#executeApplyingTasks方法中有两个比较关键的逻辑
- 调用this.ballotBox.appendPendingTask(this.conf.getConf(),this.conf.isStable() ? null : this.conf.getOldConf(), task.done)为当前任务新建票据,并存入信箱。注意这里的票据携带了应用层的回调,task.done。
- 调用this.logManager.appendEntries(entries, new LeaderStableClosure(entries))写入日志,这里新建了一个LeaderStableClosure的回调,该回调的作用是leader本地提交日志。
private void executeApplyingTasks(final List<LogEntryAndClosure> tasks) {
// 查看logManager是否busy,以便fail fast
if (!this.logManager.hasAvailableCapacityToAppendEntries(1)) {
// It's overload, fail-fast
final List<Closure> dones = tasks.stream().map(ele -> ele.done).filter(Objects::nonNull)
.collect(Collectors.toList());
ThreadPoolsFactory.runInThread(this.groupId, () -> {
for (final Closure done : dones) {
done.run(new Status(RaftError.EBUSY, "Node %s log manager is busy.", this.getNodeId()));
}
});
return;
}
this.writeLock.lock();
try {
final int size = tasks.size();
if (this.state != State.STATE_LEADER) {
// 校验节点状态
final Status st = new Status();
if (this.state != State.STATE_TRANSFERRING) {
st.setError(RaftError.EPERM, "Is not leader.");
} else {
st.setError(RaftError.EBUSY, "Is transferring leadership.");
}
LOG.debug("Node {} can't apply, status={}.", getNodeId(), st);
final List<Closure> dones = tasks.stream().map(ele -> ele.done)
.filter(Objects::nonNull).collect(Collectors.toList());
ThreadPoolsFactory.runInThread(this.groupId, () -> {
for (final Closure done : dones) {
done.run(st);
}
});
return;
}
final List<LogEntry> entries = new ArrayList<>(size);
for (int i = 0; i < size; i++) {
final LogEntryAndClosure task = tasks.get(i);
// term与当前节点term不一致,写入失败
if (task.expectedTerm != -1 && task.expectedTerm != this.currTerm) {
LOG.debug("Node {} can't apply task whose expectedTerm={} doesn't match currTerm={}.", getNodeId(),
task.expectedTerm, this.currTerm);
if (task.done != null) {
final Status st = new Status(RaftError.EPERM, "expected_term=%d doesn't match current_term=%d",
task.expectedTerm, this.currTerm);
ThreadPoolsFactory.runClosureInThread(this.groupId, task.done, st);
task.reset();
}
continue;
}
// 初始化并添加appendEntries票据到ballotBox信箱
if (!this.ballotBox.appendPendingTask(this.conf.getConf(),
this.conf.isStable() ? null : this.conf.getOldConf(), task.done)) {
ThreadPoolsFactory.runClosureInThread(this.groupId, task.done, new Status(RaftError.EINTERNAL, "Fail to append task."));
task.reset();
continue;
}
// set task entry info before adding to list.
task.entry.getId().setTerm(this.currTerm);
task.entry.setType(EnumOutter.EntryType.ENTRY_TYPE_DATA);
entries.add(task.entry);
task.reset();
}
// 调用logManager添加entries,并且新建了一个LeaderStableClosure回调
this.logManager.appendEntries(entries, new LeaderStableClosure(entries));
// update conf.first
checkAndSetConfiguration(true);
} finally {
this.writeLock.unlock();
}
}
class LeaderStableClosure extends LogManager.StableClosure {
public LeaderStableClosure(final List<LogEntry> entries) {
super(entries);
}
@Override
public void run(final Status status) {
if (status.isOk()) {
// 本地提交日志
NodeImpl.this.ballotBox.commitAt(this.firstLogIndex, this.firstLogIndex + this.nEntries - 1,
NodeImpl.this.serverId);
} else {
LOG.error("Node {} append [{}, {}] failed, status={}.", getNodeId(), this.firstLogIndex,
this.firstLogIndex + this.nEntries - 1, status);
}
}
}
接着看logManager.appendEntries方法,该方法的核心逻辑如下
- 如果非leader节点,尝试解决冲突,如果是leader节点则会为entries分配index。
- 如果冲突解决成功,则根据配置为entries设置校验和。
- 如果entries非空,则将其添加到内存,并设置回调方法的firstLogIndex为第一个entry的index
- 如果entries非空,唤醒所有的replicator,注意这里得是leader
- 发布落盘时间,这里和diskDisruptor有关,放在下一小节中细究。要注意这里将原本的StableClosure回调传递到该Disruptor,并且该回调携带需要写入的entries
@Override
public void appendEntries(final List<LogEntry> entries, final StableClosure done) {
assert(done != null);
Requires.requireNonNull(done, "done");
if (this.hasError) {
// 异常情况
entries.clear();
ThreadPoolsFactory.runClosureInThread(this.groupId, done, new Status(RaftError.EIO, "Corrupted LogStorage"));
return;
}
boolean doUnlock = true;
this.writeLock.lock();
try {
// 尝试解决日志append冲突,可能是leader向peer分发日志,与peer本地日志存在冲突,peer尝试解决冲突,失败返回false
if (!entries.isEmpty() && !checkAndResolveConflict(entries, done, this.writeLock)) {
// If checkAndResolveConflict returns false, the done will be called in it.
entries.clear();
return;
}
for (int i = 0; i < entries.size(); i++) {
final LogEntry entry = entries.get(i);
// Set checksum after checkAndResolveConflict
// 设置校验和
if (this.raftOptions.isEnableLogEntryChecksum()) {
entry.setChecksum(entry.checksum());
}
// 动态配置
if (entry.getType() == EntryType.ENTRY_TYPE_CONFIGURATION) {
Configuration oldConf = new Configuration();
if (entry.getOldPeers() != null) {
oldConf = new Configuration(entry.getOldPeers(), entry.getOldLearners());
}
final ConfigurationEntry conf = new ConfigurationEntry(entry.getId(),
new Configuration(entry.getPeers(), entry.getLearners()), oldConf);
this.configManager.add(conf);
}
}
if (!entries.isEmpty()) {
// 添加到内存
done.setFirstLogIndex(entries.get(0).getId().getIndex());
this.logsInMemory.addAll(entries);
}
done.setEntries(entries);
doUnlock = false;
// 唤醒等待的replicator
if (!wakeupAllWaiter(this.writeLock)) {
notifyLastLogIndexListeners();
}
// publish event out of lock
// 发布落盘时间,注意这里把closure传递进去了
this.diskQueue.publishEvent((event, sequence) -> {
event.reset();
event.type = EventType.OTHER;
event.done = done;
});
} finally {
if (doUnlock) {
this.writeLock.unlock();
}
}
}
3.1.2. DiskDisruptor
与applyDisruptor类似,这里我们也只需要关心相应的handler,并且知道是单线程消费。
this.disruptor = DisruptorBuilder.<StableClosureEvent> newInstance() //
.setEventFactory(new StableClosureEventFactory()) //
.setRingBufferSize(opts.getDisruptorBufferSize()) //
.setThreadFactory(new NamedThreadFactory("JRaft-LogManager-Disruptor-", true)) //
.setProducerType(ProducerType.MULTI) //
/*
* Use timeout strategy in log manager. If timeout happens, it will called reportError to halt the node.
*/
.setWaitStrategy(new TimeoutBlockingWaitStrategy(
this.raftOptions.getDisruptorPublishEventWaitTimeoutSecs(), TimeUnit.SECONDS)) //
.build();
this.disruptor.handleEventsWith(new StableClosureEventHandler());
this.disruptor.setDefaultExceptionHandler(new LogExceptionHandler<Object>(this.getClass().getSimpleName(),
(event, ex) -> reportError(-1, "LogManager handle event error")));
this.diskQueue = this.disruptor.start();
核心逻辑由StableClosureEventHandler实现,该处理的核心逻辑如下
- 如果是停机时间,强刷数据到磁盘
- 如果entries不为空,调用AppendBatcher.append添加entries到缓冲区
- 处理LAST_LOG_ID、TRUNCATE_PREFIX、TRUNCATE_SUFFIX、RESET事件,在处理事件前都会调用AppendBatcher将缓冲区的日志写入磁盘
private class StableClosureEventHandler implements EventHandler<StableClosureEvent> {
LogId lastId = LogManagerImpl.this.diskId;
List<StableClosure> storage = new ArrayList<>(256);
AppendBatcher appendBatcher = new AppendBatcher(this.storage, 256, new ArrayList<>(),
LogManagerImpl.this.diskId);
@Override
public void onEvent(final StableClosureEvent event, final long sequence, final boolean endOfBatch)
throws Exception {
if (event.type == EventType.SHUTDOWN) {
// 停机事件,刷新缓冲区数据到磁盘
this.lastId = this.appendBatcher.flush();
setDiskId(this.lastId);
LogManagerImpl.this.shutDownLatch.countDown();
event.reset();
return;
}
final StableClosure done = event.done;
final EventType eventType = event.type;
event.reset();
if (done.getEntries() != null && !done.getEntries().isEmpty()) {
// appendEntries事件,添加entry到缓冲区
this.appendBatcher.append(done);
} else {
this.lastId = this.appendBatcher.flush();
boolean ret = true;
switch (eventType) {
// 节点调用LogManager#getLastLogId方法时触发,选举、写日志等行为均会调用到该方法
case LAST_LOG_ID:
((LastLogIdClosure) done).setLastLogId(this.lastId.copy());
break;
// 删除本地日志前缀触发,一般在安装snapshot时触发
case TRUNCATE_PREFIX:
long startMs = Utils.monotonicMs();
try {
final TruncatePrefixClosure tpc = (TruncatePrefixClosure) done;
LOG.debug("Truncating storage to firstIndexKept={}.", tpc.firstIndexKept);
ret = LogManagerImpl.this.logStorage.truncatePrefix(tpc.firstIndexKept);
} finally {
LogManagerImpl.this.nodeMetrics.recordLatency("truncate-log-prefix", Utils.monotonicMs()
- startMs);
}
break;
// 删除本地日志后缀触发,一般是leader向follower发送日志时解冲突触发
case TRUNCATE_SUFFIX:
startMs = Utils.monotonicMs();
try {
final TruncateSuffixClosure tsc = (TruncateSuffixClosure) done;
LOG.warn("Truncating storage to lastIndexKept={}.", tsc.lastIndexKept);
ret = LogManagerImpl.this.logStorage.truncateSuffix(tsc.lastIndexKept);
if (ret) {
this.lastId.setIndex(tsc.lastIndexKept);
this.lastId.setTerm(tsc.lastTermKept);
Requires.requireTrue(this.lastId.getIndex() == 0 || this.lastId.getTerm() != 0);
}
} finally {
LogManagerImpl.this.nodeMetrics.recordLatency("truncate-log-suffix", Utils.monotonicMs()
- startMs);
}
break;
// logManager被重置时触发
case RESET:
final ResetClosure rc = (ResetClosure) done;
LOG.info("Resetting storage to nextLogIndex={}.", rc.nextLogIndex);
ret = LogManagerImpl.this.logStorage.reset(rc.nextLogIndex);
break;
default:
break;
}
if (!ret) {
reportError(RaftError.EIO.getNumber(), "Failed operation in LogStorage");
} else {
// 回调
done.run(Status.OK());
}
}
if (endOfBatch) {
this.lastId = this.appendBatcher.flush();
setDiskId(this.lastId);
}
}
}
在来看下AppendBatcher#append方法,核心逻辑是
- 将任务放入storage列表
- 将entries放入toAppend列表
List<StableClosure> storage;
List<LogEntry> toAppend;
void append(final StableClosure done) {
if (this.size == this.cap || this.bufferSize >= LogManagerImpl.this.raftOptions.getMaxAppendBufferSize()) {
flush();
}
this.storage.add(done);
this.size++;
this.toAppend.addAll(done.getEntries());
for (final LogEntry entry : done.getEntries()) {
this.bufferSize += entry.getData() != null ? entry.getData().remaining() : 0;
}
}
AppendBatcher#flush方法则会将缓冲区的日志写入持久化存储,默认是RocksDB。随后调用StableClosure的回调,对于leader来说调用的是LeaderStableClosure,也就是提交日志对应的信箱。
LogId flush() {
if (this.size > 0) {
this.lastId = appendToStorage(this.toAppend);
for (int i = 0; i < this.size; i++) {
this.storage.get(i).getEntries().clear();
Status st = null;
try {
if (LogManagerImpl.this.hasError) {
st = new Status(RaftError.EIO, "Corrupted LogStorage");
} else {
st = Status.OK();
}
this.storage.get(i).run(st);
} catch (Throwable t) {
LOG.error("Fail to run closure with status: {}.", st, t);
}
}
this.toAppend.clear();
this.storage.clear();
}
this.size = 0;
this.bufferSize = 0;
return this.lastId;
}
class LeaderStableClosure extends LogManager.StableClosure {
public LeaderStableClosure(final List<LogEntry> entries) {
super(entries);
}
@Override
public void run(final Status status) {
if (status.isOk()) {
// 本地提交日志
NodeImpl.this.ballotBox.commitAt(this.firstLogIndex, this.firstLogIndex + this.nEntries - 1,
NodeImpl.this.serverId);
} else {
LOG.error("Node {} append [{}, {}] failed, status={}.", getNodeId(), this.firstLogIndex,
this.firstLogIndex + this.nEntries - 1, status);
}
}
}
LogManagerImpl#appendToStorage方法最终调用appendToStorage方法进行落盘
LogManagerImpl#appendToStorage
private LogId appendToStorage(final List<LogEntry> toAppend) {
LogId lastId = null;
if (!this.hasError) {
final long startMs = Utils.monotonicMs();
final int entriesCount = toAppend.size();
this.nodeMetrics.recordSize("append-logs-count", entriesCount);
try {
int writtenSize = 0;
for (int i = 0; i < entriesCount; i++) {
final LogEntry entry = toAppend.get(i);
writtenSize += entry.getData() != null ? entry.getData().remaining() : 0;
}
this.nodeMetrics.recordSize("append-logs-bytes", writtenSize);
final int nAppent = this.logStorage.appendEntries(toAppend);
if (nAppent != entriesCount) {
LOG.error("**Critical error**, fail to appendEntries, nAppent={}, toAppend={}", nAppent,
toAppend.size());
reportError(RaftError.EIO.getNumber(), "Fail to append log entries");
}
if (nAppent > 0) {
lastId = toAppend.get(nAppent - 1).getId();
}
toAppend.clear();
} finally {
this.nodeMetrics.recordLatency("append-logs", Utils.monotonicMs() - startMs);
}
}
return lastId;
}
RocksDBLogStorage#appendEntries
@Override
public int appendEntries(final List<LogEntry> entries) {
if (entries == null || entries.isEmpty()) {
return 0;
}
final int entriesCount = entries.size();
final boolean ret = executeBatch(batch -> {
final WriteContext writeCtx = newWriteContext();
for (int i = 0; i < entriesCount; i++) {
final LogEntry entry = entries.get(i);
if (entry.getType() == EntryType.ENTRY_TYPE_CONFIGURATION) {
addConfBatch(entry, batch);
} else {
writeCtx.startJob();
addDataBatch(entry, batch, writeCtx);
}
}
writeCtx.joinAll();
doSync();
});
if (ret) {
return entriesCount;
} else {
return 0;
}
}
3.1.3. BallotBox
在上面的小节中,我们已经完成了leader上日志持久化的流程梳理。在本小节中我们将专注于日志提交,什么时候leader上的日志被提交,jraft又是怎么样实现半数提交机制的呢?
- leader本地提交
在2.3.1.2小节中我们已经提到,在appendBatch完成flush后,会回调LeaderStableClosure对日志进行提交投票,这里就不再赘述。
AppendBatch#flush
LogId flush() {
if (this.size > 0) {
this.lastId = appendToStorage(this.toAppend);
for (int i = 0; i < this.size; i++) {
this.storage.get(i).getEntries().clear();
Status st = null;
try {
if (LogManagerImpl.this.hasError) {
st = new Status(RaftError.EIO, "Corrupted LogStorage");
} else {
st = Status.OK();
}
// LeaderStableClosure#run
this.storage.get(i).run(st);
} catch (Throwable t) {
LOG.error("Fail to run closure with status: {}.", st, t);
}
}
this.toAppend.clear();
this.storage.clear();
}
this.size = 0;
this.bufferSize = 0;
return this.lastId;
}
2. follower投票提交
除了本地持久化得到本地的投票,leader必然需要来自follower的投票,才能够提交日志。BallotBox的commitAt方法只有两处调用,一处使我们熟悉的LeaderStableClosure,一处则被Replicator调用(Replicator是负责leader向peer复制日志的组件)。
该方法是Replicator收到AppendEntriesResponse后的返回,与投票相关的核心逻辑是:
如果当前Replicator对应的节点是Follower,则根据response返回的信息提交对应日志。
private static boolean onAppendEntriesReturned(final ThreadId id, final Inflight inflight, final Status status,
final AppendEntriesRequest request,
final AppendEntriesResponse response, final long rpcSendTime,
final long startTimeMs, final Replicator r) {
// 省略部分代码
final int entriesSize = request.getEntriesCount();
if (entriesSize > 0) {
if (r.options.getReplicatorType().isFollower()) {
//提交日志
// Only commit index when the response is from follower.
r.options.getBallotBox().commitAt(r.nextIndex, r.nextIndex + entriesSize - 1, r.options.getPeerId());
}
if (LOG.isDebugEnabled()) {
LOG.debug("Replicated logs in [{}, {}] to peer {}", r.nextIndex, r.nextIndex + entriesSize - 1,
r.options.getPeerId());
}
}
r.setState(State.Replicate);
r.blockTimer = null;
r.nextIndex += entriesSize;
r.hasSucceeded = true;
r.notifyOnCaughtUp(RaftError.SUCCESS.getNumber(), false);
// dummy_id is unlock in _send_entries
if (r.timeoutNowIndex > 0 && r.timeoutNowIndex < r.nextIndex) {
r.sendTimeoutNow(false, false);
}
return true;
}
3. BallotBox#commitAt
最后再来看看BallotBox#commitAt方法究竟做了什么,才让日志提交
核心逻辑就是根据传入的索引index信息,提交日志,取最后一条提交的日志的索引,将之前的日志全部标记为提交。最后根据该提交索引将日志应用到FSM(有限状态机,Finite State Machine)。
public boolean commitAt(final long firstLogIndex, final long lastLogIndex, final PeerId peer) {
// TODO use lock-free algorithm here?
final long stamp = this.stampedLock.writeLock();
long lastCommittedIndex = 0;
try {
// 1. 信箱未被初始化
if (this.pendingIndex == 0) {
return false;
}
// 2. 日志已经被提交了
if (lastLogIndex < this.pendingIndex) {
return true;
}
// 3. 提交索引越界
if (lastLogIndex >= this.pendingIndex + this.pendingMetaQueue.size()) {
throw new ArrayIndexOutOfBoundsException();
}
// 4. 计算索引起始位置取较小值,raft规则:leader会顺便提交之前任期的日志
final long startAt = Math.max(this.pendingIndex, firstLogIndex);
Ballot.PosHint hint = new Ballot.PosHint();
for (long logIndex = startAt; logIndex <= lastLogIndex; logIndex++) {
final Ballot bl = this.pendingMetaQueue.get((int) (logIndex - this.pendingIndex));
hint = bl.grant(peer, hint);
// 如果投票通过,更新提交索引
if (bl.isGranted()) {
lastCommittedIndex = logIndex;
}
}
if (lastCommittedIndex == 0) {
return true;
}
// When removing a peer off the raft group which contains even number of
// peers, the quorum would decrease by 1, e.g. 3 of 4 changes to 2 of 3. In
// this case, the log after removal may be committed before some previous
// logs, since we use the new configuration to deal the quorum of the
// removal request, we think it's safe to commit all the uncommitted
// previous logs, which is not well proved right now
this.pendingMetaQueue.removeFromFirst((int) (lastCommittedIndex - this.pendingIndex) + 1);
LOG.debug("Node {} committed log fromIndex={}, toIndex={}.", this.opts.getNodeId(), this.pendingIndex,
lastCommittedIndex);
this.pendingIndex = lastCommittedIndex + 1;
this.lastCommittedIndex = lastCommittedIndex;
} finally {
this.stampedLock.unlockWrite(stamp);
}
// 5. 将日志应用到状态机
this.waiter.onCommitted(lastCommittedIndex);
return true;
}
3.1.4. FSM
在ballotbox提交索引位置后,FSM会将其包装为FSMEvent,发送到FSMDisruptor的消费队列中
@Override
public boolean onCommitted(final long committedIndex) {
return enqueueTask((task, sequence) -> {
task.type = TaskType.COMMITTED;
task.committedIndex = committedIndex;
});
}
private boolean enqueueTask(final EventTranslator<ApplyTask> tpl) {
if (this.shutdownLatch != null) {
// Shutting down
LOG.warn("FSMCaller is stopped, can not apply new task.");
return false;
}
this.taskQueue.publishEvent(tpl);
return true;
}
ApplyTaskHandler#onEvent方法则调用runApplyTask方法
private class ApplyTaskHandler implements EventHandler<ApplyTask> {
boolean firstRun = true;
// max committed index in current batch, reset to -1 every batch
private long maxCommittedIndex = -1;
@Override
public void onEvent(final ApplyTask event, final long sequence, final boolean endOfBatch) throws Exception {
setFsmThread();
this.maxCommittedIndex = runApplyTask(event, this.maxCommittedIndex, endOfBatch);
}
private void setFsmThread() {
if (firstRun) {
fsmThread = Thread.currentThread();
firstRun = false;
}
}
}
private long runApplyTask(final ApplyTask task, long maxCommittedIndex, final boolean endOfBatch) {
CountDownLatch shutdown = null;
if (task.type == TaskType.COMMITTED) {
// 更新最大提交索引
if (task.committedIndex > maxCommittedIndex) {
maxCommittedIndex = task.committedIndex;
}
task.reset();
} else {
// 省略部分代码
}
try {
if (endOfBatch && maxCommittedIndex >= 0) {
this.currTask = TaskType.COMMITTED;
// 应用最大索引日志到状态机
doCommitted(maxCommittedIndex);
maxCommittedIndex = -1L; // reset maxCommittedIndex
}
this.currTask = TaskType.IDLE;
return maxCommittedIndex;
} finally {
if (shutdown != null) {
shutdown.countDown();
}
}
}
最终调用FSMCallerImpl#doCommitted方法,这里核心的逻辑如下
- 将日志中包含的操作apply到状态机
- 回调closure,这里的closure是应用层传入的closure,一般用来设置response,或者记录某些指标
- 更新lastAppliedIndex和lastAppliedTerm
private void doCommitted(final long committedIndex) {
if (!this.error.getStatus().isOk()) {
return;
}
final long lastAppliedIndex = this.lastAppliedIndex.get();
// We can tolerate the disorder of committed_index
if (lastAppliedIndex >= committedIndex) {
return;
}
this.lastCommittedIndex.set(committedIndex);
final long startMs = Utils.monotonicMs();
try {
final List<Closure> closures = new ArrayList<>();
final List<TaskClosure> taskClosures = new ArrayList<>();
final long firstClosureIndex = this.closureQueue.popClosureUntil(committedIndex, closures, taskClosures);
// Calls TaskClosure#onCommitted if necessary
onTaskCommitted(taskClosures);
Requires.requireTrue(firstClosureIndex >= 0, "Invalid firstClosureIndex");
// 组装成迭代器
final IteratorImpl iterImpl = new IteratorImpl(this, this.logManager, closures, firstClosureIndex,
lastAppliedIndex, committedIndex, this.applyingIndex);
while (iterImpl.isGood()) {
final LogEntry logEntry = iterImpl.entry();
if (logEntry.getType() != EnumOutter.EntryType.ENTRY_TYPE_DATA) {
if (logEntry.getType() == EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION) {
if (logEntry.getOldPeers() != null && !logEntry.getOldPeers().isEmpty()) {
// Joint stage is not supposed to be noticeable by end users.
this.fsm.onConfigurationCommitted(new Configuration(iterImpl.entry().getPeers()));
}
}
if (iterImpl.done() != null) {
// For other entries, we have nothing to do besides flush the
// pending tasks and run this closure to notify the caller that the
// entries before this one were successfully committed and applied.
// 回调应用层closure
iterImpl.done().run(Status.OK());
}
iterImpl.next();
continue;
}
// Apply data task to user state machine
doApplyTasks(iterImpl);
}
if (iterImpl.hasError()) {
setError(iterImpl.getError());
iterImpl.runTheRestClosureWithError();
}
long lastIndex = iterImpl.getIndex() - 1;
final long lastTerm = this.logManager.getTerm(lastIndex);
// 更新lastAppliedIndex和lastAppliedTerm
setLastApplied(lastIndex, lastTerm);
} finally {
this.nodeMetrics.recordLatency("fsm-commit", Utils.monotonicMs() - startMs);
}
}
3.1.5. 小结
在上面我们已经完成了leader部分日志复制核心逻辑的梳理,核心组件有
- NodeImpl:节点的抽象实现
- LogManagerImpl:日志管理抽象实现
- ApplyDisruptor:Leader发起写日志的消费队列
- DiskDisruptor:持久化事件的消费队列
- FSMCallerImpl:有限状态机的代理实现
- Replicator:负责leader到peer的日志复制、心跳等功能的组件。
3.2. Replicator
在一节中我们讲述了Leader接受日志-提交日志-应用到状态机的整个流程,但是还有一点是有缺失的,即Leader是如何将日志复制给peer(包括learner和follower)。在这一节中,我们将专注于探究Replicator是如何发送日志给Peer。
ReplicatorGroupImpl#addReplicator负责新建并添加一个Replicator到当前的group下。
@Override
public boolean addReplicator(final PeerId peer, final ReplicatorType replicatorType, final boolean sync) {
// 省略部分代码
final ThreadId rid = Replicator.start(opts, this.raftOptions);
// 省略部分代码
return this.replicatorMap.put(peer, rid) == null;
}
Replicator.start则是真正新建的方法,核心逻辑如下
- 通知Replicator状态监听器
- 启动heartbeatTimer
- 发送探针
public static ThreadId start(final ReplicatorOptions opts, final RaftOptions raftOptions) {
// 省略部分代码
final Replicator r = new Replicator(opts, raftOptions);
if (!r.rpcService.connect(opts.getPeerId().getEndpoint())) {
LOG.error("Fail to init sending channel to {}, group: {}.", opts.getPeerId(), opts.getGroupId());
// Return and it will be retried later.
return null;
}
// 省略部分代码
// Start replication
r.id = new ThreadId(r, r);
r.id.lock();
notifyReplicatorStatusListener(r, ReplicatorEvent.CREATED);
LOG.info("Replicator [group: {}, peer: {}, type: {}] is started", r.options.getGroupId(), r.options.getPeerId(), r.options.getReplicatorType());
r.catchUpClosure = null;
r.lastRpcSendTimestamp = Utils.monotonicMs();
r.startHeartbeatTimer(Utils.nowMs());
// id.unlock in sendEmptyEntries
// 发送探针
r.sendProbeRequest();
return r.id;
}
在2.探针小节中我们了解探针的作用是同步leader和peer之间的日志差异,并在一定条件下修正差异后发送日志给peer。
当leader和peer之间的日志冲突可解决或者没有冲突,并且leader的日志多于peer时,continueSendEntries为true,最终会调用Replicator#snedEntries方法发送日志。
static void onRpcReturned(final ThreadId id, final RequestType reqType, final Status status, final Message request,
final Message response, final int seq, final int stateVersion, final long rpcSendTime) {
// 省略部分代码
try {
// 省略部分代码
try {
switch (queuedPipelinedResponse.requestType) {
case AppendEntries:
// 对于probe request的回调,响应类型是AppendEntries
continueSendEntries = onAppendEntriesReturned(id, inflight, queuedPipelinedResponse.status,
(AppendEntriesRequest) queuedPipelinedResponse.request,
(AppendEntriesResponse) queuedPipelinedResponse.response, rpcSendTime, startTimeMs, r);
break;
case Snapshot:
continueSendEntries = onInstallSnapshotReturned(id, r, queuedPipelinedResponse.status,
(InstallSnapshotRequest) queuedPipelinedResponse.request,
(InstallSnapshotResponse) queuedPipelinedResponse.response);
break;
}
} finally {
// 省略部分代码
}
}
} finally {
// 省略部分代码
if (continueSendEntries) {
// 继续发送请求
r.sendEntries();
}
}
}
比较核心的逻辑是
- 如果peer需要的日志已经被压缩,leader会通过发送snapshot来进行同步
- 如果peer的日志已经与leader一致了,replicator会将自身进入等待状态
- 如果一切正常,则构建AppendEntriesRequest,并发送日志
void sendEntries() {
boolean doUnlock = true;
try {
long prevSendIndex = -1;
while (true) {
final long nextSendingIndex = getNextSendIndex();
if (nextSendingIndex > prevSendIndex) {
if (sendEntries(nextSendingIndex)) {
prevSendIndex = nextSendingIndex;
} else {
doUnlock = false;
// id already unlock in sendEntries when it returns false.
break;
}
} else {
break;
}
}
} finally {
if (doUnlock) {
unlockId();
}
}
}
private boolean sendEntries(final long nextSendingIndex) {
final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
// 1. 无法从内存中获取日志,日志已经被压缩,这时只能通过sanpshot同步
if (!fillCommonFields(rb, nextSendingIndex - 1, false)) {
// unlock id in installSnapshot
installSnapshot();
return false;
}
ByteBufferCollector dataBuf = null;
final int maxEntriesSize = this.raftOptions.getMaxEntriesSize();
final RecyclableByteBufferList byteBufList = RecyclableByteBufferList.newInstance();
try {
// 2. 尽可能获取entry
for (int i = 0; i < maxEntriesSize; i++) {
final RaftOutter.EntryMeta.Builder emb = RaftOutter.EntryMeta.newBuilder();
if (!prepareEntry(nextSendingIndex, i, emb, byteBufList)) {
break;
}
rb.addEntries(emb.build());
}
// 3. 两种情况:
// 3.1 无法从内存中获取日志,日志已经被压缩,这时只能通过sanpshot同步
// 3.2 leader和peer内存一致,这时候需要把repliator置为等待状态
if (rb.getEntriesCount() == 0) {
// 3. 无法从内存中获取日志,日志已经被压缩,这时只能通过sanpshot同步
if (nextSendingIndex < this.options.getLogManager().getFirstLogIndex()) {
// 当前要发送Entry为0,且记录地follower地下一条日志小于本地第一条日志
// 通过snapshot复制
installSnapshot();
return false;
}
// 传入replicator期望发送的下一条日志的index
waitMoreEntries(nextSendingIndex);
return false;
}
if (byteBufList.getCapacity() > 0) {
dataBuf = ByteBufferCollector.allocateByRecyclers(byteBufList.getCapacity());
for (final ByteBuffer b : byteBufList) {
dataBuf.put(b);
}
final ByteBuffer buf = dataBuf.getBuffer();
BufferUtils.flip(buf);
rb.setData(ZeroByteStringHelper.wrap(buf));
}
} finally {
RecycleUtil.recycle(byteBufList);
}
// 4. 一切正常,则构建AppendEntriesRequest,并发送日志
final AppendEntriesRequest request = rb.build();
this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
this.statInfo.firstLogIndex = rb.getPrevLogIndex() + 1;
this.statInfo.lastLogIndex = rb.getPrevLogIndex() + rb.getEntriesCount();
final Recyclable recyclable = dataBuf;
final int v = this.version;
final long monotonicSendTimeMs = Utils.monotonicMs();
final int seq = getAndIncrementReqSeq();
this.appendEntriesCounter++;
Future<Message> rpcFuture = null;
try {
rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(), request, -1,
new RpcResponseClosureAdapter<AppendEntriesResponse>() {
@Override
public void run(final Status status) {
RecycleUtil.recycle(recyclable); // TODO: recycle on send success, not response received.
onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request, getResponse(),
seq, v, monotonicSendTimeMs);
}
});
} catch (final Throwable t) {
RecycleUtil.recycle(recyclable);
ThrowUtil.throwException(t);
}
addInflight(RequestType.AppendEntries, nextSendingIndex, request.getEntriesCount(), request.getData().size(),
seq, rpcFuture);
return true;
}
这里需要留意一下Replicator#waitMoreEntries方法,该方法的入参是期望发送的下一条日志的index,并让replicator进入idle状态等待,停止发送日志。
此外还调用了LogManagerImpl#wait方法,传入了continueSending的callback,并产生了一个waitId。waitId的作用是在唤醒Replicator时候确认调用哪一个回调,因为回调中携带了下一个要发送的日志的index。
private void waitMoreEntries(final long nextWaitIndex) {
try {
LOG.debug("Node {} waits more entries", this.options.getNode().getNodeId());
if (this.waitId >= 0) {
return;
}
this.waitId = this.options.getLogManager().wait(nextWaitIndex - 1,
(arg, errorCode) -> continueSending((ThreadId) arg, errorCode), this.id);
this.statInfo.runningState = RunningState.IDLE;
} finally {
unlockId();
}
}
@Override
public long wait(final long expectedLastLogIndex, final NewLogCallback cb, final Object arg) {
final WaitMeta wm = new WaitMeta(cb, arg, 0);
return notifyOnNewLog(expectedLastLogIndex, wm);
}
private long notifyOnNewLog(final long expectedLastLogIndex, final WaitMeta wm) {
this.writeLock.lock();
try {
// 1. 还有日志可发,提交到线程池中进行发送
if (expectedLastLogIndex != this.lastLogIndex || this.stopped) {
wm.errorCode = this.stopped ? RaftError.ESTOP.getNumber() : 0;
ThreadPoolsFactory.runInThread(this.groupId, () -> runOnNewLog(wm));
return 0L;
}
// 2. 无日志可发,自增waitId
long waitId = this.nextWaitId++;
if (waitId < 0) {
// waitId的类型是int,防止溢出
// Valid waitId starts from 1, skip 0.
waitId = this.nextWaitId = 1;
}
this.waitMap.put(waitId, wm);
return waitId;
} finally {
this.writeLock.unlock();
}
}
如果是没有日志发送的情况,此时Replicator就已经被“挂起”了,那么Replicator是什么时候被唤醒的呢?
答案是LogManagerImpl#wakeupAllWaiter方法,该方法会获取往前的WaitMeta,一个个执行回调。
private boolean wakeupAllWaiter(final Lock lock) {
if (this.waitMap.isEmpty()) {
lock.unlock();
return false;
}
final List<WaitMeta> wms = new ArrayList<>(this.waitMap.values());
final int errCode = this.stopped ? RaftError.ESTOP.getNumber() : RaftError.SUCCESS.getNumber();
this.waitMap.clear();
lock.unlock();
final int waiterCount = wms.size();
for (int i = 0; i < waiterCount; i++) {
final WaitMeta wm = wms.get(i);
wm.errorCode = errCode;
ThreadPoolsFactory.runInThread(this.groupId, () -> runOnNewLog(wm));
}
return true;
}
而wakeupAllWaiter方法被调用的时机是LogManagerImpl#appendEntries方法,也就是说leader在每次写入日志时,都会尝试唤醒处于等待状态的Replicator。
@Override
public void appendEntries(final List<LogEntry> entries, final StableClosure done) {
// 省略部分带啊吗
if (!wakeupAllWaiter(this.writeLock)) {
notifyLastLogIndexListeners();
}
// 省略部分带啊吗
}
通过上面的代码分析,我们可以看出Replicator复制日志的时机有两个
- 新建时候,通过sendProbeRequest触发日志同步
- leader写入日志时,调用LogManagerImpl#wakeupAllWaiter方法唤醒处于等待状态的Replicator,并发送日志。
4. 总结
在上面的内容中我们大致了解了Leader和peer之间是如何保证同步的:1.pipeline机制提高发送效率,保证顺序处理。2. probe探针修正leader和peer之间冲突 3. replicator等其他组件实现日志复制。但是也仅仅只是了解,再jraft中有许多有意思的设计,或者说一些技巧,是本篇文章尚未关注的,在后续的该系列文章,我将针对一些编码技巧进行总结,望大伙儿多多关注~。