Elasticsearch系列之二选主7.x之后
在上一篇文章《Elasticsearch系列之二选主7.x之前》 中,我们了解到 Elasticsearch 在7.x之前的选主是基于 Bully 算法。从7.0开始,Elasticsearch 选择使用基于 Raft 算法来选主。
为什么要用 Raft 重新实现选主呢?
1、discovery.zen.minimum_master_nodes 参数代表有多少个主资格节点参与选举。如果忘了配置或配置错误,将会导致系统短暂不可用。而扩容主节点也需要修改此配置。
2、老的选举太慢,要经过三轮 ping,才能发现其他节点并完成选举。
Raft 算法简介
Raft是用来解决分布式一致性问题而设计的算法。
Raft 节点一共有三种状态:Follower、Candidate、Leader
在节点启动时,都是 Follower 状态;当一段时间没有收到来自 Leader 的心跳时,将进入 Candidate 状态并发起选举;当收到大多数节点投票时,自己当选 Leader。
节点投票时,如果发现比自己版本新的节点,则给其投票。如果自己是 Leader,但发现了比自己新的节点,则放弃 Leader进入 Follower状态。
另外,Raft 算法将时间分为一个个 term 任期。term 开始于选举,结束于没有选出 Leader 或者 Leader 宕机。
Elasticsearch 选举大致流程
Elasticsearch 源码解析
选举的底层接口是 Discovery,新的基于 Raft 的实现类是 Coordinator。节点启动时,调用 startInitialJoin 方法开始将进行选举。
@Override
public void startInitialJoin() {
synchronized (mutex) {
becomeCandidate("startInitialJoin");
}
clusterBootstrapService.scheduleUnconfiguredBootstrap();
}
节点启动时,首先通过 becomeCandidate 方法进入 Candidate 状态,做一些选举的准备工作。之后,scheduleUnconfiguredBootstrap 方法开始选举。becomeCandidate 方法相对简单,重点看下 scheduleUnconfiguredBootstrap 方法
void scheduleUnconfiguredBootstrap() {
if (unconfiguredBootstrapTimeout == null) {
return;
}
//如果节点角色不是 master,则直接返回,不参与选举。
if (transportService.getLocalNode().isMasterNode() == false) {
return;
}
//等待 unconfiguredBootstrapTimeout 时间(默认3秒)后,开始选举。
transportService.getThreadPool().scheduleUnlessShuttingDown(unconfiguredBootstrapTimeout, Names.GENERIC, new Runnable() {
@Override
public void run() {
final Set<DiscoveryNode> discoveredNodes = getDiscoveredNodes();
final List<DiscoveryNode> zen1Nodes = discoveredNodes.stream().filter(Coordinator::isZen1Node).collect(Collectors.toList());
if (zen1Nodes.isEmpty()) {
//如果都不是 ZenPing 节点,则开始Raft选举
startBootstrap(discoveredNodes, emptyList());
} else {
logger.info("avoiding best-effort cluster bootstrapping due to discovery of pre-7.0 nodes {}", zen1Nodes);
}
}
});
}
等待 unconfiguredBootstrapTimeout 时间(默认3秒)后,开始选举 startBootstrap。
private void startBootstrap(Set<DiscoveryNode> discoveryNodes, List<String> unsatisfiedRequirements) {
assert discoveryNodes.stream().allMatch(DiscoveryNode::isMasterNode) : discoveryNodes;
assert discoveryNodes.stream().noneMatch(Coordinator::isZen1Node) : discoveryNodes;
assert unsatisfiedRequirements.size() < discoveryNodes.size() : discoveryNodes + " smaller than " + unsatisfiedRequirements;
if (bootstrappingPermitted.compareAndSet(true, false)) {
doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId),
unsatisfiedRequirements.stream().map(s -> BOOTSTRAP_PLACEHOLDER_PREFIX + s))
.collect(Collectors.toSet())));
}
}
校验之后,通过 doBootstrap 方法,开始一轮新的选举。
private void doBootstrap(VotingConfiguration votingConfiguration) {
assert transportService.getLocalNode().isMasterNode();
try {
votingConfigurationConsumer.accept(votingConfiguration);
} catch (Exception e) {
//异常10s后重试
transportService.getThreadPool().scheduleUnlessShuttingDown(TimeValue.timeValueSeconds(10), Names.GENERIC,
new Runnable() {
@Override
public void run() {
doBootstrap(votingConfiguration);
}
}
);
}
}
通过 votingConfigurationConsumer 函数来处理,如果出现异常,则10S 后重试。该函数在 Coordinator 初始化时,设置为Coordinator.setInitialConfiguration 方法。
public boolean setInitialConfiguration(final VotingConfiguration votingConfiguration) {
synchronized (mutex) {
final ClusterState currentState = getStateForMasterService();
//一些基本校验
final List<DiscoveryNode> knownNodes = new ArrayList<>();
knownNodes.add(getLocalNode());
peerFinder.getFoundPeers().forEach(knownNodes::add);
//如果发现节点数不足历史发现的节点数一半以上,则抛出异常
if(votingConfiguration.hasQuorum(knownNodes.stream().map(DiscoveryNode::getId).collect(Collectors.toList())) == false) {
throw new CoordinationStateRejectedException("not enough nodes discovered to form a quorum in the initial configuration " +
"[knownNodes=" + knownNodes + ", " + votingConfiguration + "]");
}
logger.info("setting initial configuration to {}", votingConfiguration);
final CoordinationMetaData coordinationMetaData = CoordinationMetaData.builder(currentState.coordinationMetaData())
.lastAcceptedConfiguration(votingConfiguration)
.lastCommittedConfiguration(votingConfiguration)
.build();
MetaData.Builder metaDataBuilder = MetaData.builder(currentState.metaData());
// automatically generate a UID for the metadata if we need to
metaDataBuilder.generateClusterUuidIfNeeded(); // TODO generate UUID in bootstrapping tool?
metaDataBuilder.coordinationMetaData(coordinationMetaData);
//初始化集群状态
coordinationState.get().setInitialState(ClusterState.builder(currentState).metaData(metaDataBuilder).build());
//初始化 preVoteCollector 的 response
preVoteCollector.update(getPreVoteResponse(), null); // pick up the change to last-accepted version
//开始选举
startElectionScheduler();
return true;
}
}
setInitialConfiguration 方法会做一些初始化之后,通过 startElectionScheduler 方法异步调用 PreVoteCollector.start 方法开始选举投票。
private void startElectionScheduler() {
electionScheduler = electionSchedulerFactory.startElectionScheduler(gracePeriod, new Runnable() {
@Override
public void run() {
synchronized (mutex) {
if (mode == Mode.CANDIDATE) {
final ClusterState lastAcceptedState = coordinationState.get().getLastAcceptedState();
//快速失败,如果本地节点不可能赢得选举,就不发起选举了。
if (localNodeMayWinElection(lastAcceptedState) == false) {
return;
}
if (prevotingRound != null) {
prevotingRound.close();
}
final List<DiscoveryNode> discoveredNodes
= getDiscoveredNodes().stream().filter(n -> isZen1Node(n) == false).collect(Collectors.toList());
//开始选举投票
prevotingRound = preVoteCollector.start(lastAcceptedState, discoveredNodes);
}
}
}
});
}
preVoteCollector.start 如下,会依次对所有节点发起投票请求。
void start(final Iterable<DiscoveryNode> broadcastNodes) {
...
broadcastNodes.forEach(n -> transportService.sendRequest(n, REQUEST_PRE_VOTE_ACTION_NAME, preVoteRequest,
new TransportResponseHandler<PreVoteResponse>() {
@Override
public PreVoteResponse read(StreamInput in) throws IOException {
return new PreVoteResponse(in);
}
@Override
public void handleResponse(PreVoteResponse response) {
handlePreVoteResponse(response, n);
}
}));
}
当其它节点收到投票请求,通过 PreVoteCollector.handlePreVoteRequest 来处理
private PreVoteResponse handlePreVoteRequest(final PreVoteRequest request) {
updateMaxTermSeen.accept(request.getCurrentTerm());
Tuple<DiscoveryNode, PreVoteResponse> state = this.state;
assert state != null : "received pre-vote request before fully initialised";
final DiscoveryNode leader = state.v1();
final PreVoteResponse response = state.v2();
if (leader == null) {
return response;
}
if (leader.equals(request.getSourceNode())) {
return response;
}
throw new CoordinationStateRejectedException("rejecting " + request + " as there is already a leader");
}
首先调用 updateMaxTermSeen 函数来更新最大 term,如果自己是主,但有比自己大的 term,则放弃主身份,重新选举。之后如果当前无主或主是请求的节点,则响应投票,否则拒绝响应投票。
private void updateMaxTermSeen(final long term) {
synchronized (mutex) {
maxTermSeen = Math.max(maxTermSeen, term);
final long currentTerm = getCurrentTerm();
if (mode == Mode.LEADER && maxTermSeen > currentTerm) {
// Bump our term. However if there is a publication in flight then doing so would cancel the publication, so don't do that
// since we check whether a term bump is needed at the end of the publication too.
if (publicationInProgress()) {
logger.debug("updateMaxTermSeen: maxTermSeen = {} > currentTerm = {}, enqueueing term bump", maxTermSeen, currentTerm);
} else {
try {
logger.debug("updateMaxTermSeen: maxTermSeen = {} > currentTerm = {}, bumping term", maxTermSeen, currentTerm);
ensureTermAtLeast(getLocalNode(), maxTermSeen);
startElection();
} catch (Exception e) {
logger.warn(new ParameterizedMessage("failed to bump term to {}", maxTermSeen), e);
becomeCandidate("updateMaxTermSeen");
}
}
}
}
}
在收到投票响应后,通过 handlePreVoteResponse 来处理。
private void handlePreVoteResponse(final PreVoteResponse response, final DiscoveryNode sender) {
//更新最大 term
updateMaxTermSeen.accept(response.getCurrentTerm());
//如果响应节点的 term 比自己大,或者相同但版本比自己高,则本次投票响应不加入得票里。
if (response.getLastAcceptedTerm() > clusterState.term()
|| (response.getLastAcceptedTerm() == clusterState.term()
&& response.getLastAcceptedVersion() > clusterState.getVersionOrMetaDataVersion())) {
logger.debug("{} ignoring {} from {} as it is fresher", this, response, sender);
return;
}
preVotesReceived.put(sender, response);
// create a fake VoteCollection based on the pre-votes and check if there is an election quorum
final VoteCollection voteCollection = new VoteCollection();
final DiscoveryNode localNode = clusterState.nodes().getLocalNode();
final PreVoteResponse localPreVoteResponse = getPreVoteResponse();
preVotesReceived.forEach((node, preVoteResponse) -> voteCollection.addJoinVote(
new Join(node, localNode, preVoteResponse.getCurrentTerm(),
preVoteResponse.getLastAcceptedTerm(), preVoteResponse.getLastAcceptedVersion())));
//没有获得大多数得票,则 return
if (electionStrategy.isElectionQuorum(clusterState.nodes().getLocalNode(), localPreVoteResponse.getCurrentTerm(),
localPreVoteResponse.getLastAcceptedTerm(), localPreVoteResponse.getLastAcceptedVersion(),
clusterState.getLastCommittedConfiguration(), clusterState.getLastAcceptedConfiguration(), voteCollection) == false) {
return;
}
startElection.run();
}
同样先更新最大 term,再检查得票是否有效。当收到大多数得票后,再通过 startElection 方法让其他节点加入自己
private void startElection() {
synchronized (mutex) {
// The preVoteCollector is only active while we are candidate, but it does not call this method with synchronisation, so we have
// to check our mode again here.
if (mode == Mode.CANDIDATE) {
if (localNodeMayWinElection(getLastAcceptedState()) == false) {
logger.trace("skip election as local node may not win it: {}", getLastAcceptedState().coordinationMetaData());
return;
}
final StartJoinRequest startJoinRequest
= new StartJoinRequest(getLocalNode(), Math.max(getCurrentTerm(), maxTermSeen) + 1);
logger.debug("starting election with {}", startJoinRequest);
getDiscoveredNodes().forEach(node -> {
if (isZen1Node(node) == false) {
joinHelper.sendStartJoinRequest(startJoinRequest, node);
}
});
}
}
}
startElection 方法会向所有节点(非zen节点)发送 StartJoinRequest 请求,让其他节点加入自己。同时 term+1,代表新的 term。
其他节点收到 StartJoinRequest 请求后,既向其发送 join 请求。
transportService.registerRequestHandler(START_JOIN_ACTION_NAME, Names.GENERIC, false, false,
StartJoinRequest::new,
(request, channel, task) -> {
final DiscoveryNode destination = request.getSourceNode();
sendJoinRequest(destination, Optional.of(joinLeaderInTerm.apply(request)));
channel.sendResponse(Empty.INSTANCE);
});
先通过 joinLeaderInTerm 构造 Join 请求。同时更新 term,改变状态为 caididate
private Join joinLeaderInTerm(StartJoinRequest startJoinRequest) {
synchronized (mutex) {
logger.debug("joinLeaderInTerm: for [{}] with term {}", startJoinRequest.getSourceNode(), startJoinRequest.getTerm());
//会更新term
final Join join = coordinationState.get().handleStartJoin(startJoinRequest);
lastJoin = Optional.of(join);
peerFinder.setCurrentTerm(getCurrentTerm());
if (mode != Mode.CANDIDATE) {
becomeCandidate("joinLeaderInTerm"); // updates followersChecker and preVoteCollector
} else {
followersChecker.updateFastResponseState(getCurrentTerm(), mode);
preVoteCollector.update(getPreVoteResponse(), null);
}
return join;
}
}
后面通过 sendJoinRequest 发送 join 请求。此处比较简单,就不贴代码了。
当节点收到 JoinRequest 请求后,是通过 handleJoinRequest 来处理的。先 ping JoinRequest 的来源节点。如果当前是选举中,则通过 sendValidateJoinRequest 确认投票有效。
private void handleJoinRequest(JoinRequest joinRequest, JoinHelper.JoinCallback joinCallback) {
transportService.connectToNode(joinRequest.getSourceNode(), ActionListener.wrap(ignore -> {
final ClusterState stateForJoinValidation = getStateForMasterService();
if (stateForJoinValidation.nodes().isLocalNodeElectedMaster()) {
onJoinValidators.forEach(a -> a.accept(joinRequest.getSourceNode(), stateForJoinValidation));
if (stateForJoinValidation.getBlocks().hasGlobalBlock(STATE_NOT_RECOVERED_BLOCK) == false) {
// we do this in a couple of places including the cluster update thread. This one here is really just best effort
// to ensure we fail as fast as possible.
JoinTaskExecutor.ensureMajorVersionBarrier(joinRequest.getSourceNode().getVersion(),
stateForJoinValidation.getNodes().getMinNodeVersion());
}
sendValidateJoinRequest(stateForJoinValidation, joinRequest, joinCallback);
} else {
processJoinRequest(joinRequest, joinCallback);
}
}, joinCallback::onFailure));
}
收到 ValidateJoinRequest 请求,只要集群相同,并且版本、索引等兼容。既确认本次 JoinRequest 有效。
transportService.registerRequestHandler(VALIDATE_JOIN_ACTION_NAME,
ThreadPool.Names.GENERIC, ValidateJoinRequest::new,
(request, channel, task) -> {
final ClusterState localState = currentStateSupplier.get();
if (localState.metaData().clusterUUIDCommitted() &&
localState.metaData().clusterUUID().equals(request.getState().metaData().clusterUUID()) == false) {
throw new CoordinationStateRejectedException("join validation on cluster state" +
" with a different cluster uuid " + request.getState().metaData().clusterUUID() +
" than local cluster uuid " + localState.metaData().clusterUUID() + ", rejecting");
}
joinValidators.forEach(action -> action.accept(transportService.getLocalNode(), request.getState()));
channel.sendResponse(Empty.INSTANCE);
});
当确认 Join 有效后,通过 processJoinRequest 处理 Join 请求。可以看到如果 join 前未赢得选举而之后赢得选举,则宣布自己为 Leader。
private void processJoinRequest(JoinRequest joinRequest, JoinHelper.JoinCallback joinCallback) {
final Optional<Join> optionalJoin = joinRequest.getOptionalJoin();
synchronized (mutex) {
final CoordinationState coordState = coordinationState.get();
final boolean prevElectionWon = coordState.electionWon();
optionalJoin.ifPresent(this::handleJoin);
joinAccumulator.handleJoinRequest(joinRequest.getSourceNode(), joinCallback);
if (prevElectionWon == false && coordState.electionWon()) {
becomeLeader("handleJoinRequest");
}
}
}
看下 handleJoin,先通过 ensureTermAtLeast 方法确保 term 最新。之后 coordinationState 处理 join 请求。
private void handleJoin(Join join) {
synchronized (mutex) {
ensureTermAtLeast(getLocalNode(), join.getTerm()).ifPresent(this::handleJoin);
...
coordinationState.get().handleJoin(join); // this might fail and bubble up the exception
}
}
看下 ensureTermAtLeast 方法。之前发送 StartJoinRequest 时 term+1 并被 JoinRequest 传了回来,所以首次处理 Join 时getCurrentTerm 比 targetTerm 小,此处通过自己加入自己更新 term(自己给自己投票)。
private Optional<Join> ensureTermAtLeast(DiscoveryNode sourceNode, long targetTerm) {
assert Thread.holdsLock(mutex) : "Coordinator mutex not held";
if (getCurrentTerm() < targetTerm) {
return Optional.of(joinLeaderInTerm(new StartJoinRequest(sourceNode, targetTerm)));
}
return Optional.empty();
}
接下来看 coordinationState.handleJoin。通过 isElectionQuorum 方法会判断是否得到大多数加入,如果是则自己赢得选举。
public boolean handleJoin(Join join) {
...
boolean added = joinVotes.addJoinVote(join);
boolean prevElectionWon = electionWon;
//收到大多数join,则赢得选举。
electionWon = isElectionQuorum(joinVotes);
if (electionWon && prevElectionWon == false) {
logger.debug("handleJoin: election won in term [{}] with {}", getCurrentTerm(), joinVotes);
lastPublishedVersion = getLastAcceptedVersion();
}
return added;
}
参考
推荐阅读
招贤纳士
政采云技术团队(Zero),一个富有激情、创造力和执行力的团队,Base 在风景如画的杭州。团队现有300多名研发小伙伴,既有来自阿里、华为、网易的“老”兵,也有来自浙大、中科大、杭电等校的新人。团队在日常业务开发之外,还分别在云原生、区块链、人工智能、低代码平台、中间件、大数据、物料体系、工程平台、性能体验、可视化等领域进行技术探索和实践,推动并落地了一系列的内部技术产品,持续探索技术的新边界。此外,团队还纷纷投身社区建设,目前已经是 google flutter、scikit-learn、Apache Dubbo、Apache Rocketmq、Apache Pulsar、CNCF Dapr、Apache DolphinScheduler、alibaba Seata 等众多优秀开源社区的贡献者。如果你想改变一直被事折腾,希望开始折腾事;如果你想改变一直被告诫需要多些想法,却无从破局;如果你想改变你有能力去做成那个结果,却不需要你;如果你想改变你想做成的事需要一个团队去支撑,但没你带人的位置;如果你想改变本来悟性不错,但总是有那一层窗户纸的模糊……如果你相信相信的力量,相信平凡人能成就非凡事,相信能遇到更好的自己。如果你希望参与到随着业务腾飞的过程,亲手推动一个有着深入的业务理解、完善的技术体系、技术创造价值、影响力外溢的技术团队的成长过程,我觉得我们该聊聊。任何时间,等着你写点什么,发给 zcy-tc@cai-inc.com
微信公众号
文章同步发布,政采云技术团队公众号,欢迎关注