Elasticsearch系列之二选主7.x之后

image.png

无涯.png

Elasticsearch系列之二选主7.x之后

​ 在上一篇文章《Elasticsearch系列之二选主7.x之前》 中,我们了解到 Elasticsearch 在7.x之前的选主是基于 Bully 算法。从7.0开始,Elasticsearch 选择使用基于 Raft 算法来选主。

​ 为什么要用 Raft 重新实现选主呢?

1、discovery.zen.minimum_master_nodes 参数代表有多少个主资格节点参与选举。如果忘了配置或配置错误,将会导致系统短暂不可用。而扩容主节点也需要修改此配置。

2、老的选举太慢,要经过三轮 ping,才能发现其他节点并完成选举。

Raft 算法简介

Raft是用来解决分布式一致性问题而设计的算法。

Raft 节点一共有三种状态:Follower、Candidate、Leader

在节点启动时,都是 Follower 状态;当一段时间没有收到来自 Leader 的心跳时,将进入 Candidate 状态并发起选举;当收到大多数节点投票时,自己当选 Leader。

节点投票时,如果发现比自己版本新的节点,则给其投票。如果自己是 Leader,但发现了比自己新的节点,则放弃 Leader进入 Follower状态。

另外,Raft 算法将时间分为一个个 term 任期。term 开始于选举,结束于没有选出 Leader 或者 Leader 宕机。

Elasticsearch 选举大致流程

image-20220301215326915

Elasticsearch 源码解析

选举的底层接口是 Discovery,新的基于 Raft 的实现类是 Coordinator。节点启动时,调用 startInitialJoin 方法开始将进行选举。

    @Override
    public void startInitialJoin() {
        synchronized (mutex) {
            becomeCandidate("startInitialJoin");
        }
        clusterBootstrapService.scheduleUnconfiguredBootstrap();
    }

节点启动时,首先通过 becomeCandidate 方法进入 Candidate 状态,做一些选举的准备工作。之后,scheduleUnconfiguredBootstrap 方法开始选举。becomeCandidate 方法相对简单,重点看下 scheduleUnconfiguredBootstrap 方法

		    void scheduleUnconfiguredBootstrap() {
        if (unconfiguredBootstrapTimeout == null) {
            return;
        }
				//如果节点角色不是 master,则直接返回,不参与选举。
        if (transportService.getLocalNode().isMasterNode() == false) {
            return;
        }
        //等待 unconfiguredBootstrapTimeout 时间(默认3秒)后,开始选举。
        transportService.getThreadPool().scheduleUnlessShuttingDown(unconfiguredBootstrapTimeout, Names.GENERIC, new Runnable() {
            @Override
            public void run() {
                final Set<DiscoveryNode> discoveredNodes = getDiscoveredNodes();
                final List<DiscoveryNode> zen1Nodes = discoveredNodes.stream().filter(Coordinator::isZen1Node).collect(Collectors.toList());
                if (zen1Nodes.isEmpty()) {
                    //如果都不是 ZenPing 节点,则开始Raft选举
                    startBootstrap(discoveredNodes, emptyList());
                } else {
                    logger.info("avoiding best-effort cluster bootstrapping due to discovery of pre-7.0 nodes {}", zen1Nodes);
                }
            }
        });
    }

等待 unconfiguredBootstrapTimeout 时间(默认3秒)后,开始选举 startBootstrap。

    private void startBootstrap(Set<DiscoveryNode> discoveryNodes, List<String> unsatisfiedRequirements) {
        assert discoveryNodes.stream().allMatch(DiscoveryNode::isMasterNode) : discoveryNodes;
        assert discoveryNodes.stream().noneMatch(Coordinator::isZen1Node) : discoveryNodes;
        assert unsatisfiedRequirements.size() < discoveryNodes.size() : discoveryNodes + " smaller than " + unsatisfiedRequirements;
        if (bootstrappingPermitted.compareAndSet(true, false)) {
            doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId),
                unsatisfiedRequirements.stream().map(s -> BOOTSTRAP_PLACEHOLDER_PREFIX + s))
                .collect(Collectors.toSet())));
        }
    }

校验之后,通过 doBootstrap 方法,开始一轮新的选举。

    private void doBootstrap(VotingConfiguration votingConfiguration) {
        assert transportService.getLocalNode().isMasterNode();

        try {
            votingConfigurationConsumer.accept(votingConfiguration);
        } catch (Exception e) {
            //异常10s后重试
            transportService.getThreadPool().scheduleUnlessShuttingDown(TimeValue.timeValueSeconds(10), Names.GENERIC,
                new Runnable() {
                    @Override
                    public void run() {
                        doBootstrap(votingConfiguration);
                    }
                }
            );
        }
    }

通过 votingConfigurationConsumer 函数来处理,如果出现异常,则10S 后重试。该函数在 Coordinator 初始化时,设置为Coordinator.setInitialConfiguration 方法。

public boolean setInitialConfiguration(final VotingConfiguration votingConfiguration) {
        synchronized (mutex) {
            final ClusterState currentState = getStateForMasterService();
            //一些基本校验
            final List<DiscoveryNode> knownNodes = new ArrayList<>();
            knownNodes.add(getLocalNode());
            peerFinder.getFoundPeers().forEach(knownNodes::add);
            //如果发现节点数不足历史发现的节点数一半以上,则抛出异常
            if(votingConfiguration.hasQuorum(knownNodes.stream().map(DiscoveryNode::getId).collect(Collectors.toList())) == false) {
                throw new CoordinationStateRejectedException("not enough nodes discovered to form a quorum in the initial configuration " +
                    "[knownNodes=" + knownNodes + ", " + votingConfiguration + "]");
            }

            logger.info("setting initial configuration to {}", votingConfiguration);
            final CoordinationMetaData coordinationMetaData = CoordinationMetaData.builder(currentState.coordinationMetaData())
                .lastAcceptedConfiguration(votingConfiguration)
                .lastCommittedConfiguration(votingConfiguration)
                .build();

            MetaData.Builder metaDataBuilder = MetaData.builder(currentState.metaData());
            // automatically generate a UID for the metadata if we need to
            metaDataBuilder.generateClusterUuidIfNeeded(); // TODO generate UUID in bootstrapping tool?
            metaDataBuilder.coordinationMetaData(coordinationMetaData);

            //初始化集群状态
 coordinationState.get().setInitialState(ClusterState.builder(currentState).metaData(metaDataBuilder).build());
            //初始化 preVoteCollector 的 response
            preVoteCollector.update(getPreVoteResponse(), null); // pick up the change to last-accepted version
            //开始选举
            startElectionScheduler();
            return true;
        }
}

setInitialConfiguration 方法会做一些初始化之后,通过 startElectionScheduler 方法异步调用 PreVoteCollector.start 方法开始选举投票。

private void startElectionScheduler() {
        electionScheduler = electionSchedulerFactory.startElectionScheduler(gracePeriod, new Runnable() {
            @Override
            public void run() {
                synchronized (mutex) {
                    if (mode == Mode.CANDIDATE) {
                        final ClusterState lastAcceptedState = coordinationState.get().getLastAcceptedState();
												//快速失败,如果本地节点不可能赢得选举,就不发起选举了。
                        if (localNodeMayWinElection(lastAcceptedState) == false) {
                            return;
                        }

                        if (prevotingRound != null) {
                            prevotingRound.close();
                        }
                        final List<DiscoveryNode> discoveredNodes
                            = getDiscoveredNodes().stream().filter(n -> isZen1Node(n) == false).collect(Collectors.toList());
												//开始选举投票
                        prevotingRound = preVoteCollector.start(lastAcceptedState, discoveredNodes);
                    }
                }
            }
        });
    }

preVoteCollector.start 如下,会依次对所有节点发起投票请求。

void start(final Iterable<DiscoveryNode> broadcastNodes) {
  					...
            broadcastNodes.forEach(n -> transportService.sendRequest(n, REQUEST_PRE_VOTE_ACTION_NAME, preVoteRequest,
                new TransportResponseHandler<PreVoteResponse>() {
                    @Override
                    public PreVoteResponse read(StreamInput in) throws IOException {
                        return new PreVoteResponse(in);
                    }

                    @Override
                    public void handleResponse(PreVoteResponse response) {
                        handlePreVoteResponse(response, n);
                    }
                }));
        }

当其它节点收到投票请求,通过 PreVoteCollector.handlePreVoteRequest 来处理

    private PreVoteResponse handlePreVoteRequest(final PreVoteRequest request) {
        updateMaxTermSeen.accept(request.getCurrentTerm());

        Tuple<DiscoveryNode, PreVoteResponse> state = this.state;
        assert state != null : "received pre-vote request before fully initialised";

        final DiscoveryNode leader = state.v1();
        final PreVoteResponse response = state.v2();

        if (leader == null) {
            return response;
        }

        if (leader.equals(request.getSourceNode())) {
            return response;
        }

        throw new CoordinationStateRejectedException("rejecting " + request + " as there is already a leader");
    }

首先调用 updateMaxTermSeen 函数来更新最大 term,如果自己是主,但有比自己大的 term,则放弃主身份,重新选举。之后如果当前无主或主是请求的节点,则响应投票,否则拒绝响应投票。

private void updateMaxTermSeen(final long term) {
        synchronized (mutex) {
            maxTermSeen = Math.max(maxTermSeen, term);
            final long currentTerm = getCurrentTerm();
            if (mode == Mode.LEADER && maxTermSeen > currentTerm) {
                // Bump our term. However if there is a publication in flight then doing so would cancel the publication, so don't do that
                // since we check whether a term bump is needed at the end of the publication too.
                if (publicationInProgress()) {
                    logger.debug("updateMaxTermSeen: maxTermSeen = {} > currentTerm = {}, enqueueing term bump", maxTermSeen, currentTerm);
                } else {
                    try {
                        logger.debug("updateMaxTermSeen: maxTermSeen = {} > currentTerm = {}, bumping term", maxTermSeen, currentTerm);
                        ensureTermAtLeast(getLocalNode(), maxTermSeen);
                        startElection();
                    } catch (Exception e) {
                        logger.warn(new ParameterizedMessage("failed to bump term to {}", maxTermSeen), e);
                        becomeCandidate("updateMaxTermSeen");
                    }
                }
            }
        }
    }

在收到投票响应后,通过 handlePreVoteResponse 来处理。

private void handlePreVoteResponse(final PreVoteResponse response, final DiscoveryNode sender) {
            //更新最大 term
						updateMaxTermSeen.accept(response.getCurrentTerm());
					  //如果响应节点的 term 比自己大,或者相同但版本比自己高,则本次投票响应不加入得票里。
            if (response.getLastAcceptedTerm() > clusterState.term()
                || (response.getLastAcceptedTerm() == clusterState.term()
                && response.getLastAcceptedVersion() > clusterState.getVersionOrMetaDataVersion())) {
                logger.debug("{} ignoring {} from {} as it is fresher", this, response, sender);
                return;
            }

            preVotesReceived.put(sender, response);

            // create a fake VoteCollection based on the pre-votes and check if there is an election quorum
            final VoteCollection voteCollection = new VoteCollection();
            final DiscoveryNode localNode = clusterState.nodes().getLocalNode();
            final PreVoteResponse localPreVoteResponse = getPreVoteResponse();

            preVotesReceived.forEach((node, preVoteResponse) -> voteCollection.addJoinVote(
                new Join(node, localNode, preVoteResponse.getCurrentTerm(),
                preVoteResponse.getLastAcceptedTerm(), preVoteResponse.getLastAcceptedVersion())));
						//没有获得大多数得票,则 return
            if (electionStrategy.isElectionQuorum(clusterState.nodes().getLocalNode(), localPreVoteResponse.getCurrentTerm(),
                localPreVoteResponse.getLastAcceptedTerm(), localPreVoteResponse.getLastAcceptedVersion(),
                clusterState.getLastCommittedConfiguration(), clusterState.getLastAcceptedConfiguration(), voteCollection) == false) {
                return;
            }
            startElection.run();
        }

同样先更新最大 term,再检查得票是否有效。当收到大多数得票后,再通过 startElection 方法让其他节点加入自己

private void startElection() {
        synchronized (mutex) {
            // The preVoteCollector is only active while we are candidate, but it does not call this method with synchronisation, so we have
            // to check our mode again here.
            if (mode == Mode.CANDIDATE) {
                if (localNodeMayWinElection(getLastAcceptedState()) == false) {
                    logger.trace("skip election as local node may not win it: {}", getLastAcceptedState().coordinationMetaData());
                    return;
                }

                final StartJoinRequest startJoinRequest
                    = new StartJoinRequest(getLocalNode(), Math.max(getCurrentTerm(), maxTermSeen) + 1);
                logger.debug("starting election with {}", startJoinRequest);
                getDiscoveredNodes().forEach(node -> {
                    if (isZen1Node(node) == false) {
                        joinHelper.sendStartJoinRequest(startJoinRequest, node);
                    }
                });
            }
        }
    }

startElection 方法会向所有节点(非zen节点)发送 StartJoinRequest 请求,让其他节点加入自己。同时 term+1,代表新的 term。

其他节点收到 StartJoinRequest 请求后,既向其发送 join 请求。

transportService.registerRequestHandler(START_JOIN_ACTION_NAME, Names.GENERIC, false, false,
            StartJoinRequest::new,
            (request, channel, task) -> {
                final DiscoveryNode destination = request.getSourceNode();
                sendJoinRequest(destination, Optional.of(joinLeaderInTerm.apply(request)));
                channel.sendResponse(Empty.INSTANCE);
            });

先通过 joinLeaderInTerm 构造 Join 请求。同时更新 term,改变状态为 caididate

private Join joinLeaderInTerm(StartJoinRequest startJoinRequest) {
    synchronized (mutex) {
        logger.debug("joinLeaderInTerm: for [{}] with term {}", startJoinRequest.getSourceNode(), startJoinRequest.getTerm());
        //会更新term
        final Join join = coordinationState.get().handleStartJoin(startJoinRequest);
        lastJoin = Optional.of(join);
        peerFinder.setCurrentTerm(getCurrentTerm());
        if (mode != Mode.CANDIDATE) {
            becomeCandidate("joinLeaderInTerm"); // updates followersChecker and preVoteCollector
        } else {
            followersChecker.updateFastResponseState(getCurrentTerm(), mode);
            preVoteCollector.update(getPreVoteResponse(), null);
        }
        return join;
    }
}

后面通过 sendJoinRequest 发送 join 请求。此处比较简单,就不贴代码了。

当节点收到 JoinRequest 请求后,是通过 handleJoinRequest 来处理的。先 ping JoinRequest 的来源节点。如果当前是选举中,则通过 sendValidateJoinRequest 确认投票有效。

private void handleJoinRequest(JoinRequest joinRequest, JoinHelper.JoinCallback joinCallback) {
        transportService.connectToNode(joinRequest.getSourceNode(), ActionListener.wrap(ignore -> {
            final ClusterState stateForJoinValidation = getStateForMasterService();

            if (stateForJoinValidation.nodes().isLocalNodeElectedMaster()) {
                onJoinValidators.forEach(a -> a.accept(joinRequest.getSourceNode(), stateForJoinValidation));
                if (stateForJoinValidation.getBlocks().hasGlobalBlock(STATE_NOT_RECOVERED_BLOCK) == false) {
                    // we do this in a couple of places including the cluster update thread. This one here is really just best effort
                    // to ensure we fail as fast as possible.
                    JoinTaskExecutor.ensureMajorVersionBarrier(joinRequest.getSourceNode().getVersion(),
                        stateForJoinValidation.getNodes().getMinNodeVersion());
                }
                sendValidateJoinRequest(stateForJoinValidation, joinRequest, joinCallback);
            } else {
                processJoinRequest(joinRequest, joinCallback);
            }
        }, joinCallback::onFailure));
    }

收到 ValidateJoinRequest 请求,只要集群相同,并且版本、索引等兼容。既确认本次 JoinRequest 有效。

transportService.registerRequestHandler(VALIDATE_JOIN_ACTION_NAME,
            ThreadPool.Names.GENERIC, ValidateJoinRequest::new,
            (request, channel, task) -> {
                final ClusterState localState = currentStateSupplier.get();
                if (localState.metaData().clusterUUIDCommitted() &&
                    localState.metaData().clusterUUID().equals(request.getState().metaData().clusterUUID()) == false) {
                    throw new CoordinationStateRejectedException("join validation on cluster state" +
                        " with a different cluster uuid " + request.getState().metaData().clusterUUID() +
                        " than local cluster uuid " + localState.metaData().clusterUUID() + ", rejecting");
                }
                joinValidators.forEach(action -> action.accept(transportService.getLocalNode(), request.getState()));
                channel.sendResponse(Empty.INSTANCE);
            });

当确认 Join 有效后,通过 processJoinRequest 处理 Join 请求。可以看到如果 join 前未赢得选举而之后赢得选举,则宣布自己为 Leader。

    private void processJoinRequest(JoinRequest joinRequest, JoinHelper.JoinCallback joinCallback) {
        final Optional<Join> optionalJoin = joinRequest.getOptionalJoin();
        synchronized (mutex) {
            final CoordinationState coordState = coordinationState.get();
            final boolean prevElectionWon = coordState.electionWon();

            optionalJoin.ifPresent(this::handleJoin);
            joinAccumulator.handleJoinRequest(joinRequest.getSourceNode(), joinCallback);

            if (prevElectionWon == false && coordState.electionWon()) {
                becomeLeader("handleJoinRequest");
            }
        }
    }

看下 handleJoin,先通过 ensureTermAtLeast 方法确保 term 最新。之后 coordinationState 处理 join 请求。

    private void handleJoin(Join join) {
        synchronized (mutex) {
            ensureTermAtLeast(getLocalNode(), join.getTerm()).ifPresent(this::handleJoin);
						...
            coordinationState.get().handleJoin(join); // this might fail and bubble up the exception  
        }
    }

看下 ensureTermAtLeast 方法。之前发送 StartJoinRequest 时 term+1 并被 JoinRequest 传了回来,所以首次处理 Join 时getCurrentTerm 比 targetTerm 小,此处通过自己加入自己更新 term(自己给自己投票)。

    private Optional<Join> ensureTermAtLeast(DiscoveryNode sourceNode, long targetTerm) {
        assert Thread.holdsLock(mutex) : "Coordinator mutex not held";
        if (getCurrentTerm() < targetTerm) {
            return Optional.of(joinLeaderInTerm(new StartJoinRequest(sourceNode, targetTerm)));
        }
        return Optional.empty();
    }

接下来看 coordinationState.handleJoin。通过 isElectionQuorum 方法会判断是否得到大多数加入,如果是则自己赢得选举。

public boolean handleJoin(Join join) {
        ...
        boolean added = joinVotes.addJoinVote(join);
        boolean prevElectionWon = electionWon;
  			//收到大多数join,则赢得选举。
        electionWon = isElectionQuorum(joinVotes);
        if (electionWon && prevElectionWon == false) {
            logger.debug("handleJoin: election won in term [{}] with {}", getCurrentTerm(), joinVotes);
            lastPublishedVersion = getLastAcceptedVersion();
        }
        return added;
    }

参考

推荐阅读

Dapr 实战(一)

Dapr 实战(二)

DS 版本控制核心原理揭秘

DS 2.0 时代 API 操作姿势

招贤纳士

政采云技术团队(Zero),一个富有激情、创造力和执行力的团队,Base 在风景如画的杭州。团队现有300多名研发小伙伴,既有来自阿里、华为、网易的“老”兵,也有来自浙大、中科大、杭电等校的新人。团队在日常业务开发之外,还分别在云原生、区块链、人工智能、低代码平台、中间件、大数据、物料体系、工程平台、性能体验、可视化等领域进行技术探索和实践,推动并落地了一系列的内部技术产品,持续探索技术的新边界。此外,团队还纷纷投身社区建设,目前已经是 google flutter、scikit-learn、Apache Dubbo、Apache Rocketmq、Apache Pulsar、CNCF Dapr、Apache DolphinScheduler、alibaba Seata 等众多优秀开源社区的贡献者。如果你想改变一直被事折腾,希望开始折腾事;如果你想改变一直被告诫需要多些想法,却无从破局;如果你想改变你有能力去做成那个结果,却不需要你;如果你想改变你想做成的事需要一个团队去支撑,但没你带人的位置;如果你想改变本来悟性不错,但总是有那一层窗户纸的模糊……如果你相信相信的力量,相信平凡人能成就非凡事,相信能遇到更好的自己。如果你希望参与到随着业务腾飞的过程,亲手推动一个有着深入的业务理解、完善的技术体系、技术创造价值、影响力外溢的技术团队的成长过程,我觉得我们该聊聊。任何时间,等着你写点什么,发给 zcy-tc@cai-inc.com

微信公众号

文章同步发布,政采云技术团队公众号,欢迎关注

image.png