无涯.png

Elasticsearch系列之二选主7.x之后

在上一篇文章《Elasticsearch系列之二选主7.x之前》中，我们了解到 Elasticsearch 在7.x之前的选主是基于 Bully 算法。从7.0开始，Elasticsearch 选择使用基于 Raft 算法来选主。

为什么要用 Raft 重新实现选主呢？

1、discovery.zen.minimum_master_nodes 参数代表有多少个主资格节点参与选举。如果忘了配置或配置错误，将会导致系统短暂不可用。而扩容主节点也需要修改此配置。

2、老的选举太慢，要经过三轮 ping，才能发现其他节点并完成选举。

Raft 算法简介

Raft是用来解决分布式一致性问题而设计的算法。

Raft 节点一共有三种状态：Follower、Candidate、Leader

在节点启动时，都是 Follower 状态；当一段时间没有收到来自 Leader 的心跳时，将进入 Candidate 状态并发起选举；当收到大多数节点投票时，自己当选 Leader。

节点投票时，如果发现比自己版本新的节点，则给其投票。如果自己是 Leader，但发现了比自己新的节点，则放弃 Leader进入 Follower状态。

另外，Raft 算法将时间分为一个个 term 任期。term 开始于选举，结束于没有选出 Leader 或者 Leader 宕机。

Elasticsearch 选举大致流程

Elasticsearch 源码解析

选举的底层接口是 Discovery，新的基于 Raft 的实现类是 Coordinator。节点启动时，调用 startInitialJoin 方法开始将进行选举。

    @Override
    public void startInitialJoin() {
        synchronized (mutex) {
            becomeCandidate("startInitialJoin");
        }
        clusterBootstrapService.scheduleUnconfiguredBootstrap();
    }

节点启动时，首先通过 becomeCandidate 方法进入 Candidate 状态，做一些选举的准备工作。之后，scheduleUnconfiguredBootstrap 方法开始选举。becomeCandidate 方法相对简单，重点看下 scheduleUnconfiguredBootstrap 方法

		    void scheduleUnconfiguredBootstrap() {
        if (unconfiguredBootstrapTimeout == null) {
            return;
        }
				//如果节点角色不是 master，则直接返回，不参与选举。
        if (transportService.getLocalNode().isMasterNode() == false) {
            return;
        }
        //等待 unconfiguredBootstrapTimeout 时间（默认3秒）后，开始选举。
        transportService.getThreadPool().scheduleUnlessShuttingDown(unconfiguredBootstrapTimeout, Names.GENERIC, new Runnable() {
            @Override
            public void run() {
                final Set<DiscoveryNode> discoveredNodes = getDiscoveredNodes();
                final List<DiscoveryNode> zen1Nodes = discoveredNodes.stream().filter(Coordinator::isZen1Node).collect(Collectors.toList());
                if (zen1Nodes.isEmpty()) {
                    //如果都不是 ZenPing 节点，则开始Raft选举
                    startBootstrap(discoveredNodes, emptyList());
                } else {
                    logger.info("avoiding best-effort cluster bootstrapping due to discovery of pre-7.0 nodes {}", zen1Nodes);
                }
            }
        });
    }

等待 unconfiguredBootstrapTimeout 时间（默认3秒）后，开始选举 startBootstrap。

    private void startBootstrap(Set<DiscoveryNode> discoveryNodes, List<String> unsatisfiedRequirements) {
        assert discoveryNodes.stream().allMatch(DiscoveryNode::isMasterNode) : discoveryNodes;
        assert discoveryNodes.stream().noneMatch(Coordinator::isZen1Node) : discoveryNodes;
        assert unsatisfiedRequirements.size() < discoveryNodes.size() : discoveryNodes + " smaller than " + unsatisfiedRequirements;
        if (bootstrappingPermitted.compareAndSet(true, false)) {
            doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId),
                unsatisfiedRequirements.stream().map(s -> BOOTSTRAP_PLACEHOLDER_PREFIX + s))
                .collect(Collectors.toSet())));
        }
    }

校验之后，通过 doBootstrap 方法，开始一轮新的选举。

    private void doBootstrap(VotingConfiguration votingConfiguration) {
        assert transportService.getLocalNode().isMasterNode();

        try {
            votingConfigurationConsumer.accept(votingConfiguration);
        } catch (Exception e) {
            //异常10s后重试
            transportService.getThreadPool().scheduleUnlessShuttingDown(TimeValue.timeValueSeconds(10), Names.GENERIC,
                new Runnable() {
                    @Override
                    public void run() {
                        doBootstrap(votingConfiguration);
                    }
                }
            );
        }
    }

通过 votingConfigurationConsumer 函数来处理，如果出现异常，则10S 后重试。该函数在 Coordinator 初始化时，设置为Coordinator.setInitialConfiguration 方法。

public boolean setInitialConfiguration(final VotingConfiguration votingConfiguration) {
        synchronized (mutex) {
            final ClusterState currentState = getStateForMasterService();
            //一些基本校验
            final List<DiscoveryNode> knownNodes = new ArrayList<>();
            knownNodes.add(getLocalNode());
            peerFinder.getFoundPeers().forEach(knownNodes::add);
            //如果发现节点数不足历史发现的节点数一半以上，则抛出异常
            if(votingConfiguration.hasQuorum(knownNodes.stream().map(DiscoveryNode::getId).collect(Collectors.toList())) == false) {
                throw new CoordinationStateRejectedException("not enough nodes discovered to form a quorum in the initial configuration " +
                    "[knownNodes=" + knownNodes + ", " + votingConfiguration + "]");
            }

            logger.info("setting initial configuration to {}", votingConfiguration);
            final CoordinationMetaData coordinationMetaData = CoordinationMetaData.builder(currentState.coordinationMetaData())
                .lastAcceptedConfiguration(votingConfiguration)
                .lastCommittedConfiguration(votingConfiguration)
                .build();

            MetaData.Builder metaDataBuilder = MetaData.builder(currentState.metaData());
            // automatically generate a UID for the metadata if we need to
            metaDataBuilder.generateClusterUuidIfNeeded(); // TODO generate UUID in bootstrapping tool?
            metaDataBuilder.coordinationMetaData(coordinationMetaData);

            //初始化集群状态
 coordinationState.get().setInitialState(ClusterState.builder(currentState).metaData(metaDataBuilder).build());
            //初始化 preVoteCollector 的 response
            preVoteCollector.update(getPreVoteResponse(), null); // pick up the change to last-accepted version
            //开始选举
            startElectionScheduler();
            return true;
        }
}

setInitialConfiguration 方法会做一些初始化之后，通过 startElectionScheduler 方法异步调用 PreVoteCollector.start 方法开始选举投票。

private void startElectionScheduler() {
        electionScheduler = electionSchedulerFactory.startElectionScheduler(gracePeriod, new Runnable() {
            @Override
            public void run() {
                synchronized (mutex) {
                    if (mode == Mode.CANDIDATE) {
                        final ClusterState lastAcceptedState = coordinationState.get().getLastAcceptedState();
												//快速失败，如果本地节点不可能赢得选举，就不发起选举了。
                        if (localNodeMayWinElection(lastAcceptedState) == false) {
                            return;
                        }

                        if (prevotingRound != null) {
                            prevotingRound.close();
                        }
                        final List<DiscoveryNode> discoveredNodes
                            = getDiscoveredNodes().stream().filter(n -> isZen1Node(n) == false).collect(Collectors.toList());
												//开始选举投票
                        prevotingRound = preVoteCollector.start(lastAcceptedState, discoveredNodes);
                    }
                }
            }
        });
    }

preVoteCollector.start 如下，会依次对所有节点发起投票请求。

void start(final Iterable<DiscoveryNode> broadcastNodes) {
  					...
            broadcastNodes.forEach(n -> transportService.sendRequest(n, REQUEST_PRE_VOTE_ACTION_NAME, preVoteRequest,
                new TransportResponseHandler<PreVoteResponse>() {
                    @Override
                    public PreVoteResponse read(StreamInput in) throws IOException {
                        return new PreVoteResponse(in);
                    }

                    @Override
                    public void handleResponse(PreVoteResponse response) {
                        handlePreVoteResponse(response, n);
                    }
                }));
        }

当其它节点收到投票请求，通过 PreVoteCollector.handlePreVoteRequest 来处理

    private PreVoteResponse handlePreVoteRequest(final PreVoteRequest request) {
        updateMaxTermSeen.accept(request.getCurrentTerm());

        Tuple<DiscoveryNode, PreVoteResponse> state = this.state;
        assert state != null : "received pre-vote request before fully initialised";

        final DiscoveryNode leader = state.v1();
        final PreVoteResponse response = state.v2();

        if (leader == null) {
            return response;
        }

        if (leader.equals(request.getSourceNode())) {
            return response;
        }

        throw new CoordinationStateRejectedException("rejecting " + request + " as there is already a leader");
    }

首先调用 updateMaxTermSeen 函数来更新最大 term，如果自己是主，但有比自己大的 term，则放弃主身份，重新选举。之后如果当前无主或主是请求的节点，则响应投票，否则拒绝响应投票。

private void updateMaxTermSeen(final long term) {
        synchronized (mutex) {
            maxTermSeen = Math.max(maxTermSeen, term);
            final long currentTerm = getCurrentTerm();
            if (mode == Mode.LEADER && maxTermSeen > currentTerm) {
                // Bump our term. However if there is a publication in flight then doing so would cancel the publication, so don't do that
                // since we check whether a term bump is needed at the end of the publication too.
                if (publicationInProgress()) {
                    logger.debug("updateMaxTermSeen: maxTermSeen = {} > currentTerm = {}, enqueueing term bump", maxTermSeen, currentTerm);
                } else {
                    try {
                        logger.debug("updateMaxTermSeen: maxTermSeen = {} > currentTerm = {}, bumping term", maxTermSeen, currentTerm);
                        ensureTermAtLeast(getLocalNode(), maxTermSeen);
                        startElection();
                    } catch (Exception e) {
                        logger.warn(new ParameterizedMessage("failed to bump term to {}", maxTermSeen), e);
                        becomeCandidate("updateMaxTermSeen");
                    }
                }
            }
        }
    }

在收到投票响应后，通过 handlePreVoteResponse 来处理。

private void handlePreVoteResponse(final PreVoteResponse response, final DiscoveryNode sender) {
            //更新最大 term
						updateMaxTermSeen.accept(response.getCurrentTerm());
					  //如果响应节点的 term 比自己大，或者相同但版本比自己高，则本次投票响应不加入得票里。
            if (response.getLastAcceptedTerm() > clusterState.term()
                || (response.getLastAcceptedTerm() == clusterState.term()
                && response.getLastAcceptedVersion() > clusterState.getVersionOrMetaDataVersion())) {
                logger.debug("{} ignoring {} from {} as it is fresher", this, response, sender);
                return;
            }

            preVotesReceived.put(sender, response);

            // create a fake VoteCollection based on the pre-votes and check if there is an election quorum
            final VoteCollection voteCollection = new VoteCollection();
            final DiscoveryNode localNode = clusterState.nodes().getLocalNode();
            final PreVoteResponse localPreVoteResponse = getPreVoteResponse();

            preVotesReceived.forEach((node, preVoteResponse) -> voteCollection.addJoinVote(
                new Join(node, localNode, preVoteResponse.getCurrentTerm(),
                preVoteResponse.getLastAcceptedTerm(), preVoteResponse.getLastAcceptedVersion())));
						//没有获得大多数得票，则 return
            if (electionStrategy.isElectionQuorum(clusterState.nodes().getLocalNode(), localPreVoteResponse.getCurrentTerm(),
                localPreVoteResponse.getLastAcceptedTerm(), localPreVoteResponse.getLastAcceptedVersion(),
                clusterState.getLastCommittedConfiguration(), clusterState.getLastAcceptedConfiguration(), voteCollection) == false) {
                return;
            }
            startElection.run();
        }

同样先更新最大 term，再检查得票是否有效。当收到大多数得票后，再通过 startElection 方法让其他节点加入自己

private void startElection() {
        synchronized (mutex) {
            // The preVoteCollector is only active while we are candidate, but it does not call this method with synchronisation, so we have
            // to check our mode again here.
            if (mode == Mode.CANDIDATE) {
                if (localNodeMayWinElection(getLastAcceptedState()) == false) {
                    logger.trace("skip election as local node may not win it: {}", getLastAcceptedState().coordinationMetaData());
                    return;
                }

                final StartJoinRequest startJoinRequest
                    = new StartJoinRequest(getLocalNode(), Math.max(getCurrentTerm(), maxTermSeen) + 1);
                logger.debug("starting election with {}", startJoinRequest);
                getDiscoveredNodes().forEach(node -> {
                    if (isZen1Node(node) == false) {
                        joinHelper.sendStartJoinRequest(startJoinRequest, node);
                    }
                });
            }
        }
    }

startElection 方法会向所有节点（非zen节点）发送 StartJoinRequest 请求，让其他节点加入自己。同时 term+1，代表新的 term。

其他节点收到 StartJoinRequest 请求后，既向其发送 join 请求。

transportService.registerRequestHandler(START_JOIN_ACTION_NAME, Names.GENERIC, false, false,
            StartJoinRequest::new,
            (request, channel, task) -> {
                final DiscoveryNode destination = request.getSourceNode();
                sendJoinRequest(destination, Optional.of(joinLeaderInTerm.apply(request)));
                channel.sendResponse(Empty.INSTANCE);
            });

先通过 joinLeaderInTerm 构造 Join 请求。同时更新 term，改变状态为 caididate

private Join joinLeaderInTerm(StartJoinRequest startJoinRequest) {
    synchronized (mutex) {
        logger.debug("joinLeaderInTerm: for [{}] with term {}", startJoinRequest.getSourceNode(), startJoinRequest.getTerm());
        //会更新term
        final Join join = coordinationState.get().handleStartJoin(startJoinRequest);
        lastJoin = Optional.of(join);
        peerFinder.setCurrentTerm(getCurrentTerm());
        if (mode != Mode.CANDIDATE) {
            becomeCandidate("joinLeaderInTerm"); // updates followersChecker and preVoteCollector
        } else {
            followersChecker.updateFastResponseState(getCurrentTerm(), mode);
            preVoteCollector.update(getPreVoteResponse(), null);
        }
        return join;
    }
}

后面通过 sendJoinRequest 发送 join 请求。此处比较简单，就不贴代码了。

当节点收到 JoinRequest 请求后，是通过 handleJoinRequest 来处理的。先 ping JoinRequest 的来源节点。如果当前是选举中，则通过 sendValidateJoinRequest 确认投票有效。

private void handleJoinRequest(JoinRequest joinRequest, JoinHelper.JoinCallback joinCallback) {
        transportService.connectToNode(joinRequest.getSourceNode(), ActionListener.wrap(ignore -> {
            final ClusterState stateForJoinValidation = getStateForMasterService();

            if (stateForJoinValidation.nodes().isLocalNodeElectedMaster()) {
                onJoinValidators.forEach(a -> a.accept(joinRequest.getSourceNode(), stateForJoinValidation));
                if (stateForJoinValidation.getBlocks().hasGlobalBlock(STATE_NOT_RECOVERED_BLOCK) == false) {
                    // we do this in a couple of places including the cluster update thread. This one here is really just best effort
                    // to ensure we fail as fast as possible.
                    JoinTaskExecutor.ensureMajorVersionBarrier(joinRequest.getSourceNode().getVersion(),
                        stateForJoinValidation.getNodes().getMinNodeVersion());
                }
                sendValidateJoinRequest(stateForJoinValidation, joinRequest, joinCallback);
            } else {
                processJoinRequest(joinRequest, joinCallback);
            }
        }, joinCallback::onFailure));
    }

收到 ValidateJoinRequest 请求，只要集群相同，并且版本、索引等兼容。既确认本次 JoinRequest 有效。

transportService.registerRequestHandler(VALIDATE_JOIN_ACTION_NAME,
            ThreadPool.Names.GENERIC, ValidateJoinRequest::new,
            (request, channel, task) -> {
                final ClusterState localState = currentStateSupplier.get();
                if (localState.metaData().clusterUUIDCommitted() &&
                    localState.metaData().clusterUUID().equals(request.getState().metaData().clusterUUID()) == false) {
                    throw new CoordinationStateRejectedException("join validation on cluster state" +
                        " with a different cluster uuid " + request.getState().metaData().clusterUUID() +
                        " than local cluster uuid " + localState.metaData().clusterUUID() + ", rejecting");
                }
                joinValidators.forEach(action -> action.accept(transportService.getLocalNode(), request.getState()));
                channel.sendResponse(Empty.INSTANCE);
            });

当确认 Join 有效后，通过 processJoinRequest 处理 Join 请求。可以看到如果 join 前未赢得选举而之后赢得选举，则宣布自己为 Leader。

    private void processJoinRequest(JoinRequest joinRequest, JoinHelper.JoinCallback joinCallback) {
        final Optional<Join> optionalJoin = joinRequest.getOptionalJoin();
        synchronized (mutex) {
            final CoordinationState coordState = coordinationState.get();
            final boolean prevElectionWon = coordState.electionWon();

            optionalJoin.ifPresent(this::handleJoin);
            joinAccumulator.handleJoinRequest(joinRequest.getSourceNode(), joinCallback);

            if (prevElectionWon == false && coordState.electionWon()) {
                becomeLeader("handleJoinRequest");
            }
        }
    }

看下 handleJoin，先通过 ensureTermAtLeast 方法确保 term 最新。之后 coordinationState 处理 join 请求。

    private void handleJoin(Join join) {
        synchronized (mutex) {
            ensureTermAtLeast(getLocalNode(), join.getTerm()).ifPresent(this::handleJoin);
						...
            coordinationState.get().handleJoin(join); // this might fail and bubble up the exception  
        }
    }

看下 ensureTermAtLeast 方法。之前发送 StartJoinRequest 时 term+1 并被 JoinRequest 传了回来，所以首次处理 Join 时getCurrentTerm 比 targetTerm 小，此处通过自己加入自己更新 term（自己给自己投票）。

    private Optional<Join> ensureTermAtLeast(DiscoveryNode sourceNode, long targetTerm) {
        assert Thread.holdsLock(mutex) : "Coordinator mutex not held";
        if (getCurrentTerm() < targetTerm) {
            return Optional.of(joinLeaderInTerm(new StartJoinRequest(sourceNode, targetTerm)));
        }
        return Optional.empty();
    }

接下来看 coordinationState.handleJoin。通过 isElectionQuorum 方法会判断是否得到大多数加入，如果是则自己赢得选举。

public boolean handleJoin(Join join) {
        ...
        boolean added = joinVotes.addJoinVote(join);
        boolean prevElectionWon = electionWon;
  			//收到大多数join，则赢得选举。
        electionWon = isElectionQuorum(joinVotes);
        if (electionWon && prevElectionWon == false) {
            logger.debug("handleJoin: election won in term [{}] with {}", getCurrentTerm(), joinVotes);
            lastPublishedVersion = getLastAcceptedVersion();
        }
        return added;
    }

参考

招贤纳士

政采云技术团队（Zero），一个富有激情、创造力和执行力的团队，Base 在风景如画的杭州。团队现有300多名研发小伙伴，既有来自阿里、华为、网易的“老”兵，也有来自浙大、中科大、杭电等校的新人。团队在日常业务开发之外，还分别在云原生、区块链、人工智能、低代码平台、中间件、大数据、物料体系、工程平台、性能体验、可视化等领域进行技术探索和实践，推动并落地了一系列的内部技术产品，持续探索技术的新边界。此外，团队还纷纷投身社区建设，目前已经是 google flutter、scikit-learn、Apache Dubbo、Apache Rocketmq、Apache Pulsar、CNCF Dapr、Apache DolphinScheduler、alibaba Seata 等众多优秀开源社区的贡献者。如果你想改变一直被事折腾，希望开始折腾事；如果你想改变一直被告诫需要多些想法，却无从破局；如果你想改变你有能力去做成那个结果，却不需要你；如果你想改变你想做成的事需要一个团队去支撑，但没你带人的位置；如果你想改变本来悟性不错，但总是有那一层窗户纸的模糊……如果你相信相信的力量，相信平凡人能成就非凡事，相信能遇到更好的自己。如果你希望参与到随着业务腾飞的过程，亲手推动一个有着深入的业务理解、完善的技术体系、技术创造价值、影响力外溢的技术团队的成长过程，我觉得我们该聊聊。任何时间，等着你写点什么，发给 zcy-tc@cai-inc.com

微信公众号

文章同步发布，政采云技术团队公众号，欢迎关注

Elasticsearch系列之二选主7.x之后

Elasticsearch系列之二选主7.x之后

Raft 算法简介

Elasticsearch 选举大致流程

Elasticsearch 源码解析

参考

推荐阅读

招贤纳士

微信公众号