深入浅出Zookeeper源码(七):Leader选举

1,369 阅读14分钟
版本日期备注
1.02020.6.14文章首发
1.12020.8.16排版优化
1.22020.8.21优化措辞
1.32021.6.23标题从深入浅出Zookeeper(七):Leader选举改变为深入浅出Zookeeper源码(七):Leader选举

1. 前言

对于一个分布式集群来说,保证数据写入一致性最简单的方式就是依靠一个节点来调度和管理其他节点。在分布式系统中我们一般称其为Leader。

为什么是最简单的方式呢?我们想象一下,当我们写数据到Leader时,Leader写入自己的一份数据后,可能会做副本到Follower,那么拷贝的数量、及所在的位置都由该Leader来控制。但如果是多Leader调度,就要涉及到数据分区,请求负载均衡等问题了。

今天,笔者就和大家一起来看看ZK的选举流程。

2. 选举算法剖析-ZAB

这是一种典型的多数派算法,听名字就知道是为ZK而生了(Zookeeper Atomic Broadcast)。其Leader的选举主要关心节点的ID和数据ID,这其中数据ID越大,则表示数据越新,优先成为主。

2.1 选举时机

常见由两种场景触发选举,无论如何,至少得有两台ZK机器。

  • Startup触发:我们知道,每台zk都需要配置不同的myid,而当刚开始时,zxid必定都为0。这便意味着会挑选myid最大的zk节点作为leader。
  • Leader失联触发:zk节点每经过一次事务处理,都会更新zxid。那便意味着数据越新,zxid会越大。在这个选举过程中,会挑选出zxid的节点作为leader。

2.2 Zk选举过程剖析(带源码分析)

核心方法为org.apache.zookeeper.server.quorum.QuorumPeer.startLeaderElectionorg.apache.zookeeper.server.quorum.QuorumPeer.run,我们的源码分析也基于此展开。

下面的源码分析基于3.5.7版本。

2.2.1 Startup

我们得从QuorumPeerMain来看,因为这是启动的入口:

/**
 *
 * <h2>Configuration file</h2>
 *
 * When the main() method of this class is used to start the program, the first
 * argument is used as a path to the config file, which will be used to obtain
 * configuration information. This file is a Properties file, so keys and
 * values are separated by equals (=) and the key/value pairs are separated
 * by new lines. The following is a general summary of keys used in the
 * configuration file. For full details on this see the documentation in
 * docs/index.html
 * <ol>
 * <li>dataDir - The directory where the ZooKeeper data is stored.</li>
 * <li>dataLogDir - The directory where the ZooKeeper transaction log is stored.</li>
 * <li>clientPort - The port used to communicate with clients.</li>
 * <li>tickTime - The duration of a tick in milliseconds. This is the basic
 * unit of time in ZooKeeper.</li>
 * <li>initLimit - The maximum number of ticks that a follower will wait to
 * initially synchronize with a leader.</li>
 * <li>syncLimit - The maximum number of ticks that a follower will wait for a
 * message (including heartbeats) from the leader.</li>
 * <li>server.<i>id</i> - This is the host:port[:port] that the server with the
 * given id will use for the quorum protocol.</li>
 * </ol>
 * In addition to the config file. There is a file in the data directory called
 * "myid" that contains the server id as an ASCII decimal value.
 *
 */
@InterfaceAudience.Public
public class QuorumPeerMain {
    private static final Logger LOG = LoggerFactory.getLogger(QuorumPeerMain.class);

    private static final String USAGE = "Usage: QuorumPeerMain configfile";

    protected QuorumPeer quorumPeer;

    /**
     * To start the replicated server specify the configuration file name on
     * the command line.
     * @param args path to the configfile
     */
    public static void main(String[] args) {
        QuorumPeerMain main = new QuorumPeerMain();
        try {
            main.initializeAndRun(args);
        } catch (IllegalArgumentException e) {
            LOG.error("Invalid arguments, exiting abnormally", e);
            LOG.info(USAGE);
            System.err.println(USAGE);
            System.exit(2);
        } catch (ConfigException e) {
            LOG.error("Invalid config, exiting abnormally", e);
            System.err.println("Invalid config, exiting abnormally");
            System.exit(2);
        } catch (DatadirException e) {
            LOG.error("Unable to access datadir, exiting abnormally", e);
            System.err.println("Unable to access datadir, exiting abnormally");
            System.exit(3);
        } catch (AdminServerException e) {
            LOG.error("Unable to start AdminServer, exiting abnormally", e);
            System.err.println("Unable to start AdminServer, exiting abnormally");
            System.exit(4);
        } catch (Exception e) {
            LOG.error("Unexpected exception, exiting abnormally", e);
            System.exit(1);
        }
        LOG.info("Exiting normally");
        System.exit(0);
    }

    protected void initializeAndRun(String[] args)
        throws ConfigException, IOException, AdminServerException
    {
        QuorumPeerConfig config = new QuorumPeerConfig();
        if (args.length == 1) {
            config.parse(args[0]);
        }

        // Start and schedule the the purge task
        DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
                .getDataDir(), config.getDataLogDir(), config
                .getSnapRetainCount(), config.getPurgeInterval());
        purgeMgr.start();

        if (args.length == 1 && config.isDistributed()) {
            runFromConfig(config);
        } else {
            LOG.warn("Either no config or no quorum defined in config, running "
                    + " in standalone mode");
            // there is only server in the quorum -- run as standalone
            ZooKeeperServerMain.main(args);
        }
    }

    public void runFromConfig(QuorumPeerConfig config)
            throws IOException, AdminServerException
    {
      try {
          ManagedUtil.registerLog4jMBeans();
      } catch (JMException e) {
          LOG.warn("Unable to register log4j JMX control", e);
      }

      LOG.info("Starting quorum peer");
      try {
          ServerCnxnFactory cnxnFactory = null;
          ServerCnxnFactory secureCnxnFactory = null;

          if (config.getClientPortAddress() != null) {
              cnxnFactory = ServerCnxnFactory.createFactory();
              cnxnFactory.configure(config.getClientPortAddress(),
                      config.getMaxClientCnxns(),
                      false);
          }

          if (config.getSecureClientPortAddress() != null) {
              secureCnxnFactory = ServerCnxnFactory.createFactory();
              secureCnxnFactory.configure(config.getSecureClientPortAddress(),
                      config.getMaxClientCnxns(),
                      true);
          }

          quorumPeer = getQuorumPeer();
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                      config.getDataLogDir(),
                      config.getDataDir()));
          quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
          quorumPeer.enableLocalSessionsUpgrading(
              config.isLocalSessionsUpgradingEnabled());
          //quorumPeer.setQuorumPeers(config.getAllMembers());
          quorumPeer.setElectionType(config.getElectionAlg());
          quorumPeer.setMyid(config.getServerId());
          quorumPeer.setTickTime(config.getTickTime());
          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
          quorumPeer.setInitLimit(config.getInitLimit());
          quorumPeer.setSyncLimit(config.getSyncLimit());
          quorumPeer.setConfigFileName(config.getConfigFilename());
          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
          quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
          if (config.getLastSeenQuorumVerifier()!=null) {
              quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
          }
          quorumPeer.initConfigInZKDatabase();
          quorumPeer.setCnxnFactory(cnxnFactory);
          quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
          quorumPeer.setSslQuorum(config.isSslQuorum());
          quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
          quorumPeer.setLearnerType(config.getPeerType());
          quorumPeer.setSyncEnabled(config.getSyncEnabled());
          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
          if (config.sslQuorumReloadCertFiles) {
              quorumPeer.getX509Util().enableCertFileReloading();
          }

          // sets quorum sasl authentication configurations
          quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
          if(quorumPeer.isQuorumSaslAuthEnabled()){
              quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
              quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
              quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
              quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
              quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
          }
          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
          quorumPeer.initialize();
          
          quorumPeer.start();
          quorumPeer.join();
      } catch (InterruptedException e) {
          // warn, but generally this is ok
          LOG.warn("Quorum Peer interrupted", e);
      }
    }

    // @VisibleForTesting
    protected QuorumPeer getQuorumPeer() throws SaslException {
        return new QuorumPeer();
    }
}

我们从QuorumPeerMain.main() -> main.initializeAndRun(args) -> runFromConfig -> quorumPeer.start(),继续往下看QuorumPeer.java(这个类用于管理选举相关的逻辑):

    @Override
    public synchronized void start() {
        if (!getView().containsKey(myid)) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
         }
        loadDataBase();
        startServerCnxnFactory();
        try {
            adminServer.start();
        } catch (AdminServerException e) {
            LOG.warn("Problem starting AdminServer", e);
            System.out.println(e);
        }
        startLeaderElection();
        super.start();
    }

现在,我们来到核心代码startLeaderElection

    synchronized public void startLeaderElection() {
       try {
           if (getPeerState() == ServerState.LOOKING) {
               currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
           }
       } catch(IOException e) {
           RuntimeException re = new RuntimeException(e.getMessage());
           re.setStackTrace(e.getStackTrace());
           throw re;
       }

       // if (!getView().containsKey(myid)) {
      //      throw new RuntimeException("My id " + myid + " not in the peer list");
        //}
        if (electionType == 0) {
            try {
                udpSocket = new DatagramSocket(getQuorumAddress().getPort());
                responder = new ResponderThread();
                responder.start();
            } catch (SocketException e) {
                throw new RuntimeException(e);
            }
        }
        this.electionAlg = createElectionAlgorithm(electionType);
    }

逻辑非常的简单,如果处于Looking状态(服务器刚启动时默认为Looking),那么就发起选举的投票,并确认选举算法(从3.4.0开始,只有FastLeaderElection选举算法了),并将其发送出去。由于代码篇幅较大,这里不再粘出,感兴趣的读者可以自行阅读FastLeaderElection.Messenger.WorkerReceiver.run。其本质上就是一个线程,从存储选票的队列中取出vote,并发送。  在这里普及一下服务器状态:

  1. LOOKING:寻找Leader状态。当服务器处于该状态时,它认为当前集群中没有Leader。
  2. FOLLOWING:跟随者状态,表明当前服务器角色Follower。
  3. LEADING:领导者状态,表明当前服务器角色是Leader。
  4. OBSERVING:观察者状态,表明当前服务器是Observer。

接下来看QuorumPeer的相关核心代码:

    @Override
    public void run() {
        updateThreadName();

        LOG.debug("Starting quorum peer");
        try {
            jmxQuorumBean = new QuorumBean(this);
            MBeanRegistry.getInstance().register(jmxQuorumBean, null);
            for(QuorumServer s: getView().values()){
                ZKMBeanInfo p;
                if (getId() == s.id) {
                    p = jmxLocalPeerBean = new LocalPeerBean(this);
                    try {
                        MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                        jmxLocalPeerBean = null;
                    }
                } else {
                    RemotePeerBean rBean = new RemotePeerBean(this, s);
                    try {
                        MBeanRegistry.getInstance().register(rBean, jmxQuorumBean);
                        jmxRemotePeerBean.put(s.id, rBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                    }
                }
            }
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            jmxQuorumBean = null;
        }

        try {
            /*
             * Main loop
             */
            while (running) {
                switch (getPeerState()) {
                case LOOKING:
                    LOG.info("LOOKING");

                    if (Boolean.getBoolean("readonlymode.enabled")) {
                        LOG.info("Attempting to start ReadOnlyZooKeeperServer");

                        // Create read-only server but don't start it immediately
                        final ReadOnlyZooKeeperServer roZk =
                            new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb);
    
                        // Instead of starting roZk immediately, wait some grace
                        // period before we decide we're partitioned.
                        //
                        // Thread is used here because otherwise it would require
                        // changes in each of election strategy classes which is
                        // unnecessary code coupling.
                        Thread roZkMgr = new Thread() {
                            public void run() {
                                try {
                                    // lower-bound grace period to 2 secs
                                    sleep(Math.max(2000, tickTime));
                                    if (ServerState.LOOKING.equals(getPeerState())) {
                                        roZk.startup();
                                    }
                                } catch (InterruptedException e) {
                                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                                } catch (Exception e) {
                                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                                }
                            }
                        };
                        try {
                            roZkMgr.start();
                            reconfigFlagClear();
                            if (shuttingDownLE) {
                                shuttingDownLE = false;
                                startLeaderElection();
                            }
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        } finally {
                            // If the thread is in the the grace period, interrupt
                            // to come out of waiting.
                            roZkMgr.interrupt();
                            roZk.shutdown();
                        }
                    } else {
                        try {
                           reconfigFlagClear();
                            if (shuttingDownLE) {
                               shuttingDownLE = false;
                               startLeaderElection();
                               }
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }                        
                    }
                    break;

在这里仅仅截取了Looking的相关逻辑,上半段的if主要处理只读服务——其用于handle只读client。else逻辑则是常见的情况,但是从代码块:

             reconfigFlagClear();
                            if (shuttingDownLE) {
                               shuttingDownLE = false;
                               startLeaderElection();
                               }
                            setCurrentVote(makeLEStrategy().lookForLeader());

其实区别不大。接着来看lookForLeader,为了篇幅,我们只截取Looking相关的代码:

    /**
     * Starts a new round of leader election. Whenever our QuorumPeer
     * changes its state to LOOKING, this method is invoked, and it
     * sends notifications to all other peers.
     */
    public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = Time.currentElapsedTime();
        }
        try {
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet();
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */

            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);


注释说的很清楚,这个方法会开启新的一轮选举:当我们的服务器状态变为Looking,这个方法会被调用,被通知集群其他需要参与选举的服务器。那么在这段逻辑中,recvqueue会存放着相关的选举通知信息,取出一个。接下来有两个逻辑分支:

  1. 为空。想办法通知其他服务器。
  2. 有效的投票(即大家的选举轮次都是统一论次),那么便进行选票PK。

我们来看totalOrderPredicate这个方法:

    /**
     * Check if a pair (server id, zxid) succeeds our
     * current vote.
     *
     * @param id    Server identifier
     * @param zxid  Last zxid observed by the issuer of this vote
     */
    protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }

        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */

        return ((newEpoch > curEpoch) ||
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
    }

理一下逻辑:

  1. 如果新的轮次大于内部投票轮次,则需要进行投票变更
  2. 如果选举轮次一致,并外部投票的ZXID大于内部投票的,则需要变更
  3. 如果选举轮次一致,并外部投票的SID大于内部投票的,则需要变更

经过这个逻辑,便可以确定外部投票优于内部投票——即更适合成为Leader。这时便会把外部选票信息来覆盖内部投票,并发送出去:

                    case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();

接下来就会判断集群中是否有过半的服务器认可该投票。

    /**
     * Termination predicate. Given a set of votes, determines if have
     * sufficient to declare the end of the election round.
     * 
     * @param votes
     *            Set of votes
     * @param vote
     *            Identifier of the vote received last
     */
    protected boolean termPredicate(Map<Long, Vote> votes, Vote vote) {
        SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
        voteSet.addQuorumVerifier(self.getQuorumVerifier());
        if (self.getLastSeenQuorumVerifier() != null
                && self.getLastSeenQuorumVerifier().getVersion() > self
                        .getQuorumVerifier().getVersion()) {
            voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
        }

        /*
         * First make the views consistent. Sometimes peers will have different
         * zxids for a server depending on timing.
         */
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }

        return voteSet.hasAllQuorums(); //是否超过一半
    }

否则的话会继续收集选票。

接下来便是更新服务器状态。

                         /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                                Vote endVote = new Vote(proposedLeader,
                                        proposedZxid, logicalclock.get(), 
                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }

2.2.2 Leader失联

上文我们提到了QuorumPeer.java,里面有个main loop,不同的角色会在这个loop下做自己的事。直到退出。在这里,我们以Follower为例,进行分析:

                case FOLLOWING:
                    try {
                       LOG.info("FOLLOWING");
                        setFollower(makeFollower(logFactory));
                        follower.followLeader();
                    } catch (Exception e) {
                       LOG.warn("Unexpected exception",e);
                    } finally {
                       follower.shutdown();
                       setFollower(null);
                       updateServerState();
                    }
                    break;

follower.followLeader()

    /**
     * the main method called by the follower to follow the leader
     *
     * @throws InterruptedException
     */
    void followLeader() throws InterruptedException {
        self.end_fle = Time.currentElapsedTime();
        long electionTimeTaken = self.end_fle - self.start_fle;
        self.setElectionTimeTaken(electionTimeTaken);
        LOG.info("FOLLOWING - LEADER ELECTION TOOK - {} {}", electionTimeTaken,
                QuorumPeer.FLE_TIME_UNIT);
        self.start_fle = 0;
        self.end_fle = 0;
        fzk.registerJMX(new FollowerBean(this, zk), self.jmxLocalPeerBean);
        try {
            QuorumServer leaderServer = findLeader();            
            try {
                connectToLeader(leaderServer.addr, leaderServer.hostname);
                long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);
                if (self.isReconfigStateChange())
                   throw new Exception("learned about role change");
                //check to see if the leader zxid is lower than ours
                //this should never happen but is just a safety check
                long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
                if (newEpoch < self.getAcceptedEpoch()) {
                    LOG.error("Proposed leader epoch " + ZxidUtils.zxidToString(newEpochZxid)
                            + " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
                    throw new IOException("Error: Epoch of leader is lower");
                }
                syncWithLeader(newEpochZxid);                
                QuorumPacket qp = new QuorumPacket();
                while (this.isRunning()) {
                    readPacket(qp);
                    processPacket(qp);
                }
            } catch (Exception e) {
                LOG.warn("Exception when following the leader", e);
                try {
                    sock.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
    
                // clear pending revalidations
                pendingRevalidations.clear();
            }
        } finally {
            zk.unregisterJMX((Learner)this);
        }
    }

跳往核心方法processPacket

   /**
     * Examine the packet received in qp and dispatch based on its contents.
     * @param qp
     * @throws IOException
     */
    protected void processPacket(QuorumPacket qp) throws Exception{
        switch (qp.getType()) {
        case Leader.PING:            
            ping(qp);            
            break;
        case Leader.PROPOSAL:           
            TxnHeader hdr = new TxnHeader();
            Record txn = SerializeUtils.deserializeTxn(qp.getData(), hdr);
            if (hdr.getZxid() != lastQueued + 1) {
                LOG.warn("Got zxid 0x"
                        + Long.toHexString(hdr.getZxid())
                        + " expected 0x"
                        + Long.toHexString(lastQueued + 1));
            }
            lastQueued = hdr.getZxid();
            
            if (hdr.getType() == OpCode.reconfig){
               SetDataTxn setDataTxn = (SetDataTxn) txn;       
               QuorumVerifier qv = self.configFromString(new String(setDataTxn.getData()));
               self.setLastSeenQuorumVerifier(qv, true);                               
            }
            
            fzk.logRequest(hdr, txn);
            break;
        case Leader.COMMIT:
            fzk.commit(qp.getZxid());
            break;
            
        case Leader.COMMITANDACTIVATE:
           // get the new configuration from the request
           Request request = fzk.pendingTxns.element();
           SetDataTxn setDataTxn = (SetDataTxn) request.getTxn();                                                                                                      
           QuorumVerifier qv = self.configFromString(new String(setDataTxn.getData()));                                
 
           // get new designated leader from (current) leader's message
           ByteBuffer buffer = ByteBuffer.wrap(qp.getData());    
           long suggestedLeaderId = buffer.getLong();
            boolean majorChange = 
                   self.processReconfig(qv, suggestedLeaderId, qp.getZxid(), true);
           // commit (writes the new config to ZK tree (/zookeeper/config)                     
           fzk.commit(qp.getZxid());
            if (majorChange) {
               throw new Exception("changes proposed in reconfig");
           }
           break;
        case Leader.UPTODATE:
            LOG.error("Received an UPTODATE message after Follower started");
            break;
        case Leader.REVALIDATE:
            revalidate(qp);
            break;
        case Leader.SYNC:
            fzk.sync();
            break;
        default:
            LOG.warn("Unknown packet type: {}", LearnerHandler.packetToString(qp));
            break;
        }
    }

case COMMITANDACTIVATE中,我们可以看到当其收到leader改变相关的消息时,就会抛出异常。接下来它自己就会变成LOOKING状态,开始选举。

那么如何确定leader不可用呢?答案是通过心跳指令。在一定时间内如果leader的心跳没有过来,那么则认为其已经不可用。

LeanerHandler.run里的case Leader.PING

                case Leader.PING:
                    // Process the touches
                    ByteArrayInputStream bis = new ByteArrayInputStream(qp
                            .getData());
                    DataInputStream dis = new DataInputStream(bis);
                    while (dis.available() > 0) {
                        long sess = dis.readLong();
                        int to = dis.readInt();
                        leader.zk.touch(sess, to);
                    }
                    break;

3. 其他常见选举算法

首先,我们要知道。选举算法的本质是 共识算法,而绝大多数共识算法就是为了解决分布式环境下数据一致性而诞生的。 而zk里所谓leader、follower之类的,无非也是个状态,基于zk这个语义下(上下文里)大家都认为一个leader是leader,才是有效的共识。

常见的共识算法都有哪些呢?现阶段的共识算法主要可以分成三大类:公链,联盟链和私链。下面描述这三种类别的特征:

  • 私链:私链的共识算法即区块链这个概念还没普及时的传统分布式系统里的共识算法,比如 zookeeper 的 zab 协议,就是类 paxos 算法的一种。私链的适用环境一般是不考虑集群中存在作恶节点,只考虑因为系统或者网络原因导致的故障节点。
  • 联盟链:联盟链中,经典的代表项目是 Hyperledger 组织下的 Fabric 项目, Fabric0.6 版本使用的就是 pbft 算法。联盟链的适用环境除了需要考虑集群中存在故障节点,还需要考虑集群中存在作恶节点。对于联盟链,每个新加入的节点都是需要验证和审核的。
  • 公链:公链不断需要考虑网络中存在故障节点,还需要考虑作恶节点,这一点和联盟链是类似的。和联盟链最大的区别就是,公链中的节点可以很自由的加入或者退出,不需要严格的验证和审核。

引用自:zhuanlan.zhihu.com/p/35847127;…

基于篇幅,接下来简单介绍下两个较为典型的共识算法。

3.1 Raft

Raft 算法是典型的多数派投票选举算法,其选举机制与我们日常生活中的民主投票机制类似,核心思想是“少数服从多数”。也就是说,Raft 算法中,获得投票最多的节点成为主。

采用 Raft 算法选举,集群节点的角色有 3 种:

  • Leader,即主节点,同一时刻只有一个 Leader,负责协调和管理其他节点;
  • Candidate,即候选者,每一个节点都可以成为 Candidate,节点在该角色下才可以被选为新的 Leader;
  • Follower,Leader 的跟随者,不可以发起选举。

Raft 选举的流程,可以分为以下几步:

  1. 初始化时,所有节点均为 Follower 状态。
  2. 开始选主时,所有节点的状态由 Follower 转化为 Candidate,并向其他节点发送选举请求。
  3. 其他节点根据接收到的选举请求的先后顺序,回复是否同意成为主。这里需要注意的是,在每一轮选举中,一个节点只能投出一张票。
  4. 若发起选举请求的节点获得超过一半的投票,则成为主节点,其状态转化为 Leader,其他节点的状态则由 Candidate 降为 Follower。Leader 节点与 Follower 节点之间会定期发送心跳包,以检测主节点是否活着。
  5. 当 Leader 节点的任期到了,即发现其他服务器开始下一轮选主周期时(或主节点挂了),Leader 节点的状态由 Leader 降级为 Follower,进入新一轮选主。

这个算法比起ZAB,较易实现,但由于消息通信量大,相比于ZAB,更适用于中小的场景。

3.2 Pow

PoW 算法,是以每个节点或服务器的计算能力(即“算力”)来竞争记账权的机制,因此是一种使用工作量证明机制的共识算法。也就是说,谁的算力强(解题快),谁获得记账权的可能性就越大。

比如发生一次交易,同时有三个节点(A、B、C)都收到了这个记账请求。A节点已经算出来了,那么就会通知BC节点进行验证——这是一种椭圆曲线加密算法,解题的速度会比验证的速度慢很多。当所有节点验证后,这个记账就记下来了。

听起来很公平。但PoW 机制每次达成共识需要全网共同参与运算,增加了每个节点的计算量,并且如果题目过难,会导致计算时间长、资源消耗多 ;而如果题目过于简单,会导致大量节点同时获得记账权,冲突多。这些问题,都会增加达成共识的时间。

4. 小结

在本文,我们先提到了zookeeper的leader选举,大致流程如下:

4.1 服务器启动时选举

  1. 每个Server会发出一个投票
  2. 接受来每个Server的投票
  3. 处理投票(对比zxid和myid)
  4. 统计投票,直到超过半数的机器收到相同的投票信息
  5. 更改服务器角色

4.2 服务器运行期间选举

服务器启动时选举非常的像,无非就是多了一个状态变更——当Leader挂了,余下的Follower都会将自己的服务器状态变更为LOOKING,然后进入选举流程。

4.3 一致性算法和共识算法

我们还提到了一致性算法和共识算法的概念,那么一致性与共识的区别是什么呢?在平常使用中,我们通常会混淆一致性和共识这两个概念,不妨在这儿说清:

  • 一致性:分布式系统中的多个节点之间,给定一系列的操作,在约定协议的保障下,对外界呈现的数据或状态是一致的。
  • 共识:分布式系统中多个节点之间,彼此对某个状态达成一致结果的过程。

即:一致性强调的是结果,共识强调的是达成一致的过程,共识算法是保障系统满足不同程度一致性的核心技术。

因此,结合上篇文章和这篇文章,ZAB应该是一种共识算法。