基本概念

ZAB协议是专为Zookeeper设计的一种支持崩溃恢复的原子广播协议，那么什么是Zookeeper的崩溃呢？就是zk集群中Leader宕机之后的重新选举，当然，在集群启动过程中也需要Leader选举。

当集群正在启动过程中，或者Leader与超过半数的主机断连之后，集群就进入了恢复模式，而恢复模式中最重要的阶段就是Leader选举。

myid

这是zk集群中服务器的唯一标识，称为myid，在配置zk集群的时候会给每一台服务器指定myid的值。例如有三个zk服务器，那么myid编号分别是1，2，3.

逻辑时钟

逻辑时钟，Logicalclock，是一个整型数，该概念在选举时称为Logicalclock，而在选举结束之后的zxid中则为epoch的值，即epoch与Logicalclock是同一个值，在不同情况下的不同名称。

zk状态

zk集群中的每一台主机，在不同的阶段会处于不同的状态，每一台主机具有四种状态。

LOOKING：选举状态（选举Leader的状态）。
FOLLOWING：跟随状态，选举结束产生Leader之后，具有选举权的主机没有被选举为Leader的，状态由LOOKING就变为FOLLOWING。
OBSERING：观察状态，选举结束产生Leader之后，不具有选举权的主机状态由LOOKING就变为OBSERING。
LEADING：领导状态，选举结束之后，产生的Leader主机的状态由LOOKING就变为LEADING。

选举算法

在集群启动过程中的Leader选举过程（算法）与Leader断连后的Leader选举过程稍微有一些区别，基本相同。

集群启动中Leader选举

若要进行Leader的选举，则至少需要两台主机，这里以三台主机组成的集群为例。如下图所示：

在集群初始化阶段，当第一台服务器Server1启动时，其会给自己投票，然后发布自己的投票结果，投票包含所推举的服务器的myid和zxid，使用（myid,zxid）来表示，一开始各个主机都是毛遂自荐，自己选自己，此时Server1的投票为(1,0)，由于其他机器还没有启动所以它接收不到反馈信息，Server1的状态一直处于LOOKING，即属于非服务状态，即一台zk主机是不可用的。

当第二台服务器Server2启动时，此时两台机器可以相互通信，每台机器试图找到Leader，选举过程如下：

1.每个Server发出一个投票，此时Server1的投票为(1,0)，Server2的投票为(2,0)，然后各自将这个投票发给集群中其他机器。
2.接受来自各个服务器的投票，集群的每个服务器收到投票后，首先判断该投票的有效性，比如检查是否是本轮投票，是否是来自LOOKING状态的服务器。
3.处理投票。针对每一个投票，服务器都需要将别人的投票和自己的投票进行PK，PK规则如下：
- 优先检查zxid，zxid比较大的服务器优先作为Leader。
- 如果zxid相同，那么就比较myid，myid较大的服务器作为Leader服务器。
- 对于Server1而言，它的投票是(1,0)，接收到Server2的投票为(2,0)，其首先会比较两者的zxid，均为0，再比较myid，此时Server2的myid最大，于是Server1更新自己的投票为(2,0)，然后重新投票。对于Server2而言，其无需更新自己的投票，只是再次向集群中所有主机发出上一次投票信息即可。
4.统计投票。每次投票之后，服务器都会比较投票信息，判断是否已经超过半数机器接收到相同的投票信息。对于Server1,Server2而言，都统计出集群中已经有两台主机接受了(2,0)的投票信息，此时便认为已经选出了新的Leader，即Server2。
5.改变服务器状态。一旦确定了Leader，每个服务器就会更新自己的状态，如果是Follower，那么就变更为FOLLOWING，如果是Leader，就变更为LEADING。
6.添加主机。在新的Leader选举出来之后，此时Server3启动，其想要发出新一轮的选举，但是由于当前集群中各个主机的状态并不是LOOKING，而是各司其职的正常服务，所以其只能是以Follower的身份加入到集群中。

Leader断连后的选举

在Zookeeper运行期间，Leader与非Leader服务器各司其职，即便当有非Leader服务器宕机或者加入新的服务器时也不会影响Leader，但是若Leader服务器挂了，那么整个集群将暂停对外服务，进入新一轮的Leader选举，过程和启动时期的Leader选举过程基本一致。

假设正在运行的有Server1,Server2,Server3三台服务器，当前Leader是Server2，若某一个时刻Server2挂了，此时便开始新的一轮Leader选举了，如下图所示。

选举过程如下：

1.变更状态。Leader宕机后，余下的非Observer服务器会将自己的服务器状态由FOLLOWING变更为LOOKING，然后开始进入Leader选举过程。
2.每个Server都会发出一个投票，仍然是首先自己投自己，不过，在运行期间每个服务器上的zxid可能是不同的，此时假定Server1的zxid为111，Server3的zxid为333，在第一轮投票中，Server1和Server3都会投自己，产生投票(1,111)和(3,333)，然后各自将投票发送给集群中所有机器。
3.接收来自各个服务器的投票。与启动过程Leader选举类似，集群中的每个服务器收到选票后，首先判断选票的有效性，如检查是否是本轮投票，是否是来自LOOKING状态的服务器。
4.处理投票。与启动过程Leader的一样，针对每一个选票，优先比较zxid，zxid相同再比较myid，由于Server3投票的zxid大于Server1的，于是Server1更新自己的选票为(3,333)，对于Server3无需更新自己的选票，只是再次向集群中所有主机发送上一次投票信息即可。
5.统计投票。对于Server1和Server3而言，都统计出自己收到的选票信息，此时便认为选出了新的Leader，即Server3。
6.改变服务器的状态。一旦确定了Leader，每个服务器就会更新自己的状态，Server1变更为FOLLOWING，Server3变更为LEADING

源码分析

FastLeaderElection类

FastLeaderElection类的注释信息翻译成中文如下：

/**
 * Implementation of leader election using TCP. It uses an object of the class
 * QuorumCnxManager to manage connections. Otherwise, the algorithm[ˈælɡəˌrɪðəm]
 * is push-based as with the other UDP implementations.
 * 翻译：使用TCP实现了Leader的选举。它使用QuorumCnxManager类的对象进行连接管理
 * (与其它Server间的连接管理)。否则(即若不使用QuorumCnxManager对象的话)，将使用
 * UDP的基于推送的算法实现。
 *
 * There are a few parameters that can be tuned to change its behavior. First,
 * finalizeWait determines the amount of time to wait until deciding upon a leader.
 * This is part of the leader election algorithm.
 * 翻译：有几个参数可以用来改变它(选举)的行为。第一，finalizeWait(这是一个代码中的常量)
 * 决定了选举出一个Leader的时间，这是Leader选举算法的一部分。
 */

public class FastLeaderElection implements Election {
}

成员变量

finalizeWait变量

/**
 * Determine how much time a process has to wait
 * once(一经，一旦) it believes(到达) that it has reached the end of
 * leader election.
 * 翻译：(该常量)决定一个(选举)过程需要等待的选举时间。
 * 一经到达，它将结束Leader选举。
 *
 * 注意通知中会加上自己的选票
 * 每发一个通知，你就得给我一个反馈，同样是通知
 */
// 该常量在代码中用于限定当前Server发出“通知”后，
// 其要收到其它Server发送的“通知”的超时时限（其它Server发送的“通知”相当于是反馈）
// 该变量是这个时限的初始值
final static int finalizeWait = 200;

maxNotificationInterval：注意该变量的值跟上面finalizeWait变量关系是，finalizeWait变量值的最大值就是maxNotificationInterval

/**
 * Upper bound on the amount of time between two consecutive(连续的)
 * notification checks. This impacts(影响) the amount of time to get
 * the system up again after long partitions(分割). Currently 60 seconds.
 * 翻译：(该常量指定了)两个连续的notification检查间的时间间隔上限。
 * 它影响了系统在经历了长时间分割后再次重启的时间。目前60秒。
 */
// 若当前Server向某个Server发送了选票通知，则对方需要在该时限内向
// 当前Server发送其自己的选票通知。若时限到达仍没有收到对方通知，
// 则认为连接出了问题
final static int maxNotificationInterval = 60000;

manager：连接管理者。FastLeaderElection(选举算法)使用TCP(管理)，两个同辈Server的通信，并且QuorumCnxManager还管理着这些连接。

/**
 * Connection manager. Fast leader election uses TCP for
 * communication between peers, and QuorumCnxManager manages
 * such connections.
 * 翻译：连接管理者。FastLeaderElection(选举算法)使用TCP(管理)
 * 两个同辈Server的通信，并且QuorumCnxManager还管理着这些连接。
 */
QuorumCnxManager manager;

其他的成员变量如下：

// 表示当前参与选举的Server
QuorumPeer self;
Messenger messenger;//用不到，不用管
// 逻辑时钟 在选出Leader之前为逻辑时钟，选出以后为epoch
AtomicLong logicalclock = new AtomicLong(); /* Election instance */
// 记录当前Server推荐情况
long proposedLeader;//自己推荐的Leader
long proposedZxid;//自己推荐Leader的Zxid
long proposedEpoch;//自己推荐Leader的epoch

self表示当前参与选举的Server；
logicalclock表示逻辑时钟，在选出Leader之前为逻辑时钟，选出以后为epoch；
proposedLeader表示自己推荐的Leader；
proposedZxid表示自己推荐Leader的Zxid；
proposedEpoch表示自己推荐Leader的epoch

内部类Notification

内部类Notification的类注释如下：

/**
 * Notifications are messages that let other peers know that
 * a given peer has changed its vote, either because it has
 * joined leader election or because it learned of(领先于)
 * another peer with higher zxid or same zxid and higher
 * server id
 * 翻译：Notifications是一个让其它Server知道当前Server已经改变
 * 了投票的通知消息，(为什么它要改变投票呢？)要么是因为它参与了
 * Leader选举(新一轮的投票，首先投向自己)，要么是它具有更大的
 * zxid，或者zxid相同但ServerId(即myid)更大。
 */
static public class Notification {
}

Notifications是一个让其它Server知道当前Server已经改变了投票的通知消息，(为什么它要改变投票呢？)要么是因为它参与了Leader选举(新一轮的投票，首先投向自己)，要么是它具有更大的zxid，或者zxid相同但ServerId(即myid)更大。

成员变量如下：

static public class Notification {
    /*
     * Format version, introduced in 3.4.6
     */
    
    public final static int CURRENTVERSION = 0x1; 
    int version;
            
    /*
     * Proposed leader
     * 当前选票所推荐的Leader的ServerId即myid
     */
    long leader;

    /*
     * zxid of the proposed leader
     * 当前选票所推荐的Leader的zxid
     */
    long zxid;

    /*
     * Epoch
     * 当前选票所推荐的Leader的epoch
     */
    long electionEpoch;

    /*
     * current state of sender
     * 当前主机的状态
     * 1.LOOKING ----参与选举Leader过程中
     * 2.LEADING------当前为Leader
     * 3.FOLLOWING---当前为Follower
     * 4.OBSERVING---当前为Observer
     */
    QuorumPeer.ServerState state;

    /*
     * Address of sender
     * 当前主机的sid
     */
    long sid;

    /*
     * epoch of the proposed leader
     * 当前选票所推荐的Leader主机的epoch
     */
    long peerEpoch;

leader：当前选票所推荐的Leader的ServerId即myid
zxid：当前选票所推荐的Leader的zxid
electionEpoch：当前选票所推荐的Leader的epoch
state：当前主机的状态，有以下几种
- LOOKING：参与选举Leader过程中
- LEADING：当前为Leader
- FOLLOWING：当前为Follower
- OBSERVING：当前为Observer
sid：当前主机的sid
peerEpoch：当前选票所推荐的Leader主机的epoch

QuorumCnxManager类

QuorumCnxManager类注释如下：

/**
 * This class implements a connection manager for leader election using TCP. It
 * maintains(维护着) one connection for every pair of servers.
 * The tricky part(棘手部分) is to guarantee(确保) that there is exactly(仅仅)
 * one connection for every pair of servers that are operating correctly(正确地)
 * and that can communicate over the network.
 * 翻译：这个类使用TCP实现了一个用于Leader选举的连接管理器。
 * 它为每一对服务器维护着一个连接。棘手的部分在于确保[为每对服务器正确地操作
 * 并且可以与整个网络进行通信]的连接恰有一个。
 *
 * If two servers try to start a connection concurrently(同时), then the
 * connection manager uses a very simple tie-breaking（中断) mechanism(机制)
 * to decide which connection to drop based on the IP addressed of
 * the two parties.
 * 翻译：如果两个服务器试图同时启动一个连接，则连接管理器使用非常简单的中断连接
 * 机制来决定哪个中断，基于双方的IP地址。
 *
 * For every peer, the manager maintains a queue of messages to send. If the
 * connection to any particular peer drops(中断), then the sender thread
 * puts the message back on(将消息放回到) the list. As this implementation
 * currently uses a queue implementation to maintain messages to send to
 * another peer, we add the message to the tail of the queue, thus(从而)
 * changing the order of messages.Although this is not a problem for the
 * leader election, it could be a problem when consolidating(加强) peer
 * communication. This is to be verified(验证), though(不过，然而).
 * 翻译：对于每个对等体，管理器维护着一个消息发送队列。如果连接到任何
 * 特定的Server中断，那么发送者线程将消息放回到这个队列中。
 * 作为这个实现，当前使用一个队列来实现维护发送给另一方的消息，因此我们将消息
 * 添加到队列的尾部，从而更改了消息的顺序。虽然对于Leader选举来说这不是一个问题，
 * 但对于加强对等通信可能就是个问题。不过，这一点有待验证。
 *
 * 实际上是一个Map数据结构 key为其他server的myid,value队列为存放向他server发送失败的消息副本
 *
 * 三种情况
 * 1.所有ServerId对应的队列都为空，则说明当前主机与集群中是联通的，当前server的消息已全部发送成功
 * 2.所有ServerId对应的队列都不为空，则说明当前主机与集权已经断链
 * 3.多有ServerId对应的队列部分不为空，则说明当前主机与该Server的连接出现了问题
 */

public class QuorumCnxManager {
    /**
     * 消息队列
     */
     final ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap;
}

在消息队列queueSendMap结构如下：例如一共有2台主机，当前主机的serverId=1,即myid=1,则该主机中的消息发送队列（注意每一个主机中都维护一个Map结构），如下：

该Map中会出现三种情况：

1.所有ServerId对应的队列都为空，则说明当前主机与集群中是联通的，当前server的消息已全部发送成功
2.所有ServerId对应的队列都不为空，则说明当前主机与集权已经断连
3.多有ServerId对应的队列部分不为空，则说明当前主机与该Server的连接出现了问题

QuorumPeer类

由于该类是Quorum即法定人，不包括Observer，即该类的状态中不包含Observing

/**
 * 翻译：这个类管理着“法定人数投票”协议。这个服务器有三个状态：
 * （1）Leader election：(处于该状态的)每一个服务器将选举一个Leader(最初推荐
 * 自己作为Leader)。(这个状态即LOOKING状态)
 * （2）Follower：(处于该状态的)服务器将与Leader做同步，并复制所有的事务(注意
 * 这里的事务指的是最终的提议Proposal。不要忘记txid中的tx即为事务)。
 * （3）Leader：(处于该状态的)服务器将处理请求，并将这些请求转发给其它Follower。
 * 大多数Follower在该写请求被批准之前(before it can be accepted)都必须
 * 要记录下该请求(注意，这里的请求指的是写请求，Leader在接收到写请求后会向所有
 * Follower发出提议，在大多数Follower同意后该写请求才会被批准)。
 *
 * This class will setup a datagram socket that will always respond with its
 * view of the current leader. The response will take the form of:
 * 翻译：这个类将设置一个数据报套接字(就是一种数据结构)，这个数据报套接字将
 * 总是使用它的视图(格式)来响应当前的Leader。响应将采用的格式为：
 *
 * <pre>
 * int xid;
 *
 * long myid;
 *
 * long leader_id;
 *
 * long leader_zxid;
 * </pre>
 *
 * The request for the current leader will consist(包含) solely(仅仅)
 * of an xid: int xid;
 * 翻译：当前Leader的请求将仅(solely)包含(consist)一个xid(注意，xid即事务id，
 * 是一个新的提议的唯一标识)。
 *
 */
public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider {
}

选举方法lookForLeader

该方法的注释信息，如下图所示

/**
 * Starts a new round of leader election. Whenever our QuorumPeer
 * changes its state to LOOKING, this method is invoked, and it
 * sends notifications to all other peers.
 * 翻译：开启新一轮的Leader选举。无论何时，只要我们的QuorumPeer的
 * 状态变为了LOOKING，那么这个方法将被调用，并且它会发送notifications
 * 给所有其它的同级服务器。
 */
public Vote lookForLeader() throws InterruptedException {
}

准备工作

记录选举开始时间

if (self.start_fle == 0) {
    // 记录选举开始的时间起点
   self.start_fle = Time.currentElapsedTime();
}// ----这里不是我们关心的

创建记录选票信息的数据结构

try {
    // 记录当前Server收到的来自于其它Server的本轮投票信息
    // key为接收到的投票的投票者ServerId，value为其投票
    // 一个Entry对象就代表一次投票
    // 票数统计，统计的就是这个集合中的选票数量 重要
    HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
    // 记录当前Server所有投出的选票 [不是很重要]
    HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

    // 初始化本次“通知”发出后的超时时限。注意，这里的not是notification
    int notTimeout = finalizeWait;

将自己作为初始化Leader投出去

// ----------------- 将自己作为初始化Leader投出去 -------------------
synchronized(this){
    // 通过cas使逻辑时钟增一
    logicalclock.incrementAndGet();
    // 将自己作为将要推荐出去的Leader
    // getInitId()：获取当前Server的Id
    // getInitLastLoggedZxid()：获取当前Server中记录的最大的Zxid
    // getPeerEpoch()：获取当前的epoch,这次选举之前的Leader的epoch
    updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}

LOG.info("New election. My id =  " + self.getId() +
        ", proposed zxid=0x" + Long.toHexString(proposedZxid));
// 向集群中所有其它Server广播其投票信息
sendNotifications();

/**
 * 获取当前Server的Id
 */
private long getInitId(){
    if(self.getLearnerType() == LearnerType.PARTICIPANT)
        return self.getId();
    else return Long.MIN_VALUE;
}

/**
 * 获取当前Server中记录的最大的Zxid
 */
private long getInitLastLoggedZxid(){
    if(self.getLearnerType() == LearnerType.PARTICIPANT)
        return self.getLastLoggedZxid();
    else return Long.MIN_VALUE;
}

/**
 * 获取当前的epoch,这次选举之前的Leader的epoch
 */
private long getPeerEpoch(){
    if(self.getLearnerType() == LearnerType.PARTICIPANT)
       try {
          return self.getCurrentEpoch();
       } catch(IOException e) {
          RuntimeException re = new RuntimeException(e.getMessage());
          re.setStackTrace(e.getStackTrace());
          throw re;
       }
    else return Long.MIN_VALUE;
}

/**
 * 更新当前Server所推荐的Leader信息
 */
synchronized void updateProposal(long leader, long zxid, long epoch){
    if(LOG.isDebugEnabled()){
        LOG.debug("Updating proposal: " + leader + " (newleader), 0x"
                + Long.toHexString(zxid) + " (newzxid), " + proposedLeader
                + " (oldleader), 0x" + Long.toHexString(proposedZxid) + " (oldzxid)");
    }
    proposedLeader = leader;
    proposedZxid = zxid;
    proposedEpoch = epoch;
}

在上述代码中首先通过cas将逻辑时钟logicalclock自增1，接着毛遂自荐将自己作为选票，调用sendNotifications()向集群中所有其它Server广播其投票信息。

sendNotifications()方法如下：

/**
 * Send notifications to all peers upon a change in our vote
 * 发送选票通知
 */
private void sendNotifications() {
    for (QuorumServer server : self.getVotingView().values()) {
        // 获取当前遍历对象，即通知要发送的接收者的ServerID
        long sid = server.id;

        // 创建notification message数据结构
        //注意该信息中包含一个sid值，其为该投票信息接收者Server的id，即要发送给谁
        //notmsg不是not message,而是notification message
        ToSend notmsg = new ToSend(ToSend.mType.notification,
                proposedLeader,
                proposedZxid,
                logicalclock.get(),
                QuorumPeer.ServerState.LOOKING,
                sid,
                proposedEpoch);
        if(LOG.isDebugEnabled()){
            LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                  Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get())  +
                  " (n.round), " + sid + " (recipient), " + self.getId() +
                  " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
        }
        // 将当前数据结构notmsg写入到发送队列，并发送
        sendqueue.offer(notmsg);
    }
}

在上述代码中getVotingView()方法获取集群中非Observer的主机信息，方法如下：

/**
 * 获取到所有非Observer主机
 * getView()获取集群中所有主机
 */
public Map<Long,QuorumPeer.QuorumServer> getVotingView() {
    return QuorumPeer.viewToVotingView(getView());
}

/**
 * viewToVotingView()方法去除掉主机类型为ObServer的主机
 */
static Map<Long,QuorumPeer.QuorumServer> viewToVotingView(Map<Long,QuorumPeer.QuorumServer> view) {
    Map<Long,QuorumPeer.QuorumServer> ret = new HashMap<Long, QuorumPeer.QuorumServer>();
    // 通过该foreach，将Observer排除出去
    for (QuorumServer server : view.values()) {
        if (server.type == LearnerType.PARTICIPANT) {
            ret.put(server.id, server);
        }
    }
    return ret;
}

LearnerType一共有两种，如下：

PARTICIPANT：具有选举权的主机，(即Follower和Leader，即QuorumServer)
OBSERVER：没有选举权的主机;

回到上面的sendNotifications方法中的sendqueue.offer(notmsg)这一行代码如下：

验证自己的投票与大家的投票谁更适合做Leader

while循环

向大家投出自己的选票之后，开始对接收到的大家的选票进行处理，还没有选出Leader之前，会一直循环执行while中的逻辑，直到选出Leader，循环的条件就是当前server主机处于LOOKING状态并且当前选举没有结束（即stop的值为false）

while ((self.getPeerState() == ServerState.LOOKING) &&
        (!stop)){
}

从recvqueue队列中取出当前主机接收到的集群主机的所有选票，每次取出一个选票，即一个通知。

/**
 * recvqueue中存放着所有接收到的其他Server发送来的通知，现从recvqueue中
 * 取出一个要与当前Server的选票进行对比，并将队首元素删除
 * 需要注意，这是取出的是队首元素的下一个元素，而删除的时队首元素，因为这个队列
 * 未“带头结点链表”，即在链表创建之初的head元素是空的，其第一个可取出的值未head.next
 * 这一点从后面的recvqueue.put方法的源码中可以看出
 *
 */
// 删除头节点，更取第一个节点元素
Notification n = recvqueue.poll(notTimeout,
        TimeUnit.MILLISECONDS);

跟进poll方法，总体说是先删除头节点，再取出第一个结点元素

图示演示如下：

执行完dequeue()方法之后

接着看之前在QuorumCnxManager中的数据结构，该数据接口存储向某主机发送失败的消息副本，即value是一个队列结构。消息发送队列的结构如下，例如一共有2台主机，当前主机的serverId=1,即myid=1,则该主机中的消息发送队列（注意每一个主机中都维护一个Map结构），如下：

接着判断消息发送队列中的所有队列是否为空，如果有一个为空，则表示该主机跟其他主机没有断连，直接重发通知给所有主机

/**
 * 为什么n==null，因为收到的其他主机的通知没收齐，
 * 有可能网络原因，
 * 也有空可能我一个都没发出去
 */
if(n == null){
    if(manager.haveDelivered()){
        // 没收齐，重发
        sendNotifications();
    } else {
        // 只要当前主机没有发出选票，则所有其它主机一定不可能收齐选票，那么，它们就会重发
        // 当前主机只需要“坐等”即可
        manager.connectAll();
    }

    /*
     * Exponential backoff
     */
    //将超时时间，超时时间时从该主机发送选票给其他主机，接收到其他主机的选票的时间的最大等待时间
    int tmpTimeOut = notTimeout*2;
    notTimeout = (tmpTimeOut < maxNotificationInterval?
            tmpTimeOut : maxNotificationInterval);
    LOG.info("Notification time out: " + notTimeout);
}

跟进manager.haveDelivered()方法，获取QuorumCnxManager中的所有的消息发送队列，即map中的所有value值，遍历判断，只要有一个队列为空，则表明该主机没有跟集群中所有主机断连，则需要调用上面sendNotifications()重新向所有主机发送消息，因为没收齐

/**
 * 只要有一个队列的size为0，则说明当前主机与集群没有失联，就说明当前主机的已经将自己的选票
 */
boolean haveDelivered() {
    //queueSendMap就是笔记中的QuorumCnxManager中消息发送队列的数据结构
    for (ArrayBlockingQueue<ByteBuffer> queue : queueSendMap.values()) {
        LOG.debug("Queue size: " + queue.size());
        if (queue.size() == 0) {
            return true;
        }
    }

    return false;
}

反之，如果所有队列都不为空，则表明该主机已经跟其他主机断连，此时需要重新连接所有主机，跟进manager.connectAll()方法，从QuorumCnxManager中的数据结构获取所有的key，即集群所有主机的id，遍历一个一个连接。

public void connectAll(){
    long sid;
    //取出所有主机id
    for(Enumeration<Long> en = queueSendMap.keys();
        en.hasMoreElements();){
        sid = en.nextElement();
        //一个一个连接
        connectOne(sid);
    }      
}

疑问：为什么连接上了之后，不想所有的主机发送通知呢？

只要当前主机没有发出选票，则所有其它主机一定不可能收齐选票，那么，它们就会重发，当前主机只需要“坐等”即可，坐等它们全部发过来它们的选票信息，就可以统计选票了

延长超时时间

接着验证发送该通知的主机即sid，是否具有选举权，验证发送给通知的主机选票中选择的Leader是否具有选举权

// 验证推荐者（n.sid）与被推荐者（n.leader）的合法性
else if(validVoter(n.sid) && validVoter(n.leader)) {
}

/**
 * 判断sid是否被包含在具有投票权的主机Views（集合）中
 * 在上面已经看过getVotingView()方法源码了，就是获取具有选举权的所有主机
 */
private boolean validVoter(long sid) {
    return self.getVotingView().containsKey(sid);
}

接着比较接收到的通知中所推荐的leader的epoch与当前主机的逻辑时钟，其实最多的情况是两者相等，即是同一轮的选举

当接收到的通知中所推荐的leader的epoch大于当前主机的逻辑时钟时：首先将当前主机的逻辑时钟更新为通知中所推荐的leader的epoch，清空recvset，并且比较通知中所推荐的leader和当前主机谁更适合做Leader，如果通知中所推荐的leader更适合做Leader，则需要更新当前主机的选票，否则不需要更新。

 // 若n的逻辑时钟大于当前主机的逻辑时钟,说明n的逻辑时钟最新
if (n.electionEpoch > logicalclock.get()) {
    // 当前主机的逻辑时钟更新为最新的逻辑时钟，即n的逻辑时钟
    logicalclock.set(n.electionEpoch);
    // 清空recvset集合
    recvset.clear();
    //判断通知n中leader与当前主机谁更适合做Leader
    if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
            getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
        // 通知n中leader更适合做Leader
        // 当前主机将自己的选票更新为n的选票
        updateProposal(n.leader, n.zxid, n.peerEpoch);
    } else {
        // 更新当前选票为自己。（不要忘记，当前主机的逻辑时钟已经发生了改变）
        updateProposal(getInitId(),
                getInitLastLoggedZxid(),
                getPeerEpoch());
    }
    // 将更新过的选票再次发布
    sendNotifications();
}

疑问1：为什么此时要清空统计票数的recvset

因为recvset统计的已经是上一轮的了

疑问2：为什么接收到的通知中所推荐的leader的epoch大于当前主机的逻辑时钟，还需要比较谁更适合做Leader？

因为此时已经将当前主机的逻辑时钟修改为通知中所推荐的leader的epoch了

当接收到的通知中所推荐的leader的epoch小于当前主机的逻辑时钟时：跳出switch-case中的LOOKING状态

// 若n的逻辑时钟小于当前主机的逻辑时钟，直接结束当前switch
else if (n.electionEpoch < logicalclock.get()) {
    if(LOG.isDebugEnabled()){
        LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                + Long.toHexString(n.electionEpoch)
                + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
    }
    break;
}

当接收到的通知中所推荐的leader的epoch等于当前主机的逻辑时钟时：接着判断通知中所推荐的leader和当前主机推荐的Leader谁更适合做Leader，如果通知中所推荐的leader更适合做Leader则修改选票，再次发布，否则不需要修改选票

// 若n的逻辑时钟等于当前主机的逻辑时钟，则当前主机更新选票后再将发布
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
        proposedLeader, proposedZxid, proposedEpoch)) {
    updateProposal(n.leader, n.zxid, n.peerEpoch);
    sendNotifications();
}

跟进totalOrderPredicate()方法判断通知中所推荐的leader和当前主机谁更适合做Leader,首先判断当前主机是Observer，那就不需要比较了，肯定是通知中所推荐的leader更适合做Leader，因为Observer根本没有选举权，相当于你说的话相当于放屁，接着首先比较epoch,如果一样则比较zxid,如果zxid再一样，则比较serverid

// 比较new选票与cur选票谁更适合做leader
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
    LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
            Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
    // 在zk集群中，只有Observer的权重是0
    if(self.getQuorumVerifier().getWeight(newId) == 0){
        return false;
    }
    
    /*
     * We return true if one of the following three cases hold:
     * 1- New epoch is higher
     * 2- New epoch is the same as current epoch, but new zxid is higher
     * 3- New epoch is the same as current epoch, new zxid is the same
     *  as current zxid, but server id is higher.
     */
    
    return ((newEpoch > curEpoch) || 
            ((newEpoch == curEpoch) &&
            ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
}

将发送通知n的主机id和选票信息加入到map中，进行统计

recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

处理选举结束后的工作

首先通过termPredicate方法判断自己目前的选票是否已经超过半数，recvset中记录着其他所有主机的选票信息，new Vote(proposedLeader, proposedZxid,logicalclock.get(), proposedEpoch)表示当前主机的选票，有可能选的是自己，也有可能选的是其他主机

// ---------------------- 处理选举结束后的工作 --------------------

// 判断本轮选举是否应该结束
//判断自己的选票new Vote(proposedLeader, proposedZxid,
//                                        logicalclock.get(), proposedEpoch)
//数量是否已经超过半数
if (termPredicate(recvset,
        new Vote(proposedLeader, proposedZxid,
                logicalclock.get(), proposedEpoch))) {

    // Verify if there is any change in the proposed leader
    // 判断“剩余的通知”中有没有更适合做Leader的
    // 下面的while有两个出口：
    // 1)while循环条件：从这里出去，说明“剩余的通知”中没有找到比当前选票更适合的通知
    // 2)break：从这里出去，说明从“剩余的通知”中找到了比当前选票更适合做Leader的通知了
    while((n = recvqueue.poll(finalizeWait,
            TimeUnit.MILLISECONDS)) != null){
        if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                proposedLeader, proposedZxid, proposedEpoch)){
            recvqueue.put(n);
            break;
        }
    }

    /*
     * This predicate is true once we don't read any new
     * relevant message from the reception queue
     */
    // n为null，说明前面的while是从循环条件出去的，
    // 说明“剩余的通知”中没有找到比当前选票更适合的通知，那么就可以进行收尾工作了
    if (n == null) {
        //判断选出来的leader是不是自己，如果是，则把当前自己的状态变成Leader，
        //如果不是，则把自己的状态变成FOLLOWER
        // 修改当前主机的状态
        self.setPeerState((proposedLeader == self.getId()) ?
                ServerState.LEADING: learningState());

        // 形成最终选票，返回给lookForLeader()的调用者
        Vote endVote = new Vote(proposedLeader,
                                proposedZxid,
                                logicalclock.get(),
                                proposedEpoch);
        leaveInstance(endVote);
        return endVote;
    }
}
break;

跟进termPredicate(recvset, new Vote(proposedLeader, proposedZxid,logicalclock.get(), proposedEpoch))方法，判断当前主机的选票是否已经超过半数

判断set的个数是否超过集群所有主机的半数，注意：是超过，不包括相等

接着如果自己目前的选票是否已经超过半数，则接着从收集其他主机选票的队列中取出剩余的通知，判断剩余的通知中的推荐的Leader是否比自己更适合做Leader，为什么还要判断呢？

因为有可能剩下的通知中推荐的Leader的逻辑时钟比较大，即是新一轮的选举

即通过otalOrderPredicate()判断，如果有则将该通知重新放回到队列中，等到下一个大while执行中继续判断

如果剩余的通知中的推荐的Leader没有比自己更适合做Leader，则上面的while循环中推出的条件是n==null

接着执行下面的代码，即自己的选票中推荐的Leader更适合做leader,接着判断自己的选票推荐的Leader是不是自己，如果是则将自己的状态变成LEADING,否则通过learningState()方法自己变成FOLLOWING

在learningState()方法中判断当前主机是否具有选举权，如果有，则将状态变成FOLLOWING，否则变成OBSERVING

完整的lookForLeader方法源码如下：

public Vote lookForLeader() throws InterruptedException {
    // ------------------ 选举前的准备工作 ------------------
    // jmx，Oracle提供的一种分布式应用程序监控技术【了解即可】
    //创建选举对象----这里不是我们关心的
    try {
        //self：当前主机
        self.jmxLeaderElectionBean = new LeaderElectionBean();
        MBeanRegistry.getInstance().register(
                self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
    } catch (Exception e) {
        LOG.warn("Failed to register with JMX", e);
        self.jmxLeaderElectionBean = null;
    }
    if (self.start_fle == 0) {
        // 记录选举开始的时间起点
       self.start_fle = Time.currentElapsedTime();
    }// ----这里不是我们关心的
    try {
        // 记录当前Server收到的来自于其它Server的本轮投票信息
        // key为接收到的投票的投票者ServerId，value为其投票
        // 一个Entry对象就代表一次投票
        // 票数统计，统计的就是这个集合中的选票数量 重要
        HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
        // 记录当前Server所有投出的选票 [不是很重要]
        HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

        // 初始化本次“通知”发出后的超时时限。注意，这里的not是notification
        int notTimeout = finalizeWait;

        // ----------------- 将自己作为初始化Leader投出去 -------------------
        synchronized(this){
            // 使逻辑时钟增一
            logicalclock.incrementAndGet();
            // 将自己作为将要推荐出去的Leader
            // getInitId()：获取当前Server的Id
            // getInitLastLoggedZxid()：获取当前Server中记录的最大的Zxid
            // getPeerEpoch()：获取当前的epoch,这次选举之前的Leader的epoch
            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
        }

        LOG.info("New election. My id =  " + self.getId() +
                ", proposed zxid=0x" + Long.toHexString(proposedZxid));
        // 向集群中所有其它Server广播其投票信息
        sendNotifications();

        // ---------- 验证自己的投票与大家的投票谁更适合做Leader ---------------
        /*
         * Loop in which we exchange notifications until we find a leader
         *
         * 如果当前Server的状态是LOOKING并且选举没有结束，就一直执行这个循环
         */

        while ((self.getPeerState() == ServerState.LOOKING) &&
                (!stop)){
            /*
             * Remove next notification from queue, times out after 2 times
             * the termination time
             */
            /**
             * recvqueue中存放着所有接收到的其他Server发送来的通知，现从recvqueue中
             * 取出一个要与当前Server的选票进行对比，并将队首元素删除
             * 需要注意，这是取出的是队首元素的下一个元素，而删除的时队首元素，因为这个队列
             * 未“带头结点链表”，即在链表创建之初的head元素是空的，其第一个可取出的值未head.next
             * 这一点从后面的recvqueue.put方法的源码中可以看出
             *
             */
            // 删除头节点，更取第一个节点元素
            Notification n = recvqueue.poll(notTimeout,
                    TimeUnit.MILLISECONDS);

            /*
             * Sends more notifications if haven't received enough.
             * Otherwise processes new notification.
             */
            /**
             * 为什么n==null，因为收到的其他主机的通知没收齐，
             * 有可能网络原因，
             * 也有空可能我一个都没发出去
             */
            if(n == null){
                if(manager.haveDelivered()){
                    // 没收齐，重发
                    sendNotifications();
                } else {
                    // 只要当前主机没有发出选票，则所有其它主机一定不可能收齐选票，那么，它们就会重发
                    // 当前主机只需要“坐等”即可
                    manager.connectAll();
                }

                /*
                 * Exponential backoff
                 */
                //将超时时间，超时时间时从该主机发送选票给其他主机，接收到其他主机的选票的时间的最大等待时间
                int tmpTimeOut = notTimeout*2;
                notTimeout = (tmpTimeOut < maxNotificationInterval?
                        tmpTimeOut : maxNotificationInterval);
                LOG.info("Notification time out: " + notTimeout);
            }
            // 验证推荐者（n.sid）与被推荐者（n.leader）的合法性
            else if(validVoter(n.sid) && validVoter(n.leader)) {
                /*
                 * Only proceed if the vote comes from a replica in the
                 * voting view for a replica in the voting view.
                 */
                switch (n.state) {
                case LOOKING:
                    // If notification > current, replace and send messages out
                    // 判断n所推荐的leader的epoch（简称n的逻辑时钟）与当前主机选举过程的逻辑时钟（简称当前主机逻辑时钟）

                    // 若n的逻辑时钟大于当前主机的逻辑时钟,说明n的逻辑时钟最新
                    if (n.electionEpoch > logicalclock.get()) {
                        // 当前主机的逻辑时钟更新为最新的逻辑时钟，即n的逻辑时钟
                        logicalclock.set(n.electionEpoch);
                        // 清空recvset集合
                        recvset.clear();
                        //判断通知n中leader与当前主机谁更适合做Leader
                        if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                            // 通知n中leader更适合做Leader
                            // 当前主机将自己的选票更新为n的选票
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                        } else {
                            // 更新当前选票为自己。（不要忘记，当前主机的逻辑时钟已经发生了改变）
                            updateProposal(getInitId(),
                                    getInitLastLoggedZxid(),
                                    getPeerEpoch());
                        }
                        // 将更新过的选票再次发布
                        sendNotifications();

                    // 若n的逻辑时钟小于当前主机的逻辑时钟，直接结束当前switch
                    } else if (n.electionEpoch < logicalclock.get()) {
                        if(LOG.isDebugEnabled()){
                            LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                    + Long.toHexString(n.electionEpoch)
                                    + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                        }
                        break;

                    // 若n的逻辑时钟等于当前主机的逻辑时钟，则当前主机更新选票后再将发布
                    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                            proposedLeader, proposedZxid, proposedEpoch)) {
                        updateProposal(n.leader, n.zxid, n.peerEpoch);
                        sendNotifications();
                    }

                    if(LOG.isDebugEnabled()){
                        LOG.debug("Adding vote: from=" + n.sid +
                                ", proposed leader=" + n.leader +
                                ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                    }

                    //将发送通知n的主机id和选票信息加入到map中，进行统计
                    recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                    // ---------------------- 处理选举结束后的工作 --------------------

                    // 判断本轮选举是否应该结束
                    //判断自己的选票new Vote(proposedLeader, proposedZxid,
                    //                                        logicalclock.get(), proposedEpoch)
                    //数量是否已经超过半数
                    if (termPredicate(recvset,
                            new Vote(proposedLeader, proposedZxid,
                                    logicalclock.get(), proposedEpoch))) {

                        // Verify if there is any change in the proposed leader
                        // 判断“剩余的通知”中有没有更适合做Leader的
                        // 下面的while有两个出口：
                        // 1)while循环条件：从这里出去，说明“剩余的通知”中没有找到比当前选票更适合的通知
                        // 2)break：从这里出去，说明从“剩余的通知”中找到了比当前选票更适合做Leader的通知了
                        while((n = recvqueue.poll(finalizeWait,
                                TimeUnit.MILLISECONDS)) != null){
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    proposedLeader, proposedZxid, proposedEpoch)){
                                recvqueue.put(n);
                                break;
                            }
                        }

                        /*
                         * This predicate is true once we don't read any new
                         * relevant message from the reception queue
                         */
                        // n为null，说明前面的while是从循环条件出去的，
                        // 说明“剩余的通知”中没有找到比当前选票更适合的通知，那么就可以进行收尾工作了
                        if (n == null) {
                            //判断选出来的leader是不是自己，如果是，则把当前自己的状态变成Leader，
                            //如果不是，则把自己的状态变成FOLLOWER
                            // 修改当前主机的状态
                            self.setPeerState((proposedLeader == self.getId()) ?
                                    ServerState.LEADING: learningState());

                            // 形成最终选票，返回给lookForLeader()的调用者
                            Vote endVote = new Vote(proposedLeader,
                                                    proposedZxid,
                                                    logicalclock.get(),
                                                    proposedEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }
                    break;
                case OBSERVING:
                    LOG.debug("Notification from observer: " + n.sid);
                    break;
                case FOLLOWING:
                case LEADING:
                    /*
                     * Consider all notifications from the same epoch
                     * together.
                     */
                    if(n.electionEpoch == logicalclock.get()){
                        recvset.put(n.sid, new Vote(n.leader,
                                                      n.zxid,
                                                      n.electionEpoch,
                                                      n.peerEpoch));
                       
                        if(ooePredicate(recvset, outofelection, n)) {
                            self.setPeerState((n.leader == self.getId()) ?
                                    ServerState.LEADING: learningState());

                            Vote endVote = new Vote(n.leader, 
                                    n.zxid, 
                                    n.electionEpoch, 
                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }

                    /*
                     * Before joining an established ensemble, verify
                     * a majority is following the same leader.
                     */
                    outofelection.put(n.sid, new Vote(n.version,
                                                        n.leader,
                                                        n.zxid,
                                                        n.electionEpoch,
                                                        n.peerEpoch,
                                                        n.state));
       
                    if(ooePredicate(outofelection, outofelection, n)) {
                        synchronized(this){
                            logicalclock.set(n.electionEpoch);
                            self.setPeerState((n.leader == self.getId()) ?
                                    ServerState.LEADING: learningState());
                        }
                        Vote endVote = new Vote(n.leader,
                                                n.zxid,
                                                n.electionEpoch,
                                                n.peerEpoch);
                        leaveInstance(endVote);
                        return endVote;
                    }
                    break;
                default:
                    LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                            n.state, n.sid);
                    break;
                }
            } else {
                if (!validVoter(n.leader)) {
                    LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                }
                if (!validVoter(n.sid)) {
                    LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                }
            }
        }
        return null;
    } finally {
        try {
            if(self.jmxLeaderElectionBean != null){
                MBeanRegistry.getInstance().unregister(
                        self.jmxLeaderElectionBean);
            }
        } catch (Exception e) {
            LOG.warn("Failed to unregister with JMX", e);
        }
        self.jmxLeaderElectionBean = null;
        LOG.debug("Number of connection processing threads: {}",
                manager.getConnectionThreadCount());
    }
}

关于Leader选举源码问题

1.一台主机向其他主机发送选票通知时，其他主机收到后才会发送它们的选票通知，是这样吗？一开始不是各个主机都是先发送推选自己为Leader选票吗？

我们上面关于zk源码的分析是站在一台主机的角度分析的，一开始发送通知时，各个主机都是异步发送的，即不是一台主机向其他主机发送选票通知时，其他主机收到后才会发送它们的选票通知，首先都是将选自己作为Leader投出去，即是异步的过程

2.一台主机向其余10台主机发送选票通知后，收到的其余10台主机的选票通知就是10条吗？除了收到其他10主机选自己的选票，还有可能收到其余主机改变自己选票的通知吧【是的】

一台主机收到其余主机选自己的选票不止10条，除非是去重，参考上面第一个问题的解答。

说说Zookeeper的Leader选举