zookeeper-2.概览

213 阅读8分钟

从计算机组成看zookeeper

  1. cpu ZooKeeper提供了对延迟敏感的功能,如果需要与其他进程竞争CPU,或者ZooKeeper集成有多种用途,可以考虑提供一个专用的CPU核,确保不存在上下文切换的问题。

  2. memory ZooKeeper对交换是敏感的,任何运行了ZooKeeper服务器的主机都应该避免交换。

  3. disk 磁盘性能是维持ZooKeeper集群健康运行的关键。推荐使用固态硬盘(SSD),因为ZooKeeper必须有低延迟的磁盘写操作,这样才能达到最佳性能。每个对ZooKeeper的请求都必须提交到仲裁服务器的磁盘上,才能读取结果。

ZooKeeper的事务日志必须在专用设备上。(使用专用分区是不够的。)与其他进程共享日志设备可能会导致寻求和争用,这反过来会导致多秒的延迟。

  1. net 网络带宽占用——由于ZooKeeper会跟踪状态,所以对网络延迟引起的超时很敏感。 如果网络带宽饱和,你可能会遇到难以解释的超时客户端会话。

从操作系统看zookeeper

  1. process management jvm进程, 存在多个子线程;跨机器进程间socket通信

打开的文件句柄数-这应该在系统范围内和运行ZooKeeper进程的用户中执行。值应该考虑到最大允许打开的文件句柄数。ZooKeeper经常打开和关闭连接,需要一个可用的文件句柄池来选择。

  1. memory management heap占用很高,全部数据放入内存

  2. filesystem ZooKeeper数据目录创建在本地文件系统上。

  3. io protocol stack Zookeeper集群中server数量总是确定的,所以集群中的server交互采用比较可靠的bio长连接模型

不同于集群中sever间交互zookeeper客户端其实数量是未知的,为了提高zookeeper并发性能,zookeeper客户端与服务器端交互采用nio模型。

从应用程序看zookeeper

1. datastructure

  • DataTree The tree maintains two parallel data structures: a hashtable that maps from full paths to DataNodes and a tree of DataNodes. All accesses to a path is through the hashtable. The tree is traversed only when serializing to disk.

  • DataNode This class contains the data for a node in the data tree.

A data node contains a reference to its parent, a byte array as its data, an array of ACLs, a stat object, and a set of its children's paths.

  • stat the stat for this node that is persisted to disk. czxid mzxid ctime mtime version cversion aversion ephemeralOwner pzxid

  • ZKDatabase This class maintains the in memory database of zookeeper server states that includes the sessions, datatree and the committed logs. It is booted up after reading the logs and snapshots from the disk.

2. algorithm

zab

Zookeeper的核心是Zab协议:Zookeeper Atomic Broadcast,这个机制保证了各个Server之间的同步 Zab协议有两种模式,恢复模式(选主)和广播模式(同步)。

3. design patterns

  • obsever pattern

  • chain of responsibility ZooKeeperServer.setupRequestProcessors()

4. serialization

jute

disk SnapLog & net packet serialization

5. core class

  • QuorumPeerMain main Thread

  • QuorumPeer quorumpeer thread: loadDataBase数据恢复、startServerCnxnFactory开启server socket连接通信、startLeaderElection开始选主 field: ZKDatabase、ServerCnxnFactory(selectorThreads、acceptThread nio reactor编程模型)

@Override
public synchronized void start() {
    if (!getView().containsKey(myid)) {
        throw new RuntimeException("My id " + myid + " not in the peer list");
    }
    loadDataBase();
    startServerCnxnFactory();
    try {
        adminServer.start();
    } catch (AdminServerException e) {
        LOG.warn("Problem starting AdminServer", e);
        System.out.println(e);
    }
    startLeaderElection();
    startJvmPauseMonitor();
    super.start();//调用run方法
}

Main loop

while (running) {
  switch (getPeerState()) {
    case LOOKING:
        LOG.info("LOOKING");
        ServerMetrics.getMetrics().LOOKING_COUNT.add(1);
        ...
        break;
    case OBSERVING:
        try {
            LOG.info("OBSERVING");
            setObserver(makeObserver(logFactory));
            observer.observeLeader();
        } catch (Exception e) {
            LOG.warn("Unexpected exception", e);
        } finally {
            observer.shutdown();
            setObserver(null);
            updateServerState();

            // Add delay jitter before we switch to LOOKING
            // state to reduce the load of ObserverMaster
            if (isRunning()) {
                Observer.waitForObserverElectionDelay();
            }
        }
        break;
    case FOLLOWING:
        try {
            LOG.info("FOLLOWING");
            setFollower(makeFollower(logFactory));
            follower.followLeader();
        } catch (Exception e) {
            LOG.warn("Unexpected exception", e);
        } finally {
            follower.shutdown();
            setFollower(null);
            updateServerState();
        }
        break;
    case LEADING:
        LOG.info("LEADING");
        try {
            setLeader(makeLeader(logFactory));
            leader.lead();
            setLeader(null);
        } catch (Exception e) {
            LOG.warn("Unexpected exception", e);
        } finally {
            if (leader != null) {
                leader.shutdown("Forcing shutdown");
                setLeader(null);
            }
            updateServerState();
        }
        break;
    }
  }
}
* Observer(port: 2888)

  Observers are peers that do not take part in the atomic broadcast protocol. Instead, they are informed of successful proposals by the Leader. Observers therefore naturally act as a relay point for publishing the proposal stream and can relieve Followers of some of the connection load. Observers may submit proposals, but do not vote in their acceptance.
  
  field: ObserverZooKeeperServer
        A ZooKeeperServer for the Observer node type. Not much is different, but we anticipate specializing the request processors in the future.
  
* Follower
  This class has the control logic for the Follower.
  field: FollowerZooKeeperServer
        Just like the standard ZooKeeperServer. We just replace the request processors: FollowerRequestProcessor -> CommitProcessor -> FinalRequestProcessor A SyncRequestProcessor is also spawned off to log proposals from the leader.

* Leader
  This class has the control logic for the Leader.
  field: LeaderZooKeeperServer
         Just like the standard ZooKeeperServer. We just replace the request processors: PrepRequestProcessor -> ProposalRequestProcessor -> CommitProcessor -> Leader.ToBeAppliedRequestProcessor -> FinalRequestProcessor



* ServerCnxnFactory(port: 2181)
  * AcceptThread
    There is a single **AcceptThread** which accepts new connections and assigns them to a SelectorThread using a simple round-robin scheme to spread them across the SelectorThreads.

  * SelectorThread
    The **SelectorThread** receives newly accepted connections from the AcceptThread and is responsible for selecting for I/O readiness across the connections. This thread is the only thread that performs any non-threadsafe or potentially blocking calls on the selector (registering new connections and reading/writing interest ops). 

* FastLeaderElection
Implementation of leader election using TCP. It uses an object of the class QuorumCnxManager to manage connections. 

  * QuorumCnxManager(port: 3888)
    VS ServerCnxnFactory, it is bio
    This class implements a connection manager for leader election using TCP. It maintains one connection for every pair of servers. The tricky part is to guarantee that there is exactly one connection for every pair of servers that are operating correctly and that can communicate over the network. 
    * SendWorker
      **Thread** to send messages. Instance waits on a queue, and send a message as soon as there is one available. If connection breaks, then opens a new one.
    * RecvWorker
      **Thread** to receive messages. Instance waits on a socket read. If the channel breaks, then removes itself from the pool of receivers.

  * Messenger信使
    **Multi-threaded** implementation of message handler. Messenger implements two sub-classes: WorkReceiver and WorkSender. The functionality of each is obvious from the name. Each of these spawns a new thread.
  

设计

特性

design goals

  • ZooKeeper is simple. ZooKeeper允许分布式进程通过共享的层次结构命名空间(类似于标准文件系统)进行协调。命名空间由数据寄存器组成——用ZooKeeper的说法就是znode——它们类似于文件和目录。与为存储而设计的典型文件系统不同,ZooKeeper的数据保存在内存中,这意味着ZooKeeper可以实现高吞吐量和低延迟数。

ZooKeeper的实现强调了高性能、高可用性和严格的访问顺序。ZooKeeper的性能方面意味着它可以用于大型的分布式系统。可靠性方面避免了它成为单点故障。严格的顺序意味着可以在客户机上实现复杂的同步原语。

  • ZooKeeper is replicated. The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.

Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.

  • ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions.

  • ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

由于它的目标是作为构建更复杂服务(如同步)的基础,因此它提供了一组保证:

  • Sequential Consistency - Updates from a client will be applied in the order that they were sent.
  • Atomicity - Updates either succeed or fail. No partial results.
  • Single System Image - A client will see the same view of the service regardless of the server that it connects to. i.e., a client will never see an older view of the system even if the client fails over to a different server with the same session.
  • Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
  • Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

数据模型

ZK的数据结构模型是基于ZNode的树状模型。在ZK内部通过类似内存数据库的方式保存了整棵树的内容,并定时写入磁盘。

Znode是Zookeeper中数据的最小单元。默认保存数据大小最大为1M

  • 持久节点,是指该数据节点被创建后,会一直存在Zookeeper服务器上,直到有删除操作来主动清除这个节点

  • 临时节点,是指节点的生命周期和客户端的会话绑定在一起,即如果客户端会话失效,那么这个节点就会被自动清理掉

  • 顺序节点,是指创建节点时维护一份顺序,用于记录下每个节点创建的先后顺序,Zookeeper会自动为该节点名加上一个数字后缀,作为一个新的,完整的节点名。这个数字后缀的上限是整型的最大值

  • 内存数据   Zookeeper的数据模型是树结构,在内存数据库中,存储了整棵树的内容,包括所有的节点路径、节点数据、ACL信息,Zookeeper会定时将这个数据存储到磁盘上。   1. DataTree   DataTree是内存数据存储的核心,是一个树结构,代表了内存中一份完整的数据。DataTree不包含任何与网络、客户端连接及请求处理相关的业务逻辑,是一个独立的组件。

  2. DataNode   DataNode是数据存储的最小单元,其内部除了保存了结点的数据内容、ACL列表、节点状态之外,还记录了父节点的引用和子节点列表两个属性,其也提供了对子节点列表进行操作的接口。

  3. ZKDatabase   Zookeeper的内存数据库,管理Zookeeper的所有会话、DataTree存储和事务日志。ZKDatabase会定时向磁盘dump快照数据,同时在Zookeeper启动时,会通过磁盘的事务日志和快照文件恢复成一个完整的内存数据库。

  • 磁盘数据 磁盘数据主要分为snapshot文件和事务log文件   1. snapshot 某一时刻内存中的全量数据,一般当事务日志记录超过10W条会生成一份快照文件

  2. 事务log 生成快照之后的事务操作会写入事务log

API

One of the design goals of ZooKeeper is providing a very simple programming interface. As a result, it supports only these operations:

  • create : creates a node at a location in the tree
  • delete : deletes a node
  • exists : tests if a node exists at a location
  • get data : reads the data from a node
  • set data : writes data to a node
  • get children : retrieves a list of children of a node
  • sync : waits for data to be propagated