kafka server - controller -handler和状态机

736 阅读10分钟

Zk Handler

kafka的controller上面主要注册了以下两种类型的zk handler

  • childChangeHandler
  • nodeChangeHandler

childChangeHandler关注的是子节点的变化,而nodeChangeHandler关注的是节点的变化。childChangeHandler有以下几种

  • brokerChangeHandler
  • topicChangeHandler
  • topicDeletionHandler
  • logDirEventNotificationHandler
  • isrChangeNotificationHandler
handler 监听路径 作用
brokerChangeHandler /brokers/ids 监听broker的上线和下线
topicChangeHandler /brokers/topics 监听topic的创建
topicDeletionHandler /admin/delete_topics 监听topic的删除
logDirEventNotificationHandler /log_dir_event_notification
isrChangeNotificationHandler /isr_change_notification partition isr的改变

nodeChangeHandler有两种

  • preferredReplicaElectionHandler
  • partitionReassignmentHandler
handler 监听路径 作用
preferredReplicaElectionHandler /admin/preferred_replica_election 为了partition leader的选举
partitionReassignmentHandler /admin/reassign_partitions 用于分区副本的迁移

下面我们逐一介绍这些handler的处理方法,理解这些方便我们知道在操作kafka的时候,server背后到底在做什么?

ZNodeChildChangeHandler

ZNodeChildChangeHandler是一个trait,提供了一个属性path和一个方法handleChildChange

trait ZNodeChildChangeHandler {
  val path: String
  def handleChildChange(): Unit = {}
}

childChangeHandler都是其子类。

BrokerChangeHandler

BrokerChangeHandler的handle实现方法如下:

override def handleChildChange(): Unit = {
    eventManager.put(controller.BrokerChange)
  }

它在controller的主线程队列中添加进一个BrokerChange的event(ControllerEvent)。我们知道,controller的ControllerEventThread线程在拿到event以后,会将_state改成event中指定的状态,同时执行event的process方法。

BrokerChange中指定的controller状态是BrokerChange,它的处理方法是

override def process(): Unit = {
     if (!isActive) return
     val curBrokers = zkClient.getAllBrokersInCluster.toSet
     val curBrokerIds = curBrokers.map(_.id)
     val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
     val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds
     val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds
     val newBrokers = curBrokers.filter(broker => newBrokerIds(broker.id))
     controllerContext.liveBrokers = curBrokers
     val newBrokerIdsSorted = newBrokerIds.toSeq.sorted
     val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted
     val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted
     info(s"Newly added brokers: ${newBrokerIdsSorted.mkString(",")}, " +
       s"deleted brokers: ${deadBrokerIdsSorted.mkString(",")}, all live brokers: ${liveBrokerIdsSorted.mkString(",")}")

     newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
     deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker)
     if (newBrokerIds.nonEmpty)
       onBrokerStartup(newBrokerIdsSorted)
     if (deadBrokerIds.nonEmpty)
       onBrokerFailure(deadBrokerIdsSorted)
   }
  1. 先从zk的/brokers/ids下面获取到线上所有的broker id
  2. 从controller的上下文controllerContext中获取到缓存的broker id列表(这些broker有些可能已经下线)
  3. 通过线上broker减去上下文中broker,得到新加入的broker列表newBrokerIds
  4. 通过上下文中broker减去线上broker,得到下线的broker列表deadBrokerIds
  5. controller上下文中执行加入新broker和下线broker的操作
  6. 将broker变化的信息同步到集群中

我们重点看第5点和第6点。

controller中执行broker上下线

主要是两个方法:

def addBroker(broker: Broker) {
    // be careful here. Maybe the startup() API has already started the request send thread
    brokerLock synchronized {
      if (!brokerStateInfo.contains(broker.id)) {
        addNewBroker(broker)
        startRequestSendThread(broker.id)
      }
    }
  }

def removeBroker(brokerId: Int) {
   brokerLock synchronized {
     removeExistingBroker(brokerStateInfo(brokerId))
   }
 }

先看addNewBroker

addNewBroker

private def addNewBroker(broker: Broker) {
   val messageQueue = new LinkedBlockingQueue[QueueItem]
   debug(s"Controller ${config.brokerId} trying to connect to broker ${broker.id}")
   val brokerNode = broker.node(config.interBrokerListenerName)
   val logContext = new LogContext(s"[Controller id=${config.brokerId}, targetBrokerId=${brokerNode.idString}] ")
   val networkClient = {
     val channelBuilder = ...
     val selector = ...
     new NetworkClient(
       ...
     )
   }
   val threadName = threadNamePrefix match {
     case None => s"Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
     case Some(name) => s"$name:Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
   }

   val requestRateAndQueueTimeMetrics = newTimer(
     RequestRateAndQueueTimeMetricName, TimeUnit.MILLISECONDS, TimeUnit.SECONDS, brokerMetricTags(broker.id)
   )

   val requestThread = new RequestSendThread(config.brokerId, controllerContext, messageQueue, networkClient,
     brokerNode, config, time, requestRateAndQueueTimeMetrics, stateChangeLogger, threadName)
   requestThread.setDaemon(false)

   val queueSizeGauge = newGauge(
     QueueSizeMetricName,
     new Gauge[Int] {
       def value: Int = messageQueue.size
     },
     brokerMetricTags(broker.id)
   )

   brokerStateInfo.put(broker.id, ControllerBrokerStateInfo(networkClient, brokerNode, messageQueue,
     requestThread, queueSizeGauge, requestRateAndQueueTimeMetrics))
 }

因为篇幅有限,这里先省略NetworkClient的构造代码,以后讲网络的时候再分析。从上面代码能看出,controller是启动了一个线程用于给其他broker发送请求。controller显示构造了一个线程RequestSendThread,然后将这个线程放入到ControllerChannelManager中就返回了。

addBroker接着调用startRequestSendThread方法发送请求

protected def startRequestSendThread(brokerId: Int) {
   val requestThread = brokerStateInfo(brokerId).requestSendThread
   if (requestThread.getState == Thread.State.NEW)
     requestThread.start()
 }

那么,到底是发的什么请求呢?

RequestSendThread是我们前面在eventManager中ControllerEventThread那边看到过的ShutdownableThread,它的特点是不断从队列中获取某种Item并处理。而addBroker的时候传入的messageQueue恰好就作为请求内容的队列。RequestSendThread不断从队列中获取QueueItem

case class QueueItem(apiKey: ApiKeys, request: AbstractRequest.Builder[_ <: AbstractRequest],
                     callback: AbstractResponse => Unit, enqueueTimeMs: Long)

然后再发送请求到broker。所以addBroker相当于是维护了一个和broker进行通信的入口。当想要发送请求时,从broker id对应的ControllerBrokerStateInfo中获取到消息队列,再放入请求的消息即可。

removeBroker

removeBroker的过程与add正好相反

def removeBroker(brokerId: Int) {
    brokerLock synchronized {
      removeExistingBroker(brokerStateInfo(brokerId))
    }
  }

其中:

private def removeExistingBroker(brokerState: ControllerBrokerStateInfo) {
    try {
      // Shutdown the RequestSendThread before closing the NetworkClient to avoid the concurrent use of the
      // non-threadsafe classes as described in KAFKA-4959.
      // The call to shutdownLatch.await() in ShutdownableThread.shutdown() serves as a synchronization barrier that
      // hands off the NetworkClient from the RequestSendThread to the ZkEventThread.
      brokerState.requestSendThread.shutdown()
      brokerState.networkClient.close()
      brokerState.messageQueue.clear()
      removeMetric(QueueSizeMetricName, brokerMetricTags(brokerState.brokerNode.id))
      removeMetric(RequestRateAndQueueTimeMetricName, brokerMetricTags(brokerState.brokerNode.id))
      brokerStateInfo.remove(brokerState.brokerNode.id)
    } catch {
      case e: Throwable => error("Error while removing broker by the controller", e)
    }
  }
  1. 先停止requestSendThread
  2. 停止对应的networkClient
  3. 清空队列
  4. 移除metrics
  5. 将broker从brokerStateInfo中移除

BrokerChangeHandler就介绍到这里。

TopicChangeHandler

我们直奔主题,看TopicChangeHandler对应的TopicChange的process方法

override def process(): Unit = {
      if (!isActive) return
      //step 1
      val topics = zkClient.getAllTopicsInCluster.toSet
      //step 2
      val newTopics = topics -- controllerContext.allTopics
      val deletedTopics = controllerContext.allTopics -- topics
      controllerContext.allTopics = topics

      //step 3 
      registerPartitionModificationsHandlers(newTopics.toSeq)
      //step 4
      val addedPartitionReplicaAssignment = zkClient.getReplicaAssignmentForTopics(newTopics)
      //step 5
      deletedTopics.foreach(controllerContext.removeTopic)
      //step 6
      addedPartitionReplicaAssignment.foreach {
        case (topicAndPartition, newReplicas) =>
        controllerContext.updatePartitionReplicaAssignment(topicAndPartition, newReplicas)
      }
      info(s"New topics: [$newTopics], deleted topics: [$deletedTopics], new partition replica assignment " +
        s"[$addedPartitionReplicaAssignment]")
      if (addedPartitionReplicaAssignment.nonEmpty)
        //step 7
        onNewPartitionCreation(addedPartitionReplicaAssignment.keySet)
    }

分为以下几步:

  1. 从zk获取所有topic列表
  2. 与context中topic对比,获取到新添加和删除的topic列表
  3. 对于新添加的topic,在其上注册PartitionModificationsHandler
  4. 从zk获取新topic的partition replica分配情况
  5. 每个删除的topic执行removeTopic操作
  6. 更新controller context中partition assign的相关信息
  7. 修改partitionStateMachine,partitionStateMachine,partitionStateMachine,replicaStateMachine等状态机中信息

我们从PartitionModificationsHandler开始看。

PartitionModificationsHandler

class PartitionModificationsHandler(controller: KafkaController, eventManager: ControllerEventManager, topic: String) extends ZNodeChangeHandler {
  override val path: String = TopicZNode.path(topic)

  override def handleDataChange(): Unit = eventManager.put(controller.PartitionModifications(topic))
}

它监听的是topic上partition改变的情况。当数据改变时,往controller的eventManager中放入controller.PartitionModifications(topic)。这个event的proecss为:

override def process(): Unit = {
      if (!isActive) return
      //获取每个partition的assignment情况
      val partitionReplicaAssignment = zkClient.getReplicaAssignmentForTopics(immutable.Set(topic))
      val partitionsToBeAdded = partitionReplicaAssignment.filter { case (topicPartition, _) =>
        controllerContext.partitionReplicaAssignment(topicPartition).isEmpty
      }
      if (topicDeletionManager.isTopicQueuedUpForDeletion(topic))
        if (partitionsToBeAdded.nonEmpty) {
          warn("Skipping adding partitions %s for topic %s since it is currently being deleted"
            .format(partitionsToBeAdded.map(_._1.partition).mkString(","), topic))

          restorePartitionReplicaAssignment(topic, partitionReplicaAssignment)
        } else {
          // This can happen if existing partition replica assignment are restored to prevent increasing partition count during topic deletion
          info("Ignoring partition change during topic deletion as no new partitions are added")
        }
      else {
        if (partitionsToBeAdded.nonEmpty) {
          info(s"New partitions to be added $partitionsToBeAdded")
          partitionsToBeAdded.foreach { case (topicPartition, assignedReplicas) =>
            controllerContext.updatePartitionReplicaAssignment(topicPartition, assignedReplicas)
          }
          onNewPartitionCreation(partitionsToBeAdded.keySet)
        }
      }
    }
  1. 获取每个partition的assignment情况。这是Map:Map[TopicPartition, Seq[Int]]
  2. 获取待添加的新topicPartition:partitionsToBeAdded
  3. 如果topic正在被删除,回滚
  4. 否则更新controller context中的topicPartition的分配缓存
  5. 修改partitionStateMachine,partitionStateMachine,partitionStateMachine,replicaStateMachine等状态机中信息

综上来看,当topic下partition新添加partition时,修改controller当中关于topicPartition的assign信息。当新添加topic时,对newTopic执行的是类似的操作,主要是topicAssignment的更新。

TopicDeletionHandler

其process方法是:

override def process(): Unit = {
      if (!isActive) return
      var topicsToBeDeleted = zkClient.getTopicDeletions.toSet
      debug(s"Delete topics listener fired for topics ${topicsToBeDeleted.mkString(",")} to be deleted")
      val nonExistentTopics = topicsToBeDeleted -- controllerContext.allTopics
      if (nonExistentTopics.nonEmpty) {
        warn(s"Ignoring request to delete non-existing topics ${nonExistentTopics.mkString(",")}")
        zkClient.deleteTopicDeletions(nonExistentTopics.toSeq, controllerContext.epochZkVersion)
      }
      topicsToBeDeleted --= nonExistentTopics
      if (config.deleteTopicEnable) {
        if (topicsToBeDeleted.nonEmpty) {
          info(s"Starting topic deletion for topics ${topicsToBeDeleted.mkString(",")}")
          // mark topic ineligible for deletion if other state changes are in progress
          topicsToBeDeleted.foreach { topic =>
            val partitionReassignmentInProgress =
              controllerContext.partitionsBeingReassigned.keySet.map(_.topic).contains(topic)
            if (partitionReassignmentInProgress)
              topicDeletionManager.markTopicIneligibleForDeletion(Set(topic))
          }
          // add topic to deletion list
          topicDeletionManager.enqueueTopicsForDeletion(topicsToBeDeleted)
        }
      } else {
        // If delete topic is disabled remove entries under zookeeper path : /admin/delete_topics
        info(s"Removing $topicsToBeDeleted since delete topic is disabled")
        zkClient.deleteTopicDeletions(topicsToBeDeleted.toSeq, controllerContext.epochZkVersion)
      }
    }
  1. 从zk的/admin/delete_topics中获取到待删除的topic列表
  2. 忽略掉不存在的topic
  3. 如果配置准许删除topic,则
    • 如果topic正在执行reassign操作,则暂停删除topic(加入到停止队列中)
    • 否则,将topic放入到topicDeletionManager的的deletion队列中
  4. 如果配置不准许删除topic,从/admin/delete_topics中移除相关topic

那么TopicDeletionManager是怎样删除topic的呢?

TopicDeletionManager

TopicDeletionManager管理着topic删除动作的状态机。当调用命令删除topic时,在zk的/admin/delete_topics下面会创建对应topic的节点,controller监听着这个路径并且删除对应的topic。在删除topic之前会判断,如果

  • topic的某个replica所在的broker此时下线了
  • 正在进行这个topic的partition reassignment

则topic会判断为不能够删除。而当

  • replica所在的broker上线了
  • partition reassignment完成了

的时候,会继续topic的删除动作。

每一个待删除topic的replicac都有以下三种状态

  • TopicDeletionStarted:当调用了onPartitionDeletion以后replica进入此状态。当controller监听到delete_topics子节点变化时,发送StopReplicaRequests到所有的replica,并且在响应StopReplicaResponse上面注册了回调,当每个replica响应删除replica的时候执行。
  • TopicDeletionSuccessful:根据StopReplicaResponse响应状态码的不同决定是否将replica移入此状
  • TopicDeletionFailed:根据StopReplicaResponse响应状态码的不同决定是否将replica移入此状

一个topic被删除成功,当且仅当所有的replica都进入了TopicDeletionSuccessful状态。如果没有replica还在TopicDeletionStarted状态,并且至少一个replica进入TopicDeletionFailed状态,就会将这个topic标记用来重试。

上面提到,当从zk获取到待删除的topic时会执行enqueueTopicsForDeletion方法

def enqueueTopicsForDeletion(topics: Set[String]) {
    if (isDeleteTopicEnabled) {
      topicsToBeDeleted ++= topics
      resumeDeletions()
    }
  }

它将待删除的topic放入到manager的集合topicsToBeDeleted中,然后继续执行topic的删除动作

private def resumeDeletions(): Unit = {
   val topicsQueuedForDeletion = Set.empty[String] ++ topicsToBeDeleted

   if (topicsQueuedForDeletion.nonEmpty)
     info(s"Handling deletion for topics ${topicsQueuedForDeletion.mkString(",")}")

   topicsQueuedForDeletion.foreach { topic =>
     // if all replicas are marked as deleted successfully, then topic deletion is done
     //如果replica都已经删除成功
     if (controller.replicaStateMachine.areAllReplicasForTopicDeleted(topic)) {
       // clear up all state for this topic from controller cache and zookeeper
       completeDeleteTopic(topic)
       info(s"Deletion of topic $topic successfully completed")
     } else {
       //至少有一个replica已经开始删除topic
       if (controller.replicaStateMachine.isAtLeastOneReplicaInDeletionStartedState(topic)) {
         // ignore since topic deletion is in progress
         val replicasInDeletionStartedState = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionStarted)
         val replicaIds = replicasInDeletionStartedState.map(_.replica)
         val partitions = replicasInDeletionStartedState.map(_.topicPartition)
         info(s"Deletion for replicas ${replicaIds.mkString(",")} for partition ${partitions.mkString(",")} of topic $topic in progress")
       } else {
         // if you come here, then no replica is in TopicDeletionStarted and all replicas are not in
         // TopicDeletionSuccessful. That means, that either given topic haven't initiated deletion
         // or there is at least one failed replica (which means topic deletion should be retried).
         //此时,没有一个replica在start状态,表示replica都还没开始删除,或者至少有一个replica失败了
         if (controller.replicaStateMachine.isAnyReplicaInState(topic, ReplicaDeletionIneligible)) {
           // mark topic for deletion retry
           markTopicForDeletionRetry(topic)
         }
       }
     }
     // Try delete topic if it is eligible for deletion.
     if (isTopicEligibleForDeletion(topic)) {
       info(s"Deletion of topic $topic (re)started")
       // topic deletion will be kicked off
       onTopicDeletion(Set(topic))
     } else if (isTopicIneligibleForDeletion(topic)) {
       info(s"Not retrying deletion of topic $topic at this time since it is marked ineligible for deletion")
     }
   }
 }

当controller刚开始执行删除动作时,执行的是onTopicDeletion(Set(topic))方法

private def onTopicDeletion(topics: Set[String]) {
   info(s"Topic deletion callback for ${topics.mkString(",")}")
   // send update metadata so that brokers stop serving data for topics to be deleted
   val partitions = topics.flatMap(controllerContext.partitionsForTopic)
   val unseenTopicsForDeletion = topics -- topicsWithDeletionStarted
   if (unseenTopicsForDeletion.nonEmpty) {
     val unseenPartitionsForDeletion = unseenTopicsForDeletion.flatMap(controllerContext.partitionsForTopic)
     //将所有的partition先下线
     controller.partitionStateMachine.handleStateChanges(unseenPartitionsForDeletion.toSeq, OfflinePartition)
     controller.partitionStateMachine.handleStateChanges(unseenPartitionsForDeletion.toSeq, NonExistentPartition)
     // adding of unseenTopicsForDeletion to topicsBeingDeleted must be done after the partition state changes
     // to make sure the offlinePartitionCount metric is properly updated
     topicsWithDeletionStarted ++= unseenTopicsForDeletion
   }

   controller.sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, partitions)
   topics.foreach { topic =>
     onPartitionDeletion(controllerContext.partitionsForTopic(topic))
   }
 }

controller先是发送更新元信息请求(partition的leader)到所有存活的broker,然后再删除每个partition。删除partition时执行的方法是:

private def onPartitionDeletion(partitionsToBeDeleted: Set[TopicPartition]) {
   info(s"Partition deletion callback for ${partitionsToBeDeleted.mkString(",")}")
   val replicasPerPartition = controllerContext.replicasForPartition(partitionsToBeDeleted)
   startReplicaDeletion(replicasPerPartition)
 }

它先获取到这个partition所有的replica,再通知replica删除topic

private def startReplicaDeletion(replicasForTopicsToBeDeleted: Set[PartitionAndReplica]) {
   replicasForTopicsToBeDeleted.groupBy(_.topic).keys.foreach { topic =>
     val aliveReplicasForTopic = controllerContext.allLiveReplicas().filter(p => p.topic == topic)
     val deadReplicasForTopic = replicasForTopicsToBeDeleted -- aliveReplicasForTopic
     val successfullyDeletedReplicas = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
     val replicasForDeletionRetry = aliveReplicasForTopic -- successfullyDeletedReplicas
     // move dead replicas directly to failed state 
     //如果某个replica这时候挂了,replica标记为ReplicaDeletionIneligible
     controller.replicaStateMachine.handleStateChanges(deadReplicasForTopic.toSeq, ReplicaDeletionIneligible, new Callbacks())
     // send stop replica to all followers that are not in the OfflineReplica state so they stop sending fetch requests to the leader
     //将replica都下线,停止向leader拉消息
     controller.replicaStateMachine.handleStateChanges(replicasForDeletionRetry.toSeq, OfflineReplica, new Callbacks())
     debug(s"Deletion started for replicas ${replicasForDeletionRetry.mkString(",")}")
     //replica标记为ReplicaDeletionStarted状态
     controller.replicaStateMachine.handleStateChanges(replicasForDeletionRetry.toSeq, ReplicaDeletionStarted,
       new Callbacks(stopReplicaResponseCallback = (stopReplicaResponseObj, replicaId) =>
         eventManager.put(controller.TopicDeletionStopReplicaResponseReceived(stopReplicaResponseObj, replicaId))))
     if (deadReplicasForTopic.nonEmpty) {
       debug(s"Dead Replicas (${deadReplicasForTopic.mkString(",")}) found for topic $topic")
       //如果有replica下线,先暂停topic的删除
       markTopicIneligibleForDeletion(Set(topic))
     }
   }
 }

在将replica置为ReplicaDeletionStarted状态是,接收响应并进行回调:

eventManager.put(controller.TopicDeletionStopReplicaResponseReceived(stopReplicaResponseObj, replicaId))))

controller处理响应的方式也是发送一个ControllerEvent TopicDeletionStopReplicaResponseReceived。它的处理方法是:

override def process(): Unit = {
      import JavaConverters._
      if (!isActive) return
      val stopReplicaResponse = stopReplicaResponseObj.asInstanceOf[StopReplicaResponse]
      debug(s"Delete topic callback invoked for $stopReplicaResponse")
      val responseMap = stopReplicaResponse.responses.asScala
      val partitionsInError =
        if (stopReplicaResponse.error != Errors.NONE) responseMap.keySet
        else responseMap.filter { case (_, error) => error != Errors.NONE }.keySet
      val replicasInError = partitionsInError.map(PartitionAndReplica(_, replicaId))
      // move all the failed replicas to ReplicaDeletionIneligible
      topicDeletionManager.failReplicaDeletion(replicasInError)
      if (replicasInError.size != responseMap.size) {
        // some replicas could have been successfully deleted
        val deletedReplicas = responseMap.keySet -- partitionsInError
        topicDeletionManager.completeReplicaDeletion(deletedReplicas.map(PartitionAndReplica(_, replicaId)))
      }
    }

controller先获取到所有报错的partition的replica partitionsInError。如果某个partition的replica报错,则执行topicDeletionManager的failReplicaDeletion.如果不是所有的partition replica都不错(即有成功的replica),执行topicDeletionManager的completeReplicaDeletion方法。 我们先看failReplicaDeletion方法

def failReplicaDeletion(replicas: Set[PartitionAndReplica]) {
    if (isDeleteTopicEnabled) {
      val replicasThatFailedToDelete = replicas.filter(r => isTopicQueuedUpForDeletion(r.topic))
      if (replicasThatFailedToDelete.nonEmpty) {
        val topics = replicasThatFailedToDelete.map(_.topic)
        debug(s"Deletion failed for replicas ${replicasThatFailedToDelete.mkString(",")}. Halting deletion for topics $topics")
        controller.replicaStateMachine.handleStateChanges(replicasThatFailedToDelete.toSeq, ReplicaDeletionIneligible)
        markTopicIneligibleForDeletion(topics)
        resumeDeletions()
      }
    }
  }

这个方法将replica的状态标记为ReplicaDeletionIneligible,并且将这个topic标记为topicsIneligibleForDeletion等待后面再删除。

def completeReplicaDeletion(replicas: Set[PartitionAndReplica]) {
   val successfullyDeletedReplicas = replicas.filter(r => isTopicQueuedUpForDeletion(r.topic))
   debug(s"Deletion successfully completed for replicas ${successfullyDeletedReplicas.mkString(",")}")
   controller.replicaStateMachine.handleStateChanges(successfullyDeletedReplicas.toSeq, ReplicaDeletionSuccessful)
   resumeDeletions()
 }

这个方法将replica标记为ReplicaDeletionSuccessful状态。

当controller通知完replica删除topic partition之后,我们再回到topicDeletionManager的resumeDeletions方法。此时如果每一个replica都完成了topic partition的删除,即进入了ReplicaDeletionSuccessful状态,则结束topic的删除

if (controller.replicaStateMachine.areAllReplicasForTopicDeleted(topic)) {
       // clear up all state for this topic from controller cache and zookeeper
       completeDeleteTopic(topic)
       info(s"Deletion of topic $topic successfully completed")
     }

其中,deleteDeleteTopic的实现是

private def completeDeleteTopic(topic: String) {
    // deregister partition change listener on the deleted topic. This is to prevent the partition change listener
    // firing before the new topic listener when a deleted topic gets auto created
    //移除topic的partition修改监听器
    controller.unregisterPartitionModificationsHandlers(Seq(topic))
    val replicasForDeletedTopic = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
    // controller will remove this replica from the state machine as well as its partition assignment cache
    //从replica状态机缓存中将replica都删除
    controller.replicaStateMachine.handleStateChanges(replicasForDeletedTopic.toSeq, NonExistentReplica)
    //从队列和zk的节点中将删除的topic删除掉
    topicsToBeDeleted -= topic
    topicsWithDeletionStarted -= topic
    zkClient.deleteTopicZNode(topic, controllerContext.epochZkVersion)
    zkClient.deleteTopicConfigs(Seq(topic), controllerContext.epochZkVersion)
    zkClient.deleteTopicDeletions(Seq(topic), controllerContext.epochZkVersion)
    //controller中移除这个topic的replica assignment和partition leader等信息
    controllerContext.removeTopic(topic)
  }

注意,在controller发出下线replica的请求之后,一个replica的状态要么是started(即这个replica还没有响应请求),要么是Ineligible状态或者successful状态,取决于是否返回错误。在resumeDeletions方法中继续判断,如果至少存在一个replica位于started状态,则跳过(因为删除还在进行时),否则如果至少有一个replica的状态是ineligible,则表示有replica删除失败,后面需要对删除topic进行重试。

if (controller.replicaStateMachine.isAtLeastOneReplicaInDeletionStartedState(topic)) {
         // ignore since topic deletion is in progress
         val replicasInDeletionStartedState = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionStarted)
         val replicaIds = replicasInDeletionStartedState.map(_.replica)
         val partitions = replicasInDeletionStartedState.map(_.topicPartition)
         info(s"Deletion for replicas ${replicaIds.mkString(",")} for partition ${partitions.mkString(",")} of topic $topic in progress")
       } else {
         // if you come here, then no replica is in TopicDeletionStarted and all replicas are not in
         // TopicDeletionSuccessful. That means, that either given topic haven't initiated deletion
         // or there is at least one failed replica (which means topic deletion should be retried).
         if (controller.replicaStateMachine.isAnyReplicaInState(topic, ReplicaDeletionIneligible)) {
           // mark topic for deletion retry
           markTopicForDeletionRetry(topic)
         }
       }
   ```
   标记topic重试删除的方法是
   ```java
   private def markTopicForDeletionRetry(topic: String) {
   // reset replica states from ReplicaDeletionIneligible to OfflineReplica
   val failedReplicas = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionIneligible)
   info(s"Retrying delete topic for topic $topic since replicas ${failedReplicas.mkString(",")} were not successfully deleted")
   controller.replicaStateMachine.handleStateChanges(failedReplicas.toSeq, OfflineReplica)
 }

至于什么时候重试,我们在后面遇到了再补充。

LogDirEventNotificationHandler

LogDirEventNotificationHandler对应的event是LogDirEventNotification,同样,我们直接看其process方法:

override def process(): Unit = {
     if (!isActive) return
     val sequenceNumbers = zkClient.getAllLogDirEventNotifications
     try {
       //先获取到子节点znode的sequenceNumber,再通过sequenceNumber读取相关的log dir event notification对应的broker id
       val brokerIds = zkClient.getBrokerIdsFromLogDirEvents(sequenceNumbers)
       onBrokerLogDirFailure(brokerIds)
     } finally {
       // delete processed children
       zkClient.deleteLogDirEventNotifications(sequenceNumbers, controllerContext.epochZkVersion)
     }
   }

当监听到有log dir event notification时(表示遇到了日志目录的错误),controller会发送LeaderAndIsrRequest给所有的broker来查询replica的状态

IsrChangeNotifgicationHandler

IsrChangeNotifgicationHandler对应的event是IsrChangeNotification,其process方法是:

override def process(): Unit = {
      if (!isActive) return
      //第一步依旧是先获取sequence number
      val sequenceNumbers = zkClient.getAllIsrChangeNotifications
      try {
        //获取sequence number相关的isr change notification和对应的partition信息
        val partitions = zkClient.getPartitionsFromIsrChangeNotifications(sequenceNumbers)
        if (partitions.nonEmpty) {
          updateLeaderAndIsrCache(partitions)
          processUpdateNotifications(partitions)
        }
      } finally {
        // delete the notifications
        zkClient.deleteIsrChangeNotifications(sequenceNumbers, controllerContext.epochZkVersion)
      }
    }

首先通过zk的节点的sequence number获取到partition,然后从zk获取partition和对应的状态。最后向其他broker发送更新元数据的请求。

前面介绍的几种handler都是childChangeHandlers,它们监听的都是某个节点下面子节点的变化,controller还有两个handler用于监听节点的变化:PreferredReplicaElectionHandler和PartitionReassignmentHandler

PreferredReplicaElectionHandler

controller的znodeChangeHandler都实现了handleCreation方法用来监听某个路径的创建。PreferredReplicaElectionHandler在/admin/preferred_replica_election创建时往controller的eventManager中放入PreferredReplicaLeaderElection,其process方法是:

override def process(): Unit = {
      if (!isActive) return

      // We need to register the watcher if the path doesn't exist in order to detect future preferred replica
      // leader elections and we get the `path exists` check for free
      if (zkClient.registerZNodeChangeHandlerAndCheckExistence(preferredReplicaElectionHandler)) {
        val partitions = zkClient.getPreferredReplicaElection
        val partitionsForTopicsToBeDeleted = partitions.filter(p => topicDeletionManager.isTopicQueuedUpForDeletion(p.topic))
        if (partitionsForTopicsToBeDeleted.nonEmpty) {
          error(s"Skipping preferred replica election for partitions $partitionsForTopicsToBeDeleted since the " +
            "respective topics are being deleted")
        }
        onPreferredReplicaElection(partitions -- partitionsForTopicsToBeDeleted)
      }
    }
  1. 先判断节点是否存在
  2. 获取/admin/preferred_replica_election的数据,解析为Set[TopicPartition]
  3. 过滤掉需要被删除的topic
  4. 对剩余的partition执行onPreferredReplicaElection方法
private def onPreferredReplicaElection(partitions: Set[TopicPartition], isTriggeredByAutoRebalance: Boolean = false) {
    info(s"Starting preferred replica leader election for partitions ${partitions.mkString(",")}")
    try {
      partitionStateMachine.handleStateChanges(partitions.toSeq, OnlinePartition, Option(PreferredReplicaPartitionLeaderElectionStrategy))
    } catch {
      case e: ControllerMovedException =>
        error(s"Error completing preferred replica leader election for partitions ${partitions.mkString(",")} because controller has moved to another broker.", e)
        throw e
      case e: Throwable => error(s"Error completing preferred replica leader election for partitions ${partitions.mkString(",")}", e)
    } finally {
      removePartitionsFromPreferredReplicaElection(partitions, isTriggeredByAutoRebalance)
    }
  }

这里涉及到的partition state状态机,我们在后面会详细介绍。在onPreferredReplicaElection中,partitionStateMachine触发了各broker进行partition leader的选举,并且将partition的状态转移为OnlinePartition状态。当选举完成后,执行removePartitionsFromPreferredReplicaElection方法。

private def removePartitionsFromPreferredReplicaElection(partitionsToBeRemoved: Set[TopicPartition],
                                                          isTriggeredByAutoRebalance : Boolean) {
   for (partition <- partitionsToBeRemoved) {
     // check the status
     val currentLeader = controllerContext.partitionLeadershipInfo(partition).leaderAndIsr.leader
     val preferredReplica = controllerContext.partitionReplicaAssignment(partition).head
     if (currentLeader == preferredReplica) {
       info(s"Partition $partition completed preferred replica leader election. New leader is $preferredReplica")
     } else {
       warn(s"Partition $partition failed to complete preferred replica leader election to $preferredReplica. " +
         s"Leader is still $currentLeader")
     }
   }
   if (!isTriggeredByAutoRebalance) {
     zkClient.deletePreferredReplicaElection(controllerContext.epochZkVersion)
     // Ensure we detect future preferred replica leader elections
     eventManager.put(PreferredReplicaLeaderElection)
   }
 }

PartitionReassignmentHandler

PartitionReassignmentHandler监听的路径是/admin/reassign_partitions, 直接看其event的process方法:

override def process(): Unit = {
      if (!isActive) return

      // We need to register the watcher if the path doesn't exist in order to detect future reassignments and we get
      // the `path exists` check for free
      if (zkClient.registerZNodeChangeHandlerAndCheckExistence(partitionReassignmentHandler)) {
        val partitionReassignment = zkClient.getPartitionReassignment

        // Populate `partitionsBeingReassigned` with all partitions being reassigned before invoking
        // `maybeTriggerPartitionReassignment` (see method documentation for the reason)
        partitionReassignment.foreach { case (tp, newReplicas) =>
          val reassignIsrChangeHandler = new PartitionReassignmentIsrChangeHandler(KafkaController.this, eventManager,
            tp)
          controllerContext.partitionsBeingReassigned.put(tp, ReassignedPartitionsContext(newReplicas, reassignIsrChangeHandler))
        }

        maybeTriggerPartitionReassignment(partitionReassignment.keySet)
      }
    }
  1. 如果节点存在的话,先获取到partition和其分配情况
  2. 创建PartitionReassignmentIsrChangeHandler,用来监听/brokers/topics/topic/partitions/partition/state节点数据的变化。
  3. 在controller context的partitionsBeingReassigned中放入topicPartition和对应的ReassignedPartitionsContext。其中ReassignedPartitionsContext中注册了步骤2中创建的handler
  4. 执行maybeTriggerPartitionReassignment
private def maybeTriggerPartitionReassignment(topicPartitions: Set[TopicPartition]) {
    val partitionsToBeRemovedFromReassignment = scala.collection.mutable.Set.empty[TopicPartition]
    topicPartitions.foreach { tp =>
      if (topicDeletionManager.isTopicQueuedUpForDeletion(tp.topic)) {
        error(s"Skipping reassignment of $tp since the topic is currently being deleted")
        partitionsToBeRemovedFromReassignment.add(tp)
      } else {
        val reassignedPartitionContext = controllerContext.partitionsBeingReassigned.get(tp).getOrElse {
          throw new IllegalStateException(s"Initiating reassign replicas for partition $tp not present in " +
            s"partitionsBeingReassigned: ${controllerContext.partitionsBeingReassigned.mkString(", ")}")
        }
        val newReplicas = reassignedPartitionContext.newReplicas
        val topic = tp.topic
        val assignedReplicas = controllerContext.partitionReplicaAssignment(tp)
        if (assignedReplicas.nonEmpty) {
          if (assignedReplicas == newReplicas) {
            info(s"Partition $tp to be reassigned is already assigned to replicas " +
              s"${newReplicas.mkString(",")}. Ignoring request for partition reassignment.")
            partitionsToBeRemovedFromReassignment.add(tp)
          } else {
            try {
              info(s"Handling reassignment of partition $tp to new replicas ${newReplicas.mkString(",")}")
              // first register ISR change listener
              reassignedPartitionContext.registerReassignIsrChangeHandler(zkClient)
              // mark topic ineligible for deletion for the partitions being reassigned
              topicDeletionManager.markTopicIneligibleForDeletion(Set(topic))
              onPartitionReassignment(tp, reassignedPartitionContext)
            } catch {
              case e: ControllerMovedException =>
                error(s"Error completing reassignment of partition $tp because controller has moved to another broker", e)
                throw e
              case e: Throwable =>
                error(s"Error completing reassignment of partition $tp", e)
                // remove the partition from the admin path to unblock the admin client
                partitionsToBeRemovedFromReassignment.add(tp)
            }
          }
        } else {
            error(s"Ignoring request to reassign partition $tp that doesn't exist.")
            partitionsToBeRemovedFromReassignment.add(tp)
        }
      }
    }
    removePartitionsFromReassignedPartitions(partitionsToBeRemovedFromReassignment)
  }
  1. 如果这个topic正准备报错,则忽略
  2. 否则,先从controller context中获取到刚刚放入的ReassignedPartitionsContext
  3. 从reassignedPartitionContext获取到新的replica assignment,如果和原来的assignment相同,则忽略
  4. 否则在zk上面注册ISR change listener,并且暂停topic的删除
  5. 执行partition replica reassignment
  6. 将前面提到需要忽略的topicPartition从controller context的partitionsBeingReassigned移除,并且取消zk上对其节点的监听。如果context的partitionsBeingReassigned中剩余的topicPartition为空,还要删除zk上面对应的节点,同时重新监听zk上面这个节点的内容,以等待下次的reassignment。最后再次创建assignment并保存到zk中(是否会造成处理上面的死循环)。

我们着重看一下partition replica assignment的过程

onPartitionReassignment

当一个admin命令触发了partition的reassignment的任务时,会创建出/admin/reassign_partitions路径并且触发了zk的监听器,此时就开始了partition的重分配。为了方便理解,我们使用以下简称

  • RAR:reassigned replicas 新的replica分配
  • OAR: original list of replicas 原来的replica分配
  • AR: current assigned replicas

onPartitionReassignment的代码如下:

private def onPartitionReassignment(topicPartition: TopicPartition, reassignedPartitionContext: ReassignedPartitionsContext) {
    //先获取RAR
    val reassignedReplicas = reassignedPartitionContext.newReplicas
    //如果有新加入的replica没有跟上isr
    if (!areReplicasInIsr(topicPartition, reassignedReplicas)) {
      info(s"New replicas ${reassignedReplicas.mkString(",")} for partition $topicPartition being reassigned not yet " +
        "caught up with the leader")
      val newReplicasNotInOldReplicaList = reassignedReplicas.toSet -- controllerContext.partitionReplicaAssignment(topicPartition).toSet
      val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicPartition)).toSet
      //1. Update AR in ZK with OAR + RAR.
      //把新老replica一起加入到当前的replica中,即AR中
      updateAssignedReplicasForPartition(topicPartition, newAndOldReplicas.toSeq)
      //2. Send LeaderAndIsr request to every replica in OAR + RAR (with AR as OAR + RAR).
      //发送LeaderAndIsr请求到所有replica(即AR),强制进行新的leader epoch的选举
      updateLeaderEpochAndSendRequest(topicPartition, controllerContext.partitionReplicaAssignment(topicPartition),
        newAndOldReplicas.toSeq)
      //3. replicas in RAR - OAR -> NewReplica
      //通知新加入到replica跟上isr
      startNewReplicasForReassignedPartition(topicPartition, reassignedPartitionContext, newReplicasNotInOldReplicaList)
      info(s"Waiting for new replicas ${reassignedReplicas.mkString(",")} for partition ${topicPartition} being " +
        "reassigned to catch up with the leader")
    } else {
      //如果所有的新replica都跟上isr
      //4. Wait until all replicas in RAR are in sync with the leader.
      val oldReplicas = controllerContext.partitionReplicaAssignment(topicPartition).toSet -- reassignedReplicas.toSet
      //5. replicas in RAR -> OnlineReplica
      //把RAR中的replica的状态转移为OnlineReplica
      reassignedReplicas.foreach { replica =>
        replicaStateMachine.handleStateChanges(Seq(new PartitionAndReplica(topicPartition, replica)), OnlineReplica)
      }
      //6. Set AR to RAR in memory.
      //7. Send LeaderAndIsr request with a potential new leader (if current leader not in RAR) and
      //   a new AR (using RAR) and same isr to every broker in RAR
      moveReassignedPartitionLeaderIfRequired(topicPartition, reassignedPartitionContext)
      //8. replicas in OAR - RAR -> Offline (force those replicas out of isr)
      //9. replicas in OAR - RAR -> NonExistentReplica (force those replicas to be deleted)
      //老的replica,即OAR-RAR的状态先转移到offline再到nonexist
      stopOldReplicasOfReassignedPartition(topicPartition, reassignedPartitionContext, oldReplicas)
      //10. Update AR in ZK with RAR.
      //更新AR
      updateAssignedReplicasForPartition(topicPartition, reassignedReplicas)
      //11. Update the /admin/reassign_partitions path in ZK to remove this partition.
      removePartitionsFromReassignedPartitions(Set(topicPartition))
      //12. After electing leader, the replicas and isr information changes, so resend the update metadata request to every broker
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicPartition))
      // signal delete topic thread if reassignment for some partitions belonging to topics being deleted just completed
      topicDeletionManager.resumeDeletionForTopics(Set(topicPartition.topic))
    }
  }

reassignment的步骤是:

  1. 先将AR更新为RAR+OAR
  2. 发送LeaderAndIsr请求到RAR+OAR的replica中选举新的leader,开启新的leader epoch
  3. 将RAR-OAR的replica的状态设置为NewReplica
  4. 等待所有的replica跟上isr
  5. 把RAR的所有replica状态设置为OnlineReplica
  6. 将controller context中的AR设置为RAR
  7. 如果当前的leader不在RAR中,从RAR选举一个新的leader。否则将leader epoch加一。
  8. 将OAR-RAR中的replica设置为OfflineReplica,并且从isr中将OAR-RAR删掉,并且发送LeaderAndIsr请求到leader来通知新的isr
  9. 将OAR-RAR中的replica设置为NonExistentReplica状态,并且通知OAR-RAR物理删除磁盘上面的replica文件
  10. 更新zk上面的AR信息
  11. 从/admin/reassign_partitions节点中将这个partition删掉
  12. 因为replica和isr信息已经改变,发送元数据请求到所有的broker