kafka server - controller -handler和状态机childChangeHandler关注的是

Zk Handler

kafka的controller上面主要注册了以下两种类型的zk handler

childChangeHandler
nodeChangeHandler

childChangeHandler关注的是子节点的变化，而nodeChangeHandler关注的是节点的变化。childChangeHandler有以下几种

brokerChangeHandler
topicChangeHandler
topicDeletionHandler
logDirEventNotificationHandler
isrChangeNotificationHandler

handler	监听路径	作用
brokerChangeHandler	/brokers/ids	监听broker的上线和下线
topicChangeHandler	/brokers/topics	监听topic的创建
topicDeletionHandler	/admin/delete_topics	监听topic的删除
logDirEventNotificationHandler	/log_dir_event_notification
isrChangeNotificationHandler	/isr_change_notification	partition isr的改变

nodeChangeHandler有两种

preferredReplicaElectionHandler
partitionReassignmentHandler

handler	监听路径	作用
preferredReplicaElectionHandler	/admin/preferred_replica_election	为了partition leader的选举
partitionReassignmentHandler	/admin/reassign_partitions	用于分区副本的迁移

下面我们逐一介绍这些handler的处理方法，理解这些方便我们知道在操作kafka的时候，server背后到底在做什么？

ZNodeChildChangeHandler

ZNodeChildChangeHandler是一个trait，提供了一个属性path和一个方法handleChildChange

trait ZNodeChildChangeHandler {
  val path: String
  def handleChildChange(): Unit = {}
}

childChangeHandler都是其子类。

BrokerChangeHandler

BrokerChangeHandler的handle实现方法如下:

override def handleChildChange(): Unit = {
    eventManager.put(controller.BrokerChange)
  }

它在controller的主线程队列中添加进一个BrokerChange的event(ControllerEvent)。我们知道，controller的ControllerEventThread线程在拿到event以后，会将_state改成event中指定的状态，同时执行event的process方法。

BrokerChange中指定的controller状态是BrokerChange，它的处理方法是

override def process(): Unit = {
     if (!isActive) return
     val curBrokers = zkClient.getAllBrokersInCluster.toSet
     val curBrokerIds = curBrokers.map(_.id)
     val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
     val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds
     val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds
     val newBrokers = curBrokers.filter(broker => newBrokerIds(broker.id))
     controllerContext.liveBrokers = curBrokers
     val newBrokerIdsSorted = newBrokerIds.toSeq.sorted
     val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted
     val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted
     info(s"Newly added brokers: ${newBrokerIdsSorted.mkString(",")}, " +
       s"deleted brokers: ${deadBrokerIdsSorted.mkString(",")}, all live brokers: ${liveBrokerIdsSorted.mkString(",")}")

     newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
     deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker)
     if (newBrokerIds.nonEmpty)
       onBrokerStartup(newBrokerIdsSorted)
     if (deadBrokerIds.nonEmpty)
       onBrokerFailure(deadBrokerIdsSorted)
   }

先从zk的/brokers/ids下面获取到线上所有的broker id
从controller的上下文controllerContext中获取到缓存的broker id列表（这些broker有些可能已经下线）
通过线上broker减去上下文中broker，得到新加入的broker列表newBrokerIds
通过上下文中broker减去线上broker，得到下线的broker列表deadBrokerIds
controller上下文中执行加入新broker和下线broker的操作
将broker变化的信息同步到集群中

我们重点看第5点和第6点。

controller中执行broker上下线

主要是两个方法：

def addBroker(broker: Broker) {
    // be careful here. Maybe the startup() API has already started the request send thread
    brokerLock synchronized {
      if (!brokerStateInfo.contains(broker.id)) {
        addNewBroker(broker)
        startRequestSendThread(broker.id)
      }
    }
  }

和

def removeBroker(brokerId: Int) {
   brokerLock synchronized {
     removeExistingBroker(brokerStateInfo(brokerId))
   }
 }

先看addNewBroker

addNewBroker

private def addNewBroker(broker: Broker) {
   val messageQueue = new LinkedBlockingQueue[QueueItem]
   debug(s"Controller ${config.brokerId} trying to connect to broker ${broker.id}")
   val brokerNode = broker.node(config.interBrokerListenerName)
   val logContext = new LogContext(s"[Controller id=${config.brokerId}, targetBrokerId=${brokerNode.idString}] ")
   val networkClient = {
     val channelBuilder = ...
     val selector = ...
     new NetworkClient(
       ...
     )
   }
   val threadName = threadNamePrefix match {
     case None => s"Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
     case Some(name) => s"$name:Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
   }

   val requestRateAndQueueTimeMetrics = newTimer(
     RequestRateAndQueueTimeMetricName, TimeUnit.MILLISECONDS, TimeUnit.SECONDS, brokerMetricTags(broker.id)
   )

   val requestThread = new RequestSendThread(config.brokerId, controllerContext, messageQueue, networkClient,
     brokerNode, config, time, requestRateAndQueueTimeMetrics, stateChangeLogger, threadName)
   requestThread.setDaemon(false)

   val queueSizeGauge = newGauge(
     QueueSizeMetricName,
     new Gauge[Int] {
       def value: Int = messageQueue.size
     },
     brokerMetricTags(broker.id)
   )

   brokerStateInfo.put(broker.id, ControllerBrokerStateInfo(networkClient, brokerNode, messageQueue,
     requestThread, queueSizeGauge, requestRateAndQueueTimeMetrics))
 }

因为篇幅有限，这里先省略NetworkClient的构造代码，以后讲网络的时候再分析。从上面代码能看出，controller是启动了一个线程用于给其他broker发送请求。controller显示构造了一个线程RequestSendThread，然后将这个线程放入到ControllerChannelManager中就返回了。

addBroker接着调用startRequestSendThread方法发送请求

protected def startRequestSendThread(brokerId: Int) {
   val requestThread = brokerStateInfo(brokerId).requestSendThread
   if (requestThread.getState == Thread.State.NEW)
     requestThread.start()
 }

那么，到底是发的什么请求呢？

RequestSendThread是我们前面在eventManager中ControllerEventThread那边看到过的ShutdownableThread，它的特点是不断从队列中获取某种Item并处理。而addBroker的时候传入的messageQueue恰好就作为请求内容的队列。RequestSendThread不断从队列中获取QueueItem

case class QueueItem(apiKey: ApiKeys, request: AbstractRequest.Builder[_ <: AbstractRequest],
                     callback: AbstractResponse => Unit, enqueueTimeMs: Long)

然后再发送请求到broker。所以addBroker相当于是维护了一个和broker进行通信的入口。当想要发送请求时，从broker id对应的ControllerBrokerStateInfo中获取到消息队列，再放入请求的消息即可。

removeBroker

removeBroker的过程与add正好相反

def removeBroker(brokerId: Int) {
    brokerLock synchronized {
      removeExistingBroker(brokerStateInfo(brokerId))
    }
  }

其中:

private def removeExistingBroker(brokerState: ControllerBrokerStateInfo) {
    try {
      // Shutdown the RequestSendThread before closing the NetworkClient to avoid the concurrent use of the
      // non-threadsafe classes as described in KAFKA-4959.
      // The call to shutdownLatch.await() in ShutdownableThread.shutdown() serves as a synchronization barrier that
      // hands off the NetworkClient from the RequestSendThread to the ZkEventThread.
      brokerState.requestSendThread.shutdown()
      brokerState.networkClient.close()
      brokerState.messageQueue.clear()
      removeMetric(QueueSizeMetricName, brokerMetricTags(brokerState.brokerNode.id))
      removeMetric(RequestRateAndQueueTimeMetricName, brokerMetricTags(brokerState.brokerNode.id))
      brokerStateInfo.remove(brokerState.brokerNode.id)
    } catch {
      case e: Throwable => error("Error while removing broker by the controller", e)
    }
  }

先停止requestSendThread
停止对应的networkClient
清空队列
移除metrics
将broker从brokerStateInfo中移除

BrokerChangeHandler就介绍到这里。

TopicChangeHandler

我们直奔主题，看TopicChangeHandler对应的TopicChange的process方法

override def process(): Unit = {
      if (!isActive) return
      //step 1
      val topics = zkClient.getAllTopicsInCluster.toSet
      //step 2
      val newTopics = topics -- controllerContext.allTopics
      val deletedTopics = controllerContext.allTopics -- topics
      controllerContext.allTopics = topics

      //step 3 
      registerPartitionModificationsHandlers(newTopics.toSeq)
      //step 4
      val addedPartitionReplicaAssignment = zkClient.getReplicaAssignmentForTopics(newTopics)
      //step 5
      deletedTopics.foreach(controllerContext.removeTopic)
      //step 6
      addedPartitionReplicaAssignment.foreach {
        case (topicAndPartition, newReplicas) =>
        controllerContext.updatePartitionReplicaAssignment(topicAndPartition, newReplicas)
      }
      info(s"New topics: [$newTopics], deleted topics: [$deletedTopics], new partition replica assignment " +
        s"[$addedPartitionReplicaAssignment]")
      if (addedPartitionReplicaAssignment.nonEmpty)
        //step 7
        onNewPartitionCreation(addedPartitionReplicaAssignment.keySet)
    }

分为以下几步：

从zk获取所有topic列表
与context中topic对比，获取到新添加和删除的topic列表
对于新添加的topic，在其上注册PartitionModificationsHandler
从zk获取新topic的partition replica分配情况
每个删除的topic执行removeTopic操作
更新controller context中partition assign的相关信息
修改partitionStateMachine，partitionStateMachine，partitionStateMachine，replicaStateMachine等状态机中信息

我们从PartitionModificationsHandler开始看。

PartitionModificationsHandler

class PartitionModificationsHandler(controller: KafkaController, eventManager: ControllerEventManager, topic: String) extends ZNodeChangeHandler {
  override val path: String = TopicZNode.path(topic)

  override def handleDataChange(): Unit = eventManager.put(controller.PartitionModifications(topic))
}

它监听的是topic上partition改变的情况。当数据改变时，往controller的eventManager中放入controller.PartitionModifications(topic)。这个event的proecss为：

override def process(): Unit = {
      if (!isActive) return
      //获取每个partition的assignment情况
      val partitionReplicaAssignment = zkClient.getReplicaAssignmentForTopics(immutable.Set(topic))
      val partitionsToBeAdded = partitionReplicaAssignment.filter { case (topicPartition, _) =>
        controllerContext.partitionReplicaAssignment(topicPartition).isEmpty
      }
      if (topicDeletionManager.isTopicQueuedUpForDeletion(topic))
        if (partitionsToBeAdded.nonEmpty) {
          warn("Skipping adding partitions %s for topic %s since it is currently being deleted"
            .format(partitionsToBeAdded.map(_._1.partition).mkString(","), topic))

          restorePartitionReplicaAssignment(topic, partitionReplicaAssignment)
        } else {
          // This can happen if existing partition replica assignment are restored to prevent increasing partition count during topic deletion
          info("Ignoring partition change during topic deletion as no new partitions are added")
        }
      else {
        if (partitionsToBeAdded.nonEmpty) {
          info(s"New partitions to be added $partitionsToBeAdded")
          partitionsToBeAdded.foreach { case (topicPartition, assignedReplicas) =>
            controllerContext.updatePartitionReplicaAssignment(topicPartition, assignedReplicas)
          }
          onNewPartitionCreation(partitionsToBeAdded.keySet)
        }
      }
    }

获取每个partition的assignment情况。这是Map:Map[TopicPartition, Seq[Int]]
获取待添加的新topicPartition：partitionsToBeAdded
如果topic正在被删除，回滚
否则更新controller context中的topicPartition的分配缓存
修改partitionStateMachine，partitionStateMachine，partitionStateMachine，replicaStateMachine等状态机中信息

综上来看，当topic下partition新添加partition时，修改controller当中关于topicPartition的assign信息。当新添加topic时，对newTopic执行的是类似的操作，主要是topicAssignment的更新。

TopicDeletionHandler

其process方法是：

override def process(): Unit = {
      if (!isActive) return
      var topicsToBeDeleted = zkClient.getTopicDeletions.toSet
      debug(s"Delete topics listener fired for topics ${topicsToBeDeleted.mkString(",")} to be deleted")
      val nonExistentTopics = topicsToBeDeleted -- controllerContext.allTopics
      if (nonExistentTopics.nonEmpty) {
        warn(s"Ignoring request to delete non-existing topics ${nonExistentTopics.mkString(",")}")
        zkClient.deleteTopicDeletions(nonExistentTopics.toSeq, controllerContext.epochZkVersion)
      }
      topicsToBeDeleted --= nonExistentTopics
      if (config.deleteTopicEnable) {
        if (topicsToBeDeleted.nonEmpty) {
          info(s"Starting topic deletion for topics ${topicsToBeDeleted.mkString(",")}")
          // mark topic ineligible for deletion if other state changes are in progress
          topicsToBeDeleted.foreach { topic =>
            val partitionReassignmentInProgress =
              controllerContext.partitionsBeingReassigned.keySet.map(_.topic).contains(topic)
            if (partitionReassignmentInProgress)
              topicDeletionManager.markTopicIneligibleForDeletion(Set(topic))
          }
          // add topic to deletion list
          topicDeletionManager.enqueueTopicsForDeletion(topicsToBeDeleted)
        }
      } else {
        // If delete topic is disabled remove entries under zookeeper path : /admin/delete_topics
        info(s"Removing $topicsToBeDeleted since delete topic is disabled")
        zkClient.deleteTopicDeletions(topicsToBeDeleted.toSeq, controllerContext.epochZkVersion)
      }
    }

从zk的/admin/delete_topics中获取到待删除的topic列表
忽略掉不存在的topic
如果配置准许删除topic，则
- 如果topic正在执行reassign操作，则暂停删除topic(加入到停止队列中)
- 否则，将topic放入到topicDeletionManager的的deletion队列中
如果配置不准许删除topic，从/admin/delete_topics中移除相关topic

那么TopicDeletionManager是怎样删除topic的呢？

TopicDeletionManager

TopicDeletionManager管理着topic删除动作的状态机。当调用命令删除topic时，在zk的/admin/delete_topics下面会创建对应topic的节点，controller监听着这个路径并且删除对应的topic。在删除topic之前会判断，如果

topic的某个replica所在的broker此时下线了
正在进行这个topic的partition reassignment

则topic会判断为不能够删除。而当

replica所在的broker上线了
partition reassignment完成了

的时候，会继续topic的删除动作。

每一个待删除topic的replicac都有以下三种状态

TopicDeletionStarted：当调用了onPartitionDeletion以后replica进入此状态。当controller监听到delete_topics子节点变化时，发送StopReplicaRequests到所有的replica，并且在响应StopReplicaResponse上面注册了回调，当每个replica响应删除replica的时候执行。
TopicDeletionSuccessful：根据StopReplicaResponse响应状态码的不同决定是否将replica移入此状
TopicDeletionFailed：根据StopReplicaResponse响应状态码的不同决定是否将replica移入此状

一个topic被删除成功，当且仅当所有的replica都进入了TopicDeletionSuccessful状态。如果没有replica还在TopicDeletionStarted状态，并且至少一个replica进入TopicDeletionFailed状态，就会将这个topic标记用来重试。

上面提到，当从zk获取到待删除的topic时会执行enqueueTopicsForDeletion方法

def enqueueTopicsForDeletion(topics: Set[String]) {
    if (isDeleteTopicEnabled) {
      topicsToBeDeleted ++= topics
      resumeDeletions()
    }
  }

它将待删除的topic放入到manager的集合topicsToBeDeleted中，然后继续执行topic的删除动作

private def resumeDeletions(): Unit = {
   val topicsQueuedForDeletion = Set.empty[String] ++ topicsToBeDeleted

   if (topicsQueuedForDeletion.nonEmpty)
     info(s"Handling deletion for topics ${topicsQueuedForDeletion.mkString(",")}")

   topicsQueuedForDeletion.foreach { topic =>
     // if all replicas are marked as deleted successfully, then topic deletion is done
     //如果replica都已经删除成功
     if (controller.replicaStateMachine.areAllReplicasForTopicDeleted(topic)) {
       // clear up all state for this topic from controller cache and zookeeper
       completeDeleteTopic(topic)
       info(s"Deletion of topic $topic successfully completed")
     } else {
       //至少有一个replica已经开始删除topic
       if (controller.replicaStateMachine.isAtLeastOneReplicaInDeletionStartedState(topic)) {
         // ignore since topic deletion is in progress
         val replicasInDeletionStartedState = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionStarted)
         val replicaIds = replicasInDeletionStartedState.map(_.replica)
         val partitions = replicasInDeletionStartedState.map(_.topicPartition)
         info(s"Deletion for replicas ${replicaIds.mkString(",")} for partition ${partitions.mkString(",")} of topic $topic in progress")
       } else {
         // if you come here, then no replica is in TopicDeletionStarted and all replicas are not in
         // TopicDeletionSuccessful. That means, that either given topic haven't initiated deletion
         // or there is at least one failed replica (which means topic deletion should be retried).
         //此时，没有一个replica在start状态，表示replica都还没开始删除，或者至少有一个replica失败了
         if (controller.replicaStateMachine.isAnyReplicaInState(topic, ReplicaDeletionIneligible)) {
           // mark topic for deletion retry
           markTopicForDeletionRetry(topic)
         }
       }
     }
     // Try delete topic if it is eligible for deletion.
     if (isTopicEligibleForDeletion(topic)) {
       info(s"Deletion of topic $topic (re)started")
       // topic deletion will be kicked off
       onTopicDeletion(Set(topic))
     } else if (isTopicIneligibleForDeletion(topic)) {
       info(s"Not retrying deletion of topic $topic at this time since it is marked ineligible for deletion")
     }
   }
 }

当controller刚开始执行删除动作时，执行的是onTopicDeletion(Set(topic))方法

private def onTopicDeletion(topics: Set[String]) {
   info(s"Topic deletion callback for ${topics.mkString(",")}")
   // send update metadata so that brokers stop serving data for topics to be deleted
   val partitions = topics.flatMap(controllerContext.partitionsForTopic)
   val unseenTopicsForDeletion = topics -- topicsWithDeletionStarted
   if (unseenTopicsForDeletion.nonEmpty) {
     val unseenPartitionsForDeletion = unseenTopicsForDeletion.flatMap(controllerContext.partitionsForTopic)
     //将所有的partition先下线
     controller.partitionStateMachine.handleStateChanges(unseenPartitionsForDeletion.toSeq, OfflinePartition)
     controller.partitionStateMachine.handleStateChanges(unseenPartitionsForDeletion.toSeq, NonExistentPartition)
     // adding of unseenTopicsForDeletion to topicsBeingDeleted must be done after the partition state changes
     // to make sure the offlinePartitionCount metric is properly updated
     topicsWithDeletionStarted ++= unseenTopicsForDeletion
   }

   controller.sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, partitions)
   topics.foreach { topic =>
     onPartitionDeletion(controllerContext.partitionsForTopic(topic))
   }
 }

controller先是发送更新元信息请求(partition的leader)到所有存活的broker，然后再删除每个partition。删除partition时执行的方法是：

private def onPartitionDeletion(partitionsToBeDeleted: Set[TopicPartition]) {
   info(s"Partition deletion callback for ${partitionsToBeDeleted.mkString(",")}")
   val replicasPerPartition = controllerContext.replicasForPartition(partitionsToBeDeleted)
   startReplicaDeletion(replicasPerPartition)
 }

它先获取到这个partition所有的replica，再通知replica删除topic

private def startReplicaDeletion(replicasForTopicsToBeDeleted: Set[PartitionAndReplica]) {
   replicasForTopicsToBeDeleted.groupBy(_.topic).keys.foreach { topic =>
     val aliveReplicasForTopic = controllerContext.allLiveReplicas().filter(p => p.topic == topic)
     val deadReplicasForTopic = replicasForTopicsToBeDeleted -- aliveReplicasForTopic
     val successfullyDeletedReplicas = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
     val replicasForDeletionRetry = aliveReplicasForTopic -- successfullyDeletedReplicas
     // move dead replicas directly to failed state 
     //如果某个replica这时候挂了，replica标记为ReplicaDeletionIneligible
     controller.replicaStateMachine.handleStateChanges(deadReplicasForTopic.toSeq, ReplicaDeletionIneligible, new Callbacks())
     // send stop replica to all followers that are not in the OfflineReplica state so they stop sending fetch requests to the leader
     //将replica都下线，停止向leader拉消息
     controller.replicaStateMachine.handleStateChanges(replicasForDeletionRetry.toSeq, OfflineReplica, new Callbacks())
     debug(s"Deletion started for replicas ${replicasForDeletionRetry.mkString(",")}")
     //replica标记为ReplicaDeletionStarted状态
     controller.replicaStateMachine.handleStateChanges(replicasForDeletionRetry.toSeq, ReplicaDeletionStarted,
       new Callbacks(stopReplicaResponseCallback = (stopReplicaResponseObj, replicaId) =>
         eventManager.put(controller.TopicDeletionStopReplicaResponseReceived(stopReplicaResponseObj, replicaId))))
     if (deadReplicasForTopic.nonEmpty) {
       debug(s"Dead Replicas (${deadReplicasForTopic.mkString(",")}) found for topic $topic")
       //如果有replica下线，先暂停topic的删除
       markTopicIneligibleForDeletion(Set(topic))
     }
   }
 }

在将replica置为ReplicaDeletionStarted状态是，接收响应并进行回调：

eventManager.put(controller.TopicDeletionStopReplicaResponseReceived(stopReplicaResponseObj, replicaId))))

controller处理响应的方式也是发送一个ControllerEvent TopicDeletionStopReplicaResponseReceived。它的处理方法是：

override def process(): Unit = {
      import JavaConverters._
      if (!isActive) return
      val stopReplicaResponse = stopReplicaResponseObj.asInstanceOf[StopReplicaResponse]
      debug(s"Delete topic callback invoked for $stopReplicaResponse")
      val responseMap = stopReplicaResponse.responses.asScala
      val partitionsInError =
        if (stopReplicaResponse.error != Errors.NONE) responseMap.keySet
        else responseMap.filter { case (_, error) => error != Errors.NONE }.keySet
      val replicasInError = partitionsInError.map(PartitionAndReplica(_, replicaId))
      // move all the failed replicas to ReplicaDeletionIneligible
      topicDeletionManager.failReplicaDeletion(replicasInError)
      if (replicasInError.size != responseMap.size) {
        // some replicas could have been successfully deleted
        val deletedReplicas = responseMap.keySet -- partitionsInError
        topicDeletionManager.completeReplicaDeletion(deletedReplicas.map(PartitionAndReplica(_, replicaId)))
      }
    }

controller先获取到所有报错的partition的replica partitionsInError。如果某个partition的replica报错，则执行topicDeletionManager的failReplicaDeletion.如果不是所有的partition replica都不错（即有成功的replica），执行topicDeletionManager的completeReplicaDeletion方法。我们先看failReplicaDeletion方法

def failReplicaDeletion(replicas: Set[PartitionAndReplica]) {
    if (isDeleteTopicEnabled) {
      val replicasThatFailedToDelete = replicas.filter(r => isTopicQueuedUpForDeletion(r.topic))
      if (replicasThatFailedToDelete.nonEmpty) {
        val topics = replicasThatFailedToDelete.map(_.topic)
        debug(s"Deletion failed for replicas ${replicasThatFailedToDelete.mkString(",")}. Halting deletion for topics $topics")
        controller.replicaStateMachine.handleStateChanges(replicasThatFailedToDelete.toSeq, ReplicaDeletionIneligible)
        markTopicIneligibleForDeletion(topics)
        resumeDeletions()
      }
    }
  }

这个方法将replica的状态标记为ReplicaDeletionIneligible，并且将这个topic标记为topicsIneligibleForDeletion等待后面再删除。

def completeReplicaDeletion(replicas: Set[PartitionAndReplica]) {
   val successfullyDeletedReplicas = replicas.filter(r => isTopicQueuedUpForDeletion(r.topic))
   debug(s"Deletion successfully completed for replicas ${successfullyDeletedReplicas.mkString(",")}")
   controller.replicaStateMachine.handleStateChanges(successfullyDeletedReplicas.toSeq, ReplicaDeletionSuccessful)
   resumeDeletions()
 }

这个方法将replica标记为ReplicaDeletionSuccessful状态。

当controller通知完replica删除topic partition之后，我们再回到topicDeletionManager的resumeDeletions方法。此时如果每一个replica都完成了topic partition的删除，即进入了ReplicaDeletionSuccessful状态，则结束topic的删除

if (controller.replicaStateMachine.areAllReplicasForTopicDeleted(topic)) {
       // clear up all state for this topic from controller cache and zookeeper
       completeDeleteTopic(topic)
       info(s"Deletion of topic $topic successfully completed")
     }

其中，deleteDeleteTopic的实现是

private def completeDeleteTopic(topic: String) {
    // deregister partition change listener on the deleted topic. This is to prevent the partition change listener
    // firing before the new topic listener when a deleted topic gets auto created
    //移除topic的partition修改监听器
    controller.unregisterPartitionModificationsHandlers(Seq(topic))
    val replicasForDeletedTopic = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
    // controller will remove this replica from the state machine as well as its partition assignment cache
    //从replica状态机缓存中将replica都删除
    controller.replicaStateMachine.handleStateChanges(replicasForDeletedTopic.toSeq, NonExistentReplica)
    //从队列和zk的节点中将删除的topic删除掉
    topicsToBeDeleted -= topic
    topicsWithDeletionStarted -= topic
    zkClient.deleteTopicZNode(topic, controllerContext.epochZkVersion)
    zkClient.deleteTopicConfigs(Seq(topic), controllerContext.epochZkVersion)
    zkClient.deleteTopicDeletions(Seq(topic), controllerContext.epochZkVersion)
    //controller中移除这个topic的replica assignment和partition leader等信息
    controllerContext.removeTopic(topic)
  }

注意，在controller发出下线replica的请求之后，一个replica的状态要么是started(即这个replica还没有响应请求)，要么是Ineligible状态或者successful状态，取决于是否返回错误。在resumeDeletions方法中继续判断，如果至少存在一个replica位于started状态，则跳过（因为删除还在进行时），否则如果至少有一个replica的状态是ineligible，则表示有replica删除失败，后面需要对删除topic进行重试。

if (controller.replicaStateMachine.isAtLeastOneReplicaInDeletionStartedState(topic)) {
         // ignore since topic deletion is in progress
         val replicasInDeletionStartedState = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionStarted)
         val replicaIds = replicasInDeletionStartedState.map(_.replica)
         val partitions = replicasInDeletionStartedState.map(_.topicPartition)
         info(s"Deletion for replicas ${replicaIds.mkString(",")} for partition ${partitions.mkString(",")} of topic $topic in progress")
       } else {
         // if you come here, then no replica is in TopicDeletionStarted and all replicas are not in
         // TopicDeletionSuccessful. That means, that either given topic haven't initiated deletion
         // or there is at least one failed replica (which means topic deletion should be retried).
         if (controller.replicaStateMachine.isAnyReplicaInState(topic, ReplicaDeletionIneligible)) {
           // mark topic for deletion retry
           markTopicForDeletionRetry(topic)
         }
       }
   ```
   标记topic重试删除的方法是
   ```java
   private def markTopicForDeletionRetry(topic: String) {
   // reset replica states from ReplicaDeletionIneligible to OfflineReplica
   val failedReplicas = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionIneligible)
   info(s"Retrying delete topic for topic $topic since replicas ${failedReplicas.mkString(",")} were not successfully deleted")
   controller.replicaStateMachine.handleStateChanges(failedReplicas.toSeq, OfflineReplica)
 }

至于什么时候重试，我们在后面遇到了再补充。

LogDirEventNotificationHandler

LogDirEventNotificationHandler对应的event是LogDirEventNotification，同样，我们直接看其process方法：

override def process(): Unit = {
     if (!isActive) return
     val sequenceNumbers = zkClient.getAllLogDirEventNotifications
     try {
       //先获取到子节点znode的sequenceNumber，再通过sequenceNumber读取相关的log dir event notification对应的broker id
       val brokerIds = zkClient.getBrokerIdsFromLogDirEvents(sequenceNumbers)
       onBrokerLogDirFailure(brokerIds)
     } finally {
       // delete processed children
       zkClient.deleteLogDirEventNotifications(sequenceNumbers, controllerContext.epochZkVersion)
     }
   }

当监听到有log dir event notification时（表示遇到了日志目录的错误），controller会发送LeaderAndIsrRequest给所有的broker来查询replica的状态

IsrChangeNotifgicationHandler

IsrChangeNotifgicationHandler对应的event是IsrChangeNotification，其process方法是：

override def process(): Unit = {
      if (!isActive) return
      //第一步依旧是先获取sequence number
      val sequenceNumbers = zkClient.getAllIsrChangeNotifications
      try {
        //获取sequence number相关的isr change notification和对应的partition信息
        val partitions = zkClient.getPartitionsFromIsrChangeNotifications(sequenceNumbers)
        if (partitions.nonEmpty) {
          updateLeaderAndIsrCache(partitions)
          processUpdateNotifications(partitions)
        }
      } finally {
        // delete the notifications
        zkClient.deleteIsrChangeNotifications(sequenceNumbers, controllerContext.epochZkVersion)
      }
    }

首先通过zk的节点的sequence number获取到partition，然后从zk获取partition和对应的状态。最后向其他broker发送更新元数据的请求。

前面介绍的几种handler都是childChangeHandlers，它们监听的都是某个节点下面子节点的变化，controller还有两个handler用于监听节点的变化:PreferredReplicaElectionHandler和PartitionReassignmentHandler

PreferredReplicaElectionHandler

controller的znodeChangeHandler都实现了handleCreation方法用来监听某个路径的创建。PreferredReplicaElectionHandler在/admin/preferred_replica_election创建时往controller的eventManager中放入PreferredReplicaLeaderElection，其process方法是：

override def process(): Unit = {
      if (!isActive) return

      // We need to register the watcher if the path doesn't exist in order to detect future preferred replica
      // leader elections and we get the `path exists` check for free
      if (zkClient.registerZNodeChangeHandlerAndCheckExistence(preferredReplicaElectionHandler)) {
        val partitions = zkClient.getPreferredReplicaElection
        val partitionsForTopicsToBeDeleted = partitions.filter(p => topicDeletionManager.isTopicQueuedUpForDeletion(p.topic))
        if (partitionsForTopicsToBeDeleted.nonEmpty) {
          error(s"Skipping preferred replica election for partitions $partitionsForTopicsToBeDeleted since the " +
            "respective topics are being deleted")
        }
        onPreferredReplicaElection(partitions -- partitionsForTopicsToBeDeleted)
      }
    }

先判断节点是否存在
获取/admin/preferred_replica_election的数据，解析为Set[TopicPartition]
过滤掉需要被删除的topic
对剩余的partition执行onPreferredReplicaElection方法

private def onPreferredReplicaElection(partitions: Set[TopicPartition], isTriggeredByAutoRebalance: Boolean = false) {
    info(s"Starting preferred replica leader election for partitions ${partitions.mkString(",")}")
    try {
      partitionStateMachine.handleStateChanges(partitions.toSeq, OnlinePartition, Option(PreferredReplicaPartitionLeaderElectionStrategy))
    } catch {
      case e: ControllerMovedException =>
        error(s"Error completing preferred replica leader election for partitions ${partitions.mkString(",")} because controller has moved to another broker.", e)
        throw e
      case e: Throwable => error(s"Error completing preferred replica leader election for partitions ${partitions.mkString(",")}", e)
    } finally {
      removePartitionsFromPreferredReplicaElection(partitions, isTriggeredByAutoRebalance)
    }
  }

这里涉及到的partition state状态机，我们在后面会详细介绍。在onPreferredReplicaElection中，partitionStateMachine触发了各broker进行partition leader的选举，并且将partition的状态转移为OnlinePartition状态。当选举完成后，执行removePartitionsFromPreferredReplicaElection方法。

private def removePartitionsFromPreferredReplicaElection(partitionsToBeRemoved: Set[TopicPartition],
                                                          isTriggeredByAutoRebalance : Boolean) {
   for (partition <- partitionsToBeRemoved) {
     // check the status
     val currentLeader = controllerContext.partitionLeadershipInfo(partition).leaderAndIsr.leader
     val preferredReplica = controllerContext.partitionReplicaAssignment(partition).head
     if (currentLeader == preferredReplica) {
       info(s"Partition $partition completed preferred replica leader election. New leader is $preferredReplica")
     } else {
       warn(s"Partition $partition failed to complete preferred replica leader election to $preferredReplica. " +
         s"Leader is still $currentLeader")
     }
   }
   if (!isTriggeredByAutoRebalance) {
     zkClient.deletePreferredReplicaElection(controllerContext.epochZkVersion)
     // Ensure we detect future preferred replica leader elections
     eventManager.put(PreferredReplicaLeaderElection)
   }
 }

PartitionReassignmentHandler

PartitionReassignmentHandler监听的路径是/admin/reassign_partitions, 直接看其event的process方法：

override def process(): Unit = {
      if (!isActive) return

      // We need to register the watcher if the path doesn't exist in order to detect future reassignments and we get
      // the `path exists` check for free
      if (zkClient.registerZNodeChangeHandlerAndCheckExistence(partitionReassignmentHandler)) {
        val partitionReassignment = zkClient.getPartitionReassignment

        // Populate `partitionsBeingReassigned` with all partitions being reassigned before invoking
        // `maybeTriggerPartitionReassignment` (see method documentation for the reason)
        partitionReassignment.foreach { case (tp, newReplicas) =>
          val reassignIsrChangeHandler = new PartitionReassignmentIsrChangeHandler(KafkaController.this, eventManager,
            tp)
          controllerContext.partitionsBeingReassigned.put(tp, ReassignedPartitionsContext(newReplicas, reassignIsrChangeHandler))
        }

        maybeTriggerPartitionReassignment(partitionReassignment.keySet)
      }
    }

如果节点存在的话，先获取到partition和其分配情况
创建PartitionReassignmentIsrChangeHandler，用来监听/brokers/topics/ $topic/partitions/$ partition/state节点数据的变化。
在controller context的partitionsBeingReassigned中放入topicPartition和对应的ReassignedPartitionsContext。其中ReassignedPartitionsContext中注册了步骤2中创建的handler
执行maybeTriggerPartitionReassignment

private def maybeTriggerPartitionReassignment(topicPartitions: Set[TopicPartition]) {
    val partitionsToBeRemovedFromReassignment = scala.collection.mutable.Set.empty[TopicPartition]
    topicPartitions.foreach { tp =>
      if (topicDeletionManager.isTopicQueuedUpForDeletion(tp.topic)) {
        error(s"Skipping reassignment of $tp since the topic is currently being deleted")
        partitionsToBeRemovedFromReassignment.add(tp)
      } else {
        val reassignedPartitionContext = controllerContext.partitionsBeingReassigned.get(tp).getOrElse {
          throw new IllegalStateException(s"Initiating reassign replicas for partition $tp not present in " +
            s"partitionsBeingReassigned: ${controllerContext.partitionsBeingReassigned.mkString(", ")}")
        }
        val newReplicas = reassignedPartitionContext.newReplicas
        val topic = tp.topic
        val assignedReplicas = controllerContext.partitionReplicaAssignment(tp)
        if (assignedReplicas.nonEmpty) {
          if (assignedReplicas == newReplicas) {
            info(s"Partition $tp to be reassigned is already assigned to replicas " +
              s"${newReplicas.mkString(",")}. Ignoring request for partition reassignment.")
            partitionsToBeRemovedFromReassignment.add(tp)
          } else {
            try {
              info(s"Handling reassignment of partition $tp to new replicas ${newReplicas.mkString(",")}")
              // first register ISR change listener
              reassignedPartitionContext.registerReassignIsrChangeHandler(zkClient)
              // mark topic ineligible for deletion for the partitions being reassigned
              topicDeletionManager.markTopicIneligibleForDeletion(Set(topic))
              onPartitionReassignment(tp, reassignedPartitionContext)
            } catch {
              case e: ControllerMovedException =>
                error(s"Error completing reassignment of partition $tp because controller has moved to another broker", e)
                throw e
              case e: Throwable =>
                error(s"Error completing reassignment of partition $tp", e)
                // remove the partition from the admin path to unblock the admin client
                partitionsToBeRemovedFromReassignment.add(tp)
            }
          }
        } else {
            error(s"Ignoring request to reassign partition $tp that doesn't exist.")
            partitionsToBeRemovedFromReassignment.add(tp)
        }
      }
    }
    removePartitionsFromReassignedPartitions(partitionsToBeRemovedFromReassignment)
  }

如果这个topic正准备报错，则忽略
否则，先从controller context中获取到刚刚放入的ReassignedPartitionsContext
从reassignedPartitionContext获取到新的replica assignment，如果和原来的assignment相同，则忽略
否则在zk上面注册ISR change listener，并且暂停topic的删除
执行partition replica reassignment
将前面提到需要忽略的topicPartition从controller context的partitionsBeingReassigned移除，并且取消zk上对其节点的监听。如果context的partitionsBeingReassigned中剩余的topicPartition为空，还要删除zk上面对应的节点，同时重新监听zk上面这个节点的内容，以等待下次的reassignment。最后再次创建assignment并保存到zk中（是否会造成处理上面的死循环）。

我们着重看一下partition replica assignment的过程

onPartitionReassignment

当一个admin命令触发了partition的reassignment的任务时，会创建出/admin/reassign_partitions路径并且触发了zk的监听器，此时就开始了partition的重分配。为了方便理解，我们使用以下简称

RAR：reassigned replicas 新的replica分配
OAR: original list of replicas 原来的replica分配
AR: current assigned replicas

onPartitionReassignment的代码如下：

private def onPartitionReassignment(topicPartition: TopicPartition, reassignedPartitionContext: ReassignedPartitionsContext) {
    //先获取RAR
    val reassignedReplicas = reassignedPartitionContext.newReplicas
    //如果有新加入的replica没有跟上isr
    if (!areReplicasInIsr(topicPartition, reassignedReplicas)) {
      info(s"New replicas ${reassignedReplicas.mkString(",")} for partition $topicPartition being reassigned not yet " +
        "caught up with the leader")
      val newReplicasNotInOldReplicaList = reassignedReplicas.toSet -- controllerContext.partitionReplicaAssignment(topicPartition).toSet
      val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicPartition)).toSet
      //1. Update AR in ZK with OAR + RAR.
      //把新老replica一起加入到当前的replica中，即AR中
      updateAssignedReplicasForPartition(topicPartition, newAndOldReplicas.toSeq)
      //2. Send LeaderAndIsr request to every replica in OAR + RAR (with AR as OAR + RAR).
      //发送LeaderAndIsr请求到所有replica（即AR），强制进行新的leader epoch的选举
      updateLeaderEpochAndSendRequest(topicPartition, controllerContext.partitionReplicaAssignment(topicPartition),
        newAndOldReplicas.toSeq)
      //3. replicas in RAR - OAR -> NewReplica
      //通知新加入到replica跟上isr
      startNewReplicasForReassignedPartition(topicPartition, reassignedPartitionContext, newReplicasNotInOldReplicaList)
      info(s"Waiting for new replicas ${reassignedReplicas.mkString(",")} for partition ${topicPartition} being " +
        "reassigned to catch up with the leader")
    } else {
      //如果所有的新replica都跟上isr
      //4. Wait until all replicas in RAR are in sync with the leader.
      val oldReplicas = controllerContext.partitionReplicaAssignment(topicPartition).toSet -- reassignedReplicas.toSet
      //5. replicas in RAR -> OnlineReplica
      //把RAR中的replica的状态转移为OnlineReplica
      reassignedReplicas.foreach { replica =>
        replicaStateMachine.handleStateChanges(Seq(new PartitionAndReplica(topicPartition, replica)), OnlineReplica)
      }
      //6. Set AR to RAR in memory.
      //7. Send LeaderAndIsr request with a potential new leader (if current leader not in RAR) and
      //   a new AR (using RAR) and same isr to every broker in RAR
      moveReassignedPartitionLeaderIfRequired(topicPartition, reassignedPartitionContext)
      //8. replicas in OAR - RAR -> Offline (force those replicas out of isr)
      //9. replicas in OAR - RAR -> NonExistentReplica (force those replicas to be deleted)
      //老的replica，即OAR-RAR的状态先转移到offline再到nonexist
      stopOldReplicasOfReassignedPartition(topicPartition, reassignedPartitionContext, oldReplicas)
      //10. Update AR in ZK with RAR.
      //更新AR
      updateAssignedReplicasForPartition(topicPartition, reassignedReplicas)
      //11. Update the /admin/reassign_partitions path in ZK to remove this partition.
      removePartitionsFromReassignedPartitions(Set(topicPartition))
      //12. After electing leader, the replicas and isr information changes, so resend the update metadata request to every broker
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicPartition))
      // signal delete topic thread if reassignment for some partitions belonging to topics being deleted just completed
      topicDeletionManager.resumeDeletionForTopics(Set(topicPartition.topic))
    }
  }

reassignment的步骤是：

先将AR更新为RAR+OAR
发送LeaderAndIsr请求到RAR+OAR的replica中选举新的leader，开启新的leader epoch
将RAR-OAR的replica的状态设置为NewReplica
等待所有的replica跟上isr
把RAR的所有replica状态设置为OnlineReplica
将controller context中的AR设置为RAR
如果当前的leader不在RAR中，从RAR选举一个新的leader。否则将leader epoch加一。
将OAR-RAR中的replica设置为OfflineReplica，并且从isr中将OAR-RAR删掉，并且发送LeaderAndIsr请求到leader来通知新的isr
将OAR-RAR中的replica设置为NonExistentReplica状态，并且通知OAR-RAR物理删除磁盘上面的replica文件
更新zk上面的AR信息
从/admin/reassign_partitions节点中将这个partition删掉
因为replica和isr信息已经改变，发送元数据请求到所有的broker