Zk Handler
kafka的controller上面主要注册了以下两种类型的zk handler
- childChangeHandler
- nodeChangeHandler
childChangeHandler关注的是子节点的变化,而nodeChangeHandler关注的是节点的变化。childChangeHandler有以下几种
- brokerChangeHandler
- topicChangeHandler
- topicDeletionHandler
- logDirEventNotificationHandler
- isrChangeNotificationHandler
| handler | 监听路径 | 作用 |
|---|---|---|
| brokerChangeHandler | /brokers/ids | 监听broker的上线和下线 |
| topicChangeHandler | /brokers/topics | 监听topic的创建 |
| topicDeletionHandler | /admin/delete_topics | 监听topic的删除 |
| logDirEventNotificationHandler | /log_dir_event_notification | |
| isrChangeNotificationHandler | /isr_change_notification | partition isr的改变 |
nodeChangeHandler有两种
- preferredReplicaElectionHandler
- partitionReassignmentHandler
| handler | 监听路径 | 作用 |
|---|---|---|
| preferredReplicaElectionHandler | /admin/preferred_replica_election | 为了partition leader的选举 |
| partitionReassignmentHandler | /admin/reassign_partitions | 用于分区副本的迁移 |
下面我们逐一介绍这些handler的处理方法,理解这些方便我们知道在操作kafka的时候,server背后到底在做什么?
ZNodeChildChangeHandler
ZNodeChildChangeHandler是一个trait,提供了一个属性path和一个方法handleChildChange
trait ZNodeChildChangeHandler {
val path: String
def handleChildChange(): Unit = {}
}
childChangeHandler都是其子类。
BrokerChangeHandler
BrokerChangeHandler的handle实现方法如下:
override def handleChildChange(): Unit = {
eventManager.put(controller.BrokerChange)
}
它在controller的主线程队列中添加进一个BrokerChange的event(ControllerEvent)。我们知道,controller的ControllerEventThread线程在拿到event以后,会将_state改成event中指定的状态,同时执行event的process方法。
BrokerChange中指定的controller状态是BrokerChange,它的处理方法是
override def process(): Unit = {
if (!isActive) return
val curBrokers = zkClient.getAllBrokersInCluster.toSet
val curBrokerIds = curBrokers.map(_.id)
val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds
val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds
val newBrokers = curBrokers.filter(broker => newBrokerIds(broker.id))
controllerContext.liveBrokers = curBrokers
val newBrokerIdsSorted = newBrokerIds.toSeq.sorted
val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted
val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted
info(s"Newly added brokers: ${newBrokerIdsSorted.mkString(",")}, " +
s"deleted brokers: ${deadBrokerIdsSorted.mkString(",")}, all live brokers: ${liveBrokerIdsSorted.mkString(",")}")
newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker)
if (newBrokerIds.nonEmpty)
onBrokerStartup(newBrokerIdsSorted)
if (deadBrokerIds.nonEmpty)
onBrokerFailure(deadBrokerIdsSorted)
}
- 先从zk的/brokers/ids下面获取到线上所有的broker id
- 从controller的上下文controllerContext中获取到缓存的broker id列表(这些broker有些可能已经下线)
- 通过线上broker减去上下文中broker,得到新加入的broker列表newBrokerIds
- 通过上下文中broker减去线上broker,得到下线的broker列表deadBrokerIds
- controller上下文中执行加入新broker和下线broker的操作
- 将broker变化的信息同步到集群中
我们重点看第5点和第6点。
controller中执行broker上下线
主要是两个方法:
def addBroker(broker: Broker) {
// be careful here. Maybe the startup() API has already started the request send thread
brokerLock synchronized {
if (!brokerStateInfo.contains(broker.id)) {
addNewBroker(broker)
startRequestSendThread(broker.id)
}
}
}
和
def removeBroker(brokerId: Int) {
brokerLock synchronized {
removeExistingBroker(brokerStateInfo(brokerId))
}
}
先看addNewBroker
addNewBroker
private def addNewBroker(broker: Broker) {
val messageQueue = new LinkedBlockingQueue[QueueItem]
debug(s"Controller ${config.brokerId} trying to connect to broker ${broker.id}")
val brokerNode = broker.node(config.interBrokerListenerName)
val logContext = new LogContext(s"[Controller id=${config.brokerId}, targetBrokerId=${brokerNode.idString}] ")
val networkClient = {
val channelBuilder = ...
val selector = ...
new NetworkClient(
...
)
}
val threadName = threadNamePrefix match {
case None => s"Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
case Some(name) => s"$name:Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
}
val requestRateAndQueueTimeMetrics = newTimer(
RequestRateAndQueueTimeMetricName, TimeUnit.MILLISECONDS, TimeUnit.SECONDS, brokerMetricTags(broker.id)
)
val requestThread = new RequestSendThread(config.brokerId, controllerContext, messageQueue, networkClient,
brokerNode, config, time, requestRateAndQueueTimeMetrics, stateChangeLogger, threadName)
requestThread.setDaemon(false)
val queueSizeGauge = newGauge(
QueueSizeMetricName,
new Gauge[Int] {
def value: Int = messageQueue.size
},
brokerMetricTags(broker.id)
)
brokerStateInfo.put(broker.id, ControllerBrokerStateInfo(networkClient, brokerNode, messageQueue,
requestThread, queueSizeGauge, requestRateAndQueueTimeMetrics))
}
因为篇幅有限,这里先省略NetworkClient的构造代码,以后讲网络的时候再分析。从上面代码能看出,controller是启动了一个线程用于给其他broker发送请求。controller显示构造了一个线程RequestSendThread,然后将这个线程放入到ControllerChannelManager中就返回了。
addBroker接着调用startRequestSendThread方法发送请求
protected def startRequestSendThread(brokerId: Int) {
val requestThread = brokerStateInfo(brokerId).requestSendThread
if (requestThread.getState == Thread.State.NEW)
requestThread.start()
}
那么,到底是发的什么请求呢?
RequestSendThread是我们前面在eventManager中ControllerEventThread那边看到过的ShutdownableThread,它的特点是不断从队列中获取某种Item并处理。而addBroker的时候传入的messageQueue恰好就作为请求内容的队列。RequestSendThread不断从队列中获取QueueItem
case class QueueItem(apiKey: ApiKeys, request: AbstractRequest.Builder[_ <: AbstractRequest],
callback: AbstractResponse => Unit, enqueueTimeMs: Long)
然后再发送请求到broker。所以addBroker相当于是维护了一个和broker进行通信的入口。当想要发送请求时,从broker id对应的ControllerBrokerStateInfo中获取到消息队列,再放入请求的消息即可。
removeBroker
removeBroker的过程与add正好相反
def removeBroker(brokerId: Int) {
brokerLock synchronized {
removeExistingBroker(brokerStateInfo(brokerId))
}
}
其中:
private def removeExistingBroker(brokerState: ControllerBrokerStateInfo) {
try {
// Shutdown the RequestSendThread before closing the NetworkClient to avoid the concurrent use of the
// non-threadsafe classes as described in KAFKA-4959.
// The call to shutdownLatch.await() in ShutdownableThread.shutdown() serves as a synchronization barrier that
// hands off the NetworkClient from the RequestSendThread to the ZkEventThread.
brokerState.requestSendThread.shutdown()
brokerState.networkClient.close()
brokerState.messageQueue.clear()
removeMetric(QueueSizeMetricName, brokerMetricTags(brokerState.brokerNode.id))
removeMetric(RequestRateAndQueueTimeMetricName, brokerMetricTags(brokerState.brokerNode.id))
brokerStateInfo.remove(brokerState.brokerNode.id)
} catch {
case e: Throwable => error("Error while removing broker by the controller", e)
}
}
- 先停止requestSendThread
- 停止对应的networkClient
- 清空队列
- 移除metrics
- 将broker从brokerStateInfo中移除
BrokerChangeHandler就介绍到这里。
TopicChangeHandler
我们直奔主题,看TopicChangeHandler对应的TopicChange的process方法
override def process(): Unit = {
if (!isActive) return
//step 1
val topics = zkClient.getAllTopicsInCluster.toSet
//step 2
val newTopics = topics -- controllerContext.allTopics
val deletedTopics = controllerContext.allTopics -- topics
controllerContext.allTopics = topics
//step 3
registerPartitionModificationsHandlers(newTopics.toSeq)
//step 4
val addedPartitionReplicaAssignment = zkClient.getReplicaAssignmentForTopics(newTopics)
//step 5
deletedTopics.foreach(controllerContext.removeTopic)
//step 6
addedPartitionReplicaAssignment.foreach {
case (topicAndPartition, newReplicas) =>
controllerContext.updatePartitionReplicaAssignment(topicAndPartition, newReplicas)
}
info(s"New topics: [$newTopics], deleted topics: [$deletedTopics], new partition replica assignment " +
s"[$addedPartitionReplicaAssignment]")
if (addedPartitionReplicaAssignment.nonEmpty)
//step 7
onNewPartitionCreation(addedPartitionReplicaAssignment.keySet)
}
分为以下几步:
- 从zk获取所有topic列表
- 与context中topic对比,获取到新添加和删除的topic列表
- 对于新添加的topic,在其上注册PartitionModificationsHandler
- 从zk获取新topic的partition replica分配情况
- 每个删除的topic执行removeTopic操作
- 更新controller context中partition assign的相关信息
- 修改partitionStateMachine,partitionStateMachine,partitionStateMachine,replicaStateMachine等状态机中信息
我们从PartitionModificationsHandler开始看。
PartitionModificationsHandler
class PartitionModificationsHandler(controller: KafkaController, eventManager: ControllerEventManager, topic: String) extends ZNodeChangeHandler {
override val path: String = TopicZNode.path(topic)
override def handleDataChange(): Unit = eventManager.put(controller.PartitionModifications(topic))
}
它监听的是topic上partition改变的情况。当数据改变时,往controller的eventManager中放入controller.PartitionModifications(topic)。这个event的proecss为:
override def process(): Unit = {
if (!isActive) return
//获取每个partition的assignment情况
val partitionReplicaAssignment = zkClient.getReplicaAssignmentForTopics(immutable.Set(topic))
val partitionsToBeAdded = partitionReplicaAssignment.filter { case (topicPartition, _) =>
controllerContext.partitionReplicaAssignment(topicPartition).isEmpty
}
if (topicDeletionManager.isTopicQueuedUpForDeletion(topic))
if (partitionsToBeAdded.nonEmpty) {
warn("Skipping adding partitions %s for topic %s since it is currently being deleted"
.format(partitionsToBeAdded.map(_._1.partition).mkString(","), topic))
restorePartitionReplicaAssignment(topic, partitionReplicaAssignment)
} else {
// This can happen if existing partition replica assignment are restored to prevent increasing partition count during topic deletion
info("Ignoring partition change during topic deletion as no new partitions are added")
}
else {
if (partitionsToBeAdded.nonEmpty) {
info(s"New partitions to be added $partitionsToBeAdded")
partitionsToBeAdded.foreach { case (topicPartition, assignedReplicas) =>
controllerContext.updatePartitionReplicaAssignment(topicPartition, assignedReplicas)
}
onNewPartitionCreation(partitionsToBeAdded.keySet)
}
}
}
- 获取每个partition的assignment情况。这是Map:Map[TopicPartition, Seq[Int]]
- 获取待添加的新topicPartition:partitionsToBeAdded
- 如果topic正在被删除,回滚
- 否则更新controller context中的topicPartition的分配缓存
- 修改partitionStateMachine,partitionStateMachine,partitionStateMachine,replicaStateMachine等状态机中信息
综上来看,当topic下partition新添加partition时,修改controller当中关于topicPartition的assign信息。当新添加topic时,对newTopic执行的是类似的操作,主要是topicAssignment的更新。
TopicDeletionHandler
其process方法是:
override def process(): Unit = {
if (!isActive) return
var topicsToBeDeleted = zkClient.getTopicDeletions.toSet
debug(s"Delete topics listener fired for topics ${topicsToBeDeleted.mkString(",")} to be deleted")
val nonExistentTopics = topicsToBeDeleted -- controllerContext.allTopics
if (nonExistentTopics.nonEmpty) {
warn(s"Ignoring request to delete non-existing topics ${nonExistentTopics.mkString(",")}")
zkClient.deleteTopicDeletions(nonExistentTopics.toSeq, controllerContext.epochZkVersion)
}
topicsToBeDeleted --= nonExistentTopics
if (config.deleteTopicEnable) {
if (topicsToBeDeleted.nonEmpty) {
info(s"Starting topic deletion for topics ${topicsToBeDeleted.mkString(",")}")
// mark topic ineligible for deletion if other state changes are in progress
topicsToBeDeleted.foreach { topic =>
val partitionReassignmentInProgress =
controllerContext.partitionsBeingReassigned.keySet.map(_.topic).contains(topic)
if (partitionReassignmentInProgress)
topicDeletionManager.markTopicIneligibleForDeletion(Set(topic))
}
// add topic to deletion list
topicDeletionManager.enqueueTopicsForDeletion(topicsToBeDeleted)
}
} else {
// If delete topic is disabled remove entries under zookeeper path : /admin/delete_topics
info(s"Removing $topicsToBeDeleted since delete topic is disabled")
zkClient.deleteTopicDeletions(topicsToBeDeleted.toSeq, controllerContext.epochZkVersion)
}
}
- 从zk的/admin/delete_topics中获取到待删除的topic列表
- 忽略掉不存在的topic
- 如果配置准许删除topic,则
- 如果topic正在执行reassign操作,则暂停删除topic(加入到停止队列中)
- 否则,将topic放入到topicDeletionManager的的deletion队列中
- 如果配置不准许删除topic,从/admin/delete_topics中移除相关topic
那么TopicDeletionManager是怎样删除topic的呢?
TopicDeletionManager
TopicDeletionManager管理着topic删除动作的状态机。当调用命令删除topic时,在zk的/admin/delete_topics下面会创建对应topic的节点,controller监听着这个路径并且删除对应的topic。在删除topic之前会判断,如果
- topic的某个replica所在的broker此时下线了
- 正在进行这个topic的partition reassignment
则topic会判断为不能够删除。而当
- replica所在的broker上线了
- partition reassignment完成了
的时候,会继续topic的删除动作。
每一个待删除topic的replicac都有以下三种状态
- TopicDeletionStarted:当调用了onPartitionDeletion以后replica进入此状态。当controller监听到delete_topics子节点变化时,发送StopReplicaRequests到所有的replica,并且在响应StopReplicaResponse上面注册了回调,当每个replica响应删除replica的时候执行。
- TopicDeletionSuccessful:根据StopReplicaResponse响应状态码的不同决定是否将replica移入此状
- TopicDeletionFailed:根据StopReplicaResponse响应状态码的不同决定是否将replica移入此状
一个topic被删除成功,当且仅当所有的replica都进入了TopicDeletionSuccessful状态。如果没有replica还在TopicDeletionStarted状态,并且至少一个replica进入TopicDeletionFailed状态,就会将这个topic标记用来重试。
上面提到,当从zk获取到待删除的topic时会执行enqueueTopicsForDeletion方法
def enqueueTopicsForDeletion(topics: Set[String]) {
if (isDeleteTopicEnabled) {
topicsToBeDeleted ++= topics
resumeDeletions()
}
}
它将待删除的topic放入到manager的集合topicsToBeDeleted中,然后继续执行topic的删除动作
private def resumeDeletions(): Unit = {
val topicsQueuedForDeletion = Set.empty[String] ++ topicsToBeDeleted
if (topicsQueuedForDeletion.nonEmpty)
info(s"Handling deletion for topics ${topicsQueuedForDeletion.mkString(",")}")
topicsQueuedForDeletion.foreach { topic =>
// if all replicas are marked as deleted successfully, then topic deletion is done
//如果replica都已经删除成功
if (controller.replicaStateMachine.areAllReplicasForTopicDeleted(topic)) {
// clear up all state for this topic from controller cache and zookeeper
completeDeleteTopic(topic)
info(s"Deletion of topic $topic successfully completed")
} else {
//至少有一个replica已经开始删除topic
if (controller.replicaStateMachine.isAtLeastOneReplicaInDeletionStartedState(topic)) {
// ignore since topic deletion is in progress
val replicasInDeletionStartedState = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionStarted)
val replicaIds = replicasInDeletionStartedState.map(_.replica)
val partitions = replicasInDeletionStartedState.map(_.topicPartition)
info(s"Deletion for replicas ${replicaIds.mkString(",")} for partition ${partitions.mkString(",")} of topic $topic in progress")
} else {
// if you come here, then no replica is in TopicDeletionStarted and all replicas are not in
// TopicDeletionSuccessful. That means, that either given topic haven't initiated deletion
// or there is at least one failed replica (which means topic deletion should be retried).
//此时,没有一个replica在start状态,表示replica都还没开始删除,或者至少有一个replica失败了
if (controller.replicaStateMachine.isAnyReplicaInState(topic, ReplicaDeletionIneligible)) {
// mark topic for deletion retry
markTopicForDeletionRetry(topic)
}
}
}
// Try delete topic if it is eligible for deletion.
if (isTopicEligibleForDeletion(topic)) {
info(s"Deletion of topic $topic (re)started")
// topic deletion will be kicked off
onTopicDeletion(Set(topic))
} else if (isTopicIneligibleForDeletion(topic)) {
info(s"Not retrying deletion of topic $topic at this time since it is marked ineligible for deletion")
}
}
}
当controller刚开始执行删除动作时,执行的是onTopicDeletion(Set(topic))方法
private def onTopicDeletion(topics: Set[String]) {
info(s"Topic deletion callback for ${topics.mkString(",")}")
// send update metadata so that brokers stop serving data for topics to be deleted
val partitions = topics.flatMap(controllerContext.partitionsForTopic)
val unseenTopicsForDeletion = topics -- topicsWithDeletionStarted
if (unseenTopicsForDeletion.nonEmpty) {
val unseenPartitionsForDeletion = unseenTopicsForDeletion.flatMap(controllerContext.partitionsForTopic)
//将所有的partition先下线
controller.partitionStateMachine.handleStateChanges(unseenPartitionsForDeletion.toSeq, OfflinePartition)
controller.partitionStateMachine.handleStateChanges(unseenPartitionsForDeletion.toSeq, NonExistentPartition)
// adding of unseenTopicsForDeletion to topicsBeingDeleted must be done after the partition state changes
// to make sure the offlinePartitionCount metric is properly updated
topicsWithDeletionStarted ++= unseenTopicsForDeletion
}
controller.sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, partitions)
topics.foreach { topic =>
onPartitionDeletion(controllerContext.partitionsForTopic(topic))
}
}
controller先是发送更新元信息请求(partition的leader)到所有存活的broker,然后再删除每个partition。删除partition时执行的方法是:
private def onPartitionDeletion(partitionsToBeDeleted: Set[TopicPartition]) {
info(s"Partition deletion callback for ${partitionsToBeDeleted.mkString(",")}")
val replicasPerPartition = controllerContext.replicasForPartition(partitionsToBeDeleted)
startReplicaDeletion(replicasPerPartition)
}
它先获取到这个partition所有的replica,再通知replica删除topic
private def startReplicaDeletion(replicasForTopicsToBeDeleted: Set[PartitionAndReplica]) {
replicasForTopicsToBeDeleted.groupBy(_.topic).keys.foreach { topic =>
val aliveReplicasForTopic = controllerContext.allLiveReplicas().filter(p => p.topic == topic)
val deadReplicasForTopic = replicasForTopicsToBeDeleted -- aliveReplicasForTopic
val successfullyDeletedReplicas = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
val replicasForDeletionRetry = aliveReplicasForTopic -- successfullyDeletedReplicas
// move dead replicas directly to failed state
//如果某个replica这时候挂了,replica标记为ReplicaDeletionIneligible
controller.replicaStateMachine.handleStateChanges(deadReplicasForTopic.toSeq, ReplicaDeletionIneligible, new Callbacks())
// send stop replica to all followers that are not in the OfflineReplica state so they stop sending fetch requests to the leader
//将replica都下线,停止向leader拉消息
controller.replicaStateMachine.handleStateChanges(replicasForDeletionRetry.toSeq, OfflineReplica, new Callbacks())
debug(s"Deletion started for replicas ${replicasForDeletionRetry.mkString(",")}")
//replica标记为ReplicaDeletionStarted状态
controller.replicaStateMachine.handleStateChanges(replicasForDeletionRetry.toSeq, ReplicaDeletionStarted,
new Callbacks(stopReplicaResponseCallback = (stopReplicaResponseObj, replicaId) =>
eventManager.put(controller.TopicDeletionStopReplicaResponseReceived(stopReplicaResponseObj, replicaId))))
if (deadReplicasForTopic.nonEmpty) {
debug(s"Dead Replicas (${deadReplicasForTopic.mkString(",")}) found for topic $topic")
//如果有replica下线,先暂停topic的删除
markTopicIneligibleForDeletion(Set(topic))
}
}
}
在将replica置为ReplicaDeletionStarted状态是,接收响应并进行回调:
eventManager.put(controller.TopicDeletionStopReplicaResponseReceived(stopReplicaResponseObj, replicaId))))
controller处理响应的方式也是发送一个ControllerEvent TopicDeletionStopReplicaResponseReceived。它的处理方法是:
override def process(): Unit = {
import JavaConverters._
if (!isActive) return
val stopReplicaResponse = stopReplicaResponseObj.asInstanceOf[StopReplicaResponse]
debug(s"Delete topic callback invoked for $stopReplicaResponse")
val responseMap = stopReplicaResponse.responses.asScala
val partitionsInError =
if (stopReplicaResponse.error != Errors.NONE) responseMap.keySet
else responseMap.filter { case (_, error) => error != Errors.NONE }.keySet
val replicasInError = partitionsInError.map(PartitionAndReplica(_, replicaId))
// move all the failed replicas to ReplicaDeletionIneligible
topicDeletionManager.failReplicaDeletion(replicasInError)
if (replicasInError.size != responseMap.size) {
// some replicas could have been successfully deleted
val deletedReplicas = responseMap.keySet -- partitionsInError
topicDeletionManager.completeReplicaDeletion(deletedReplicas.map(PartitionAndReplica(_, replicaId)))
}
}
controller先获取到所有报错的partition的replica partitionsInError。如果某个partition的replica报错,则执行topicDeletionManager的failReplicaDeletion.如果不是所有的partition replica都不错(即有成功的replica),执行topicDeletionManager的completeReplicaDeletion方法。 我们先看failReplicaDeletion方法
def failReplicaDeletion(replicas: Set[PartitionAndReplica]) {
if (isDeleteTopicEnabled) {
val replicasThatFailedToDelete = replicas.filter(r => isTopicQueuedUpForDeletion(r.topic))
if (replicasThatFailedToDelete.nonEmpty) {
val topics = replicasThatFailedToDelete.map(_.topic)
debug(s"Deletion failed for replicas ${replicasThatFailedToDelete.mkString(",")}. Halting deletion for topics $topics")
controller.replicaStateMachine.handleStateChanges(replicasThatFailedToDelete.toSeq, ReplicaDeletionIneligible)
markTopicIneligibleForDeletion(topics)
resumeDeletions()
}
}
}
这个方法将replica的状态标记为ReplicaDeletionIneligible,并且将这个topic标记为topicsIneligibleForDeletion等待后面再删除。
def completeReplicaDeletion(replicas: Set[PartitionAndReplica]) {
val successfullyDeletedReplicas = replicas.filter(r => isTopicQueuedUpForDeletion(r.topic))
debug(s"Deletion successfully completed for replicas ${successfullyDeletedReplicas.mkString(",")}")
controller.replicaStateMachine.handleStateChanges(successfullyDeletedReplicas.toSeq, ReplicaDeletionSuccessful)
resumeDeletions()
}
这个方法将replica标记为ReplicaDeletionSuccessful状态。
当controller通知完replica删除topic partition之后,我们再回到topicDeletionManager的resumeDeletions方法。此时如果每一个replica都完成了topic partition的删除,即进入了ReplicaDeletionSuccessful状态,则结束topic的删除
if (controller.replicaStateMachine.areAllReplicasForTopicDeleted(topic)) {
// clear up all state for this topic from controller cache and zookeeper
completeDeleteTopic(topic)
info(s"Deletion of topic $topic successfully completed")
}
其中,deleteDeleteTopic的实现是
private def completeDeleteTopic(topic: String) {
// deregister partition change listener on the deleted topic. This is to prevent the partition change listener
// firing before the new topic listener when a deleted topic gets auto created
//移除topic的partition修改监听器
controller.unregisterPartitionModificationsHandlers(Seq(topic))
val replicasForDeletedTopic = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
// controller will remove this replica from the state machine as well as its partition assignment cache
//从replica状态机缓存中将replica都删除
controller.replicaStateMachine.handleStateChanges(replicasForDeletedTopic.toSeq, NonExistentReplica)
//从队列和zk的节点中将删除的topic删除掉
topicsToBeDeleted -= topic
topicsWithDeletionStarted -= topic
zkClient.deleteTopicZNode(topic, controllerContext.epochZkVersion)
zkClient.deleteTopicConfigs(Seq(topic), controllerContext.epochZkVersion)
zkClient.deleteTopicDeletions(Seq(topic), controllerContext.epochZkVersion)
//controller中移除这个topic的replica assignment和partition leader等信息
controllerContext.removeTopic(topic)
}
注意,在controller发出下线replica的请求之后,一个replica的状态要么是started(即这个replica还没有响应请求),要么是Ineligible状态或者successful状态,取决于是否返回错误。在resumeDeletions方法中继续判断,如果至少存在一个replica位于started状态,则跳过(因为删除还在进行时),否则如果至少有一个replica的状态是ineligible,则表示有replica删除失败,后面需要对删除topic进行重试。
if (controller.replicaStateMachine.isAtLeastOneReplicaInDeletionStartedState(topic)) {
// ignore since topic deletion is in progress
val replicasInDeletionStartedState = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionStarted)
val replicaIds = replicasInDeletionStartedState.map(_.replica)
val partitions = replicasInDeletionStartedState.map(_.topicPartition)
info(s"Deletion for replicas ${replicaIds.mkString(",")} for partition ${partitions.mkString(",")} of topic $topic in progress")
} else {
// if you come here, then no replica is in TopicDeletionStarted and all replicas are not in
// TopicDeletionSuccessful. That means, that either given topic haven't initiated deletion
// or there is at least one failed replica (which means topic deletion should be retried).
if (controller.replicaStateMachine.isAnyReplicaInState(topic, ReplicaDeletionIneligible)) {
// mark topic for deletion retry
markTopicForDeletionRetry(topic)
}
}
```
标记topic重试删除的方法是
```java
private def markTopicForDeletionRetry(topic: String) {
// reset replica states from ReplicaDeletionIneligible to OfflineReplica
val failedReplicas = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionIneligible)
info(s"Retrying delete topic for topic $topic since replicas ${failedReplicas.mkString(",")} were not successfully deleted")
controller.replicaStateMachine.handleStateChanges(failedReplicas.toSeq, OfflineReplica)
}
至于什么时候重试,我们在后面遇到了再补充。
LogDirEventNotificationHandler
LogDirEventNotificationHandler对应的event是LogDirEventNotification,同样,我们直接看其process方法:
override def process(): Unit = {
if (!isActive) return
val sequenceNumbers = zkClient.getAllLogDirEventNotifications
try {
//先获取到子节点znode的sequenceNumber,再通过sequenceNumber读取相关的log dir event notification对应的broker id
val brokerIds = zkClient.getBrokerIdsFromLogDirEvents(sequenceNumbers)
onBrokerLogDirFailure(brokerIds)
} finally {
// delete processed children
zkClient.deleteLogDirEventNotifications(sequenceNumbers, controllerContext.epochZkVersion)
}
}
当监听到有log dir event notification时(表示遇到了日志目录的错误),controller会发送LeaderAndIsrRequest给所有的broker来查询replica的状态
IsrChangeNotifgicationHandler
IsrChangeNotifgicationHandler对应的event是IsrChangeNotification,其process方法是:
override def process(): Unit = {
if (!isActive) return
//第一步依旧是先获取sequence number
val sequenceNumbers = zkClient.getAllIsrChangeNotifications
try {
//获取sequence number相关的isr change notification和对应的partition信息
val partitions = zkClient.getPartitionsFromIsrChangeNotifications(sequenceNumbers)
if (partitions.nonEmpty) {
updateLeaderAndIsrCache(partitions)
processUpdateNotifications(partitions)
}
} finally {
// delete the notifications
zkClient.deleteIsrChangeNotifications(sequenceNumbers, controllerContext.epochZkVersion)
}
}
首先通过zk的节点的sequence number获取到partition,然后从zk获取partition和对应的状态。最后向其他broker发送更新元数据的请求。
前面介绍的几种handler都是childChangeHandlers,它们监听的都是某个节点下面子节点的变化,controller还有两个handler用于监听节点的变化:PreferredReplicaElectionHandler和PartitionReassignmentHandler
PreferredReplicaElectionHandler
controller的znodeChangeHandler都实现了handleCreation方法用来监听某个路径的创建。PreferredReplicaElectionHandler在/admin/preferred_replica_election创建时往controller的eventManager中放入PreferredReplicaLeaderElection,其process方法是:
override def process(): Unit = {
if (!isActive) return
// We need to register the watcher if the path doesn't exist in order to detect future preferred replica
// leader elections and we get the `path exists` check for free
if (zkClient.registerZNodeChangeHandlerAndCheckExistence(preferredReplicaElectionHandler)) {
val partitions = zkClient.getPreferredReplicaElection
val partitionsForTopicsToBeDeleted = partitions.filter(p => topicDeletionManager.isTopicQueuedUpForDeletion(p.topic))
if (partitionsForTopicsToBeDeleted.nonEmpty) {
error(s"Skipping preferred replica election for partitions $partitionsForTopicsToBeDeleted since the " +
"respective topics are being deleted")
}
onPreferredReplicaElection(partitions -- partitionsForTopicsToBeDeleted)
}
}
- 先判断节点是否存在
- 获取/admin/preferred_replica_election的数据,解析为Set[TopicPartition]
- 过滤掉需要被删除的topic
- 对剩余的partition执行onPreferredReplicaElection方法
private def onPreferredReplicaElection(partitions: Set[TopicPartition], isTriggeredByAutoRebalance: Boolean = false) {
info(s"Starting preferred replica leader election for partitions ${partitions.mkString(",")}")
try {
partitionStateMachine.handleStateChanges(partitions.toSeq, OnlinePartition, Option(PreferredReplicaPartitionLeaderElectionStrategy))
} catch {
case e: ControllerMovedException =>
error(s"Error completing preferred replica leader election for partitions ${partitions.mkString(",")} because controller has moved to another broker.", e)
throw e
case e: Throwable => error(s"Error completing preferred replica leader election for partitions ${partitions.mkString(",")}", e)
} finally {
removePartitionsFromPreferredReplicaElection(partitions, isTriggeredByAutoRebalance)
}
}
这里涉及到的partition state状态机,我们在后面会详细介绍。在onPreferredReplicaElection中,partitionStateMachine触发了各broker进行partition leader的选举,并且将partition的状态转移为OnlinePartition状态。当选举完成后,执行removePartitionsFromPreferredReplicaElection方法。
private def removePartitionsFromPreferredReplicaElection(partitionsToBeRemoved: Set[TopicPartition],
isTriggeredByAutoRebalance : Boolean) {
for (partition <- partitionsToBeRemoved) {
// check the status
val currentLeader = controllerContext.partitionLeadershipInfo(partition).leaderAndIsr.leader
val preferredReplica = controllerContext.partitionReplicaAssignment(partition).head
if (currentLeader == preferredReplica) {
info(s"Partition $partition completed preferred replica leader election. New leader is $preferredReplica")
} else {
warn(s"Partition $partition failed to complete preferred replica leader election to $preferredReplica. " +
s"Leader is still $currentLeader")
}
}
if (!isTriggeredByAutoRebalance) {
zkClient.deletePreferredReplicaElection(controllerContext.epochZkVersion)
// Ensure we detect future preferred replica leader elections
eventManager.put(PreferredReplicaLeaderElection)
}
}
PartitionReassignmentHandler
PartitionReassignmentHandler监听的路径是/admin/reassign_partitions, 直接看其event的process方法:
override def process(): Unit = {
if (!isActive) return
// We need to register the watcher if the path doesn't exist in order to detect future reassignments and we get
// the `path exists` check for free
if (zkClient.registerZNodeChangeHandlerAndCheckExistence(partitionReassignmentHandler)) {
val partitionReassignment = zkClient.getPartitionReassignment
// Populate `partitionsBeingReassigned` with all partitions being reassigned before invoking
// `maybeTriggerPartitionReassignment` (see method documentation for the reason)
partitionReassignment.foreach { case (tp, newReplicas) =>
val reassignIsrChangeHandler = new PartitionReassignmentIsrChangeHandler(KafkaController.this, eventManager,
tp)
controllerContext.partitionsBeingReassigned.put(tp, ReassignedPartitionsContext(newReplicas, reassignIsrChangeHandler))
}
maybeTriggerPartitionReassignment(partitionReassignment.keySet)
}
}
- 如果节点存在的话,先获取到partition和其分配情况
- 创建PartitionReassignmentIsrChangeHandler,用来监听/brokers/topics/
partition/state节点数据的变化。
- 在controller context的partitionsBeingReassigned中放入topicPartition和对应的ReassignedPartitionsContext。其中ReassignedPartitionsContext中注册了步骤2中创建的handler
- 执行maybeTriggerPartitionReassignment
private def maybeTriggerPartitionReassignment(topicPartitions: Set[TopicPartition]) {
val partitionsToBeRemovedFromReassignment = scala.collection.mutable.Set.empty[TopicPartition]
topicPartitions.foreach { tp =>
if (topicDeletionManager.isTopicQueuedUpForDeletion(tp.topic)) {
error(s"Skipping reassignment of $tp since the topic is currently being deleted")
partitionsToBeRemovedFromReassignment.add(tp)
} else {
val reassignedPartitionContext = controllerContext.partitionsBeingReassigned.get(tp).getOrElse {
throw new IllegalStateException(s"Initiating reassign replicas for partition $tp not present in " +
s"partitionsBeingReassigned: ${controllerContext.partitionsBeingReassigned.mkString(", ")}")
}
val newReplicas = reassignedPartitionContext.newReplicas
val topic = tp.topic
val assignedReplicas = controllerContext.partitionReplicaAssignment(tp)
if (assignedReplicas.nonEmpty) {
if (assignedReplicas == newReplicas) {
info(s"Partition $tp to be reassigned is already assigned to replicas " +
s"${newReplicas.mkString(",")}. Ignoring request for partition reassignment.")
partitionsToBeRemovedFromReassignment.add(tp)
} else {
try {
info(s"Handling reassignment of partition $tp to new replicas ${newReplicas.mkString(",")}")
// first register ISR change listener
reassignedPartitionContext.registerReassignIsrChangeHandler(zkClient)
// mark topic ineligible for deletion for the partitions being reassigned
topicDeletionManager.markTopicIneligibleForDeletion(Set(topic))
onPartitionReassignment(tp, reassignedPartitionContext)
} catch {
case e: ControllerMovedException =>
error(s"Error completing reassignment of partition $tp because controller has moved to another broker", e)
throw e
case e: Throwable =>
error(s"Error completing reassignment of partition $tp", e)
// remove the partition from the admin path to unblock the admin client
partitionsToBeRemovedFromReassignment.add(tp)
}
}
} else {
error(s"Ignoring request to reassign partition $tp that doesn't exist.")
partitionsToBeRemovedFromReassignment.add(tp)
}
}
}
removePartitionsFromReassignedPartitions(partitionsToBeRemovedFromReassignment)
}
- 如果这个topic正准备报错,则忽略
- 否则,先从controller context中获取到刚刚放入的ReassignedPartitionsContext
- 从reassignedPartitionContext获取到新的replica assignment,如果和原来的assignment相同,则忽略
- 否则在zk上面注册ISR change listener,并且暂停topic的删除
- 执行partition replica reassignment
- 将前面提到需要忽略的topicPartition从controller context的partitionsBeingReassigned移除,并且取消zk上对其节点的监听。如果context的partitionsBeingReassigned中剩余的topicPartition为空,还要删除zk上面对应的节点,同时重新监听zk上面这个节点的内容,以等待下次的reassignment。最后再次创建assignment并保存到zk中(是否会造成处理上面的死循环)。
我们着重看一下partition replica assignment的过程
onPartitionReassignment
当一个admin命令触发了partition的reassignment的任务时,会创建出/admin/reassign_partitions路径并且触发了zk的监听器,此时就开始了partition的重分配。为了方便理解,我们使用以下简称
- RAR:reassigned replicas 新的replica分配
- OAR: original list of replicas 原来的replica分配
- AR: current assigned replicas
onPartitionReassignment的代码如下:
private def onPartitionReassignment(topicPartition: TopicPartition, reassignedPartitionContext: ReassignedPartitionsContext) {
//先获取RAR
val reassignedReplicas = reassignedPartitionContext.newReplicas
//如果有新加入的replica没有跟上isr
if (!areReplicasInIsr(topicPartition, reassignedReplicas)) {
info(s"New replicas ${reassignedReplicas.mkString(",")} for partition $topicPartition being reassigned not yet " +
"caught up with the leader")
val newReplicasNotInOldReplicaList = reassignedReplicas.toSet -- controllerContext.partitionReplicaAssignment(topicPartition).toSet
val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicPartition)).toSet
//1. Update AR in ZK with OAR + RAR.
//把新老replica一起加入到当前的replica中,即AR中
updateAssignedReplicasForPartition(topicPartition, newAndOldReplicas.toSeq)
//2. Send LeaderAndIsr request to every replica in OAR + RAR (with AR as OAR + RAR).
//发送LeaderAndIsr请求到所有replica(即AR),强制进行新的leader epoch的选举
updateLeaderEpochAndSendRequest(topicPartition, controllerContext.partitionReplicaAssignment(topicPartition),
newAndOldReplicas.toSeq)
//3. replicas in RAR - OAR -> NewReplica
//通知新加入到replica跟上isr
startNewReplicasForReassignedPartition(topicPartition, reassignedPartitionContext, newReplicasNotInOldReplicaList)
info(s"Waiting for new replicas ${reassignedReplicas.mkString(",")} for partition ${topicPartition} being " +
"reassigned to catch up with the leader")
} else {
//如果所有的新replica都跟上isr
//4. Wait until all replicas in RAR are in sync with the leader.
val oldReplicas = controllerContext.partitionReplicaAssignment(topicPartition).toSet -- reassignedReplicas.toSet
//5. replicas in RAR -> OnlineReplica
//把RAR中的replica的状态转移为OnlineReplica
reassignedReplicas.foreach { replica =>
replicaStateMachine.handleStateChanges(Seq(new PartitionAndReplica(topicPartition, replica)), OnlineReplica)
}
//6. Set AR to RAR in memory.
//7. Send LeaderAndIsr request with a potential new leader (if current leader not in RAR) and
// a new AR (using RAR) and same isr to every broker in RAR
moveReassignedPartitionLeaderIfRequired(topicPartition, reassignedPartitionContext)
//8. replicas in OAR - RAR -> Offline (force those replicas out of isr)
//9. replicas in OAR - RAR -> NonExistentReplica (force those replicas to be deleted)
//老的replica,即OAR-RAR的状态先转移到offline再到nonexist
stopOldReplicasOfReassignedPartition(topicPartition, reassignedPartitionContext, oldReplicas)
//10. Update AR in ZK with RAR.
//更新AR
updateAssignedReplicasForPartition(topicPartition, reassignedReplicas)
//11. Update the /admin/reassign_partitions path in ZK to remove this partition.
removePartitionsFromReassignedPartitions(Set(topicPartition))
//12. After electing leader, the replicas and isr information changes, so resend the update metadata request to every broker
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicPartition))
// signal delete topic thread if reassignment for some partitions belonging to topics being deleted just completed
topicDeletionManager.resumeDeletionForTopics(Set(topicPartition.topic))
}
}
reassignment的步骤是:
- 先将AR更新为RAR+OAR
- 发送LeaderAndIsr请求到RAR+OAR的replica中选举新的leader,开启新的leader epoch
- 将RAR-OAR的replica的状态设置为NewReplica
- 等待所有的replica跟上isr
- 把RAR的所有replica状态设置为OnlineReplica
- 将controller context中的AR设置为RAR
- 如果当前的leader不在RAR中,从RAR选举一个新的leader。否则将leader epoch加一。
- 将OAR-RAR中的replica设置为OfflineReplica,并且从isr中将OAR-RAR删掉,并且发送LeaderAndIsr请求到leader来通知新的isr
- 将OAR-RAR中的replica设置为NonExistentReplica状态,并且通知OAR-RAR物理删除磁盘上面的replica文件
- 更新zk上面的AR信息
- 从/admin/reassign_partitions节点中将这个partition删掉
- 因为replica和isr信息已经改变,发送元数据请求到所有的broker