controller的启动
kafka server在启动的时候都会启动一个kafka controller,它相当于kafka集群的master。但是启动controller并不代表这个broker就是集群的controller,启动controller只是用于在这台broker上面注册一个监听会话过期的监听器,然后开始controller的选举过程。controller组件启动监听器的方式是在zk上面注册一个stateChangeHandler
def startup() = {
zkClient.registerStateChangeHandler(new StateChangeHandler {
override val name: String = StateChangeHandlers.ControllerHandler
override def afterInitializingSession(): Unit = {
eventManager.put(RegisterBrokerAndReelect)
}
override def beforeInitializingSession(): Unit = {
val expireEvent = new Expire
eventManager.clearAndPut(expireEvent)
// Block initialization of the new session until the expiration event is being handled,
// which ensures that all pending events have been processed before creating the new session
expireEvent.waitUntilProcessingStarted()
}
})
eventManager.put(Startup)
eventManager.start()
}
在zkClient关闭,或者接受到一个表示过期的event的时候会触发重新初始化的过程,
private def reinitialize(): Unit = {
// Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion
// may require additional Zookeeper requests, which will block to acquire the initialization lock
stateChangeHandlers.values.foreach(callBeforeInitializingSession _)
inWriteLock(initializationLock) {
if (!connectionState.isAlive) {
zooKeeper.close()
info(s"Initializing a new session to $connectString.")
// retry forever until ZooKeeper can be instantiated
var connected = false
while (!connected) {
try {
zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher)
connected = true
} catch {
case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short sleep", e)
Thread.sleep(1000)
}
}
}
}
stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
reinitialize的开始和结束的时候会执行对应的方法,在这里执行的方法是:
val expireEvent = new Expire
eventManager.clearAndPut(expireEvent)
expireEvent.waitUntilProcessingStarted()
和
eventManager.put(RegisterBrokerAndReelect)
eventManager我们可以理解为一个队列,它的背后是一个线程,叫做ControllerEventThread。这个线程的作用就是不断从队列中获取event,并执行其process方法。在上面我们看到传入的event分别是Expire和RegisterBrokerAndReelect。这两个event的process是:
//Expire
override def process(): Unit = {
processingStarted.countDown()
activeControllerId = -1
onControllerResignation()
}
和
//RegisterBrokerAndReelect
override def process(): Unit = {
zkClient.registerBroker(brokerInfo)
Reelect.process()
}
在会话刚过期的时候传入的是Expire,这时候需要让线程等待处理完expire这个event(通过processingStarted这个countdownLatch控制)。expire处理的时候先将activeControllerId赋值为-1,再放弃了controller的所有权。我们先不细看放弃controller所有权做哪些事情,而是先选举成为controller需要做哪些事情,也就是RegisterBrokerAndReelect的process方法:
override def process(): Unit = {
zkClient.registerBroker(brokerInfo)
Reelect.process()
}
override def process(): Unit = {
maybeResign()
elect()
}
在竞选之前先判断,如果broker刚刚就是controller而现在已经不是的话,就调用一次onControllerResignation()方法,它也是Expire event的处理方法。这个方法里面实现了放弃controller的细节。在放弃完controller以后,进行elect()方法开启新一轮选举:
private def elect(): Unit = {
activeControllerId = zkClient.getControllerId.getOrElse(-1)
/*
* We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition,
* it's possible that the controller has already been elected when we get here. This check will prevent the following
* createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
*/
if (activeControllerId != -1) {
//在初始化的时候可能会走到这里,如果当前controller不空,则退出选举
debug(s"Broker $activeControllerId has been elected as the controller, so stopping the election process.")
return
}
try {
//所谓选举,就是抢占zk上面一个节点,如果抛异常说明未能选举上
val (epoch, epochZkVersion) = zkClient.registerControllerAndIncrementControllerEpoch(config.brokerId)
controllerContext.epoch = epoch
controllerContext.epochZkVersion = epochZkVersion
activeControllerId = config.brokerId
info(s"${config.brokerId} successfully elected as the controller. Epoch incremented to ${controllerContext.epoch} " +
s"and epoch zk version is now ${controllerContext.epochZkVersion}")
onControllerFailover()
} catch {
case e: ControllerMovedException =>
maybeResign()
if (activeControllerId != -1)
debug(s"Broker $activeControllerId was elected as controller instead of broker ${config.brokerId}", e)
else
warn("A controller has been elected but just resigned, this will result in another round of election", e)
case t: Throwable =>
error(s"Error while electing or becoming controller on broker ${config.brokerId}. " +
s"Trigger controller movement immediately", t)
triggerControllerMove()
}
}
注册zk的方法是:
def registerControllerAndIncrementControllerEpoch(controllerId: Int): (Int, Int) = {
val timestamp = time.milliseconds()
// Read /controller_epoch to get the current controller epoch and zkVersion,
// create /controller_epoch with initial value if not exists
val (curEpoch, curEpochZkVersion) = getControllerEpoch
.map(e => (e._1, e._2.getVersion))
//如果没有epoch,则初始化epoch=0
.getOrElse(maybeCreateControllerEpochZNode())
// Create /controller and update /controller_epoch atomically
val newControllerEpoch = curEpoch + 1
val expectedControllerEpochZkVersion = curEpochZkVersion
debug(s"Try to create ${ControllerZNode.path} and increment controller epoch to $newControllerEpoch with expected controller epoch zkVersion $expectedControllerEpochZkVersion")
def checkControllerAndEpoch(): (Int, Int) = {
...
}
def tryCreateControllerZNodeAndIncrementEpoch(): (Int, Int) = {
...
//创建一个临时节点(EPHEEMEERAL)
transaction.create(ControllerZNode.path, ControllerZNode.encode(controllerId, timestamp),
acls(ControllerZNode.path).asJava, CreateMode.EPHEMERAL)
transaction.setData(ControllerEpochZNode.path, ControllerEpochZNode.encode(newControllerEpoch), expectedControllerEpochZkVersion)
...
}
tryCreateControllerZNodeAndIncrementEpoch()
}
上面介绍的listener是每一台broker都有的(不仅仅是controller),他们时刻监听controller节点的信息,等待这个节点过期以后抢占这个节点。
controller主要的工作任务
上文提到,broker在启动的时候在zk上监听controller节点,随时准备进行选举。对于选举成功的broker有资格在zk上面注册自己的brokerId. 在elect()中,当broker抢占zk成功后(即没有抛出异常),随后会执行onControllerFailover方法,它涉及到controller主要的工作任务:
private def onControllerFailover() {
info("Registering handlers")
// before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
//依次添加handler到一个map中(此时还没有在zk上注册watcher)
val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler,
isrChangeNotificationHandler)
childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler)
val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler)
nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence)
info("Deleting log dir event notifications")
//删除log_dir_event_notification这个目录下面的子节点
zkClient.deleteLogDirEventNotifications(controllerContext.epochZkVersion)
info("Deleting isr change notifications")
//删除isr_change_notification这个目录下面的子节点
zkClient.deleteIsrChangeNotifications(controllerContext.epochZkVersion)
info("Initializing controller context")
//初始化controller的上下文
initializeControllerContext()
info("Fetching topic deletions in progress")
//获取所有待删除的topic
val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
info("Initializing topic deletion manager")
//初始化topicDeletionManager
topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)
// We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
// are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
// they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
// partitionStateMachine.startup().
info("Sending update metadata request")
//同步一下live的broker列表
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
replicaStateMachine.startup()
partitionStateMachine.startup()
info(s"Ready to serve as the new controller with epoch $epoch")
//判断是否需要重新分配partition
maybeTriggerPartitionReassignment(controllerContext.partitionsBeingReassigned.keySet)
topicDeletionManager.tryTopicDeletion()
val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections()
onPreferredReplicaElection(pendingPreferredReplicaElections)
info("Starting the controller scheduler")
kafkaScheduler.startup()
if (config.autoLeaderRebalanceEnable) {
scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS)
}
if (config.tokenAuthEnabled) {
info("starting the token expiry check scheduler")
tokenCleanScheduler.startup()
tokenCleanScheduler.schedule(name = "delete-expired-tokens",
fun = tokenManager.expireTokens,
period = config.delegationTokenExpiryCheckIntervalMs,
unit = TimeUnit.MILLISECONDS)
}
}
controller在选上以后,进行以下的工作流程:
- 注册子节点和节点上面的handler,处理节点的子节点和节点数据的变化
- 删除logdir和isr变化节点的子节点
- 初始化controller的上下文
- 获取待删除的topic列表,启动topicDeletionManager
- 与其他broker更新broker的元信息
- 启动副本和分区状态机
- 判断是否进行partition的重分配
- 删除需要删除的topic
- 等待partition选举结果
- 启动controller调度器