这是草稿
我们知道,kafka通过其副本同步机制保障消息高可用性。kafka的副本是针对分区而言,每个分区拥有一个leader和若干follower,follower通过向leader拉取消息保证自己随时接替成为新leader的可能性。那么,什么样的follower能够成为leader呢?kafka通过isr机制保证无论何时都有若干follower能够通过选举成为leader。这些细节都在kafka的副本同步的机制当中。在介绍kafka的副本同步机制之前,我们先详细了解一下Partition类和Replica类.
相关类
Partition
kafka的Partition类是对物理上一个topic partition的抽象,每台broker的kafka server服务内部保留着若干partition对象,其内部缓存着这个partition的副本的信息。partition的成员变量包含:
- topic
- partitionId
- allReplicasMap: Pool[Int, Replica]; 表示这个partition所有的副本(包含已分配和正在被分配给这个partition的replica)
- leaderIsrUpdateLock: new ReentrantReadWriteLock;
- zkVersion
- leaderEpoch
- leaderEpochStartOffsetOpt: 表示这个leader epoch的start offset -> 只有leader才有这个信息
- leaderReplicaIdOpt
- inSyncReplicas
- controllerEpoch
Replica
replica表示某个partition的副本,它是对一台broker的抽象。Replica的成员变量包括
- highWatermarkMetadata
- logEndOffsetMetadata
- _logStartOffset
- lastFetchLeaderLogEndOffset
- lastFetchTimeMs
- _lastCaughtUpTimeMs
一台server中保存了partition的所有replica,有些是本地即自身, 有些是远程的replica。replica的构造方法传入一个Option[Log],如果log未定义,表示这个replica是一个远程的replica,否则是一个本地的replica。
我们看到,replica内部保存了很多关于拉取消息的offset相关的信息,主要包含几个方面
- hw(highWaterMark)
- leo(logEndOffset)
- lso(logStartOffset)
从offset的归属权来看,又分为以下三类,即
- leader
- local replica
- remote replica
那么,上述三种offset何时会更新呢?接下来我们看副本是怎么从leader上面拉取消息并同步的。
replica和controller之间的rpc
kafka集群在启动时,会启动KafkaController组件,这个组件的作用相当于是分布式系统中的master,由它来控制partition的replica的选举,并且通报给所有的replica。当一台broker当选为controller时,在zk上面注册BrokerChange的监听器,它监听了zk中的/brokers/ids,当其子节点发生变化时(新增或下线broker),调用某个指定的方法。在以前的文章中我们已经介绍过controller的细节,这里就不展开了,只看和replica相关的内容。
当有新的broker启动时,会调用controller的onBrokerStartup方法,它的作用是:
- 发送broker相关元数据到所有的broker,使老的broker能够发现新broker
- 触发replica的状态改变。如果新加入的broker是某个partition的replica,就将这个replica的状态设置为online(这个信息保存在replica状态机中);同时若context中拥有这个partition的leader等信息,就将这个broker的replica和leader信息同步到所有相关broker
- 触发partition的状态改变,将所有状态为offline或者new的partition改为online,这里就涉及到partition的leader选举,再将选举结果发送到replica和leader
- 检查zk中是否有reassignment是包含了新broker,是的话执行对应partition的reassignment操作。
新的broker启动只是触发leader和isr变化的一个原因之一,只要是涉及到replica的改变,就有可能触发partition的重新选举和isr的变动。kafka中,replica关于partition的信息都是由controller中的replicaManager负责发送,具体的发送请求是一个ControllerBrokerRequestBatch类的对象,它包含如下的成员变量
- controllerContext:表示controller的context
- controllerId:表示controller id
- leaderAndIsrRequestMap:mutable.Map.empty[Int, mutable.Map[TopicPartition, LeaderAndIsrRequest.PartitionState]]
- stopReplicaRequestMap:mutable.Set.empty[Int]
- updateMetadataRequestBrokerSet
- updateMetadataRequestPartitionInfoMap:保存了partition和其partitionState的map
再发送leaderAndIsrRequest时,
- leaderAndIsrRequestMap中加入brokerId和其对应的partition和partitionState的信息
- updateMetadataRequestBrokerSet:加入这次请求涉及到元数据修改的broker(在startup的时候就是新加入的broker)
- updateMetadataRequestPartitionInfoMap加入partition和partitionState的信息
构造好和请求相关的参数以后,调用发送方法
def sendRequestsToBrokers(controllerEpoch: Int) {
try {
val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerEpoch)
val leaderAndIsrRequestVersion: Short =
if (controller.config.interBrokerProtocolVersion >= KAFKA_1_0_IV0) 1
else 0
leaderAndIsrRequestMap.foreach { case (broker, leaderAndIsrPartitionStates) =>
leaderAndIsrPartitionStates.foreach { case (topicPartition, state) =>
val typeOfRequest =
if (broker == state.basePartitionState.leader) "become-leader"
else "become-follower"
stateChangeLog.trace(s"Sending $typeOfRequest LeaderAndIsr request $state to broker $broker for partition $topicPartition")
}
val leaderIds = leaderAndIsrPartitionStates.map(_._2.basePartitionState.leader).toSet
val leaders = controllerContext.liveOrShuttingDownBrokers.filter(b => leaderIds.contains(b.id)).map {
_.node(controller.config.interBrokerListenerName)
}
val leaderAndIsrRequestBuilder = new LeaderAndIsrRequest.Builder(leaderAndIsrRequestVersion, controllerId,
controllerEpoch, leaderAndIsrPartitionStates.asJava, leaders.asJava)
controller.sendRequest(broker, ApiKeys.LEADER_AND_ISR, leaderAndIsrRequestBuilder,
(r: AbstractResponse) => controller.eventManager.put(controller.LeaderAndIsrResponseReceived(r, broker)))
}
leaderAndIsrRequestMap.clear()
updateMetadataRequestPartitionInfoMap.foreach { case (tp, partitionState) =>
stateChangeLog.trace(s"Sending UpdateMetadata request $partitionState to brokers $updateMetadataRequestBrokerSet " +
s"for partition $tp")
}
val partitionStates = Map.empty ++ updateMetadataRequestPartitionInfoMap
val updateMetadataRequestVersion: Short =
if (controller.config.interBrokerProtocolVersion >= KAFKA_1_0_IV0) 4
else if (controller.config.interBrokerProtocolVersion >= KAFKA_0_10_2_IV0) 3
else if (controller.config.interBrokerProtocolVersion >= KAFKA_0_10_0_IV1) 2
else if (controller.config.interBrokerProtocolVersion >= KAFKA_0_9_0) 1
else 0
val updateMetadataRequest = {
val liveBrokers = if (updateMetadataRequestVersion == 0) {
// Version 0 of UpdateMetadataRequest only supports PLAINTEXT.
controllerContext.liveOrShuttingDownBrokers.map { broker =>
val securityProtocol = SecurityProtocol.PLAINTEXT
val listenerName = ListenerName.forSecurityProtocol(securityProtocol)
val node = broker.node(listenerName)
val endPoints = Seq(new EndPoint(node.host, node.port, securityProtocol, listenerName))
new UpdateMetadataRequest.Broker(broker.id, endPoints.asJava, broker.rack.orNull)
}
} else {
controllerContext.liveOrShuttingDownBrokers.map { broker =>
val endPoints = broker.endPoints.map { endPoint =>
new UpdateMetadataRequest.EndPoint(endPoint.host, endPoint.port, endPoint.securityProtocol, endPoint.listenerName)
}
new UpdateMetadataRequest.Broker(broker.id, endPoints.asJava, broker.rack.orNull)
}
}
new UpdateMetadataRequest.Builder(updateMetadataRequestVersion, controllerId, controllerEpoch, partitionStates.asJava,
liveBrokers.asJava)
}
updateMetadataRequestBrokerSet.foreach { broker =>
controller.sendRequest(broker, ApiKeys.UPDATE_METADATA, updateMetadataRequest, null)
}
updateMetadataRequestBrokerSet.clear()
updateMetadataRequestPartitionInfoMap.clear()
stopReplicaRequestMap.foreach { case (broker, replicaInfoList) =>
val stopReplicaWithDelete = replicaInfoList.filter(_.deletePartition).map(_.replica).toSet
val stopReplicaWithoutDelete = replicaInfoList.filterNot(_.deletePartition).map(_.replica).toSet
debug(s"The stop replica request (delete = true) sent to broker $broker is ${stopReplicaWithDelete.mkString(",")}")
debug(s"The stop replica request (delete = false) sent to broker $broker is ${stopReplicaWithoutDelete.mkString(",")}")
val (replicasToGroup, replicasToNotGroup) = replicaInfoList.partition(r => !r.deletePartition && r.callback == null)
// Send one StopReplicaRequest for all partitions that require neither delete nor callback. This potentially
// changes the order in which the requests are sent for the same partitions, but that's OK.
val stopReplicaRequest = new StopReplicaRequest.Builder(controllerId, controllerEpoch, false,
replicasToGroup.map(_.replica.topicPartition).toSet.asJava)
controller.sendRequest(broker, ApiKeys.STOP_REPLICA, stopReplicaRequest)
replicasToNotGroup.foreach { r =>
val stopReplicaRequest = new StopReplicaRequest.Builder(
controllerId, controllerEpoch, r.deletePartition,
Set(r.replica.topicPartition).asJava)
controller.sendRequest(broker, ApiKeys.STOP_REPLICA, stopReplicaRequest, r.callback)
}
}
stopReplicaRequestMap.clear()
} catch {
case e: Throwable =>
if (leaderAndIsrRequestMap.nonEmpty) {
error("Haven't been able to send leader and isr requests, current state of " +
s"the map is $leaderAndIsrRequestMap. Exception message: $e")
}
if (updateMetadataRequestBrokerSet.nonEmpty) {
error(s"Haven't been able to send metadata update requests to brokers $updateMetadataRequestBrokerSet, " +
s"current state of the partition info is $updateMetadataRequestPartitionInfoMap. Exception message: $e")
}
if (stopReplicaRequestMap.nonEmpty) {
error("Haven't been able to send stop replica requests, current state of " +
s"the map is $stopReplicaRequestMap. Exception message: $e")
}
throw new IllegalStateException(e)
}
}
}
在看这段代码的时候,我们心里要时刻思考,发给谁?请求内容是什么? 先看leaderAndIsrRequest请求,
val leaderAndIsrRequestBuilder = new LeaderAndIsrRequest.Builder(leaderAndIsrRequestVersion, controllerId,
controllerEpoch, leaderAndIsrPartitionStates.asJava, leaders.asJava)
controller.sendRequest(broker, ApiKeys.LEADER_AND_ISR, leaderAndIsrRequestBuilder,
(r: AbstractResponse) => controller.eventManager.put(controller.LeaderAndIsrResponseReceived(r, broker)))
这个请求发送的broker是leaderAndIsrRequestMap的key,也就是状态发生变更的replica对应的broker,发送的内容就是这个replica对应的partition的信息。
同步完leader和isr后,再次发送更新元数据的请求
val updateMetadataRequest = {
val liveBrokers = if (updateMetadataRequestVersion == 0) {
// Version 0 of UpdateMetadataRequest only supports PLAINTEXT.
controllerContext.liveOrShuttingDownBrokers.map { broker =>
val securityProtocol = SecurityProtocol.PLAINTEXT
val listenerName = ListenerName.forSecurityProtocol(securityProtocol)
val node = broker.node(listenerName)
val endPoints = Seq(new EndPoint(node.host, node.port, securityProtocol, listenerName))
new UpdateMetadataRequest.Broker(broker.id, endPoints.asJava, broker.rack.orNull)
}
} else {
controllerContext.liveOrShuttingDownBrokers.map { broker =>
val endPoints = broker.endPoints.map { endPoint =>
new UpdateMetadataRequest.EndPoint(endPoint.host, endPoint.port, endPoint.securityProtocol, endPoint.listenerName)
}
new UpdateMetadataRequest.Broker(broker.id, endPoints.asJava, broker.rack.orNull)
}
}
new UpdateMetadataRequest.Builder(updateMetadataRequestVersion, controllerId, controllerEpoch, partitionStates.asJava,
liveBrokers.asJava)
}
updateMetadataRequestBrokerSet.foreach { broker =>
controller.sendRequest(broker, ApiKeys.UPDATE_METADATA, updateMetadataRequest, null)
}
我们可以看出,元数据更新的请求的发送对象是所有存活的broker,请求内容是集群中每个broker的endpoint和rack等新消息。
至此,在controller端的工作就已经完成了,它做了两件事情
- broker startup等影响到replica状态的事件发生时,更新controller context的缓存
- 发送leaderAndIsrRequest和updateMeta的请求到broker
LeaderAndIsr API
leaderAndIsr的api对应的handler方法是
def handleLeaderAndIsrRequest(request: RequestChannel.Request) {
// ensureTopicExists is only for client facing requests
// We can't have the ensureTopicExists check here since the controller sends it as an advisory to all brokers so they
// stop serving data to clients for the topic being deleted
val correlationId = request.header.correlationId
val leaderAndIsrRequest = request.body[LeaderAndIsrRequest]
def onLeadershipChange(updatedLeaders: Iterable[Partition], updatedFollowers: Iterable[Partition]) {
// for each new leader or follower, call coordinator to handle consumer group migration.
// this callback is invoked under the replica state change lock to ensure proper order of
// leadership changes
updatedLeaders.foreach { partition =>
if (partition.topic == GROUP_METADATA_TOPIC_NAME)
groupCoordinator.handleGroupImmigration(partition.partitionId)
else if (partition.topic == TRANSACTION_STATE_TOPIC_NAME)
txnCoordinator.handleTxnImmigration(partition.partitionId, partition.getLeaderEpoch)
}
updatedFollowers.foreach { partition =>
if (partition.topic == GROUP_METADATA_TOPIC_NAME)
groupCoordinator.handleGroupEmigration(partition.partitionId)
else if (partition.topic == TRANSACTION_STATE_TOPIC_NAME)
txnCoordinator.handleTxnEmigration(partition.partitionId, partition.getLeaderEpoch)
}
}
if (isAuthorizedClusterAction(request)) {
val response = replicaManager.becomeLeaderOrFollower(correlationId, leaderAndIsrRequest, onLeadershipChange)
sendResponseExemptThrottle(request, response)
} else {
sendResponseMaybeThrottle(request, throttleTimeMs => leaderAndIsrRequest.getErrorResponse(throttleTimeMs,
Errors.CLUSTER_AUTHORIZATION_FAILED.exception))
}
}
如果鉴权成功,执行replicaManager的becomeLeaderOrFollower方法
def becomeLeaderOrFollower(correlationId: Int,
leaderAndIsrRequest: LeaderAndIsrRequest,
onLeadershipChange: (Iterable[Partition], Iterable[Partition]) => Unit): LeaderAndIsrResponse = {
leaderAndIsrRequest.partitionStates.asScala.foreach { case (topicPartition, stateInfo) =>
stateChangeLogger.trace(s"Received LeaderAndIsr request $stateInfo " +
s"correlation id $correlationId from controller ${leaderAndIsrRequest.controllerId} " +
s"epoch ${leaderAndIsrRequest.controllerEpoch} for partition $topicPartition")
}
replicaStateChangeLock synchronized {
//如果收到一条过时的controller epoch的请求,直接忽略
if (leaderAndIsrRequest.controllerEpoch < controllerEpoch) {
stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from controller ${leaderAndIsrRequest.controllerId} with " +
s"correlation id $correlationId since its controller epoch ${leaderAndIsrRequest.controllerEpoch} is old. " +
s"Latest known controller epoch is $controllerEpoch")
leaderAndIsrRequest.getErrorResponse(0, Errors.STALE_CONTROLLER_EPOCH.exception)
} else {
val responseMap = new mutable.HashMap[TopicPartition, Errors]
//更新controller信息
val controllerId = leaderAndIsrRequest.controllerId
controllerEpoch = leaderAndIsrRequest.controllerEpoch
// First check partition's leader epoch
val partitionState = new mutable.HashMap[Partition, LeaderAndIsrRequest.PartitionState]()
val newPartitions = new mutable.HashSet[Partition]
leaderAndIsrRequest.partitionStates.asScala.foreach { case (topicPartition, stateInfo) =>
//更新replica中缓存的partition列表
val partition = getPartition(topicPartition).getOrElse {
val createdPartition = getOrCreatePartition(topicPartition)
newPartitions.add(createdPartition)
createdPartition
}
val currentLeaderEpoch = partition.getLeaderEpoch
val requestLeaderEpoch = stateInfo.basePartitionState.leaderEpoch
if (partition eq ReplicaManager.OfflinePartition) {
stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from " +
s"controller $controllerId with correlation id $correlationId " +
s"epoch $controllerEpoch for partition $topicPartition as the local replica for the " +
"partition is in an offline log directory")
responseMap.put(topicPartition, Errors.KAFKA_STORAGE_ERROR)
} else if (requestLeaderEpoch > currentLeaderEpoch) {
// If the leader epoch is valid record the epoch of the controller that made the leadership decision.
// This is useful while updating the isr to maintain the decision maker controller's epoch in the zookeeper path
if(stateInfo.basePartitionState.replicas.contains(localBrokerId))
partitionState.put(partition, stateInfo)
else {
stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from controller $controllerId with " +
s"correlation id $correlationId epoch $controllerEpoch for partition $topicPartition as itself is not " +
s"in assigned replica list ${stateInfo.basePartitionState.replicas.asScala.mkString(",")}")
responseMap.put(topicPartition, Errors.UNKNOWN_TOPIC_OR_PARTITION)
}
} else {
// Otherwise record the error code in response
stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from " +
s"controller $controllerId with correlation id $correlationId " +
s"epoch $controllerEpoch for partition $topicPartition since its associated " +
s"leader epoch $requestLeaderEpoch is not higher than the current " +
s"leader epoch $currentLeaderEpoch")
responseMap.put(topicPartition, Errors.STALE_CONTROLLER_EPOCH)
}
}
//这个broker要成为leader的partition
val partitionsTobeLeader = partitionState.filter { case (_, stateInfo) =>
stateInfo.basePartitionState.leader == localBrokerId
}
//这个broker要成为follower的partition
val partitionsToBeFollower = partitionState -- partitionsTobeLeader.keys
val partitionsBecomeLeader = if (partitionsTobeLeader.nonEmpty)
makeLeaders(controllerId, controllerEpoch, partitionsTobeLeader, correlationId, responseMap)
else
Set.empty[Partition]
val partitionsBecomeFollower = if (partitionsToBeFollower.nonEmpty)
makeFollowers(controllerId, controllerEpoch, partitionsToBeFollower, correlationId, responseMap)
else
Set.empty[Partition]
leaderAndIsrRequest.partitionStates.asScala.keys.foreach { topicPartition =>
/*
* If there is offline log directory, a Partition object may have been created by getOrCreatePartition()
* before getOrCreateReplica() failed to create local replica due to KafkaStorageException.
* In this case ReplicaManager.allPartitions will map this topic-partition to an empty Partition object.
* we need to map this topic-partition to OfflinePartition instead.
*/
if (localReplica(topicPartition).isEmpty && (allPartitions.get(topicPartition) ne ReplicaManager.OfflinePartition))
allPartitions.put(topicPartition, ReplicaManager.OfflinePartition)
}
// we initialize highwatermark thread after the first leaderisrrequest. This ensures that all the partitions
// have been completely populated before starting the checkpointing there by avoiding weird race conditions
if (!hwThreadInitialized) {
startHighWaterMarksCheckPointThread()
hwThreadInitialized = true
}
val futureReplicasAndInitialOffset = new mutable.HashMap[TopicPartition, InitialFetchState]
for (partition <- newPartitions) {
val topicPartition = partition.topicPartition
if (logManager.getLog(topicPartition, isFuture = true).isDefined) {
partition.localReplica.foreach { replica =>
val leader = BrokerEndPoint(config.brokerId, "localhost", -1)
// Add future replica to partition's map
partition.getOrCreateReplica(Request.FutureLocalReplicaId, isNew = false)
// pause cleaning for partitions that are being moved and start ReplicaAlterDirThread to move
// replica from source dir to destination dir
logManager.abortAndPauseCleaning(topicPartition)
futureReplicasAndInitialOffset.put(topicPartition, InitialFetchState(leader,
partition.getLeaderEpoch, replica.highWatermark.messageOffset))
}
}
}
replicaAlterLogDirsManager.addFetcherForPartitions(futureReplicasAndInitialOffset)
replicaFetcherManager.shutdownIdleFetcherThreads()
replicaAlterLogDirsManager.shutdownIdleFetcherThreads()
onLeadershipChange(partitionsBecomeLeader, partitionsBecomeFollower)
new LeaderAndIsrResponse(Errors.NONE, responseMap.asJava)
}
}
}
在这个方法当中,根据自己在partition的follower中的角色不同,执行不同的方法,如果自己是partition的leader,执行makeLeaders方法,否则执行makeFollowers方法。在makeFollowers方法中会开启从leader fetch的线程,我们稍后再看。接下来replicaManager会判断是否启动了hw checkpoint线程,否则启动。
makeLeaders
这个方法的实现是:
private def makeLeaders(controllerId: Int,
epoch: Int,
partitionState: Map[Partition, LeaderAndIsrRequest.PartitionState],
correlationId: Int,
responseMap: mutable.Map[TopicPartition, Errors]): Set[Partition] = {
partitionState.keys.foreach { partition =>
stateChangeLogger.trace(s"Handling LeaderAndIsr request correlationId $correlationId from " +
s"controller $controllerId epoch $epoch starting the become-leader transition for " +
s"partition ${partition.topicPartition}")
}
for (partition <- partitionState.keys)
responseMap.put(partition.topicPartition, Errors.NONE)
val partitionsToMakeLeaders = mutable.Set[Partition]()
try {
// First stop fetchers for all the partitions
//停掉fetcher线程
replicaFetcherManager.removeFetcherForPartitions(partitionState.keySet.map(_.topicPartition))
// Update the partition information to be the leader
partitionState.foreach{ case (partition, partitionStateInfo) =>
try {
//执行partition的makeLeader方法
if (partition.makeLeader(controllerId, partitionStateInfo, correlationId)) {
partitionsToMakeLeaders += partition
stateChangeLogger.trace(s"Stopped fetchers as part of become-leader request from " +
s"controller $controllerId epoch $epoch with correlation id $correlationId for partition ${partition.topicPartition} " +
s"(last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch})")
} else
stateChangeLogger.info(s"Skipped the become-leader state change after marking its " +
s"partition as leader with correlation id $correlationId from controller $controllerId epoch $epoch for " +
s"partition ${partition.topicPartition} (last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) " +
s"since it is already the leader for the partition.")
} catch {
case e: KafkaStorageException =>
stateChangeLogger.error(s"Skipped the become-leader state change with " +
s"correlation id $correlationId from controller $controllerId epoch $epoch for partition ${partition.topicPartition} " +
s"(last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) since " +
s"the replica for the partition is offline due to disk error $e")
val dirOpt = getLogDir(partition.topicPartition)
error(s"Error while making broker the leader for partition $partition in dir $dirOpt", e)
responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
}
}
} catch {
case e: Throwable =>
partitionState.keys.foreach { partition =>
stateChangeLogger.error(s"Error while processing LeaderAndIsr request correlationId $correlationId received " +
s"from controller $controllerId epoch $epoch for partition ${partition.topicPartition}", e)
}
// Re-throw the exception for it to be caught in KafkaApis
throw e
}
partitionState.keys.foreach { partition =>
stateChangeLogger.trace(s"Completed LeaderAndIsr request correlationId $correlationId from controller $controllerId " +
s"epoch $epoch for the become-leader transition for partition ${partition.topicPartition}")
}
partitionsToMakeLeaders
}
一个partition的makeLeader要比后面介绍的makeFollower复杂一些,
def makeLeader(controllerId: Int, partitionStateInfo: LeaderAndIsrRequest.PartitionState, correlationId: Int): Boolean = {
val (leaderHWIncremented, isNewLeader) = inWriteLock(leaderIsrUpdateLock) {
//新分配给这个partition的replicas
val newAssignedReplicas = partitionStateInfo.basePartitionState.replicas.asScala.map(_.toInt)
// record the epoch of the controller that made the leadership decision. This is useful while updating the isr
// to maintain the decision maker controller's epoch in the zookeeper path
//记录发送leaderAndIsr请求的controller epoch
controllerEpoch = partitionStateInfo.basePartitionState.controllerEpoch
// add replicas that are new
val newInSyncReplicas = partitionStateInfo.basePartitionState.isr.asScala.map(r => getOrCreateReplica(r, partitionStateInfo.isNew)).toSet
// remove assigned replicas that have been removed by the controller
(assignedReplicas.map(_.brokerId) -- newAssignedReplicas).foreach(removeReplica)
//更新isr为从请求中带过来的replica
inSyncReplicas = newInSyncReplicas
newAssignedReplicas.foreach(id => getOrCreateReplica(id, partitionStateInfo.isNew))
val leaderReplica = localReplicaOrException
//将这台replica的leo作为新的leader epoch的startOffset
val leaderEpochStartOffset = leaderReplica.logEndOffset.messageOffset
info(s"$topicPartition starts at Leader Epoch ${partitionStateInfo.basePartitionState.leaderEpoch} from " +
s"offset $leaderEpochStartOffset. Previous Leader Epoch was: $leaderEpoch")
//We cache the leader epoch here, persisting it only if it's local (hence having a log dir)
//更新新的leaderEpoch
leaderEpoch = partitionStateInfo.basePartitionState.leaderEpoch
leaderEpochStartOffsetOpt = Some(leaderEpochStartOffset)
zkVersion = partitionStateInfo.basePartitionState.zkVersion
// In the case of successive leader elections in a short time period, a follower may have
// entries in its log from a later epoch than any entry in the new leader's log. In order
// to ensure that these followers can truncate to the right offset, we must cache the new
// leader epoch and the start offset since it should be larger than any epoch that a follower
// would try to query.
//在这里缓存了每个leaderEpoch和其对应的start offset
leaderReplica.epochs.foreach { epochCache =>
epochCache.assign(leaderEpoch, leaderEpochStartOffset)
}
val isNewLeader = !leaderReplicaIdOpt.contains(localBrokerId)
val curLeaderLogEndOffset = leaderReplica.logEndOffset.messageOffset
val curTimeMs = time.milliseconds
// initialize lastCaughtUpTime of replicas as well as their lastFetchTimeMs and lastFetchLeaderLogEndOffset.
(assignedReplicas - leaderReplica).foreach { replica =>
val lastCaughtUpTimeMs = if (inSyncReplicas.contains(replica)) curTimeMs else 0L
//更新replica拉取的信息。每个replica处记录的leader的leo变成curLeaderLogEndOffset,lastFetchTimeMs变成当前时间,如果是isr,lastCaughtUpTimeMs变成当前时间,否则为0
replica.resetLastCaughtUpTime(curLeaderLogEndOffset, curTimeMs, lastCaughtUpTimeMs)
}
//如果当前broker是新的partition leader
if (isNewLeader) {
// construct the high watermark metadata for the new leader replica
//构造新的hw元数据
leaderReplica.convertHWToLocalOffsetMetadata()
// mark local replica as the leader after converting hw
//将当前broker设置为leader replica
leaderReplicaIdOpt = Some(localBrokerId)
// reset log end offset for remote replicas
//修改replica的fetch信息
assignedReplicas.filter(_.brokerId != localBrokerId).foreach(_.updateLogReadResult(LogReadResult.UnknownLogReadResult))
}
// we may need to increment high watermark since ISR could be down to 1
(maybeIncrementLeaderHW(leaderReplica), isNewLeader)
}
// some delayed operations may be unblocked after HW changed
if (leaderHWIncremented)
tryCompleteDelayedRequests()
isNewLeader
}
在partition的leader处保留了其余replica的拉取的信息,主要是
- _lastCaughtUpTimeMs
- lastFetchLeaderLogEndOffset
- lastFetchTimeMs
makeFollowers
在makeFollowers方法中,将自己设置为这个partition的follower,开启fetch线程。实现是:
private def makeFollowers(controllerId: Int,
epoch: Int,
partitionStates: Map[Partition, LeaderAndIsrRequest.PartitionState],
correlationId: Int,
responseMap: mutable.Map[TopicPartition, Errors]) : Set[Partition] = {
partitionStates.foreach { case (partition, partitionState) =>
stateChangeLogger.trace(s"Handling LeaderAndIsr request correlationId $correlationId from controller $controllerId " +
s"epoch $epoch starting the become-follower transition for partition ${partition.topicPartition} with leader " +
s"${partitionState.basePartitionState.leader}")
}
for (partition <- partitionStates.keys)
responseMap.put(partition.topicPartition, Errors.NONE)
val partitionsToMakeFollower: mutable.Set[Partition] = mutable.Set()
try {
// TODO: Delete leaders from LeaderAndIsrRequest
partitionStates.foreach { case (partition, partitionStateInfo) =>
val newLeaderBrokerId = partitionStateInfo.basePartitionState.leader
try {
metadataCache.getAliveBrokers.find(_.id == newLeaderBrokerId) match {
// Only change partition state when the leader is available
case Some(_) =>
//如果这个partition的leader在live broker列表中,调用partition的makeFollower方法,初始化follower的状态。这个方法返回true/false,表示是否有发生partition leadership的变动
if (partition.makeFollower(controllerId, partitionStateInfo, correlationId))
partitionsToMakeFollower += partition
else
stateChangeLogger.info(s"Skipped the become-follower state change after marking its partition as " +
s"follower with correlation id $correlationId from controller $controllerId epoch $epoch " +
s"for partition ${partition.topicPartition} (last update " +
s"controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) " +
s"since the new leader $newLeaderBrokerId is the same as the old leader")
case None =>
// The leader broker should always be present in the metadata cache.
// If not, we should record the error message and abort the transition process for this partition
stateChangeLogger.error(s"Received LeaderAndIsrRequest with correlation id $correlationId from " +
s"controller $controllerId epoch $epoch for partition ${partition.topicPartition} " +
s"(last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) " +
s"but cannot become follower since the new leader $newLeaderBrokerId is unavailable.")
// Create the local replica even if the leader is unavailable. This is required to ensure that we include
// the partition's high watermark in the checkpoint file (see KAFKA-1647)
partition.getOrCreateReplica(localBrokerId, isNew = partitionStateInfo.isNew)
}
} catch {
case e: KafkaStorageException =>
stateChangeLogger.error(s"Skipped the become-follower state change with correlation id $correlationId from " +
s"controller $controllerId epoch $epoch for partition ${partition.topicPartition} " +
s"(last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) with leader " +
s"$newLeaderBrokerId since the replica for the partition is offline due to disk error $e")
val dirOpt = getLogDir(partition.topicPartition)
error(s"Error while making broker the follower for partition $partition with leader " +
s"$newLeaderBrokerId in dir $dirOpt", e)
responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
}
}
//如果partition的leadership发生改变,那么停止fetcher线程
replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(_.topicPartition))
partitionsToMakeFollower.foreach { partition =>
stateChangeLogger.trace(s"Stopped fetchers as part of become-follower request from controller $controllerId " +
s"epoch $epoch with correlation id $correlationId for partition ${partition.topicPartition} with leader " +
s"${partitionStates(partition).basePartitionState.leader}")
}
partitionsToMakeFollower.foreach { partition =>
val topicPartitionOperationKey = new TopicPartitionOperationKey(partition.topicPartition)
tryCompleteDelayedProduce(topicPartitionOperationKey)
tryCompleteDelayedFetch(topicPartitionOperationKey)
}
partitionsToMakeFollower.foreach { partition =>
stateChangeLogger.trace(s"Truncated logs and checkpointed recovery boundaries for partition " +
s"${partition.topicPartition} as part of become-follower request with correlation id $correlationId from " +
s"controller $controllerId epoch $epoch with leader ${partitionStates(partition).basePartitionState.leader}")
}
if (isShuttingDown.get()) {
partitionsToMakeFollower.foreach { partition =>
stateChangeLogger.trace(s"Skipped the adding-fetcher step of the become-follower state " +
s"change with correlation id $correlationId from controller $controllerId epoch $epoch for " +
s"partition ${partition.topicPartition} with leader ${partitionStates(partition).basePartitionState.leader} " +
"since it is shutting down")
}
}
else {
// we do not need to check if the leader exists again since this has been done at the beginning of this process
val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map { partition =>
val leader = metadataCache.getAliveBrokers.find(_.id == partition.leaderReplicaIdOpt.get).get
.brokerEndPoint(config.interBrokerListenerName)
val fetchOffset = partition.localReplicaOrException.highWatermark.messageOffset
partition.topicPartition -> InitialFetchState(leader, partition.getLeaderEpoch, fetchOffset)
}.toMap
//开启像新的leader的fetcher线程
replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)
partitionsToMakeFollower.foreach { partition =>
stateChangeLogger.trace(s"Started fetcher to new leader as part of become-follower " +
s"request from controller $controllerId epoch $epoch with correlation id $correlationId for " +
s"partition ${partition.topicPartition} with leader ${partitionStates(partition).basePartitionState.leader}")
}
}
} catch {
case e: Throwable =>
stateChangeLogger.error(s"Error while processing LeaderAndIsr request with correlationId $correlationId " +
s"received from controller $controllerId epoch $epoch", e)
// Re-throw the exception for it to be caught in KafkaApis
throw e
}
partitionStates.keys.foreach { partition =>
stateChangeLogger.trace(s"Completed LeaderAndIsr request correlationId $correlationId from controller $controllerId " +
s"epoch $epoch for the become-follower transition for partition ${partition.topicPartition} with leader " +
s"${partitionStates(partition).basePartitionState.leader}")
}
partitionsToMakeFollower
}
replica fetch的过程
一个replica在收到成为follower的请求的时候,启动向partition leader fetch的线程。在一台replica上对同一个broker的fetch线程数是有上限的,fetch id的计算方法如下:
private[server] def getFetcherId(topicPartition: TopicPartition): Int = {
lock synchronized {
Utils.abs(31 * topicPartition.topic.hashCode() + topicPartition.partition) % numFetchersPerBroker
}
}
所有fetch线程都保存在AbstractFetcherManager的fetcherThreadMap中,这个map的keey是
BrokerAndFetcherId(brokerAndInitialFetchOffset.leader, getFetcherId(topicPartition))
请求时,如果已存在面向目标broker的fetch线程,则复用那个线程,否则创建新线程
def addAndStartFetcherThread(brokerAndFetcherId: BrokerAndFetcherId, brokerIdAndFetcherId: BrokerIdAndFetcherId): AbstractFetcherThread = {
val fetcherThread = createFetcherThread(brokerAndFetcherId.fetcherId, brokerAndFetcherId.broker)
fetcherThreadMap.put(brokerIdAndFetcherId, fetcherThread)
fetcherThread.start()
fetcherThread
}
在这里,创建的是ReplicaFetcherThread。在构造出ReplicaFetcherThread后要注意初始化fetch的相关状态,
val currentState = partitionStates.stateValue(tp)
val updatedState = if (currentState != null && currentState.currentLeaderEpoch == initialFetchState.leaderEpoch) {
currentState
} else {
val initialFetchOffset = if (initialFetchState.offset < 0)
fetchOffsetAndTruncate(tp, initialFetchState.leaderEpoch)
else
initialFetchState.offset
PartitionFetchState(initialFetchOffset, initialFetchState.leaderEpoch, state = Truncating)
}
partitionStates.updateAndMoveToEnd(tp, updatedState)
这里的initialFetchState是前面提到过的当前replica的hw。注意,如果fetch的线程的partitionState已经包含此partition的fetch信息且leaderEpoch没有发生变化的话,fetch处的partitionState不做变化。否则,将partitionFetchState更新为:
- 如果initialFetchState>0,updatedState=PartitionFetchState(initialFetchOffset, initialFetchState.leaderEpoch, state = Truncating),将这个partition的状态设置为Truncating
- 如果initialFetchState<0,将updatedState修改为当前topicPartition的leo
ReplicaFetcherThread
ReplicaFetcherThread继承自AbstractFetcherThread,而AbstractFetcherThread继承自kafka中多次出现过的ShutdownableThread,其主要逻辑是重复doWork()方法。 AbstractFetcherThread抽象类已经实现了主要的fetch的逻辑,
override def doWork() {
maybeTruncate()
maybeFetch()
}
maybeTruncate方法在前面我已经介绍过。他的主要作用是将需要截断的topicPartition的offset截断到leaderepoch对应的offset。这篇文章我们主要看maybeFetch方法。maybeFetch无外乎几个重要的过程:
- 构造fetch请求
- 处理fetch请求 这里我们看由其子类ReplicaFetcherThread实现的两个方法:
protected def buildFetch(partitionMap: Map[TopicPartition, PartitionFetchState]): ResultWithPartitions[Option[FetchRequest.Builder]]
protected def processPartitionData(topicPartition: TopicPartition,
fetchOffset: Long,
partitionData: FetchData): Option[LogAppendInfo]
先看第一个方法 buildFetch,它的作用是构造fetch请求
override def buildFetch(partitionMap: Map[TopicPartition, PartitionFetchState]): ResultWithPartitions[Option[FetchRequest.Builder]] = {
val partitionsWithError = mutable.Set[TopicPartition]()
val builder = fetchSessionHandler.newBuilder()
partitionMap.foreach { case (topicPartition, fetchState) =>
// We will not include a replica in the fetch request if it should be throttled.
if (fetchState.isReadyForFetch && !shouldFollowerThrottle(quota, topicPartition)) {
try {
val logStartOffset = replicaMgr.localReplicaOrException(topicPartition).logStartOffset
builder.add(topicPartition, new FetchRequest.PartitionData(
fetchState.fetchOffset, logStartOffset, fetchSize, Optional.of(fetchState.currentLeaderEpoch)))
} catch {
case _: KafkaStorageException =>
// The replica has already been marked offline due to log directory failure and the original failure should have already been logged.
// This partition should be removed from ReplicaFetcherThread soon by ReplicaManager.handleLogDirFailure()
partitionsWithError += topicPartition
}
}
}
在方法开始时,构造FetchRequestData的builder,将fetchOffset等信息带上,每个replica还会带上自己的logStartOffset,这个offset表示自身日志开始的offset。收集lso信息有助于日志的日常滚动。最后执行builder.build方法产生fetchRequestData。
最后看一下processPartitionData方法
override def processPartitionData(topicPartition: TopicPartition,
fetchOffset: Long,
partitionData: FetchData): Option[LogAppendInfo] = {
val replica = replicaMgr.localReplicaOrException(topicPartition)
val partition = replicaMgr.getPartition(topicPartition).get
val records = toMemoryRecords(partitionData.records)
maybeWarnIfOversizedRecords(records, topicPartition)
if (fetchOffset != replica.logEndOffset.messageOffset)
throw new IllegalStateException("Offset mismatch for partition %s: fetched offset = %d, log end offset = %d.".format(
topicPartition, fetchOffset, replica.logEndOffset.messageOffset))
if (isTraceEnabled)
trace("Follower has replica log end offset %d for partition %s. Received %d messages and leader hw %d"
.format(replica.logEndOffset.messageOffset, topicPartition, records.sizeInBytes, partitionData.highWatermark))
// Append the leader's messages to the log
val logAppendInfo = partition.appendRecordsToFollowerOrFutureReplica(records, isFuture = false)
if (isTraceEnabled)
trace("Follower has replica log end offset %d after appending %d bytes of messages for partition %s"
.format(replica.logEndOffset.messageOffset, records.sizeInBytes, topicPartition))
val followerHighWatermark = replica.logEndOffset.messageOffset.min(partitionData.highWatermark)
val leaderLogStartOffset = partitionData.logStartOffset
// for the follower replica, we do not need to keep
// its segment base offset the physical position,
// these values will be computed upon making the leader
replica.highWatermark = new LogOffsetMetadata(followerHighWatermark)
replica.maybeIncrementLogStartOffset(leaderLogStartOffset)
if (isTraceEnabled)
trace(s"Follower set replica high watermark for partition $topicPartition to $followerHighWatermark")
// Traffic from both in-sync and out of sync replicas are accounted for in replication quota to ensure total replication
// traffic doesn't exceed quota.
if (quota.isThrottled(topicPartition))
quota.record(records.sizeInBytes)
replicaMgr.brokerTopicStats.updateReplicationBytesIn(records.sizeInBytes)
logAppendInfo
}
处理fetch响应主要分为两个步骤
- 保存fetch records
- 更新replica处的元数据,包括hw和lso。如果leader返回的lso增长了,则增长replica处的lso,同时截断日志
leader处处理fetch请求
partition leader接收来自consumer和follower的fetch请求,在这里我们主要看follower请求的处理。 kafka server处理fetch请求的api handler是handleFetchRequest方法
...
if (fetchRequest.isFromFollower) {
// The follower must have ClusterAction on ClusterResource in order to fetch partition data.
if (authorize(request.session, ClusterAction, Resource.ClusterResource)) {
fetchContext.foreachPartition { (topicPartition, data) =>
if (!metadataCache.contains(topicPartition))
erroneous += topicPartition -> errorResponse(Errors.UNKNOWN_TOPIC_OR_PARTITION)
else
interesting += (topicPartition -> data)
}
} else {
fetchContext.foreachPartition { (part, _) =>
erroneous += part -> errorResponse(Errors.TOPIC_AUTHORIZATION_FAILED)
}
}
}
...
if (interesting.isEmpty)
processResponseCallback(Seq.empty)
else {
// call the replica manager to fetch messages from the local replica
replicaManager.fetchMessages(
fetchRequest.maxWait.toLong,
fetchRequest.replicaId,
fetchRequest.minBytes,
fetchRequest.maxBytes,
versionId <= 2,
interesting,
replicationQuota(fetchRequest),
processResponseCallback,
fetchRequest.isolationLevel)
}
...
我们忽略了和metric,限额,构造响应等过程,直接看replicaManager是怎么处理fetch请求的
def fetchMessages(timeout: Long,
replicaId: Int,
fetchMinBytes: Int,
fetchMaxBytes: Int,
hardMaxBytesLimit: Boolean,
fetchInfos: Seq[(TopicPartition, PartitionData)],
quota: ReplicaQuota = UnboundedQuota,
responseCallback: Seq[(TopicPartition, FetchPartitionData)] => Unit,
isolationLevel: IsolationLevel) {
val isFromFollower = Request.isValidBrokerId(replicaId)
val fetchOnlyFromLeader = replicaId != Request.DebuggingConsumerId && replicaId != Request.FutureLocalReplicaId
val fetchIsolation = if (isFromFollower || replicaId == Request.FutureLocalReplicaId)
FetchLogEnd
else if (isolationLevel == IsolationLevel.READ_COMMITTED)
FetchTxnCommitted
else
FetchHighWatermark
def readFromLog(): Seq[(TopicPartition, LogReadResult)] = {
val result = readFromLocalLog(
replicaId = replicaId,
fetchOnlyFromLeader = fetchOnlyFromLeader,
fetchIsolation = fetchIsolation,
fetchMaxBytes = fetchMaxBytes,
hardMaxBytesLimit = hardMaxBytesLimit,
readPartitionInfo = fetchInfos,
quota = quota)
if (isFromFollower) updateFollowerLogReadResults(replicaId, result)
else result
}
val logReadResults = readFromLog()
// check if this fetch request can be satisfied right away
val logReadResultValues = logReadResults.map { case (_, v) => v }
val bytesReadable = logReadResultValues.map(_.info.records.sizeInBytes).sum
val errorReadingData = logReadResultValues.foldLeft(false) ((errorIncurred, readResult) =>
errorIncurred || (readResult.error != Errors.NONE))
// respond immediately if 1) fetch request does not want to wait
// 2) fetch request does not require any data
// 3) has enough data to respond
// 4) some error happens while reading data
if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {
val fetchPartitionData = logReadResults.map { case (tp, result) =>
tp -> FetchPartitionData(result.error, result.highWatermark, result.leaderLogStartOffset, result.info.records,
result.lastStableOffset, result.info.abortedTransactions)
}
responseCallback(fetchPartitionData)
} else {
// construct the fetch results from the read results
val fetchPartitionStatus = logReadResults.map { case (topicPartition, result) =>
val fetchInfo = fetchInfos.collectFirst {
case (tp, v) if tp == topicPartition => v
}.getOrElse(sys.error(s"Partition $topicPartition not found in fetchInfos"))
(topicPartition, FetchPartitionStatus(result.info.fetchOffsetMetadata, fetchInfo))
}
val fetchMetadata = FetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit, fetchOnlyFromLeader,
fetchIsolation, isFromFollower, replicaId, fetchPartitionStatus)
val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, responseCallback)
// create a list of (topic, partition) pairs to use as keys for this delayed fetch operation
val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) }
// try to complete the request immediately, otherwise put it into the purgatory;
// this is because while the delayed fetch operation is being created, new requests
// may arrive and hence make this operation completable.
delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
}
}
方法中
- 第一步,先判断fetch的隔离级别
- 第二步,从log中读取日志并返回
- 第三步,更新follower处的offset相关信息
- 第四步,判断是否满足这次fetch的需求,如果满足立即响应,如果不满足,构造一个delayedFetch对象并观察是否可以完成其对应的操作。
一共有三者fetch的级别
- FetchLogEnd => 拉取所有日志(follower的fetch)
- FetchTxnCommitted => 读取到lastStableOffset日志
- FetchHighWatermark => 拉取到hw的日志(consumer的fetch)
再往下看,先看从log中读取日志的过程
readFromLocalLog
def readFromLocalLog(replicaId: Int,
fetchOnlyFromLeader: Boolean,
fetchIsolation: FetchIsolation,
fetchMaxBytes: Int,
hardMaxBytesLimit: Boolean,
readPartitionInfo: Seq[(TopicPartition, PartitionData)],
quota: ReplicaQuota): Seq[(TopicPartition, LogReadResult)] = {
def read(tp: TopicPartition, fetchInfo: PartitionData, limitBytes: Int, minOneMessage: Boolean): LogReadResult = {
val offset = fetchInfo.fetchOffset
val partitionFetchSize = fetchInfo.maxBytes
val followerLogStartOffset = fetchInfo.logStartOffset
brokerTopicStats.topicStats(tp.topic).totalFetchRequestRate.mark()
brokerTopicStats.allTopicsStats.totalFetchRequestRate.mark()
try {
trace(s"Fetching log segment for partition $tp, offset $offset, partition fetch size $partitionFetchSize, " +
s"remaining response limit $limitBytes" +
(if (minOneMessage) s", ignoring response/partition size limits" else ""))
val partition = getPartitionOrException(tp, expectLeader = fetchOnlyFromLeader)
val adjustedMaxBytes = math.min(fetchInfo.maxBytes, limitBytes)
val fetchTimeMs = time.milliseconds
// Try the read first, this tells us whether we need all of adjustedFetchSize for this partition
val readInfo = partition.readRecords(
fetchOffset = fetchInfo.fetchOffset,
currentLeaderEpoch = fetchInfo.currentLeaderEpoch,
maxBytes = adjustedMaxBytes,
fetchIsolation = fetchIsolation,
fetchOnlyFromLeader = fetchOnlyFromLeader,
minOneMessage = minOneMessage)
val fetchDataInfo = if (shouldLeaderThrottle(quota, tp, replicaId)) {
// If the partition is being throttled, simply return an empty set.
FetchDataInfo(readInfo.fetchedData.fetchOffsetMetadata, MemoryRecords.EMPTY)
} else if (!hardMaxBytesLimit && readInfo.fetchedData.firstEntryIncomplete) {
// For FetchRequest version 3, we replace incomplete message sets with an empty one as consumers can make
// progress in such cases and don't need to report a `RecordTooLargeException`
FetchDataInfo(readInfo.fetchedData.fetchOffsetMetadata, MemoryRecords.EMPTY)
} else {
readInfo.fetchedData
}
LogReadResult(info = fetchDataInfo,
highWatermark = readInfo.highWatermark,
leaderLogStartOffset = readInfo.logStartOffset,
leaderLogEndOffset = readInfo.logEndOffset,
followerLogStartOffset = followerLogStartOffset,
fetchTimeMs = fetchTimeMs,
readSize = adjustedMaxBytes,
lastStableOffset = Some(readInfo.lastStableOffset),
exception = None)
} catch {
// NOTE: Failed fetch requests metric is not incremented for known exceptions since it
// is supposed to indicate un-expected failure of a broker in handling a fetch request
case e@ (_: UnknownTopicOrPartitionException |
_: NotLeaderForPartitionException |
_: UnknownLeaderEpochException |
_: FencedLeaderEpochException |
_: ReplicaNotAvailableException |
_: KafkaStorageException |
_: OffsetOutOfRangeException) =>
LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY),
highWatermark = -1L,
leaderLogStartOffset = -1L,
leaderLogEndOffset = -1L,
followerLogStartOffset = -1L,
fetchTimeMs = -1L,
readSize = 0,
lastStableOffset = None,
exception = Some(e))
case e: Throwable =>
brokerTopicStats.topicStats(tp.topic).failedFetchRequestRate.mark()
brokerTopicStats.allTopicsStats.failedFetchRequestRate.mark()
error(s"Error processing fetch operation on partition $tp, offset $offset", e)
LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY),
highWatermark = -1L,
leaderLogStartOffset = -1L,
leaderLogEndOffset = -1L,
followerLogStartOffset = -1L,
fetchTimeMs = -1L,
readSize = 0,
lastStableOffset = None,
exception = Some(e))
}
}
var limitBytes = fetchMaxBytes
val result = new mutable.ArrayBuffer[(TopicPartition, LogReadResult)]
var minOneMessage = !hardMaxBytesLimit
readPartitionInfo.foreach { case (tp, fetchInfo) =>
val readResult = read(tp, fetchInfo, limitBytes, minOneMessage)
val recordBatchSize = readResult.info.records.sizeInBytes
// Once we read from a non-empty partition, we stop ignoring request and partition level size limits
if (recordBatchSize > 0)
minOneMessage = false
limitBytes = math.max(0, limitBytes - recordBatchSize)
result += (tp -> readResult)
}
result
}
实际上调用的是内部的read方法,
- 调用partition上面的readRecords方法获取到日志数据
- 如果partition被限流,返回空的数据
- 如果第一条日志因为太大而没有读完整,且vertionId>3,就返回空的数据,防止下游报RecordTooLargeException
- 返回拉取的日志数据
拉取日志调用的是partition的readRecords方法
def readRecords(fetchOffset: Long,
currentLeaderEpoch: Optional[Integer],
maxBytes: Int,
fetchIsolation: FetchIsolation,
fetchOnlyFromLeader: Boolean,
minOneMessage: Boolean): LogReadInfo = inReadLock(leaderIsrUpdateLock) {
// decide whether to only fetch from leader
val localReplica = localReplicaWithEpochOrException(currentLeaderEpoch, fetchOnlyFromLeader)
/* Read the LogOffsetMetadata prior to performing the read from the log.
* We use the LogOffsetMetadata to determine if a particular replica is in-sync or not.
* Using the log end offset after performing the read can lead to a race condition
* where data gets appended to the log immediately after the replica has consumed from it
* This can cause a replica to always be out of sync.
*/
//这一步根据fetchIsolation判断能拉取到的最大offset
val initialHighWatermark = localReplica.highWatermark.messageOffset
val initialLogStartOffset = localReplica.logStartOffset
val initialLogEndOffset = localReplica.logEndOffset.messageOffset
val initialLastStableOffset = localReplica.lastStableOffset.messageOffset
val maxOffsetOpt = fetchIsolation match {
case FetchLogEnd => None
case FetchHighWatermark => Some(initialHighWatermark)
case FetchTxnCommitted => Some(initialLastStableOffset)
}
val fetchedData = localReplica.log match {
case Some(log) =>
//从本地磁盘读取日志
log.read(fetchOffset, maxBytes, maxOffsetOpt, minOneMessage,
includeAbortedTxns = fetchIsolation == FetchTxnCommitted)
case None =>
error(s"Leader does not have a local log")
FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY)
}
LogReadInfo(
fetchedData = fetchedData,
highWatermark = initialHighWatermark,
logStartOffset = initialLogStartOffset,
logEndOffset = initialLogEndOffset,
lastStableOffset = initialLastStableOffset)
}
readFromLocalLog我们就介绍到这里
updateFollowerLogReadResults
我们注意到,在readFromLog方法中,首先先调用readFromLocalLog读取本地的日志内容,接着调用了updateFollowerLogReadResults方法更新follower拉取fetchState相关信息
def readFromLog(): Seq[(TopicPartition, LogReadResult)] = {
val result = readFromLocalLog(
replicaId = replicaId,
fetchOnlyFromLeader = fetchOnlyFromLeader,
fetchIsolation = fetchIsolation,
fetchMaxBytes = fetchMaxBytes,
hardMaxBytesLimit = hardMaxBytesLimit,
readPartitionInfo = fetchInfos,
quota = quota)
if (isFromFollower) updateFollowerLogReadResults(replicaId, result)
else result
}
下面就是updateFollowerLogReadResults方法
private def updateFollowerLogReadResults(replicaId: Int,
readResults: Seq[(TopicPartition, LogReadResult)]): Seq[(TopicPartition, LogReadResult)] = {
debug(s"Recording follower broker $replicaId log end offsets: $readResults")
readResults.map { case (topicPartition, readResult) =>
var updatedReadResult = readResult
nonOfflinePartition(topicPartition) match {
case Some(partition) =>
partition.getReplica(replicaId) match {
case Some(replica) =>
partition.updateReplicaLogReadResult(replica, readResult)
case None =>
warn(s"Leader $localBrokerId failed to record follower $replicaId's position " +
s"${readResult.info.fetchOffsetMetadata.messageOffset} since the replica is not recognized to be " +
s"one of the assigned replicas ${partition.assignedReplicas.map(_.brokerId).mkString(",")} " +
s"for partition $topicPartition. Empty records will be returned for this partition.")
updatedReadResult = readResult.withEmptyFetchInfo
}
case None =>
warn(s"While recording the replica LEO, the partition $topicPartition hasn't been created.")
}
topicPartition -> updatedReadResult
}
}
如果partition仍然在线的话,获取到replicaId对应的replica,调用partition.updateReplicaLogReadResult(replica, readResult)方法更新replica的fetchState(这是保存在leader处的各follower的fetchState)
def updateReplicaLogReadResult(replica: Replica, logReadResult: LogReadResult): Boolean = {
val replicaId = replica.brokerId
// No need to calculate low watermark if there is no delayed DeleteRecordsRequest
val oldLeaderLW = if (replicaManager.delayedDeleteRecordsPurgatory.delayed > 0) lowWatermarkIfLeader else -1L
replica.updateLogReadResult(logReadResult)
val newLeaderLW = if (replicaManager.delayedDeleteRecordsPurgatory.delayed > 0) lowWatermarkIfLeader else -1L
// check if the LW of the partition has incremented
// since the replica's logStartOffset may have incremented
val leaderLWIncremented = newLeaderLW > oldLeaderLW
// check if we need to expand ISR to include this replica
// if it is not in the ISR yet
val leaderHWIncremented = maybeExpandIsr(replicaId, logReadResult)
val result = leaderLWIncremented || leaderHWIncremented
// some delayed operations may be unblocked after HW or LW changed
if (result)
tryCompleteDelayedRequests()
debug(s"Recorded replica $replicaId log end offset (LEO) position ${logReadResult.info.fetchOffsetMetadata.messageOffset}.")
result
}
这里主要做了两件事情,
- replica更新fetchState
- 判断是否要增加low watermark:因为replica的logStartOffset增大了,因此lw也可能增大
- 判断是否要增加high watermark:判断是否要扩容isr
replica的更新操作为
def updateLogReadResult(logReadResult: LogReadResult) {
if (logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.leaderLogEndOffset)
_lastCaughtUpTimeMs = math.max(_lastCaughtUpTimeMs, logReadResult.fetchTimeMs)
else if (logReadResult.info.fetchOffsetMetadata.messageOffset >= lastFetchLeaderLogEndOffset)
_lastCaughtUpTimeMs = math.max(_lastCaughtUpTimeMs, lastFetchTimeMs)
logStartOffset = logReadResult.followerLogStartOffset
logEndOffset = logReadResult.info.fetchOffsetMetadata
lastFetchLeaderLogEndOffset = logReadResult.leaderLogEndOffset
lastFetchTimeMs = logReadResult.fetchTimeMs
}
- 如果读到的offset大于leader处的leo,则更新_lastCaughtUpTimeMs
- 如果读到的offset比上次拉取的时候leader的leo大,则更新_lastCaughtUpTimeMs为math.max(_lastCaughtUpTimeMs, lastFetchTimeMs)
- 更新logStartOffset和logEndOffset(logStartOffset是由follower请求的时候带过来的,logEndOffset是这次拉取到的最大的offset)
这里解释一下_lastCaughtUpTimeMs的含义,主要是我的理解。 kafka中,follower的catchUp的含义不仅仅是指当前某个follower跟上了leader,不是说跟上了以后才有资格更新catchUpTime。而是用catchUpTime表示,follower此刻的leo是大于等于catchUpTime那个时刻对应的leader的leo的。
在leader处,存在这样一个leo的变化序列(fetch表示fetch的时间)
(leo_1,fetch_1) -> (leo_2, fetch_2) -> ... -> (leo_x,fetch_x) -> (leo_x+1,fetch_x+1) 某follower在fetch_x+1时fetch到的最新offset如果满足offset>leo_x+1,那么当然_lastCaughtUpTimeMs=fetch_x+1,而如果offset退而其次满足offset>leo_x,那么也可以将_lastCaughtUpTimeMs更新为fetch_x。所以在replica处保留上一次fetch时leader的leo和当时的fetch时间。用数学表达式表达就是
_lastCaughtUpTimeMs = maxargs_fetch {leo(cur,replica) > leo(fetch,leader)}
更新完replica的fetchState后,判断是否要更新leader处的isr信息
def maybeExpandIsr(replicaId: Int, logReadResult: LogReadResult): Boolean = {
inWriteLock(leaderIsrUpdateLock) {
// check if this replica needs to be added to the ISR
leaderReplicaIfLocal match {
case Some(leaderReplica) =>
val replica = getReplica(replicaId).get
val leaderHW = leaderReplica.highWatermark
val fetchOffset = logReadResult.info.fetchOffsetMetadata.messageOffset
if (!inSyncReplicas.contains(replica) &&
assignedReplicas.map(_.brokerId).contains(replicaId) &&
replica.logEndOffset.offsetDiff(leaderHW) >= 0 &&
leaderEpochStartOffsetOpt.exists(fetchOffset >= _)) {
val newInSyncReplicas = inSyncReplicas + replica
info(s"Expanding ISR from ${inSyncReplicas.map(_.brokerId).mkString(",")} " +
s"to ${newInSyncReplicas.map(_.brokerId).mkString(",")}")
// update ISR in ZK and cache
updateIsr(newInSyncReplicas)
replicaManager.isrExpandRate.mark()
}
// check if the HW of the partition can now be incremented
// since the replica may already be in the ISR and its LEO has just incremented
maybeIncrementLeaderHW(leaderReplica, logReadResult.fetchTimeMs)
case None => false // nothing to do if no longer leader
}
}
}
如果replica的leo大于highwatermark,且fetchOffset(=logReadResult.info.fetchOffsetMetadata.messageOffset)>leaderEpochStartOffset,则将这个replica加入到isr中,同时通知zk更新isr.最后在leader处判断是否增加hw
private def maybeIncrementLeaderHW(leaderReplica: Replica, curTime: Long = time.milliseconds): Boolean = {
val allLogEndOffsets = assignedReplicas.filter { replica =>
curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
}.map(_.logEndOffset)
//hw是所有isr当中leo的最小值
val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering)
val oldHighWatermark = leaderReplica.highWatermark
// Ensure that the high watermark increases monotonically. We also update the high watermark when the new
// offset metadata is on a newer segment, which occurs whenever the log is rolled to a new segment.
if (oldHighWatermark.messageOffset < newHighWatermark.messageOffset ||
(oldHighWatermark.messageOffset == newHighWatermark.messageOffset && oldHighWatermark.onOlderSegment(newHighWatermark))) {
leaderReplica.highWatermark = newHighWatermark
debug(s"High watermark updated to $newHighWatermark")
true
} else {
def logEndOffsetString(r: Replica) = s"replica ${r.brokerId}: ${r.logEndOffset}"
debug(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old hw $oldHighWatermark. " +
s"All current LEOs are ${assignedReplicas.map(logEndOffsetString)}")
false
}
}
在获取allLogEndOffsets我们注意到,还有一个条件
val allLogEndOffsets = assignedReplicas.filter { replica =>
curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
}.map(_.logEndOffset)
这里能看出,有资格影响到hw的replica,不仅仅包含isr中的replica,而且包含满足curTime - replica.lastCaughtUpTimeMs的replica。这是为了防止当isr中只有一个leader,其他follower在追赶的情况下,如果仍然按照isr(这时只有leader)的成员leo来更新hw,其他的follower有可能永远在hw之后(此时是leader的leo),此时isr将只有leader一个。