kafka server - 副本同步机制

1,696 阅读16分钟

这是草稿

我们知道,kafka通过其副本同步机制保障消息高可用性。kafka的副本是针对分区而言,每个分区拥有一个leader和若干follower,follower通过向leader拉取消息保证自己随时接替成为新leader的可能性。那么,什么样的follower能够成为leader呢?kafka通过isr机制保证无论何时都有若干follower能够通过选举成为leader。这些细节都在kafka的副本同步的机制当中。在介绍kafka的副本同步机制之前,我们先详细了解一下Partition类和Replica类.

相关类

Partition

kafka的Partition类是对物理上一个topic partition的抽象,每台broker的kafka server服务内部保留着若干partition对象,其内部缓存着这个partition的副本的信息。partition的成员变量包含:

  • topic
  • partitionId
  • allReplicasMap: Pool[Int, Replica]; 表示这个partition所有的副本(包含已分配和正在被分配给这个partition的replica)
  • leaderIsrUpdateLock: new ReentrantReadWriteLock;
  • zkVersion
  • leaderEpoch
  • leaderEpochStartOffsetOpt: 表示这个leader epoch的start offset -> 只有leader才有这个信息
  • leaderReplicaIdOpt
  • inSyncReplicas
  • controllerEpoch

Replica

replica表示某个partition的副本,它是对一台broker的抽象。Replica的成员变量包括

  • highWatermarkMetadata
  • logEndOffsetMetadata
  • _logStartOffset
  • lastFetchLeaderLogEndOffset
  • lastFetchTimeMs
  • _lastCaughtUpTimeMs

一台server中保存了partition的所有replica,有些是本地即自身, 有些是远程的replica。replica的构造方法传入一个Option[Log],如果log未定义,表示这个replica是一个远程的replica,否则是一个本地的replica。

我们看到,replica内部保存了很多关于拉取消息的offset相关的信息,主要包含几个方面

  • hw(highWaterMark)
  • leo(logEndOffset)
  • lso(logStartOffset)

从offset的归属权来看,又分为以下三类,即

  • leader
  • local replica
  • remote replica

那么,上述三种offset何时会更新呢?接下来我们看副本是怎么从leader上面拉取消息并同步的。

replica和controller之间的rpc

kafka集群在启动时,会启动KafkaController组件,这个组件的作用相当于是分布式系统中的master,由它来控制partition的replica的选举,并且通报给所有的replica。当一台broker当选为controller时,在zk上面注册BrokerChange的监听器,它监听了zk中的/brokers/ids,当其子节点发生变化时(新增或下线broker),调用某个指定的方法。在以前的文章中我们已经介绍过controller的细节,这里就不展开了,只看和replica相关的内容。

当有新的broker启动时,会调用controller的onBrokerStartup方法,它的作用是:

  • 发送broker相关元数据到所有的broker,使老的broker能够发现新broker
  • 触发replica的状态改变。如果新加入的broker是某个partition的replica,就将这个replica的状态设置为online(这个信息保存在replica状态机中);同时若context中拥有这个partition的leader等信息,就将这个broker的replica和leader信息同步到所有相关broker
  • 触发partition的状态改变,将所有状态为offline或者new的partition改为online,这里就涉及到partition的leader选举,再将选举结果发送到replica和leader
  • 检查zk中是否有reassignment是包含了新broker,是的话执行对应partition的reassignment操作。

新的broker启动只是触发leader和isr变化的一个原因之一,只要是涉及到replica的改变,就有可能触发partition的重新选举和isr的变动。kafka中,replica关于partition的信息都是由controller中的replicaManager负责发送,具体的发送请求是一个ControllerBrokerRequestBatch类的对象,它包含如下的成员变量

  • controllerContext:表示controller的context
  • controllerId:表示controller id
  • leaderAndIsrRequestMap:mutable.Map.empty[Int, mutable.Map[TopicPartition, LeaderAndIsrRequest.PartitionState]]
  • stopReplicaRequestMap:mutable.Set.empty[Int]
  • updateMetadataRequestBrokerSet
  • updateMetadataRequestPartitionInfoMap:保存了partition和其partitionState的map

再发送leaderAndIsrRequest时,

  • leaderAndIsrRequestMap中加入brokerId和其对应的partition和partitionState的信息
  • updateMetadataRequestBrokerSet:加入这次请求涉及到元数据修改的broker(在startup的时候就是新加入的broker)
  • updateMetadataRequestPartitionInfoMap加入partition和partitionState的信息

构造好和请求相关的参数以后,调用发送方法

def sendRequestsToBrokers(controllerEpoch: Int) {
    try {
      val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerEpoch)

      val leaderAndIsrRequestVersion: Short =
        if (controller.config.interBrokerProtocolVersion >= KAFKA_1_0_IV0) 1
        else 0

      leaderAndIsrRequestMap.foreach { case (broker, leaderAndIsrPartitionStates) =>
        leaderAndIsrPartitionStates.foreach { case (topicPartition, state) =>
          val typeOfRequest =
            if (broker == state.basePartitionState.leader) "become-leader"
            else "become-follower"
          stateChangeLog.trace(s"Sending $typeOfRequest LeaderAndIsr request $state to broker $broker for partition $topicPartition")
        }
        val leaderIds = leaderAndIsrPartitionStates.map(_._2.basePartitionState.leader).toSet
        val leaders = controllerContext.liveOrShuttingDownBrokers.filter(b => leaderIds.contains(b.id)).map {
          _.node(controller.config.interBrokerListenerName)
        }
        val leaderAndIsrRequestBuilder = new LeaderAndIsrRequest.Builder(leaderAndIsrRequestVersion, controllerId,
          controllerEpoch, leaderAndIsrPartitionStates.asJava, leaders.asJava)
        controller.sendRequest(broker, ApiKeys.LEADER_AND_ISR, leaderAndIsrRequestBuilder,
          (r: AbstractResponse) => controller.eventManager.put(controller.LeaderAndIsrResponseReceived(r, broker)))
      }
      leaderAndIsrRequestMap.clear()

      updateMetadataRequestPartitionInfoMap.foreach { case (tp, partitionState) =>
        stateChangeLog.trace(s"Sending UpdateMetadata request $partitionState to brokers $updateMetadataRequestBrokerSet " +
          s"for partition $tp")
      }

      val partitionStates = Map.empty ++ updateMetadataRequestPartitionInfoMap
      val updateMetadataRequestVersion: Short =
        if (controller.config.interBrokerProtocolVersion >= KAFKA_1_0_IV0) 4
        else if (controller.config.interBrokerProtocolVersion >= KAFKA_0_10_2_IV0) 3
        else if (controller.config.interBrokerProtocolVersion >= KAFKA_0_10_0_IV1) 2
        else if (controller.config.interBrokerProtocolVersion >= KAFKA_0_9_0) 1
        else 0

      val updateMetadataRequest = {
        val liveBrokers = if (updateMetadataRequestVersion == 0) {
          // Version 0 of UpdateMetadataRequest only supports PLAINTEXT.
          controllerContext.liveOrShuttingDownBrokers.map { broker =>
            val securityProtocol = SecurityProtocol.PLAINTEXT
            val listenerName = ListenerName.forSecurityProtocol(securityProtocol)
            val node = broker.node(listenerName)
            val endPoints = Seq(new EndPoint(node.host, node.port, securityProtocol, listenerName))
            new UpdateMetadataRequest.Broker(broker.id, endPoints.asJava, broker.rack.orNull)
          }
        } else {
          controllerContext.liveOrShuttingDownBrokers.map { broker =>
            val endPoints = broker.endPoints.map { endPoint =>
              new UpdateMetadataRequest.EndPoint(endPoint.host, endPoint.port, endPoint.securityProtocol, endPoint.listenerName)
            }
            new UpdateMetadataRequest.Broker(broker.id, endPoints.asJava, broker.rack.orNull)
          }
        }
        new UpdateMetadataRequest.Builder(updateMetadataRequestVersion, controllerId, controllerEpoch, partitionStates.asJava,
          liveBrokers.asJava)
      }

      updateMetadataRequestBrokerSet.foreach { broker =>
        controller.sendRequest(broker, ApiKeys.UPDATE_METADATA, updateMetadataRequest, null)
      }
      updateMetadataRequestBrokerSet.clear()
      updateMetadataRequestPartitionInfoMap.clear()

      stopReplicaRequestMap.foreach { case (broker, replicaInfoList) =>
        val stopReplicaWithDelete = replicaInfoList.filter(_.deletePartition).map(_.replica).toSet
        val stopReplicaWithoutDelete = replicaInfoList.filterNot(_.deletePartition).map(_.replica).toSet
        debug(s"The stop replica request (delete = true) sent to broker $broker is ${stopReplicaWithDelete.mkString(",")}")
        debug(s"The stop replica request (delete = false) sent to broker $broker is ${stopReplicaWithoutDelete.mkString(",")}")

        val (replicasToGroup, replicasToNotGroup) = replicaInfoList.partition(r => !r.deletePartition && r.callback == null)

        // Send one StopReplicaRequest for all partitions that require neither delete nor callback. This potentially
        // changes the order in which the requests are sent for the same partitions, but that's OK.
        val stopReplicaRequest = new StopReplicaRequest.Builder(controllerId, controllerEpoch, false,
          replicasToGroup.map(_.replica.topicPartition).toSet.asJava)
        controller.sendRequest(broker, ApiKeys.STOP_REPLICA, stopReplicaRequest)

        replicasToNotGroup.foreach { r =>
          val stopReplicaRequest = new StopReplicaRequest.Builder(
              controllerId, controllerEpoch, r.deletePartition,
              Set(r.replica.topicPartition).asJava)
          controller.sendRequest(broker, ApiKeys.STOP_REPLICA, stopReplicaRequest, r.callback)
        }
      }
      stopReplicaRequestMap.clear()
    } catch {
      case e: Throwable =>
        if (leaderAndIsrRequestMap.nonEmpty) {
          error("Haven't been able to send leader and isr requests, current state of " +
              s"the map is $leaderAndIsrRequestMap. Exception message: $e")
        }
        if (updateMetadataRequestBrokerSet.nonEmpty) {
          error(s"Haven't been able to send metadata update requests to brokers $updateMetadataRequestBrokerSet, " +
                s"current state of the partition info is $updateMetadataRequestPartitionInfoMap. Exception message: $e")
        }
        if (stopReplicaRequestMap.nonEmpty) {
          error("Haven't been able to send stop replica requests, current state of " +
              s"the map is $stopReplicaRequestMap. Exception message: $e")
        }
        throw new IllegalStateException(e)
    }
  }
}

在看这段代码的时候,我们心里要时刻思考,发给谁?请求内容是什么? 先看leaderAndIsrRequest请求,

val leaderAndIsrRequestBuilder = new LeaderAndIsrRequest.Builder(leaderAndIsrRequestVersion, controllerId,
          controllerEpoch, leaderAndIsrPartitionStates.asJava, leaders.asJava)
controller.sendRequest(broker, ApiKeys.LEADER_AND_ISR, leaderAndIsrRequestBuilder,
          (r: AbstractResponse) => controller.eventManager.put(controller.LeaderAndIsrResponseReceived(r, broker)))

这个请求发送的broker是leaderAndIsrRequestMap的key,也就是状态发生变更的replica对应的broker,发送的内容就是这个replica对应的partition的信息。

同步完leader和isr后,再次发送更新元数据的请求

val updateMetadataRequest = {
        val liveBrokers = if (updateMetadataRequestVersion == 0) {
          // Version 0 of UpdateMetadataRequest only supports PLAINTEXT.
          controllerContext.liveOrShuttingDownBrokers.map { broker =>
            val securityProtocol = SecurityProtocol.PLAINTEXT
            val listenerName = ListenerName.forSecurityProtocol(securityProtocol)
            val node = broker.node(listenerName)
            val endPoints = Seq(new EndPoint(node.host, node.port, securityProtocol, listenerName))
            new UpdateMetadataRequest.Broker(broker.id, endPoints.asJava, broker.rack.orNull)
          }
        } else {
          controllerContext.liveOrShuttingDownBrokers.map { broker =>
            val endPoints = broker.endPoints.map { endPoint =>
              new UpdateMetadataRequest.EndPoint(endPoint.host, endPoint.port, endPoint.securityProtocol, endPoint.listenerName)
            }
            new UpdateMetadataRequest.Broker(broker.id, endPoints.asJava, broker.rack.orNull)
          }
        }
        new UpdateMetadataRequest.Builder(updateMetadataRequestVersion, controllerId, controllerEpoch, partitionStates.asJava,
          liveBrokers.asJava)
      }

      updateMetadataRequestBrokerSet.foreach { broker =>
        controller.sendRequest(broker, ApiKeys.UPDATE_METADATA, updateMetadataRequest, null)
      }

我们可以看出,元数据更新的请求的发送对象是所有存活的broker,请求内容是集群中每个broker的endpoint和rack等新消息。

至此,在controller端的工作就已经完成了,它做了两件事情

  1. broker startup等影响到replica状态的事件发生时,更新controller context的缓存
  2. 发送leaderAndIsrRequest和updateMeta的请求到broker

LeaderAndIsr API

leaderAndIsr的api对应的handler方法是

def handleLeaderAndIsrRequest(request: RequestChannel.Request) {
    // ensureTopicExists is only for client facing requests
    // We can't have the ensureTopicExists check here since the controller sends it as an advisory to all brokers so they
    // stop serving data to clients for the topic being deleted
    val correlationId = request.header.correlationId
    val leaderAndIsrRequest = request.body[LeaderAndIsrRequest]

    def onLeadershipChange(updatedLeaders: Iterable[Partition], updatedFollowers: Iterable[Partition]) {
      // for each new leader or follower, call coordinator to handle consumer group migration.
      // this callback is invoked under the replica state change lock to ensure proper order of
      // leadership changes
      updatedLeaders.foreach { partition =>
        if (partition.topic == GROUP_METADATA_TOPIC_NAME)
          groupCoordinator.handleGroupImmigration(partition.partitionId)
        else if (partition.topic == TRANSACTION_STATE_TOPIC_NAME)
          txnCoordinator.handleTxnImmigration(partition.partitionId, partition.getLeaderEpoch)
      }

      updatedFollowers.foreach { partition =>
        if (partition.topic == GROUP_METADATA_TOPIC_NAME)
          groupCoordinator.handleGroupEmigration(partition.partitionId)
        else if (partition.topic == TRANSACTION_STATE_TOPIC_NAME)
          txnCoordinator.handleTxnEmigration(partition.partitionId, partition.getLeaderEpoch)
      }
    }

    if (isAuthorizedClusterAction(request)) {
      val response = replicaManager.becomeLeaderOrFollower(correlationId, leaderAndIsrRequest, onLeadershipChange)
      sendResponseExemptThrottle(request, response)
    } else {
      sendResponseMaybeThrottle(request, throttleTimeMs => leaderAndIsrRequest.getErrorResponse(throttleTimeMs,
        Errors.CLUSTER_AUTHORIZATION_FAILED.exception))
    }
  }

如果鉴权成功,执行replicaManager的becomeLeaderOrFollower方法

def becomeLeaderOrFollower(correlationId: Int,
                            leaderAndIsrRequest: LeaderAndIsrRequest,
                            onLeadershipChange: (Iterable[Partition], Iterable[Partition]) => Unit): LeaderAndIsrResponse = {
   leaderAndIsrRequest.partitionStates.asScala.foreach { case (topicPartition, stateInfo) =>
     stateChangeLogger.trace(s"Received LeaderAndIsr request $stateInfo " +
       s"correlation id $correlationId from controller ${leaderAndIsrRequest.controllerId} " +
       s"epoch ${leaderAndIsrRequest.controllerEpoch} for partition $topicPartition")
   }
   replicaStateChangeLock synchronized {
     //如果收到一条过时的controller epoch的请求,直接忽略
     if (leaderAndIsrRequest.controllerEpoch < controllerEpoch) {
       stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from controller ${leaderAndIsrRequest.controllerId} with " +
         s"correlation id $correlationId since its controller epoch ${leaderAndIsrRequest.controllerEpoch} is old. " +
         s"Latest known controller epoch is $controllerEpoch")
       leaderAndIsrRequest.getErrorResponse(0, Errors.STALE_CONTROLLER_EPOCH.exception)
     } else {
       val responseMap = new mutable.HashMap[TopicPartition, Errors]
       //更新controller信息
       val controllerId = leaderAndIsrRequest.controllerId
       controllerEpoch = leaderAndIsrRequest.controllerEpoch

       // First check partition's leader epoch
       val partitionState = new mutable.HashMap[Partition, LeaderAndIsrRequest.PartitionState]()
       val newPartitions = new mutable.HashSet[Partition]

       leaderAndIsrRequest.partitionStates.asScala.foreach { case (topicPartition, stateInfo) =>
         //更新replica中缓存的partition列表
         val partition = getPartition(topicPartition).getOrElse {
           val createdPartition = getOrCreatePartition(topicPartition)
           newPartitions.add(createdPartition)
           createdPartition
         }
         val currentLeaderEpoch = partition.getLeaderEpoch
         val requestLeaderEpoch = stateInfo.basePartitionState.leaderEpoch
         if (partition eq ReplicaManager.OfflinePartition) {
           stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from " +
             s"controller $controllerId with correlation id $correlationId " +
             s"epoch $controllerEpoch for partition $topicPartition as the local replica for the " +
             "partition is in an offline log directory")
           responseMap.put(topicPartition, Errors.KAFKA_STORAGE_ERROR)
         } else if (requestLeaderEpoch > currentLeaderEpoch) {
           // If the leader epoch is valid record the epoch of the controller that made the leadership decision.
           // This is useful while updating the isr to maintain the decision maker controller's epoch in the zookeeper path
           if(stateInfo.basePartitionState.replicas.contains(localBrokerId))
             partitionState.put(partition, stateInfo)
           else {
             stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from controller $controllerId with " +
               s"correlation id $correlationId epoch $controllerEpoch for partition $topicPartition as itself is not " +
               s"in assigned replica list ${stateInfo.basePartitionState.replicas.asScala.mkString(",")}")
             responseMap.put(topicPartition, Errors.UNKNOWN_TOPIC_OR_PARTITION)
           }
         } else {
           // Otherwise record the error code in response
           stateChangeLogger.warn(s"Ignoring LeaderAndIsr request from " +
             s"controller $controllerId with correlation id $correlationId " +
             s"epoch $controllerEpoch for partition $topicPartition since its associated " +
             s"leader epoch $requestLeaderEpoch is not higher than the current " +
             s"leader epoch $currentLeaderEpoch")
           responseMap.put(topicPartition, Errors.STALE_CONTROLLER_EPOCH)
         }
       }
       
       //这个broker要成为leader的partition
       val partitionsTobeLeader = partitionState.filter { case (_, stateInfo) =>
         stateInfo.basePartitionState.leader == localBrokerId
       }
       //这个broker要成为follower的partition
       val partitionsToBeFollower = partitionState -- partitionsTobeLeader.keys

       val partitionsBecomeLeader = if (partitionsTobeLeader.nonEmpty)
         makeLeaders(controllerId, controllerEpoch, partitionsTobeLeader, correlationId, responseMap)
       else
         Set.empty[Partition]
       val partitionsBecomeFollower = if (partitionsToBeFollower.nonEmpty)
         makeFollowers(controllerId, controllerEpoch, partitionsToBeFollower, correlationId, responseMap)
       else
         Set.empty[Partition]

       leaderAndIsrRequest.partitionStates.asScala.keys.foreach { topicPartition =>
         /*
          * If there is offline log directory, a Partition object may have been created by getOrCreatePartition()
          * before getOrCreateReplica() failed to create local replica due to KafkaStorageException.
          * In this case ReplicaManager.allPartitions will map this topic-partition to an empty Partition object.
          * we need to map this topic-partition to OfflinePartition instead.
          */
         if (localReplica(topicPartition).isEmpty && (allPartitions.get(topicPartition) ne ReplicaManager.OfflinePartition))
           allPartitions.put(topicPartition, ReplicaManager.OfflinePartition)
       }

       // we initialize highwatermark thread after the first leaderisrrequest. This ensures that all the partitions
       // have been completely populated before starting the checkpointing there by avoiding weird race conditions
       if (!hwThreadInitialized) {
         startHighWaterMarksCheckPointThread()
         hwThreadInitialized = true
       }

       val futureReplicasAndInitialOffset = new mutable.HashMap[TopicPartition, InitialFetchState]
       for (partition <- newPartitions) {
         val topicPartition = partition.topicPartition
         if (logManager.getLog(topicPartition, isFuture = true).isDefined) {
           partition.localReplica.foreach { replica =>
             val leader = BrokerEndPoint(config.brokerId, "localhost", -1)

             // Add future replica to partition's map
             partition.getOrCreateReplica(Request.FutureLocalReplicaId, isNew = false)

             // pause cleaning for partitions that are being moved and start ReplicaAlterDirThread to move
             // replica from source dir to destination dir
             logManager.abortAndPauseCleaning(topicPartition)

             futureReplicasAndInitialOffset.put(topicPartition, InitialFetchState(leader,
               partition.getLeaderEpoch, replica.highWatermark.messageOffset))
           }
         }
       }
       replicaAlterLogDirsManager.addFetcherForPartitions(futureReplicasAndInitialOffset)

       replicaFetcherManager.shutdownIdleFetcherThreads()
       replicaAlterLogDirsManager.shutdownIdleFetcherThreads()
       onLeadershipChange(partitionsBecomeLeader, partitionsBecomeFollower)
       new LeaderAndIsrResponse(Errors.NONE, responseMap.asJava)
     }
   }
 }

在这个方法当中,根据自己在partition的follower中的角色不同,执行不同的方法,如果自己是partition的leader,执行makeLeaders方法,否则执行makeFollowers方法。在makeFollowers方法中会开启从leader fetch的线程,我们稍后再看。接下来replicaManager会判断是否启动了hw checkpoint线程,否则启动。

makeLeaders

这个方法的实现是:

private def makeLeaders(controllerId: Int,
                         epoch: Int,
                         partitionState: Map[Partition, LeaderAndIsrRequest.PartitionState],
                         correlationId: Int,
                         responseMap: mutable.Map[TopicPartition, Errors]): Set[Partition] = {
   partitionState.keys.foreach { partition =>
     stateChangeLogger.trace(s"Handling LeaderAndIsr request correlationId $correlationId from " +
       s"controller $controllerId epoch $epoch starting the become-leader transition for " +
       s"partition ${partition.topicPartition}")
   }

   for (partition <- partitionState.keys)
     responseMap.put(partition.topicPartition, Errors.NONE)

   val partitionsToMakeLeaders = mutable.Set[Partition]()

   try {
     // First stop fetchers for all the partitions
     //停掉fetcher线程
     replicaFetcherManager.removeFetcherForPartitions(partitionState.keySet.map(_.topicPartition))
     // Update the partition information to be the leader
     partitionState.foreach{ case (partition, partitionStateInfo) =>
       try {
         //执行partition的makeLeader方法
         if (partition.makeLeader(controllerId, partitionStateInfo, correlationId)) {
           partitionsToMakeLeaders += partition
           stateChangeLogger.trace(s"Stopped fetchers as part of become-leader request from " +
             s"controller $controllerId epoch $epoch with correlation id $correlationId for partition ${partition.topicPartition} " +
             s"(last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch})")
         } else
           stateChangeLogger.info(s"Skipped the become-leader state change after marking its " +
             s"partition as leader with correlation id $correlationId from controller $controllerId epoch $epoch for " +
             s"partition ${partition.topicPartition} (last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) " +
             s"since it is already the leader for the partition.")
       } catch {
         case e: KafkaStorageException =>
           stateChangeLogger.error(s"Skipped the become-leader state change with " +
             s"correlation id $correlationId from controller $controllerId epoch $epoch for partition ${partition.topicPartition} " +
             s"(last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) since " +
             s"the replica for the partition is offline due to disk error $e")
           val dirOpt = getLogDir(partition.topicPartition)
           error(s"Error while making broker the leader for partition $partition in dir $dirOpt", e)
           responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
       }
     }

   } catch {
     case e: Throwable =>
       partitionState.keys.foreach { partition =>
         stateChangeLogger.error(s"Error while processing LeaderAndIsr request correlationId $correlationId received " +
           s"from controller $controllerId epoch $epoch for partition ${partition.topicPartition}", e)
       }
       // Re-throw the exception for it to be caught in KafkaApis
       throw e
   }

   partitionState.keys.foreach { partition =>
     stateChangeLogger.trace(s"Completed LeaderAndIsr request correlationId $correlationId from controller $controllerId " +
       s"epoch $epoch for the become-leader transition for partition ${partition.topicPartition}")
   }

   partitionsToMakeLeaders
 }

一个partition的makeLeader要比后面介绍的makeFollower复杂一些,

def makeLeader(controllerId: Int, partitionStateInfo: LeaderAndIsrRequest.PartitionState, correlationId: Int): Boolean = {
   val (leaderHWIncremented, isNewLeader) = inWriteLock(leaderIsrUpdateLock) {
     //新分配给这个partition的replicas
     val newAssignedReplicas = partitionStateInfo.basePartitionState.replicas.asScala.map(_.toInt)
     // record the epoch of the controller that made the leadership decision. This is useful while updating the isr
     // to maintain the decision maker controller's epoch in the zookeeper path
     //记录发送leaderAndIsr请求的controller epoch
     controllerEpoch = partitionStateInfo.basePartitionState.controllerEpoch
     // add replicas that are new
     val newInSyncReplicas = partitionStateInfo.basePartitionState.isr.asScala.map(r => getOrCreateReplica(r, partitionStateInfo.isNew)).toSet
     // remove assigned replicas that have been removed by the controller
     (assignedReplicas.map(_.brokerId) -- newAssignedReplicas).foreach(removeReplica)
     //更新isr为从请求中带过来的replica
     inSyncReplicas = newInSyncReplicas
     newAssignedReplicas.foreach(id => getOrCreateReplica(id, partitionStateInfo.isNew))

     val leaderReplica = localReplicaOrException
     //将这台replica的leo作为新的leader epoch的startOffset
     val leaderEpochStartOffset = leaderReplica.logEndOffset.messageOffset
     info(s"$topicPartition starts at Leader Epoch ${partitionStateInfo.basePartitionState.leaderEpoch} from " +
       s"offset $leaderEpochStartOffset. Previous Leader Epoch was: $leaderEpoch")

     //We cache the leader epoch here, persisting it only if it's local (hence having a log dir)
     //更新新的leaderEpoch
     leaderEpoch = partitionStateInfo.basePartitionState.leaderEpoch
     leaderEpochStartOffsetOpt = Some(leaderEpochStartOffset)
     zkVersion = partitionStateInfo.basePartitionState.zkVersion

     // In the case of successive leader elections in a short time period, a follower may have
     // entries in its log from a later epoch than any entry in the new leader's log. In order
     // to ensure that these followers can truncate to the right offset, we must cache the new
     // leader epoch and the start offset since it should be larger than any epoch that a follower
     // would try to query.
     //在这里缓存了每个leaderEpoch和其对应的start offset
     leaderReplica.epochs.foreach { epochCache =>
       epochCache.assign(leaderEpoch, leaderEpochStartOffset)
     }

     val isNewLeader = !leaderReplicaIdOpt.contains(localBrokerId)
     val curLeaderLogEndOffset = leaderReplica.logEndOffset.messageOffset
     val curTimeMs = time.milliseconds
     // initialize lastCaughtUpTime of replicas as well as their lastFetchTimeMs and lastFetchLeaderLogEndOffset.
     (assignedReplicas - leaderReplica).foreach { replica =>
       val lastCaughtUpTimeMs = if (inSyncReplicas.contains(replica)) curTimeMs else 0L
       //更新replica拉取的信息。每个replica处记录的leader的leo变成curLeaderLogEndOffset,lastFetchTimeMs变成当前时间,如果是isr,lastCaughtUpTimeMs变成当前时间,否则为0
       replica.resetLastCaughtUpTime(curLeaderLogEndOffset, curTimeMs, lastCaughtUpTimeMs)
     }

     //如果当前broker是新的partition leader
     if (isNewLeader) {
       // construct the high watermark metadata for the new leader replica
       //构造新的hw元数据
       leaderReplica.convertHWToLocalOffsetMetadata()
       // mark local replica as the leader after converting hw
       //将当前broker设置为leader replica
       leaderReplicaIdOpt = Some(localBrokerId)
       // reset log end offset for remote replicas
       //修改replica的fetch信息
       assignedReplicas.filter(_.brokerId != localBrokerId).foreach(_.updateLogReadResult(LogReadResult.UnknownLogReadResult))
     }
     // we may need to increment high watermark since ISR could be down to 1
     (maybeIncrementLeaderHW(leaderReplica), isNewLeader)
   }
   // some delayed operations may be unblocked after HW changed
   if (leaderHWIncremented)
     tryCompleteDelayedRequests()
   isNewLeader
 }

在partition的leader处保留了其余replica的拉取的信息,主要是

  • _lastCaughtUpTimeMs
  • lastFetchLeaderLogEndOffset
  • lastFetchTimeMs

makeFollowers

在makeFollowers方法中,将自己设置为这个partition的follower,开启fetch线程。实现是:

private def makeFollowers(controllerId: Int,
                           epoch: Int,
                           partitionStates: Map[Partition, LeaderAndIsrRequest.PartitionState],
                           correlationId: Int,
                           responseMap: mutable.Map[TopicPartition, Errors]) : Set[Partition] = {
   partitionStates.foreach { case (partition, partitionState) =>
     stateChangeLogger.trace(s"Handling LeaderAndIsr request correlationId $correlationId from controller $controllerId " +
       s"epoch $epoch starting the become-follower transition for partition ${partition.topicPartition} with leader " +
       s"${partitionState.basePartitionState.leader}")
   }

   for (partition <- partitionStates.keys)
     responseMap.put(partition.topicPartition, Errors.NONE)

   val partitionsToMakeFollower: mutable.Set[Partition] = mutable.Set()

   try {
     // TODO: Delete leaders from LeaderAndIsrRequest
     partitionStates.foreach { case (partition, partitionStateInfo) =>
       val newLeaderBrokerId = partitionStateInfo.basePartitionState.leader
       try {
         metadataCache.getAliveBrokers.find(_.id == newLeaderBrokerId) match {
           // Only change partition state when the leader is available
           case Some(_) =>
             //如果这个partition的leader在live broker列表中,调用partition的makeFollower方法,初始化follower的状态。这个方法返回true/false,表示是否有发生partition leadership的变动
             if (partition.makeFollower(controllerId, partitionStateInfo, correlationId))
               partitionsToMakeFollower += partition
             else
               stateChangeLogger.info(s"Skipped the become-follower state change after marking its partition as " +
                 s"follower with correlation id $correlationId from controller $controllerId epoch $epoch " +
                 s"for partition ${partition.topicPartition} (last update " +
                 s"controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) " +
                 s"since the new leader $newLeaderBrokerId is the same as the old leader")
           case None =>
             // The leader broker should always be present in the metadata cache.
             // If not, we should record the error message and abort the transition process for this partition
             stateChangeLogger.error(s"Received LeaderAndIsrRequest with correlation id $correlationId from " +
               s"controller $controllerId epoch $epoch for partition ${partition.topicPartition} " +
               s"(last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) " +
               s"but cannot become follower since the new leader $newLeaderBrokerId is unavailable.")
             // Create the local replica even if the leader is unavailable. This is required to ensure that we include
             // the partition's high watermark in the checkpoint file (see KAFKA-1647)
             partition.getOrCreateReplica(localBrokerId, isNew = partitionStateInfo.isNew)
         }
       } catch {
         case e: KafkaStorageException =>
           stateChangeLogger.error(s"Skipped the become-follower state change with correlation id $correlationId from " +
             s"controller $controllerId epoch $epoch for partition ${partition.topicPartition} " +
             s"(last update controller epoch ${partitionStateInfo.basePartitionState.controllerEpoch}) with leader " +
             s"$newLeaderBrokerId since the replica for the partition is offline due to disk error $e")
           val dirOpt = getLogDir(partition.topicPartition)
           error(s"Error while making broker the follower for partition $partition with leader " +
             s"$newLeaderBrokerId in dir $dirOpt", e)
           responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
       }
     }
     //如果partition的leadership发生改变,那么停止fetcher线程
     replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(_.topicPartition))
     partitionsToMakeFollower.foreach { partition =>
       stateChangeLogger.trace(s"Stopped fetchers as part of become-follower request from controller $controllerId " +
         s"epoch $epoch with correlation id $correlationId for partition ${partition.topicPartition} with leader " +
         s"${partitionStates(partition).basePartitionState.leader}")
     }

     partitionsToMakeFollower.foreach { partition =>
       val topicPartitionOperationKey = new TopicPartitionOperationKey(partition.topicPartition)
       tryCompleteDelayedProduce(topicPartitionOperationKey)
       tryCompleteDelayedFetch(topicPartitionOperationKey)
     }

     partitionsToMakeFollower.foreach { partition =>
       stateChangeLogger.trace(s"Truncated logs and checkpointed recovery boundaries for partition " +
         s"${partition.topicPartition} as part of become-follower request with correlation id $correlationId from " +
         s"controller $controllerId epoch $epoch with leader ${partitionStates(partition).basePartitionState.leader}")
     }

     if (isShuttingDown.get()) {
       partitionsToMakeFollower.foreach { partition =>
         stateChangeLogger.trace(s"Skipped the adding-fetcher step of the become-follower state " +
           s"change with correlation id $correlationId from controller $controllerId epoch $epoch for " +
           s"partition ${partition.topicPartition} with leader ${partitionStates(partition).basePartitionState.leader} " +
           "since it is shutting down")
       }
     }
     else {
       // we do not need to check if the leader exists again since this has been done at the beginning of this process
       val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map { partition =>
         val leader = metadataCache.getAliveBrokers.find(_.id == partition.leaderReplicaIdOpt.get).get
           .brokerEndPoint(config.interBrokerListenerName)
         val fetchOffset = partition.localReplicaOrException.highWatermark.messageOffset
         partition.topicPartition -> InitialFetchState(leader, partition.getLeaderEpoch, fetchOffset)
       }.toMap
       //开启像新的leader的fetcher线程
       replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)

       partitionsToMakeFollower.foreach { partition =>
         stateChangeLogger.trace(s"Started fetcher to new leader as part of become-follower " +
           s"request from controller $controllerId epoch $epoch with correlation id $correlationId for " +
           s"partition ${partition.topicPartition} with leader ${partitionStates(partition).basePartitionState.leader}")
       }
     }
   } catch {
     case e: Throwable =>
       stateChangeLogger.error(s"Error while processing LeaderAndIsr request with correlationId $correlationId " +
         s"received from controller $controllerId epoch $epoch", e)
       // Re-throw the exception for it to be caught in KafkaApis
       throw e
   }

   partitionStates.keys.foreach { partition =>
     stateChangeLogger.trace(s"Completed LeaderAndIsr request correlationId $correlationId from controller $controllerId " +
       s"epoch $epoch for the become-follower transition for partition ${partition.topicPartition} with leader " +
       s"${partitionStates(partition).basePartitionState.leader}")
   }

   partitionsToMakeFollower
 }

replica fetch的过程

一个replica在收到成为follower的请求的时候,启动向partition leader fetch的线程。在一台replica上对同一个broker的fetch线程数是有上限的,fetch id的计算方法如下:

private[server] def getFetcherId(topicPartition: TopicPartition): Int = {
   lock synchronized {
     Utils.abs(31 * topicPartition.topic.hashCode() + topicPartition.partition) % numFetchersPerBroker
   }
 }

所有fetch线程都保存在AbstractFetcherManager的fetcherThreadMap中,这个map的keey是

BrokerAndFetcherId(brokerAndInitialFetchOffset.leader, getFetcherId(topicPartition))

请求时,如果已存在面向目标broker的fetch线程,则复用那个线程,否则创建新线程

def addAndStartFetcherThread(brokerAndFetcherId: BrokerAndFetcherId, brokerIdAndFetcherId: BrokerIdAndFetcherId): AbstractFetcherThread = {
        val fetcherThread = createFetcherThread(brokerAndFetcherId.fetcherId, brokerAndFetcherId.broker)
        fetcherThreadMap.put(brokerIdAndFetcherId, fetcherThread)
        fetcherThread.start()
        fetcherThread
      }

在这里,创建的是ReplicaFetcherThread。在构造出ReplicaFetcherThread后要注意初始化fetch的相关状态,

val currentState = partitionStates.stateValue(tp)
        val updatedState = if (currentState != null && currentState.currentLeaderEpoch == initialFetchState.leaderEpoch) {
          currentState
        } else {
          val initialFetchOffset = if (initialFetchState.offset < 0)
            fetchOffsetAndTruncate(tp, initialFetchState.leaderEpoch)
          else
            initialFetchState.offset
          PartitionFetchState(initialFetchOffset, initialFetchState.leaderEpoch, state = Truncating)
        }
        partitionStates.updateAndMoveToEnd(tp, updatedState)

这里的initialFetchState是前面提到过的当前replica的hw。注意,如果fetch的线程的partitionState已经包含此partition的fetch信息且leaderEpoch没有发生变化的话,fetch处的partitionState不做变化。否则,将partitionFetchState更新为:

  • 如果initialFetchState>0,updatedState=PartitionFetchState(initialFetchOffset, initialFetchState.leaderEpoch, state = Truncating),将这个partition的状态设置为Truncating
  • 如果initialFetchState<0,将updatedState修改为当前topicPartition的leo

ReplicaFetcherThread

ReplicaFetcherThread继承自AbstractFetcherThread,而AbstractFetcherThread继承自kafka中多次出现过的ShutdownableThread,其主要逻辑是重复doWork()方法。 AbstractFetcherThread抽象类已经实现了主要的fetch的逻辑,

override def doWork() {
    maybeTruncate()
    maybeFetch()
  }

maybeTruncate方法在前面我已经介绍过。他的主要作用是将需要截断的topicPartition的offset截断到leaderepoch对应的offset。这篇文章我们主要看maybeFetch方法。maybeFetch无外乎几个重要的过程:

  • 构造fetch请求
  • 处理fetch请求 这里我们看由其子类ReplicaFetcherThread实现的两个方法:
protected def buildFetch(partitionMap: Map[TopicPartition, PartitionFetchState]): ResultWithPartitions[Option[FetchRequest.Builder]]
protected def processPartitionData(topicPartition: TopicPartition,
                                     fetchOffset: Long,
                                     partitionData: FetchData): Option[LogAppendInfo]

先看第一个方法 buildFetch,它的作用是构造fetch请求

override def buildFetch(partitionMap: Map[TopicPartition, PartitionFetchState]): ResultWithPartitions[Option[FetchRequest.Builder]] = {
    val partitionsWithError = mutable.Set[TopicPartition]()

    val builder = fetchSessionHandler.newBuilder()
    partitionMap.foreach { case (topicPartition, fetchState) =>
      // We will not include a replica in the fetch request if it should be throttled.
      if (fetchState.isReadyForFetch && !shouldFollowerThrottle(quota, topicPartition)) {
        try {
          val logStartOffset = replicaMgr.localReplicaOrException(topicPartition).logStartOffset
          builder.add(topicPartition, new FetchRequest.PartitionData(
            fetchState.fetchOffset, logStartOffset, fetchSize, Optional.of(fetchState.currentLeaderEpoch)))
        } catch {
          case _: KafkaStorageException =>
            // The replica has already been marked offline due to log directory failure and the original failure should have already been logged.
            // This partition should be removed from ReplicaFetcherThread soon by ReplicaManager.handleLogDirFailure()
            partitionsWithError += topicPartition
        }
      }
    }

在方法开始时,构造FetchRequestData的builder,将fetchOffset等信息带上,每个replica还会带上自己的logStartOffset,这个offset表示自身日志开始的offset。收集lso信息有助于日志的日常滚动。最后执行builder.build方法产生fetchRequestData。

最后看一下processPartitionData方法

override def processPartitionData(topicPartition: TopicPartition,
                                   fetchOffset: Long,
                                   partitionData: FetchData): Option[LogAppendInfo] = {
   val replica = replicaMgr.localReplicaOrException(topicPartition)
   val partition = replicaMgr.getPartition(topicPartition).get
   val records = toMemoryRecords(partitionData.records)

   maybeWarnIfOversizedRecords(records, topicPartition)

   if (fetchOffset != replica.logEndOffset.messageOffset)
     throw new IllegalStateException("Offset mismatch for partition %s: fetched offset = %d, log end offset = %d.".format(
       topicPartition, fetchOffset, replica.logEndOffset.messageOffset))

   if (isTraceEnabled)
     trace("Follower has replica log end offset %d for partition %s. Received %d messages and leader hw %d"
       .format(replica.logEndOffset.messageOffset, topicPartition, records.sizeInBytes, partitionData.highWatermark))

   // Append the leader's messages to the log
   val logAppendInfo = partition.appendRecordsToFollowerOrFutureReplica(records, isFuture = false)

   if (isTraceEnabled)
     trace("Follower has replica log end offset %d after appending %d bytes of messages for partition %s"
       .format(replica.logEndOffset.messageOffset, records.sizeInBytes, topicPartition))
   val followerHighWatermark = replica.logEndOffset.messageOffset.min(partitionData.highWatermark)
   val leaderLogStartOffset = partitionData.logStartOffset
   // for the follower replica, we do not need to keep
   // its segment base offset the physical position,
   // these values will be computed upon making the leader
   replica.highWatermark = new LogOffsetMetadata(followerHighWatermark)
   replica.maybeIncrementLogStartOffset(leaderLogStartOffset)
   if (isTraceEnabled)
     trace(s"Follower set replica high watermark for partition $topicPartition to $followerHighWatermark")

   // Traffic from both in-sync and out of sync replicas are accounted for in replication quota to ensure total replication
   // traffic doesn't exceed quota.
   if (quota.isThrottled(topicPartition))
     quota.record(records.sizeInBytes)
   replicaMgr.brokerTopicStats.updateReplicationBytesIn(records.sizeInBytes)

   logAppendInfo
 }

处理fetch响应主要分为两个步骤

  • 保存fetch records
  • 更新replica处的元数据,包括hw和lso。如果leader返回的lso增长了,则增长replica处的lso,同时截断日志

leader处处理fetch请求

partition leader接收来自consumer和follower的fetch请求,在这里我们主要看follower请求的处理。 kafka server处理fetch请求的api handler是handleFetchRequest方法

...
if (fetchRequest.isFromFollower) {
      // The follower must have ClusterAction on ClusterResource in order to fetch partition data.
      if (authorize(request.session, ClusterAction, Resource.ClusterResource)) {
        fetchContext.foreachPartition { (topicPartition, data) =>
          if (!metadataCache.contains(topicPartition))
            erroneous += topicPartition -> errorResponse(Errors.UNKNOWN_TOPIC_OR_PARTITION)
          else
            interesting += (topicPartition -> data)
        }
      } else {
        fetchContext.foreachPartition { (part, _) =>
          erroneous += part -> errorResponse(Errors.TOPIC_AUTHORIZATION_FAILED)
        }
      }
    }
...
if (interesting.isEmpty)
      processResponseCallback(Seq.empty)
    else {
      // call the replica manager to fetch messages from the local replica
      replicaManager.fetchMessages(
        fetchRequest.maxWait.toLong,
        fetchRequest.replicaId,
        fetchRequest.minBytes,
        fetchRequest.maxBytes,
        versionId <= 2,
        interesting,
        replicationQuota(fetchRequest),
        processResponseCallback,
        fetchRequest.isolationLevel)
    }
...

我们忽略了和metric,限额,构造响应等过程,直接看replicaManager是怎么处理fetch请求的

def fetchMessages(timeout: Long,
                    replicaId: Int,
                    fetchMinBytes: Int,
                    fetchMaxBytes: Int,
                    hardMaxBytesLimit: Boolean,
                    fetchInfos: Seq[(TopicPartition, PartitionData)],
                    quota: ReplicaQuota = UnboundedQuota,
                    responseCallback: Seq[(TopicPartition, FetchPartitionData)] => Unit,
                    isolationLevel: IsolationLevel) {
    val isFromFollower = Request.isValidBrokerId(replicaId)
    val fetchOnlyFromLeader = replicaId != Request.DebuggingConsumerId && replicaId != Request.FutureLocalReplicaId

    val fetchIsolation = if (isFromFollower || replicaId == Request.FutureLocalReplicaId)
      FetchLogEnd
    else if (isolationLevel == IsolationLevel.READ_COMMITTED)
      FetchTxnCommitted
    else
      FetchHighWatermark


    def readFromLog(): Seq[(TopicPartition, LogReadResult)] = {
      val result = readFromLocalLog(
        replicaId = replicaId,
        fetchOnlyFromLeader = fetchOnlyFromLeader,
        fetchIsolation = fetchIsolation,
        fetchMaxBytes = fetchMaxBytes,
        hardMaxBytesLimit = hardMaxBytesLimit,
        readPartitionInfo = fetchInfos,
        quota = quota)
      if (isFromFollower) updateFollowerLogReadResults(replicaId, result)
      else result
    }

    val logReadResults = readFromLog()

    // check if this fetch request can be satisfied right away
    val logReadResultValues = logReadResults.map { case (_, v) => v }
    val bytesReadable = logReadResultValues.map(_.info.records.sizeInBytes).sum
    val errorReadingData = logReadResultValues.foldLeft(false) ((errorIncurred, readResult) =>
      errorIncurred || (readResult.error != Errors.NONE))

    // respond immediately if 1) fetch request does not want to wait
    //                        2) fetch request does not require any data
    //                        3) has enough data to respond
    //                        4) some error happens while reading data
    if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {
      val fetchPartitionData = logReadResults.map { case (tp, result) =>
        tp -> FetchPartitionData(result.error, result.highWatermark, result.leaderLogStartOffset, result.info.records,
          result.lastStableOffset, result.info.abortedTransactions)
      }
      responseCallback(fetchPartitionData)
    } else {
      // construct the fetch results from the read results
      val fetchPartitionStatus = logReadResults.map { case (topicPartition, result) =>
        val fetchInfo = fetchInfos.collectFirst {
          case (tp, v) if tp == topicPartition => v
        }.getOrElse(sys.error(s"Partition $topicPartition not found in fetchInfos"))
        (topicPartition, FetchPartitionStatus(result.info.fetchOffsetMetadata, fetchInfo))
      }
      val fetchMetadata = FetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit, fetchOnlyFromLeader,
        fetchIsolation, isFromFollower, replicaId, fetchPartitionStatus)
      val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, responseCallback)

      // create a list of (topic, partition) pairs to use as keys for this delayed fetch operation
      val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) }

      // try to complete the request immediately, otherwise put it into the purgatory;
      // this is because while the delayed fetch operation is being created, new requests
      // may arrive and hence make this operation completable.
      delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
    }
  }

方法中

  • 第一步,先判断fetch的隔离级别
  • 第二步,从log中读取日志并返回
  • 第三步,更新follower处的offset相关信息
  • 第四步,判断是否满足这次fetch的需求,如果满足立即响应,如果不满足,构造一个delayedFetch对象并观察是否可以完成其对应的操作。

一共有三者fetch的级别

  • FetchLogEnd => 拉取所有日志(follower的fetch)
  • FetchTxnCommitted => 读取到lastStableOffset日志
  • FetchHighWatermark => 拉取到hw的日志(consumer的fetch)

再往下看,先看从log中读取日志的过程

readFromLocalLog

def readFromLocalLog(replicaId: Int,
                       fetchOnlyFromLeader: Boolean,
                       fetchIsolation: FetchIsolation,
                       fetchMaxBytes: Int,
                       hardMaxBytesLimit: Boolean,
                       readPartitionInfo: Seq[(TopicPartition, PartitionData)],
                       quota: ReplicaQuota): Seq[(TopicPartition, LogReadResult)] = {

    def read(tp: TopicPartition, fetchInfo: PartitionData, limitBytes: Int, minOneMessage: Boolean): LogReadResult = {
      val offset = fetchInfo.fetchOffset
      val partitionFetchSize = fetchInfo.maxBytes
      val followerLogStartOffset = fetchInfo.logStartOffset

      brokerTopicStats.topicStats(tp.topic).totalFetchRequestRate.mark()
      brokerTopicStats.allTopicsStats.totalFetchRequestRate.mark()

      try {
        trace(s"Fetching log segment for partition $tp, offset $offset, partition fetch size $partitionFetchSize, " +
          s"remaining response limit $limitBytes" +
          (if (minOneMessage) s", ignoring response/partition size limits" else ""))

        val partition = getPartitionOrException(tp, expectLeader = fetchOnlyFromLeader)
        val adjustedMaxBytes = math.min(fetchInfo.maxBytes, limitBytes)
        val fetchTimeMs = time.milliseconds

        // Try the read first, this tells us whether we need all of adjustedFetchSize for this partition
        val readInfo = partition.readRecords(
          fetchOffset = fetchInfo.fetchOffset,
          currentLeaderEpoch = fetchInfo.currentLeaderEpoch,
          maxBytes = adjustedMaxBytes,
          fetchIsolation = fetchIsolation,
          fetchOnlyFromLeader = fetchOnlyFromLeader,
          minOneMessage = minOneMessage)

        val fetchDataInfo = if (shouldLeaderThrottle(quota, tp, replicaId)) {
          // If the partition is being throttled, simply return an empty set.
          FetchDataInfo(readInfo.fetchedData.fetchOffsetMetadata, MemoryRecords.EMPTY)
        } else if (!hardMaxBytesLimit && readInfo.fetchedData.firstEntryIncomplete) {
          // For FetchRequest version 3, we replace incomplete message sets with an empty one as consumers can make
          // progress in such cases and don't need to report a `RecordTooLargeException`
          FetchDataInfo(readInfo.fetchedData.fetchOffsetMetadata, MemoryRecords.EMPTY)
        } else {
          readInfo.fetchedData
        }

        LogReadResult(info = fetchDataInfo,
                      highWatermark = readInfo.highWatermark,
                      leaderLogStartOffset = readInfo.logStartOffset,
                      leaderLogEndOffset = readInfo.logEndOffset,
                      followerLogStartOffset = followerLogStartOffset,
                      fetchTimeMs = fetchTimeMs,
                      readSize = adjustedMaxBytes,
                      lastStableOffset = Some(readInfo.lastStableOffset),
                      exception = None)
      } catch {
        // NOTE: Failed fetch requests metric is not incremented for known exceptions since it
        // is supposed to indicate un-expected failure of a broker in handling a fetch request
        case e@ (_: UnknownTopicOrPartitionException |
                 _: NotLeaderForPartitionException |
                 _: UnknownLeaderEpochException |
                 _: FencedLeaderEpochException |
                 _: ReplicaNotAvailableException |
                 _: KafkaStorageException |
                 _: OffsetOutOfRangeException) =>
          LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY),
                        highWatermark = -1L,
                        leaderLogStartOffset = -1L,
                        leaderLogEndOffset = -1L,
                        followerLogStartOffset = -1L,
                        fetchTimeMs = -1L,
                        readSize = 0,
                        lastStableOffset = None,
                        exception = Some(e))
        case e: Throwable =>
          brokerTopicStats.topicStats(tp.topic).failedFetchRequestRate.mark()
          brokerTopicStats.allTopicsStats.failedFetchRequestRate.mark()
          error(s"Error processing fetch operation on partition $tp, offset $offset", e)
          LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY),
                        highWatermark = -1L,
                        leaderLogStartOffset = -1L,
                        leaderLogEndOffset = -1L,
                        followerLogStartOffset = -1L,
                        fetchTimeMs = -1L,
                        readSize = 0,
                        lastStableOffset = None,
                        exception = Some(e))
      }
    }

    var limitBytes = fetchMaxBytes
    val result = new mutable.ArrayBuffer[(TopicPartition, LogReadResult)]
    var minOneMessage = !hardMaxBytesLimit
    readPartitionInfo.foreach { case (tp, fetchInfo) =>
      val readResult = read(tp, fetchInfo, limitBytes, minOneMessage)
      val recordBatchSize = readResult.info.records.sizeInBytes
      // Once we read from a non-empty partition, we stop ignoring request and partition level size limits
      if (recordBatchSize > 0)
        minOneMessage = false
      limitBytes = math.max(0, limitBytes - recordBatchSize)
      result += (tp -> readResult)
    }
    result
  }

实际上调用的是内部的read方法,

  1. 调用partition上面的readRecords方法获取到日志数据
  2. 如果partition被限流,返回空的数据
  3. 如果第一条日志因为太大而没有读完整,且vertionId>3,就返回空的数据,防止下游报RecordTooLargeException
  4. 返回拉取的日志数据

拉取日志调用的是partition的readRecords方法

def readRecords(fetchOffset: Long,
                  currentLeaderEpoch: Optional[Integer],
                  maxBytes: Int,
                  fetchIsolation: FetchIsolation,
                  fetchOnlyFromLeader: Boolean,
                  minOneMessage: Boolean): LogReadInfo = inReadLock(leaderIsrUpdateLock) {
    // decide whether to only fetch from leader
    val localReplica = localReplicaWithEpochOrException(currentLeaderEpoch, fetchOnlyFromLeader)

    /* Read the LogOffsetMetadata prior to performing the read from the log.
     * We use the LogOffsetMetadata to determine if a particular replica is in-sync or not.
     * Using the log end offset after performing the read can lead to a race condition
     * where data gets appended to the log immediately after the replica has consumed from it
     * This can cause a replica to always be out of sync.
     */
     //这一步根据fetchIsolation判断能拉取到的最大offset
    val initialHighWatermark = localReplica.highWatermark.messageOffset
    val initialLogStartOffset = localReplica.logStartOffset
    val initialLogEndOffset = localReplica.logEndOffset.messageOffset
    val initialLastStableOffset = localReplica.lastStableOffset.messageOffset

    val maxOffsetOpt = fetchIsolation match {
      case FetchLogEnd => None
      case FetchHighWatermark => Some(initialHighWatermark)
      case FetchTxnCommitted => Some(initialLastStableOffset)
    }

    val fetchedData = localReplica.log match {
      case Some(log) =>
        //从本地磁盘读取日志
        log.read(fetchOffset, maxBytes, maxOffsetOpt, minOneMessage,
          includeAbortedTxns = fetchIsolation == FetchTxnCommitted)

      case None =>
        error(s"Leader does not have a local log")
        FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY)
    }

    LogReadInfo(
      fetchedData = fetchedData,
      highWatermark = initialHighWatermark,
      logStartOffset = initialLogStartOffset,
      logEndOffset = initialLogEndOffset,
      lastStableOffset = initialLastStableOffset)
  }

readFromLocalLog我们就介绍到这里

updateFollowerLogReadResults

我们注意到,在readFromLog方法中,首先先调用readFromLocalLog读取本地的日志内容,接着调用了updateFollowerLogReadResults方法更新follower拉取fetchState相关信息

def readFromLog(): Seq[(TopicPartition, LogReadResult)] = {
     val result = readFromLocalLog(
       replicaId = replicaId,
       fetchOnlyFromLeader = fetchOnlyFromLeader,
       fetchIsolation = fetchIsolation,
       fetchMaxBytes = fetchMaxBytes,
       hardMaxBytesLimit = hardMaxBytesLimit,
       readPartitionInfo = fetchInfos,
       quota = quota)
     if (isFromFollower) updateFollowerLogReadResults(replicaId, result)
     else result
   }

下面就是updateFollowerLogReadResults方法

private def updateFollowerLogReadResults(replicaId: Int,
                                           readResults: Seq[(TopicPartition, LogReadResult)]): Seq[(TopicPartition, LogReadResult)] = {
    debug(s"Recording follower broker $replicaId log end offsets: $readResults")
    readResults.map { case (topicPartition, readResult) =>
      var updatedReadResult = readResult
      nonOfflinePartition(topicPartition) match {
        case Some(partition) =>
          partition.getReplica(replicaId) match {
            case Some(replica) =>
              partition.updateReplicaLogReadResult(replica, readResult)
            case None =>
              warn(s"Leader $localBrokerId failed to record follower $replicaId's position " +
                s"${readResult.info.fetchOffsetMetadata.messageOffset} since the replica is not recognized to be " +
                s"one of the assigned replicas ${partition.assignedReplicas.map(_.brokerId).mkString(",")} " +
                s"for partition $topicPartition. Empty records will be returned for this partition.")
              updatedReadResult = readResult.withEmptyFetchInfo
          }
        case None =>
          warn(s"While recording the replica LEO, the partition $topicPartition hasn't been created.")
      }
      topicPartition -> updatedReadResult
    }
  }

如果partition仍然在线的话,获取到replicaId对应的replica,调用partition.updateReplicaLogReadResult(replica, readResult)方法更新replica的fetchState(这是保存在leader处的各follower的fetchState)

def updateReplicaLogReadResult(replica: Replica, logReadResult: LogReadResult): Boolean = {
   val replicaId = replica.brokerId
   // No need to calculate low watermark if there is no delayed DeleteRecordsRequest
   val oldLeaderLW = if (replicaManager.delayedDeleteRecordsPurgatory.delayed > 0) lowWatermarkIfLeader else -1L
   replica.updateLogReadResult(logReadResult)
   val newLeaderLW = if (replicaManager.delayedDeleteRecordsPurgatory.delayed > 0) lowWatermarkIfLeader else -1L
   // check if the LW of the partition has incremented
   // since the replica's logStartOffset may have incremented
   val leaderLWIncremented = newLeaderLW > oldLeaderLW
   // check if we need to expand ISR to include this replica
   // if it is not in the ISR yet
   val leaderHWIncremented = maybeExpandIsr(replicaId, logReadResult)

   val result = leaderLWIncremented || leaderHWIncremented
   // some delayed operations may be unblocked after HW or LW changed
   if (result)
     tryCompleteDelayedRequests()

   debug(s"Recorded replica $replicaId log end offset (LEO) position ${logReadResult.info.fetchOffsetMetadata.messageOffset}.")
   result
 }

这里主要做了两件事情,

  1. replica更新fetchState
  2. 判断是否要增加low watermark:因为replica的logStartOffset增大了,因此lw也可能增大
  3. 判断是否要增加high watermark:判断是否要扩容isr

replica的更新操作为

def updateLogReadResult(logReadResult: LogReadResult) {
    if (logReadResult.info.fetchOffsetMetadata.messageOffset >= logReadResult.leaderLogEndOffset)
      _lastCaughtUpTimeMs = math.max(_lastCaughtUpTimeMs, logReadResult.fetchTimeMs)
    else if (logReadResult.info.fetchOffsetMetadata.messageOffset >= lastFetchLeaderLogEndOffset)
      _lastCaughtUpTimeMs = math.max(_lastCaughtUpTimeMs, lastFetchTimeMs)

    logStartOffset = logReadResult.followerLogStartOffset
    logEndOffset = logReadResult.info.fetchOffsetMetadata
    lastFetchLeaderLogEndOffset = logReadResult.leaderLogEndOffset
    lastFetchTimeMs = logReadResult.fetchTimeMs
  }
  • 如果读到的offset大于leader处的leo,则更新_lastCaughtUpTimeMs
  • 如果读到的offset比上次拉取的时候leader的leo大,则更新_lastCaughtUpTimeMs为math.max(_lastCaughtUpTimeMs, lastFetchTimeMs)
  • 更新logStartOffset和logEndOffset(logStartOffset是由follower请求的时候带过来的,logEndOffset是这次拉取到的最大的offset)

这里解释一下_lastCaughtUpTimeMs的含义,主要是我的理解。 kafka中,follower的catchUp的含义不仅仅是指当前某个follower跟上了leader,不是说跟上了以后才有资格更新catchUpTime。而是用catchUpTime表示,follower此刻的leo是大于等于catchUpTime那个时刻对应的leader的leo的。

在leader处,存在这样一个leo的变化序列(fetch表示fetch的时间)

(leo_1,fetch_1) -> (leo_2, fetch_2) -> ... -> (leo_x,fetch_x) -> (leo_x+1,fetch_x+1) 某follower在fetch_x+1时fetch到的最新offset如果满足offset>leo_x+1,那么当然_lastCaughtUpTimeMs=fetch_x+1,而如果offset退而其次满足offset>leo_x,那么也可以将_lastCaughtUpTimeMs更新为fetch_x。所以在replica处保留上一次fetch时leader的leo和当时的fetch时间。用数学表达式表达就是

_lastCaughtUpTimeMs = maxargs_fetch {leo(cur,replica) > leo(fetch,leader)}

更新完replica的fetchState后,判断是否要更新leader处的isr信息

def maybeExpandIsr(replicaId: Int, logReadResult: LogReadResult): Boolean = {
    inWriteLock(leaderIsrUpdateLock) {
      // check if this replica needs to be added to the ISR
      leaderReplicaIfLocal match {
        case Some(leaderReplica) =>
          val replica = getReplica(replicaId).get
          val leaderHW = leaderReplica.highWatermark
          val fetchOffset = logReadResult.info.fetchOffsetMetadata.messageOffset
          if (!inSyncReplicas.contains(replica) &&
             assignedReplicas.map(_.brokerId).contains(replicaId) &&
             replica.logEndOffset.offsetDiff(leaderHW) >= 0 &&
             leaderEpochStartOffsetOpt.exists(fetchOffset >= _)) {
            val newInSyncReplicas = inSyncReplicas + replica
            info(s"Expanding ISR from ${inSyncReplicas.map(_.brokerId).mkString(",")} " +
              s"to ${newInSyncReplicas.map(_.brokerId).mkString(",")}")
            // update ISR in ZK and cache
            updateIsr(newInSyncReplicas)
            replicaManager.isrExpandRate.mark()
          }
          // check if the HW of the partition can now be incremented
          // since the replica may already be in the ISR and its LEO has just incremented
          maybeIncrementLeaderHW(leaderReplica, logReadResult.fetchTimeMs)
        case None => false // nothing to do if no longer leader
      }
    }
  }

如果replica的leo大于highwatermark,且fetchOffset(=logReadResult.info.fetchOffsetMetadata.messageOffset)>leaderEpochStartOffset,则将这个replica加入到isr中,同时通知zk更新isr.最后在leader处判断是否增加hw

private def maybeIncrementLeaderHW(leaderReplica: Replica, curTime: Long = time.milliseconds): Boolean = {
   val allLogEndOffsets = assignedReplicas.filter { replica =>
     curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
   }.map(_.logEndOffset)
   //hw是所有isr当中leo的最小值
   val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering)
   val oldHighWatermark = leaderReplica.highWatermark

   // Ensure that the high watermark increases monotonically. We also update the high watermark when the new
   // offset metadata is on a newer segment, which occurs whenever the log is rolled to a new segment.
   if (oldHighWatermark.messageOffset < newHighWatermark.messageOffset ||
     (oldHighWatermark.messageOffset == newHighWatermark.messageOffset && oldHighWatermark.onOlderSegment(newHighWatermark))) {
     leaderReplica.highWatermark = newHighWatermark
     debug(s"High watermark updated to $newHighWatermark")
     true
   } else {
     def logEndOffsetString(r: Replica) = s"replica ${r.brokerId}: ${r.logEndOffset}"
     debug(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old hw $oldHighWatermark. " +
       s"All current LEOs are ${assignedReplicas.map(logEndOffsetString)}")
     false
   }
 }

在获取allLogEndOffsets我们注意到,还有一个条件

val allLogEndOffsets = assignedReplicas.filter { replica =>
    curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
  }.map(_.logEndOffset)

这里能看出,有资格影响到hw的replica,不仅仅包含isr中的replica,而且包含满足curTime - replica.lastCaughtUpTimeMs的replica。这是为了防止当isr中只有一个leader,其他follower在追赶的情况下,如果仍然按照isr(这时只有leader)的成员leo来更新hw,其他的follower有可能永远在hw之后(此时是leader的leo),此时isr将只有leader一个。