kafka server - replicaManager

2,397 阅读9分钟

这是草稿

replica manager

初始化

在kafka server启动的时候,会创建出一个replica manager,专门用来管理partition的replica的状态。我们看一下这个类的构造方法。构造参数就不看了,直接看他的成员变量

  • controllerEpoch:标记当前他所在的controller的epoch
  • localBrokerId:标记这台broker的id
  • allPartitions:保存partition的缓存
  • replicaStateChangeLock:在更改replica的状态的时候用到的锁(内置锁)
  • replicaFetcherManager:
  • replicaAlterLogDirsManager

replicaFetcherManager和replicaAlterLogDirsManager都继承自AbstractFetcherManager,这个类主要用途是管理replica fetcher的线程,其内部维护了一个从topicPartition到fetchId到AbstractFetcherThread的映射关系。replicaFetcherManager创建的线程是ReplicaFetcherThread,ReplicaAlterLogDirsManager创建的线程是ReplicaAlterLogDirsThread。我们后面再详细介绍这两个线程的作用。

  • highWatermarkCheckPointThreadStarted:标记checkpoint是否启动,只有leader才会启动这个线程
  • highWatermarkCheckpoints
  • isrChangeSet:isr最近发生改变的topicPartition集合
  • lastIsrChangeMs:上次isr发生改变的时间
  • lastIsrPropagationMs:上次传播isr改变的时间
  • logDirFailureHandler

AbstractFetcherThread

AbstractFetcherThread用来从一台broker上面拉取多个partition的数据。它的构造参数是:

  • name
  • clientId
  • sourceBroker
  • fetchBackOffMs
  • isInterrupible AbstractFetcherThread继承自ShutdownableThread,name和isInterrupible是从父类继承的参数,表示线程名和中断状态;sourceBroker是拉取数据的broker的endpoint,fetchBackOffMs表示拉取的退避时间。

AbstractFetcherThread的成员变量包括:

  • partitionStates: new PartitionStates[PartitionFetchState]
  • partitionMapLock: new ReentrantLock
  • partitionMapCond: partitionMapLock.newCondition()

PartitionStates内部包含一个从TopicPartition到状态的map

public class PartitionStates<S> {

    private final LinkedHashMap<TopicPartition, S> map = new LinkedHashMap<>();
    private final Set<TopicPartition> partitionSetView = Collections.unmodifiableSet(map.keySet());
    ...
}

map是LinkedHashMap,用途是可以轮询其中的key(topicPartition),并且可以将其移到map的末尾,从而实现round-robin的效果。先理解它的主要结构,后面我们再介绍它的用途是什么。map的value在AbstractFetcherThread中是PartitionFetchState,它表示了当前一个partition的fetch的状态

case class PartitionFetchState(fetchOffset: Long,
                               currentLeaderEpoch: Int,
                               delay: DelayedItem,
                               state: ReplicaState) {

  def isReadyForFetch: Boolean = state == Fetching && !isDelayed

  def isTruncating: Boolean = state == Truncating && !isDelayed

  //isDelayed表示这个DelayedItem是否到期
  def isDelayed: Boolean = delay.getDelay(TimeUnit.MILLISECONDS) > 0

  override def toString: String = {
    s"FetchState(fetchOffset=$fetchOffset" +
      s", currentLeaderEpoch=$currentLeaderEpoch" +
      s", state=$state" +
      s", delay=${delay.delayMs}ms" +
      s")"
  }
}

这个类保存了partition的fetch offset和其状态:

  • Truncating
  • Delayed
  • ReadyForFetch

因为AbstractFetcherThread继承自ShutdownableThread,因此其会不断执行doWork方法

override def doWork() {
    maybeTruncate()
    maybeFetch()
  }

其中,maybeTruncate方法判断是否需要截断log。为什么还需要截断log呢,这是因为每当一个follower当选为leader时,其他follower的日志当然不能比leader还多,因此需要截断一部分日志来保障同步。那么截取多少呢?kafka中的partition有leader epoch的概念,它保存了每个leader epoch是从哪个offset开始的。那么当有新leader产生时,只要能拿到这部分缓存,就能正确截取日志。maybeTruncate的实现是:

private def maybeTruncate(): Unit = {
   val (partitionsWithEpochs, partitionsWithoutEpochs) = fetchTruncatingPartitions()
   if (partitionsWithEpochs.nonEmpty) {
     truncateToEpochEndOffsets(partitionsWithEpochs)
   }
   if (partitionsWithoutEpochs.nonEmpty) {
     truncateToHighWatermark(partitionsWithoutEpochs)
   }
 }
  1. 第一步,先从partitionStates获取到那些有leader epoch和没有leader epoch的partition列表。
  2. 如果有leader epoch,那么按照leader epoch offset截断
  3. 如果没有leader epoch,那么按照hw截断
截断

下面我们看一下截断的具体实现 首先是按照leader epoch截断

private def truncateToEpochEndOffsets(latestEpochsForPartitions: Map[TopicPartition, EpochData]): Unit = {
    val endOffsets = fetchEpochEndOffsets(latestEpochsForPartitions)
    //Ensure we hold a lock during truncation.
    inLock(partitionMapLock) {
      //Check no leadership and no leader epoch changes happened whilst we were unlocked, fetching epochs
      val epochEndOffsets = endOffsets.filter { case (tp, _) =>
        val curPartitionState = partitionStates.stateValue(tp)
        val partitionEpochRequest = latestEpochsForPartitions.get(tp).getOrElse {
          throw new IllegalStateException(
            s"Leader replied with partition $tp not requested in OffsetsForLeaderEpoch request")
        }
        val leaderEpochInRequest = partitionEpochRequest.currentLeaderEpoch.get
        //这里先保证最新请求出的leader epoch和缓存中的一致(记住前面提到的partitionStates保存了每个partition的fetch state)
        curPartitionState != null && leaderEpochInRequest == curPartitionState.currentLeaderEpoch
      }

      val ResultWithPartitions(fetchOffsets, partitionsWithError) = maybeTruncateToEpochEndOffsets(epochEndOffsets)
      handlePartitionsWithErrors(partitionsWithError)
      updateFetchOffsetAndMaybeMarkTruncationComplete(fetchOffsets)
    }
  }
  1. 第一步,先获取最新的leader epoch的offset(这是一个抽象方法,由子类去实现)
  2. 加锁,保证在truncate期间leader不会改变
  3. 过滤掉刚从partitionStates中拿到的leader epoch和现在去拿到的不一致的情况(即leader发生了改变)
  4. 执行截断逻辑
  5. 标记截断完成

执行截断逻辑的方法是

private def maybeTruncateToEpochEndOffsets(fetchedEpochs: Map[TopicPartition, EpochEndOffset]): ResultWithPartitions[Map[TopicPartition, OffsetTruncationState]] = {
    val fetchOffsets = mutable.HashMap.empty[TopicPartition, OffsetTruncationState]
    val partitionsWithError = mutable.HashSet.empty[TopicPartition]

    fetchedEpochs.foreach { case (tp, leaderEpochOffset) =>
      try {
        leaderEpochOffset.error match {
          case Errors.NONE =>
            val offsetTruncationState = getOffsetTruncationState(tp, leaderEpochOffset)
            truncate(tp, offsetTruncationState)
            fetchOffsets.put(tp, offsetTruncationState)

          case Errors.FENCED_LEADER_EPOCH =>
            onPartitionFenced(tp)

          case error =>
            info(s"Retrying leaderEpoch request for partition $tp as the leader reported an error: $error")
            partitionsWithError += tp
        }
      } catch {
        case e: KafkaStorageException =>
          info(s"Failed to truncate $tp", e)
          partitionsWithError += tp
      }
    }

    ResultWithPartitions(fetchOffsets, partitionsWithError)
  }
  1. 根据leader响应的不同,截断逻辑也不同
  2. 如果leader响应没有报错,即eaderEpochOffset.error=NONE,那么
    • 获取需要截断到的offset
    • 执行truncate进行截断(抽象方法))
    • 保存截断结果
  3. 如果报错FENCED_LEADER_EPOCH,即请求的epoch低于broker上面的epoch,表示这个topicPartition的epoch过时了,从而会从partitionStates中踢掉。

那么需要截取到offset到底是什么呢,leader真会返回这样一个leader epoch offset吗?有没有其他可能的情况?我们看getOffsetTruncationState这个方法:

private def getOffsetTruncationState(tp: TopicPartition,
                                       leaderEpochOffset: EpochEndOffset): OffsetTruncationState = inLock(partitionMapLock) {
    //如果leader返回的是undefined offset
    if (leaderEpochOffset.endOffset == UNDEFINED_EPOCH_OFFSET) {
      // truncate to initial offset which is the high watermark for follower replica. For
      // future replica, it is either high watermark of the future replica or current
      // replica's truncation offset (when the current replica truncates, it forces future
      // replica's partition state to 'truncating' and sets initial offset to its truncation offset)
      warn(s"Based on replica's leader epoch, leader replied with an unknown offset in $tp. " +
           s"The initial fetch offset ${partitionStates.stateValue(tp).fetchOffset} will be used for truncation.")
      OffsetTruncationState(partitionStates.stateValue(tp).fetchOffset, truncationCompleted = true)
    } else if (leaderEpochOffset.leaderEpoch == UNDEFINED_EPOCH) {
      // either leader or follower or both use inter-broker protocol version < KAFKA_2_0_IV0
      // (version 0 of OffsetForLeaderEpoch request/response)
      warn(s"Leader or replica is on protocol version where leader epoch is not considered in the OffsetsForLeaderEpoch response. " +
           s"The leader's offset ${leaderEpochOffset.endOffset} will be used for truncation in $tp.")
      OffsetTruncationState(min(leaderEpochOffset.endOffset, logEndOffset(tp)), truncationCompleted = true)
    } else {
      val replicaEndOffset = logEndOffset(tp)

      // get (leader epoch, end offset) pair that corresponds to the largest leader epoch
      // less than or equal to the requested epoch.
      endOffsetForEpoch(tp, leaderEpochOffset.leaderEpoch) match {
        case Some(OffsetAndEpoch(followerEndOffset, followerEpoch)) =>
          if (followerEpoch != leaderEpochOffset.leaderEpoch) {
            // the follower does not know about the epoch that leader replied with
            // we truncate to the end offset of the largest epoch that is smaller than the
            // epoch the leader replied with, and send another offset for leader epoch request
            val intermediateOffsetToTruncateTo = min(followerEndOffset, replicaEndOffset)
            info(s"Based on replica's leader epoch, leader replied with epoch ${leaderEpochOffset.leaderEpoch} " +
              s"unknown to the replica for $tp. " +
              s"Will truncate to $intermediateOffsetToTruncateTo and send another leader epoch request to the leader.")
            OffsetTruncationState(intermediateOffsetToTruncateTo, truncationCompleted = false)
          } else {
            val offsetToTruncateTo = min(followerEndOffset, leaderEpochOffset.endOffset)
            OffsetTruncationState(min(offsetToTruncateTo, replicaEndOffset), truncationCompleted = true)
          }
        case None =>
          // This can happen if the follower was not tracking leader epochs at that point (before the
          // upgrade, or if this broker is new). Since the leader replied with epoch <
          // requested epoch from follower, so should be safe to truncate to leader's
          // offset (this is the same behavior as post-KIP-101 and pre-KIP-279)
          warn(s"Based on replica's leader epoch, leader replied with epoch ${leaderEpochOffset.leaderEpoch} " +
            s"below any replica's tracked epochs for $tp. " +
            s"The leader's offset only ${leaderEpochOffset.endOffset} will be used for truncation.")
          OffsetTruncationState(min(leaderEpochOffset.endOffset, replicaEndOffset), truncationCompleted = true)
      }
    }
  }
  • 如果leaderEpochOffset返回的endOffset是UNDEFINED_EPOCH_OFFSET,那么截断到replica的hw。这种情况一般发生在
    • leader使用的消息格式比较老(0.11.0之前)
    • follower请求的leader epoch比leader知道的最早的leader epoch小
  • 如果leaderEpochOffset返回的epoch是UNDEFINED_EPOCH,那么截断到endOffset和replica的leo中间的较小者。
  • todo

下面看按照hw截断

private[server] def truncateToHighWatermark(partitions: Set[TopicPartition]): Unit = inLock(partitionMapLock) {
    val fetchOffsets = mutable.HashMap.empty[TopicPartition, OffsetTruncationState]
    val partitionsWithError = mutable.HashSet.empty[TopicPartition]

    for (tp <- partitions) {
      val partitionState = partitionStates.stateValue(tp)
      if (partitionState != null) {
        try {
          val highWatermark = partitionState.fetchOffset
          val truncationState = OffsetTruncationState(highWatermark, truncationCompleted = true)

          info(s"Truncating partition $tp to local high watermark $highWatermark")
          truncate(tp, truncationState)

          fetchOffsets.put(tp, truncationState)
        } catch {
          case e: KafkaStorageException =>
            info(s"Failed to truncate $tp", e)
            partitionsWithError += tp
        }
      }
    }

    handlePartitionsWithErrors(partitionsWithError)
    updateFetchOffsetAndMaybeMarkTruncationComplete(fetchOffsets)
  }

先从partitionState获取到hw,即fetchOffset,然后按照hw截断(调用truncate方法)

fetch

我们再回到AbstractFetcherThread的doWork方法,再看maybeFetch方法

private def maybeFetch(): Unit = {
  val (fetchStates, fetchRequestOpt) = inLock(partitionMapLock) {
    val fetchStates = partitionStates.partitionStateMap.asScala
    //构造一个fetch请求
    val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(fetchStates)

    handlePartitionsWithErrors(partitionsWithError)

    if (fetchRequestOpt.isEmpty) {
      trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
      partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
    }

    (fetchStates, fetchRequestOpt)
  }

  //fetch操作
  fetchRequestOpt.foreach { fetchRequest =>
    processFetchRequest(fetchStates, fetchRequest)
  }
}

processFetchRequest的实现是:

private def processFetchRequest(fetchStates: Map[TopicPartition, PartitionFetchState],
                                 fetchRequest: FetchRequest.Builder): Unit = {
   val partitionsWithError = mutable.Set[TopicPartition]()
   var responseData: Seq[(TopicPartition, FetchData)] = Seq.empty

   try {
     trace(s"Sending fetch request $fetchRequest")
     //这是一个抽象方法,由子类实现
     responseData = fetchFromLeader(fetchRequest)
   } catch {
     case t: Throwable =>
       if (isRunning) {
         warn(s"Error in response for fetch request $fetchRequest", t)
         inLock(partitionMapLock) {
           partitionsWithError ++= partitionStates.partitionSet.asScala
           // there is an error occurred while fetching partitions, sleep a while
           // note that `ReplicaFetcherThread.handlePartitionsWithError` will also introduce the same delay for every
           // partition with error effectively doubling the delay. It would be good to improve this.
           partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
         }
       }
   }
   fetcherStats.requestRate.mark()

   if (responseData.nonEmpty) {
     // process fetched data
     inLock(partitionMapLock) {
       responseData.foreach { case (topicPartition, partitionData) =>
         Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =>
           // It's possible that a partition is removed and re-added or truncated when there is a pending fetch request.
           // In this case, we only want to process the fetch response if the partition state is ready for fetch and
           // the current offset is the same as the offset requested.
           val fetchState = fetchStates(topicPartition)
           //只处理fetch state在方法调用前后没有变化的partition
           if (fetchState.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) {
             partitionData.error match {
               case Errors.NONE =>
                 try {
                   // Once we hand off the partition data to the subclass, we can't mess with it any more in this thread
                   //构造logAppendInfo
                   val logAppendInfoOpt = processPartitionData(topicPartition, currentFetchState.fetchOffset,
                     partitionData)

                   logAppendInfoOpt.foreach { logAppendInfo =>
                     val validBytes = logAppendInfo.validBytes
                     val nextOffset = if (validBytes > 0) logAppendInfo.lastOffset + 1 else currentFetchState.fetchOffset
                     //计算拉取的offset与hw的差距
                     fetcherLagStats.getAndMaybePut(topicPartition).lag = Math.max(0L, partitionData.highWatermark - nextOffset)

                     // ReplicaDirAlterThread may have removed topicPartition from the partitionStates after processing the partition data
                     if (validBytes > 0 && partitionStates.contains(topicPartition)) {
                       // Update partitionStates only if there is no exception during processPartitionData
                       //更新下一次拉取的fetch state
                       val newFetchState = PartitionFetchState(nextOffset, fetchState.currentLeaderEpoch,
                         state = Fetching)
                       //注意,这里把这个partition的位置调整到了最后(round-robin)
                       partitionStates.updateAndMoveToEnd(topicPartition, newFetchState)
                       fetcherStats.byteRate.mark(validBytes)
                     }
                   }
                 } catch {
                   case ime: CorruptRecordException =>
                     // we log the error and continue. This ensures two things
                     // 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread
                     //    down and cause other topic partition to also lag
                     // 2. If the message is corrupt due to a transient state in the log (truncation, partial writes
                     //    can cause this), we simply continue and should get fixed in the subsequent fetches
                     error(s"Found invalid messages during fetch for partition $topicPartition " +
                       s"offset ${currentFetchState.fetchOffset}", ime)
                     partitionsWithError += topicPartition
                   case e: KafkaStorageException =>
                     error(s"Error while processing data for partition $topicPartition", e)
                     partitionsWithError += topicPartition
                   case e: Throwable =>
                     throw new KafkaException(s"Error processing data for partition $topicPartition " +
                       s"offset ${currentFetchState.fetchOffset}", e)
                 }
               //由于leader的切换可能会导致拉取的offset range有问题,需要重新设置offset为topicPartition的leo
               case Errors.OFFSET_OUT_OF_RANGE =>
                 if (!handleOutOfRangeError(topicPartition, currentFetchState))
                   partitionsWithError += topicPartition

               case Errors.UNKNOWN_LEADER_EPOCH =>
                 debug(s"Remote broker has a smaller leader epoch for partition $topicPartition than " +
                   s"this replica's current leader epoch of ${fetchState.currentLeaderEpoch}.")
                 partitionsWithError += topicPartition

               //拉取的leader epoch比当前的leader epoch老,先从partitionStates中移除,等待更新新的leaderAndIsr状态
               case Errors.FENCED_LEADER_EPOCH =>
                 onPartitionFenced(topicPartition)

               case Errors.NOT_LEADER_FOR_PARTITION =>
                 debug(s"Remote broker is not the leader for partition $topicPartition, which could indicate " +
                   "that the partition is being moved")
                 partitionsWithError += topicPartition

               case _ =>
                 error(s"Error for partition $topicPartition at offset ${currentFetchState.fetchOffset}",
                   partitionData.error.exception)
                 partitionsWithError += topicPartition
             }
           }
         }
       }
     }
   }

   if (partitionsWithError.nonEmpty) {
     debug(s"Handling errors for partitions $partitionsWithError")
     handlePartitionsWithErrors(partitionsWithError)
   }
 }
  1. 调用抽象方法fetchFromLeader发送fetch请求到leader,返回responseData
  2. 只处理fetch state在方法调用前后保持不变的partition response数据(partition可能会应为各种原因退避fetch)
  3. 如果partition响应数据错误码为NONE,则
-   调用抽象方法processPartitionData处理响应数据,并返回logAppendInfo
-   根据logAppendInfo更新partition的fetch state
-   将partition的fetch state移到partitionStates的map的最末尾,保证round-robin
  1. 处理过程中如果抛异常,则将partition加入到partitionsWithError中
  2. 如果响应错误码为OFFSET_OUT_OF_RANGE,则调用fetchOffsetAndTruncate返回的offset代替fetch的offset,并将partition加入到partitionsWithError中
protected def fetchOffsetAndTruncate(topicPartition: TopicPartition, currentLeaderEpoch: Int): Long = {
  val replicaEndOffset = logEndOffset(topicPartition)

  /**
   * Unclean leader election: A follower goes down, in the meanwhile the leader keeps appending messages. The follower comes back up
   * and before it has completely caught up with the leader's logs, all replicas in the ISR go down. The follower is now uncleanly
   * elected as the new leader, and it starts appending messages from the client. The old leader comes back up, becomes a follower
   * and it may discover that the current leader's end offset is behind its own end offset.
   *
   * In such a case, truncate the current follower's log to the current leader's end offset and continue fetching.
   *
   * There is a potential for a mismatch between the logs of the two replicas here. We don't fix this mismatch as of now.
   */
  //设想有一台follower挂掉了,leader还在继续接收消息。后来isr中的replica又全部挂掉,然而那一台follower又起起来,并被选举为新的leader。这时候老的leader如果再起起来,并向leader拉取消息的时候发现,leader的leo比自己的还要低,就会报out of range这种异常。这时候,只有截断follower的消息,才能继续拉取消息
  val leaderEndOffset = fetchLatestOffsetFromLeader(topicPartition, currentLeaderEpoch)
  if (leaderEndOffset < replicaEndOffset) {
    warn(s"Reset fetch offset for partition $topicPartition from $replicaEndOffset to current " +
      s"leader's latest offset $leaderEndOffset")
    truncate(topicPartition, new EpochEndOffset(Errors.NONE, UNDEFINED_EPOCH, leaderEndOffset))
    leaderEndOffset
  } else {
    /**
     * If the leader's log end offset is greater than the follower's log end offset, there are two possibilities:
     * 1. The follower could have been down for a long time and when it starts up, its end offset could be smaller than the leader's
     * start offset because the leader has deleted old logs (log.logEndOffset < leaderStartOffset).
     * 2. When unclean leader election occurs, it is possible that the old leader's high watermark is greater than
     * the new leader's log end offset. So when the old leader truncates its offset to its high watermark and starts
     * to fetch from the new leader, an OffsetOutOfRangeException will be thrown. After that some more messages are
     * produced to the new leader. While the old leader is trying to handle the OffsetOutOfRangeException and query
     * the log end offset of the new leader, the new leader's log end offset becomes higher than the follower's log end offset.
     *
     * In the first case, the follower's current log end offset is smaller than the leader's log start offset. So the
     * follower should truncate all its logs, roll out a new segment and start to fetch from the current leader's log
     * start offset.
     * In the second case, the follower should just keep the current log segments and retry the fetch. In the second
     * case, there will be some inconsistency of data between old and new leader. We are not solving it here.
     * If users want to have strong consistency guarantees, appropriate configurations needs to be set for both
     * brokers and producers.
     *
     * Putting the two cases together, the follower should fetch from the higher one of its replica log end offset
     * and the current leader's log start offset.
     */
  //这里处理的情况是新的leo高于老的leader的leo
  //前面提到,老的leader的leo理论上来说是高于新的leader的leo的,所有有可能报out_of_range。但是如果老的leader停的时间太长,有可能导致比新的leader的leo还低。或者如果老的leader的hw比新leader的leo大,则会发生截断,并且截断到老的leader的hw,并报out_of_range异常。这时候新leader还在接受消息,而老leader在进行异常处理等额外工作,它的leo就会超过老leader的leo.
  //第一种情况,老leader(当前是follower)截断日志到新leader的lso
  //第二种情况,老leader不截断,并且继续拉取,此时leader和follower的数据会发生不一致。
  //综合以上,遇到out_of_range异常的时候,将拉取的offset设置为自己的leo和leader的lso中的最大者。
    val leaderStartOffset = fetchEarliestOffsetFromLeader(topicPartition, currentLeaderEpoch)
    warn(s"Reset fetch offset for partition $topicPartition from $replicaEndOffset to current " +
      s"leader's start offset $leaderStartOffset")
    val offsetToFetch = Math.max(leaderStartOffset, replicaEndOffset)
    // Only truncate log when current leader's log start offset is greater than follower's log end offset.
    if (leaderStartOffset > replicaEndOffset)
      truncateFullyAndStartAt(topicPartition, leaderStartOffset)
    offsetToFetch
  }
}
  1. 如果响应错误码为UNKNOWN_LEADER_EPOCH,则将partition加入到partitionsWithError中
  2. 如果响应错误码为FENCED_LEADER_EPOCH,从partitionStates中移除partition,并等待更新新的leaderAndIsr
  3. 如果响应错误码为NOT_LEADER_FOR_PARTITION,则将partition加入到partitionsWithError中
  4. 处理partitionsWithError中出错的partition,调用delayPartitions方法
def delayPartitions(partitions: Iterable[TopicPartition], delay: Long) {
  partitionMapLock.lockInterruptibly()
  try {
    for (partition <- partitions) {
      Option(partitionStates.stateValue(partition)).foreach { currentFetchState =>
        if (!currentFetchState.isDelayed) {
          //将partition设置为延迟fetch,延迟时间为fetchBackOffMs
          partitionStates.updateAndMoveToEnd(partition, PartitionFetchState(currentFetchState.fetchOffset,
            currentFetchState.currentLeaderEpoch, new DelayedItem(delay), currentFetchState.state))
        }
      }
    }
    partitionMapCond.signalAll()
  } finally partitionMapLock.unlock()
}

ReplicaFetcherThread

它的父类AbstractFetcherThread已经完成了大部分fetch的逻辑,即用partitionStates来记录每个partition的fetch状态。其中fetch的细节都由子类实现。这里我们只看由ReplicaFetcherThread实现的几个比较重要的子方法

processPartitionData

fetch线程拉取到partitionData以后会调用processPartitionData方法,并返回logAppendInfo。这里它的实现是:

override def processPartitionData(topicPartition: TopicPartition,
                                   fetchOffset: Long,
                                   partitionData: FetchData): Option[LogAppendInfo] = {
   val replica = replicaMgr.localReplicaOrException(topicPartition)
   val partition = replicaMgr.getPartition(topicPartition).get
   //先将数据读出来到内存中
   val records = toMemoryRecords(partitionData.records)

   maybeWarnIfOversizedRecords(records, topicPartition)

   //一定要从leo开始拉取
   if (fetchOffset != replica.logEndOffset.messageOffset)
     throw new IllegalStateException("Offset mismatch for partition %s: fetched offset = %d, log end offset = %d.".format(
       topicPartition, fetchOffset, replica.logEndOffset.messageOffset))

   if (isTraceEnabled)
     trace("Follower has replica log end offset %d for partition %s. Received %d messages and leader hw %d"
       .format(replica.logEndOffset.messageOffset, topicPartition, records.sizeInBytes, partitionData.highWatermark))

   // Append the leader's messages to the log
   val logAppendInfo = partition.appendRecordsToFollowerOrFutureReplica(records, isFuture = false)

   if (isTraceEnabled)
     trace("Follower has replica log end offset %d after appending %d bytes of messages for partition %s"
       .format(replica.logEndOffset.messageOffset, records.sizeInBytes, topicPartition))
   //将本地leo和所有follower的hw作为自己的hw
   val followerHighWatermark = replica.logEndOffset.messageOffset.min(partitionData.highWatermark)
   val leaderLogStartOffset = partitionData.logStartOffset
   // for the follower replica, we do not need to keep
   // its segment base offset the physical position,
   // these values will be computed upon making the leader
   replica.highWatermark = new LogOffsetMetadata(followerHighWatermark)
   replica.maybeIncrementLogStartOffset(leaderLogStartOffset)
   if (isTraceEnabled)
     trace(s"Follower set replica high watermark for partition $topicPartition to $followerHighWatermark")

   // Traffic from both in-sync and out of sync replicas are accounted for in replication quota to ensure total replication
   // traffic doesn't exceed quota.
   if (quota.isThrottled(topicPartition))
     quota.record(records.sizeInBytes)
   replicaMgr.brokerTopicStats.updateReplicationBytesIn(records.sizeInBytes)

   logAppendInfo
 }

start up

顾名思义,replicaManager的作用是管理一个partition的所有replica,具体会管理什么内容呢?在replicaManager中有三个定时任务,分别是

  • isr-expiration
  • isr-change-propagation
  • shutdown-idle-replica-alter-log-dirs-thread
scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
    scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
    scheduler.schedule("shutdown-idle-replica-alter-log-dirs-thread", shutdownIdleReplicaAlterLogDirsThread _, period = 10000L, unit = TimeUnit.MILLISECONDS)

我们分别来看这三个任务

isr-expiration

在kafka中,一个partition的follower不能落后于leader太多,如果落后超过config.replicaLagTimeMaxMs x 1.5,就会从ISR中移除。isr-expiration就是来判断有没有replica过期。

private def maybeShrinkIsr(): Unit = {
    trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")
    nonOfflinePartitionsIterator.foreach(_.maybeShrinkIsr(config.replicaLagTimeMaxMs))
  }
def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
   val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
     leaderReplicaIfLocal match {
       case Some(leaderReplica) =>
         val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs)
         if(outOfSyncReplicas.nonEmpty) {
           val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
           assert(newInSyncReplicas.nonEmpty)
           info("Shrinking ISR from %s to %s".format(inSyncReplicas.map(_.brokerId).mkString(","),
             newInSyncReplicas.map(_.brokerId).mkString(",")))
           // update ISR in zk and in cache
           updateIsr(newInSyncReplicas)
           // we may need to increment high watermark since ISR could be down to 1

           replicaManager.isrShrinkRate.mark()
           maybeIncrementLeaderHW(leaderReplica)
         } else {
           false
         }

       case None => false // do nothing if no longer leader
     }
   }

   // some delayed operations may be unblocked after HW changed
   if (leaderHWIncremented)
     tryCompleteDelayedRequests()
 }

isr-expiration任务遍历所有的topicPartition,然后执行topicPartition自身的maybeShrinkIsr方法。partition内部维护了一个从brokerId到Replica的map,在执行maybeShrinkIsr的时候,先判断自己是不是这个partition的leader,如果不是就跳过,否则:

  1. 获取outOfSync replicas。获取不在没跟上的同步的replica
  2. 如果outOfSyncReplicas不为空,则从isr中删除
  3. 更新zk中关于isr的信息
  4. 检验是否增加hw

getOutOfSyncReplicas逻辑是这样的

def getOutOfSyncReplicas(leaderReplica: Replica, maxLagMs: Long): Set[Replica] = {
    /**
     * If the follower already has the same leo as the leader, it will not be considered as out-of-sync,
     * otherwise there are two cases that will be handled here -
     * 1. Stuck followers: If the leo of the replica hasn't been updated for maxLagMs ms,
     *                     the follower is stuck and should be removed from the ISR
     * 2. Slow followers: If the replica has not read up to the leo within the last maxLagMs ms,
     *                    then the follower is lagging and should be removed from the ISR
     * Both these cases are handled by checking the lastCaughtUpTimeMs which represents
     * the last time when the replica was fully caught up. If either of the above conditions
     * is violated, that replica is considered to be out of sync
     *
     **/
    val candidateReplicas = inSyncReplicas - leaderReplica

    val laggingReplicas = candidateReplicas.filter(r =>
    //leo与leader不同,并且距离上次跟上leo的时间间隔超过maxLagMs
      r.logEndOffset.messageOffset != leaderReplica.logEndOffset.messageOffset && (time.milliseconds - r.lastCaughtUpTimeMs) > maxLagMs)
    if (laggingReplicas.nonEmpty)
      debug("Lagging replicas are %s".format(laggingReplicas.map(_.brokerId).mkString(",")))

    laggingReplicas
  }

lastCaughtUpTimeMs是指这个replica上次拉取的offset>leo的时间(意思就是跟上了)。

在获取到outOfSyncReplica后,将zk中的isr信息更新。

private def updateIsr(newIsr: Set[Replica]) {
   val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(_.brokerId).toList, zkVersion)
   //注意这里用的是上次更新leader的controller epoch(每当controller选举完partition的leader时都会发送controller epoch)
   val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topicPartition, newLeaderAndIsr,
     controllerEpoch)

   if (updateSucceeded) {
     //zk中更新成功后,还要更新缓存中的isr信息
     replicaManager.recordIsrChange(topicPartition)
     inSyncReplicas = newIsr
     zkVersion = newVersion
     trace("ISR updated to [%s] and zkVersion updated to [%d]".format(newIsr.mkString(","), zkVersion))
   } else {
     replicaManager.failedIsrUpdatesRate.mark()
     info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating ISR".format(zkVersion))
   }
 }

最后因为更新了isr,所有还要判断是否要更新hw(hw是isr中最小的leo)。

private def maybeIncrementLeaderHW(leaderReplica: Replica, curTime: Long = time.milliseconds): Boolean = {
   val allLogEndOffsets = assignedReplicas.filter { replica =>
     curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
   }.map(_.logEndOffset)
   val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering)
   val oldHighWatermark = leaderReplica.highWatermark

   // Ensure that the high watermark increases monotonically. We also update the high watermark when the new
   // offset metadata is on a newer segment, which occurs whenever the log is rolled to a new segment.
   if (oldHighWatermark.messageOffset < newHighWatermark.messageOffset ||
     (oldHighWatermark.messageOffset == newHighWatermark.messageOffset && oldHighWatermark.onOlderSegment(newHighWatermark))) {
     leaderReplica.highWatermark = newHighWatermark
     debug(s"High watermark updated to $newHighWatermark")
     true
   } else {
     def logEndOffsetString(r: Replica) = s"replica ${r.brokerId}: ${r.logEndOffset}"
     debug(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old hw $oldHighWatermark. " +
       s"All current LEOs are ${assignedReplicas.map(logEndOffsetString)}")
     false
   }
 }

hw一般在以下两种情况下改动

  • isr发生变化
  • replica的leo发生变化

为什么上面第二点说的是replica的leo发生变化,而不是说成isr的leo发生变化呢?kafka在考虑是否增加hw的时候,不仅仅考虑到了isr,还考虑到了能catch-up的replica。设想一种情况,当isr中只包含了leader一个replica,所有的follower都在后面追赶。如果不等待后面的follow就增加了hw,那么follower的leo将会永远慢于hw(因为此时,hw就是leader的leo),那么follower就将永远进不去isr,因为他的lastCaughtUpTimeMs永远都很小。

isr-change-propagation

这个方法将发送isr change的事件到zk的路径/isr_change_notification/isr_change_中,最后由controller处理这个事件。

def maybePropagateIsrChanges() {
    val now = System.currentTimeMillis()
    isrChangeSet synchronized {
      if (isrChangeSet.nonEmpty &&
        (lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
          lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
        zkClient.propagateIsrChanges(isrChangeSet)
        isrChangeSet.clear()
        lastIsrPropagationMs.set(now)
      }
    }
  }

为了防止大量的isr change事件,只有在两种情况下将改变的isr通知出去

  • 过去5s内没有isr的变动
  • 过去60s内没有传播过isr的变动

shutdown-idle-replica-alter-log-dirs-thread