kafka server - replicaManager在kafka server启动的时候，会创建出一个replic

这是草稿

replica manager

初始化

在kafka server启动的时候，会创建出一个replica manager，专门用来管理partition的replica的状态。我们看一下这个类的构造方法。构造参数就不看了，直接看他的成员变量

controllerEpoch：标记当前他所在的controller的epoch
localBrokerId：标记这台broker的id
allPartitions：保存partition的缓存
replicaStateChangeLock：在更改replica的状态的时候用到的锁(内置锁)
replicaFetcherManager：
replicaAlterLogDirsManager

replicaFetcherManager和replicaAlterLogDirsManager都继承自AbstractFetcherManager，这个类主要用途是管理replica fetcher的线程，其内部维护了一个从topicPartition到fetchId到AbstractFetcherThread的映射关系。replicaFetcherManager创建的线程是ReplicaFetcherThread，ReplicaAlterLogDirsManager创建的线程是ReplicaAlterLogDirsThread。我们后面再详细介绍这两个线程的作用。

highWatermarkCheckPointThreadStarted：标记checkpoint是否启动，只有leader才会启动这个线程
highWatermarkCheckpoints
isrChangeSet：isr最近发生改变的topicPartition集合
lastIsrChangeMs：上次isr发生改变的时间
lastIsrPropagationMs：上次传播isr改变的时间
logDirFailureHandler

AbstractFetcherThread

AbstractFetcherThread用来从一台broker上面拉取多个partition的数据。它的构造参数是：

name
clientId
sourceBroker
fetchBackOffMs
isInterrupible AbstractFetcherThread继承自ShutdownableThread，name和isInterrupible是从父类继承的参数，表示线程名和中断状态；sourceBroker是拉取数据的broker的endpoint，fetchBackOffMs表示拉取的退避时间。

AbstractFetcherThread的成员变量包括：

partitionStates: new PartitionStates[PartitionFetchState]
partitionMapLock: new ReentrantLock
partitionMapCond: partitionMapLock.newCondition()

PartitionStates内部包含一个从TopicPartition到状态的map

public class PartitionStates<S> {

    private final LinkedHashMap<TopicPartition, S> map = new LinkedHashMap<>();
    private final Set<TopicPartition> partitionSetView = Collections.unmodifiableSet(map.keySet());
    ...
}

map是LinkedHashMap，用途是可以轮询其中的key(topicPartition)，并且可以将其移到map的末尾，从而实现round-robin的效果。先理解它的主要结构，后面我们再介绍它的用途是什么。map的value在AbstractFetcherThread中是PartitionFetchState，它表示了当前一个partition的fetch的状态

case class PartitionFetchState(fetchOffset: Long,
                               currentLeaderEpoch: Int,
                               delay: DelayedItem,
                               state: ReplicaState) {

  def isReadyForFetch: Boolean = state == Fetching && !isDelayed

  def isTruncating: Boolean = state == Truncating && !isDelayed

  //isDelayed表示这个DelayedItem是否到期
  def isDelayed: Boolean = delay.getDelay(TimeUnit.MILLISECONDS) > 0

  override def toString: String = {
    s"FetchState(fetchOffset=$fetchOffset" +
      s", currentLeaderEpoch=$currentLeaderEpoch" +
      s", state=$state" +
      s", delay=${delay.delayMs}ms" +
      s")"
  }
}

这个类保存了partition的fetch offset和其状态：

Truncating
Delayed
ReadyForFetch

因为AbstractFetcherThread继承自ShutdownableThread，因此其会不断执行doWork方法

override def doWork() {
    maybeTruncate()
    maybeFetch()
  }

其中，maybeTruncate方法判断是否需要截断log。为什么还需要截断log呢，这是因为每当一个follower当选为leader时，其他follower的日志当然不能比leader还多，因此需要截断一部分日志来保障同步。那么截取多少呢？kafka中的partition有leader epoch的概念，它保存了每个leader epoch是从哪个offset开始的。那么当有新leader产生时，只要能拿到这部分缓存，就能正确截取日志。maybeTruncate的实现是：

private def maybeTruncate(): Unit = {
   val (partitionsWithEpochs, partitionsWithoutEpochs) = fetchTruncatingPartitions()
   if (partitionsWithEpochs.nonEmpty) {
     truncateToEpochEndOffsets(partitionsWithEpochs)
   }
   if (partitionsWithoutEpochs.nonEmpty) {
     truncateToHighWatermark(partitionsWithoutEpochs)
   }
 }

第一步，先从partitionStates获取到那些有leader epoch和没有leader epoch的partition列表。
如果有leader epoch，那么按照leader epoch offset截断
如果没有leader epoch，那么按照hw截断

截断

下面我们看一下截断的具体实现首先是按照leader epoch截断

private def truncateToEpochEndOffsets(latestEpochsForPartitions: Map[TopicPartition, EpochData]): Unit = {
    val endOffsets = fetchEpochEndOffsets(latestEpochsForPartitions)
    //Ensure we hold a lock during truncation.
    inLock(partitionMapLock) {
      //Check no leadership and no leader epoch changes happened whilst we were unlocked, fetching epochs
      val epochEndOffsets = endOffsets.filter { case (tp, _) =>
        val curPartitionState = partitionStates.stateValue(tp)
        val partitionEpochRequest = latestEpochsForPartitions.get(tp).getOrElse {
          throw new IllegalStateException(
            s"Leader replied with partition $tp not requested in OffsetsForLeaderEpoch request")
        }
        val leaderEpochInRequest = partitionEpochRequest.currentLeaderEpoch.get
        //这里先保证最新请求出的leader epoch和缓存中的一致（记住前面提到的partitionStates保存了每个partition的fetch state）
        curPartitionState != null && leaderEpochInRequest == curPartitionState.currentLeaderEpoch
      }

      val ResultWithPartitions(fetchOffsets, partitionsWithError) = maybeTruncateToEpochEndOffsets(epochEndOffsets)
      handlePartitionsWithErrors(partitionsWithError)
      updateFetchOffsetAndMaybeMarkTruncationComplete(fetchOffsets)
    }
  }

第一步，先获取最新的leader epoch的offset（这是一个抽象方法，由子类去实现）
加锁，保证在truncate期间leader不会改变
过滤掉刚从partitionStates中拿到的leader epoch和现在去拿到的不一致的情况（即leader发生了改变）
执行截断逻辑
标记截断完成

执行截断逻辑的方法是

private def maybeTruncateToEpochEndOffsets(fetchedEpochs: Map[TopicPartition, EpochEndOffset]): ResultWithPartitions[Map[TopicPartition, OffsetTruncationState]] = {
    val fetchOffsets = mutable.HashMap.empty[TopicPartition, OffsetTruncationState]
    val partitionsWithError = mutable.HashSet.empty[TopicPartition]

    fetchedEpochs.foreach { case (tp, leaderEpochOffset) =>
      try {
        leaderEpochOffset.error match {
          case Errors.NONE =>
            val offsetTruncationState = getOffsetTruncationState(tp, leaderEpochOffset)
            truncate(tp, offsetTruncationState)
            fetchOffsets.put(tp, offsetTruncationState)

          case Errors.FENCED_LEADER_EPOCH =>
            onPartitionFenced(tp)

          case error =>
            info(s"Retrying leaderEpoch request for partition $tp as the leader reported an error: $error")
            partitionsWithError += tp
        }
      } catch {
        case e: KafkaStorageException =>
          info(s"Failed to truncate $tp", e)
          partitionsWithError += tp
      }
    }

    ResultWithPartitions(fetchOffsets, partitionsWithError)
  }

根据leader响应的不同，截断逻辑也不同
如果leader响应没有报错，即eaderEpochOffset.error=NONE，那么
- 获取需要截断到的offset
- 执行truncate进行截断（抽象方法））
- 保存截断结果
如果报错FENCED_LEADER_EPOCH，即请求的epoch低于broker上面的epoch,表示这个topicPartition的epoch过时了，从而会从partitionStates中踢掉。

那么需要截取到offset到底是什么呢，leader真会返回这样一个leader epoch offset吗？有没有其他可能的情况？我们看getOffsetTruncationState这个方法：

private def getOffsetTruncationState(tp: TopicPartition,
                                       leaderEpochOffset: EpochEndOffset): OffsetTruncationState = inLock(partitionMapLock) {
    //如果leader返回的是undefined offset
    if (leaderEpochOffset.endOffset == UNDEFINED_EPOCH_OFFSET) {
      // truncate to initial offset which is the high watermark for follower replica. For
      // future replica, it is either high watermark of the future replica or current
      // replica's truncation offset (when the current replica truncates, it forces future
      // replica's partition state to 'truncating' and sets initial offset to its truncation offset)
      warn(s"Based on replica's leader epoch, leader replied with an unknown offset in $tp. " +
           s"The initial fetch offset ${partitionStates.stateValue(tp).fetchOffset} will be used for truncation.")
      OffsetTruncationState(partitionStates.stateValue(tp).fetchOffset, truncationCompleted = true)
    } else if (leaderEpochOffset.leaderEpoch == UNDEFINED_EPOCH) {
      // either leader or follower or both use inter-broker protocol version < KAFKA_2_0_IV0
      // (version 0 of OffsetForLeaderEpoch request/response)
      warn(s"Leader or replica is on protocol version where leader epoch is not considered in the OffsetsForLeaderEpoch response. " +
           s"The leader's offset ${leaderEpochOffset.endOffset} will be used for truncation in $tp.")
      OffsetTruncationState(min(leaderEpochOffset.endOffset, logEndOffset(tp)), truncationCompleted = true)
    } else {
      val replicaEndOffset = logEndOffset(tp)

      // get (leader epoch, end offset) pair that corresponds to the largest leader epoch
      // less than or equal to the requested epoch.
      endOffsetForEpoch(tp, leaderEpochOffset.leaderEpoch) match {
        case Some(OffsetAndEpoch(followerEndOffset, followerEpoch)) =>
          if (followerEpoch != leaderEpochOffset.leaderEpoch) {
            // the follower does not know about the epoch that leader replied with
            // we truncate to the end offset of the largest epoch that is smaller than the
            // epoch the leader replied with, and send another offset for leader epoch request
            val intermediateOffsetToTruncateTo = min(followerEndOffset, replicaEndOffset)
            info(s"Based on replica's leader epoch, leader replied with epoch ${leaderEpochOffset.leaderEpoch} " +
              s"unknown to the replica for $tp. " +
              s"Will truncate to $intermediateOffsetToTruncateTo and send another leader epoch request to the leader.")
            OffsetTruncationState(intermediateOffsetToTruncateTo, truncationCompleted = false)
          } else {
            val offsetToTruncateTo = min(followerEndOffset, leaderEpochOffset.endOffset)
            OffsetTruncationState(min(offsetToTruncateTo, replicaEndOffset), truncationCompleted = true)
          }
        case None =>
          // This can happen if the follower was not tracking leader epochs at that point (before the
          // upgrade, or if this broker is new). Since the leader replied with epoch <
          // requested epoch from follower, so should be safe to truncate to leader's
          // offset (this is the same behavior as post-KIP-101 and pre-KIP-279)
          warn(s"Based on replica's leader epoch, leader replied with epoch ${leaderEpochOffset.leaderEpoch} " +
            s"below any replica's tracked epochs for $tp. " +
            s"The leader's offset only ${leaderEpochOffset.endOffset} will be used for truncation.")
          OffsetTruncationState(min(leaderEpochOffset.endOffset, replicaEndOffset), truncationCompleted = true)
      }
    }
  }

如果leaderEpochOffset返回的endOffset是UNDEFINED_EPOCH_OFFSET，那么截断到replica的hw。这种情况一般发生在
- leader使用的消息格式比较老(0.11.0之前)
- follower请求的leader epoch比leader知道的最早的leader epoch小
如果leaderEpochOffset返回的epoch是UNDEFINED_EPOCH，那么截断到endOffset和replica的leo中间的较小者。
todo

下面看按照hw截断

private[server] def truncateToHighWatermark(partitions: Set[TopicPartition]): Unit = inLock(partitionMapLock) {
    val fetchOffsets = mutable.HashMap.empty[TopicPartition, OffsetTruncationState]
    val partitionsWithError = mutable.HashSet.empty[TopicPartition]

    for (tp <- partitions) {
      val partitionState = partitionStates.stateValue(tp)
      if (partitionState != null) {
        try {
          val highWatermark = partitionState.fetchOffset
          val truncationState = OffsetTruncationState(highWatermark, truncationCompleted = true)

          info(s"Truncating partition $tp to local high watermark $highWatermark")
          truncate(tp, truncationState)

          fetchOffsets.put(tp, truncationState)
        } catch {
          case e: KafkaStorageException =>
            info(s"Failed to truncate $tp", e)
            partitionsWithError += tp
        }
      }
    }

    handlePartitionsWithErrors(partitionsWithError)
    updateFetchOffsetAndMaybeMarkTruncationComplete(fetchOffsets)
  }

先从partitionState获取到hw，即fetchOffset,然后按照hw截断（调用truncate方法）

fetch

我们再回到AbstractFetcherThread的doWork方法，再看maybeFetch方法

private def maybeFetch(): Unit = {
  val (fetchStates, fetchRequestOpt) = inLock(partitionMapLock) {
    val fetchStates = partitionStates.partitionStateMap.asScala
    //构造一个fetch请求
    val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(fetchStates)

    handlePartitionsWithErrors(partitionsWithError)

    if (fetchRequestOpt.isEmpty) {
      trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
      partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
    }

    (fetchStates, fetchRequestOpt)
  }

  //fetch操作
  fetchRequestOpt.foreach { fetchRequest =>
    processFetchRequest(fetchStates, fetchRequest)
  }
}

processFetchRequest的实现是：

private def processFetchRequest(fetchStates: Map[TopicPartition, PartitionFetchState],
                                 fetchRequest: FetchRequest.Builder): Unit = {
   val partitionsWithError = mutable.Set[TopicPartition]()
   var responseData: Seq[(TopicPartition, FetchData)] = Seq.empty

   try {
     trace(s"Sending fetch request $fetchRequest")
     //这是一个抽象方法，由子类实现
     responseData = fetchFromLeader(fetchRequest)
   } catch {
     case t: Throwable =>
       if (isRunning) {
         warn(s"Error in response for fetch request $fetchRequest", t)
         inLock(partitionMapLock) {
           partitionsWithError ++= partitionStates.partitionSet.asScala
           // there is an error occurred while fetching partitions, sleep a while
           // note that `ReplicaFetcherThread.handlePartitionsWithError` will also introduce the same delay for every
           // partition with error effectively doubling the delay. It would be good to improve this.
           partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
         }
       }
   }
   fetcherStats.requestRate.mark()

   if (responseData.nonEmpty) {
     // process fetched data
     inLock(partitionMapLock) {
       responseData.foreach { case (topicPartition, partitionData) =>
         Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =>
           // It's possible that a partition is removed and re-added or truncated when there is a pending fetch request.
           // In this case, we only want to process the fetch response if the partition state is ready for fetch and
           // the current offset is the same as the offset requested.
           val fetchState = fetchStates(topicPartition)
           //只处理fetch state在方法调用前后没有变化的partition
           if (fetchState.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) {
             partitionData.error match {
               case Errors.NONE =>
                 try {
                   // Once we hand off the partition data to the subclass, we can't mess with it any more in this thread
                   //构造logAppendInfo
                   val logAppendInfoOpt = processPartitionData(topicPartition, currentFetchState.fetchOffset,
                     partitionData)

                   logAppendInfoOpt.foreach { logAppendInfo =>
                     val validBytes = logAppendInfo.validBytes
                     val nextOffset = if (validBytes > 0) logAppendInfo.lastOffset + 1 else currentFetchState.fetchOffset
                     //计算拉取的offset与hw的差距
                     fetcherLagStats.getAndMaybePut(topicPartition).lag = Math.max(0L, partitionData.highWatermark - nextOffset)

                     // ReplicaDirAlterThread may have removed topicPartition from the partitionStates after processing the partition data
                     if (validBytes > 0 && partitionStates.contains(topicPartition)) {
                       // Update partitionStates only if there is no exception during processPartitionData
                       //更新下一次拉取的fetch state
                       val newFetchState = PartitionFetchState(nextOffset, fetchState.currentLeaderEpoch,
                         state = Fetching)
                       //注意，这里把这个partition的位置调整到了最后(round-robin)
                       partitionStates.updateAndMoveToEnd(topicPartition, newFetchState)
                       fetcherStats.byteRate.mark(validBytes)
                     }
                   }
                 } catch {
                   case ime: CorruptRecordException =>
                     // we log the error and continue. This ensures two things
                     // 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread
                     //    down and cause other topic partition to also lag
                     // 2. If the message is corrupt due to a transient state in the log (truncation, partial writes
                     //    can cause this), we simply continue and should get fixed in the subsequent fetches
                     error(s"Found invalid messages during fetch for partition $topicPartition " +
                       s"offset ${currentFetchState.fetchOffset}", ime)
                     partitionsWithError += topicPartition
                   case e: KafkaStorageException =>
                     error(s"Error while processing data for partition $topicPartition", e)
                     partitionsWithError += topicPartition
                   case e: Throwable =>
                     throw new KafkaException(s"Error processing data for partition $topicPartition " +
                       s"offset ${currentFetchState.fetchOffset}", e)
                 }
               //由于leader的切换可能会导致拉取的offset range有问题，需要重新设置offset为topicPartition的leo
               case Errors.OFFSET_OUT_OF_RANGE =>
                 if (!handleOutOfRangeError(topicPartition, currentFetchState))
                   partitionsWithError += topicPartition

               case Errors.UNKNOWN_LEADER_EPOCH =>
                 debug(s"Remote broker has a smaller leader epoch for partition $topicPartition than " +
                   s"this replica's current leader epoch of ${fetchState.currentLeaderEpoch}.")
                 partitionsWithError += topicPartition

               //拉取的leader epoch比当前的leader epoch老，先从partitionStates中移除，等待更新新的leaderAndIsr状态
               case Errors.FENCED_LEADER_EPOCH =>
                 onPartitionFenced(topicPartition)

               case Errors.NOT_LEADER_FOR_PARTITION =>
                 debug(s"Remote broker is not the leader for partition $topicPartition, which could indicate " +
                   "that the partition is being moved")
                 partitionsWithError += topicPartition

               case _ =>
                 error(s"Error for partition $topicPartition at offset ${currentFetchState.fetchOffset}",
                   partitionData.error.exception)
                 partitionsWithError += topicPartition
             }
           }
         }
       }
     }
   }

   if (partitionsWithError.nonEmpty) {
     debug(s"Handling errors for partitions $partitionsWithError")
     handlePartitionsWithErrors(partitionsWithError)
   }
 }

调用抽象方法fetchFromLeader发送fetch请求到leader，返回responseData
只处理fetch state在方法调用前后保持不变的partition response数据（partition可能会应为各种原因退避fetch）
如果partition响应数据错误码为NONE，则

-   调用抽象方法processPartitionData处理响应数据，并返回logAppendInfo
-   根据logAppendInfo更新partition的fetch state
-   将partition的fetch state移到partitionStates的map的最末尾，保证round-robin

处理过程中如果抛异常，则将partition加入到partitionsWithError中
如果响应错误码为OFFSET_OUT_OF_RANGE，则调用fetchOffsetAndTruncate返回的offset代替fetch的offset，并将partition加入到partitionsWithError中

protected def fetchOffsetAndTruncate(topicPartition: TopicPartition, currentLeaderEpoch: Int): Long = {
  val replicaEndOffset = logEndOffset(topicPartition)

  /**
   * Unclean leader election: A follower goes down, in the meanwhile the leader keeps appending messages. The follower comes back up
   * and before it has completely caught up with the leader's logs, all replicas in the ISR go down. The follower is now uncleanly
   * elected as the new leader, and it starts appending messages from the client. The old leader comes back up, becomes a follower
   * and it may discover that the current leader's end offset is behind its own end offset.
   *
   * In such a case, truncate the current follower's log to the current leader's end offset and continue fetching.
   *
   * There is a potential for a mismatch between the logs of the two replicas here. We don't fix this mismatch as of now.
   */
  //设想有一台follower挂掉了，leader还在继续接收消息。后来isr中的replica又全部挂掉，然而那一台follower又起起来，并被选举为新的leader。这时候老的leader如果再起起来，并向leader拉取消息的时候发现，leader的leo比自己的还要低，就会报out of range这种异常。这时候，只有截断follower的消息，才能继续拉取消息
  val leaderEndOffset = fetchLatestOffsetFromLeader(topicPartition, currentLeaderEpoch)
  if (leaderEndOffset < replicaEndOffset) {
    warn(s"Reset fetch offset for partition $topicPartition from $replicaEndOffset to current " +
      s"leader's latest offset $leaderEndOffset")
    truncate(topicPartition, new EpochEndOffset(Errors.NONE, UNDEFINED_EPOCH, leaderEndOffset))
    leaderEndOffset
  } else {
    /**
     * If the leader's log end offset is greater than the follower's log end offset, there are two possibilities:
     * 1. The follower could have been down for a long time and when it starts up, its end offset could be smaller than the leader's
     * start offset because the leader has deleted old logs (log.logEndOffset < leaderStartOffset).
     * 2. When unclean leader election occurs, it is possible that the old leader's high watermark is greater than
     * the new leader's log end offset. So when the old leader truncates its offset to its high watermark and starts
     * to fetch from the new leader, an OffsetOutOfRangeException will be thrown. After that some more messages are
     * produced to the new leader. While the old leader is trying to handle the OffsetOutOfRangeException and query
     * the log end offset of the new leader, the new leader's log end offset becomes higher than the follower's log end offset.
     *
     * In the first case, the follower's current log end offset is smaller than the leader's log start offset. So the
     * follower should truncate all its logs, roll out a new segment and start to fetch from the current leader's log
     * start offset.
     * In the second case, the follower should just keep the current log segments and retry the fetch. In the second
     * case, there will be some inconsistency of data between old and new leader. We are not solving it here.
     * If users want to have strong consistency guarantees, appropriate configurations needs to be set for both
     * brokers and producers.
     *
     * Putting the two cases together, the follower should fetch from the higher one of its replica log end offset
     * and the current leader's log start offset.
     */
  //这里处理的情况是新的leo高于老的leader的leo
  //前面提到，老的leader的leo理论上来说是高于新的leader的leo的，所有有可能报out_of_range。但是如果老的leader停的时间太长，有可能导致比新的leader的leo还低。或者如果老的leader的hw比新leader的leo大，则会发生截断，并且截断到老的leader的hw，并报out_of_range异常。这时候新leader还在接受消息，而老leader在进行异常处理等额外工作，它的leo就会超过老leader的leo.
  //第一种情况，老leader(当前是follower)截断日志到新leader的lso
  //第二种情况，老leader不截断，并且继续拉取，此时leader和follower的数据会发生不一致。
  //综合以上，遇到out_of_range异常的时候，将拉取的offset设置为自己的leo和leader的lso中的最大者。
    val leaderStartOffset = fetchEarliestOffsetFromLeader(topicPartition, currentLeaderEpoch)
    warn(s"Reset fetch offset for partition $topicPartition from $replicaEndOffset to current " +
      s"leader's start offset $leaderStartOffset")
    val offsetToFetch = Math.max(leaderStartOffset, replicaEndOffset)
    // Only truncate log when current leader's log start offset is greater than follower's log end offset.
    if (leaderStartOffset > replicaEndOffset)
      truncateFullyAndStartAt(topicPartition, leaderStartOffset)
    offsetToFetch
  }
}

如果响应错误码为UNKNOWN_LEADER_EPOCH，则将partition加入到partitionsWithError中
如果响应错误码为FENCED_LEADER_EPOCH，从partitionStates中移除partition，并等待更新新的leaderAndIsr
如果响应错误码为NOT_LEADER_FOR_PARTITION，则将partition加入到partitionsWithError中
处理partitionsWithError中出错的partition，调用delayPartitions方法

def delayPartitions(partitions: Iterable[TopicPartition], delay: Long) {
  partitionMapLock.lockInterruptibly()
  try {
    for (partition <- partitions) {
      Option(partitionStates.stateValue(partition)).foreach { currentFetchState =>
        if (!currentFetchState.isDelayed) {
          //将partition设置为延迟fetch，延迟时间为fetchBackOffMs
          partitionStates.updateAndMoveToEnd(partition, PartitionFetchState(currentFetchState.fetchOffset,
            currentFetchState.currentLeaderEpoch, new DelayedItem(delay), currentFetchState.state))
        }
      }
    }
    partitionMapCond.signalAll()
  } finally partitionMapLock.unlock()
}

ReplicaFetcherThread

它的父类AbstractFetcherThread已经完成了大部分fetch的逻辑，即用partitionStates来记录每个partition的fetch状态。其中fetch的细节都由子类实现。这里我们只看由ReplicaFetcherThread实现的几个比较重要的子方法

processPartitionData

fetch线程拉取到partitionData以后会调用processPartitionData方法，并返回logAppendInfo。这里它的实现是：

override def processPartitionData(topicPartition: TopicPartition,
                                   fetchOffset: Long,
                                   partitionData: FetchData): Option[LogAppendInfo] = {
   val replica = replicaMgr.localReplicaOrException(topicPartition)
   val partition = replicaMgr.getPartition(topicPartition).get
   //先将数据读出来到内存中
   val records = toMemoryRecords(partitionData.records)

   maybeWarnIfOversizedRecords(records, topicPartition)

   //一定要从leo开始拉取
   if (fetchOffset != replica.logEndOffset.messageOffset)
     throw new IllegalStateException("Offset mismatch for partition %s: fetched offset = %d, log end offset = %d.".format(
       topicPartition, fetchOffset, replica.logEndOffset.messageOffset))

   if (isTraceEnabled)
     trace("Follower has replica log end offset %d for partition %s. Received %d messages and leader hw %d"
       .format(replica.logEndOffset.messageOffset, topicPartition, records.sizeInBytes, partitionData.highWatermark))

   // Append the leader's messages to the log
   val logAppendInfo = partition.appendRecordsToFollowerOrFutureReplica(records, isFuture = false)

   if (isTraceEnabled)
     trace("Follower has replica log end offset %d after appending %d bytes of messages for partition %s"
       .format(replica.logEndOffset.messageOffset, records.sizeInBytes, topicPartition))
   //将本地leo和所有follower的hw作为自己的hw
   val followerHighWatermark = replica.logEndOffset.messageOffset.min(partitionData.highWatermark)
   val leaderLogStartOffset = partitionData.logStartOffset
   // for the follower replica, we do not need to keep
   // its segment base offset the physical position,
   // these values will be computed upon making the leader
   replica.highWatermark = new LogOffsetMetadata(followerHighWatermark)
   replica.maybeIncrementLogStartOffset(leaderLogStartOffset)
   if (isTraceEnabled)
     trace(s"Follower set replica high watermark for partition $topicPartition to $followerHighWatermark")

   // Traffic from both in-sync and out of sync replicas are accounted for in replication quota to ensure total replication
   // traffic doesn't exceed quota.
   if (quota.isThrottled(topicPartition))
     quota.record(records.sizeInBytes)
   replicaMgr.brokerTopicStats.updateReplicationBytesIn(records.sizeInBytes)

   logAppendInfo
 }

start up

顾名思义，replicaManager的作用是管理一个partition的所有replica，具体会管理什么内容呢？在replicaManager中有三个定时任务，分别是

isr-expiration
isr-change-propagation
shutdown-idle-replica-alter-log-dirs-thread

scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
    scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
    scheduler.schedule("shutdown-idle-replica-alter-log-dirs-thread", shutdownIdleReplicaAlterLogDirsThread _, period = 10000L, unit = TimeUnit.MILLISECONDS)

我们分别来看这三个任务

isr-expiration

在kafka中，一个partition的follower不能落后于leader太多，如果落后超过config.replicaLagTimeMaxMs x 1.5，就会从ISR中移除。isr-expiration就是来判断有没有replica过期。

private def maybeShrinkIsr(): Unit = {
    trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")
    nonOfflinePartitionsIterator.foreach(_.maybeShrinkIsr(config.replicaLagTimeMaxMs))
  }

def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
   val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
     leaderReplicaIfLocal match {
       case Some(leaderReplica) =>
         val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs)
         if(outOfSyncReplicas.nonEmpty) {
           val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
           assert(newInSyncReplicas.nonEmpty)
           info("Shrinking ISR from %s to %s".format(inSyncReplicas.map(_.brokerId).mkString(","),
             newInSyncReplicas.map(_.brokerId).mkString(",")))
           // update ISR in zk and in cache
           updateIsr(newInSyncReplicas)
           // we may need to increment high watermark since ISR could be down to 1

           replicaManager.isrShrinkRate.mark()
           maybeIncrementLeaderHW(leaderReplica)
         } else {
           false
         }

       case None => false // do nothing if no longer leader
     }
   }

   // some delayed operations may be unblocked after HW changed
   if (leaderHWIncremented)
     tryCompleteDelayedRequests()
 }

isr-expiration任务遍历所有的topicPartition，然后执行topicPartition自身的maybeShrinkIsr方法。partition内部维护了一个从brokerId到Replica的map，在执行maybeShrinkIsr的时候，先判断自己是不是这个partition的leader，如果不是就跳过，否则：

获取outOfSync replicas。获取不在没跟上的同步的replica
如果outOfSyncReplicas不为空，则从isr中删除
更新zk中关于isr的信息
检验是否增加hw

getOutOfSyncReplicas逻辑是这样的

def getOutOfSyncReplicas(leaderReplica: Replica, maxLagMs: Long): Set[Replica] = {
    /**
     * If the follower already has the same leo as the leader, it will not be considered as out-of-sync,
     * otherwise there are two cases that will be handled here -
     * 1. Stuck followers: If the leo of the replica hasn't been updated for maxLagMs ms,
     *                     the follower is stuck and should be removed from the ISR
     * 2. Slow followers: If the replica has not read up to the leo within the last maxLagMs ms,
     *                    then the follower is lagging and should be removed from the ISR
     * Both these cases are handled by checking the lastCaughtUpTimeMs which represents
     * the last time when the replica was fully caught up. If either of the above conditions
     * is violated, that replica is considered to be out of sync
     *
     **/
    val candidateReplicas = inSyncReplicas - leaderReplica

    val laggingReplicas = candidateReplicas.filter(r =>
    //leo与leader不同，并且距离上次跟上leo的时间间隔超过maxLagMs
      r.logEndOffset.messageOffset != leaderReplica.logEndOffset.messageOffset && (time.milliseconds - r.lastCaughtUpTimeMs) > maxLagMs)
    if (laggingReplicas.nonEmpty)
      debug("Lagging replicas are %s".format(laggingReplicas.map(_.brokerId).mkString(",")))

    laggingReplicas
  }

lastCaughtUpTimeMs是指这个replica上次拉取的offset>leo的时间（意思就是跟上了）。

在获取到outOfSyncReplica后，将zk中的isr信息更新。

private def updateIsr(newIsr: Set[Replica]) {
   val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(_.brokerId).toList, zkVersion)
   //注意这里用的是上次更新leader的controller epoch(每当controller选举完partition的leader时都会发送controller epoch)
   val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topicPartition, newLeaderAndIsr,
     controllerEpoch)

   if (updateSucceeded) {
     //zk中更新成功后，还要更新缓存中的isr信息
     replicaManager.recordIsrChange(topicPartition)
     inSyncReplicas = newIsr
     zkVersion = newVersion
     trace("ISR updated to [%s] and zkVersion updated to [%d]".format(newIsr.mkString(","), zkVersion))
   } else {
     replicaManager.failedIsrUpdatesRate.mark()
     info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating ISR".format(zkVersion))
   }
 }

最后因为更新了isr，所有还要判断是否要更新hw（hw是isr中最小的leo）。

private def maybeIncrementLeaderHW(leaderReplica: Replica, curTime: Long = time.milliseconds): Boolean = {
   val allLogEndOffsets = assignedReplicas.filter { replica =>
     curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
   }.map(_.logEndOffset)
   val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering)
   val oldHighWatermark = leaderReplica.highWatermark

   // Ensure that the high watermark increases monotonically. We also update the high watermark when the new
   // offset metadata is on a newer segment, which occurs whenever the log is rolled to a new segment.
   if (oldHighWatermark.messageOffset < newHighWatermark.messageOffset ||
     (oldHighWatermark.messageOffset == newHighWatermark.messageOffset && oldHighWatermark.onOlderSegment(newHighWatermark))) {
     leaderReplica.highWatermark = newHighWatermark
     debug(s"High watermark updated to $newHighWatermark")
     true
   } else {
     def logEndOffsetString(r: Replica) = s"replica ${r.brokerId}: ${r.logEndOffset}"
     debug(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old hw $oldHighWatermark. " +
       s"All current LEOs are ${assignedReplicas.map(logEndOffsetString)}")
     false
   }
 }

hw一般在以下两种情况下改动

isr发生变化
replica的leo发生变化

为什么上面第二点说的是replica的leo发生变化，而不是说成isr的leo发生变化呢？kafka在考虑是否增加hw的时候，不仅仅考虑到了isr，还考虑到了能catch-up的replica。设想一种情况，当isr中只包含了leader一个replica，所有的follower都在后面追赶。如果不等待后面的follow就增加了hw，那么follower的leo将会永远慢于hw（因为此时，hw就是leader的leo），那么follower就将永远进不去isr，因为他的lastCaughtUpTimeMs永远都很小。

isr-change-propagation

这个方法将发送isr change的事件到zk的路径/isr_change_notification/isr_change_中，最后由controller处理这个事件。

def maybePropagateIsrChanges() {
    val now = System.currentTimeMillis()
    isrChangeSet synchronized {
      if (isrChangeSet.nonEmpty &&
        (lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
          lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
        zkClient.propagateIsrChanges(isrChangeSet)
        isrChangeSet.clear()
        lastIsrPropagationMs.set(now)
      }
    }
  }

为了防止大量的isr change事件，只有在两种情况下将改变的isr通知出去

过去5s内没有isr的变动
过去60s内没有传播过isr的变动