这是草稿
replica manager
初始化
在kafka server启动的时候,会创建出一个replica manager,专门用来管理partition的replica的状态。我们看一下这个类的构造方法。构造参数就不看了,直接看他的成员变量
- controllerEpoch:标记当前他所在的controller的epoch
- localBrokerId:标记这台broker的id
- allPartitions:保存partition的缓存
- replicaStateChangeLock:在更改replica的状态的时候用到的锁(内置锁)
- replicaFetcherManager:
- replicaAlterLogDirsManager
replicaFetcherManager和replicaAlterLogDirsManager都继承自AbstractFetcherManager,这个类主要用途是管理replica fetcher的线程,其内部维护了一个从topicPartition到fetchId到AbstractFetcherThread的映射关系。replicaFetcherManager创建的线程是ReplicaFetcherThread,ReplicaAlterLogDirsManager创建的线程是ReplicaAlterLogDirsThread。我们后面再详细介绍这两个线程的作用。
- highWatermarkCheckPointThreadStarted:标记checkpoint是否启动,只有leader才会启动这个线程
- highWatermarkCheckpoints
- isrChangeSet:isr最近发生改变的topicPartition集合
- lastIsrChangeMs:上次isr发生改变的时间
- lastIsrPropagationMs:上次传播isr改变的时间
- logDirFailureHandler
AbstractFetcherThread
AbstractFetcherThread用来从一台broker上面拉取多个partition的数据。它的构造参数是:
- name
- clientId
- sourceBroker
- fetchBackOffMs
- isInterrupible AbstractFetcherThread继承自ShutdownableThread,name和isInterrupible是从父类继承的参数,表示线程名和中断状态;sourceBroker是拉取数据的broker的endpoint,fetchBackOffMs表示拉取的退避时间。
AbstractFetcherThread的成员变量包括:
- partitionStates: new PartitionStates[PartitionFetchState]
- partitionMapLock: new ReentrantLock
- partitionMapCond: partitionMapLock.newCondition()
PartitionStates内部包含一个从TopicPartition到状态的map
public class PartitionStates<S> {
private final LinkedHashMap<TopicPartition, S> map = new LinkedHashMap<>();
private final Set<TopicPartition> partitionSetView = Collections.unmodifiableSet(map.keySet());
...
}
map是LinkedHashMap,用途是可以轮询其中的key(topicPartition),并且可以将其移到map的末尾,从而实现round-robin的效果。先理解它的主要结构,后面我们再介绍它的用途是什么。map的value在AbstractFetcherThread中是PartitionFetchState,它表示了当前一个partition的fetch的状态
case class PartitionFetchState(fetchOffset: Long,
currentLeaderEpoch: Int,
delay: DelayedItem,
state: ReplicaState) {
def isReadyForFetch: Boolean = state == Fetching && !isDelayed
def isTruncating: Boolean = state == Truncating && !isDelayed
//isDelayed表示这个DelayedItem是否到期
def isDelayed: Boolean = delay.getDelay(TimeUnit.MILLISECONDS) > 0
override def toString: String = {
s"FetchState(fetchOffset=$fetchOffset" +
s", currentLeaderEpoch=$currentLeaderEpoch" +
s", state=$state" +
s", delay=${delay.delayMs}ms" +
s")"
}
}
这个类保存了partition的fetch offset和其状态:
- Truncating
- Delayed
- ReadyForFetch
因为AbstractFetcherThread继承自ShutdownableThread,因此其会不断执行doWork方法
override def doWork() {
maybeTruncate()
maybeFetch()
}
其中,maybeTruncate方法判断是否需要截断log。为什么还需要截断log呢,这是因为每当一个follower当选为leader时,其他follower的日志当然不能比leader还多,因此需要截断一部分日志来保障同步。那么截取多少呢?kafka中的partition有leader epoch的概念,它保存了每个leader epoch是从哪个offset开始的。那么当有新leader产生时,只要能拿到这部分缓存,就能正确截取日志。maybeTruncate的实现是:
private def maybeTruncate(): Unit = {
val (partitionsWithEpochs, partitionsWithoutEpochs) = fetchTruncatingPartitions()
if (partitionsWithEpochs.nonEmpty) {
truncateToEpochEndOffsets(partitionsWithEpochs)
}
if (partitionsWithoutEpochs.nonEmpty) {
truncateToHighWatermark(partitionsWithoutEpochs)
}
}
- 第一步,先从partitionStates获取到那些有leader epoch和没有leader epoch的partition列表。
- 如果有leader epoch,那么按照leader epoch offset截断
- 如果没有leader epoch,那么按照hw截断
截断
下面我们看一下截断的具体实现 首先是按照leader epoch截断
private def truncateToEpochEndOffsets(latestEpochsForPartitions: Map[TopicPartition, EpochData]): Unit = {
val endOffsets = fetchEpochEndOffsets(latestEpochsForPartitions)
//Ensure we hold a lock during truncation.
inLock(partitionMapLock) {
//Check no leadership and no leader epoch changes happened whilst we were unlocked, fetching epochs
val epochEndOffsets = endOffsets.filter { case (tp, _) =>
val curPartitionState = partitionStates.stateValue(tp)
val partitionEpochRequest = latestEpochsForPartitions.get(tp).getOrElse {
throw new IllegalStateException(
s"Leader replied with partition $tp not requested in OffsetsForLeaderEpoch request")
}
val leaderEpochInRequest = partitionEpochRequest.currentLeaderEpoch.get
//这里先保证最新请求出的leader epoch和缓存中的一致(记住前面提到的partitionStates保存了每个partition的fetch state)
curPartitionState != null && leaderEpochInRequest == curPartitionState.currentLeaderEpoch
}
val ResultWithPartitions(fetchOffsets, partitionsWithError) = maybeTruncateToEpochEndOffsets(epochEndOffsets)
handlePartitionsWithErrors(partitionsWithError)
updateFetchOffsetAndMaybeMarkTruncationComplete(fetchOffsets)
}
}
- 第一步,先获取最新的leader epoch的offset(这是一个抽象方法,由子类去实现)
- 加锁,保证在truncate期间leader不会改变
- 过滤掉刚从partitionStates中拿到的leader epoch和现在去拿到的不一致的情况(即leader发生了改变)
- 执行截断逻辑
- 标记截断完成
执行截断逻辑的方法是
private def maybeTruncateToEpochEndOffsets(fetchedEpochs: Map[TopicPartition, EpochEndOffset]): ResultWithPartitions[Map[TopicPartition, OffsetTruncationState]] = {
val fetchOffsets = mutable.HashMap.empty[TopicPartition, OffsetTruncationState]
val partitionsWithError = mutable.HashSet.empty[TopicPartition]
fetchedEpochs.foreach { case (tp, leaderEpochOffset) =>
try {
leaderEpochOffset.error match {
case Errors.NONE =>
val offsetTruncationState = getOffsetTruncationState(tp, leaderEpochOffset)
truncate(tp, offsetTruncationState)
fetchOffsets.put(tp, offsetTruncationState)
case Errors.FENCED_LEADER_EPOCH =>
onPartitionFenced(tp)
case error =>
info(s"Retrying leaderEpoch request for partition $tp as the leader reported an error: $error")
partitionsWithError += tp
}
} catch {
case e: KafkaStorageException =>
info(s"Failed to truncate $tp", e)
partitionsWithError += tp
}
}
ResultWithPartitions(fetchOffsets, partitionsWithError)
}
- 根据leader响应的不同,截断逻辑也不同
- 如果leader响应没有报错,即eaderEpochOffset.error=NONE,那么
- 获取需要截断到的offset
- 执行truncate进行截断(抽象方法))
- 保存截断结果
- 如果报错FENCED_LEADER_EPOCH,即请求的epoch低于broker上面的epoch,表示这个topicPartition的epoch过时了,从而会从partitionStates中踢掉。
那么需要截取到offset到底是什么呢,leader真会返回这样一个leader epoch offset吗?有没有其他可能的情况?我们看getOffsetTruncationState这个方法:
private def getOffsetTruncationState(tp: TopicPartition,
leaderEpochOffset: EpochEndOffset): OffsetTruncationState = inLock(partitionMapLock) {
//如果leader返回的是undefined offset
if (leaderEpochOffset.endOffset == UNDEFINED_EPOCH_OFFSET) {
// truncate to initial offset which is the high watermark for follower replica. For
// future replica, it is either high watermark of the future replica or current
// replica's truncation offset (when the current replica truncates, it forces future
// replica's partition state to 'truncating' and sets initial offset to its truncation offset)
warn(s"Based on replica's leader epoch, leader replied with an unknown offset in $tp. " +
s"The initial fetch offset ${partitionStates.stateValue(tp).fetchOffset} will be used for truncation.")
OffsetTruncationState(partitionStates.stateValue(tp).fetchOffset, truncationCompleted = true)
} else if (leaderEpochOffset.leaderEpoch == UNDEFINED_EPOCH) {
// either leader or follower or both use inter-broker protocol version < KAFKA_2_0_IV0
// (version 0 of OffsetForLeaderEpoch request/response)
warn(s"Leader or replica is on protocol version where leader epoch is not considered in the OffsetsForLeaderEpoch response. " +
s"The leader's offset ${leaderEpochOffset.endOffset} will be used for truncation in $tp.")
OffsetTruncationState(min(leaderEpochOffset.endOffset, logEndOffset(tp)), truncationCompleted = true)
} else {
val replicaEndOffset = logEndOffset(tp)
// get (leader epoch, end offset) pair that corresponds to the largest leader epoch
// less than or equal to the requested epoch.
endOffsetForEpoch(tp, leaderEpochOffset.leaderEpoch) match {
case Some(OffsetAndEpoch(followerEndOffset, followerEpoch)) =>
if (followerEpoch != leaderEpochOffset.leaderEpoch) {
// the follower does not know about the epoch that leader replied with
// we truncate to the end offset of the largest epoch that is smaller than the
// epoch the leader replied with, and send another offset for leader epoch request
val intermediateOffsetToTruncateTo = min(followerEndOffset, replicaEndOffset)
info(s"Based on replica's leader epoch, leader replied with epoch ${leaderEpochOffset.leaderEpoch} " +
s"unknown to the replica for $tp. " +
s"Will truncate to $intermediateOffsetToTruncateTo and send another leader epoch request to the leader.")
OffsetTruncationState(intermediateOffsetToTruncateTo, truncationCompleted = false)
} else {
val offsetToTruncateTo = min(followerEndOffset, leaderEpochOffset.endOffset)
OffsetTruncationState(min(offsetToTruncateTo, replicaEndOffset), truncationCompleted = true)
}
case None =>
// This can happen if the follower was not tracking leader epochs at that point (before the
// upgrade, or if this broker is new). Since the leader replied with epoch <
// requested epoch from follower, so should be safe to truncate to leader's
// offset (this is the same behavior as post-KIP-101 and pre-KIP-279)
warn(s"Based on replica's leader epoch, leader replied with epoch ${leaderEpochOffset.leaderEpoch} " +
s"below any replica's tracked epochs for $tp. " +
s"The leader's offset only ${leaderEpochOffset.endOffset} will be used for truncation.")
OffsetTruncationState(min(leaderEpochOffset.endOffset, replicaEndOffset), truncationCompleted = true)
}
}
}
- 如果leaderEpochOffset返回的endOffset是UNDEFINED_EPOCH_OFFSET,那么截断到replica的hw。这种情况一般发生在
- leader使用的消息格式比较老(0.11.0之前)
- follower请求的leader epoch比leader知道的最早的leader epoch小
- 如果leaderEpochOffset返回的epoch是UNDEFINED_EPOCH,那么截断到endOffset和replica的leo中间的较小者。
- todo
下面看按照hw截断
private[server] def truncateToHighWatermark(partitions: Set[TopicPartition]): Unit = inLock(partitionMapLock) {
val fetchOffsets = mutable.HashMap.empty[TopicPartition, OffsetTruncationState]
val partitionsWithError = mutable.HashSet.empty[TopicPartition]
for (tp <- partitions) {
val partitionState = partitionStates.stateValue(tp)
if (partitionState != null) {
try {
val highWatermark = partitionState.fetchOffset
val truncationState = OffsetTruncationState(highWatermark, truncationCompleted = true)
info(s"Truncating partition $tp to local high watermark $highWatermark")
truncate(tp, truncationState)
fetchOffsets.put(tp, truncationState)
} catch {
case e: KafkaStorageException =>
info(s"Failed to truncate $tp", e)
partitionsWithError += tp
}
}
}
handlePartitionsWithErrors(partitionsWithError)
updateFetchOffsetAndMaybeMarkTruncationComplete(fetchOffsets)
}
先从partitionState获取到hw,即fetchOffset,然后按照hw截断(调用truncate方法)
fetch
我们再回到AbstractFetcherThread的doWork方法,再看maybeFetch方法
private def maybeFetch(): Unit = {
val (fetchStates, fetchRequestOpt) = inLock(partitionMapLock) {
val fetchStates = partitionStates.partitionStateMap.asScala
//构造一个fetch请求
val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(fetchStates)
handlePartitionsWithErrors(partitionsWithError)
if (fetchRequestOpt.isEmpty) {
trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
}
(fetchStates, fetchRequestOpt)
}
//fetch操作
fetchRequestOpt.foreach { fetchRequest =>
processFetchRequest(fetchStates, fetchRequest)
}
}
processFetchRequest的实现是:
private def processFetchRequest(fetchStates: Map[TopicPartition, PartitionFetchState],
fetchRequest: FetchRequest.Builder): Unit = {
val partitionsWithError = mutable.Set[TopicPartition]()
var responseData: Seq[(TopicPartition, FetchData)] = Seq.empty
try {
trace(s"Sending fetch request $fetchRequest")
//这是一个抽象方法,由子类实现
responseData = fetchFromLeader(fetchRequest)
} catch {
case t: Throwable =>
if (isRunning) {
warn(s"Error in response for fetch request $fetchRequest", t)
inLock(partitionMapLock) {
partitionsWithError ++= partitionStates.partitionSet.asScala
// there is an error occurred while fetching partitions, sleep a while
// note that `ReplicaFetcherThread.handlePartitionsWithError` will also introduce the same delay for every
// partition with error effectively doubling the delay. It would be good to improve this.
partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
}
}
}
fetcherStats.requestRate.mark()
if (responseData.nonEmpty) {
// process fetched data
inLock(partitionMapLock) {
responseData.foreach { case (topicPartition, partitionData) =>
Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =>
// It's possible that a partition is removed and re-added or truncated when there is a pending fetch request.
// In this case, we only want to process the fetch response if the partition state is ready for fetch and
// the current offset is the same as the offset requested.
val fetchState = fetchStates(topicPartition)
//只处理fetch state在方法调用前后没有变化的partition
if (fetchState.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) {
partitionData.error match {
case Errors.NONE =>
try {
// Once we hand off the partition data to the subclass, we can't mess with it any more in this thread
//构造logAppendInfo
val logAppendInfoOpt = processPartitionData(topicPartition, currentFetchState.fetchOffset,
partitionData)
logAppendInfoOpt.foreach { logAppendInfo =>
val validBytes = logAppendInfo.validBytes
val nextOffset = if (validBytes > 0) logAppendInfo.lastOffset + 1 else currentFetchState.fetchOffset
//计算拉取的offset与hw的差距
fetcherLagStats.getAndMaybePut(topicPartition).lag = Math.max(0L, partitionData.highWatermark - nextOffset)
// ReplicaDirAlterThread may have removed topicPartition from the partitionStates after processing the partition data
if (validBytes > 0 && partitionStates.contains(topicPartition)) {
// Update partitionStates only if there is no exception during processPartitionData
//更新下一次拉取的fetch state
val newFetchState = PartitionFetchState(nextOffset, fetchState.currentLeaderEpoch,
state = Fetching)
//注意,这里把这个partition的位置调整到了最后(round-robin)
partitionStates.updateAndMoveToEnd(topicPartition, newFetchState)
fetcherStats.byteRate.mark(validBytes)
}
}
} catch {
case ime: CorruptRecordException =>
// we log the error and continue. This ensures two things
// 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread
// down and cause other topic partition to also lag
// 2. If the message is corrupt due to a transient state in the log (truncation, partial writes
// can cause this), we simply continue and should get fixed in the subsequent fetches
error(s"Found invalid messages during fetch for partition $topicPartition " +
s"offset ${currentFetchState.fetchOffset}", ime)
partitionsWithError += topicPartition
case e: KafkaStorageException =>
error(s"Error while processing data for partition $topicPartition", e)
partitionsWithError += topicPartition
case e: Throwable =>
throw new KafkaException(s"Error processing data for partition $topicPartition " +
s"offset ${currentFetchState.fetchOffset}", e)
}
//由于leader的切换可能会导致拉取的offset range有问题,需要重新设置offset为topicPartition的leo
case Errors.OFFSET_OUT_OF_RANGE =>
if (!handleOutOfRangeError(topicPartition, currentFetchState))
partitionsWithError += topicPartition
case Errors.UNKNOWN_LEADER_EPOCH =>
debug(s"Remote broker has a smaller leader epoch for partition $topicPartition than " +
s"this replica's current leader epoch of ${fetchState.currentLeaderEpoch}.")
partitionsWithError += topicPartition
//拉取的leader epoch比当前的leader epoch老,先从partitionStates中移除,等待更新新的leaderAndIsr状态
case Errors.FENCED_LEADER_EPOCH =>
onPartitionFenced(topicPartition)
case Errors.NOT_LEADER_FOR_PARTITION =>
debug(s"Remote broker is not the leader for partition $topicPartition, which could indicate " +
"that the partition is being moved")
partitionsWithError += topicPartition
case _ =>
error(s"Error for partition $topicPartition at offset ${currentFetchState.fetchOffset}",
partitionData.error.exception)
partitionsWithError += topicPartition
}
}
}
}
}
}
if (partitionsWithError.nonEmpty) {
debug(s"Handling errors for partitions $partitionsWithError")
handlePartitionsWithErrors(partitionsWithError)
}
}
- 调用抽象方法fetchFromLeader发送fetch请求到leader,返回responseData
- 只处理fetch state在方法调用前后保持不变的partition response数据(partition可能会应为各种原因退避fetch)
- 如果partition响应数据错误码为NONE,则
- 调用抽象方法processPartitionData处理响应数据,并返回logAppendInfo
- 根据logAppendInfo更新partition的fetch state
- 将partition的fetch state移到partitionStates的map的最末尾,保证round-robin
- 处理过程中如果抛异常,则将partition加入到partitionsWithError中
- 如果响应错误码为OFFSET_OUT_OF_RANGE,则调用fetchOffsetAndTruncate返回的offset代替fetch的offset,并将partition加入到partitionsWithError中
protected def fetchOffsetAndTruncate(topicPartition: TopicPartition, currentLeaderEpoch: Int): Long = {
val replicaEndOffset = logEndOffset(topicPartition)
/**
* Unclean leader election: A follower goes down, in the meanwhile the leader keeps appending messages. The follower comes back up
* and before it has completely caught up with the leader's logs, all replicas in the ISR go down. The follower is now uncleanly
* elected as the new leader, and it starts appending messages from the client. The old leader comes back up, becomes a follower
* and it may discover that the current leader's end offset is behind its own end offset.
*
* In such a case, truncate the current follower's log to the current leader's end offset and continue fetching.
*
* There is a potential for a mismatch between the logs of the two replicas here. We don't fix this mismatch as of now.
*/
//设想有一台follower挂掉了,leader还在继续接收消息。后来isr中的replica又全部挂掉,然而那一台follower又起起来,并被选举为新的leader。这时候老的leader如果再起起来,并向leader拉取消息的时候发现,leader的leo比自己的还要低,就会报out of range这种异常。这时候,只有截断follower的消息,才能继续拉取消息
val leaderEndOffset = fetchLatestOffsetFromLeader(topicPartition, currentLeaderEpoch)
if (leaderEndOffset < replicaEndOffset) {
warn(s"Reset fetch offset for partition $topicPartition from $replicaEndOffset to current " +
s"leader's latest offset $leaderEndOffset")
truncate(topicPartition, new EpochEndOffset(Errors.NONE, UNDEFINED_EPOCH, leaderEndOffset))
leaderEndOffset
} else {
/**
* If the leader's log end offset is greater than the follower's log end offset, there are two possibilities:
* 1. The follower could have been down for a long time and when it starts up, its end offset could be smaller than the leader's
* start offset because the leader has deleted old logs (log.logEndOffset < leaderStartOffset).
* 2. When unclean leader election occurs, it is possible that the old leader's high watermark is greater than
* the new leader's log end offset. So when the old leader truncates its offset to its high watermark and starts
* to fetch from the new leader, an OffsetOutOfRangeException will be thrown. After that some more messages are
* produced to the new leader. While the old leader is trying to handle the OffsetOutOfRangeException and query
* the log end offset of the new leader, the new leader's log end offset becomes higher than the follower's log end offset.
*
* In the first case, the follower's current log end offset is smaller than the leader's log start offset. So the
* follower should truncate all its logs, roll out a new segment and start to fetch from the current leader's log
* start offset.
* In the second case, the follower should just keep the current log segments and retry the fetch. In the second
* case, there will be some inconsistency of data between old and new leader. We are not solving it here.
* If users want to have strong consistency guarantees, appropriate configurations needs to be set for both
* brokers and producers.
*
* Putting the two cases together, the follower should fetch from the higher one of its replica log end offset
* and the current leader's log start offset.
*/
//这里处理的情况是新的leo高于老的leader的leo
//前面提到,老的leader的leo理论上来说是高于新的leader的leo的,所有有可能报out_of_range。但是如果老的leader停的时间太长,有可能导致比新的leader的leo还低。或者如果老的leader的hw比新leader的leo大,则会发生截断,并且截断到老的leader的hw,并报out_of_range异常。这时候新leader还在接受消息,而老leader在进行异常处理等额外工作,它的leo就会超过老leader的leo.
//第一种情况,老leader(当前是follower)截断日志到新leader的lso
//第二种情况,老leader不截断,并且继续拉取,此时leader和follower的数据会发生不一致。
//综合以上,遇到out_of_range异常的时候,将拉取的offset设置为自己的leo和leader的lso中的最大者。
val leaderStartOffset = fetchEarliestOffsetFromLeader(topicPartition, currentLeaderEpoch)
warn(s"Reset fetch offset for partition $topicPartition from $replicaEndOffset to current " +
s"leader's start offset $leaderStartOffset")
val offsetToFetch = Math.max(leaderStartOffset, replicaEndOffset)
// Only truncate log when current leader's log start offset is greater than follower's log end offset.
if (leaderStartOffset > replicaEndOffset)
truncateFullyAndStartAt(topicPartition, leaderStartOffset)
offsetToFetch
}
}
- 如果响应错误码为UNKNOWN_LEADER_EPOCH,则将partition加入到partitionsWithError中
- 如果响应错误码为FENCED_LEADER_EPOCH,从partitionStates中移除partition,并等待更新新的leaderAndIsr
- 如果响应错误码为NOT_LEADER_FOR_PARTITION,则将partition加入到partitionsWithError中
- 处理partitionsWithError中出错的partition,调用delayPartitions方法
def delayPartitions(partitions: Iterable[TopicPartition], delay: Long) {
partitionMapLock.lockInterruptibly()
try {
for (partition <- partitions) {
Option(partitionStates.stateValue(partition)).foreach { currentFetchState =>
if (!currentFetchState.isDelayed) {
//将partition设置为延迟fetch,延迟时间为fetchBackOffMs
partitionStates.updateAndMoveToEnd(partition, PartitionFetchState(currentFetchState.fetchOffset,
currentFetchState.currentLeaderEpoch, new DelayedItem(delay), currentFetchState.state))
}
}
}
partitionMapCond.signalAll()
} finally partitionMapLock.unlock()
}
ReplicaFetcherThread
它的父类AbstractFetcherThread已经完成了大部分fetch的逻辑,即用partitionStates来记录每个partition的fetch状态。其中fetch的细节都由子类实现。这里我们只看由ReplicaFetcherThread实现的几个比较重要的子方法
processPartitionData
fetch线程拉取到partitionData以后会调用processPartitionData方法,并返回logAppendInfo。这里它的实现是:
override def processPartitionData(topicPartition: TopicPartition,
fetchOffset: Long,
partitionData: FetchData): Option[LogAppendInfo] = {
val replica = replicaMgr.localReplicaOrException(topicPartition)
val partition = replicaMgr.getPartition(topicPartition).get
//先将数据读出来到内存中
val records = toMemoryRecords(partitionData.records)
maybeWarnIfOversizedRecords(records, topicPartition)
//一定要从leo开始拉取
if (fetchOffset != replica.logEndOffset.messageOffset)
throw new IllegalStateException("Offset mismatch for partition %s: fetched offset = %d, log end offset = %d.".format(
topicPartition, fetchOffset, replica.logEndOffset.messageOffset))
if (isTraceEnabled)
trace("Follower has replica log end offset %d for partition %s. Received %d messages and leader hw %d"
.format(replica.logEndOffset.messageOffset, topicPartition, records.sizeInBytes, partitionData.highWatermark))
// Append the leader's messages to the log
val logAppendInfo = partition.appendRecordsToFollowerOrFutureReplica(records, isFuture = false)
if (isTraceEnabled)
trace("Follower has replica log end offset %d after appending %d bytes of messages for partition %s"
.format(replica.logEndOffset.messageOffset, records.sizeInBytes, topicPartition))
//将本地leo和所有follower的hw作为自己的hw
val followerHighWatermark = replica.logEndOffset.messageOffset.min(partitionData.highWatermark)
val leaderLogStartOffset = partitionData.logStartOffset
// for the follower replica, we do not need to keep
// its segment base offset the physical position,
// these values will be computed upon making the leader
replica.highWatermark = new LogOffsetMetadata(followerHighWatermark)
replica.maybeIncrementLogStartOffset(leaderLogStartOffset)
if (isTraceEnabled)
trace(s"Follower set replica high watermark for partition $topicPartition to $followerHighWatermark")
// Traffic from both in-sync and out of sync replicas are accounted for in replication quota to ensure total replication
// traffic doesn't exceed quota.
if (quota.isThrottled(topicPartition))
quota.record(records.sizeInBytes)
replicaMgr.brokerTopicStats.updateReplicationBytesIn(records.sizeInBytes)
logAppendInfo
}
start up
顾名思义,replicaManager的作用是管理一个partition的所有replica,具体会管理什么内容呢?在replicaManager中有三个定时任务,分别是
- isr-expiration
- isr-change-propagation
- shutdown-idle-replica-alter-log-dirs-thread
scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
scheduler.schedule("shutdown-idle-replica-alter-log-dirs-thread", shutdownIdleReplicaAlterLogDirsThread _, period = 10000L, unit = TimeUnit.MILLISECONDS)
我们分别来看这三个任务
isr-expiration
在kafka中,一个partition的follower不能落后于leader太多,如果落后超过config.replicaLagTimeMaxMs x 1.5,就会从ISR中移除。isr-expiration就是来判断有没有replica过期。
private def maybeShrinkIsr(): Unit = {
trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")
nonOfflinePartitionsIterator.foreach(_.maybeShrinkIsr(config.replicaLagTimeMaxMs))
}
def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
leaderReplicaIfLocal match {
case Some(leaderReplica) =>
val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs)
if(outOfSyncReplicas.nonEmpty) {
val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
assert(newInSyncReplicas.nonEmpty)
info("Shrinking ISR from %s to %s".format(inSyncReplicas.map(_.brokerId).mkString(","),
newInSyncReplicas.map(_.brokerId).mkString(",")))
// update ISR in zk and in cache
updateIsr(newInSyncReplicas)
// we may need to increment high watermark since ISR could be down to 1
replicaManager.isrShrinkRate.mark()
maybeIncrementLeaderHW(leaderReplica)
} else {
false
}
case None => false // do nothing if no longer leader
}
}
// some delayed operations may be unblocked after HW changed
if (leaderHWIncremented)
tryCompleteDelayedRequests()
}
isr-expiration任务遍历所有的topicPartition,然后执行topicPartition自身的maybeShrinkIsr方法。partition内部维护了一个从brokerId到Replica的map,在执行maybeShrinkIsr的时候,先判断自己是不是这个partition的leader,如果不是就跳过,否则:
- 获取outOfSync replicas。获取不在没跟上的同步的replica
- 如果outOfSyncReplicas不为空,则从isr中删除
- 更新zk中关于isr的信息
- 检验是否增加hw
getOutOfSyncReplicas逻辑是这样的
def getOutOfSyncReplicas(leaderReplica: Replica, maxLagMs: Long): Set[Replica] = {
/**
* If the follower already has the same leo as the leader, it will not be considered as out-of-sync,
* otherwise there are two cases that will be handled here -
* 1. Stuck followers: If the leo of the replica hasn't been updated for maxLagMs ms,
* the follower is stuck and should be removed from the ISR
* 2. Slow followers: If the replica has not read up to the leo within the last maxLagMs ms,
* then the follower is lagging and should be removed from the ISR
* Both these cases are handled by checking the lastCaughtUpTimeMs which represents
* the last time when the replica was fully caught up. If either of the above conditions
* is violated, that replica is considered to be out of sync
*
**/
val candidateReplicas = inSyncReplicas - leaderReplica
val laggingReplicas = candidateReplicas.filter(r =>
//leo与leader不同,并且距离上次跟上leo的时间间隔超过maxLagMs
r.logEndOffset.messageOffset != leaderReplica.logEndOffset.messageOffset && (time.milliseconds - r.lastCaughtUpTimeMs) > maxLagMs)
if (laggingReplicas.nonEmpty)
debug("Lagging replicas are %s".format(laggingReplicas.map(_.brokerId).mkString(",")))
laggingReplicas
}
lastCaughtUpTimeMs是指这个replica上次拉取的offset>leo的时间(意思就是跟上了)。
在获取到outOfSyncReplica后,将zk中的isr信息更新。
private def updateIsr(newIsr: Set[Replica]) {
val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(_.brokerId).toList, zkVersion)
//注意这里用的是上次更新leader的controller epoch(每当controller选举完partition的leader时都会发送controller epoch)
val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topicPartition, newLeaderAndIsr,
controllerEpoch)
if (updateSucceeded) {
//zk中更新成功后,还要更新缓存中的isr信息
replicaManager.recordIsrChange(topicPartition)
inSyncReplicas = newIsr
zkVersion = newVersion
trace("ISR updated to [%s] and zkVersion updated to [%d]".format(newIsr.mkString(","), zkVersion))
} else {
replicaManager.failedIsrUpdatesRate.mark()
info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating ISR".format(zkVersion))
}
}
最后因为更新了isr,所有还要判断是否要更新hw(hw是isr中最小的leo)。
private def maybeIncrementLeaderHW(leaderReplica: Replica, curTime: Long = time.milliseconds): Boolean = {
val allLogEndOffsets = assignedReplicas.filter { replica =>
curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
}.map(_.logEndOffset)
val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering)
val oldHighWatermark = leaderReplica.highWatermark
// Ensure that the high watermark increases monotonically. We also update the high watermark when the new
// offset metadata is on a newer segment, which occurs whenever the log is rolled to a new segment.
if (oldHighWatermark.messageOffset < newHighWatermark.messageOffset ||
(oldHighWatermark.messageOffset == newHighWatermark.messageOffset && oldHighWatermark.onOlderSegment(newHighWatermark))) {
leaderReplica.highWatermark = newHighWatermark
debug(s"High watermark updated to $newHighWatermark")
true
} else {
def logEndOffsetString(r: Replica) = s"replica ${r.brokerId}: ${r.logEndOffset}"
debug(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old hw $oldHighWatermark. " +
s"All current LEOs are ${assignedReplicas.map(logEndOffsetString)}")
false
}
}
hw一般在以下两种情况下改动
- isr发生变化
- replica的leo发生变化
为什么上面第二点说的是replica的leo发生变化,而不是说成isr的leo发生变化呢?kafka在考虑是否增加hw的时候,不仅仅考虑到了isr,还考虑到了能catch-up的replica。设想一种情况,当isr中只包含了leader一个replica,所有的follower都在后面追赶。如果不等待后面的follow就增加了hw,那么follower的leo将会永远慢于hw(因为此时,hw就是leader的leo),那么follower就将永远进不去isr,因为他的lastCaughtUpTimeMs永远都很小。
isr-change-propagation
这个方法将发送isr change的事件到zk的路径/isr_change_notification/isr_change_中,最后由controller处理这个事件。
def maybePropagateIsrChanges() {
val now = System.currentTimeMillis()
isrChangeSet synchronized {
if (isrChangeSet.nonEmpty &&
(lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
zkClient.propagateIsrChanges(isrChangeSet)
isrChangeSet.clear()
lastIsrPropagationMs.set(now)
}
}
}
为了防止大量的isr change事件,只有在两种情况下将改变的isr通知出去
- 过去5s内没有isr的变动
- 过去60s内没有传播过isr的变动