kafka server - LogCleaner

796 阅读6分钟

not completed ~

在某些场合,我们给一个kafka record设置了key,但是相同的key,只有最新的offset的那条消息才是有用的,一些相同key的过时消息是冗余的,如果不删除,以后恢复的时候就会浪费时间。kafka针对这种topic有一种clean的策略,就是用来删除key冗余的消息。

kafka server在启动其LogManager的时候,会启动一些后台的任务,其中有一个任务就是做clean。这个clean job是由LogCleaner来完成的,在其注释中解释其任务是从logs中移除过时的记录。一个记录(key,offset)被称为是过时的,是指有一个相同key的消息,其offset'>offset。根据这种对message的划分,一个logs下面所有的segments可以被分成如下几个部分:

  • clean section: 曾经被clean过的section
  • dirty section: 还没有被clean过的section
    • cleanable section:
    • uncleanable section: 包含active segment,不能被clean

下面从几个角度来分析LogCleaner的工作流程

什么时候clean logs?

LogCleaner在启动的时候根据配置创建一些clean thread,由这些线程负责clean logs. CleanerThread继承自kafka.utils.ShutdownableThread,这个线程会不停的执行其doWork()方法直到线程被中断。CleanerThread的doWork()方法是

override def doWork() {
      val cleaned = cleanFilthiestLog()
      if (!cleaned)
        pause(config.backOffMs, TimeUnit.MILLISECONDS)
    }

其中cleanFilthiestLog()就是clean logs的主要逻辑了。 在回答下面三个问题之前,先来过一遍cleanFilthiestLog()方法的主要执行过程:

/**
      * Cleans a log if there is a dirty log available
      * @return whether a log was cleaned
      */
    private def cleanFilthiestLog(): Boolean = {
      var currentLog: Option[Log] = None

      try {
        //获取待清理log
        val cleaned = cleanerManager.grabFilthiestCompactedLog(time) match {
          case None =>
            false
          case Some(cleanable) =>
            // there's a log, clean it
            currentLog = Some(cleanable.log)
            //执行clean
            cleanLog(cleanable)
            true
        }
        //获取待删除log
        val deletable: Iterable[(TopicPartition, Log)] = cleanerManager.deletableLogs()
        try {
          deletable.foreach {
            case (topicPartition, log) =>
              try {
                currentLog = Some(log)
                //删除
                log.deleteOldSegments()
              }
          }
        } finally  {
          cleanerManager.doneDeleting(deletable.map(_._1))
        }

        cleaned
      } catch {
        case e @ (_: ThreadShutdownException | _: ControlThrowable) => throw e
        case e: Exception =>
          if (currentLog.isEmpty) {
            throw new IllegalStateException("currentLog cannot be empty on an unexpected exception", e)
          }
          val erroneousLog = currentLog.get
          warn(s"Unexpected exception thrown when cleaning log $erroneousLog. Marking its partition (${erroneousLog.topicPartition}) as uncleanable", e)
          cleanerManager.markPartitionUncleanable(erroneousLog.dir.getParent, erroneousLog.topicPartition)

          false
      }
    }

从以上代码看出,cleanFilthiestLog做了两件事情,一个是清理log,一个是删除log.

哪些logs需要被clean?

我们主要看cleanerManager.grabFilthiestCompactedLog(time)方法和cleanerManager.deletableLogs()方法。

grabFilthiestCompactedLog(time)

/**
    * Choose the log to clean next and add it to the in-progress set. We recompute this
    * each time from the full set of logs to allow logs to be dynamically added to the pool of logs
    * the log manager maintains.
    */
  def grabFilthiestCompactedLog(time: Time): Option[LogToClean] = {
    inLock(lock) {
      val now = time.milliseconds
      this.timeOfLastRun = now
      val lastClean = allCleanerCheckpoints
      val dirtyLogs = logs.filter {
        case (_, log) => log.config.compact  // match logs that are marked as compacted
      }.filterNot {
        case (topicPartition, log) =>
          // skip any logs already in-progress and uncleanable partitions
          inProgress.contains(topicPartition) || isUncleanablePartition(log, topicPartition)
      }.map {
        case (topicPartition, log) => // create a LogToClean instance for each
          val (firstDirtyOffset, firstUncleanableDirtyOffset) = LogCleanerManager.cleanableOffsets(log, topicPartition,
            lastClean, now)
          LogToClean(topicPartition, log, firstDirtyOffset, firstUncleanableDirtyOffset)
      }.filter(ltc => ltc.totalBytes > 0) // skip any empty logs

      this.dirtiestLogCleanableRatio = if (dirtyLogs.nonEmpty) dirtyLogs.max.cleanableRatio else 0
      // and must meet the minimum threshold for dirty byte ratio
      val cleanableLogs = dirtyLogs.filter(ltc => ltc.cleanableRatio > ltc.log.config.minCleanableRatio)
      if(cleanableLogs.isEmpty) {
        None
      } else {
        val filthiest = cleanableLogs.max
        inProgress.put(filthiest.topicPartition, LogCleaningInProgress)
        Some(filthiest)
      }
    }
  }

寻找待清理log的过程就是构造cleanableLogs的过程。

  • 首先先从logs列表中筛选出配置了compact的log
  • 如果这个log(即TopicPartition)正在进行clean过程,或者不可以被清理,则跳过 对于上述过程寻找出来的log,计算dirty offset的范围
/**
    * Returns the range of dirty offsets that can be cleaned.
    *
    * @param log the log
    * @param lastClean the map of checkpointed offsets
    * @param now the current time in milliseconds of the cleaning operation
    * @return the lower (inclusive) and upper (exclusive) offsets
    */
  def cleanableOffsets(log: Log, topicPartition: TopicPartition, lastClean: immutable.Map[TopicPartition, Long], now: Long): (Long, Long) = {

    // the checkpointed offset, ie., the first offset of the next dirty segment
    val lastCleanOffset: Option[Long] = lastClean.get(topicPartition)

    // If the log segments are abnormally truncated and hence the checkpointed offset is no longer valid;
    // reset to the log starting offset and log the error
    val logStartOffset = log.logSegments.head.baseOffset
    val firstDirtyOffset = {
      val offset = lastCleanOffset.getOrElse(logStartOffset)
      if (offset < logStartOffset) {
        // don't bother with the warning if compact and delete are enabled.
        if (!isCompactAndDelete(log))
          warn(s"Resetting first dirty offset of ${log.name} to log start offset $logStartOffset since the checkpointed offset $offset is invalid.")
        logStartOffset
      } else {
        offset
      }
    }

    val compactionLagMs = math.max(log.config.compactionLagMs, 0L)

    // find first segment that cannot be cleaned
    // neither the active segment, nor segments with any messages closer to the head of the log than the minimum compaction lag time
    // may be cleaned
    val firstUncleanableDirtyOffset: Long = Seq(

      // we do not clean beyond the first unstable offset
      log.firstUnstableOffset.map(_.messageOffset),

      // the active segment is always uncleanable
      Option(log.activeSegment.baseOffset),

      // the first segment whose largest message timestamp is within a minimum time lag from now
      if (compactionLagMs > 0) {
        // dirty log segments
        val dirtyNonActiveSegments = log.logSegments(firstDirtyOffset, log.activeSegment.baseOffset)
        dirtyNonActiveSegments.find { s =>
          val isUncleanable = s.largestTimestamp > now - compactionLagMs
          debug(s"Checking if log segment may be cleaned: log='${log.name}' segment.baseOffset=${s.baseOffset} segment.largestTimestamp=${s.largestTimestamp}; now - compactionLag=${now - compactionLagMs}; is uncleanable=$isUncleanable")
          isUncleanable
        }.map(_.baseOffset)
      } else None
    ).flatten.min

    debug(s"Finding range of cleanable offsets for log=${log.name} topicPartition=$topicPartition. Last clean offset=$lastCleanOffset now=$now => firstDirtyOffset=$firstDirtyOffset firstUncleanableOffset=$firstUncleanableDirtyOffset activeSegment.baseOffset=${log.activeSegment.baseOffset}")

    (firstDirtyOffset, firstUncleanableDirtyOffset)
  }

firstDirtyOffset是上次清理完后的最后offset.如果这个offset比所有segments的base offset都小,则设置为最小的base offset. firstUncleanableDirtyOffset则取

  • log的第一个unstable offset
  • active segment的base offset
  • largestTimestamp小于now - compactionLagMs的segment的base offset

三者的最小值。 然后根据log,dirty offset构造出LogToClean对象。

private case class LogToClean(topicPartition: TopicPartition, log: Log, firstDirtyOffset: Long, uncleanableOffset: Long) extends Ordered[LogToClean]

在构造LogToClean对象的时候会算出cleanableRatio

val cleanBytes = log.logSegments(-1, firstDirtyOffset).map(_.size.toLong).sum
  val (firstUncleanableOffset, cleanableBytes) = LogCleaner.calculateCleanableBytes(log, firstDirtyOffset, uncleanableOffset)
  val totalBytes = cleanBytes + cleanableBytes
  val cleanableRatio = cleanableBytes / totalBytes.toDouble

cleanableRatio就是指cleanable bytes在所有dirty bytes中所占的比例。

最后将所有的LogToClean根据cleanableRatio排序,取cleanableRatio最大的那个log并返回。

至此,需要被clean的那个log伴随着firstDirtyOffset和uncleanableOffset已经被找出来了。

deletableLogs()

如何clean log?

清理log的主要调用链为 cleanLog(cleanable: LogToClean) -> clean(cleanable: LogToClean) -> doClean(cleanable: LogToClean, deleteHorizonMs: Long) 下面逐一分析这些方法

cleanLog(cleanable: LogToClean)

private def cleanLog(cleanable: LogToClean): Unit = {
      //在清理之前,先初始化endOffset
      var endOffset = cleanable.firstDirtyOffset
      try {
        //清理log,返回第一个未清理的offset和这次清理的统计信息
        val (nextDirtyOffset, cleanerStats) = cleaner.clean(cleanable)
        recordStats(cleaner.id, cleanable.log.name, cleanable.firstDirtyOffset, endOffset, cleanerStats)
        endOffset = nextDirtyOffset
      } catch {
        case _: LogCleaningAbortedException => // task can be aborted, let it go.
        case _: KafkaStorageException => // partition is already offline. let it go.
        case e: IOException =>
          var logDirectory = cleanable.log.dir.getParent
          val msg = s"Failed to clean up log for ${cleanable.topicPartition} in dir ${logDirectory} due to IOException"
          logDirFailureChannel.maybeAddOfflineLogDir(logDirectory, msg, e)
      } finally {
        cleanerManager.doneCleaning(cleanable.topicPartition, cleanable.log.dir.getParentFile, endOffset)
      }
    }

clean(cleanable: LogToClean)

/**
   * Clean the given log
   *
   * @param cleanable The log to be cleaned
   *
   * @return The first offset not cleaned and the statistics for this round of cleaning
   */
  private[log] def clean(cleanable: LogToClean): (Long, CleanerStats) = {
    // figure out the timestamp below which it is safe to remove delete tombstones
    // this position is defined to be a configurable time beneath the last modified time of the last clean segment
    val deleteHorizonMs =
      cleanable.log.logSegments(0, cleanable.firstDirtyOffset).lastOption match {
        case None => 0L
        case Some(seg) => seg.lastModified - cleanable.log.config.deleteRetentionMs
    }

    doClean(cleanable, deleteHorizonMs)
  }

在这个方法中,主要是算出一个时间叫做deleteHorizonMs。在kafka中,有一个概念叫做tombstones,kafka提出一种安全删除tombstones的时间机制,即如果tombstones所在的segment的lastModified大于deleteHorizonMs,就能够被删除,即

lastModified > deleteHorizonMs
=> lastModified > seg.lastModified - cleanable.log.config.deleteRetentionMs
=> seg.lastModified - lastModified < cleanable.log.config.deleteRetentionMs

在后面我们再介绍tombstones。

算出deleteHorizonMs后就开始调用doClean方法了

doClean(cleanable: LogToClean, deleteHorizonMs: Long)

private[log] def doClean(cleanable: LogToClean, deleteHorizonMs: Long): (Long, CleanerStats) = {
    info("Beginning cleaning of log %s.".format(cleanable.log.name))

    val log = cleanable.log
    val stats = new CleanerStats()

    // build the offset map
    info("Building offset map for %s...".format(cleanable.log.name))
    val upperBoundOffset = cleanable.firstUncleanableOffset
    buildOffsetMap(log, cleanable.firstDirtyOffset, upperBoundOffset, offsetMap, stats)
    val endOffset = offsetMap.latestOffset + 1
    stats.indexDone()

    // determine the timestamp up to which the log will be cleaned
    // this is the lower of the last active segment and the compaction lag
    val cleanableHorizonMs = log.logSegments(0, cleanable.firstUncleanableOffset).lastOption.map(_.lastModified).getOrElse(0L)

    // group the segments and clean the groups
    info("Cleaning log %s (cleaning prior to %s, discarding tombstones prior to %s)...".format(log.name, new Date(cleanableHorizonMs), new Date(deleteHorizonMs)))
    for (group <- groupSegmentsBySize(log.logSegments(0, endOffset), log.config.segmentSize, log.config.maxIndexSize, cleanable.firstUncleanableOffset))
      cleanSegments(log, group, offsetMap, deleteHorizonMs, stats)

    // record buffer utilization
    stats.bufferUtilization = offsetMap.utilization

    stats.allDone()

    (endOffset, stats)
  }

doClean里面主要做了两件事情,第一件事是创建一个offsetMap。offsetMap的作用是从segment中找出哪些key需要被清理。

buildOffsetMap
/**
  * Build a map of key_hash => offset for the keys in the cleanable dirty portion of the log to use in cleaning.
  * @param log The log to use
  * @param start The offset at which dirty messages begin
  * @param end The ending offset for the map that is being built
  * @param map The map in which to store the mappings
  * @param stats Collector for cleaning statistics
  */
 private[log] def buildOffsetMap(log: Log,
                                 start: Long,
                                 end: Long,
                                 map: OffsetMap,
                                 stats: CleanerStats) {
   map.clear()
   //所有dirty segments的集合
   val dirty = log.logSegments(start, end).toBuffer
   info("Building offset map for log %s for %d segments in offset range [%d, %d).".format(log.name, dirty.size, start, end))

   val abortedTransactions = log.collectAbortedTransactions(start, end)
   val transactionMetadata = CleanedTransactionMetadata(abortedTransactions)

   // Add all the cleanable dirty segments. We must take at least map.slots * load_factor,
   // but we may be able to fit more (if there is lots of duplication in the dirty section of the log)
   var full = false
   //在offsetmap满之前遍历所有segments
   for (segment <- dirty if !full) {
     checkDone(log.topicPartition)

     full = buildOffsetMapForSegment(log.topicPartition, segment, map, start, log.config.maxMessageSize,
       transactionMetadata, stats)
     if (full)
       debug("Offset map is full, %d segments fully mapped, segment with base offset %d is partially mapped".format(dirty.indexOf(segment), segment.baseOffset))
   }
   info("Offset map for log %s complete.".format(log.name))
 }

最主要看buildOffsetMapForSegment方法

buildOffsetMapForSegment
/**
  * Add the messages in the given segment to the offset map
  *
  * @param segment The segment to index
  * @param map The map in which to store the key=>offset mapping
  * @param stats Collector for cleaning statistics
  *
  * @return If the map was filled whilst loading from this segment
  */
 private def buildOffsetMapForSegment(topicPartition: TopicPartition,
                                      segment: LogSegment,
                                      map: OffsetMap,
                                      startOffset: Long,
                                      maxLogMessageSize: Int,
                                      transactionMetadata: CleanedTransactionMetadata,
                                      stats: CleanerStats): Boolean = {
   //先定位到startOffset在当前segment中对应的物理位置
   var position = segment.offsetIndex.lookup(startOffset).position
   //控制map的size
   val maxDesiredMapSize = (map.slots * this.dupBufferLoadFactor).toInt
   while (position < segment.log.sizeInBytes) {
     checkDone(topicPartition)
     readBuffer.clear()
     try {
       //从position开始将readBuffer读满
       segment.log.readInto(readBuffer, position)
     } catch {
       case e: Exception =>
         throw new KafkaException(s"Failed to read from segment $segment of partition $topicPartition " +
           "while loading offset map", e)
     }
     //返回record batch
     val records = MemoryRecords.readableRecords(readBuffer)
     //限流用的,防止磁盘io过高
     throttler.maybeThrottle(records.sizeInBytes)

     val startPosition = position
     for (batch <- records.batches.asScala) {
       if (batch.isControlBatch) {
         transactionMetadata.onControlBatchRead(batch)
         stats.indexMessagesRead(1)
       } else {
         val isAborted = transactionMetadata.onBatchRead(batch)
         if (isAborted) {
           // If the batch is aborted, do not bother populating the offset map.
           // Note that abort markers are supported in v2 and above, which means count is defined.
           stats.indexMessagesRead(batch.countOrNull)
         } else {
           for (record <- batch.asScala) {
             if (record.hasKey && record.offset >= startOffset) {
               if (map.size < maxDesiredMapSize)
               //相同的key保留的是最后出现的offset
                 map.put(record.key, record.offset)
               else
                 return true
             }
             stats.indexMessagesRead(1)
           }
         }
       }

       if (batch.lastOffset >= startOffset)
         //更新map中最新的offset
         map.updateLatestOffset(batch.lastOffset)
     }
     val bytesRead = records.validBytes
     position += bytesRead
     stats.indexBytesRead(bytesRead)

     // if we didn't read even one complete message, our read buffer may be too small
     if(position == startPosition)
       growBuffersOrFail(segment.log, position, maxLogMessageSize, records)
   }
   restoreBuffers()
   false
 }
}

经常上面的过程,所有遇到的key和其最后的offset都已经放入到offsetMap当中了。 第二件事就是将segments中的key清理掉。为了防止segment被清理过key以后的size太小,kafka先将所有的segments分组,再对每个segment group做清理工作

groupSegmentsBySize

这个方法比较简单,就是遍历所有segments加到一个group中,当group中的size超过一定条件时产生一个新的group。

/**
   * Group the segments in a log into groups totaling less than a given size. the size is enforced separately for the log data and the index data.
   * We collect a group of such segments together into a single
   * destination segment. This prevents segment sizes from shrinking too much.
   *
   * @param segments The log segments to group
   * @param maxSize the maximum size in bytes for the total of all log data in a group
   * @param maxIndexSize the maximum size in bytes for the total of all index data in a group
   *
   * @return A list of grouped segments
   */
  private[log] def groupSegmentsBySize(segments: Iterable[LogSegment], maxSize: Int, maxIndexSize: Int, firstUncleanableOffset: Long): List[Seq[LogSegment]] = {
    var grouped = List[List[LogSegment]]()
    var segs = segments.toList
    while(segs.nonEmpty) {
      var group = List(segs.head)
      var logSize = segs.head.size.toLong
      var indexSize = segs.head.offsetIndex.sizeInBytes.toLong
      var timeIndexSize = segs.head.timeIndex.sizeInBytes.toLong
      segs = segs.tail
      while(segs.nonEmpty &&
            logSize + segs.head.size <= maxSize &&
            indexSize + segs.head.offsetIndex.sizeInBytes <= maxIndexSize &&
            timeIndexSize + segs.head.timeIndex.sizeInBytes <= maxIndexSize &&
            lastOffsetForFirstSegment(segs, firstUncleanableOffset) - group.last.baseOffset <= Int.MaxValue) {
        group = segs.head :: group
        logSize += segs.head.size
        indexSize += segs.head.offsetIndex.sizeInBytes
        timeIndexSize += segs.head.timeIndex.sizeInBytes
        segs = segs.tail
      }
      grouped ::= group.reverse
    }
    grouped.reverse
  }

最主要的方法是cleanSegments,它将一个group中所有的segment合并成一个segment。

cleanSegments
/**
   * Clean a group of segments into a single replacement segment
   *
   * @param log The log being cleaned
   * @param segments The group of segments being cleaned
   * @param map The offset map to use for cleaning segments
   * @param deleteHorizonMs The time to retain delete tombstones
   * @param stats Collector for cleaning statistics
   */
  private[log] def cleanSegments(log: Log,
                                 segments: Seq[LogSegment],
                                 map: OffsetMap,
                                 deleteHorizonMs: Long,
                                 stats: CleanerStats) {
    // create a new segment with a suffix appended to the name of the log and indexes
    val cleaned = LogCleaner.createNewCleanedSegment(log, segments.head.baseOffset)

    try {
      // clean segments into the new destination segment
      val iter = segments.iterator
      var currentSegmentOpt: Option[LogSegment] = Some(iter.next())
      while (currentSegmentOpt.isDefined) {
        val currentSegment = currentSegmentOpt.get
        val nextSegmentOpt = if (iter.hasNext) Some(iter.next()) else None

        val startOffset = currentSegment.baseOffset
        val upperBoundOffset = nextSegmentOpt.map(_.baseOffset).getOrElse(map.latestOffset + 1)
        val abortedTransactions = log.collectAbortedTransactions(startOffset, upperBoundOffset)
        val transactionMetadata = CleanedTransactionMetadata(abortedTransactions, Some(cleaned.txnIndex))

        val retainDeletes = currentSegment.lastModified > deleteHorizonMs
        info(s"Cleaning segment $startOffset in log ${log.name} (largest timestamp ${new Date(currentSegment.largestTimestamp)}) " +
          s"into ${cleaned.baseOffset}, ${if(retainDeletes) "retaining" else "discarding"} deletes.")

        try {
          cleanInto(log.topicPartition, currentSegment.log, cleaned, map, retainDeletes, log.config.maxMessageSize,
            transactionMetadata, log.activeProducersWithLastSequence, stats)
        } catch {
          case e: LogSegmentOffsetOverflowException =>
            // Split the current segment. It's also safest to abort the current cleaning process, so that we retry from
            // scratch once the split is complete.
            info(s"Caught segment overflow error during cleaning: ${e.getMessage}")
            log.splitOverflowedSegment(currentSegment)
            throw new LogCleaningAbortedException()
        }
        currentSegmentOpt = nextSegmentOpt
      }

      cleaned.onBecomeInactiveSegment()
      // flush new segment to disk before swap
      cleaned.flush()

      // update the modification date to retain the last modified date of the original files
      val modified = segments.last.lastModified
      cleaned.lastModified = modified

      // swap in new segment
      info(s"Swapping in cleaned segment $cleaned for segment(s) $segments in log $log")
      log.replaceSegments(List(cleaned), segments)
    } catch {
      case e: LogCleaningAbortedException =>
        try cleaned.deleteIfExists()
        catch {
          case deleteException: Exception =>
            e.addSuppressed(deleteException)
        } finally throw e
    }
  }

cleaned是构造出来的一个新segment,遍历每一个segment,将其中的消息写入到cleaned中,方法是cleanInto

cleanInto
/**
 * Clean the given source log segment into the destination segment using the key=>offset mapping
 * provided
 *
 * @param topicPartition The topic and partition of the log segment to clean
 * @param sourceRecords The dirty log segment
 * @param dest The cleaned log segment
 * @param map The key=>offset mapping
 * @param retainDeletes Should delete tombstones be retained while cleaning this segment
 * @param maxLogMessageSize The maximum message size of the corresponding topic
 * @param stats Collector for cleaning statistics
 */
private[log] def cleanInto(topicPartition: TopicPartition,
                           sourceRecords: FileRecords,
                           dest: LogSegment,
                           map: OffsetMap,
                           retainDeletes: Boolean,
                           maxLogMessageSize: Int,
                           transactionMetadata: CleanedTransactionMetadata,
                           activeProducers: Map[Long, Int],
                           stats: CleanerStats) {
  val logCleanerFilter = new RecordFilter {
    var discardBatchRecords: Boolean = _

    override def checkBatchRetention(batch: RecordBatch): BatchRetention = {
      // we piggy-back on the tombstone retention logic to delay deletion of transaction markers.
      // note that we will never delete a marker until all the records from that transaction are removed.
      discardBatchRecords = shouldDiscardBatch(batch, transactionMetadata, retainTxnMarkers = retainDeletes)

      // check if the batch contains the last sequence number for the producer. if so, we cannot
      // remove the batch just yet or the producer may see an out of sequence error.
      if (batch.hasProducerId && activeProducers.get(batch.producerId).contains(batch.lastSequence))
        BatchRetention.RETAIN_EMPTY
      else if (discardBatchRecords)
        BatchRetention.DELETE
      else
        BatchRetention.DELETE_EMPTY
    }

    override def shouldRetainRecord(batch: RecordBatch, record: Record): Boolean = {
      if (discardBatchRecords)
        // The batch is only retained to preserve producer sequence information; the records can be removed
        false
      else
        Cleaner.this.shouldRetainRecord(map, retainDeletes, batch, record, stats)
    }
  }

  var position = 0
  while (position < sourceRecords.sizeInBytes) {
    checkDone(topicPartition)
    // read a chunk of messages and copy any that are to be retained to the write buffer to be written out
    readBuffer.clear()
    writeBuffer.clear()

    sourceRecords.readInto(readBuffer, position)
    val records = MemoryRecords.readableRecords(readBuffer)
    throttler.maybeThrottle(records.sizeInBytes)
    val result = records.filterTo(topicPartition, logCleanerFilter, writeBuffer, maxLogMessageSize, decompressionBufferSupplier)
    stats.readMessages(result.messagesRead, result.bytesRead)
    stats.recopyMessages(result.messagesRetained, result.bytesRetained)

    position += result.bytesRead

    // if any messages are to be retained, write them out
    val outputBuffer = result.outputBuffer
    if (outputBuffer.position() > 0) {
      outputBuffer.flip()
      val retained = MemoryRecords.readableRecords(outputBuffer)
      // it's OK not to hold the Log's lock in this case, because this segment is only accessed by other threads
      // after `Log.replaceSegments` (which acquires the lock) is called
      dest.append(largestOffset = result.maxOffset,
        largestTimestamp = result.maxTimestamp,
        shallowOffsetOfMaxTimestamp = result.shallowOffsetOfMaxTimestamp,
        records = retained)
      throttler.maybeThrottle(outputBuffer.limit())
    }

    // if we read bytes but didn't get even one complete batch, our I/O buffer is too small, grow it and try again
    // `result.bytesRead` contains bytes from `messagesRead` and any discarded batches.
    if (readBuffer.limit() > 0 && result.bytesRead == 0)
      growBuffersOrFail(sourceRecords, position, maxLogMessageSize, records)
  }
  restoreBuffers()
}

cleanInto的主要步骤是

  • 构造 RecordFilter,对于一个batch和record,能够返回retention的结果
public enum BatchRetention {
          DELETE, // Delete the batch without inspecting records
          RETAIN_EMPTY, // Retain the batch even if it is empty
          DELETE_EMPTY  // Delete the batch if it is empty
      }
  • 从position=0开始读取这个segment中的消息,写入到readBuffer
  • 从readBuffer中读出所有的records,根据上面构造的RecordFilter判断是否清理,将需要保留的消息写入到writeBuffer中
  • 调用segment的append方法,从writeBuffer将保留下来的消息写入

上面的过程其实比较清晰,因为我们过滤了其中的一些kafka的细节和代码。首先是保留过滤器RecordFilter的构造

val logCleanerFilter = new RecordFilter {
      var discardBatchRecords: Boolean = _

      override def checkBatchRetention(batch: RecordBatch): BatchRetention = {
        // we piggy-back on the tombstone retention logic to delay deletion of transaction markers.
        // note that we will never delete a marker until all the records from that transaction are removed.
        discardBatchRecords = shouldDiscardBatch(batch, transactionMetadata, retainTxnMarkers = retainDeletes)

        // check if the batch contains the last sequence number for the producer. if so, we cannot
        // remove the batch just yet or the producer may see an out of sequence error.
        if (batch.hasProducerId && activeProducers.get(batch.producerId).contains(batch.lastSequence))
          BatchRetention.RETAIN_EMPTY
        else if (discardBatchRecords)
          BatchRetention.DELETE
        else
          BatchRetention.DELETE_EMPTY
      }

      override def shouldRetainRecord(batch: RecordBatch, record: Record): Boolean = {
        if (discardBatchRecords)
          // The batch is only retained to preserve producer sequence information; the records can be removed
          false
        else
          Cleaner.this.shouldRetainRecord(map, retainDeletes, batch, record, stats)
      }
    }

过滤batch的逻辑牵扯到kafka事务的实现,暂时不做介绍,先看怎么过滤一个record:

private def shouldRetainRecord(map: kafka.log.OffsetMap,
                                 retainDeletes: Boolean,
                                 batch: RecordBatch,
                                 record: Record,
                                 stats: CleanerStats): Boolean = {
    //offset比较新,则不用判断,直接通过
    val pastLatestOffset = record.offset > map.latestOffset
    if (pastLatestOffset)
      return true

    if (record.hasKey) {
      val key = record.key
      //这是dirty中出现过的最后offset
      val foundOffset = map.get(key)
      /* two cases in which we can get rid of a message:
       *   1) if there exists a message with the same key but higher offset
       *   2) if the message is a delete "tombstone" marker and enough time has passed
       */
      val redundant = foundOffset >= 0 && record.offset < foundOffset
      val obsoleteDelete = !retainDeletes && !record.hasValue
      !redundant && !obsoleteDelete
    } else {
      stats.invalidMessage()
      false
    }
  }

只有当redundant和obsoleteDelete都是false的时候,才会保留一个record.redundant是指一个record的key对应的offset太小,已经过时了,就能够删除了。obsoleteDelete为false当且仅当retainDeletes和record.hasValue至少有一个是true。如果hasValue=true,则不需要看tombstone保留的条件。否则需要retainDeletes为true时,才能够保留这条tombstone。如果record没有key,也不保留。

为何要clean log?