kafka-server producer

595 阅读7分钟

kafka server - produce

case ApiKeys.PRODUCE => handleProduceRequest(request)

kafka api

  1. 对请求进行鉴权
    • 事务produce请求
    • 幂等请求
    • 其他请求
  2. 定义sendResponseCallback回调
  3. 定义processingStatsCallback回调
  4. 调用replicaManager.appendRecords

replica manager

  1. 检查请求的ack是否是有效ack(ack==0||ack=-1||ack==1)
  2. 将日志写入partition(调用自身appendToLocalLog方法)
  3. 执行recordConversionStatsCallback回调
  4. 判断是否需要进行延迟produce,如果不需要这个请求可以立即响应,否则
  5. 创建延迟请求并观察

对每个topicPartition,调用对应partition的appendRecordsToLeader方法写入日志。

partition.appendRecordsToLeader(records, isFromClient, requiredAcks)

partition

partition将日志写入自己的log中。在写入之前首先要判断是否有足够的ISR。如果ISR数目低于minIsr且ack==-1,那么这次就不能写入,返回NotEnoughReplicasException异常。否则,调用partition对应的Log对象的appendAsLeader方法。在将日志写入log后

  1. 解锁延迟的fetch请求
  2. 增加hw

log的append是一个很复杂的过程,涉及到kafka日志存储的很多细节,我们需要看一下源码

log

  1. 日志添加的第一步是对这一批日志进行分析和验证,分析验证的结果是一个LogAppendInfo对象
val appendInfo = analyzeAndValidateRecords(records, isFromClient = isFromClient)

其中,LogAppendInfo对象是这样:

case class LogAppendInfo(var firstOffset: Option[Long],
                         var lastOffset: Long,
                         var maxTimestamp: Long,
                         var offsetOfMaxTimestamp: Long,
                         var logAppendTime: Long,
                         var logStartOffset: Long,
                         var recordConversionStats: RecordConversionStats,
                         sourceCodec: CompressionCodec,
                         targetCodec: CompressionCodec,
                         shallowCount: Int,
                         validBytes: Int,
                         offsetsMonotonic: Boolean,
                         lastOffsetOfFirstBatch: Long)
  • firstOffset: 日志中的第一个offset -> 第一个batch的baseOffset
  • lastOffset: 日志中的最后一个offset -> 最后一个batch的lastOffset
  • maxTimestamp: 日志中最大的timestamp -> 每个batch的maxTimestamp的最大值
  • offsetOfMaxTimestamp: 拥有最大timestamp的日志 -> 与maxTimestamp对应的lastOffset
  • logAppendTime: 日志添加的时间 -> 此时等于-1
  • logStartOffset: 在添加日志的时候log的startOffset
  • recordConversionStats: 添加日志过程中的统计指标
  • sourceCodec: produce时指定的压缩算法 -> 通过batch.compressionType.id得到
  • targetCodec: 最终的日志压缩算法 -> 如果配置的compression.type等于produce发送的sourceCodec,返回sourceCodec,否则使用配置的压缩算法
  • shallowCount: shallowCount消息的数目 -> 消息数目
  • validBytes: 有效字节数 -> batch size之和。如果某个batch的size超过max.message.bytes配置的值,则为无效batch,抛出异常
  • offsetsMonotonic: 表示这批消息的offset是否是单调增长
  • lastOffsetOfFirstBatch: 第一个batch的最后offset

分析的时候通过遍历records的每一个batch实现。

  1. 在获得logAppendInfo之后,如果存在无效的字节,就把无效的字节都跳过。在处理添加客户端的produce请求的时候,kafka会重新分配offset给这批消息。
if (sourceCodec == NoCompressionCodec && targetCodec == NoCompressionCodec) {
      // check the magic value
      if (!records.hasMatchingMagic(magic))
        convertAndAssignOffsetsNonCompressed(records, offsetCounter, compactedTopic, time, now, timestampType,
          timestampDiffMaxMs, magic, partitionLeaderEpoch, isFromClient)
      else
        // Do in-place validation, offset assignment and maybe set timestamp
        assignOffsetsNonCompressed(records, offsetCounter, now, compactedTopic, timestampType, timestampDiffMaxMs,
          partitionLeaderEpoch, isFromClient, magic)
    } else {
      validateMessagesAndAssignOffsetsCompressed(records, offsetCounter, time, now, sourceCodec, targetCodec, compactedTopic,
        magic, timestampType, timestampDiffMaxMs, partitionLeaderEpoch, isFromClient)

如果没有压缩,并且batch的format version与配置的format version不相同的话(format version就是magic value),调用convertAndAssignOffsetsNonCompressed分配offset。

private def convertAndAssignOffsetsNonCompressed(records: MemoryRecords,
                                                   offsetCounter: LongRef,
                                                   compactedTopic: Boolean,
                                                   time: Time,
                                                   now: Long,
                                                   timestampType: TimestampType,
                                                   timestampDiffMaxMs: Long,
                                                   toMagicValue: Byte,
                                                   partitionLeaderEpoch: Int,
                                                   isFromClient: Boolean): ValidationAndOffsetAssignResult = {
    val startNanos = time.nanoseconds
    val sizeInBytesAfterConversion = AbstractRecords.estimateSizeInBytes(toMagicValue, offsetCounter.value,
      CompressionType.NONE, records.records)

    val (producerId, producerEpoch, sequence, isTransactional) = {
      val first = records.batches.asScala.head
      (first.producerId, first.producerEpoch, first.baseSequence, first.isTransactional)
    }

    //开一片新内存放消息
    val newBuffer = ByteBuffer.allocate(sizeInBytesAfterConversion)
    val builder = MemoryRecords.builder(newBuffer, toMagicValue, CompressionType.NONE, timestampType,
      offsetCounter.value, now, producerId, producerEpoch, sequence, isTransactional, partitionLeaderEpoch)

    for (batch <- records.batches.asScala) {
      validateBatch(batch, isFromClient, toMagicValue)

      for (record <- batch.asScala) {
        //验证消息的format version和batch的是否一致,验证消息的checksum等
        validateRecord(batch, record, now, timestampType, timestampDiffMaxMs, compactedTopic)
        //每个record的offset依次加一
        builder.appendWithOffset(offsetCounter.getAndIncrement(), record)
      }
    }

    val convertedRecords = builder.build()

    val info = builder.info
    val recordConversionStats = new RecordConversionStats(builder.uncompressedBytesWritten,
      builder.numRecords, time.nanoseconds - startNanos)
    ValidationAndOffsetAssignResult(
      validatedRecords = convertedRecords,
      maxTimestamp = info.maxTimestamp,
      shallowOffsetOfMaxTimestamp = info.shallowOffsetOfMaxTimestamp,
      messageSizeMaybeChanged = true,
      recordConversionStats = recordConversionStats)
  }

因为records的magic与配置是不一样的,因此中间要对消息进行转换,首先开辟一片新内存,再对消息进行验证,转换后添加进入。添加的方法是appendWithOffset

private Long appendWithOffset(long offset, boolean isControlRecord, long timestamp, ByteBuffer key,
                                  ByteBuffer value, Header[] headers) {
        try {
            if (isControlRecord != isControlBatch)
                throw new IllegalArgumentException("Control records can only be appended to control batches");

            if (lastOffset != null && offset <= lastOffset)
                throw new IllegalArgumentException(String.format("Illegal offset %s following previous offset %s " +
                        "(Offsets must increase monotonically).", offset, lastOffset));

            if (timestamp < 0 && timestamp != RecordBatch.NO_TIMESTAMP)
                throw new IllegalArgumentException("Invalid negative timestamp " + timestamp);

            if (magic < RecordBatch.MAGIC_VALUE_V2 && headers != null && headers.length > 0)
                throw new IllegalArgumentException("Magic v" + magic + " does not support record headers");

            if (firstTimestamp == null)
                firstTimestamp = timestamp;

            if (magic > RecordBatch.MAGIC_VALUE_V1) {
                appendDefaultRecord(offset, timestamp, key, value, headers);
                return null;
            } else {
                return appendLegacyRecord(offset, timestamp, key, value);
            }
        } catch (IOException e) {
            throw new KafkaException("I/O exception when writing to the append stream, closing", e);
        }
    }

如果records的format version与配置相同,则直接在原地修改相应的offset等信息

private def assignOffsetsNonCompressed(records: MemoryRecords,
                                         offsetCounter: LongRef,
                                         now: Long,
                                         compactedTopic: Boolean,
                                         timestampType: TimestampType,
                                         timestampDiffMaxMs: Long,
                                         partitionLeaderEpoch: Int,
                                         isFromClient: Boolean,
                                         magic: Byte): ValidationAndOffsetAssignResult = {
    var maxTimestamp = RecordBatch.NO_TIMESTAMP
    var offsetOfMaxTimestamp = -1L
    val initialOffset = offsetCounter.value

    for (batch <- records.batches.asScala) {
      //首先对batch做校验
      validateBatch(batch, isFromClient, magic)

      var maxBatchTimestamp = RecordBatch.NO_TIMESTAMP
      var offsetOfMaxBatchTimestamp = -1L

      for (record <- batch.asScala) {
        validateRecord(batch, record, now, timestampType, timestampDiffMaxMs, compactedTopic)
        val offset = offsetCounter.getAndIncrement()
        if (batch.magic > RecordBatch.MAGIC_VALUE_V0 && record.timestamp > maxBatchTimestamp) {
          maxBatchTimestamp = record.timestamp
          offsetOfMaxBatchTimestamp = offset
        }
      }

      //获取maxTimestamp和offsetOfMaxTimestamp
      if (batch.magic > RecordBatch.MAGIC_VALUE_V0 && maxBatchTimestamp > maxTimestamp) {
        maxTimestamp = maxBatchTimestamp
        offsetOfMaxTimestamp = offsetOfMaxBatchTimestamp
      }

      //这里修改的是这个batch最后的offset(并没有给每个record一个offset)
      batch.setLastOffset(offsetCounter.value - 1)

      if (batch.magic >= RecordBatch.MAGIC_VALUE_V2)
        batch.setPartitionLeaderEpoch(partitionLeaderEpoch)

      if (batch.magic > RecordBatch.MAGIC_VALUE_V0) {
        if (timestampType == TimestampType.LOG_APPEND_TIME)
          batch.setMaxTimestamp(TimestampType.LOG_APPEND_TIME, now)
        else
          batch.setMaxTimestamp(timestampType, maxBatchTimestamp)
      }
    }

    if (timestampType == TimestampType.LOG_APPEND_TIME) {
      maxTimestamp = now
      if (magic >= RecordBatch.MAGIC_VALUE_V2)
        offsetOfMaxTimestamp = offsetCounter.value - 1
      else
        offsetOfMaxTimestamp = initialOffset
    }

    ValidationAndOffsetAssignResult(
      validatedRecords = records,
      maxTimestamp = maxTimestamp,
      shallowOffsetOfMaxTimestamp = offsetOfMaxTimestamp,
      messageSizeMaybeChanged = false,
      recordConversionStats = RecordConversionStats.EMPTY)
  }
  1. 在分配完records的offset以后,对LogAppendInfo的结果再做一次校验,如果
  • logAppendInfo的offset不是单调递增的,抛出异常OffsetsOutOfOrderException
  • firstOrLastOffsetOfFirstBatch < nextOffsetMetadata.messageOffset,即第一条消息的offset(或者第一个batch的lastOffset)低于开始的messageOffset,抛出UnexpectedAppendOffsetException异常
  • batch.baseOffset比对应的partitionLeaderEpoch的baseOffset还要小,则需要清理epochs缓存中过于“靠前”的epoch(因为这些epoch的baseOffset与新加入的batch冲突)
  • validRecords.sizeInBytes > config.segmentSize,records的大小比配置的segment还要大,抛出异常RecordBatchTooLargeException

再判断是否需要滚动segment,如果需要滚动则滚动segment。判断是否需要滚动由以下几个维度控制

  • 如果segment中的最大timestamp与当前消息的最大timestamp相差过大,则滚动
val reachedRollMs = timeWaitedForRoll(rollParams.now, rollParams.maxTimestampInMessages) > rollParams.maxSegmentMs - rollJitterMs
  • 如果segment中的size加上message size大于segment的最大size,则滚动
  • 如果offsetIndex或者timeIndex满了,则滚动
  • 如果message中的最大offset减去segment的baseoffset大于最大整数,则滚动 即
size > rollParams.maxSegmentBytes - rollParams.messagesSize ||
      (size > 0 && reachedRollMs) ||
      offsetIndex.isFull || timeIndex.isFull || !canConvertToRelativeOffset(rollParams.maxOffsetInMessages)
  1. 滚动出新的segment(或者在老的activeSegment上)后将日志写入
segment.append(largestOffset = appendInfo.lastOffset,
          largestTimestamp = appendInfo.maxTimestamp,
          shallowOffsetOfMaxTimestamp = appendInfo.offsetOfMaxTimestamp,
          records = validRecords)
  1. 更新offset。这里更新的是和partition相关的offset,即LEO和LSO //TODO LEO更新为最大offset+1
updateLogEndOffset(appendInfo.lastOffset + 1)
  1. 如果满足条件,将消息写入磁盘

到这里,log append的过程就结束了。我们再回到replica manager。log append结束后,replicaManger会判断是否需要进行延迟produce

private def delayedProduceRequestRequired(requiredAcks: Short,
                                            entriesPerPartition: Map[TopicPartition, MemoryRecords],
                                            localProduceResults: Map[TopicPartition, LogAppendResult]): Boolean = {
    requiredAcks == -1 &&
    entriesPerPartition.nonEmpty &&
    localProduceResults.values.count(_.exception.isDefined) < entriesPerPartition.size
  }

如果ack==-1且至少有一个partition写入成功了,就要创建一个延迟produce,等待replica的ack。DealyedProduce是DelayedOperation的一种,在其上需要实现两个方法onComplete和tryComplete。它是一个定时任务,在超时以后执行forceComplete方法。replicaManager先将其注册到delayedProducePurgatory中,每个topicpartition代表一个key,每个key都会watch这一operation。如果这个topicpartition被fetch了,就会调用其DelayedProduce的tryCompleteWatched方法尝试完成这个delayedProduce。如果这个operation没有被complete,则只好等待超时后执行forceComplete,从而执行onComplete。我们看一下DelayedProduce的tryComplete和onComplete的实现

override def tryComplete(): Boolean = {
    // check for each partition if it still has pending acks
    produceMetadata.produceStatus.foreach { case (topicPartition, status) =>
      trace(s"Checking produce satisfaction for $topicPartition, current status $status")
      // skip those partitions that have already been satisfied
      if (status.acksPending) {
        val (hasEnough, error) = replicaManager.getPartition(topicPartition) match {
          case Some(partition) =>
            if (partition eq ReplicaManager.OfflinePartition)
              (false, Errors.KAFKA_STORAGE_ERROR)
            else
              partition.checkEnoughReplicasReachOffset(status.requiredOffset)
          case None =>
            // Case A
            (false, Errors.UNKNOWN_TOPIC_OR_PARTITION)
        }
        // Case B.1 || B.2
        if (error != Errors.NONE || hasEnough) {
          status.acksPending = false
          status.responseStatus.error = error
        }
      }
    }

    // check if every partition has satisfied at least one of case A or B
    if (!produceMetadata.produceStatus.values.exists(_.acksPending))
      forceComplete()
    else
      false
  }

在构造DelayedProduce的时候,传入ProduceMetadata元数据,其中包含刚刚produce的结果

case class ProduceMetadata(produceRequiredAcks: Short,
                           produceStatus: Map[TopicPartition, ProducePartitionStatus])

如果ProducePartitionStatus的err为None,则将其acksPending设置为true,否则设置为false。在执行tryComplete的时候如果topicPartition的acksPending为false则跳过,否则检查replica是否跟上

//requiredOffset为刚刚append的lastoffset+1
def checkEnoughReplicasReachOffset(requiredOffset: Long): (Boolean, Errors) = {
    leaderReplicaIfLocal match {
      case Some(leaderReplica) =>
        // keep the current immutable replica list reference
        val curInSyncReplicas = inSyncReplicas

        if (isTraceEnabled) {
          def logEndOffsetString(r: Replica) = s"broker ${r.brokerId}: ${r.logEndOffset.messageOffset}"
          val (ackedReplicas, awaitingReplicas) = curInSyncReplicas.partition { replica =>
            replica.logEndOffset.messageOffset >= requiredOffset
          }
          trace(s"Progress awaiting ISR acks for offset $requiredOffset: acked: ${ackedReplicas.map(logEndOffsetString)}, " +
            s"awaiting ${awaitingReplicas.map(logEndOffsetString)}")
        }

        val minIsr = leaderReplica.log.get.config.minInSyncReplicas
        //如果hw大于等于requiredOffset,表示isr中的replica都已经跟上复制
        if (leaderReplica.highWatermark.messageOffset >= requiredOffset) {
          /*
           * The topic may be configured not to accept messages if there are not enough replicas in ISR
           * in this scenario the request was already appended locally and then added to the purgatory before the ISR was shrunk
           */
          if (minIsr <= curInSyncReplicas.size)
            (true, Errors.NONE)
          else
            //在append消息的时候,已经有过判断isr大于minIsr,此时只是isr收缩了,但实际上消息已经被replica消费
            (true, Errors.NOT_ENOUGH_REPLICAS_AFTER_APPEND)
        } else
          (false, Errors.NONE)
      case None =>
        (false, Errors.NOT_LEADER_FOR_PARTITION)
    }
  }

如果经过以上判断以后,没有partition的acksPending=true,则执行forceComplete进行收尾工作,否则返回false。DelayedProduce的onComplete的实现是

override def onComplete() {
    val responseStatus = produceMetadata.produceStatus.mapValues(status => status.responseStatus)
    responseCallback(responseStatus)
  }

如果前面所述的所有partition的checkEnoughReplicasReachOffset都返回true,则操作完成,将响应发送给客户端。如果超过timeout时间后replica仍然没有完成fetch,则执行onExpiration方法

override def onExpiration() {
    produceMetadata.produceStatus.foreach { case (topicPartition, status) =>
      if (status.acksPending) {
        debug(s"Expiring produce request for partition $topicPartition with status $status")
        DelayedProduceMetrics.recordExpiration(topicPartition)
      }
    }
  }

如果不需要replica ack的话,也直接返回响应

if (delayedProduceRequestRequired(requiredAcks, entriesPerPartition, localProduceResults)) {
        // ...
      } else {
        // we can respond immediately
        val produceResponseStatus = produceStatus.mapValues(status => status.responseStatus)
        responseCallback(produceResponseStatus)
      }

至此produce的过程就结束了