kafka server - produce
case ApiKeys.PRODUCE => handleProduceRequest(request)
kafka api
- 对请求进行鉴权
- 事务produce请求
- 幂等请求
- 其他请求
- 定义sendResponseCallback回调
- 定义processingStatsCallback回调
- 调用replicaManager.appendRecords
replica manager
- 检查请求的ack是否是有效ack(ack==0||ack=-1||ack==1)
- 将日志写入partition(调用自身appendToLocalLog方法)
- 执行recordConversionStatsCallback回调
- 判断是否需要进行延迟produce,如果不需要这个请求可以立即响应,否则
- 创建延迟请求并观察
对每个topicPartition,调用对应partition的appendRecordsToLeader方法写入日志。
partition.appendRecordsToLeader(records, isFromClient, requiredAcks)
partition
partition将日志写入自己的log中。在写入之前首先要判断是否有足够的ISR。如果ISR数目低于minIsr且ack==-1,那么这次就不能写入,返回NotEnoughReplicasException异常。否则,调用partition对应的Log对象的appendAsLeader方法。在将日志写入log后
- 解锁延迟的fetch请求
- 增加hw
log的append是一个很复杂的过程,涉及到kafka日志存储的很多细节,我们需要看一下源码
log
- 日志添加的第一步是对这一批日志进行分析和验证,分析验证的结果是一个LogAppendInfo对象
val appendInfo = analyzeAndValidateRecords(records, isFromClient = isFromClient)
其中,LogAppendInfo对象是这样:
case class LogAppendInfo(var firstOffset: Option[Long],
var lastOffset: Long,
var maxTimestamp: Long,
var offsetOfMaxTimestamp: Long,
var logAppendTime: Long,
var logStartOffset: Long,
var recordConversionStats: RecordConversionStats,
sourceCodec: CompressionCodec,
targetCodec: CompressionCodec,
shallowCount: Int,
validBytes: Int,
offsetsMonotonic: Boolean,
lastOffsetOfFirstBatch: Long)
- firstOffset: 日志中的第一个offset -> 第一个batch的baseOffset
- lastOffset: 日志中的最后一个offset -> 最后一个batch的lastOffset
- maxTimestamp: 日志中最大的timestamp -> 每个batch的maxTimestamp的最大值
- offsetOfMaxTimestamp: 拥有最大timestamp的日志 -> 与maxTimestamp对应的lastOffset
- logAppendTime: 日志添加的时间 -> 此时等于-1
- logStartOffset: 在添加日志的时候log的startOffset
- recordConversionStats: 添加日志过程中的统计指标
- sourceCodec: produce时指定的压缩算法 -> 通过batch.compressionType.id得到
- targetCodec: 最终的日志压缩算法 -> 如果配置的compression.type等于produce发送的sourceCodec,返回sourceCodec,否则使用配置的压缩算法
- shallowCount: shallowCount消息的数目 -> 消息数目
- validBytes: 有效字节数 -> batch size之和。如果某个batch的size超过max.message.bytes配置的值,则为无效batch,抛出异常
- offsetsMonotonic: 表示这批消息的offset是否是单调增长
- lastOffsetOfFirstBatch: 第一个batch的最后offset
分析的时候通过遍历records的每一个batch实现。
- 在获得logAppendInfo之后,如果存在无效的字节,就把无效的字节都跳过。在处理添加客户端的produce请求的时候,kafka会重新分配offset给这批消息。
if (sourceCodec == NoCompressionCodec && targetCodec == NoCompressionCodec) {
// check the magic value
if (!records.hasMatchingMagic(magic))
convertAndAssignOffsetsNonCompressed(records, offsetCounter, compactedTopic, time, now, timestampType,
timestampDiffMaxMs, magic, partitionLeaderEpoch, isFromClient)
else
// Do in-place validation, offset assignment and maybe set timestamp
assignOffsetsNonCompressed(records, offsetCounter, now, compactedTopic, timestampType, timestampDiffMaxMs,
partitionLeaderEpoch, isFromClient, magic)
} else {
validateMessagesAndAssignOffsetsCompressed(records, offsetCounter, time, now, sourceCodec, targetCodec, compactedTopic,
magic, timestampType, timestampDiffMaxMs, partitionLeaderEpoch, isFromClient)
如果没有压缩,并且batch的format version与配置的format version不相同的话(format version就是magic value),调用convertAndAssignOffsetsNonCompressed分配offset。
private def convertAndAssignOffsetsNonCompressed(records: MemoryRecords,
offsetCounter: LongRef,
compactedTopic: Boolean,
time: Time,
now: Long,
timestampType: TimestampType,
timestampDiffMaxMs: Long,
toMagicValue: Byte,
partitionLeaderEpoch: Int,
isFromClient: Boolean): ValidationAndOffsetAssignResult = {
val startNanos = time.nanoseconds
val sizeInBytesAfterConversion = AbstractRecords.estimateSizeInBytes(toMagicValue, offsetCounter.value,
CompressionType.NONE, records.records)
val (producerId, producerEpoch, sequence, isTransactional) = {
val first = records.batches.asScala.head
(first.producerId, first.producerEpoch, first.baseSequence, first.isTransactional)
}
//开一片新内存放消息
val newBuffer = ByteBuffer.allocate(sizeInBytesAfterConversion)
val builder = MemoryRecords.builder(newBuffer, toMagicValue, CompressionType.NONE, timestampType,
offsetCounter.value, now, producerId, producerEpoch, sequence, isTransactional, partitionLeaderEpoch)
for (batch <- records.batches.asScala) {
validateBatch(batch, isFromClient, toMagicValue)
for (record <- batch.asScala) {
//验证消息的format version和batch的是否一致,验证消息的checksum等
validateRecord(batch, record, now, timestampType, timestampDiffMaxMs, compactedTopic)
//每个record的offset依次加一
builder.appendWithOffset(offsetCounter.getAndIncrement(), record)
}
}
val convertedRecords = builder.build()
val info = builder.info
val recordConversionStats = new RecordConversionStats(builder.uncompressedBytesWritten,
builder.numRecords, time.nanoseconds - startNanos)
ValidationAndOffsetAssignResult(
validatedRecords = convertedRecords,
maxTimestamp = info.maxTimestamp,
shallowOffsetOfMaxTimestamp = info.shallowOffsetOfMaxTimestamp,
messageSizeMaybeChanged = true,
recordConversionStats = recordConversionStats)
}
因为records的magic与配置是不一样的,因此中间要对消息进行转换,首先开辟一片新内存,再对消息进行验证,转换后添加进入。添加的方法是appendWithOffset
private Long appendWithOffset(long offset, boolean isControlRecord, long timestamp, ByteBuffer key,
ByteBuffer value, Header[] headers) {
try {
if (isControlRecord != isControlBatch)
throw new IllegalArgumentException("Control records can only be appended to control batches");
if (lastOffset != null && offset <= lastOffset)
throw new IllegalArgumentException(String.format("Illegal offset %s following previous offset %s " +
"(Offsets must increase monotonically).", offset, lastOffset));
if (timestamp < 0 && timestamp != RecordBatch.NO_TIMESTAMP)
throw new IllegalArgumentException("Invalid negative timestamp " + timestamp);
if (magic < RecordBatch.MAGIC_VALUE_V2 && headers != null && headers.length > 0)
throw new IllegalArgumentException("Magic v" + magic + " does not support record headers");
if (firstTimestamp == null)
firstTimestamp = timestamp;
if (magic > RecordBatch.MAGIC_VALUE_V1) {
appendDefaultRecord(offset, timestamp, key, value, headers);
return null;
} else {
return appendLegacyRecord(offset, timestamp, key, value);
}
} catch (IOException e) {
throw new KafkaException("I/O exception when writing to the append stream, closing", e);
}
}
如果records的format version与配置相同,则直接在原地修改相应的offset等信息
private def assignOffsetsNonCompressed(records: MemoryRecords,
offsetCounter: LongRef,
now: Long,
compactedTopic: Boolean,
timestampType: TimestampType,
timestampDiffMaxMs: Long,
partitionLeaderEpoch: Int,
isFromClient: Boolean,
magic: Byte): ValidationAndOffsetAssignResult = {
var maxTimestamp = RecordBatch.NO_TIMESTAMP
var offsetOfMaxTimestamp = -1L
val initialOffset = offsetCounter.value
for (batch <- records.batches.asScala) {
//首先对batch做校验
validateBatch(batch, isFromClient, magic)
var maxBatchTimestamp = RecordBatch.NO_TIMESTAMP
var offsetOfMaxBatchTimestamp = -1L
for (record <- batch.asScala) {
validateRecord(batch, record, now, timestampType, timestampDiffMaxMs, compactedTopic)
val offset = offsetCounter.getAndIncrement()
if (batch.magic > RecordBatch.MAGIC_VALUE_V0 && record.timestamp > maxBatchTimestamp) {
maxBatchTimestamp = record.timestamp
offsetOfMaxBatchTimestamp = offset
}
}
//获取maxTimestamp和offsetOfMaxTimestamp
if (batch.magic > RecordBatch.MAGIC_VALUE_V0 && maxBatchTimestamp > maxTimestamp) {
maxTimestamp = maxBatchTimestamp
offsetOfMaxTimestamp = offsetOfMaxBatchTimestamp
}
//这里修改的是这个batch最后的offset(并没有给每个record一个offset)
batch.setLastOffset(offsetCounter.value - 1)
if (batch.magic >= RecordBatch.MAGIC_VALUE_V2)
batch.setPartitionLeaderEpoch(partitionLeaderEpoch)
if (batch.magic > RecordBatch.MAGIC_VALUE_V0) {
if (timestampType == TimestampType.LOG_APPEND_TIME)
batch.setMaxTimestamp(TimestampType.LOG_APPEND_TIME, now)
else
batch.setMaxTimestamp(timestampType, maxBatchTimestamp)
}
}
if (timestampType == TimestampType.LOG_APPEND_TIME) {
maxTimestamp = now
if (magic >= RecordBatch.MAGIC_VALUE_V2)
offsetOfMaxTimestamp = offsetCounter.value - 1
else
offsetOfMaxTimestamp = initialOffset
}
ValidationAndOffsetAssignResult(
validatedRecords = records,
maxTimestamp = maxTimestamp,
shallowOffsetOfMaxTimestamp = offsetOfMaxTimestamp,
messageSizeMaybeChanged = false,
recordConversionStats = RecordConversionStats.EMPTY)
}
- 在分配完records的offset以后,对LogAppendInfo的结果再做一次校验,如果
- logAppendInfo的offset不是单调递增的,抛出异常OffsetsOutOfOrderException
- firstOrLastOffsetOfFirstBatch < nextOffsetMetadata.messageOffset,即第一条消息的offset(或者第一个batch的lastOffset)低于开始的messageOffset,抛出UnexpectedAppendOffsetException异常
- batch.baseOffset比对应的partitionLeaderEpoch的baseOffset还要小,则需要清理epochs缓存中过于“靠前”的epoch(因为这些epoch的baseOffset与新加入的batch冲突)
- validRecords.sizeInBytes > config.segmentSize,records的大小比配置的segment还要大,抛出异常RecordBatchTooLargeException
再判断是否需要滚动segment,如果需要滚动则滚动segment。判断是否需要滚动由以下几个维度控制
- 如果segment中的最大timestamp与当前消息的最大timestamp相差过大,则滚动
val reachedRollMs = timeWaitedForRoll(rollParams.now, rollParams.maxTimestampInMessages) > rollParams.maxSegmentMs - rollJitterMs
- 如果segment中的size加上message size大于segment的最大size,则滚动
- 如果offsetIndex或者timeIndex满了,则滚动
- 如果message中的最大offset减去segment的baseoffset大于最大整数,则滚动 即
size > rollParams.maxSegmentBytes - rollParams.messagesSize ||
(size > 0 && reachedRollMs) ||
offsetIndex.isFull || timeIndex.isFull || !canConvertToRelativeOffset(rollParams.maxOffsetInMessages)
- 滚动出新的segment(或者在老的activeSegment上)后将日志写入
segment.append(largestOffset = appendInfo.lastOffset,
largestTimestamp = appendInfo.maxTimestamp,
shallowOffsetOfMaxTimestamp = appendInfo.offsetOfMaxTimestamp,
records = validRecords)
- 更新offset。这里更新的是和partition相关的offset,即LEO和LSO //TODO LEO更新为最大offset+1
updateLogEndOffset(appendInfo.lastOffset + 1)
- 如果满足条件,将消息写入磁盘
到这里,log append的过程就结束了。我们再回到replica manager。log append结束后,replicaManger会判断是否需要进行延迟produce
private def delayedProduceRequestRequired(requiredAcks: Short,
entriesPerPartition: Map[TopicPartition, MemoryRecords],
localProduceResults: Map[TopicPartition, LogAppendResult]): Boolean = {
requiredAcks == -1 &&
entriesPerPartition.nonEmpty &&
localProduceResults.values.count(_.exception.isDefined) < entriesPerPartition.size
}
如果ack==-1且至少有一个partition写入成功了,就要创建一个延迟produce,等待replica的ack。DealyedProduce是DelayedOperation的一种,在其上需要实现两个方法onComplete和tryComplete。它是一个定时任务,在超时以后执行forceComplete方法。replicaManager先将其注册到delayedProducePurgatory中,每个topicpartition代表一个key,每个key都会watch这一operation。如果这个topicpartition被fetch了,就会调用其DelayedProduce的tryCompleteWatched方法尝试完成这个delayedProduce。如果这个operation没有被complete,则只好等待超时后执行forceComplete,从而执行onComplete。我们看一下DelayedProduce的tryComplete和onComplete的实现
override def tryComplete(): Boolean = {
// check for each partition if it still has pending acks
produceMetadata.produceStatus.foreach { case (topicPartition, status) =>
trace(s"Checking produce satisfaction for $topicPartition, current status $status")
// skip those partitions that have already been satisfied
if (status.acksPending) {
val (hasEnough, error) = replicaManager.getPartition(topicPartition) match {
case Some(partition) =>
if (partition eq ReplicaManager.OfflinePartition)
(false, Errors.KAFKA_STORAGE_ERROR)
else
partition.checkEnoughReplicasReachOffset(status.requiredOffset)
case None =>
// Case A
(false, Errors.UNKNOWN_TOPIC_OR_PARTITION)
}
// Case B.1 || B.2
if (error != Errors.NONE || hasEnough) {
status.acksPending = false
status.responseStatus.error = error
}
}
}
// check if every partition has satisfied at least one of case A or B
if (!produceMetadata.produceStatus.values.exists(_.acksPending))
forceComplete()
else
false
}
在构造DelayedProduce的时候,传入ProduceMetadata元数据,其中包含刚刚produce的结果
case class ProduceMetadata(produceRequiredAcks: Short,
produceStatus: Map[TopicPartition, ProducePartitionStatus])
如果ProducePartitionStatus的err为None,则将其acksPending设置为true,否则设置为false。在执行tryComplete的时候如果topicPartition的acksPending为false则跳过,否则检查replica是否跟上
//requiredOffset为刚刚append的lastoffset+1
def checkEnoughReplicasReachOffset(requiredOffset: Long): (Boolean, Errors) = {
leaderReplicaIfLocal match {
case Some(leaderReplica) =>
// keep the current immutable replica list reference
val curInSyncReplicas = inSyncReplicas
if (isTraceEnabled) {
def logEndOffsetString(r: Replica) = s"broker ${r.brokerId}: ${r.logEndOffset.messageOffset}"
val (ackedReplicas, awaitingReplicas) = curInSyncReplicas.partition { replica =>
replica.logEndOffset.messageOffset >= requiredOffset
}
trace(s"Progress awaiting ISR acks for offset $requiredOffset: acked: ${ackedReplicas.map(logEndOffsetString)}, " +
s"awaiting ${awaitingReplicas.map(logEndOffsetString)}")
}
val minIsr = leaderReplica.log.get.config.minInSyncReplicas
//如果hw大于等于requiredOffset,表示isr中的replica都已经跟上复制
if (leaderReplica.highWatermark.messageOffset >= requiredOffset) {
/*
* The topic may be configured not to accept messages if there are not enough replicas in ISR
* in this scenario the request was already appended locally and then added to the purgatory before the ISR was shrunk
*/
if (minIsr <= curInSyncReplicas.size)
(true, Errors.NONE)
else
//在append消息的时候,已经有过判断isr大于minIsr,此时只是isr收缩了,但实际上消息已经被replica消费
(true, Errors.NOT_ENOUGH_REPLICAS_AFTER_APPEND)
} else
(false, Errors.NONE)
case None =>
(false, Errors.NOT_LEADER_FOR_PARTITION)
}
}
如果经过以上判断以后,没有partition的acksPending=true,则执行forceComplete进行收尾工作,否则返回false。DelayedProduce的onComplete的实现是
override def onComplete() {
val responseStatus = produceMetadata.produceStatus.mapValues(status => status.responseStatus)
responseCallback(responseStatus)
}
如果前面所述的所有partition的checkEnoughReplicasReachOffset都返回true,则操作完成,将响应发送给客户端。如果超过timeout时间后replica仍然没有完成fetch,则执行onExpiration方法
override def onExpiration() {
produceMetadata.produceStatus.foreach { case (topicPartition, status) =>
if (status.acksPending) {
debug(s"Expiring produce request for partition $topicPartition with status $status")
DelayedProduceMetrics.recordExpiration(topicPartition)
}
}
}
如果不需要replica ack的话,也直接返回响应
if (delayedProduceRequestRequired(requiredAcks, entriesPerPartition, localProduceResults)) {
// ...
} else {
// we can respond immediately
val produceResponseStatus = produceStatus.mapValues(status => status.responseStatus)
responseCallback(produceResponseStatus)
}
至此produce的过程就结束了