Kafka发送消息方法
发送消息整体流程
Kafka生产者异步发送消息并返回一个Future,代表发送结果。此外,用户可以选择提供回调函数,在Kafka代理确认接收记录时调用该回调函数。虽然看起来很简单,但整个过程还是比较复杂。整体流程如下:
- 生产者将消息传递给配置的拦截器。
- 序列化将记录的键和值转换为字节数组。
- 默认或配置的分区在未指定时计算主题分区。
RecordAccumulator
使用配置的压缩算法将消息追加到生产者批次中。
此时,消息仍然保存在内存中,并未发送到Kafka broker。Record Accumulator
通过主题和分区将消息在内存中进行分组。
发送线程将具有相同broker(leader)的多个批次分组为请求并发送它们。此时,消息被发送到Kafka。
上图是Producer生产消息到发送到Broker的主流程。Producer先生产消息、序列化消息并压缩消息后,追加到本地的记录收集器(RecordAccumulator),Sender不断轮询记录收集器,当满足一定条件时,将队列中的数据发送到Partition Leader节点。Sender发送数据到Broker的条件有两个:
- 消息大小达到阈值
- 消息等待发送的时间达到阈值
Producer会为每个Partition都创建一个双端队列来缓存客户端消息,队列的每个元素是一个批记录(ProducerBatch),批记录使用createdMs表示批记录的创建时间(批记录中第一条消息加入的时间), topicPartion表示对应的Partition元数据。当Producer生产的消息经过序列化,会被先写入到recordsBuilder对象中。一旦队列中有批记录的大小达到阈值,就会被Sender发送到Partition对应的Leader节点;若批记录等待发送的时间达到阈值,消息也会被发送到Partition对应的Leader节点中。
代码分析
kafka的发送方法
org.apache.kafka.clients.producer.KafkaProducer#send()
中:
@Override
public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
// intercept the record, which can be potentially modified; this method does not throw exceptions
ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
return doSend(interceptedRecord, callback);
}
发送方法org.apache.kafka.clients.producer.KafkaProducer#doSend
:
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
TopicPartition tp = null;
try {
//检查生产者是否已关闭,如果已关闭则抛出IllegalStateException异常。
throwIfProducerClosed();
// first make sure the metadata for the topic is available
long nowMs = time.milliseconds();
ClusterAndWaitTime clusterAndWaitTime;
try {
//确保目标主题的元数据可用。如果在等待元数据时生产者被关闭,会抛出KafkaException异常。
clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), nowMs, maxBlockTimeMs);
} catch (KafkaException e) {
if (metadata.isClosed())
throw new KafkaException("Producer closed while send in progress", e);
throw e;
}
nowMs += clusterAndWaitTime.waitedOnMetadataMs;
long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
Cluster cluster = clusterAndWaitTime.cluster;
byte[] serializedKey;
try {
//对记录的键进行序列化
serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
" specified in key.serializer", cce);
}
byte[] serializedValue;
try {
//对记录的值进行序列化
serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
" specified in value.serializer", cce);
}
//根据序列化后的键和值以及集群信息,确定记录应该发送到哪个分区
int partition = partition(record, serializedKey, serializedValue, cluster);
//创建TopicPartition对象,表示记录将要发送到的主题和分区
tp = new TopicPartition(record.topic(), partition);
// 将记录的头部设置为只读,防止在发送过程中被修改
setReadOnly(record.headers());
Header[] headers = record.headers().toArray();
//计算记录的序列化大小,确保记录大小在合理范围内,否则抛出RecordTooLargeException异常。
int serializedSize = AbstractRecords.estimateSizeInBytesUpperBound(apiVersions.maxUsableProduceMagic(),
compressionType, serializedKey, serializedValue, headers);
ensureValidRecordSize(serializedSize);
//如果记录没有设置时间戳,则使用当前时间作为时间戳。
long timestamp = record.timestamp() == null ? nowMs : record.timestamp();
if (log.isTraceEnabled()) {
log.trace("Attempting to append record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
}
// producer callback will make sure to call both 'callback' and interceptor callback
// 创建一个InterceptorCallback对象,用于在发送完成后调用用户提供的回调函数和拦截器的回调函数
Callback interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
//如果启用了事务管理,将目标分区添加到事务管理器中
if (transactionManager != null) {
transactionManager.maybeAddPartition(tp);
}
//将记录追加到RecordAccumulator中。如果需要创建新的批次,会重新计算分区并重试追加操作
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey,
serializedValue, headers, interceptCallback, remainingWaitMs, true, nowMs);
// 如果批次已满或者创建了新的批次,唤醒发送线程,准备发送记录
if (result.abortForNewBatch) {
int prevPartition = partition;
partitioner.onNewBatch(record.topic(), cluster, prevPartition);
partition = partition(record, serializedKey, serializedValue, cluster);
tp = new TopicPartition(record.topic(), partition);
if (log.isTraceEnabled()) {
log.trace("Retrying append due to new batch creation for topic {} partition {}. The old partition was {}", record.topic(), partition, prevPartition);
}
// producer callback will make sure to call both 'callback' and interceptor callback
interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
result = accumulator.append(tp, timestamp, serializedKey,
serializedValue, headers, interceptCallback, remainingWaitMs, false, nowMs);
}
// 判断追加结果中的 batchIsFull 和 newBatchCreated 两个标志。batchIsFull 表示当前批次已满,newBatchCreated 表示创建了新的批次。如果满足其中一个条件,说明需要唤醒发送线程进行发送。
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
// 调用 this.sender.wakeup() 唤醒发送线程,准备发送记录。
this.sender.wakeup();
}
return result.future;
} catch (Exception e) {
// ....
}
}
追加消息
accumulator.append()
public RecordAppendResult append(TopicPartition tp,
long timestamp,
byte[] key,
byte[] value,
Header[] headers,
Callback callback,
long maxTimeToBlock) throws InterruptedException {
// 跟踪追加线程的数量,以确保我们不会丢批次
appendsInProgress.incrementAndGet();
ByteBuffer buffer = null;
if (headers == null) headers = Record.EMPTY_HEADERS;
try {
// 检查是否有在处理中的批次(因为一个分区只有一个deque)
Deque<ProducerBatch> dq = getOrCreateDeque(tp);
// 锁住deque,添加数据
synchronized (dq) {
if (closed)
throw new KafkaException("Producer closed while send in progress");
RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
if (appendResult != null)
return appendResult;
}
byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
//分配大小,默认为batch.size大小。16384,即16K
buffer = free.allocate(size, maxTimeToBlock);
synchronized (dq) {
if (closed)
throw new KafkaException("Producer closed while send in progress");
RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
if (appendResult != null) {
return appendResult;
}
MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, time.milliseconds());
// 如果是第一次,直接看这里。上面的都添加不成功。
// 添加到batch
FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, headers, callback, time.milliseconds()));
// 将batch添加到deque中
dq.addLast(batch);
incomplete.add(batch);
buffer = null;
// 返回结果。此处可以看到队列长度大于1或者队列已满就认为队列已满
return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true);
}
} finally {
if (buffer != null)
free.deallocate(buffer);
appendsInProgress.decrementAndGet();
}
}
追加消息时首先要获取Partition所属的队列,然后取队列中最后一个批记录,如果队列中不存在批记录或者批记录的大小达到阈值,应该创建新的批记录,并且加入队列的尾部。这里先创建的批记录最先被消息填满,后创建的批记录表示最新的消息,追加消息时总是往队列尾部的批记录中追加。记录收集器用来缓存客户端的消息,还需要通过Sender才能将消息发送到Partition对应的Leader节点。
Sender发送消息
Sender不断轮询记录收集器,当满足一定条件时,将队列中的数据发送到Partition Leader节点。
Sender读取记录收集器,得到每个Leader节点对应的批记录列表,找出准备好的Broker节点并建立连接,然后将各个Partition的批记录发送到Leader节点。Sender的核心代码如下:
void runOnce() {
// 有事务管理器执行事务管理器逻辑
if (transactionManager != null) {
try {
// 解析序列号
transactionManager.maybeResolveSequences();
// 检查事务管理器是否处于失败状态,如果是,则可能中止批次处理,并进行轮询。
if (transactionManager.hasFatalError()) {
RuntimeException lastError = transactionManager.lastError();
if (lastError != null)
maybeAbortBatches(lastError);
client.poll(retryBackoffMs, time.milliseconds());
return;
}
//检查是否需要新的producerId,如果需要,将会发送一个InitProducerId请求。
//调用maybeSendAndPollTransactionalRequest()尝试发送事务请求并轮询响应,如果发送成功则返回。
transactionManager.bumpIdempotentEpochAndResetIdIfNeeded();
if (maybeSendAndPollTransactionalRequest()) {
return;
}
} catch (AuthenticationException e) {
// This is already logged as error, but propagated here to perform any clean ups.
log.trace("Authentication exception while processing transactional request", e);
transactionManager.authenticationFailed(e);
}
}
//发送数据
long currentTimeMs = time.milliseconds();
long pollTimeout = sendProducerData(currentTimeMs);
// 进行轮询,处理Broker的响应。
client.poll(pollTimeout, currentTimeMs);
}
sendProducerData方法
private long sendProducerData(long now) {
//获取集群元数据
Cluster cluster = metadata.fetch();
//获取待发送的分区列表
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// 如果有分区的 leader 未知,则强制更新元数据
if (!result.unknownLeaderTopics.isEmpty()) {
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic, now);
log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
result.unknownLeaderTopics);
this.metadata.requestUpdate();
}
// 剔除不可用的节点,同时计算下一次轮询的超时时间。
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
}
}
// 创建生产者请求,从 RecordAccumulator 中获取待发送的批次,同时将批次添加到 inflightBatches 中,如果需要保证消息顺序,则将批次中的分区加入到 muted 中
Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
addToInflightBatches(batches);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List<ProducerBatch> batchList : batches.values()) {
for (ProducerBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
accumulator.resetNextBatchExpiryTime();
//处理过期的批次,如果批次已经发送到 broker,则将其标记为失败,否则将其标记为未决。
List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);
List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
expiredBatches.addAll(expiredInflightBatches);
if (!expiredBatches.isEmpty())
log.trace("Expired {} batches in accumulator", expiredBatches.size());
for (ProducerBatch expiredBatch : expiredBatches) {
String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
+ ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
failBatch(expiredBatch, new TimeoutException(errorMessage), false);
if (transactionManager != null && expiredBatch.inRetry()) {
transactionManager.markSequenceUnresolved(expiredBatch);
}
}
//更新生产者请求的指标。
sensors.updateProduceRequestMetrics(batches);
//计算下一次轮询的超时时间,如果有节点已经准备好发送数据,则超时时间为 0。
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
pollTimeout = Math.max(pollTimeout, 0);
if (!result.readyNodes.isEmpty()) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
pollTimeout = 0;
}
//发送生产者请求。
sendProduceRequests(batches, now);
return pollTimeout;
}
sendProducerData
private long sendProducerData(long now) {
//获取集群元数据
Cluster cluster = metadata.fetch();
//获取待发送的分区列表
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// 如果有分区的 leader 未知,则强制更新元数据
if (!result.unknownLeaderTopics.isEmpty()) {
// The set of topics with unknown leader contains topics with leader election pending as well as
// topics which may have expired. Add the topic again to metadata to ensure it is included
// and request metadata update, since there are messages to send to the topic.
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic, now);
log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
result.unknownLeaderTopics);
this.metadata.requestUpdate();
}
// 剔除不可用的节点,同时计算下一次轮询的超时时间。
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
}
}
// 创建生产者请求,从 RecordAccumulator 中获取待发送的批次,同时将批次添加到 inflightBatches 中,如果需要保证消息顺序,则将批次中的分区加入到 muted 中
Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
addToInflightBatches(batches);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List<ProducerBatch> batchList : batches.values()) {
for (ProducerBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
accumulator.resetNextBatchExpiryTime();
//处理过期的批次,如果批次已经发送到 broker,则将其标记为失败,否则将其标记为未决。
List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);
List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
expiredBatches.addAll(expiredInflightBatches);
if (!expiredBatches.isEmpty())
log.trace("Expired {} batches in accumulator", expiredBatches.size());
for (ProducerBatch expiredBatch : expiredBatches) {
String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
+ ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
failBatch(expiredBatch, new TimeoutException(errorMessage), false);
if (transactionManager != null && expiredBatch.inRetry()) {
// This ensures that no new batches are drained until the current in flight batches are fully resolved.
transactionManager.markSequenceUnresolved(expiredBatch);
}
}
//更新生产者请求的指标。
sensors.updateProduceRequestMetrics(batches);
//计算下一次轮询的超时时间,如果有节点已经准备好发送数据,则超时时间为 0。
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
pollTimeout = Math.max(pollTimeout, 0);
if (!result.readyNodes.isEmpty()) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
pollTimeout = 0;
}
//发送生产者请求。
sendProduceRequests(batches, now);
return pollTimeout;
}
发送生产者请求
private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {
if (batches.isEmpty())
return;
Map<TopicPartition, MemoryRecords> produceRecordsByPartition = new HashMap<>(batches.size());
final Map<TopicPartition, ProducerBatch> recordsByPartition = new HashMap<>(batches.size());
// 头部魔数处理,用于校对请求协议版本等信息,忽略掉
byte minUsedMagic = apiVersions.maxUsableProduceMagic();
for (ProducerBatch batch : batches) {
if (batch.magic() < minUsedMagic)
minUsedMagic = batch.magic();
}
for (ProducerBatch batch : batches) {
TopicPartition tp = batch.topicPartition;
MemoryRecords records = batch.records();
if (!records.hasMatchingMagic(minUsedMagic))
records = batch.records().downConvert(minUsedMagic, 0, time).records();
produceRecordsByPartition.put(tp, records);
recordsByPartition.put(tp, batch);
}
String transactionalId = null;
if (transactionManager != null && transactionManager.isTransactional()) {
transactionalId = transactionManager.transactionalId();
}
// 构建请求
ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,
produceRecordsByPartition, transactionalId);
RequestCompletionHandler callback = new RequestCompletionHandler() {
public void onComplete(ClientResponse response) {
// 处理生产者响应体
handleProduceResponse(response, recordsByPartition, time.milliseconds());
}
};
String nodeId = Integer.toString(destination);
// 这个callback就是上面注册的 处理响应体的 方法
ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,
requestTimeoutMs, callback);
// 发送请求
client.send(clientRequest, now);
log.trace("Sent produce request to {}: {}", nodeId, requestBuilder);
}
处理结果
handleProduceResponse
private void handleProduceResponse(ClientResponse response, Map<TopicPartition, ProducerBatch> batches, long now) {
RequestHeader requestHeader = response.requestHeader();
long receivedTimeMs = response.receivedTimeMs();
int correlationId = requestHeader.correlationId();
// 连接断开处理
if (response.wasDisconnected()) {
log.trace("Cancelled request with header {} due to node {} being disconnected",
requestHeader, response.destination());
for (ProducerBatch batch : batches.values())
completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION), correlationId, now, 0L);
// 版本不匹配处理
} else if (response.versionMismatch() != null) {
log.warn("Cancelled request {} due to a version mismatch with node {}",
response, response.destination(), response.versionMismatch());
for (ProducerBatch batch : batches.values())
completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.UNSUPPORTED_VERSION), correlationId, now, 0L);
} else {
log.trace("Received produce response from node {} with correlation id {}", response.destination(), correlationId);
// 如果响应还有响应体的处理
// 发送成功会进入到这里
if (response.hasResponse()) {
ProduceResponse produceResponse = (ProduceResponse) response.responseBody();
for (Map.Entry<TopicPartition, ProduceResponse.PartitionResponse> entry : produceResponse.responses().entrySet()) {
TopicPartition tp = entry.getKey();
ProduceResponse.PartitionResponse partResp = entry.getValue();
ProducerBatch batch = batches.get(tp);
completeBatch(batch, partResp, correlationId, now, receivedTimeMs + produceResponse.throttleTimeMs());
}
this.sensors.recordLatency(response.destination(), response.requestLatencyMs());
} else {
// 剩下的都是ack=0的,全部完成就完了
for (ProducerBatch batch : batches.values()) {
completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NONE), correlationId, now, 0L);
}
}
}
}