个人Github地址 github.com/AmbitionTn/… , 希望下面的文章对大家有所帮助,要是觉得感兴趣可以关注一下,会持续更新,持续学习中
Kafka源码版本基于 kafka3.3
上一篇文章:Kafka源码分析2-Producer发送消息到缓存
概述
上一篇文章说明了KafkaProducer在发送一个消息的时候,是如何调用send方法发送到缓存中的,同时说明了会有一个后台线程sender否则后台的发送,这篇文章主要介绍下Sender线程到底做了什么。
Sender线程逻辑时序图
下图只是说明了Sender run方法大概执行的几个重要步骤,相关类的步骤没有详细说明,需要深入了解可以参考下面的源码,或者参考Kafka源码3.3版本。
Sender属性
public class Sender implements Runnable {
private final Logger log;
/* the state of each nodes connection */
// KafkaClient 具体网络操作的客户端 Kafka通过NetworkClient实现
private final KafkaClient client;
/* the record accumulator that batches records */
// 消息内存缓存区
private final RecordAccumulator accumulator;
/* the metadata for the client */
// 元数据信息,集群Cluster、Topic、Partition等信息
private final ProducerMetadata metadata;
/* the flag indicating whether the producer should guarantee the message order on the broker or not. */
// 用来说明生产正是否应该保证消息的顺序
private final boolean guaranteeMessageOrder;
/* the maximum request size to attempt to send to the server */
// 发送到Broker的最大字节大小 【默认为1MB】
private final int maxRequestSize;
/* the number of acknowledgements to request from the server */
// ACK 0不等待服务端返回 1 leader保存成功 -1服务端完成Follower同步
private final short acks;
/* the number of times to retry a failed request before giving up */
// 发送失败重试次数
private final int retries;
/* the clock instance used for getting the time */
private final Time time;
/* true while the sender thread is still running */
// 用来标记sender线程是否仍然运行
private volatile boolean running;
/* true when the caller wants to ignore all unsent/inflight messages and force close. */
// 是否强制关闭
private volatile boolean forceClose;
/* metrics */
// 用来记录监控信息
private final SenderMetrics sensors;
/* the max time to wait for the server to respond to the request*/
// 服务端等待响应时间
private final int requestTimeoutMs;
/* The max time to wait before retrying a request which has failed */
// 请求失败重试时间间隔
private final long retryBackoffMs;
/* current request API versions supported by the known brokers */
// API版本
private final ApiVersions apiVersions;
/* all the state related to transactions, in particular the producer id, producer epoch, and sequence numbers */
// 事务管理器
private final TransactionManager transactionManager;
// A per-partition queue of batches ordered by creation time for tracking the in-flight batches
// 飞行中的batches 顾名思义:用来存储每一个partition已经发送出去但是没有收到响应的数据
private final Map<TopicPartition, List<ProducerBatch>> inFlightBatches;
}
- 上面给出了Sender的属性信息,可以看到在Sender中包含了NetworkClient信息,Sender并不用于网络请求,只用做后台线程负责拉取Accumulator缓存中的数据拉取出来通过NetworkClient进行网络发送。
Sender线程run方法做了什么
/**
* The main run loop for the sender thread
*/
@Override
public void run() {
log.debug("Starting Kafka producer I/O thread.");
// main loop, runs until close is called
/**
* running 在Sender类的构造方法中完成了赋值,赋值为true
* 主循环,直到running状态为false才会退出
*/
while (running) {
try {
runOnce();
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");
// okay we stopped accepting requests but there may still be
// requests in the transaction manager, accumulator or waiting for acknowledgment,
// wait until these are completed.
/**
* 当running状态为false才会走到这里
* 对于在事务中的请求,accumulator中的,以及等待响应的请求,在forceClose=false的情况下
* 会重新执行runOnce() 直到他们执行完毕
*/
while (!forceClose && ((this.accumulator.hasUndrained() || this.client.inFlightRequestCount() > 0) || hasPendingTransactionalRequests())) {
try {
runOnce();
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
// Abort the transaction if any commit or abort didn't go through the transaction manager's queue
/**
* 如果一些提交或者中止没有通过事务管理器管理,那么进行中止
*/
while (!forceClose && transactionManager != null && transactionManager.hasOngoingTransaction()) {
if (!transactionManager.isCompleting()) {
log.info("Aborting incomplete transaction due to shutdown");
transactionManager.beginAbort();
}
try {
runOnce();
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
/**
* 强制关闭场景下需要把所有未完成的事务请求置为失败,唤醒线程等待未来处理
*/
if (forceClose) {
// We need to fail all the incomplete transactional requests and batches and wake up the threads waiting on
// the futures.
if (transactionManager != null) {
log.debug("Aborting incomplete transactional requests due to forced shutdown");
transactionManager.close();
}
log.debug("Aborting incomplete batches due to forced shutdown");
this.accumulator.abortIncompleteBatches();
}
try {
this.client.close();
} catch (Exception e) {
log.error("Failed to close network client", e);
}
log.debug("Shutdown of Kafka producer I/O thread has completed.");
}
从上面可以看到running为false会一直进行发送,当running为false的时候,需要判断在哪些场景下需要继续完成发送,哪些情况下会中止
- has pending transaction,accumulator中有积累,以及等待响应的请求,在forceClose=false的情况下,会重新执行runOnce(),直到他们执行完毕
- 如果存在执行中或者中止中,同时没有事务管理器管理,那么进行中止
详细介绍
runOnce做了什么
void runOnce() {
/**
* 判断是否为事务操作
*/
if (transactionManager != null) {
try {
transactionManager.maybeResolveSequences();
// do not continue sending if the transaction manager is in a failed state
/// 如果事务管理器有错误,那么停止运行,不再继续发送
if (transactionManager.hasFatalError()) {
RuntimeException lastError = transactionManager.lastError();
if (lastError != null)
maybeAbortBatches(lastError);
client.poll(retryBackoffMs, time.milliseconds());
return;
}
// Check whether we need a new producerId. If so, we will enqueue an InitProducerId
// request which will be sent below
// 检查是否需要重新生成事务Id,如果需要会重新生成一个
transactionManager.bumpIdempotentEpochAndResetIdIfNeeded();
if (maybeSendAndPollTransactionalRequest()) {
return;
}
} catch (AuthenticationException e) {
// This is already logged as error, but propagated here to perform any clean ups.
log.trace("Authentication exception while processing transactional request", e);
transactionManager.authenticationFailed(e);
}
}
long currentTimeMs = time.milliseconds();
// 发送数据
long pollTimeout = sendProducerData(currentTimeMs);
client.poll(pollTimeout, currentTimeMs);
}
- TransactionManager 用于做事务状态的管理,同时保证producer发送幂等的处理。
sendProducerDate 是怎么校验数据和组装请求的
private long sendProducerData(long now) {
/**
* 步骤1:拉取集群元数据信息
* 元数据信息在第一次进来的时候缓存中是没有的
* 所以第一次进来是没有办法拉取到元数据信息的
* 那么后面的代码也都不用看了,因为都是依赖于这个元数据信息做的
*
* 后面可以看到 runOnce() client.poll(pollTimeout, currentTimeMs);
* 这行代码完成真正元数据的拉取
*/
Cluster cluster = metadata.fetch();
// get the list of partitions with data ready to send
/**
* 步骤2 首先是判断哪些partition有消息可以发送,获取到这个partition的leader partition对应的broker主机。
* 哪些Broker我们可以发送数据
*/
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// if there are any partitions whose leaders are not known yet, force metadata update
/**
* 步骤3 元数据信息校验
* 判断需要发送的数据,是否有哪一个partition没有设置leader,如果有需要强制更新
*/
if (!result.unknownLeaderTopics.isEmpty()) {
// The set of topics with unknown leader contains topics with leader election pending as well as
// topics which may have expired. Add the topic again to metadata to ensure it is included
// and request metadata update, since there are messages to send to the topic.
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic, now);
log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
result.unknownLeaderTopics);
// 这里只完成了 needFullUpdate 字段设置为true,并没有进行真正的获取元数据信息
this.metadata.requestUpdate();
}
// remove any nodes we aren't ready to send to
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
/**
* 步骤4 检测要发送的数据网络是否已经建立好
*/
if (!this.client.ready(node, now)) {
// Update just the readyTimeMs of the latency stats, so that it moves forward
// every time the batch is ready (then the difference between readyTimeMs and
// drainTimeMs would represent how long data is waiting for the node).
// 如果网络连接状态没有建立好
this.accumulator.updateNodeLatencyStats(node.id(), now, false);
// 移除掉result中需要发送给broker的数据
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
} else {
// Update both readyTimeMs and drainTimeMs, this would "reset" the node
// latency.
this.accumulator.updateNodeLatencyStats(node.id(), now, true);
}
}
/**
* 步骤5 创建发送请求
* 我们可能需要发送的partition有好多,同时不同的partition可能又要发送到同一个Node
* eg:
* partition0: 0
* partition1: 0
* partition2: 1
* partition3: 1
*
* 为了减少网络请求的开销,将同一个Node节点的数据聚合在一起
*/
// create produce requests
Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
// 增加到 inFlightBatches 发送中队列
addToInflightBatches(batches);
// 保证消息顺序
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List<ProducerBatch> batchList : batches.values()) {
for (ProducerBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
/**
* 步骤6 过期数据处理
*/
// 重置 accumulator 过期时间
accumulator.resetNextBatchExpiryTime();
// 获取发送超时的列表
List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);
// 获取在 accumulator 时间太久了,已经要过期了的batches
List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
// 将发送时间超时的列表加入到过期列表中
expiredBatches.addAll(expiredInflightBatches);
// Reset the producer id if an expired batch has previously been sent to the broker. Also update the metrics
// for expired batches. see the documentation of @TransactionState.resetIdempotentProducerId to understand why
// we need to reset the producer id here.
// 如果过期的数据已经发送到了broker,需要重置生产者Id
if (!expiredBatches.isEmpty())
log.trace("Expired {} batches in accumulator", expiredBatches.size());
for (ProducerBatch expiredBatch : expiredBatches) {
String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
+ ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
failBatch(expiredBatch, new TimeoutException(errorMessage), false);
if (transactionManager != null && expiredBatch.inRetry()) {
// This ensures that no new batches are drained until the current in flight batches are fully resolved.
transactionManager.markSequenceUnresolved(expiredBatch);
}
}
sensors.updateProduceRequestMetrics(batches);
// If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
// loop and try sending more data. Otherwise, the timeout will be the smaller value between next batch expiry
// time, and the delay time for checking data availability. Note that the nodes may have data that isn't yet
// sendable due to lingering, backing off, etc. This specifically does not include nodes with sendable data
// that aren't ready to send since they would cause busy looping.
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
pollTimeout = Math.max(pollTimeout, 0);
// pollTimeout 用来说明下一次拉取需要间隔多久,如果还有ready的数据,会马上进行发送
if (!result.readyNodes.isEmpty()) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
// if some partitions are already ready to be sent, the select time would be 0;
// otherwise if some partition already has some data accumulated but not ready yet,
// the select time will be the time difference between now and its linger expiry time;
// otherwise the select time will be the time difference between now and the metadata expiry time;
pollTimeout = 0;
}
/**
* 步骤7 将同一个broker上面的数据打包到一起通过NetworkClient发送
* 在集群里面资源是非常宝贵的,如果每一个partition进行一次网络请求,那么太过于繁琐也浪费资源
* 所以进行了一个打包
*
* 里面封装了事务Id 调用了 client.send(clientRequest, now); 通过NetworkClient进行了发送
*/
sendProduceRequests(batches, now);
return pollTimeout;
}
上面的代码可以看出 sendProducerData 主要分为7个步骤
-
步骤1: 获取集群元数据信息,第一次进来是没有获取的,只会触发获取的条件,在执行最后的client.poll()的时候才会完成真正的数据发送
-
步骤2: 根据元数据信息的集群信息,获取accumulator中哪些batches已经ready
-
步骤3: 校验已经ready的数据的元数据信息是否ready
- eg:是否有某一个partition还没有找到leader等
-
步骤4:判断NetworkClient是否已经建立连接
-
步骤5:根据broker进行数据分组,因为同一个broker可能有很多个partition,进行聚合
-
步骤6:发送超时和过期数据的处理
-
步骤7: 聚合数据,调用NetworkClient完成网络操作,NetworkClient底层依赖于Select来完成。
注意:client.send() 是将数据放在一个等待队列中,只有完成网络连接的才可以发送
总结
在本篇文章中,主要说明了Kafka sender线程做了哪些事情,阐述了accumulator中的消息是如何被校验、聚合、发送到NetworkClient的,从中也学习到了很多思想。
- 在设计系统的过程中可以考虑懒加载,对于Kafka原数据就是采用懒加载的方式,第一次进来时没有加载的,只是标记一下需要强制加载的标记,后面执行client.poll()才真正完成数据加载。
- 后台线程在跑的过程中,通过超时时间来控制循环等待时间,避免空轮询。