前言
关键词简述
- Broker:表示一个Kafka服务的物理节点,该版本是将自己的ID注册到zookeeper上面来实现高可用,单机环境只有一个Broker
- Topic:消息主题,消费者通过订阅主题来获取消息
- Partition:消息分区,一个主题可以有多个分区,发送的消息按照一定的格式发送到分区,一个分区只能有一个消费者去消费
源码
使用生产者发送消息会涉及到以下几个类:
- KafkaProducer:生产者主类,用于协调其他资源
- RecordAccumulator:消息累加器,发送消息时候将消息提交给累加器,累加器按照Topic和Partition进行分类存储,因为为单独一条消息进行一次网络I/O耗费资源,所以需要暂时收集消息等到一定数量或者一定时间批量发送
- Sender:负责从RecordAccumulator获取消息,构造请求提交发送
- NetworkClient:主要调用KafkaChannel进行消息发送和读取,并且处理发送成功和读取成功的消息,事件循环核心
RecordAccumulator消息累加器
为什么要首先看RecordAccumulator?因为RecordAccumulator是用于存储一批消息,涉及到对消息的一些处理方式,与其他类关联性不高,所以可以单独来分析源码
private final ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batches;
主要的结构,使用Deque来存储ProducerBatch
public RecordAppendResult append(TopicPartition tp,
long timestamp,
byte[] key,
byte[] value,
Header[] headers,
Callback callback,
long maxTimeToBlock,
boolean abortOnNewBatch) throws InterruptedException {
appendsInProgress.incrementAndGet();
ByteBuffer buffer = null;
if (headers == null) headers = Record.EMPTY_HEADERS;
try {
//获取或创建一个deque来存放记录
Deque<ProducerBatch> dq = getOrCreateDeque(tp);
synchronized (dq) {
if (closed)
throw new KafkaException("Producer closed while send in progress");
RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
if (appendResult != null)
return appendResult;
}
//如果没有批记录 设置为true则返回
if (abortOnNewBatch) {
return new RecordAppendResult(null, false, false, true);
}
append方法前半段,首先根据主题和分区的信息来获取Deque,如果不存在就创建,同时会调用tryAppend来对Deque进行append操作。
注意abortOnNewBatch这个参数,表示如果tryAppend失败了,并且这个设置为true就会中断当前append操作,那么tryAppend如何才能失败?
private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers,
Callback callback, Deque<ProducerBatch> deque) {
ProducerBatch last = deque.peekLast();
if (last != null) {
FutureRecordMetadata future = last.tryAppend(timestamp, key, value, headers, callback, time.milliseconds());
if (future == null)
last.closeForRecordAppends();
else
return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false, false);
}
return null;
}
tryAppend方法会从Deque中获取最近的一个ProducerBatch,如果没有的话就会返回null表示失败,也就是说如果第一次发送消息,获取前一批消息都发送出去到这里就会append失败
byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
buffer = free.allocate(size, maxTimeToBlock);
synchronized (dq) {
if (closed)
throw new KafkaException("Producer closed while send in progress");
//重试
RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
if (appendResult != null) {
return appendResult;
}
//创建批记录
MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, time.milliseconds());
FutureRecordMetadata future = Objects.requireNonNull(batch.tryAppend(timestamp, key, value, headers,
callback, time.milliseconds()));
dq.addLast(batch);
incomplete.add(batch);
buffer = null;
return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true, false);
}
} finally {
if (buffer != null)
free.deallocate(buffer);
appendsInProgress.decrementAndGet();
}
如果abortOnNewBatch设置为false,就会进入到下面的流程,也就是创建ProducerBatch然后append消息,并且将batch添加到deque中
++这里使用了appendsInProgress来表示当前append操作是否运行++
累加器的append操作只是把消息添加到了ProducerBatch中,多个batch组成一个Deque链
public ReadyCheckResult ready(Cluster cluster, long nowMs) {
Set<Node> readyNodes = new HashSet<>();
long nextReadyCheckDelayMs = Long.MAX_VALUE;
Set<String> unknownLeaderTopics = new HashSet<>();
boolean exhausted = this.free.queued() > 0;
for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {
Deque<ProducerBatch> deque = entry.getValue();
synchronized (deque) {
ProducerBatch batch = deque.peekFirst();
if (batch != null) {
TopicPartition part = entry.getKey();
Node leader = cluster.leaderFor(part);
if (leader == null) {
//加入未知leader集合
unknownLeaderTopics.add(part.topic());
} else if (!readyNodes.contains(leader) && !isMuted(part, nowMs)) {
long waitedTimeMs = batch.waitedTimeMs(nowMs);
boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
boolean full = deque.size() > 1 || batch.isFull();
boolean expired = waitedTimeMs >= timeToWaitMs;
//以下任意情况发送 满了 等待足够事件 累加器内存不足,线程正在阻塞等待数据 已经关闭了
boolean sendable = full || expired || exhausted || closed || flushInProgress();
if (sendable && !backingOff) {
readyNodes.add(leader);
} else {
long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
//下一次检查时间
nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
}
}
}
}
}
return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}
这段代码的含义:是否可以发送数据了,如果可以了将能够发送的节点添加到readyNodes中,逻辑如下:
- 遍历每一个主题分区信息,找到该分区的leader节点,因为只有leader节点用于处理发送请求
- 出现以下任意情况就可以发送数据了
- batch装满了
- 消息等到时间足够了
- 累加器内存不足
- 累加器已经关闭了
- 返回可以发送的的节点信息,下一次检查时间和未知leader的节点情况
ready方法用于判断存放的消息是否可以发送
public Map<Integer, List<ProducerBatch>> drain(Cluster cluster, Set<Node> nodes, int maxSize, long now) {
if (nodes.isEmpty())
return Collections.emptyMap();
Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
for (Node node : nodes) {
//筛选节点
List<ProducerBatch> ready = drainBatchesForOneNode(cluster, node, maxSize, now);
batches.put(node.id(), ready);
}
return batches;
}
由于分区是分布在不同的节点的,所以drain方法会将发往同一个节点的消息分在一起
drainBatchesForOneNode大概的逻辑就是:获取当前节点的分区信息,根据通过Map查到Deque,经过一些判断例如消息是否超过了maxSize等,将准备的好的ProducerBatch返回
void abortBatches(final RuntimeException reason) {
for (ProducerBatch batch : incomplete.copyAll()) {
Deque<ProducerBatch> dq = getDeque(batch.topicPartition);
synchronized (dq) {
batch.abortRecordAppends();
dq.remove(batch);
}
batch.abort(reason);
deallocate(batch);
}
}
已经提交到累加器的消息是可以被中断的,将该批次删除并且会触发onCompletion回调函数
累加器总结:累加器是我们发送消息的暂存点,提交的消息会按照「主题——分区」作为Map的Key存放;通过ready来判断消息是否应该发送,如果消息可以发送则调用drain方法按照发送的节点进行分类,如果过出现什么错误,消息可以被中断
Partitioner
public interface Partitioner extends Configurable, Closeable {
//计算分区
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster);
public void close();
//开启一个新的batch
default public void onNewBatch(String topic, Cluster cluster, int prevPartition) {
}
}
Partitioner是一个分区计算接口,关于消息被发送到某一个分区是由我们自行决定的,可以实现这个接口自定义消息分区规则,也可以使用默认规则。
RoundRobinPartitioner轮循分区
private final ConcurrentMap<String, AtomicInteger> topicCounterMap = new ConcurrentHashMap<>();
private int nextValue(String topic) {
AtomicInteger counter = topicCounterMap.computeIfAbsent(topic, k -> new AtomicInteger(0));
return counter.getAndIncrement();
}
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
//分区序号 + 1
int nextValue = nextValue(topic);
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (!availablePartitions.isEmpty()) {
//先从可用分区里面找 取余的方法
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// 没有可用分区,给出不可用的分区
return Utils.toPositive(nextValue) % numPartitions;
}
}
轮循实现很简单,采用取余的方式
DefaultPartitioner默认分区器
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
if (keyBytes == null) {
return stickyPartitionCache.partition(topic, cluster);
}
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
public void onNewBatch(String topic, Cluster cluster, int prevPartition) {
stickyPartitionCache.nextPartition(topic, cluster, prevPartition);
}
如果需要分区的keyBytes(序列化后的key)不为空,使用hash的方式来分区,反之使用StickyPartitionCache
++StickyPartitionCache尽量让消息都发往同一个分区,让这个分区的累加器填满发送,然后进行分区切换。这种做在高速发送消息的时候会减少计算hash的时间,随着系统的持续运行分区消息也会变得均衡++
总结:以上是系统内置的分区策略,如果没有手动指定就使用DefaultPartitioner,用hash的方式来计算分区
Sender消息发送线程
Sender是用于处理RecordAccumulator里面的消息用来发送
public void run() {
log.debug("Starting Kafka producer I/O thread.");
// main loop, runs until close is called
while (running) {
try {
runOnce();
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
....
Sender是一个Runnable,run方法里面是一个while循环,始终执行runOnce方法
void runOnce() {
//前面一部分是事务管理器有关
.....
if (事务问题){
maybeAbortBatches(lastError);
}
...
long currentTimeMs = time.milliseconds();
long pollTimeout = sendProducerData(currentTimeMs);
client.poll(pollTimeout, currentTimeMs);
}
这是一段伪代码,前半段是事务处理相关,如果出现问题会调用abortBatches;真正处理发送消息是sendProducerData方法
SendProducerData方法如下步骤:
private long sendProducerData(long now) {
//读取缓存这里是异步的
Cluster cluster = metadata.fetch();
// 获取准备好的节点
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// 如果有任何分区,其领导者尚未知道,则强制元数据更新
if (!result.unknownLeaderTopics.isEmpty()) {
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic);
log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
result.unknownLeaderTopics);
this.metadata.requestUpdate();
}
调用accumulator.ready方法,和前面解析的联系上了。然后对找不到leader的分区进行元数据跟新
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
}
}
第二步是对准备好的node进行连接性测试,将不符合条件的删除
Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
addToInflightBatches(batches);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List<ProducerBatch> batchList : batches.values()) {
for (ProducerBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
第三步调用drain方法将消息按照node分配,如果开启了分区有序,需要将分区进行静音,保证当前分区只能有一批消息进行发送
第四步处理过期消息
第五步调用sendProduceRequests(batches, now)
private void sendProduceRequests(Map<Integer, List<ProducerBatch>> collated, long now) {
for (Map.Entry<Integer, List<ProducerBatch>> entry : collated.entrySet())
sendProduceRequest(now, entry.getKey(), acks, requestTimeoutMs, entry.getValue());
}
private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {
for (ProducerBatch batch : batches) {
TopicPartition tp = batch.topicPartition;
MemoryRecords records = batch.records();
ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,
produceRecordsByPartition, transactionalId);
//回调函数
RequestCompletionHandler callback = response -> handleProduceResponse(response, recordsByPartition, time.milliseconds());
String nodeId = Integer.toString(destination);
ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,
requestTimeoutMs, callback);
client.send(clientRequest, now);
}
}
调用NetworkClient的send方法,最终调用kafkaChannel的setSend方法进行缓存
NetworkClient
Sender的runOnce最后一行是这样的:
client.poll(pollTimeout, currentTimeMs);
调用了NetworkClient的poll方法,我们知道Selector同样也有poll方法,它是用于对I/O事件进行选择,将执行读写事件并且将内容放到Set里面,NetworkClient.poll方法应该是对这些事件的处理
public List<ClientResponse> poll(long timeout, long now) {
ensureActive();
if (!abortedSends.isEmpty()) {
// 如果由于不支持的版本异常或断开连接而中止发送,请立即处理它们,而无需等待Selector.poll。
List<ClientResponse> responses = new ArrayList<>();
handleAbortedSends(responses);
completeResponses(responses);
return responses;
}
//这里计算出元数据更新所需要最小时间
long metadataTimeout = metadataUpdater.maybeUpdate(now);
try {
this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));
} catch (IOException e) {
log.error("Unexpected error during I/O", e);
}
long updatedNow = this.time.milliseconds();
List<ClientResponse> responses = new ArrayList<>();
//处理已经发送成功的消息
handleCompletedSends(responses, updatedNow);
//处理已经成功读取的消息 刚才元数据跟新就是这里处理的
handleCompletedReceives(responses, updatedNow);
//处理失去连接的节点
handleDisconnections(responses, updatedNow);
//处理连接
handleConnections();
handleInitiateApiVersionRequests(updatedNow);
handleTimedOutRequests(responses, updatedNow);
//调用callback方法 向topic发送消息
completeResponses(responses);
return responses;
}
根据注释可以看出来,首先获取节点数据,然后执行selector.poll方法处理读写事件,接下来就是对这些事件内容进行处理。由于是生产者客户端主要负责发送,处理读数据的逻辑也不太复杂
@Override
public long maybeUpdate(long now) {
// metadata 下次更新的时间 设置了needUpdate立刻跟新 或者 计算 metadata 的过期时间)
long timeToNextMetadataUpdate = metadata.timeToNextUpdate(now);
// 如果上一次对服务器的请求还没有收到 就设置为defaultRequestTimeoutMs 30S
long waitForMetadataFetch = hasFetchInProgress() ? defaultRequestTimeoutMs : 0;
long metadataTimeout = Math.max(timeToNextMetadataUpdate, waitForMetadataFetch);
//因为跟新时间还没到,所以返回需要等待的时间
if (metadataTimeout > 0) {
return metadataTimeout;
}
//选择一个比较空闲的节点 用于获取
Node node = leastLoadedNode(now);
if (node == null) {
log.debug("Give up sending metadata request since no node is available");
return reconnectBackoffMs;
}
//发送一条请求
return maybeUpdate(now, node);
}
首先是向服务器请求元数据,主题、分区、节点等;通过leastLoadedNode方法选择一个比较空闲节点发送请求
该方法最终会发送请求,并计算出等待时间作为selector.poll参数
private void handleCompletedSends(List<ClientResponse> responses, long now) {
// if no response is expected then when the send is completed, return it
for (Send send : this.selector.completedSends()) {
InFlightRequest request = this.inFlightRequests.lastSent(send.destination());
if (!request.expectResponse) {
//从inFlightRequests删除消息
this.inFlightRequests.completeLastSent(send.destination());
responses.add(request.completed(null, now));
}
}
}
处理发送完成的请求,将他们从inFlightRequests中删除
private void completeResponses(List<ClientResponse> responses) {
for (ClientResponse response : responses) {
try {
response.onComplete();
} catch (Exception e) {
log.error("Uncaught error in request completion:", e);
}
}
}
调用onComplete回调,对于消息来说这个函数是handleProduceResponse,在sendProduceRequests方法里面有体现,见上文。
onComplete主要是用于对batch消息进行确认,最终会调用onCompletion,这个方法是我们自行编写的,表示消息已经被发送到对应节点了。
KafkaProducer
@Override
public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
return doSend(interceptedRecord, callback);
}
我们通过send方法提交消息,这里会触发OnSend方法,然后进入doSend
//等待元数据跟新
clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);
//序列化key-val
serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
//计算分区
int partition = partition(record, serializedKey, serializedValue, cluster);
tp = new TopicPartition(record.topic(), partition);
精简了部分逻辑,前半段分为以上几步,可以看到我们创建的序列化器,以及分区器(未指定使用默认)就是在这里调用。
private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long maxWaitMs) throws InterruptedException {
Cluster cluster = metadata.fetch();
metadata.add(topic);
//获取主题分区数量
Integer partitionsCount = cluster.partitionCountForTopic(topic);
//如果没有指定分区但是有分区 或者分区在已知范围内 返回
if (partitionsCount != null && (partition == null || partition < partitionsCount))
return new ClusterAndWaitTime(cluster, 0);
int version = metadata.requestUpdate();
//唤醒sender向远程服务器发送元数据发送请求
sender.wakeup();
metadata.awaitUpdate(version, remainingWaitMs);
这里调用metadata.fetch(),进入这个放可以看到只是一个缓存,因为请求元数据是异步的,根据前面讲到NetworkClient.poll方法会发送元数据请求,而这里就是等一段时间另外的线程跟新元数据
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey,
serializedValue, headers, interceptCallback, remainingWaitMs, true);
//如果Deque里面没有Batch 就会进入到这一步 说明上一批消息已经发出去 或者第一次创建
if (result.abortForNewBatch) {
int prevPartition = partition;
//这个时候就需要换一个分区 StickyPartitionCache逻辑
partitioner.onNewBatch(record.topic(), cluster, prevPartition);
partition = partition(record, serializedKey, serializedValue, cluster);
tp = new TopicPartition(record.topic(), partition);
if (log.isTraceEnabled()) {
log.trace("Retrying append due to new batch creation for topic {} partition {}. The old partition was {}", record.topic(), partition, prevPartition);
}
// producer callback will make sure to call both 'callback' and interceptor callback
interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
result = accumulator.append(tp, timestamp, serializedKey,
serializedValue, headers, interceptCallback, remainingWaitMs, false);
}
if (transactionManager != null && transactionManager.isTransactional())
transactionManager.maybeAddPartitionToTransaction(tp);
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
this.sender.wakeup();
}
return result.future;
然后调用accumulator.append方法,如果第一次调用是会不成功的,没有创建Batch。这里兼容StickyPartition逻辑,会重新选择分区然后创建batch消息
KafkaProducer到这里就结束,剩下的就是Sender线程去干活~
总结
本文介绍了消息发送流程,从我们调用send开始,消息被发送到了累加器,Sender线程不停的工作,从累加器取出消息发送到KafkaChannel,触发Selector.poll处理读写事件。