个人Github地址 github.com/AmbitionTn/… , 希望下面的文章对大家有所帮助,要是觉得感兴趣可以关注一下,会持续更新,持续学习中
Kafka源码版本基于 kafka3.3
上一篇文章:Kafka源码分析1 - Producer初始化
下一篇文章:kafka源码分析3 - Sender线程做了什么
1. Producer发送过程
整体流程图
producer的send方法
public Future<RecordMetadata> send(ProducerRecord<K, V> record) {
return send(record, null);
}
public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
// intercept the record, which can be potentially modified; this method does not throw exceptions
ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
return doSend(interceptedRecord, callback);
}
producer的dosend方法
/**
* Implementation of asynchronously send a record to a topic.
*/
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
// Append callback takes care of the following:
// - call interceptors and user callback on completion
// - remember partition that is calculated in RecordAccumulator.append
AppendCallbacks<K, V> appendCallbacks = new AppendCallbacks<K, V>(callback, this.interceptors, record);
try {
/**
* 步骤1 校验sender状态
* 通过判断Sender线程running属性
* 判断发送线程是否关闭,如果发送线程关闭,那么将不能够继续发送,抛出异常
*/
throwIfProducerClosed();
// first make sure the metadata for the topic is available
long nowMs = time.milliseconds();
ClusterAndWaitTime clusterAndWaitTime;
try {
/**
* 步骤2 获取集群元数据信息
* 等待获取集群元数据信息
* maxBlockTimeMs 最大等待时间
*/
clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), nowMs, maxBlockTimeMs);
} catch (KafkaException e) {
if (metadata.isClosed())
throw new KafkaException("Producer closed while send in progress", e);
throw e;
}
nowMs += clusterAndWaitTime.waitedOnMetadataMs;
// 计算剩余时间还有多少
long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
Cluster cluster = clusterAndWaitTime.cluster;
byte[] serializedKey;
/**
* 步骤3 完成record key,value序列化
* 使用 keySerializer 完成record key的序列化
* 使用 valueSerializer 完成record value的序列化
*/
try {
serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
" specified in key.serializer", cce);
}
byte[] serializedValue;
try {
serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
" specified in value.serializer", cce);
}
// Try to calculate partition, but note that after this call it can be RecordMetadata.UNKNOWN_PARTITION,
// which means that the RecordAccumulator would pick a partition using built-in logic (which may
// take into account broker load, the amount of data produced to each partition, etc.).
/**
* 步骤4 获取partition信息
* 1. 如果record本身有partiton信息,直接返回
* 2. 如果指定了Partitioner分区器实现类,那么通过partitioner去计算出分区信息
* 3. 如果没有指定分区器,但是record有key同时 partitionerIgnoreKeys!=true ,会根据key计算出partition信息
* (根据Key计算出32位hash值,通过取余partition总数计算出partition信息)
* 4. 如果不满足上述条件,那么直接返回-1,表示未分配partition,后面会根据partition = -1 随机选择partition信息
*/
int partition = partition(record, serializedKey, serializedValue, cluster);
setReadOnly(record.headers());
Header[] headers = record.headers().toArray();
/**
* 步骤5 校验并验证消息大小
* 计算出序列化后的record大小,包含Key value和压缩信息
*/
int serializedSize = AbstractRecords.estimateSizeInBytesUpperBound(apiVersions.maxUsableProduceMagic(),
compressionType, serializedKey, serializedValue, headers);
// 校验消息大小是否超过 max.request.size 大小,默认为1M
ensureValidRecordSize(serializedSize);
long timestamp = record.timestamp() == null ? nowMs : record.timestamp();
// A custom partitioner may take advantage on the onNewBatch callback.
// 对于自定义分区器的场景,设置 abortOnNewBatch = true
boolean abortOnNewBatch = partitioner != null;
// Append the record to the accumulator. Note, that the actual partition may be
// calculated there and can be accessed via appendCallbacks.topicPartition.
/**
* 步骤6 将消息发送到RecordAccumulator
* 通过topic和partition计算具体应该将record发送到哪一个partition
* 将信息存储在RecordAccumulator中
*/
RecordAccumulator.RecordAppendResult result = accumulator.append(record.topic(), partition, timestamp, serializedKey,
serializedValue, headers, appendCallbacks, remainingWaitMs, abortOnNewBatch, nowMs, cluster);
assert appendCallbacks.getPartition() != RecordMetadata.UNKNOWN_PARTITION;
/**
* abortForNewBatch 为true,第一次不存在batch,首次完成创建,等第二次进入再append
* double call 重新调用执行append
*/
if (result.abortForNewBatch) {
int prevPartition = partition;
onNewBatch(record.topic(), cluster, prevPartition);
partition = partition(record, serializedKey, serializedValue, cluster);
if (log.isTraceEnabled()) {
log.trace("Retrying append due to new batch creation for topic {} partition {}. The old partition was {}", record.topic(), partition, prevPartition);
}
// 重新append
result = accumulator.append(record.topic(), partition, timestamp, serializedKey,
serializedValue, headers, appendCallbacks, remainingWaitMs, false, nowMs, cluster);
}
// Add the partition to the transaction (if in progress) after it has been successfully
// appended to the accumulator. We cannot do it before because the partition may be
// unknown or the initially selected partition may be changed when the batch is closed
// (as indicated by `abortForNewBatch`). Note that the `Sender` will refuse to dequeue
// batches from the accumulator until they have been added to the transaction.
if (transactionManager != null) {
transactionManager.maybeAddPartition(appendCallbacks.topicPartition());
}
/**
* 步骤7 唤醒sender线程
* 1. result.batchIsFull batch满了,会发送
* 2. newBatchCreated 创建了新的batch,会发送
*/
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), appendCallbacks.getPartition());
this.sender.wakeup();
}
return result.future;
// handling exceptions and record the errors;
// for API exceptions return them in the future,
// for other exceptions throw directly
} catch (ApiException e) {
// 省略。。。
}
}
从上面的源码中可以看到dosend的主要完成了7个步骤
-
步骤1 校验sender状态:
- 通过判断sender中running属性是否为true,来判断sender线程是否关闭,如果关闭直接抛出异常,禁止发送消息
-
步骤2 获取集群元数据信息
- metadata中包含了Node,partition,ISR等信息,所以必须阻塞等待获取到Metadata信息之后才可以进行后续的操作,metadata更新时间会在下一篇文章里介绍,这里知道可以获取到元数据信息就可以了。
-
步骤3 序列化
- 完成record key value的序列化
-
步骤4 获取分区信息
- 如果record本身有partition信息,直接返回
- 如果指定了Partitioner分区器实现类,那么通过partitioner去计算出分区信息
- 如果没有指定分区器,但是record有key同时 partitionerIgnoreKeys!=true ,会根据key计算出partition信息,(根据Key计算出32位hash值,通过取余partition总数计算出partition信息)
- 如果不满足上述条件,那么直接返回-1,表示未分配partition,后面会根据partition = -1 随机选择partition信息
-
步骤5 校验消息大小
- 计算出序列化后的record大小,包含Key value和压缩信息,默认大小为1M
-
步骤6 将消息发送到RecordAccumulator
- 通过topic和partition计算具体应该将record发送到哪一个partition
- 将信息存储在RecordAccumulator中
-
步骤7 唤醒sender线程
- result.batchIsFull batch满了,会发送
- 创建了新的batch,会发送
2. 发送过程详解
Kafka提供了哪些序列化实现
对于发送的消息Key value可以是各种类型,例如:Double、String、Int等或是一些自定义对象,但是如果将这个消息发送到Broker必须进行序列化后才可以发送,Kafka提供了一些实现,如果不够用可以选择自定义序列化实现也可以的。
分区信息如何获取的
private int partition(ProducerRecord<K, V> record, byte[] serializedKey, byte[] serializedValue, Cluster cluster) {
if (record.partition() != null)
return record.partition();
if (partitioner != null) {
int customPartition = partitioner.partition(
record.topic(), record.key(), serializedKey, record.value(), serializedValue, cluster);
if (customPartition < 0) {
throw new IllegalArgumentException(String.format(
"The partitioner generated an invalid partition number: %d. Partition number should always be non-negative.", customPartition));
}
return customPartition;
}
if (serializedKey != null && !partitionerIgnoreKeys) {
// hash the keyBytes to choose a partition
return BuiltInPartitioner.partitionForKey(serializedKey, cluster.partitionsForTopic(record.topic()).size());
} else {
return RecordMetadata.UNKNOWN_PARTITION;
}
}
- 如果record本身有partition信息,直接返回
- 如果指定了Partitioner分区器实现类,那么通过partitioner去计算出分区信息
- 如果没有指定分区器,但是record有key同时 partitionerIgnoreKeys!=true ,会根据key计算出partition信息,(根据Key计算出32位hash值,通过取余partition总数计算出partition信息)
- 如果不满足上述条件,那么直接返回-1,表示未分配partition,后面会根据partition = -1 随机选择partition信息
校验消息大小
private void ensureValidRecordSize(int size) {
if (size > maxRequestSize)
throw new RecordTooLargeException("The message is " + size +
" bytes when serialized which is larger than " + maxRequestSize + ", which is the value of the " +
ProducerConfig.MAX_REQUEST_SIZE_CONFIG + " configuration.");
if (size > totalMemorySize)
throw new RecordTooLargeException("The message is " + size +
" bytes when serialized which is larger than the total memory buffer you have configured with the " +
ProducerConfig.BUFFER_MEMORY_CONFIG +
" configuration.");
}
将消息发送到RecordAccumulator
public RecordAppendResult append(String topic,
int partition,
long timestamp,
byte[] key,
byte[] value,
Header[] headers,
AppendCallbacks callbacks,
long maxTimeToBlock,
boolean abortOnNewBatch,
long nowMs,
Cluster cluster) throws InterruptedException {
/**
* 步骤1 根据topic获取 TopicInfo
* 这里再第一次进来的时候肯定是无法获取到 TopicInfo 的,会创建一个新的放进去
* 当第二次进来的时候,如果还是这个topic就可以获取到topicInfo
* 里面主要基于CopyOnWrite做的
*/
TopicInfo topicInfo = topicInfoMap.computeIfAbsent(topic, k -> new TopicInfo(logContext, k, batchSize));
// We keep track of the number of appending thread to make sure we do not miss batches in
// abortIncompleteBatches().
// 增加发送的线程计数 + 1
appendsInProgress.incrementAndGet();
ByteBuffer buffer = null;
if (headers == null) headers = Record.EMPTY_HEADERS;
try {
// Loop to retry in case we encounter partitioner's race conditions.
/**
* 通过循环重试的方式获取partition信息,避免因为竞争阻塞
*/
while (true) {
// If the message doesn't have any partition affinity, so we pick a partition based on the broker
// availability and performance. Note, that here we peek current partition before we hold the
// deque lock, so we'll need to make sure that it's not changed while we were waiting for the
// deque lock.
final BuiltInPartitioner.StickyPartitionInfo partitionInfo;
final int effectivePartition;
/**
* 步骤2 获取真实的partition信息
* 如果 partition == RecordMetadata.UNKNOWN_PARTITION 未分配分区器,随机选择一个分区器
* 如果给定了分区器,使用给定的分区器信息
步骤3
*
*/
if (partition == RecordMetadata.UNKNOWN_PARTITION) {
partitionInfo = topicInfo.builtInPartitioner.peekCurrentPartitionInfo(cluster);
effectivePartition = partitionInfo.partition();
} else {
partitionInfo = null;
effectivePartition = partition;
}
// 为callback设置生效的partition
// Now that we know the effective partition, let the caller know.
setPartition(callbacks, effectivePartition);
// check if we have an in-progress batch
/**
* 步骤3 根据partition获得deque信息
* 根据partition信息,获取TopicInfo中 Deque<ProducerBatch> dq
* 如果不存在则创建新的队列
*/
Deque<ProducerBatch> dq = topicInfo.batches.computeIfAbsent(effectivePartition, k -> new ArrayDeque<>());
/**
* 加锁:锁住需要的使用到的partition,采用分段锁思想
*/
synchronized (dq) {
// After taking the lock, validate that the partition hasn't changed and retry.
/**
* 由于在锁外面就已经获取到了partition信息,需要double check分区信息是否发生变更
*/
if (partitionChanged(topic, topicInfo, partitionInfo, dq, nowMs, cluster))
continue;
/**
* 步骤4 将消息append到 RecordAccumulator中
* 已经分配过内存buffer会在这里完成添加,如果没有分配过内存buffer下面代码不执行
* 如果返回null,则证明没有空间了
* 不等于null,证明有空间
*/
RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callbacks, dq, nowMs);
if (appendResult != null) {
// If queue has incomplete batches we disable switch (see comments in updatePartitionInfo).
// 校验最后一个是否满了
boolean enableSwitch = allBatchesFull(dq);
// 判断是否需要切换分区
topicInfo.builtInPartitioner.updatePartitionInfo(partitionInfo, appendResult.appendedBytes, cluster, enableSwitch);
return appendResult;
}
}
/**
* 步骤5 判断是否第一次进来分配batch,如果是第一次分配完直接返回
*/
// we don't have an in-progress record batch try to allocate a new batch
if (abortOnNewBatch) {
// Return a result that will cause another call to append.
return new RecordAppendResult(null, false, false, true, 0);
}
if (buffer == null) {
byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
log.trace("Allocating a new {} byte message buffer for topic {} partition {} with remaining timeout {}ms", size, topic, partition, maxTimeToBlock);
// This call may block if we exhausted buffer space.
buffer = free.allocate(size, maxTimeToBlock);
// Update the current time in case the buffer allocation blocked above.
// NOTE: getting time may be expensive, so calling it under a lock
// should be avoided.
nowMs = time.milliseconds();
}
/**
* 步骤6 代码能够走到这里,可能有几种因素
* 1. 没有分配Buffer,需要在这里分配空间
* 2. batch已经满了,所以需要分配buffer重新添加
*/
synchronized (dq) {
// After taking the lock, validate that the partition hasn't changed and retry.
if (partitionChanged(topic, topicInfo, partitionInfo, dq, nowMs, cluster))
continue;
/**
* 创建一个batch,并将record放入batch中
*/
RecordAppendResult appendResult = appendNewBatch(topic, effectivePartition, dq, timestamp, key, value, headers, callbacks, buffer, nowMs);
// Set buffer to null, so that deallocate doesn't return it back to free pool, since it's used in the batch.
if (appendResult.newBatchCreated)
buffer = null;
// If queue has incomplete batches we disable switch (see comments in updatePartitionInfo).
boolean enableSwitch = allBatchesFull(dq);
topicInfo.builtInPartitioner.updatePartitionInfo(partitionInfo, appendResult.appendedBytes, cluster, enableSwitch);
return appendResult;
}
}
} finally {
free.deallocate(buffer);
appendsInProgress.decrementAndGet();
}
}
-
从上面RecordAccumulator中append方法可以看到,append中主要的几个步骤
-
步骤1 根据topic获取 TopicInfo
- 这里再第一次进来的时候肯定是无法获取到 TopicInfo 的,会创建一个新的放进去
- 当第二次进来的时候,如果还是这个topic就可以获取到topicInfo
- 里面主要基于CopyOnWrite做的
-
步骤2 获取真实的partition信息
- 主要是根据partition计算并获取到真实的partition信息
- 如果 partition == RecordMetadata.UNKNOWN_PARTITION 未分配分区器,随机选择一个分区器
- 如果给定了分区器,使用给定的分区器信息
-
步骤3 根据partition获得deque信息
- 根据partition信息,获取TopicInfo中 Deque dq
- 如果不存在则创建新的队列
-
步骤4 将消息append到 RecordAccumulator中
-
步骤5 校验是否需要交换partition
-
步骤6 判断是否需要执行分配buffer空间
default V computeIfAbsent(K key,
Function<? super K, ? extends V> mappingFunction) {
Objects.requireNonNull(mappingFunction);
V v, newValue;
return ((v = get(key)) == null &&
(newValue = mappingFunction.apply(key)) != null &&
(v = putIfAbsent(key, newValue)) == null) ? newValue : v;
}
上面append代码中,根据topic获取并创建topicInfo和根据partition获取并创建Deque都是使用了这个方法,这个方法使用了函数式编程,通过CopyOnWrite实现。
CopyOnWrite怎么做的
再TopicInfo中可以看到 batches 的类型为 CopyOnWriteMap,CopyOnWriteMap为Kafka自定义的Map类型,实现了ConcurrentMap接口,是线程安全的
private static class TopicInfo {
public final ConcurrentMap<Integer /*partition*/, Deque<ProducerBatch>> batches = new CopyOnWriteMap<>();
public final BuiltInPartitioner builtInPartitioner;
public TopicInfo(LogContext logContext, String topic, int stickyBatchSize) {
builtInPartitioner = new BuiltInPartitioner(logContext, topic, stickyBatchSize);
}
}
ConcurrentMap定义如下
/**
* 1.采用读写分离的思想
* 2.适用于读多写少的场景
*/
public class CopyOnWriteMap<K, V> implements ConcurrentMap<K, V> {
// 多个线程操作同一个Map,使用volatile修饰,修改对其他线程可见
private volatile Map<K, V> map;
/**
* 读取数据是没有加锁的,再高并发的场景下性能是非常高的,并且是线程安全的
* 采用了读写分离的思想
*/
@Override
public V get(Object k) {
return map.get(k);
}
// 省略部分代码
/**
* 1) 整个方法使用的是synchronized关键字去修饰的,说明这个方法是线程安全
* 2)这种设计方式,采用的是读写分离的设计思想。
* 3) map 使用 volatile 修饰,如果变化了那么get可以感知到
*
*/
@Override
public synchronized V put(K k, V v) {
Map<K, V> copy = new HashMap<K, V>(this.map);
V prev = copy.put(k, v);
this.map = Collections.unmodifiableMap(copy);
return prev;
}
// 省略部分代码
}
- 再 Accumulator中 computeIfAbsent 方法的实现就是通过CopyOnWrite实现的。
唤醒sender线程
唤醒sender线程主要是为了向broker发送数据,后面会详细说明sender线程的做了什么
3. 总结
学习Kafka send方法的实现,从中可以获取到很多的收获
- 将业务定义和线程的定义分开,可以做解耦
- Producer在初始化的时候只是定义了Metadata元数据对象,并没有真正的通过网络请求去Broker获取元数据信息,再send的时候才真正的去获取,采用了懒加载的方案。
- 在设计一个通用的框架的时候,必须要有序列化的过程,因为Key value都是可能是各种类型的,像Kafka一样,想要和Broker通信必须要转化成Byte数组,所以就必须要有序列化过程。
- 在设计一个高可用,高扩展的系统的时候,需要考虑可扩展性,提供自定义实现的扩展,例如Kafka中的partitioner,可以使用默认的实现,也可以使用自定义。
- 在设计高并发实现的时候,可以考虑读写情况,如果读多写少可以考虑使用读写分离的思想,CopyOnWrite就是使用了读写分离的思想,写事Copy完成。