kafka源码分析2 Producer发送消息到缓存

1,805 阅读9分钟

个人Github地址 github.com/AmbitionTn/… , 希望下面的文章对大家有所帮助,要是觉得感兴趣可以关注一下,会持续更新,持续学习中

Kafka源码版本基于 kafka3.3

上一篇文章:Kafka源码分析1 - Producer初始化

下一篇文章:kafka源码分析3 - Sender线程做了什么

1. Producer发送过程

整体流程图

image.png

producer的send方法

public Future<RecordMetadata> send(ProducerRecord<K, V> record) {
    return send(record, null);
}

public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
    // intercept the record, which can be potentially modified; this method does not throw exceptions
    ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
    return doSend(interceptedRecord, callback);
}

producer的dosend方法

/**
 * Implementation of asynchronously send a record to a topic.
 */
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
    // Append callback takes care of the following:
    //  - call interceptors and user callback on completion
    //  - remember partition that is calculated in RecordAccumulator.append
    AppendCallbacks<K, V> appendCallbacks = new AppendCallbacks<K, V>(callback, this.interceptors, record);

    try {
        /**
         * 步骤1 校验sender状态
         * 通过判断Sender线程running属性
         * 判断发送线程是否关闭,如果发送线程关闭,那么将不能够继续发送,抛出异常
         */
        throwIfProducerClosed();
        // first make sure the metadata for the topic is available
        long nowMs = time.milliseconds();
        ClusterAndWaitTime clusterAndWaitTime;
        try {
            /**
             * 步骤2 获取集群元数据信息
             * 等待获取集群元数据信息
             * maxBlockTimeMs 最大等待时间
             */
            clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), nowMs, maxBlockTimeMs);
        } catch (KafkaException e) {
            if (metadata.isClosed())
                throw new KafkaException("Producer closed while send in progress", e);
            throw e;
        }
        nowMs += clusterAndWaitTime.waitedOnMetadataMs;
        // 计算剩余时间还有多少
        long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
        Cluster cluster = clusterAndWaitTime.cluster;
        byte[] serializedKey;
        /**
         * 步骤3 完成record key,value序列化
         * 使用 keySerializer 完成record key的序列化
         * 使用 valueSerializer 完成record value的序列化
         */
        try {
            serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
        } catch (ClassCastException cce) {
            throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
                    " to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
                    " specified in key.serializer", cce);
        }
        byte[] serializedValue;
        try {
            serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
        } catch (ClassCastException cce) {
            throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
                    " to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
                    " specified in value.serializer", cce);
        }

        // Try to calculate partition, but note that after this call it can be RecordMetadata.UNKNOWN_PARTITION,
        // which means that the RecordAccumulator would pick a partition using built-in logic (which may
        // take into account broker load, the amount of data produced to each partition, etc.).
        /**
         * 步骤4 获取partition信息
         * 1. 如果record本身有partiton信息,直接返回
         * 2. 如果指定了Partitioner分区器实现类,那么通过partitioner去计算出分区信息
         * 3. 如果没有指定分区器,但是record有key同时 partitionerIgnoreKeys!=true ,会根据key计算出partition信息
         * (根据Key计算出32位hash值,通过取余partition总数计算出partition信息)
         * 4. 如果不满足上述条件,那么直接返回-1,表示未分配partition,后面会根据partition = -1 随机选择partition信息
         */
        int partition = partition(record, serializedKey, serializedValue, cluster);

        setReadOnly(record.headers());
        Header[] headers = record.headers().toArray();
        /**
         * 步骤5 校验并验证消息大小
         * 计算出序列化后的record大小,包含Key value和压缩信息
         */
        int serializedSize = AbstractRecords.estimateSizeInBytesUpperBound(apiVersions.maxUsableProduceMagic(),
                compressionType, serializedKey, serializedValue, headers);
        // 校验消息大小是否超过 max.request.size 大小,默认为1M
        ensureValidRecordSize(serializedSize);
        long timestamp = record.timestamp() == null ? nowMs : record.timestamp();

        // A custom partitioner may take advantage on the onNewBatch callback.
        // 对于自定义分区器的场景,设置 abortOnNewBatch = true
        boolean abortOnNewBatch = partitioner != null;

        // Append the record to the accumulator.  Note, that the actual partition may be
        // calculated there and can be accessed via appendCallbacks.topicPartition.
        /**
         * 步骤6 将消息发送到RecordAccumulator
         * 通过topic和partition计算具体应该将record发送到哪一个partition
         * 将信息存储在RecordAccumulator中
         */
        RecordAccumulator.RecordAppendResult result = accumulator.append(record.topic(), partition, timestamp, serializedKey,
                serializedValue, headers, appendCallbacks, remainingWaitMs, abortOnNewBatch, nowMs, cluster);
        assert appendCallbacks.getPartition() != RecordMetadata.UNKNOWN_PARTITION;

        /**
         * abortForNewBatch 为true,第一次不存在batch,首次完成创建,等第二次进入再append
         * double call 重新调用执行append
         */
        if (result.abortForNewBatch) {
            int prevPartition = partition;
            onNewBatch(record.topic(), cluster, prevPartition);
            partition = partition(record, serializedKey, serializedValue, cluster);
            if (log.isTraceEnabled()) {
                log.trace("Retrying append due to new batch creation for topic {} partition {}. The old partition was {}", record.topic(), partition, prevPartition);
            }
            // 重新append
            result = accumulator.append(record.topic(), partition, timestamp, serializedKey,
                    serializedValue, headers, appendCallbacks, remainingWaitMs, false, nowMs, cluster);
        }

        // Add the partition to the transaction (if in progress) after it has been successfully
        // appended to the accumulator. We cannot do it before because the partition may be
        // unknown or the initially selected partition may be changed when the batch is closed
        // (as indicated by `abortForNewBatch`). Note that the `Sender` will refuse to dequeue
        // batches from the accumulator until they have been added to the transaction.
        if (transactionManager != null) {
            transactionManager.maybeAddPartition(appendCallbacks.topicPartition());
        }

        /**
         * 步骤7 唤醒sender线程
         * 1. result.batchIsFull batch满了,会发送
         * 2. newBatchCreated 创建了新的batch,会发送
         */
        if (result.batchIsFull || result.newBatchCreated) {
            log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), appendCallbacks.getPartition());
            this.sender.wakeup();
        }
        return result.future;
        // handling exceptions and record the errors;
        // for API exceptions return them in the future,
        // for other exceptions throw directly
    } catch (ApiException e) {
        // 省略。。。
    }
}

从上面的源码中可以看到dosend的主要完成了7个步骤

  • 步骤1 校验sender状态:

    • 通过判断sender中running属性是否为true,来判断sender线程是否关闭,如果关闭直接抛出异常,禁止发送消息
  • 步骤2 获取集群元数据信息

    • metadata中包含了Node,partition,ISR等信息,所以必须阻塞等待获取到Metadata信息之后才可以进行后续的操作,metadata更新时间会在下一篇文章里介绍,这里知道可以获取到元数据信息就可以了。
  • 步骤3 序列化

    • 完成record key value的序列化
  • 步骤4 获取分区信息

    • 如果record本身有partition信息,直接返回
    • 如果指定了Partitioner分区器实现类,那么通过partitioner去计算出分区信息
    • 如果没有指定分区器,但是record有key同时 partitionerIgnoreKeys!=true ,会根据key计算出partition信息,(根据Key计算出32位hash值,通过取余partition总数计算出partition信息)
    • 如果不满足上述条件,那么直接返回-1,表示未分配partition,后面会根据partition = -1 随机选择partition信息
  • 步骤5 校验消息大小

    • 计算出序列化后的record大小,包含Key value和压缩信息,默认大小为1M
  • 步骤6 将消息发送到RecordAccumulator

    • 通过topic和partition计算具体应该将record发送到哪一个partition
    • 将信息存储在RecordAccumulator中
  • 步骤7 唤醒sender线程

    • result.batchIsFull batch满了,会发送
    • 创建了新的batch,会发送

2. 发送过程详解

Kafka提供了哪些序列化实现

对于发送的消息Key value可以是各种类型,例如:Double、String、Int等或是一些自定义对象,但是如果将这个消息发送到Broker必须进行序列化后才可以发送,Kafka提供了一些实现,如果不够用可以选择自定义序列化实现也可以的。

image.png

分区信息如何获取的

private int partition(ProducerRecord<K, V> record, byte[] serializedKey, byte[] serializedValue, Cluster cluster) {
    if (record.partition() != null)
        return record.partition();

    if (partitioner != null) {
        int customPartition = partitioner.partition(
                record.topic(), record.key(), serializedKey, record.value(), serializedValue, cluster);
        if (customPartition < 0) {
            throw new IllegalArgumentException(String.format(
                    "The partitioner generated an invalid partition number: %d. Partition number should always be non-negative.", customPartition));
        }
        return customPartition;
    }

    if (serializedKey != null && !partitionerIgnoreKeys) {
        // hash the keyBytes to choose a partition
        return BuiltInPartitioner.partitionForKey(serializedKey, cluster.partitionsForTopic(record.topic()).size());
    } else {
        return RecordMetadata.UNKNOWN_PARTITION;
    }
}
  • 如果record本身有partition信息,直接返回
  • 如果指定了Partitioner分区器实现类,那么通过partitioner去计算出分区信息
  • 如果没有指定分区器,但是record有key同时 partitionerIgnoreKeys!=true ,会根据key计算出partition信息,(根据Key计算出32位hash值,通过取余partition总数计算出partition信息)
  • 如果不满足上述条件,那么直接返回-1,表示未分配partition,后面会根据partition = -1 随机选择partition信息

校验消息大小

private void ensureValidRecordSize(int size) {
    if (size > maxRequestSize)
        throw new RecordTooLargeException("The message is " + size +
                " bytes when serialized which is larger than " + maxRequestSize + ", which is the value of the " +
                ProducerConfig.MAX_REQUEST_SIZE_CONFIG + " configuration.");
    if (size > totalMemorySize)
        throw new RecordTooLargeException("The message is " + size +
                " bytes when serialized which is larger than the total memory buffer you have configured with the " +
                ProducerConfig.BUFFER_MEMORY_CONFIG +
                " configuration.");
}

将消息发送到RecordAccumulator

public RecordAppendResult append(String topic,
                                 int partition,
                                 long timestamp,
                                 byte[] key,
                                 byte[] value,
                                 Header[] headers,
                                 AppendCallbacks callbacks,
                                 long maxTimeToBlock,
                                 boolean abortOnNewBatch,
                                 long nowMs,
                                 Cluster cluster) throws InterruptedException {
    /**
     * 步骤1 根据topic获取 TopicInfo
     * 这里再第一次进来的时候肯定是无法获取到 TopicInfo 的,会创建一个新的放进去
     * 当第二次进来的时候,如果还是这个topic就可以获取到topicInfo
     * 里面主要基于CopyOnWrite做的
     */
    TopicInfo topicInfo = topicInfoMap.computeIfAbsent(topic, k -> new TopicInfo(logContext, k, batchSize));

    // We keep track of the number of appending thread to make sure we do not miss batches in
    // abortIncompleteBatches().
    // 增加发送的线程计数 + 1
    appendsInProgress.incrementAndGet();
    ByteBuffer buffer = null;
    if (headers == null) headers = Record.EMPTY_HEADERS;
    try {
        // Loop to retry in case we encounter partitioner's race conditions.
        /**
         * 通过循环重试的方式获取partition信息,避免因为竞争阻塞
         */
        while (true) {
            // If the message doesn't have any partition affinity, so we pick a partition based on the broker
            // availability and performance.  Note, that here we peek current partition before we hold the
            // deque lock, so we'll need to make sure that it's not changed while we were waiting for the
            // deque lock.
            final BuiltInPartitioner.StickyPartitionInfo partitionInfo;
            final int effectivePartition;
            /**
             * 步骤2 获取真实的partition信息
             * 如果 partition == RecordMetadata.UNKNOWN_PARTITION 未分配分区器,随机选择一个分区器
             * 如果给定了分区器,使用给定的分区器信息
    步骤3         
      *
             */
            if (partition == RecordMetadata.UNKNOWN_PARTITION) {
                partitionInfo = topicInfo.builtInPartitioner.peekCurrentPartitionInfo(cluster);
                effectivePartition = partitionInfo.partition();
            } else {
                partitionInfo = null;
                effectivePartition = partition;
            }
            // 为callback设置生效的partition
            // Now that we know the effective partition, let the caller know.
            setPartition(callbacks, effectivePartition);

            // check if we have an in-progress batch
            /**
             * 步骤3 根据partition获得deque信息
             * 根据partition信息,获取TopicInfo中 Deque<ProducerBatch> dq
             * 如果不存在则创建新的队列
             */
            Deque<ProducerBatch> dq = topicInfo.batches.computeIfAbsent(effectivePartition, k -> new ArrayDeque<>());
            /**
             * 加锁:锁住需要的使用到的partition,采用分段锁思想
             */
            synchronized (dq) {
                // After taking the lock, validate that the partition hasn't changed and retry.
                /**
                 * 由于在锁外面就已经获取到了partition信息,需要double check分区信息是否发生变更
                 */
                if (partitionChanged(topic, topicInfo, partitionInfo, dq, nowMs, cluster))
                    continue;
                /**
                 * 步骤4 将消息append到 RecordAccumulator中
                 * 已经分配过内存buffer会在这里完成添加,如果没有分配过内存buffer下面代码不执行
                 * 如果返回null,则证明没有空间了
                 * 不等于null,证明有空间
                 */
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callbacks, dq, nowMs);
                if (appendResult != null) {
                    // If queue has incomplete batches we disable switch (see comments in updatePartitionInfo).
                    // 校验最后一个是否满了
                    boolean enableSwitch = allBatchesFull(dq);
                    // 判断是否需要切换分区
                    topicInfo.builtInPartitioner.updatePartitionInfo(partitionInfo, appendResult.appendedBytes, cluster, enableSwitch);
                    return appendResult;
                }
            }

            /**
             * 步骤5 判断是否第一次进来分配batch,如果是第一次分配完直接返回
             */
            // we don't have an in-progress record batch try to allocate a new batch
            if (abortOnNewBatch) {
                // Return a result that will cause another call to append.
                return new RecordAppendResult(null, false, false, true, 0);
            }
            
            if (buffer == null) {
                byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
                int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
                log.trace("Allocating a new {} byte message buffer for topic {} partition {} with remaining timeout {}ms", size, topic, partition, maxTimeToBlock);
                // This call may block if we exhausted buffer space.
                buffer = free.allocate(size, maxTimeToBlock);
                // Update the current time in case the buffer allocation blocked above.
                // NOTE: getting time may be expensive, so calling it under a lock
                // should be avoided.
                nowMs = time.milliseconds();
            }

            /**
             * 步骤6 代码能够走到这里,可能有几种因素
             * 1. 没有分配Buffer,需要在这里分配空间
             * 2. batch已经满了,所以需要分配buffer重新添加
             */
            synchronized (dq) {
                // After taking the lock, validate that the partition hasn't changed and retry.
                if (partitionChanged(topic, topicInfo, partitionInfo, dq, nowMs, cluster))
                    continue;
                /**
                 * 创建一个batch,并将record放入batch中
                 */
                RecordAppendResult appendResult = appendNewBatch(topic, effectivePartition, dq, timestamp, key, value, headers, callbacks, buffer, nowMs);
                // Set buffer to null, so that deallocate doesn't return it back to free pool, since it's used in the batch.
                if (appendResult.newBatchCreated)
                    buffer = null;
                // If queue has incomplete batches we disable switch (see comments in updatePartitionInfo).
                boolean enableSwitch = allBatchesFull(dq);
                topicInfo.builtInPartitioner.updatePartitionInfo(partitionInfo, appendResult.appendedBytes, cluster, enableSwitch);
                return appendResult;
            }
        }
    } finally {
        free.deallocate(buffer);
        appendsInProgress.decrementAndGet();
    }
}
  • 从上面RecordAccumulator中append方法可以看到,append中主要的几个步骤

  • 步骤1 根据topic获取 TopicInfo

    • 这里再第一次进来的时候肯定是无法获取到 TopicInfo 的,会创建一个新的放进去
    • 当第二次进来的时候,如果还是这个topic就可以获取到topicInfo
    • 里面主要基于CopyOnWrite做的
  • 步骤2 获取真实的partition信息

    • 主要是根据partition计算并获取到真实的partition信息
    • 如果 partition == RecordMetadata.UNKNOWN_PARTITION 未分配分区器,随机选择一个分区器
    • 如果给定了分区器,使用给定的分区器信息
  • 步骤3 根据partition获得deque信息

    • 根据partition信息,获取TopicInfo中 Deque dq
    • 如果不存在则创建新的队列
  • 步骤4 将消息append到 RecordAccumulator中

  • 步骤5 校验是否需要交换partition

  • 步骤6 判断是否需要执行分配buffer空间

default V computeIfAbsent(K key,
        Function<? super K, ? extends V> mappingFunction) {
    Objects.requireNonNull(mappingFunction);
    V v, newValue;
    return ((v = get(key)) == null &&
            (newValue = mappingFunction.apply(key)) != null &&
            (v = putIfAbsent(key, newValue)) == null) ? newValue : v;
}

上面append代码中,根据topic获取并创建topicInfo和根据partition获取并创建Deque都是使用了这个方法,这个方法使用了函数式编程,通过CopyOnWrite实现。

CopyOnWrite怎么做的

再TopicInfo中可以看到 batches 的类型为 CopyOnWriteMap,CopyOnWriteMap为Kafka自定义的Map类型,实现了ConcurrentMap接口,是线程安全的

private static class TopicInfo {
    public final ConcurrentMap<Integer /*partition*/, Deque<ProducerBatch>> batches = new CopyOnWriteMap<>();
    public final BuiltInPartitioner builtInPartitioner;

    public TopicInfo(LogContext logContext, String topic, int stickyBatchSize) {
        builtInPartitioner = new BuiltInPartitioner(logContext, topic, stickyBatchSize);
    }
}

ConcurrentMap定义如下

/**
 * 1.采用读写分离的思想
 * 2.适用于读多写少的场景
 */
public class CopyOnWriteMap<K, V> implements ConcurrentMap<K, V> {
    // 多个线程操作同一个Map,使用volatile修饰,修改对其他线程可见
    private volatile Map<K, V> map;
    
    /**
     * 读取数据是没有加锁的,再高并发的场景下性能是非常高的,并且是线程安全的
     * 采用了读写分离的思想
     */
    @Override
    public V get(Object k) {
        return map.get(k);
    }
    // 省略部分代码
    
    /**
     * 1) 整个方法使用的是synchronized关键字去修饰的,说明这个方法是线程安全
     * 2)这种设计方式,采用的是读写分离的设计思想。
     * 3) map 使用 volatile 修饰,如果变化了那么get可以感知到
     * 
     */
    @Override
    public synchronized V put(K k, V v) {
        Map<K, V> copy = new HashMap<K, V>(this.map);
        V prev = copy.put(k, v);
        this.map = Collections.unmodifiableMap(copy);
        return prev;
    }
    // 省略部分代码
}
  • 再 Accumulator中 computeIfAbsent 方法的实现就是通过CopyOnWrite实现的。

唤醒sender线程

唤醒sender线程主要是为了向broker发送数据,后面会详细说明sender线程的做了什么

3. 总结

学习Kafka send方法的实现,从中可以获取到很多的收获

  1. 将业务定义和线程的定义分开,可以做解耦
  2. Producer在初始化的时候只是定义了Metadata元数据对象,并没有真正的通过网络请求去Broker获取元数据信息,再send的时候才真正的去获取,采用了懒加载的方案。
  3. 在设计一个通用的框架的时候,必须要有序列化的过程,因为Key value都是可能是各种类型的,像Kafka一样,想要和Broker通信必须要转化成Byte数组,所以就必须要有序列化过程。
  4. 在设计一个高可用,高扩展的系统的时候,需要考虑可扩展性,提供自定义实现的扩展,例如Kafka中的partitioner,可以使用默认的实现,也可以使用自定义。
  5. 在设计高并发实现的时候,可以考虑读写情况,如果读多写少可以考虑使用读写分离的思想,CopyOnWrite就是使用了读写分离的思想,写事Copy完成。