前言

关键词简述

Broker：表示一个Kafka服务的物理节点，该版本是将自己的ID注册到zookeeper上面来实现高可用，单机环境只有一个Broker
Topic：消息主题，消费者通过订阅主题来获取消息
Partition：消息分区，一个主题可以有多个分区，发送的消息按照一定的格式发送到分区，一个分区只能有一个消费者去消费

源码

Producer架构.png

使用生产者发送消息会涉及到以下几个类：

KafkaProducer：生产者主类，用于协调其他资源
RecordAccumulator：消息累加器，发送消息时候将消息提交给累加器，累加器按照Topic和Partition进行分类存储，因为为单独一条消息进行一次网络I/O耗费资源，所以需要暂时收集消息等到一定数量或者一定时间批量发送
Sender：负责从RecordAccumulator获取消息，构造请求提交发送
NetworkClient：主要调用KafkaChannel进行消息发送和读取，并且处理发送成功和读取成功的消息，事件循环核心

RecordAccumulator消息累加器

为什么要首先看RecordAccumulator？因为RecordAccumulator是用于存储一批消息，涉及到对消息的一些处理方式，与其他类关联性不高，所以可以单独来分析源码

private final ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batches;

主要的结构，使用Deque来存储ProducerBatch

    public RecordAppendResult append(TopicPartition tp,
                                     long timestamp,
                                     byte[] key,
                                     byte[] value,
                                     Header[] headers,
                                     Callback callback,
                                     long maxTimeToBlock,
                                     boolean abortOnNewBatch) throws InterruptedException {
        appendsInProgress.incrementAndGet();
        ByteBuffer buffer = null;
        if (headers == null) headers = Record.EMPTY_HEADERS;
        try {
            //获取或创建一个deque来存放记录
            Deque<ProducerBatch> dq = getOrCreateDeque(tp);
            synchronized (dq) {
                if (closed)
                    throw new KafkaException("Producer closed while send in progress");
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
                if (appendResult != null)
                    return appendResult;
            }

            //如果没有批记录 设置为true则返回
            if (abortOnNewBatch) {
                return new RecordAppendResult(null, false, false, true);
            }

append方法前半段，首先根据主题和分区的信息来获取Deque，如果不存在就创建，同时会调用tryAppend来对Deque进行append操作。

注意abortOnNewBatch这个参数，表示如果tryAppend失败了，并且这个设置为true就会中断当前append操作，那么tryAppend如何才能失败？

 private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers,
                                         Callback callback, Deque<ProducerBatch> deque) {
        ProducerBatch last = deque.peekLast();
        if (last != null) {
            FutureRecordMetadata future = last.tryAppend(timestamp, key, value, headers, callback, time.milliseconds());
            if (future == null)
                last.closeForRecordAppends();
            else
                return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false, false);
        }
        return null;
    }

tryAppend方法会从Deque中获取最近的一个ProducerBatch，如果没有的话就会返回null表示失败，也就是说如果第一次发送消息，获取前一批消息都发送出去到这里就会append失败

byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
            int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
            log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
            buffer = free.allocate(size, maxTimeToBlock);
            synchronized (dq) {
                if (closed)
                    throw new KafkaException("Producer closed while send in progress");

                //重试
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
                if (appendResult != null) {
                    return appendResult;
                }

                //创建批记录
                MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
                ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, time.milliseconds());
                FutureRecordMetadata future = Objects.requireNonNull(batch.tryAppend(timestamp, key, value, headers,
                        callback, time.milliseconds()));

                dq.addLast(batch);
                incomplete.add(batch);
                
                buffer = null;
                return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true, false);
            }
        } finally {
            if (buffer != null)
                free.deallocate(buffer);
            appendsInProgress.decrementAndGet();
        }

如果abortOnNewBatch设置为false，就会进入到下面的流程，也就是创建ProducerBatch然后append消息，并且将batch添加到deque中

++这里使用了appendsInProgress来表示当前append操作是否运行++

累加器的append操作只是把消息添加到了ProducerBatch中，多个batch组成一个Deque链

public ReadyCheckResult ready(Cluster cluster, long nowMs) {
        Set<Node> readyNodes = new HashSet<>();
        long nextReadyCheckDelayMs = Long.MAX_VALUE;
        Set<String> unknownLeaderTopics = new HashSet<>();

        boolean exhausted = this.free.queued() > 0;
        for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {
            Deque<ProducerBatch> deque = entry.getValue();
            synchronized (deque) {
                ProducerBatch batch = deque.peekFirst();
                if (batch != null) {
                    TopicPartition part = entry.getKey();
                    Node leader = cluster.leaderFor(part);
                    if (leader == null) {
                        //加入未知leader集合
                        unknownLeaderTopics.add(part.topic());
                    } else if (!readyNodes.contains(leader) && !isMuted(part, nowMs)) {
                        long waitedTimeMs = batch.waitedTimeMs(nowMs);
                        boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
                        long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                        boolean full = deque.size() > 1 || batch.isFull();
                        boolean expired = waitedTimeMs >= timeToWaitMs;
                        //以下任意情况发送 满了 等待足够事件 累加器内存不足，线程正在阻塞等待数据 已经关闭了
                        boolean sendable = full || expired || exhausted || closed || flushInProgress();
                        if (sendable && !backingOff) {
                            readyNodes.add(leader);
                        } else {
                            long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                            //下一次检查时间
                            nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
                        }
                    }
                }
            }
        }
        return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
    }

这段代码的含义：是否可以发送数据了，如果可以了将能够发送的节点添加到readyNodes中，逻辑如下：

遍历每一个主题分区信息，找到该分区的leader节点，因为只有leader节点用于处理发送请求
出现以下任意情况就可以发送数据了
- batch装满了
- 消息等到时间足够了
- 累加器内存不足
- 累加器已经关闭了
返回可以发送的的节点信息，下一次检查时间和未知leader的节点情况

ready方法用于判断存放的消息是否可以发送

public Map<Integer, List<ProducerBatch>> drain(Cluster cluster, Set<Node> nodes, int maxSize, long now) {
        if (nodes.isEmpty())
            return Collections.emptyMap();

        Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
        for (Node node : nodes) {
	    //筛选节点
            List<ProducerBatch> ready = drainBatchesForOneNode(cluster, node, maxSize, now);
            batches.put(node.id(), ready);
        }
        return batches;
    }

由于分区是分布在不同的节点的，所以drain方法会将发往同一个节点的消息分在一起

drainBatchesForOneNode大概的逻辑就是：获取当前节点的分区信息，根据通过Map查到Deque，经过一些判断例如消息是否超过了maxSize等，将准备的好的ProducerBatch返回

   void abortBatches(final RuntimeException reason) {
        for (ProducerBatch batch : incomplete.copyAll()) {
            Deque<ProducerBatch> dq = getDeque(batch.topicPartition);
            synchronized (dq) {
                batch.abortRecordAppends();
                dq.remove(batch);
            }
            batch.abort(reason);
            deallocate(batch);
        }
    }

已经提交到累加器的消息是可以被中断的，将该批次删除并且会触发onCompletion回调函数

累加器总结：累加器是我们发送消息的暂存点，提交的消息会按照「主题——分区」作为Map的Key存放；通过ready来判断消息是否应该发送，如果消息可以发送则调用drain方法按照发送的节点进行分类，如果过出现什么错误，消息可以被中断

Partitioner

public interface Partitioner extends Configurable, Closeable {

  
    //计算分区
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster);

   
    public void close();

    //开启一个新的batch
    default public void onNewBatch(String topic, Cluster cluster, int prevPartition) {
    }
}

Partitioner是一个分区计算接口，关于消息被发送到某一个分区是由我们自行决定的，可以实现这个接口自定义消息分区规则，也可以使用默认规则。

RoundRobinPartitioner轮循分区

 private final ConcurrentMap<String, AtomicInteger> topicCounterMap = new ConcurrentHashMap<>();

private int nextValue(String topic) {
        AtomicInteger counter = topicCounterMap.computeIfAbsent(topic, k -> new AtomicInteger(0));
        return counter.getAndIncrement();
    }


public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        int numPartitions = partitions.size();
        //分区序号 + 1
        int nextValue = nextValue(topic);
        List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
        if (!availablePartitions.isEmpty()) {
            //先从可用分区里面找 取余的方法
            int part = Utils.toPositive(nextValue) % availablePartitions.size();
            return availablePartitions.get(part).partition();
        } else {
            // 没有可用分区，给出不可用的分区
            return Utils.toPositive(nextValue) % numPartitions;
        }
    }

轮循实现很简单，采用取余的方式

DefaultPartitioner默认分区器

public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
        if (keyBytes == null) {
            return stickyPartitionCache.partition(topic, cluster);
        } 
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        int numPartitions = partitions.size();
        return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
    }

    
 
    public void onNewBatch(String topic, Cluster cluster, int prevPartition) {
        stickyPartitionCache.nextPartition(topic, cluster, prevPartition);
    }

如果需要分区的keyBytes（序列化后的key）不为空，使用hash的方式来分区，反之使用StickyPartitionCache

++StickyPartitionCache尽量让消息都发往同一个分区，让这个分区的累加器填满发送，然后进行分区切换。这种做在高速发送消息的时候会减少计算hash的时间，随着系统的持续运行分区消息也会变得均衡++

总结：以上是系统内置的分区策略，如果没有手动指定就使用DefaultPartitioner，用hash的方式来计算分区

Sender消息发送线程

Sender是用于处理RecordAccumulator里面的消息用来发送

public void run() {
        log.debug("Starting Kafka producer I/O thread.");

        // main loop, runs until close is called
        while (running) {
            try {
                runOnce();
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }
        }
....

Sender是一个Runnable，run方法里面是一个while循环，始终执行runOnce方法

void runOnce() {
//前面一部分是事务管理器有关
.....
	if (事务问题){
		maybeAbortBatches(lastError);
	}
...

long currentTimeMs = time.milliseconds();
        long pollTimeout = sendProducerData(currentTimeMs);
        client.poll(pollTimeout, currentTimeMs);
}

这是一段伪代码，前半段是事务处理相关，如果出现问题会调用abortBatches；真正处理发送消息是sendProducerData方法

SendProducerData方法如下步骤：

private long sendProducerData(long now) {
        //读取缓存这里是异步的
        Cluster cluster = metadata.fetch();
        // 获取准备好的节点
        RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

        // 如果有任何分区，其领导者尚未知道，则强制元数据更新
        if (!result.unknownLeaderTopics.isEmpty()) {
            for (String topic : result.unknownLeaderTopics)
                this.metadata.add(topic);

            log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
                result.unknownLeaderTopics);
            this.metadata.requestUpdate();
        }

调用accumulator.ready方法，和前面解析的联系上了。然后对找不到leader的分区进行元数据跟新

Iterator<Node> iter = result.readyNodes.iterator();
        long notReadyTimeout = Long.MAX_VALUE;
        while (iter.hasNext()) {
            Node node = iter.next();
            if (!this.client.ready(node, now)) {
                iter.remove();
                notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
            }
        }

第二步是对准备好的node进行连接性测试，将不符合条件的删除

Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
        addToInflightBatches(batches);
        if (guaranteeMessageOrder) {
            // Mute all the partitions drained
            for (List<ProducerBatch> batchList : batches.values()) {
                for (ProducerBatch batch : batchList)
                    this.accumulator.mutePartition(batch.topicPartition);
            }
        }

第三步调用drain方法将消息按照node分配，如果开启了分区有序，需要将分区进行静音，保证当前分区只能有一批消息进行发送

第四步处理过期消息

第五步调用sendProduceRequests(batches, now)

private void sendProduceRequests(Map<Integer, List<ProducerBatch>> collated, long now) {
        for (Map.Entry<Integer, List<ProducerBatch>> entry : collated.entrySet())
            sendProduceRequest(now, entry.getKey(), acks, requestTimeoutMs, entry.getValue());
    }

private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {
  for (ProducerBatch batch : batches) {
            TopicPartition tp = batch.topicPartition;
            MemoryRecords records = batch.records();
	ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,
                produceRecordsByPartition, transactionalId);
        //回调函数
        RequestCompletionHandler callback = response -> handleProduceResponse(response, recordsByPartition, time.milliseconds());

        String nodeId = Integer.toString(destination);
        ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,
                requestTimeoutMs, callback);
        client.send(clientRequest, now);
}
}

调用NetworkClient的send方法，最终调用kafkaChannel的setSend方法进行缓存

NetworkClient

Sender的runOnce最后一行是这样的：

client.poll(pollTimeout, currentTimeMs);

调用了NetworkClient的poll方法，我们知道Selector同样也有poll方法，它是用于对I/O事件进行选择，将执行读写事件并且将内容放到Set里面，NetworkClient.poll方法应该是对这些事件的处理

public List<ClientResponse> poll(long timeout, long now) {
        ensureActive();

        if (!abortedSends.isEmpty()) {
            // 如果由于不支持的版本异常或断开连接而中止发送，请立即处理它们，而无需等待Selector.poll。
            List<ClientResponse> responses = new ArrayList<>();
            handleAbortedSends(responses);
            completeResponses(responses);
            return responses;
        }

        //这里计算出元数据更新所需要最小时间
        long metadataTimeout = metadataUpdater.maybeUpdate(now);
        try {
            this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));
        } catch (IOException e) {
            log.error("Unexpected error during I/O", e);
        }

        long updatedNow = this.time.milliseconds();
        List<ClientResponse> responses = new ArrayList<>();
        //处理已经发送成功的消息
        handleCompletedSends(responses, updatedNow);
        //处理已经成功读取的消息 刚才元数据跟新就是这里处理的
        handleCompletedReceives(responses, updatedNow);
        //处理失去连接的节点
        handleDisconnections(responses, updatedNow);
        //处理连接
        handleConnections();
        handleInitiateApiVersionRequests(updatedNow);
        handleTimedOutRequests(responses, updatedNow);
        //调用callback方法 向topic发送消息
        completeResponses(responses);

        return responses;
    }

根据注释可以看出来，首先获取节点数据，然后执行selector.poll方法处理读写事件，接下来就是对这些事件内容进行处理。由于是生产者客户端主要负责发送，处理读数据的逻辑也不太复杂

@Override
        public long maybeUpdate(long now) {
            // metadata 下次更新的时间 设置了needUpdate立刻跟新 或者 计算 metadata 的过期时间）
            long timeToNextMetadataUpdate = metadata.timeToNextUpdate(now);
            // 如果上一次对服务器的请求还没有收到 就设置为defaultRequestTimeoutMs 30S
            long waitForMetadataFetch = hasFetchInProgress() ? defaultRequestTimeoutMs : 0;

            long metadataTimeout = Math.max(timeToNextMetadataUpdate, waitForMetadataFetch);

            //因为跟新时间还没到，所以返回需要等待的时间
            if (metadataTimeout > 0) {
                return metadataTimeout;
            }

            //选择一个比较空闲的节点 用于获取
            Node node = leastLoadedNode(now);
            if (node == null) {
                log.debug("Give up sending metadata request since no node is available");
                return reconnectBackoffMs;
            }

            //发送一条请求
            return maybeUpdate(now, node);
        }

首先是向服务器请求元数据，主题、分区、节点等；通过leastLoadedNode方法选择一个比较空闲节点发送请求

该方法最终会发送请求，并计算出等待时间作为selector.poll参数

private void handleCompletedSends(List<ClientResponse> responses, long now) {
        // if no response is expected then when the send is completed, return it
        for (Send send : this.selector.completedSends()) {
            InFlightRequest request = this.inFlightRequests.lastSent(send.destination());
            if (!request.expectResponse) {
                //从inFlightRequests删除消息
                this.inFlightRequests.completeLastSent(send.destination());
                responses.add(request.completed(null, now));
            }
        }
    }

处理发送完成的请求，将他们从inFlightRequests中删除

private void completeResponses(List<ClientResponse> responses) {
        for (ClientResponse response : responses) {
            try {
                response.onComplete();
            } catch (Exception e) {
                log.error("Uncaught error in request completion:", e);
            }
        }
    }

调用onComplete回调，对于消息来说这个函数是handleProduceResponse，在sendProduceRequests方法里面有体现，见上文。

onComplete主要是用于对batch消息进行确认，最终会调用onCompletion，这个方法是我们自行编写的，表示消息已经被发送到对应节点了。

KafkaProducer

@Override
    public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
        ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
        return doSend(interceptedRecord, callback);
    }

我们通过send方法提交消息，这里会触发OnSend方法，然后进入doSend

//等待元数据跟新
clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);
//序列化key-val
serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
//计算分区
            int partition = partition(record, serializedKey, serializedValue, cluster);
            tp = new TopicPartition(record.topic(), partition);

精简了部分逻辑，前半段分为以上几步，可以看到我们创建的序列化器，以及分区器（未指定使用默认）就是在这里调用。

 private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long maxWaitMs) throws InterruptedException {
        Cluster cluster = metadata.fetch();
        metadata.add(topic);

        //获取主题分区数量
        Integer partitionsCount = cluster.partitionCountForTopic(topic);
        //如果没有指定分区但是有分区 或者分区在已知范围内 返回
        if (partitionsCount != null && (partition == null || partition < partitionsCount))
            return new ClusterAndWaitTime(cluster, 0);
	 int version = metadata.requestUpdate();
            //唤醒sender向远程服务器发送元数据发送请求
            sender.wakeup();
	metadata.awaitUpdate(version, remainingWaitMs);

这里调用metadata.fetch()，进入这个放可以看到只是一个缓存，因为请求元数据是异步的，根据前面讲到NetworkClient.poll方法会发送元数据请求，而这里就是等一段时间另外的线程跟新元数据

 RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey,
                    serializedValue, headers, interceptCallback, remainingWaitMs, true);
            //如果Deque里面没有Batch 就会进入到这一步 说明上一批消息已经发出去 或者第一次创建
            if (result.abortForNewBatch) {
                int prevPartition = partition;
                //这个时候就需要换一个分区 StickyPartitionCache逻辑
                partitioner.onNewBatch(record.topic(), cluster, prevPartition);
                partition = partition(record, serializedKey, serializedValue, cluster);
                tp = new TopicPartition(record.topic(), partition);
                if (log.isTraceEnabled()) {
                    log.trace("Retrying append due to new batch creation for topic {} partition {}. The old partition was {}", record.topic(), partition, prevPartition);
                }
                // producer callback will make sure to call both 'callback' and interceptor callback
                interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);

                result = accumulator.append(tp, timestamp, serializedKey,
                    serializedValue, headers, interceptCallback, remainingWaitMs, false);
            }
            
            if (transactionManager != null && transactionManager.isTransactional())
                transactionManager.maybeAddPartitionToTransaction(tp);

            if (result.batchIsFull || result.newBatchCreated) {
                log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
                this.sender.wakeup();
            }
            return result.future;

然后调用accumulator.append方法，如果第一次调用是会不成功的，没有创建Batch。这里兼容StickyPartition逻辑，会重新选择分区然后创建batch消息

KafkaProducer到这里就结束，剩下的就是Sender线程去干活~

总结

本文介绍了消息发送流程，从我们调用send开始，消息被发送到了累加器，Sender线程不停的工作，从累加器取出消息发送到KafkaChannel，触发Selector.poll处理读写事件。

Kafka2.4源码阅读——生产者客户端流程

前言