深入解析Kafka消息发送过程

379 阅读9分钟

Kafka发送消息方法

image.png

发送消息整体流程

Kafka生产者异步发送消息并返回一个Future,代表发送结果。此外,用户可以选择提供回调函数,在Kafka代理确认接收记录时调用该回调函数。虽然看起来很简单,但整个过程还是比较复杂。整体流程如下:

image.png

  1. 生产者将消息传递给配置的拦截器。
  2. 序列化将记录的键和值转换为字节数组。
  3. 默认或配置的分区在未指定时计算主题分区。
  4. RecordAccumulator使用配置的压缩算法将消息追加到生产者批次中。

此时,消息仍然保存在内存中,并未发送到Kafka broker。Record Accumulator通过主题和分区将消息在内存中进行分组。

image.png

发送线程将具有相同broker(leader)的多个批次分组为请求并发送它们。此时,消息被发送到Kafka。

image.png

上图是Producer生产消息到发送到Broker的主流程。Producer先生产消息、序列化消息并压缩消息后,追加到本地的记录收集器(RecordAccumulator),Sender不断轮询记录收集器,当满足一定条件时,将队列中的数据发送到Partition Leader节点。Sender发送数据到Broker的条件有两个:

  • 消息大小达到阈值
  • 消息等待发送的时间达到阈值

Producer会为每个Partition都创建一个双端队列来缓存客户端消息,队列的每个元素是一个批记录(ProducerBatch),批记录使用createdMs表示批记录的创建时间(批记录中第一条消息加入的时间), topicPartion表示对应的Partition元数据。当Producer生产的消息经过序列化,会被先写入到recordsBuilder对象中。一旦队列中有批记录的大小达到阈值,就会被Sender发送到Partition对应的Leader节点;若批记录等待发送的时间达到阈值,消息也会被发送到Partition对应的Leader节点中。

代码分析

kafka的发送方法

org.apache.kafka.clients.producer.KafkaProducer#send()中:

@Override
public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {

   // intercept the record, which can be potentially modified; this method does not throw exceptions
   ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);

   return doSend(interceptedRecord, callback);
}

发送方法org.apache.kafka.clients.producer.KafkaProducer#doSend

private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {

       TopicPartition tp = null;
       try {
           //检查生产者是否已关闭,如果已关闭则抛出IllegalStateException异常。
           throwIfProducerClosed();
           // first make sure the metadata for the topic is available
           long nowMs = time.milliseconds();
           ClusterAndWaitTime clusterAndWaitTime;
           try {
               //确保目标主题的元数据可用。如果在等待元数据时生产者被关闭,会抛出KafkaException异常。
               clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), nowMs, maxBlockTimeMs);

          } catch (KafkaException e) {
               if (metadata.isClosed())
                   throw new KafkaException("Producer closed while send in progress", e);
               throw e;
          }

           nowMs += clusterAndWaitTime.waitedOnMetadataMs;
           long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);

           Cluster cluster = clusterAndWaitTime.cluster;
           byte[] serializedKey;
           try {
               //对记录的键进行序列化
               serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
          } catch (ClassCastException cce) {
               throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
                       " to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
                       " specified in key.serializer", cce);
          }
           byte[] serializedValue;
           try {
               //对记录的值进行序列化
               serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
          } catch (ClassCastException cce) {
               throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
                       " to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
                       " specified in value.serializer", cce);
          }
           //根据序列化后的键和值以及集群信息,确定记录应该发送到哪个分区
           int partition = partition(record, serializedKey, serializedValue, cluster);
           //创建TopicPartition对象,表示记录将要发送到的主题和分区
           tp = new TopicPartition(record.topic(), partition);

           // 将记录的头部设置为只读,防止在发送过程中被修改
           setReadOnly(record.headers());
           Header[] headers = record.headers().toArray();
           //计算记录的序列化大小,确保记录大小在合理范围内,否则抛出RecordTooLargeException异常。
           int serializedSize = AbstractRecords.estimateSizeInBytesUpperBound(apiVersions.maxUsableProduceMagic(),
                   compressionType, serializedKey, serializedValue, headers);
           ensureValidRecordSize(serializedSize);
           //如果记录没有设置时间戳,则使用当前时间作为时间戳。
           long timestamp = record.timestamp() == null ? nowMs : record.timestamp();
           if (log.isTraceEnabled()) {
               log.trace("Attempting to append record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);

          }
           // producer callback will make sure to call both 'callback' and interceptor callback
           // 创建一个InterceptorCallback对象,用于在发送完成后调用用户提供的回调函数和拦截器的回调函数
           Callback interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
           //如果启用了事务管理,将目标分区添加到事务管理器中
           if (transactionManager != null) {
               transactionManager.maybeAddPartition(tp);
          }
           //将记录追加到RecordAccumulator中。如果需要创建新的批次,会重新计算分区并重试追加操作
           RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey,
                   serializedValue, headers, interceptCallback, remainingWaitMs, true, nowMs);

           // 如果批次已满或者创建了新的批次,唤醒发送线程,准备发送记录
           if (result.abortForNewBatch) {
               int prevPartition = partition;
               partitioner.onNewBatch(record.topic(), cluster, prevPartition);
               partition = partition(record, serializedKey, serializedValue, cluster);

               tp = new TopicPartition(record.topic(), partition);
               if (log.isTraceEnabled()) {
                   log.trace("Retrying append due to new batch creation for topic {} partition {}. The old partition was {}", record.topic(), partition, prevPartition);
              }
               // producer callback will make sure to call both 'callback' and interceptor callback

               interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);

               result = accumulator.append(tp, timestamp, serializedKey,
                   serializedValue, headers, interceptCallback, remainingWaitMs, false, nowMs);
          }

            // 判断追加结果中的 batchIsFull 和 newBatchCreated 两个标志。batchIsFull 表示当前批次已满,newBatchCreated 表示创建了新的批次。如果满足其中一个条件,说明需要唤醒发送线程进行发送。
           if (result.batchIsFull || result.newBatchCreated) {

               log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);

               // 调用 this.sender.wakeup() 唤醒发送线程,准备发送记录。
               this.sender.wakeup();
          }
           return result.future;

      } catch (Exception e) {
        // ....
      }

  }

追加消息

accumulator.append()

public RecordAppendResult append(TopicPartition tp,  
                                     long timestamp,  
                                     byte[] key,  
                                     byte[] value,  
                                     Header[] headers,  
                                     Callback callback,  
                                     long maxTimeToBlock) throws InterruptedException {  
        // 跟踪追加线程的数量,以确保我们不会丢批次  
        appendsInProgress.incrementAndGet();  
        ByteBuffer buffer = null;  
        if (headers == null) headers = Record.EMPTY_HEADERS;  
        try {  
            // 检查是否有在处理中的批次(因为一个分区只有一个deque)  
            Deque<ProducerBatch> dq = getOrCreateDeque(tp);  
           // 锁住deque,添加数据  
            synchronized (dq) {  
                if (closed)  
                    throw new KafkaException("Producer closed while send in progress");  
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);  
                if (appendResult != null)  
                    return appendResult;  
            }  
  
            byte maxUsableMagic = apiVersions.maxUsableProduceMagic();  
            int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));  
            log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());  
            //分配大小,默认为batch.size大小。16384,即16K  
            buffer = free.allocate(size, maxTimeToBlock);  
            synchronized (dq) {  
                if (closed)  
                    throw new KafkaException("Producer closed while send in progress");  
  
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);  
                if (appendResult != null) {  
                    return appendResult;  
                }  
  
                MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);  
                ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, time.milliseconds());  
                // 如果是第一次,直接看这里。上面的都添加不成功。  
                // 添加到batch  
                FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, headers, callback, time.milliseconds()));  
                // 将batch添加到deque中  
                dq.addLast(batch);  
                incomplete.add(batch);  
  
                buffer = null;  
                // 返回结果。此处可以看到队列长度大于1或者队列已满就认为队列已满  
                return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true);  
            }  
        } finally {  
            if (buffer != null)  
                free.deallocate(buffer);  
            appendsInProgress.decrementAndGet();  
        }  
    }

       追加消息时首先要获取Partition所属的队列,然后取队列中最后一个批记录,如果队列中不存在批记录或者批记录的大小达到阈值,应该创建新的批记录,并且加入队列的尾部。这里先创建的批记录最先被消息填满,后创建的批记录表示最新的消息,追加消息时总是往队列尾部的批记录中追加。记录收集器用来缓存客户端的消息,还需要通过Sender才能将消息发送到Partition对应的Leader节点。

Sender发送消息

Sender不断轮询记录收集器,当满足一定条件时,将队列中的数据发送到Partition Leader节点。
Sender读取记录收集器,得到每个Leader节点对应的批记录列表,找出准备好的Broker节点并建立连接,然后将各个Partition的批记录发送到Leader节点。Sender的核心代码如下:

void runOnce() {
   // 有事务管理器执行事务管理器逻辑
   if (transactionManager != null) {
       try {
           // 解析序列号
           transactionManager.maybeResolveSequences();
           // 检查事务管理器是否处于失败状态,如果是,则可能中止批次处理,并进行轮询。
           if (transactionManager.hasFatalError()) {
               RuntimeException lastError = transactionManager.lastError();
               if (lastError != null)
                   maybeAbortBatches(lastError);
               client.poll(retryBackoffMs, time.milliseconds());
               return;
          }

           //检查是否需要新的producerId,如果需要,将会发送一个InitProducerId请求。
           //调用maybeSendAndPollTransactionalRequest()尝试发送事务请求并轮询响应,如果发送成功则返回。
           transactionManager.bumpIdempotentEpochAndResetIdIfNeeded();
           if (maybeSendAndPollTransactionalRequest()) {
               return;
          }

      } catch (AuthenticationException e) {

           // This is already logged as error, but propagated here to perform any clean ups.

           log.trace("Authentication exception while processing transactional request", e);

           transactionManager.authenticationFailed(e);

      }
  }
   //发送数据

   long currentTimeMs = time.milliseconds();
   long pollTimeout = sendProducerData(currentTimeMs);
   // 进行轮询,处理Broker的响应。
   client.poll(pollTimeout, currentTimeMs);
}

sendProducerData方法

private long sendProducerData(long now) {
      //获取集群元数据
      Cluster cluster = metadata.fetch();
      //获取待发送的分区列表
      RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

      // 如果有分区的 leader 未知,则强制更新元数据
      if (!result.unknownLeaderTopics.isEmpty()) {
          for (String topic : result.unknownLeaderTopics)
              this.metadata.add(topic, now);
          log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
              result.unknownLeaderTopics);
          this.metadata.requestUpdate();
      }

      // 剔除不可用的节点,同时计算下一次轮询的超时时间。
      Iterator<Node> iter = result.readyNodes.iterator();

      long notReadyTimeout = Long.MAX_VALUE;

      while (iter.hasNext()) {
          Node node = iter.next();
          if (!this.client.ready(node, now)) {
              iter.remove();
              notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));

          }

      }
      // 创建生产者请求,从 RecordAccumulator 中获取待发送的批次,同时将批次添加到 inflightBatches 中,如果需要保证消息顺序,则将批次中的分区加入到 muted 中
      Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);

      addToInflightBatches(batches);
      if (guaranteeMessageOrder) {
          // Mute all the partitions drained
          for (List<ProducerBatch> batchList : batches.values()) {
              for (ProducerBatch batch : batchList)
                  this.accumulator.mutePartition(batch.topicPartition);
          }
      }

      accumulator.resetNextBatchExpiryTime();
      //处理过期的批次,如果批次已经发送到 broker,则将其标记为失败,否则将其标记为未决。
      List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);

      List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
      expiredBatches.addAll(expiredInflightBatches);
      if (!expiredBatches.isEmpty())
          log.trace("Expired {} batches in accumulator", expiredBatches.size());

      for (ProducerBatch expiredBatch : expiredBatches) {
          String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
              + ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
          failBatch(expiredBatch, new TimeoutException(errorMessage), false);

          if (transactionManager != null && expiredBatch.inRetry()) {
              transactionManager.markSequenceUnresolved(expiredBatch);
          }
      }
      //更新生产者请求的指标。
      sensors.updateProduceRequestMetrics(batches);
      //计算下一次轮询的超时时间,如果有节点已经准备好发送数据,则超时时间为 0。
      long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
      pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
      pollTimeout = Math.max(pollTimeout, 0);
      if (!result.readyNodes.isEmpty()) {
          log.trace("Nodes with data ready to send: {}", result.readyNodes);
          pollTimeout = 0;

      }
      //发送生产者请求。
      sendProduceRequests(batches, now);
      return pollTimeout;
  }

sendProducerData

private long sendProducerData(long now) {
      //获取集群元数据
      Cluster cluster = metadata.fetch();
      //获取待发送的分区列表
      RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

      // 如果有分区的 leader 未知,则强制更新元数据
      if (!result.unknownLeaderTopics.isEmpty()) {
          // The set of topics with unknown leader contains topics with leader election pending as well as
          // topics which may have expired. Add the topic again to metadata to ensure it is included
          // and request metadata update, since there are messages to send to the topic.
          for (String topic : result.unknownLeaderTopics)
              this.metadata.add(topic, now);
          log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
              result.unknownLeaderTopics);
          this.metadata.requestUpdate();

      }
      // 剔除不可用的节点,同时计算下一次轮询的超时时间。
      Iterator<Node> iter = result.readyNodes.iterator();

      long notReadyTimeout = Long.MAX_VALUE;

      while (iter.hasNext()) {
          Node node = iter.next();
          if (!this.client.ready(node, now)) {
              iter.remove();
              notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));

          }
      }
      // 创建生产者请求,从 RecordAccumulator 中获取待发送的批次,同时将批次添加到 inflightBatches 中,如果需要保证消息顺序,则将批次中的分区加入到 muted 中
      Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);

      addToInflightBatches(batches);
      if (guaranteeMessageOrder) {

          // Mute all the partitions drained
          for (List<ProducerBatch> batchList : batches.values()) {
              for (ProducerBatch batch : batchList)
                  this.accumulator.mutePartition(batch.topicPartition);
          }
      }
      accumulator.resetNextBatchExpiryTime();

      //处理过期的批次,如果批次已经发送到 broker,则将其标记为失败,否则将其标记为未决。
      List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);

      List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);

      expiredBatches.addAll(expiredInflightBatches);
      if (!expiredBatches.isEmpty())
          log.trace("Expired {} batches in accumulator", expiredBatches.size());

      for (ProducerBatch expiredBatch : expiredBatches) {
          String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
              + ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
          failBatch(expiredBatch, new TimeoutException(errorMessage), false);
          if (transactionManager != null && expiredBatch.inRetry()) {

              // This ensures that no new batches are drained until the current in flight batches are fully resolved.
              transactionManager.markSequenceUnresolved(expiredBatch);
          }
      }
      //更新生产者请求的指标。
      sensors.updateProduceRequestMetrics(batches);
      //计算下一次轮询的超时时间,如果有节点已经准备好发送数据,则超时时间为 0。

      long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
      pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
      pollTimeout = Math.max(pollTimeout, 0);

      if (!result.readyNodes.isEmpty()) {
          log.trace("Nodes with data ready to send: {}", result.readyNodes);
          pollTimeout = 0;

      }
      //发送生产者请求。
      sendProduceRequests(batches, now);
      return pollTimeout;

  }

发送生产者请求

private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {  
        if (batches.isEmpty())  
            return;  
  
        Map<TopicPartition, MemoryRecords> produceRecordsByPartition = new HashMap<>(batches.size());  
        final Map<TopicPartition, ProducerBatch> recordsByPartition = new HashMap<>(batches.size());  
  
        // 头部魔数处理,用于校对请求协议版本等信息,忽略掉  
        byte minUsedMagic = apiVersions.maxUsableProduceMagic();  
        for (ProducerBatch batch : batches) {  
            if (batch.magic() < minUsedMagic)  
                minUsedMagic = batch.magic();  
        }  
  
        for (ProducerBatch batch : batches) {  
            TopicPartition tp = batch.topicPartition;  
            MemoryRecords records = batch.records();  
  
            if (!records.hasMatchingMagic(minUsedMagic))  
                records = batch.records().downConvert(minUsedMagic, 0, time).records();  
            produceRecordsByPartition.put(tp, records);  
            recordsByPartition.put(tp, batch);  
        }  
  
        String transactionalId = null;  
        if (transactionManager != null && transactionManager.isTransactional()) {  
            transactionalId = transactionManager.transactionalId();  
        }  
         // 构建请求  
        ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,  
                produceRecordsByPartition, transactionalId);  
        RequestCompletionHandler callback = new RequestCompletionHandler() {  
            public void onComplete(ClientResponse response) { 
                // 处理生产者响应体  
                handleProduceResponse(response, recordsByPartition, time.milliseconds());  
            }  
        };  
  
        String nodeId = Integer.toString(destination);  
         // 这个callback就是上面注册的 处理响应体的 方法  
        ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,  
                requestTimeoutMs, callback);  
         // 发送请求  
        client.send(clientRequest, now);  
        log.trace("Sent produce request to {}: {}", nodeId, requestBuilder);  
    }

处理结果

handleProduceResponse

private void handleProduceResponse(ClientResponse response, Map<TopicPartition, ProducerBatch> batches, long now) {  
        RequestHeader requestHeader = response.requestHeader();  
        long receivedTimeMs = response.receivedTimeMs();  
        int correlationId = requestHeader.correlationId();  
        // 连接断开处理  
        if (response.wasDisconnected()) {  
            log.trace("Cancelled request with header {} due to node {} being disconnected",  
                    requestHeader, response.destination());  
            for (ProducerBatch batch : batches.values())  
                completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION), correlationId, now, 0L);  
          // 版本不匹配处理  
        } else if (response.versionMismatch() != null) {  
            log.warn("Cancelled request {} due to a version mismatch with node {}",  
                    response, response.destination(), response.versionMismatch());  
            for (ProducerBatch batch : batches.values())  
                completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.UNSUPPORTED_VERSION), correlationId, now, 0L);  
        } else {  
            log.trace("Received produce response from node {} with correlation id {}", response.destination(), correlationId);  
            // 如果响应还有响应体的处理  
            // 发送成功会进入到这里  
            if (response.hasResponse()) {  
                ProduceResponse produceResponse = (ProduceResponse) response.responseBody();  
                for (Map.Entry<TopicPartition, ProduceResponse.PartitionResponse> entry : produceResponse.responses().entrySet()) {  
                    TopicPartition tp = entry.getKey();  
                    ProduceResponse.PartitionResponse partResp = entry.getValue();  
                    ProducerBatch batch = batches.get(tp);  
                    completeBatch(batch, partResp, correlationId, now, receivedTimeMs + produceResponse.throttleTimeMs());  
                }  
                this.sensors.recordLatency(response.destination(), response.requestLatencyMs());  
            } else {  
                // 剩下的都是ack=0的,全部完成就完了  
                for (ProducerBatch batch : batches.values()) {  
                    completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NONE), correlationId, now, 0L);  
                }  
            }  
        }  
    }