Rocketmq 7大重复消费场景源码解读~

189 阅读8分钟

前言

业务开发中只要涉及到了MQ,我们会条件反射的想到了需要避免消息重复消费保证消费的幂等性~

即使是作为业界中成熟的消息中间件: RocketMq,在真实的生产环境中也不可避免的会出现重复消费的现象,那么本来就带大家了解了解RocketMq什么情况下可能会出现重复消费~


重复消费出现时机

image-20230905004832738

1. Producer发消息失败重试

org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl#sendKernelImpl

image-20230831224453195

image-20230831224526583

Producer在同步发送模式下,因为网络波动等原因,可能会出现消息发送成功,但是Producer没有及时收到响应,从而出现timeout的情况

此时Produer会认为消息发送失败,从而进行重试发送默认会重试2

这里就涉及到一个messageQueue的选择策略了,如果是失败重试选择messageQueue,那么RocketMq会选择跟上一次选择的messageQueue不同所属brokermessageQueue,从而提高发送成功率~

故,在重试发送时,发送成功率会很高,此时就造成了消息的重复发送,而且msgId还不一致,极容易造成业务内容的重复消费~

探究 | RocketMQ 发送消息messageQueue选择策略~


2. Consumer消费异常

// org.apache.rocketmq.client.impl.consumer.DefaultMQPushConsumerImpl#pullMessage
public void pullMessage(final PullRequest pullRequest) {

  // ......

  PullCallback pullCallback = new PullCallback() {
    @Override
    public void onSuccess(PullResult pullResult) {
      if (pullResult != null) {
        pullResult = DefaultMQPushConsumerImpl.this.pullAPIWrapper.processPullResult(pullRequest.getMessageQueue(), pullResult, subscriptionData);

        switch (pullResult.getPullStatus()) {
          case FOUND:
            // todo 成功拉取到消息
            long prevRequestOffset = pullRequest.getNextOffset();
            pullRequest.setNextOffset(pullResult.getNextBeginOffset());
            long pullRT = System.currentTimeMillis() - beginTimestamp;
            DefaultMQPushConsumerImpl.this.getConsumerStatsManager().incPullRT(pullRequest.getConsumerGroup(),
                                                                               pullRequest.getMessageQueue().getTopic(), pullRT);

            long firstMsgOffset = Long.MAX_VALUE;
            if (pullResult.getMsgFoundList() == null || pullResult.getMsgFoundList().isEmpty()) {
              DefaultMQPushConsumerImpl.this.executePullRequestImmediately(pullRequest);
            } else {
              firstMsgOffset = pullResult.getMsgFoundList().get(0).getQueueOffset();

              DefaultMQPushConsumerImpl.this.getConsumerStatsManager().incPullTPS(pullRequest.getConsumerGroup(),
                                                                                  pullRequest.getMessageQueue().getTopic(), pullResult.getMsgFoundList().size());

              boolean dispatchToConsume = processQueue.putMessage(pullResult.getMsgFoundList());
              
              // todo 提交消费请求,交给Consumer消费
              DefaultMQPushConsumerImpl.this.consumeMessageService.submitConsumeRequest(
                pullResult.getMsgFoundList(),
                processQueue,
                pullRequest.getMessageQueue(),
                dispatchToConsume);

            // ......

            break;
          // ......
          default:
            break;
        }
      }
    }

    // ......
  };

  // ......
}

Consumer客户端成功拉取到消息后,会提交消费请求,即submitConsumeRequest

// org.apache.rocketmq.client.impl.consumer.ConsumeMessageConcurrentlyService.ConsumeRequest#run
class ConsumeRequest implements Runnable {

  // 待消费消息集合
  private final List<MessageExt> msgs;

  public void run() {
    // ......

    ConsumeConcurrentlyStatus status = null;
    try {
      // .....

      // todo 回调Consumer,执行真正业务消费
      status = listener.consumeMessage(Collections.unmodifiableList(msgs), context);
    } catch (Throwable e) {
      log.warn(String.format("consumeMessage exception: %s Group: %s Msgs: %s MQ: %s",
                             RemotingHelper.exceptionSimpleDesc(e),
                             ConsumeMessageConcurrentlyService.this.consumerGroup,
                             msgs,
                             messageQueue), e);
      hasException = true;
    }

    // ...... 不对status改变

    if (null == status) {
      log.warn("consumeMessage return null, Group: {} Msgs: {} MQ: {}",
               ConsumeMessageConcurrentlyService.this.consumerGroup,
               msgs,
               messageQueue);
      // todo RECONSUME_LATER: 稍后再消费
      status = ConsumeConcurrentlyStatus.RECONSUME_LATER;
    }

    // ......
    
    if (!processQueue.isDropped()) {
      // todo 处理消费结果
      ConsumeMessageConcurrentlyService.this.processConsumeResult(status, context, this);
    } else {
      log.warn("processQueue is dropped without process consume result. messageQueue={}, msgs={}", messageQueue, msgs);
    }
  }
}

通过如上代码我们可以看到,当listener.consumeMessage出现异常时,status会维持null值,当null == status时,status会被重置为RECONSUME_LATER

那么在processConsumeResult中,即处理消费结果时👇🏻

public void processConsumeResult(
  final ConsumeConcurrentlyStatus status,
  final ConsumeConcurrentlyContext context,
  final ConsumeRequest consumeRequest
) {

  // 默认Integer.MAX_VALUE
  int ackIndex = context.getAckIndex();

  if (consumeRequest.getMsgs().isEmpty())
    return;

  switch (status) {
    case CONSUME_SUCCESS: // 正常消费成功
      if (ackIndex >= consumeRequest.getMsgs().size()) {
        ackIndex = consumeRequest.getMsgs().size() - 1;
      }
      int ok = ackIndex + 1;
      int failed = consumeRequest.getMsgs().size() - ok;
      this.getConsumerStatsManager().incConsumeOKTPS(consumerGroup, consumeRequest.getMessageQueue().getTopic(), ok);
      this.getConsumerStatsManager().incConsumeFailedTPS(consumerGroup, consumeRequest.getMessageQueue().getTopic(), failed);
      break;
    case RECONSUME_LATER: // 稍后重新消费
      // ackIndex重置为-1
      ackIndex = -1;
      this.getConsumerStatsManager().incConsumeFailedTPS(consumerGroup, consumeRequest.getMessageQueue().getTopic(),
                                                         consumeRequest.getMsgs().size());
      break;
    default:
      break;
  }

  switch (this.defaultMQPushConsumer.getMessageModel()) {
    case BROADCASTING:
      for (int i = ackIndex + 1; i < consumeRequest.getMsgs().size(); i++) {
        MessageExt msg = consumeRequest.getMsgs().get(i);
        log.warn("BROADCASTING, the message consume failed, drop it, {}", msg.toString());
      }
      break;
    case CLUSTERING: // 集群模式
      List<MessageExt> msgBackFailed = new ArrayList<MessageExt>(consumeRequest.getMsgs().size());

      // todo 如果正常消费成功 ackIndex = consumeRequest.getMsgs().size() - 1;不会进入for循环
      // 如果消费失败 ackIndex = -1 ,此时ackIndex + 1 = 0,进入for循环,重新遍历发消息
      for (int i = ackIndex + 1; i < consumeRequest.getMsgs().size(); i++) {
        MessageExt msg = consumeRequest.getMsgs().get(i);
        // todo 重发消息,稍后再次消费
        boolean result = this.sendMessageBack(msg, context);
        if (!result) {
          msg.setReconsumeTimes(msg.getReconsumeTimes() + 1);
          msgBackFailed.add(msg);
        }
      }

      if (!msgBackFailed.isEmpty()) {
        consumeRequest.getMsgs().removeAll(msgBackFailed);

        this.submitConsumeRequestLater(msgBackFailed, consumeRequest.getProcessQueue(), consumeRequest.getMessageQueue());
      }
      break;
    default:
      break;
  }

  // .....
}

源码可见,当status处于RECONSUME_LATER时,ackIndex将被置为-1

此时,在集群模式下,将会重新遍历所有拉取到的消息,重新一个一个发送,稍后再次消费~

此场景下,问题就在于消息是批量拉取的每一批可以拉取多个消息,当消费到第2、3....个消息时,出现了异常,这样会导致这一批消息都会被重新发送,进而重新被消费,那么对于之前正常消费成功过的消息来说,就会被重新消费。

再回到之前提到的submitConsumeRequest,即如下源码

虽然每一批可以拉取多个消息,但是在submitConsumeRequest时,有一个关键的参数: consumeBatchSize

consumeBatchSize可以控制每一次消费的消息个数,且默认为1,即默认每一次只消费一条消息,这样一来就不存在上述👆🏻说描述的场景了,也不会出现重复消费的情况。

// org.apache.rocketmq.client.impl.consumer.ConsumeMessageConcurrentlyService#submitConsumeRequest
public void submitConsumeRequest(
  final List<MessageExt> msgs,
  final ProcessQueue processQueue,
  final MessageQueue messageQueue,
  final boolean dispatchToConsume) {
  // todo 默认是1
  final int consumeBatchSize = this.defaultMQPushConsumer.getConsumeMessageBatchMaxSize();
  if (msgs.size() <= consumeBatchSize) {
    ConsumeRequest consumeRequest = new ConsumeRequest(msgs, processQueue, messageQueue);
    try {
      this.consumeExecutor.submit(consumeRequest);
    } catch (RejectedExecutionException e) {
      this.submitConsumeRequestLater(consumeRequest);
    }
  } else {
    // consumeBatchSize < msgs.size()
    for (int total = 0; total < msgs.size(); ) {
      List<MessageExt> msgThis = new ArrayList<MessageExt>(consumeBatchSize);
      
      // 每一次消费consumeBatchSize个消息
      for (int i = 0; i < consumeBatchSize; i++, total++) {
        if (total < msgs.size()) {
          msgThis.add(msgs.get(total));
        } else {
          break;
        }
      }

      // msgThis默认只有一个元素
      ConsumeRequest consumeRequest = new ConsumeRequest(msgThis, processQueue, messageQueue);
      try {
        this.consumeExecutor.submit(consumeRequest);
      } catch (RejectedExecutionException e) {
        for (; total < msgs.size(); total++) {
          msgThis.add(msgs.get(total));
        }

        this.submitConsumeRequestLater(consumeRequest);
      }
    }
  }
}

image-20230901005018055


3. Consumer提交offSet失败

如果有不了解RocketMq offSet机制的可以先阅读 -> RocketMq offSet管理机制

接继上文,还是在processConsumeResult中,处理完消息后,最终会updateOffset

// org.apache.rocketmq.client.impl.consumer.ConsumeMessageConcurrentlyService#processConsumeResult
public void processConsumeResult(
  final ConsumeConcurrentlyStatus status,
  final ConsumeConcurrentlyContext context,
  final ConsumeRequest consumeRequest
) {
  // ......

  long offset = consumeRequest.getProcessQueue().removeMessage(consumeRequest.getMsgs());
  if (offset >= 0 && !consumeRequest.getProcessQueue().isDropped()) {
		// todo 更新offset
    this.defaultMQPushConsumerImpl.getOffsetStore().updateOffset(consumeRequest.getMessageQueue(), offset, true);
  }
}

广播模式下使用LocalFileOffsetStore,集群模式下使用RemoteBrokerOffsetStore

image-20230901095045704

我们以集群模式为例~

public class RemoteBrokerOffsetStore implements OffsetStore {

  // key: messageQueue, value: 对应的offset
  private ConcurrentMap<MessageQueue, AtomicLong> offsetTable =
    new ConcurrentHashMap<MessageQueue, AtomicLong>();

  @Override
  public void updateOffset(MessageQueue mq, long offset, boolean increaseOnly) {
    if (mq != null) {
      // todo 从offsetTable拿到mq当前对应的offset
      AtomicLong offsetOld = this.offsetTable.get(mq);
      if (null == offsetOld) {
        // 为空,则初始化一下
        offsetOld = this.offsetTable.putIfAbsent(mq, new AtomicLong(offset));
      }

      if (null != offsetOld) {
        if (increaseOnly) {
          // todo 更新offset
          MixAll.compareAndIncreaseOnly(offsetOld, offset);
        } else {
          offsetOld.set(offset);
        }
      }
    }
  }
}

可以看到,最终offset是更新到了内存中的offsetTable

Consumerstart的时候会开启一个定时任务,每隔5s将内存中的offsetTable同步给broker

this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {

  @Override
  public void run() {
    try {
      // todo 每隔5s将内存中的offsetTable同步给broker
      MQClientInstance.this.persistAllConsumerOffset();
    } catch (Exception e) {
      log.error("ScheduledTask persistAllConsumerOffset exception", e);
    }
  }
}, 1000 * 10, this.clientConfig.getPersistConsumerOffsetInterval(), TimeUnit.MILLISECONDS);

这样就可能存在一个问题!

由于Consumer处理完消息后,并不是实时同步给broker offset,而是通过定时任务的方式

那么如果服务宕机导致最新的offset没有同步给broker,那么在服务重启后,只能根据broker之前的offset开始消费,此时就会造成重复消费的问题。


4. Broker持久化offset失败

继续接上述Consumer同步offsetbroker

public class ConsumerOffsetManager extends ConfigManager {

  // todo key: topic@group, value: <messageQueueId, offset>map
  protected ConcurrentMap<String/* topic@group */, ConcurrentMap<Integer, Long>> offsetTable =
    new ConcurrentHashMap<String, ConcurrentMap<Integer, Long>>(512);
  
}

broker接收到Consumer同步offset请求后,最终也是更新到本地内存中~

同时也是开启一个每隔5s执行一次的定时任务,将内存中的offsetTable持久化成文件~

// org.apache.rocketmq.broker.BrokerController#initialize
public boolean initialize() throws CloneNotSupportedException {
  
  // ......
  
  this.scheduledExecutorService.scheduleAtFixedRate(() -> {
    try {
      BrokerController.this.consumerOffsetManager.persist();
    } catch (Throwable e) {
      log.error("schedule persist consumerOffset error.", e);
    }
  }, 1000 * 10, this.brokerConfig.getFlushConsumerOffsetInterval(), TimeUnit.MILLISECONDS);
  
  // ......
  
}

出现重复消费的场景也很明显,如果broker宕机,那么最新的offset可能没及时进行持久化,那么就会造成5soffset丢失,broker重新启动后,Consumerbroker读取的"最新"的offset就是旧的,这样一来就会造成重复消费~


5. 主从同步offset失败

众所周知,RocketMq存在主从模式,既然是主从模式,那么必然存在主从节点之间的数据同步~

RocketMq从节点默认每隔10s会向主节点发送请求同步数据,其中就包括offset

// org.apache.rocketmq.broker.BrokerController#handleSlaveSynchronize
private void handleSlaveSynchronize(BrokerRole role) {
  // 如果是从节点
  if (role == BrokerRole.SLAVE) {
    if (null != slaveSyncFuture) {
      slaveSyncFuture.cancel(false);
    }
    
    this.slaveSynchronize.setMasterAddr(null);
    slaveSyncFuture = this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {
      @Override
      public void run() {
        try {
          // todo 同步数据
          BrokerController.this.slaveSynchronize.syncAll();
        }
        catch (Throwable e) {
          log.error("ScheduledTask SlaveSynchronize syncAll error.", e);
        }
      }
    }, 1000 * 3, 1000 * 10, TimeUnit.MILLISECONDS);
  } else {
    //handle the slave synchronise
    if (null != slaveSyncFuture) {
      slaveSyncFuture.cancel(false);
    }
    this.slaveSynchronize.setMasterAddr(null);
  }
}

image-20230904130948540

同理,如果主节点挂了,那么从节点会丢失10s最新的offset,如果此时从节点升级为主节点,那么Consumer拉取到的最新的offset就是旧的,也就造成了重复消费。


6. Consumer重平衡

RocketMq中,一个topic往往有多个messageQueue,一个ConsumerGroup中往往有多个Consumer,那么把这些messageQueue合理的分配给一个ConsumerGroupconsumer的过程就叫重平衡.

image-20230904224544276

Consumer消费完消息后,正常情况下时需要更新offset

public void processConsumeResult(
  final ConsumeConcurrentlyStatus status,
  final ConsumeConcurrentlyContext context,
  final ConsumeRequest consumeRequest
) {
  // ......
  
  // 
  long offset = consumeRequest.getProcessQueue().removeMessage(consumeRequest.getMsgs());
  
  // 如果offset >= 0且没有drop,则需要更新offset
  if (offset >= 0 && !consumeRequest.getProcessQueue().isDropped()) {
    this.defaultMQPushConsumerImpl.getOffsetStore().updateOffset(consumeRequest.getMessageQueue(), offset, true);
  }
}

但是重平衡进行的时候,Consumer也会在进行消息消费,所以可能存在**Consumer准备消费完准备去更新offset时,此时drop被标记为true了,这样一来最新的offset不会被更新,故之后就可能造成重复消费**~


7. 最小位点提交

RocketMq中,Consumer会从Broker拉取一批消息,再默认情况下一个一个提交到线程池进行消费

举例: Consumer拉取到3条消息,然后提交线程池消费,其中thread1 消费 msg1, thread2 消费 msg2, thread3 消费 msg3

thread3消费比较快,先于thread1、thread2消费完成,消费完成后需要从processQueueu中移除消息,并updateOffset

此时,应该是将offset更新为msg3offset吗?

为了保证消息不丢失,此时更新的还是msg1对应的offset,即最小位点提交

image-20230905003335746

结合上述案例及图示,我们可以想象,在最小位点的机制下,thread3消费成功后提交了msg1offset,此时客户端重启,重新向broker获取的offset还是msg1的,此时就会造成重复消费。


总结

通过本文,我们了解到了RocketMq7种可能出现重复消费的场景,保证消息幂等任重道远~

我是 Code皮皮虾 ,会在以后的日子里跟大家一起学习,一起进步! 觉得文章不错的话,可以在 掘金 关注我,这样就不会错过很多技术干货啦~