RocketMQ 消息可靠性保障与堆积处理
一、RocketMQ 消息零丢失全链路保障
消息丢失可能发生在「生产者发送」「Broker 存储」「主从同步」「消费者消费」四个核心环节
1. 生产者侧:杜绝发送阶段丢失
核心问题
生产者发送消息后,因网络异常、超时、客户端崩溃等原因,消息未能成功写入 Broker,而生产者误以为发送成功。
解决方案
(1)普通消息:强一致性发送配置
# application.yml - 生产者核心配置
rocketmq:
producer:
send-message-timeout: 5000 # 超时时间设为5s,避免短超时误判
retry-times-when-send-failed: 3 # 同步发送失败重试3次
retry-times-when-send-async-failed: 3 # 异步发送失败重试3次
retry-next-server: true # 失败时重试下一个Broker(集群场景)
// 同步发送 - 最可靠的发送方式
public SendResult sendMessage(String topic, String message) {
SendResult result = rocketMQTemplate.syncSend(topic, message);
if (result.getSendStatus() != SendStatus.SEND_OK) {
throw new RuntimeException("消息发送失败:" + result.getSendStatus());
}
return result;
}
(2)事务消息:保证本地事务与消息发送原子性
针对「业务操作 + 消息发送」的原子性需求(如支付成功后发送订单通知),必须使用事务消息:
// 1. 事务消息生产者
@Slf4j
@Service
public class OrderService {
@Resource
private RocketMQTemplate rocketMQTemplate;
public void createOrder(OrderDTO order) {
String transactionId = UUID.randomUUID().toString();
Message<String> message = MessageBuilder.withPayload(JSON.toJSONString(order))
.setHeader(RocketMQHeaders.TRANSACTION_ID, transactionId)
.setHeader(RocketMQHeaders.KEYS, order.getOrderId())
.build();
// 发送事务消息
TransactionSendResult result = rocketMQTemplate.sendMessageInTransaction(
"order-topic",
message,
order // 传递业务参数
);
if (result.getLocalTransactionState() != LocalTransactionState.COMMIT_MESSAGE) {
throw new RuntimeException("事务消息发送失败");
}
}
}
// 2. 事务监听器(核心:本地事务执行 + 回查)
@Slf4j
@RocketMQTransactionListener
public class OrderTransactionListener implements RocketMQLocalTransactionListener {
@Resource
private OrderMapper orderMapper;
@Resource
private TransactionLogMapper transactionLogMapper;
@Override
public RocketMQLocalTransactionState executeLocalTransaction(Message msg, Object arg) {
String transactionId = msg.getHeaders().get(RocketMQHeaders.TRANSACTION_ID).toString();
OrderDTO order = (OrderDTO) arg;
try {
// 1. 记录事务日志(用于回查)
TransactionLog log = new TransactionLog();
log.setTransactionId(transactionId);
log.setBusinessKey(order.getOrderId());
log.setStatus("EXECUTING");
transactionLogMapper.insert(log);
// 2. 执行本地事务(如创建订单、扣减库存)
orderMapper.insert(order);
// 3. 更新事务日志状态
log.setStatus("SUCCESS");
transactionLogMapper.updateById(log);
return RocketMQLocalTransactionState.COMMIT;
} catch (Exception e) {
log.error("本地事务执行失败", e);
// 返回UNKNOWN,触发Broker回查
return RocketMQLocalTransactionState.UNKNOWN;
}
}
@Override
public RocketMQLocalTransactionState checkLocalTransaction(Message msg) {
String transactionId = msg.getHeaders().get(RocketMQHeaders.TRANSACTION_ID).toString();
// 查询事务日志
TransactionLog log = transactionLogMapper.selectByTransactionId(transactionId);
if (log == null) {
return RocketMQLocalTransactionState.ROLLBACK;
}
if ("SUCCESS".equals(log.getStatus())) {
return RocketMQLocalTransactionState.COMMIT;
} else if ("FAILED".equals(log.getStatus())) {
return RocketMQLocalTransactionState.ROLLBACK;
} else {
// 仍在执行中,等待下一次回查
return RocketMQLocalTransactionState.UNKNOWN;
}
}
}
(3)生产端终极兜底:消息发送日志 + 定时补偿
针对极端场景(如 Broker 集群整体宕机、网络完全断开),必须建立消息发送日志兜底机制:
@Slf4j
@Component
public class ReliableMessageProducer {
@Resource
private RocketMQTemplate rocketMQTemplate;
@Resource
private MessageSendLogMapper logMapper;
@Resource
private HealthChecker healthChecker;
/**
* 可靠发送:先落库,再发送
*/
public void reliableSend(String topic, String message, String businessKey) {
// 1. 保存发送日志(状态:待发送)
MessageSendLog sendLog = new MessageSendLog();
sendLog.setMsgId(UUID.randomUUID().toString());
sendLog.setBusinessKey(businessKey);
sendLog.setTopic(topic);
sendLog.setMessageBody(message);
sendLog.setStatus(MessageSendStatus.PENDING);
sendLog.setCreateTime(new Date());
sendLog.setRetryCount(0);
logMapper.insert(sendLog);
// 2. 检查MQ集群健康状态
if (!healthChecker.isRocketMQAvailable()) {
log.warn("RocketMQ集群不可用,消息已落库待补偿: {}", businessKey);
return;
}
// 3. 尝试发送
trySend(sendLog);
}
/**
* 尝试发送消息
*/
private void trySend(MessageSendLog sendLog) {
try {
SendResult result = rocketMQTemplate.syncSend(sendLog.getTopic(),
MessageBuilder.withPayload(sendLog.getMessageBody())
.setHeader(RocketMQHeaders.KEYS, sendLog.getBusinessKey())
.build());
if (result.getSendStatus() == SendStatus.SEND_OK) {
// 发送成功,更新日志状态
sendLog.setStatus(MessageSendStatus.SUCCESS);
sendLog.setSendTime(new Date());
logMapper.updateById(sendLog);
log.info("消息发送成功: {}", sendLog.getBusinessKey());
}
} catch (Exception e) {
log.error("消息发送失败: {}", sendLog.getBusinessKey(), e);
// 更新重试次数
sendLog.setRetryCount(sendLog.getRetryCount() + 1);
sendLog.setStatus(MessageSendStatus.FAILED);
logMapper.updateById(sendLog);
}
}
/**
* 定时补偿任务:每分钟执行
*/
@Scheduled(fixedDelay = 60000)
public void compensateFailedMessages() {
log.info("开始补偿失败消息...");
// 查询待补偿的消息(状态为FAILED且重试次数<3)
List<MessageSendLog> failedLogs = logMapper.selectForCompensate(3, 30);
for (MessageSendLog log : failedLogs) {
// 检查MQ是否可用
if (!healthChecker.isRocketMQAvailable()) {
log.warn("MQ不可用,暂停补偿");
break;
}
trySend(log);
}
}
}
2. Broker 侧:杜绝存储/同步阶段丢失
核心问题
Broker 收到消息后,因主从同步失败、刷盘延迟、宕机等原因导致消息丢失。
解决方案
(1)Broker 核心配置:同步刷盘 + 同步复制
# broker.conf - 强可靠性配置
# 刷盘策略:SYNC_FLUSH 保证消息写入磁盘后才返回成功
flushDiskType=SYNC_FLUSH
# 同步复制模式(主从同步完成后才返回)
brokerRole=SYNC_MASTER
# 刷盘间隔:0表示立即刷盘
flushIntervalCommitLog=0
# 主从同步超时时间(毫秒)
syncMasterFlushTimeout=5000
(2)Dledger 集群:基于 Raft 的自动选主与数据一致性
针对你的「主从同步过半成功再返回」思路,Dledger 模式实现了 Quorum 机制:
# dledger.conf - 3节点集群配置
# 节点1 (broker-a)
enableDLegerCommitLog=true
dLegerGroup=dledger-group
dLegerPeers=n0-192.168.1.10:40911;n1-192.168.1.11:40911;n2-192.168.1.12:40911
dLegerSelfId=n0
brokerRole=SYNC_MASTER
flushDiskType=SYNC_FLUSH
# 节点2 (broker-b) - dLegerSelfId=n1
# 节点3 (broker-c) - dLegerSelfId=n2
Dledger 核心优势:
- 多数派确认:消息需写入超过半数节点(3节点需2个)才返回成功
- 自动选主:Master 宕机后自动选举新 Master,数据零丢失
- 强一致性:基于 Raft 协议,保证所有节点数据最终一致
3. 消费者侧:杜绝消费阶段丢失
核心问题
消费者收到消息后,因业务处理异常、客户端崩溃等导致消息实际未处理完成,却被标记为已消费。
解决方案
(1)消费端核心配置:手动确认 + 重试机制
# application.yml - 消费者配置
rocketmq:
consumer:
group: business-consumer-group
# 最大重试次数(超过则进入死信队列)
max-reconsume-times: 3
# 消费失败重试间隔(毫秒)
suspend-current-queue-time-millis: 3000
# 消费线程池配置
consume-thread-min: 20
consume-thread-max: 50
(2)幂等消费实现(杜绝重复消费)
@Slf4j
@Component
@RocketMQMessageListener(
topic = "business-topic",
consumerGroup = "business-consumer-group",
consumeMode = ConsumeMode.CONCURRENTLY,
messageModel = MessageModel.CLUSTERING
)
public class ReliableConsumer implements RocketMQListener<MessageExt> {
@Resource
private RedisTemplate<String, String> redisTemplate;
@Resource
private BusinessService businessService;
@Resource
private ConsumeLogMapper consumeLogMapper;
@Override
public void onMessage(MessageExt message) {
String msgId = message.getMsgId();
String businessKey = message.getKeys();
String body = new String(message.getBody(), StandardCharsets.UTF_8);
log.info("收到消息: msgId={}, businessKey={}", msgId, businessKey);
// 1. 分布式锁 + 幂等校验(Redis)
String lockKey = "rocketmq:consume:" + businessKey;
Boolean locked = redisTemplate.opsForValue()
.setIfAbsent(lockKey, msgId, 1, TimeUnit.HOURS);
if (locked == null || !locked) {
// 已处理过,检查是否是不同消息
String consumedMsgId = redisTemplate.opsForValue().get(lockKey);
if (msgId.equals(consumedMsgId)) {
log.info("消息已处理过,跳过: {}", businessKey);
return;
} else {
// 相同业务Key的不同消息,说明业务Key重复,需要告警
log.error("业务Key重复: {}, 新消息: {}, 旧消息: {}",
businessKey, msgId, consumedMsgId);
throw new RuntimeException("业务Key重复");
}
}
try {
// 2. 记录消费日志(数据库,用于长周期幂等)
ConsumeLog consumeLog = new ConsumeLog();
consumeLog.setMsgId(msgId);
consumeLog.setBusinessKey(businessKey);
consumeLog.setStatus(ConsumeStatus.PROCESSING);
consumeLogMapper.insert(consumeLog);
// 3. 执行业务逻辑
businessService.process(body);
// 4. 更新消费日志
consumeLog.setStatus(ConsumeStatus.SUCCESS);
consumeLog.setCompleteTime(new Date());
consumeLogMapper.updateById(consumeLog);
log.info("消息消费成功: {}", businessKey);
} catch (Exception e) {
log.error("消息消费失败: {}", businessKey, e);
// 删除Redis锁,允许重试
redisTemplate.delete(lockKey);
// 更新消费日志
consumeLog.setStatus(ConsumeStatus.FAILED);
consumeLog.setErrorMsg(e.getMessage());
consumeLogMapper.updateById(consumeLog);
// 抛出异常触发RocketMQ重试
throw new RuntimeException("消费失败", e);
}
}
}
(3)死信队列兜底:重试耗尽的消息不丢失
@Slf4j
@Component
@RocketMQMessageListener(
topic = "%DLQ%business-consumer-group", // 死信队列Topic
consumerGroup = "dlq-consumer-group"
)
public class DeadLetterConsumer implements RocketMQListener<MessageExt> {
@Resource
private DeadLetterMapper deadLetterMapper;
@Resource
private AlertService alertService;
@Override
public void onMessage(MessageExt message) {
String msgId = message.getMsgId();
String businessKey = message.getKeys();
String body = new String(message.getBody(), StandardCharsets.UTF_8);
String originTopic = message.getProperty("ORIGIN_TOPIC");
log.error("收到死信消息: msgId={}, businessKey={}, topic={}",
msgId, businessKey, originTopic);
// 1. 记录死信消息
DeadLetter deadLetter = new DeadLetter();
deadLetter.setMsgId(msgId);
deadLetter.setBusinessKey(businessKey);
deadLetter.setOriginTopic(originTopic);
deadLetter.setMessageBody(body);
deadLetter.setCreateTime(new Date());
deadLetterMapper.insert(deadLetter);
// 2. 发送告警
alertService.sendAlert(String.format("死信消息产生: %s, Key: %s", originTopic, businessKey));
// 3. 根据业务类型决定是否需要自动重试
if (shouldAutoRetry(originTopic, businessKey)) {
retryDeadMessage(deadLetter);
}
}
private boolean shouldAutoRetry(String topic, String businessKey) {
// 根据业务规则判断是否需要自动重试
return true;
}
private void retryDeadMessage(DeadLetter deadLetter) {
// 手动重新发送到原Topic
rocketMQTemplate.syncSend(deadLetter.getOriginTopic(),
deadLetter.getMessageBody());
}
}
4. 极端场景兜底:Broker 集群整体不可用
针对你提到的「MQ集群宕机,将消息写入Redis/数据库」的场景,建立分级降级策略:
@Slf4j
@Component
public class MessageReliabilityGuardian {
@Resource
private RocketMQTemplate rocketMQTemplate;
@Resource
private RedisTemplate<String, String> redisTemplate;
@Resource
private MessageBackupMapper backupMapper;
@Resource
private HealthChecker healthChecker;
/**
* 分级降级发送
*/
public void sendWithDegradation(String topic, String message, String businessKey) {
// 检查MQ健康状态
MQHealthStatus status = healthChecker.checkMQHealth();
switch (status.getLevel()) {
case HEALTHY:
// 正常发送
sendToMQ(topic, message, businessKey);
break;
case DEGRADED:
// MQ性能下降,写入Redis暂存
sendToRedis(topic, message, businessKey);
break;
case DOWN:
// MQ完全不可用,写入数据库兜底
sendToDatabase(topic, message, businessKey);
break;
}
}
/**
* Redis暂存(用于短暂不可用)
*/
private void sendToRedis(String topic, String message, String businessKey) {
String key = "mq:backup:" + topic + ":" + businessKey;
redisTemplate.opsForList().leftPush("mq:backup:queue",
JSON.toJSONString(new BackupMessage(topic, message, businessKey)));
redisTemplate.expire(key, 1, TimeUnit.HOURS);
}
/**
* 数据库兜底(用于长期不可用)
*/
private void sendToDatabase(String topic, String message, String businessKey) {
MessageBackup backup = new MessageBackup();
backup.setMsgId(UUID.randomUUID().toString());
backup.setBusinessKey(businessKey);
backup.setTopic(topic);
backup.setMessageBody(message);
backup.setStatus(BackupStatus.PENDING);
backup.setCreateTime(new Date());
backupMapper.insert(backup);
}
/**
* 恢复任务:MQ恢复后重新发送
*/
@Scheduled(fixedDelay = 30000) // 30秒执行一次
public void recoveryTask() {
if (!healthChecker.isRocketMQAvailable()) {
return;
}
// 1. 恢复Redis中的消息
recoverFromRedis();
// 2. 恢复数据库中的消息
recoverFromDatabase();
}
private void recoverFromRedis() {
while (true) {
String json = redisTemplate.opsForList().rightPop("mq:backup:queue");
if (json == null) {
break;
}
BackupMessage backup = JSON.parseObject(json, BackupMessage.class);
try {
rocketMQTemplate.syncSend(backup.getTopic(), backup.getMessage());
log.info("Redis恢复消息成功: {}", backup.getBusinessKey());
} catch (Exception e) {
log.error("Redis恢复消息失败: {}", backup.getBusinessKey(), e);
// 重新放回队列
redisTemplate.opsForList().leftPush("mq:backup:queue", json);
break;
}
}
}
}
消息零丢失核心配置速查表
| 环节 | 防护策略 | 核心配置/代码 | 作用 |
|---|---|---|---|
| 生产者 | 同步发送+重试 | retry-times-when-send-failed: 3 | 网络抖动时自动重试 |
| 事务消息 | sendMessageInTransaction | 保证本地事务与消息原子性 | |
| 发送日志+补偿 | @Scheduled + 数据库 | 极端场景终极兜底 | |
| Broker | 同步刷盘 | flushDiskType=SYNC_FLUSH | 防止宕机丢内存数据 |
| 同步复制 | brokerRole=SYNC_MASTER | 主从同步完成才返回 | |
| Dledger集群 | dLegerPeers + 多数派 | 节点故障自动切换 | |
| 消费者 | 手动确认 | 异常时抛出 | 消费失败自动重试 |
| 幂等消费 | Redis分布式锁 | 防止重复消费 | |
| 死信队列 | %DLQ%consumer_group | 重试耗尽后不丢失 |
二、消息堆积处理最佳实践
消息堆积的核心原因是「消费速度 < 生产速度」,需要分场景采取不同策略。
1. 堆积诊断:快速定位问题
# 1. 查看消费堆积情况
sh mqadmin consumerProgress -n 127.0.0.1:9876 -g consumer_group
# 输出示例:
# #Consumer Group #Topic #Broker #Diff #LastTime
# consumer_group topic_test broker-a 15000 2024-01-01 10:30:25
# consumer_group topic_test broker-b 12000 2024-01-01 10:30:25
# 2. 查看Topic队列分布
sh mqadmin topicStatus -n 127.0.0.1:9876 -t topic_test
# 3. 查看消费者连接情况
sh mqadmin consumerConnection -n 127.0.0.1:9876 -g consumer_group
2. 临时堆积处理(队列数充足)
场景特征:消费者数 < 队列数,可通过扩容快速解决
# 步骤1:确认队列数
# 若队列数为16,当前消费者为4,可扩容至16
# 步骤2:扩容消费者(K8s示例)
kubectl scale deployment consumer-deployment --replicas=16
# 步骤3:优化单消费者性能
rocketmq:
consumer:
# 调大消费线程池
consume-thread-min: 40
consume-thread-max: 80
# 开启批量消费
consume-message-batch-max-size: 32
3. 长期堆积处理(队列数不足)
场景特征:消费者数 ≥ 队列数,但消费速度仍跟不上,需增加队列
/**
* 消息转移方案:将堆积消息转移到多队列Topic
*/
@Slf4j
@Component
public class MessageTransferService {
@Resource
private RocketMQTemplate rocketMQTemplate;
/**
* 创建新Topic(16队列)
* 执行命令:mqadmin updateTopic -t topic_new -r 16 -w 16
*/
/**
* 转移消费者:消费旧Topic,发送到新Topic
*/
@Scheduled(fixedDelay = 1000) // 每秒执行
public void transferMessages() {
// 使用Pull Consumer批量拉取
DefaultLitePullConsumer consumer = new DefaultLitePullConsumer("transfer_group");
consumer.setNamesrvAddr("127.0.0.1:9876");
consumer.subscribe("topic_old", "*");
consumer.setPullBatchSize(100); // 每次拉取100条
consumer.start();
try {
List<MessageExt> messages = consumer.poll(5000);
if (messages.isEmpty()) {
return;
}
// 批量转移到新Topic
List<Message> transferBatch = new ArrayList<>();
for (MessageExt msg : messages) {
Message transferMsg = new Message(
"topic_new",
msg.getTags(),
msg.getKeys(),
msg.getBody()
);
transferBatch.add(transferMsg);
}
// 批量发送
rocketMQTemplate.syncSend("topic_new", transferBatch);
// 手动提交offset
consumer.commitSync();
log.info("转移消息成功: {}条", messages.size());
} catch (Exception e) {
log.error("转移消息失败", e);
} finally {
consumer.shutdown();
}
}
}
4. 生产级堆积预防方案
(1)Topic创建规范:预分配足够队列
# 生产环境创建Topic,队列数建议32起步
sh mqadmin updateTopic -n 127.0.0.1:9876 -c DefaultCluster \
-t order_topic_prod -r 32 -w 32 \
--order false --perm 6
(2)消费端限流保护
@Slf4j
@Component
public class ThrottledConsumer {
private final RateLimiter rateLimiter = RateLimiter.create(1000); // 每秒1000条
@RocketMQMessageListener(
topic = "throttle_topic",
consumerGroup = "throttle_group"
)
public class InnerConsumer implements RocketMQListener<MessageExt> {
@Override
public void onMessage(MessageExt message) {
// 限流
rateLimiter.acquire();
try {
// 业务处理
process(message);
} catch (Exception e) {
log.error("消费失败", e);
throw new RuntimeException(e);
}
}
}
}
(3)监控告警配置
# Prometheus + Grafana监控
# RocketMQ暴露的堆积指标
rocketmq_consumer_offset_diff{consumerGroup="order_group"} 10000
# AlertManager告警规则
groups:
- name: rocketmq_alerts
rules:
- alert: MessageAccumulationHigh
expr: rocketmq_consumer_offset_diff > 5000
for: 5m
annotations:
summary: "消息堆积严重: {{ $labels.consumerGroup }}"
(4)弹性扩容策略(K8s HPA)
# HorizontalPodAutoscaler配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: consumer-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rocketmq-consumer
minReplicas: 3
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: rocketmq_consumer_offset_diff
selector:
matchLabels:
consumerGroup: order_group
target:
type: AverageValue
averageValue: 1000
消息堆积处理决策树
graph TD
A[发现消息堆积] --> B{队列数是否充足?}
B -->|是| C[扩容消费者实例]
B -->|否| D[创建多队列新Topic]
C --> E[调优消费线程/批量消费]
D --> F[编写消息转移程序]
F --> G[将堆积消息转移到新Topic]
G --> H[扩容新Topic消费者]
E & H --> I[监控堆积是否下降]
I -->|否| J[检查业务逻辑性能]
J --> K[优化数据库/SQL/第三方调用]
K --> I
I -->|是| L[恢复]
三、核心总结
1. 消息零丢失的关键原则
- 生产者:同步发送 + 重试 + 事务消息(关键业务)+ 发送日志兜底
- Broker:同步刷盘 + 同步复制 + Dledger集群(3节点多数派)
- 消费者:手动确认 + 幂等消费 + 死信队列兜底
- 终极防线:MQ不可用时写入DB/Redis,恢复后自动补偿
2. 消息堆积处理三板斧
- 扩容:消费者数 ≤ 队列数时,扩容消费者实例
- 转移:队列数不足时,转移消息到多队列Topic
- 优化:批量消费、异步处理、限流保护、索引优化
3. 生产环境必备配置清单
# 1. Broker强可靠配置
flushDiskType: SYNC_FLUSH
brokerRole: SYNC_MASTER
# 2. 生产者重试配置
retry-times-when-send-failed: 3
send-message-timeout: 5000
# 3. 消费者幂等配置
max-reconsume-times: 3
consume-thread-max: 50
# 4. 监控告警
堆积阈值: 5000
死信告警: 开启