一起养成写作习惯！这是我参与「掘金日新计划 · 4 月更文挑战」的第13天，点击查看活动详情。

容错策略类继承关系图

在这里插入图片描述

LatencyFaultTolerance：延迟故障容错接口
LatencyFaultToleranceImpl：延迟故障容错实现类，具体容错功能的实现
MQFaultStrategy：RocketMQ提供的容错策略

源码分析

MQFaultStrategy

MQFaultStrategy主要维护的属性：

每个Broker发送消息的延迟
发送消息延迟容错开关
不可用时长与延迟级别的映射关系

MQFaultStrategy类基本定义：

public class MQFaultStrategy {
    private final static InternalLogger log = ClientLogger.getLog();
	
	/**
	* 维护每个Broker发送消息的延迟
	* key:brokerName
	*/
    private final LatencyFaultTolerance<String> latencyFaultTolerance = new LatencyFaultToleranceImpl();
	
	/**
	* 发送消息延迟容错开关
	*/
    private boolean sendLatencyFaultEnable = false;
	
	/**
	* 延迟级别数组
	*/
    private long[] latencyMax = {50L, 100L, 550L, 1000L, 2000L, 3000L, 15000L};
	/**
	* 不可用时长
	*/
    private long[] notAvailableDuration = {0L, 0L, 30000L, 60000L, 120000L, 180000L, 600000L};

    public long[] getNotAvailableDuration() {
        return notAvailableDuration;
    }

    public void setNotAvailableDuration(final long[] notAvailableDuration) {
        this.notAvailableDuration = notAvailableDuration;
    }

    public long[] getLatencyMax() {
        return latencyMax;
    }

    public void setLatencyMax(final long[] latencyMax) {
        this.latencyMax = latencyMax;
    }

    public boolean isSendLatencyFaultEnable() {
        return sendLatencyFaultEnable;
    }

    public void setSendLatencyFaultEnable(final boolean sendLatencyFaultEnable) {
        this.sendLatencyFaultEnable = sendLatencyFaultEnable;
    }
}

计算延迟对应的不可用时间方法，采用的查表法，具体的表里的内容会在章节末尾给出~：

private long computeNotAvailableDuration(final long currentLatency) {
for (int i = latencyMax.length - 1; i >= 0; i--) {
    if (currentLatency >= latencyMax[i])
        return this.notAvailableDuration[i];
}

return 0;
}

更新延迟容错信息方法，该方法接收一个延迟时间参数和一个是否“隔离”参数，其中延迟时间默认设为了30s：

public void updateFaultItem(final String brokerName, final long currentLatency, boolean isolation) {
if (this.sendLatencyFaultEnable) {
        // 当开启隔离时，延迟取默认30000
    long duration = computeNotAvailableDuration(isolation ? 30000 : currentLatency);
    // 更新broker的延迟
    this.latencyFaultTolerance.updateFaultItem(brokerName, currentLatency, duration);
}
}

根据TopicPublishInfo，选择一个消息队列的核心逻辑：

    public MessageQueue selectOneMessageQueue(final TopicPublishInfo tpInfo, final String lastBrokerName) {
    	// 判断容错开关是否打开，默认是false
        if (this.sendLatencyFaultEnable) {
            try {
            	// 根据负载均衡策略选择一个MQ，brokerName == lastBrokerName && 可用的MQ
                int index = tpInfo.getSendWhichQueue().incrementAndGet();
                for (int i = 0; i < tpInfo.getMessageQueueList().size(); i++) {
                    int pos = Math.abs(index++) % tpInfo.getMessageQueueList().size();
                    if (pos < 0)
                        pos = 0;
                    MessageQueue mq = tpInfo.getMessageQueueList().get(pos);
                    if (latencyFaultTolerance.isAvailable(mq.getBrokerName()))
                        return mq;
                }
				// 上一步没选出来时，选一个相对较好的Broker
                final String notBestBroker = latencyFaultTolerance.pickOneAtLeast();
                int writeQueueNums = tpInfo.getQueueIdByBroker(notBestBroker);
                if (writeQueueNums > 0) {
                    final MessageQueue mq = tpInfo.selectOneMessageQueue();
                    if (notBestBroker != null) {
                        mq.setBrokerName(notBestBroker);
                        mq.setQueueId(tpInfo.getSendWhichQueue().incrementAndGet() % writeQueueNums);
                    }
                    return mq;
                } else {
                    latencyFaultTolerance.remove(notBestBroker);
                }
            } catch (Exception e) {
                log.error("Error occurred when selecting message queue", e);
            }
			// 上面两步都没选出来时，默认负载均衡策略选一个MQ
            return tpInfo.selectOneMessageQueue();
        }

        return tpInfo.selectOneMessageQueue(lastBrokerName);
    }

从源码中不难看出，selectOneMessageQueue在容错策略下选择MQ的步骤：

优先获取上一次用过的Broker（上一次用的很大程度上是可用的）
选择一个次优的Broker
默认负载均衡策略返回一个Broker

updateFaultItem更新Broker对应的延迟，如果Producer发送消息时间过长，则认为一段时间N内不可用，N的取值与Producer发送消息持续时长的关系如下表：（其实就是上面源码中的latencyMax和notAvailableDuration数组）

`Producer`发送消息消耗时长	`Broker`不可用时长
≥15000ms	600×1000ms
≥3000ms	180×1000ms
≥2000ms	120×1000ms
≥1000ms	60×1000ms
≥550ms	30×1000ms
≥100ms	0ms
≥50ms	0ms

重试机制

由于在复杂的分布式系统中，经常会有网络波动、服务器宕机、程序出现异常，所以就有可能出现消息发送或消费失败的问题。

所以MQ就必须提供消息重试的机制，如果没有消息重试，就可能会产生消息丢失的问题，对系统产生较大的影响，整体示意图如下图所示：在这里插入图片描述 MQ消费者的消费逻辑失败时，可以通过设置返回状态来达到消息重试的结果。MQ消息重试只对集群消费方式生效，广播消息不提供失败重试的特性，消费失败后会继续往后消费新的消息。

重试机制源码分析

发送消息的执行步骤：

获取消息路由信息
选择要发送到的MQ
执行消息发送方法
对发送结果封装并返回

函数签名：


private SendResult sendDefaultImpl(
         Message msg,
         final CommunicationMode communicationMode,
         final SendCallback sendCallback,
         final long timeout
 ) throws MQClientException, RemotingException, MQBrokerException, InterruptedException

其中各个参数的含义如下：

msg：消息
communicationMode：通信模式
sendCallback：发送回调
timeout：发送超时时间
MQClientException：Client发送异常
RemotingException：请求发生异常
MQBrokerException：Broker发生异常
InterruptedException：线程被打断异常

下面具体结合源码来看

 private SendResult sendDefaultImpl(
         Message msg,
         final CommunicationMode communicationMode,
         final SendCallback sendCallback,
         final long timeout
 ) throws MQClientException, RemotingException, MQBrokerException, InterruptedException {
     // 检验Producer处于运行状态
     this.makeSureStateOK();
     // 验证消息责任
     Validators.checkMessage(msg, this.defaultMQProducer);
     // 调用编号；用于标记是同一次发送的消息
     final long invokeID = random.nextLong();
     long beginTimestampFirst = System.currentTimeMillis();
     long beginTimestampPrev = beginTimestampFirst;
     long endTimestamp = beginTimestampFirst;
     // 获取Topic路由信息
     TopicPublishInfo topicPublishInfo = this.tryToFindTopicPublishInfo(msg.getTopic());
     if (topicPublishInfo != null && topicPublishInfo.ok()) {
         boolean callTimeout = false;
         MessageQueue mq = null;     // 消息要发送到的MQ
         Exception exception = null;
         SendResult sendResult = null;   // 最后一次发送结果
         // 计算最多重试次数
         int timesTotal = communicationMode == CommunicationMode.SYNC ? 1 + this.defaultMQProducer.getRetryTimesWhenSendFailed() : 1;
         int times = 0;
         String[] brokersSent = new String[timesTotal];  // 存储每次发送消息时选择的broker名称
         // 循环发送消息，直到成功
         for (; times < timesTotal; times++) {

     // ......逻辑省略
 }

this.defaultMQProducer.getRetryTimesWhenSendFailed()返回的默认重试次数是2，根据上面的计算，如果是同步发送消息，则重试次数就是3，如果是异步发送，则重试次数为1

在这里插入图片描述

DefaultMQProducer可以设置消息发送失败的最大重试次数，并且还可以设置发送消息的超时时间来进行消息重试：

public void setRetryTimesWhenSendFailed(int retryTimesWhenSendFailed) {
    this.retryTimesWhenSendFailed = retryTimesWhenSendFailed;
}

@Override
public SendResult send(Collection<Message> msgs,
    long timeout) throws MQClientException, RemotingException, MQBrokerException, InterruptedException {
    return this.defaultMQProducerImpl.send(batch(msgs), timeout);
}

如果想要设置Producer在5s内没有发送成功则重试5次，代码片段如下：

DefaultMQProducer producer = new DefaultMQProducer("producer");
producer.setRetryTimesWhenSendFailed(5);
producer.send(msg, 5000L);

RocketMQ 容错策略 解析——图解、源码级解析

容错策略类继承关系图

源码分析

MQFaultStrategy

重试机制

重试机制源码分析

RocketMQ 容错策略解析——图解、源码级解析