9、Dubbo源码系列-集群容错策略读书求学不宜懒，天地日月比人忙。周末了，大家在放松的时候，如果还有精力的话，不妨和我

“我报名参加金石计划1期挑战——瓜分10万奖池，这是我的第3篇文章，点击查看活动详情”

读书求学不宜懒，天地日月比人忙。周末了，大家在放松的时候，如果还有精力的话，不妨和我一起学习下Dubbo源码，本文就带着大家一起看下Dubbo的集群容错模式。

一、温故知新

    private <T> Invoker<T> doRefer(Cluster cluster, Registry registry, Class<T> type, URL url) {
        ... ...  
        // 1、建立路由规则
        directory.buildRouterChain(subscribeUrl);
        // 2、订阅服务提供方者地址
        directory.subscribe(subscribeUrl.addParameter(CATEGORY_KEY,
                PROVIDERS_CATEGORY + "," + CONFIGURATORS_CATEGORY + "," + ROUTERS_CATEGORY));

        // 3、包装容错策略到invoker
        Invoker invoker = cluster.join(directory);
        return invoker;
    }

前文讲解RegistryProtocol-doRefer时提到过，步骤三就是包装集群容错策略给invoker，当我们设计程序的时候，除了正常的逻辑外，也要考虑异常情况怎么处理，集群容错策略就是Dubbo提供的异常处理机制，比如服务提供方机器宕机了，是直接失败，还是发起重试？重试的时候重试几次，怎么重试等等。看完Dubbo的容错策略后，想必以上问题你就有了答案。

二、集群容错模式汇总

Dubbo提供了一下几种容错模式实现：

Failover Cluster
- 失败重试，可通过retries=x设置重试次数，通常用于幂等性的读操作。
Failsafe Cluster
- 快速失败，出现异常时，立刻报错。通常用于非幂等的写操作。
Failback Cluster
- 失败自动恢复，后台记录失败请求，定时重发。通常用于消息通知操作
Forking Cluster
- 并行调用多个服务器，只要一个成功即返回。通常用于实时性要求较高的读操作，但需要浪费更多服务资源
Broadcast Cluste
- 广播调用所有提供者，逐个调用，任意一台报错则报错。通常用于通知所有提供者更新缓存或日志等本地资源

三、具体实现

@SPI(FailoverCluster.NAME)
public interface Cluster {
    @Adaptive
    <T> Invoker<T> join(Directory<T> directory) throws RpcException;
}

查看Custer实现发现，也是个SPI注入点，模式实现为FailoverCluster，即失败重试

3.1. FailoverCluster

FailoverCluster实现如下，doJoin方法，内部初始化了FailoverClusterInvoker对象。

public class FailoverCluster extends AbstractCluster {
    public final static String NAME = "failover";

    @Override
    public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
        return new FailoverClusterInvoker<>(directory);
    }
}

查看FailoverClusterInvoker的doInvoke方法如下：

  @Override
    @SuppressWarnings({"unchecked", "rawtypes"})
    public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        // 获取重试次数，默认是2次
        int len = getUrl().getMethodParameter(methodName, RETRIES_KEY, DEFAULT_RETRIES) + 1;
        ... ...
        // 根据重试次数发起调用
        for (int i = 0; i < len; i++) {
            // 重试的时候进行校验
             if (i > 0) {
                 // 重试的是否判断消费者是否销毁
                checkWhetherDestroyed();
                // 重新获取服务端列表
                copyInvokers = list(invocation);
                // 校验
                checkInvokers(copyInvokers, invocation);
            }
            
            // 负载均衡获取最终发起远程调用的inviker
            Invoker<T> invoker = select(loadbalance, invocation, copyInvokers, invoked);
            invoked.add(invoker);
            RpcContext.getContext().setInvokers((List) invoked);
            try {
                // 执行远程调用
                Result result = invoker.invoke(invocation);
                if (le != null && logger.isWarnEnabled()) {
                    ... ...
                }
                return result;
            } catch (RpcException e) {
                if (e.isBiz()) { // biz exception.
                    throw e;
                }
                le = e;
            } catch (Throwable e) {
                le = new RpcException(e.getMessage(), e);
            } finally {
                providers.add(invoker.getUrl().getAddress());
            }
        }
    }

首先获取协议里设置的重试次数，默认是2次，循环调用下游服务，通过负载均衡策略获取对应的invoker，用来发起远程调用。

3.2. FailsafeCluster

查看FailfastClusterInvoker doInvke实现如下：

    public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        checkInvokers(invokers, invocation);
        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
        try {
            return invoker.invoke(invocation);
        } catch (Throwable e) {
            if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.
                throw (RpcException) e;
            }
        }
    }

实现比较简单，RpcException异常则直接thrwo。

3.3. FailbackCluster

查看FailbackClusterInvoker doInvoke实现如下：


    @Override
    protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        Invoker<T> invoker = null;
        try {
            checkInvokers(invokers, invocation);
            invoker = select(loadbalance, invocation, invokers, null);
            return invoker.invoke(invocation);
        } catch (Throwable e) {
            // 失败添加到定时器
            addFailed(loadbalance, invocation, invokers, invoker);
            return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
        }
    }
    
        private void addFailed(LoadBalance loadbalance, Invocation invocation, List<Invoker<T>> invokers, Invoker<T> lastInvoker) {
        // 创建时间轮
        if (failTimer == null) {
            synchronized (this) {
                if (failTimer == null) {
                    failTimer = new HashedWheelTimer(
                            new NamedThreadFactory("failback-cluster-timer", true),
                            1,
                            TimeUnit.SECONDS, 32, failbackTasks);
                }
            }
        }
        // 创建重试任务
        RetryTimerTask retryTimerTask = new RetryTimerTask(loadbalance, invocation, invokers, lastInvoker, retries, RETRY_FAILED_PERIOD);
        try {
            failTimer.newTimeout(retryTimerTask, RETRY_FAILED_PERIOD, TimeUnit.SECONDS);
        } catch (Throwable e) {
        
        }
    }

当服务调用跑出异常时，把任务加入到对应的时间轮去，定时发起调用。

3.4. ForkingCluster

查看ForkingClusterInvoker doInvoke实现如下：

    @Override
    @SuppressWarnings({"unchecked", "rawtypes"})
    public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        try {
            // 获取配置参数
            final int forks = getUrl().getParameter(FORKS_KEY, DEFAULT_FORKS);
            final int timeout = getUrl().getParameter(TIMEOUT_KEY, DEFAULT_TIMEOUT);
            // 获取并行执行的invker泪表
            if (forks <= 0 || forks >= invokers.size()) {
                selected = invokers;
            } else {
                selected = new ArrayList<>();
                for (int i = 0; i < forks; i++) {
                    Invoker<T> invoker = select(loadbalance, invocation, invokers, selected);
                    if (!selected.contains(invoker)) {
                        //Avoid add the same invoker several times.
                        selected.add(invoker);
                    }
                }
            }
            // 使用线程池，并行对invoker列表发起调用
            RpcContext.getContext().setInvokers((List) selected);
            final AtomicInteger count = new AtomicInteger();
            final BlockingQueue<Object> ref = new LinkedBlockingQueue<>();
            for (final Invoker<T> invoker : selected) {
                executor.execute(() -> {
                    try {
                        Result result = invoker.invoke(invocation);
                        ref.offer(result);
                    } catch (Throwable e) {
                        // 记录实行失败的个数
                        int value = count.incrementAndGet();
                        if (value >= selected.size()) {
                            ref.offer(e);
                        }
                    }
                });
            }
            try {
                // 获取执行结果并返回
                Object ret = ref.poll(timeout, TimeUnit.MILLISECONDS);
                ... ...
                return (Result) ret;
            } catch (InterruptedException e) {
                throw new RpcException("Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e);
            }
        } finally {
            // clear attachments which is binding to current thread.
            RpcContext.getContext().clearAttachments();
        }
    }

获取到设置的forks数量，找到对应的数量的invoke，如果invoker数量不足，有的invoke可能要被重复调用，设置的时候要注意一点。使用的BlockingQueue用来记录并发的实行结果，这个方式还挺新颖的，说实话之前自己没用过，通过使用队列的poll方法设置和rpc一样的超时时间，来达到和future.get一样的结果。

3.5. BroadcastCluste

查看BroadcastClusterInvoker doInvoke实现如下：

    @Override
    @SuppressWarnings({"unchecked", "rawtypes"})
    public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        checkInvokers(invokers, invocation);
        RpcContext.getContext().setInvokers((List) invokers);
        RpcException exception = null;
        Result result = null;
        for (Invoker<T> invoker : invokers) {
            try {
                result = invoker.invoke(invocation);
            } catch (RpcException e) {
                exception = e;
                logger.warn(e.getMessage(), e);
            } catch (Throwable e) {
                exception = new RpcException(e.getMessage(), e);
                logger.warn(e.getMessage(), e);
            }
        }
        if (exception != null) {
            throw exception;
        }
        return result;
    }

广播模式嘛，省略了负载均衡的步骤，直接对每个服务者进行调用。

3.6. 自定义容错策略

当然，如果你觉得上述容错模式都不符合你的业务场景，可以通过自行定义，实现方式也简单，定义MyCluster实现Cluster接口，join方法内，调用MyClusterInvoker，实现其doInvoke方法，由于Cluster的选择是通过SPI注入的，要在自己的项目resource/META-INF目录下，添加对应的SPI声明文件，这里就不展开讲解了。

四、小节

本文主要为大家分析了Dubbo容错模式的实现，代码其实都不复杂，但是我觉得这个思路还是指的大家借鉴的，平时我们在做架构设计的时候，不光要考虑正常的业务逻辑处理，异常场景当然也要考虑，添加一层统一的异常场景处理层，无论是对架构的可扩展性，或者代码的可阅读性都是有很大提升的。