Dubbo集群容错在分布式环境下，一般都会将服务集群部署，所以对于调用者而言，在众多的服务之中应该调用哪个，失败后又应该

Dubbo集群容错

在分布式环境下，一般都会将服务集群部署，所以对于调用者而言，在众多的服务之中应该调用哪个，失败后又应该怎么办。在Dubbo中都做了解决。Dubbo设计了一套集群容错机制，调用者调用失败后由Dubbo帮我们来处理失败的逻辑。Dubbo定义了Cluster和对于的Invoker，Cluster接口的实现类分为：FailOverCluster，FailFastCluster，FailSafeCluster，FailBackCluster，ForkingCluster与BroadcastCluster，对应的是XxxInvoker类

FailOverCluster

Dubbo默认的集群容错策略。消费方调用失败后会进行重试，重试一定次数还是失败后抛出异常，默认重试次数为2，重试次数太多可能会带来更长的延迟。

public class FailoverCluster implements Cluster {

    public final static String NAME = "failover";

    @Override
    public <T> Invoker<T> join(Directory<T> directory) throws RpcException {
        return new FailoverClusterInvoker<T>(directory);
    }
}

FailoverCluster的工作是负责将多个Invoker合并成单个Invoker，也就是创建FailoverClusterInvoker，具体的容错的逻辑都对应的Invoker中。具体服务调用时会调用AbstractClusterInvoker#invoke方法，最终调用模板doInvoke方法，由子类去完成具体的调用逻辑。

FailoverClusterInvoker#doInvoke：

public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    // 该invokers是经过服务列表获得，并经过路由后得到的Invoker集合
    List<Invoker<T>> copyinvokers = invokers;
    // 检测Invokers是否为空  为空抛异常
    checkInvokers(copyinvokers, invocation);
    // 获取调用的方法名 
    String methodName = RpcUtils.getMethodName(invocation);
    // 获取重试次数  默认是2  重试是排除第一次调用  所以这里需要 + 1
    int len = getUrl().getMethodParameter(methodName, Constants.RETRIES_KEY, Constants.DEFAULT_RETRIES) + 1;
    // 如果该参数设置不可理则置为1
    if (len <= 0) {
        len = 1;
    }
    // 用来记录重试次数到达后还是失败最后异常信息
    RpcException le = null; 
    // 记录已经调用过的服务
    List<Invoker<T>> invoked = new ArrayList<Invoker<T>>(copyinvokers.size()); 
    Set<String> providers = new HashSet<String>(len);
    // 调用失败后重试   Dubbo默认集群容错策略
    for (int i = 0; i < len; i++) {
        if (i > 0) {
            // 这里需要重新做检测 已经重新获取服务目录和路由获取新的Invoker列表
            checkWhetherDestroyed();
            // 重新获取服务列表
            copyinvokers = list(invocation);
            // 重新检测invokers是否为空
            checkInvokers(copyinvokers, invocation);
        }
        // 通过负载均衡获得最终需要调用的服务
        Invoker<T> invoker = select(loadbalance, invocation, copyinvokers, invoked);
        invoked.add(invoker);
        RpcContext.getContext().setInvokers((List) invoked);
        try {
            // 执行调用  成功后直接返回 失败后重新获取服务列表做负载均衡等操作后调用
     	   // 重新做负载均衡之后如果还是得到之前失败的服务，则会重新做负载均衡
           // 也就是失败重试并不会掉同一个服务
            Result result = invoker.invoke(invocation);
            if (le != null && logger.isWarnEnabled()) {
                logger.warn("Although retry the method " + invocation.getMethodName()
                        + " in the service " + getInterface().getName()
                        + " was successful by the provider " + invoker.getUrl().getAddress()
                        + ", but there have been failed providers " + providers
                        + " (" + providers.size() + "/" + copyinvokers.size()
                        + ") from the registry " + directory.getUrl().getAddress()
                        + " on the consumer " + NetUtils.getLocalHost()
                        + " using the dubbo version " + Version.getVersion() + ". Last error is: "
                        + le.getMessage(), le);
            }
            return result;
        } catch (RpcException e) {
            if (e.isBiz()) {
                throw e;
            }
            le = e;
        } catch (Throwable e) {
            le = new RpcException(e.getMessage(), e);
        } finally {
            providers.add(invoker.getUrl().getAddress());
        }
    }
    // 重试后还是失败的话  则抛出异常
    throw new RpcException(le != null ? le.getCode() : 0, "Failed to invoke the method "
            + invocation.getMethodName() + " in the service " + getInterface().getName()
            + ". Tried " + len + " times of the providers " + providers
            + " (" + providers.size() + "/" + copyinvokers.size()
            + ") from the registry " + directory.getUrl().getAddress()
            + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version "
            + Version.getVersion() + ". Last error is: "
            + (le != null ? le.getMessage() : ""), le != null && le.getCause() != null ? le.getCause() : le);
}

FailFastCluster

快速失败，只发起一次调用，调用失败后直接抛出异常。

同上，最终调用会来到FailfastClusterInvoker的doInvoke方法，前置逻辑就不再分析了，后面也直接跳过前戏，直接分析doInvoke方法

FailFastClusterInvoker#doInvoke

public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    // 检测Invokers是否为空  为空抛异常
    checkInvokers(invokers, invocation);
    // 执行负载均衡逻辑
    Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
    try {
        // 执行调用
        return invoker.invoke(invocation);
    } catch (Throwable e) {
        if (e instanceof RpcException && ((RpcException) e).isBiz()) {
            throw (RpcException) e;
        }
        throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0, "Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName() + " select from all providers " + invokers + " for service " + getInterface().getName() + " method " + invocation.getMethodName() + " on consumer " + NetUtils.getLocalHost() + " use dubbo version " + Version.getVersion() + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e);
    }
}

可以看到逻辑非常简单，调用失败后会直接抛出异常，不做过多分析

FailSafeCluster

失败安全，调用失败后不会抛异常，只会记录失败的日志，并返回空的结果

FailSafeClusterInvoker#doInvoke

public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    try {
        // 检测Invokers是否为空  为空抛异常
        checkInvokers(invokers, invocation);
        // 执行负载均衡逻辑
        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
        // 执行调用
        return invoker.invoke(invocation);
    } catch (Throwable e) {
        // 记录调用失败后的异常日志
        logger.error("Failsafe ignore exception: " + e.getMessage(), e);
        // 返回空的结果
        return new RpcResult();
    }
}

FailBackCluster

调用失败后记录失败日志先返回空的结果，将错误调用加入到定时任务中间隔5秒执行。

FailBackClusterInvoker#doInvoke

protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    try {
         // 检测Invokers是否为空  为空抛异常
        checkInvokers(invokers, invocation);
        // 执行负载均衡逻辑
        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
        // 执行调用
        return invoker.invoke(invocation);
    } catch (Throwable e) {
        // 记录调用失败异常日志
        logger.error("Failback to invoke method " + invocation.getMethodName() + ", wait for retry in background. Ignored exception: "
                + e.getMessage() + ", ", e);
        // 将错误调用加入到定时任务中，每间隔5秒执行一次
        addFailed(invocation, this);
        // 先返回空结果
        return new RpcResult(); 
    }
}

FailBackClusterInvoker#addFailed

private volatile ScheduledFuture<?> retryFuture;
private final ScheduledExecutorService scheduledExecutorService = Executors.newScheduledThreadPool(2,
            new NamedInternalThreadFactory("failback-cluster-timer", true));
private final ConcurrentMap<Invocation, AbstractClusterInvoker<?>> failed = new ConcurrentHashMap<Invocation, AbstractClusterInvoker<?>>();


private void addFailed(Invocation invocation, AbstractClusterInvoker<?> router) {
    if (retryFuture == null) {
        synchronized (this) {
            if (retryFuture == null) {
                // 创建定时任务，每间隔5秒执行一次
                retryFuture = scheduledExecutorService.scheduleWithFixedDelay(new Runnable() {

                    @Override
                    public void run() {
                        try {
                            retryFailed();
                        } catch (Throwable t) { // Defensive fault tolerance
                            logger.error("Unexpected error occur at collect statistic", t);
                        }
                    }
                }, RETRY_FAILED_PERIOD, RETRY_FAILED_PERIOD, TimeUnit.MILLISECONDS);
            }
        }
    }
    // 添加 invocation 和 invoker 到 failed 中
    failed.put(invocation, router);
}

void retryFailed() {
    if (failed.size() == 0) {
        return;
    }
    // 遍历failed
    for (Map.Entry<Invocation, AbstractClusterInvoker<?>> entry : new HashMap<Invocation, AbstractClusterInvoker<?>>(
            failed).entrySet()) {
        Invocation invocation = entry.getKey();
        Invoker<?> invoker = entry.getValue();
        try {
            // 执行调用
            invoker.invoke(invocation);
            // 调用成功后从failed里面移除
            failed.remove(invocation);
        } catch (Throwable e) {
            logger.error("Failed retry to invoke method " + invocation.getMethodName() + ", waiting again.", e);
        }
    }
}

ForkingCluster

并行调用多个服务，当有一个返回成功时，则返回成功

ForkingClusterInvoker#doInvoke

// 创建线程池
private final ExecutorService executor = Executors.newCachedThreadPool(
            new NamedInternalThreadFactory("forking-cluster-timer", true));

public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    try {
        // 检测Invokers是否为空  为空抛异常
        checkInvokers(invokers, invocation);
        final List<Invoker<T>> selected;
        // 获取forks参数的值，默认为2
        final int forks = getUrl().getParameter(Constants.FORKS_KEY, Constants.DEFAULT_FORKS);
        // 获取timeout参数的值，默认为1000
        final int timeout = getUrl().getParameter(Constants.TIMEOUT_KEY, Constants.DEFAULT_TIMEOUT);
        // 如果forks参数设置不合理，则将invokers赋值给selected
        if (forks <= 0 || forks >= invokers.size()) {
            selected = invokers;
        } else {
            selected = new ArrayList<Invoker<T>>();
            // 循环调用负载均衡选出forks个invoker
            for (int i = 0; i < forks; i++) {
                // 负载均衡
                Invoker<T> invoker = select(loadbalance, invocation, invokers, selected);
                // 过滤重复的invoker
                if (!selected.contains(invoker)) {
                    selected.add(invoker);
                }
            }
        }
        RpcContext.getContext().setInvokers((List) selected);
        final AtomicInteger count = new AtomicInteger();
        // 创建阻塞队列
        final BlockingQueue<Object> ref = new LinkedBlockingQueue<Object>();
        for (final Invoker<T> invoker : selected) {
            // 线程池执行多个invoker
            executor.execute(new Runnable() {
                @Override
                public void run() {
                    try {
                        // 执行调用
                        Result result = invoker.invoke(invocation);
                        // 放入阻塞队列中
                        ref.offer(result);
                    } catch (Throwable e) {
                        // 调用失败后递增，当递增的值 >= invoker列表的大小时，将异常加入到队列中
                        // 这里为什么是 >=  因为只要有一个成功了就可以返回成功了
                        // 所以当所有的调用都失败后才将异常加入到阻塞队列中
                        // 当队列中没有元素时会一直阻塞直到有元素，如果有调用成功的返回则直接加到队列中返回
                        int value = count.incrementAndGet();
                        if (value >= selected.size()) {
                            ref.offer(e);
                        }
                    }
                }
            });
        }
        try {
            // 这里会有两种结果  
            // 1. 调用成功 从队列中拿出成功结果
            // 2. 所有服务都调用失败，队列中的元素是Throwable异常体系
            Object ret = ref.poll(timeout, TimeUnit.MILLISECONDS);
            if (ret instanceof Throwable) {
                Throwable e = (Throwable) ret;
                throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0, "Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e);
            }
            return (Result) ret;
        } catch (InterruptedException e) {
            throw new RpcException("Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e);
        }
    } finally {
        RpcContext.getContext().clearAttachments();
    }
}

BroadcastCluster

就如名字一样，广播，调用每一台服务，只要有一台返回失败，则抛出异常（也就是即使其中大部分机器都成功了，只有一台失败了也算失败），像事务一样。注意：这里可以配置失败机器的比例，如：@reference(cluster = "broadcast", parameters = {"broadcast.fail.percent", "20"}) 代表集群中百分之20的机器返回失败后就不会再去调用其他的机器了，返回失败。

BroadcastClusterInvoker#doInvoke

public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    // 检测Invokers是否为空  为空抛异常
    checkInvokers(invokers, invocation);
    RpcContext.getContext().setInvokers((List) invokers);
    // 用于保存异常信息
    RpcException exception = null;
    Result result = null;
    // 调用每一个服务
    for (Invoker<T> invoker : invokers) {
        try {
            result = invoker.invoke(invocation);
        } catch (RpcException e) {
            exception = e;
            logger.warn(e.getMessage(), e);
        } catch (Throwable e) {
            exception = new RpcException(e.getMessage(), e);
            logger.warn(e.getMessage(), e);
        }
    }
    // 只要有失败则抛出异常
    if (exception != null) {
        throw exception;
    }
    return result;
}