Dubbo服务集群容错原理（重要）在分布式集群应用环境下，如果某些服务提供者出现服务不可用时，如何让服务调用者选择可用服

一起养成写作习惯！这是我参与「掘金日新计划 · 4 月更文挑战」的第7天，点击查看活动详情。

前言

集群容错技术是分布式服务治理技术中非常关键的一项技术。

什么是集群容错技术呢？

在分布式集群应用环境下，服务提供者可能集群部署并且有很多台，如果某些服务提供者因为一些原因出现服务不可用时，如何让服务调用者选择可用服务提供者进行调用呢？这个时候集群容错技术就闪亮登场了，它能够针对某些服务提供者不可用时提供自动故障转移的能力。

学习Dubbo的集群容错技术可以增强服务集群容错技术的理解，对我们理解分布式技术有很大的帮助，本文将分析Dubbo中使用到的集群容错技术原理。

在Dubbo中，总共有6种集群容错技术，下面我们一个一个来分析。

负载均衡技术分类

第一个，failover

这也是Dubbo中默认的一个集群容错技术，failover翻译成中文叫故障转移，见名思义，它是指发现服务提供者不可用时会尝试调用集群中其他的服务提供者，dubbo中默认重试2次。源码如下：

public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        List<Invoker<T>> copyInvokers = invokers;
        checkInvokers(copyInvokers, invocation);
        String methodName = RpcUtils.getMethodName(invocation);
        //获取重试次数，默认2次 加当前调用，总共3次
        int len = getUrl().getMethodParameter(methodName, RETRIES_KEY, DEFAULT_RETRIES) + 1;
        if (len <= 0) {
            len = 1;
        }
        // retry loop.
        RpcException le = null; // last exception.
        List<Invoker<T>> invoked = new ArrayList<Invoker<T>>(copyInvokers.size()); // invoked invokers.
        Set<String> providers = new HashSet<String>(len);
        //循环重试调用
        for (int i = 0; i < len; i++) {
            //Reselect before retry to avoid a change of candidate `invokers`.
            //NOTE: if `invokers` changed, then `invoked` also lose accuracy.
            if (i > 0) {
                checkWhetherDestroyed();
                copyInvokers = list(invocation);
                // check again
                checkInvokers(copyInvokers, invocation);
            }
            Invoker<T> invoker = select(loadbalance, invocation, copyInvokers, invoked);
            invoked.add(invoker);
            RpcContext.getContext().setInvokers((List) invoked);
            try {
                Result result = invoker.invoke(invocation);
                if (le != null && logger.isWarnEnabled()) {
                    logger.warn("Although retry the method " + methodName
                            + " in the service " + getInterface().getName()
                            + " was successful by the provider " + invoker.getUrl().getAddress()
                            + ", but there have been failed providers " + providers
                            + " (" + providers.size() + "/" + copyInvokers.size()
                            + ") from the registry " + directory.getUrl().getAddress()
                            + " on the consumer " + NetUtils.getLocalHost()
                            + " using the dubbo version " + Version.getVersion() + ". Last error is: "
                            + le.getMessage(), le);
                }
                return result;
            } catch (RpcException e) {
                if (e.isBiz()) { // biz exception.
                    throw e;
                }
                le = e;
            } catch (Throwable e) {
                le = new RpcException(e.getMessage(), e);
            } finally {
                providers.add(invoker.getUrl().getAddress());
            }
        }
        throw new RpcException(le.getCode(), "Failed to invoke the method "
                + methodName + " in the service " + getInterface().getName()
                + ". Tried " + len + " times of the providers " + providers
                + " (" + providers.size() + "/" + copyInvokers.size()
                + ") from the registry " + directory.getUrl().getAddress()
                + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version "
                + Version.getVersion() + ". Last error is: "
                + le.getMessage(), le.getCause() != null ? le.getCause() : le);
    }

第二个，fastfail

这种是快速失败机制，就是调用失败时立即报错，通常用于非幂等的操作。

源码如下：

@Override
    public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        checkInvokers(invokers, invocation);
        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
        try {
            return invoker.invoke(invocation);
        } catch (Throwable e) {
            if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.
                throw (RpcException) e;
            }
            //失败了，直接抛异常
            throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0,
                    "Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName()
                            + " select from all providers " + invokers + " for service " + getInterface().getName()
                            + " method " + invocation.getMethodName() + " on consumer " + NetUtils.getLocalHost()
                            + " use dubbo version " + Version.getVersion()
                            + ", but no luck to perform the invocation. Last error is: " + e.getMessage(),
                    e.getCause() != null ? e.getCause() : e);
        }
    }

第三个，failsafe

这种是失败安全机制，失败之后不抛异常也不重试，而是忽略异常，单纯记录一下日志，返回空结果。

源码如下：

@Override
    public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        try {
            checkInvokers(invokers, invocation);
            Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
            return invoker.invoke(invocation);
        } catch (Throwable e) {
            //忽略异常，记录日志，返回空结果
            logger.error("Failsafe ignore exception: " + e.getMessage(), e);
            return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
        }
    }

第四个，broadcast

这种是广播调用，任何一个服务提供者报错就报错。这种集群容错方式的使用场景是用来更新本地缓存，如果某个数据在某台服务器更新了，可用同步到集群中的其他服务器。

源码如下：

public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        checkInvokers(invokers, invocation);
        RpcContext.getContext().setInvokers((List) invokers);
        RpcException exception = null;
        Result result = null;
        //调用所有服务提供者
        for (Invoker<T> invoker : invokers) {
            try {
                result = invoker.invoke(invocation);
            } catch (RpcException e) {
                exception = e;
                logger.warn(e.getMessage(), e);
            } catch (Throwable e) {
                exception = new RpcException(e.getMessage(), e);
                logger.warn(e.getMessage(), e);
            }
        }
        //只要有一台异常，就报错
        if (exception != null) {
            throw exception;
        }
        return result;
    }

第五个，failback

这种是失败降级，失败后会记录请求并且定时重试，忽略异常后返回空

源码如下：

@Override
    protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        Invoker<T> invoker = null;
        try {
            checkInvokers(invokers, invocation);
            invoker = select(loadbalance, invocation, invokers, null);
            return invoker.invoke(invocation);
        } catch (Throwable e) {
            logger.error("Failback to invoke method " + invocation.getMethodName() + ", wait for retry in background. Ignored exception: "
                    + e.getMessage() + ", ", e);
            //新增失败的任务 到时间轮队列
            addFailed(loadbalance, invocation, invokers, invoker);
            return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
        }
    }

第六个，forking

这种是并行调用，只要有1个成功就成功，全部失败才算失败，可通过forks="2" 来设置最大并行数。

源码如下：

//循环每个服务提供者，多线程并行调用
for (final Invoker<T> invoker : selected) {
                executor.execute(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            Result result = invoker.invoke(invocation);
                            ref.offer(result);
                        } catch (Throwable e) {
                            int value = count.incrementAndGet();
                            //失败次数大于等于了提供者数量，说明全部失败
                            if (value >= selected.size()) {
                                ref.offer(e);
                            }
                        }
                    }
                });
            }
try {
                Object ret = ref.poll(timeout, TimeUnit.MILLISECONDS);
                //如果有失败记录，说明全部都失败了则失败
                if (ret instanceof Throwable) {
                    Throwable e = (Throwable) ret;
                    throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0, "Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e);
                }
                //否则返回成功
                return (Result) ret;
            } catch (InterruptedException e) {
                throw new RpcException("Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e);
            }

总结

本文分析了Dubbo中集群容错的6种机制原理，通过学习Dubbo的集群容错机制，我们也可以去学习下其他中间件的集群容错机制，比如redis,相信我们以后对其他的中间件集群容错机制的学习有很大的帮助。