Dubbo的集群容错策略剖析_dubbo容错策略-CSDN博客

本文链接：https://blog.csdn.net/J_bean/article/details/135942181

1 Dubbo的集群容错策略概述

Dubbo的集群容错策略（cluster strategy）应用于服务消费方，并在服务消费方进行设置。当服务消费方调用服务提供方的服务失败时，Dubbo提供了多种集群容错策略来处理失败。其中默认的策略为“失败重试”策略，即 Failover Cluster。下面将对多种集群容错策略依次进行解读。

服务消费端发起远程调用的过程中，会执行 AbstractClusterInvoker 的 invoke 方法。其中通过执行集群容错策略类的 doInvoke 方法，来执行集群容错策略以及发起最终的远程调用。源码如下所示。

// AbstractClusterInvoker 的 invoke 方法
@Override
public Result invoke(final Invocation invocation) throws RpcException {
    checkWhetherDestroyed();

    InvocationProfilerUtils.enterDetailProfiler(invocation, () -> "Router route.");

    // 获取invoker列表（服务提供者列表）
    List<Invoker<T>> invokers = list(invocation);
    InvocationProfilerUtils.releaseDetailProfiler(invocation);

    checkInvokers(invokers, invocation);

    // 获取服务消费者配置的负载均衡策略，默认为随机负载均衡策略（random）
    LoadBalance loadbalance = initLoadBalance(invokers, invocation);
    RpcUtils.attachInvocationIdIfAsync(getUrl(), invocation);

    InvocationProfilerUtils.enterDetailProfiler(invocation, () -> "Cluster " + this.getClass().getName() + " invoke.");
    try {
        // 执行调用（获取服务消费者配置的集群容错策略，
        // 以及根据负载均衡策略选择某个服务提供者进行调用）
        return doInvoke(invocation, invokers, loadbalance);
    } finally {
        InvocationProfilerUtils.releaseDetailProfiler(invocation);
    }
}

2 Dubbo的集群容错策略源码分析

下面将介绍一些主要的集群容错策略，以及对其源码进行解读。

2.1 失败重试-Failover Cluster

2.1.1 概述

要点：当服务消费方调用服务提供方的服务失败时，消费方会自动切换到其他服务提供者服务器进行重试。
适用场景：读操作或者具有幂等的写操作。
注意事项：重试会带来更长的延迟。
使用方式：通过设置 retries 参数。如 retries=“3” 即重试3次（不含第一次调用），即最多调用4次。另外可以设置接口级别或者方法级别的重试次数。不主动设置时，默认的重试次数为2次。如果使用的是失败重试策略，则对于不具有幂等的写操作，retries一定要设置为0，防止数据重复等问题的出现。

<dubbo:reference ... cluster="failover" retries="3"/>

2.1.2 源码分析

实现类是FailoverClusterInvoker，主要的方法是其doInvoke()方法。具体如下所示。

public class FailoverCluster extends AbstractCluster {

    public final static String NAME = "failover";

    @Override
    public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
        return new FailoverClusterInvoker<>(directory);
    }

}


// FailoverClusterInvoker 的 doInvoke()方法
public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    // 所有的服务提供者
    List<Invoker<T>> copyInvokers = invokers;
    String methodName = RpcUtils.getMethodName(invocation);

    // 计算最多调用多少次（retries+1）
    int len = calculateInvokeTimes(methodName);
    // retry loop.
    RpcException le = null;
    List<Invoker<T>> invoked = new ArrayList<Invoker<T>>(copyInvokers.size()); // invoked invokers.
    Set<String> providers = new HashSet<String>(len);
    for (int i = 0; i < len; i++) {
        //Reselect before retry to avoid a change of candidate `invokers`.
        //NOTE: if `invokers` changed, then `invoked` also lose accuracy.
        if (i > 0) {
            checkWhetherDestroyed();
            // 重试时，重新获取所有的服务提供者
            copyInvokers = list(invocation);
            // check again
            checkInvokers(copyInvokers, invocation);
        }

        // 根据负载均衡策略选择一个服务提供者
        Invoker<T> invoker = select(loadbalance, invocation, copyInvokers, invoked);
        invoked.add(invoker);
        RpcContext.getServiceContext().setInvokers((List) invoked);
        boolean success = false;
        try {
            // 发起远程调用
            Result result = invokeWithContext(invoker, invocation);
            if (le != null && logger.isWarnEnabled()) {
                logger.warn(CLUSTER_FAILED_MULTIPLE_RETRIES,"failed to retry do invoke","","Although retry the method " + methodName
                    + " in the service " + getInterface().getName()
                    + " was successful by the provider " + invoker.getUrl().getAddress()
                    + ", but there have been failed providers " + providers
                    + " (" + providers.size() + "/" + copyInvokers.size()
                    + ") from the registry " + directory.getUrl().getAddress()
                    + " on the consumer " + NetUtils.getLocalHost()
                    + " using the dubbo version " + Version.getVersion() + ". Last error is: "
                    + le.getMessage(),le);
            }
            success = true;
            return result;
        } catch (RpcException e) {
            // 如果为 biz exception（BIZ_EXCEPTION）则不重试
            if (e.isBiz()) {
                throw e;
            }
            le = e;
        } catch (Throwable e) {
            le = new RpcException(e.getMessage(), e);
        } finally {
            if (!success) {
                providers.add(invoker.getUrl().getAddress());
            }
        }
    }
    throw new RpcException(le.getCode(), "Failed to invoke the method "
        + methodName + " in the service " + getInterface().getName()
        + ". Tried " + len + " times of the providers " + providers
        + " (" + providers.size() + "/" + copyInvokers.size()
        + ") from the registry " + directory.getUrl().getAddress()
        + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version "
        + Version.getVersion() + ". Last error is: "
        + le.getMessage(), le.getCause() != null ? le.getCause() : le);
}

// 计算最多调用多少次
private int calculateInvokeTimes(String methodName) {
    int len = getUrl().getMethodParameter(methodName, RETRIES_KEY, DEFAULT_RETRIES) + 1;
    RpcContext rpcContext = RpcContext.getClientAttachment();
    Object retry = rpcContext.getObjectAttachment(RETRIES_KEY);
    if (retry instanceof Number) {
        len = ((Number) retry).intValue() + 1;
        rpcContext.removeAttachment(RETRIES_KEY);
    }
    if (len <= 0) {
        len = 1;
    }

    return len;
}

// 发起远程调用
protected Result invokeWithContext(Invoker<T> invoker, Invocation invocation) {
    Invoker<T> originInvoker = setContext(invoker);
    Result result;
    try {
        if (ProfilerSwitch.isEnableSimpleProfiler()) {
            InvocationProfilerUtils.enterProfiler(invocation, "Invoker invoke. Target Address: " + invoker.getUrl().getAddress());
        }
        setRemote(invoker, invocation);
        result = invoker.invoke(invocation);
    } finally {
        clearContext(originInvoker);
        InvocationProfilerUtils.releaseSimpleProfiler(invocation);
    }
    return result;
}

2.2 快速失败-Failfast Cluster

2.2.1 概述

要点：服务消费方调用服务提供方失败后，立即抛出异常，即只调用一次。
适用场景：通常用于非幂等性的写操作。
使用方式

<dubbo:reference ... cluster="failfast"/>

2.2.2 源码分析

实现类是FailfastClusterInvoker，主要的方法是其doInvoke()方法。具体如下所示。

public class FailfastCluster extends AbstractCluster {

    public final static String NAME = "failfast";

    @Override
    public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
        return new FailfastClusterInvoker<>(directory);
    }

}


public class FailfastClusterInvoker<T> extends AbstractClusterInvoker<T> {

    public FailfastClusterInvoker(Directory<T> directory) {
        super(directory);
    }

    @Override
    public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        // 选择一个服务提供者
        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
        try {
            // 执行远程调用
            return invokeWithContext(invoker, invocation);
        } catch (Throwable e) {
            if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.
                throw (RpcException) e;
            }
            throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0,
                "Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName()
                    + " for service " + getInterface().getName()
                    + " method " + RpcUtils.getMethodName(invocation) + " on consumer " + NetUtils.getLocalHost()
                    + " use dubbo version " + Version.getVersion()
                    + ", but no luck to perform the invocation. Last error is: " + e.getMessage(),
                e.getCause() != null ? e.getCause() : e);
        }
    }
}

2.3 安全失败-Failsafe Cluster

2.3.1 概述

要点：服务消费方调用服务提供方失败时，忽略错误，直接返回空结果。
适用场景：通常用于记录审计日志等操作。
使用方式

<dubbo:reference ... cluster="failsafe"/>

2.3.2 源码分析

实现类是FailsafeClusterInvoker，主要的方法是其doInvoke()方法。具体如下所示。

public class FailsafeCluster extends AbstractCluster {

    public final static String NAME = "failsafe";

    @Override
    public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
        return new FailsafeClusterInvoker<>(directory);
    }

}

public class FailsafeClusterInvoker<T> extends AbstractClusterInvoker<T> {
    private static final ErrorTypeAwareLogger logger = LoggerFactory.getErrorTypeAwareLogger(FailsafeClusterInvoker.class);

    public FailsafeClusterInvoker(Directory<T> directory) {
        super(directory);
    }

    @Override
    public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        try {
            Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
            return invokeWithContext(invoker, invocation);
        } catch (Throwable e) {
            logger.error(CLUSTER_ERROR_RESPONSE,"Failsafe for provider exception","","Failsafe ignore exception: " + e.getMessage(),e);
            return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
        }
    }
}

2.4 失败自动恢复-Failback Cluster

2.4.1 概述

要点：服务消费方调用服务提供方失败时，记录失败的请求，并按照一定的策略定时重新发起请求。
适用场景：适用于消息通知操作。
使用方式

<dubbo:reference ... cluster="failback" retries="3" failbacktasks="100"/>

failbacktasks - 重新发起请求的任务数量的上限

2.4.2 源码分析

实现类是FailbackClusterInvoker，主要的方法是其doInvoke()方法。具体如下所示。

 public class FailbackCluster extends AbstractCluster {

    public final static String NAME = "failback";

    @Override
    public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
        return new FailbackClusterInvoker<>(directory);
    }

}


// FailbackClusterInvoker 的 doInvoke 方法
@Override
protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    Invoker<T> invoker = null;
    URL consumerUrl = RpcContext.getServiceContext().getConsumerUrl();
    try {
        invoker = select(loadbalance, invocation, invokers, null);
        // Asynchronous call method must be used here, because failback will retry in the background.
        // Then the serviceContext will be cleared after the call is completed.
        return invokeWithContextAsync(invoker, invocation, consumerUrl);
    } catch (Throwable e) {
        logger.error(CLUSTER_FAILED_INVOKE_SERVICE,"Failback to invoke method and start to retries",
            "","Failback to invoke method " + RpcUtils.getMethodName(invocation) +
                ", wait for retry in background. Ignored exception: "
            + e.getMessage() + ", ",e);
        if (retries > 0) {
            // 将调用失败的请求加入到定时执行的任务中
            addFailed(loadbalance, invocation, invokers, invoker, consumerUrl);
        }
        return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
    }
}

将调用失败的请求加入到定时执行的任务中

    private void addFailed(LoadBalance loadbalance, Invocation invocation, List<Invoker<T>> invokers, Invoker<T> lastInvoker, URL consumerUrl) {
        // 创建一个定时器
        if (failTimer == null) {
            synchronized (this) {
                if (failTimer == null) {
                    failTimer = new HashedWheelTimer(
                        new NamedThreadFactory("failback-cluster-timer", true),
                        1,
                        TimeUnit.SECONDS, 32, failbackTasks);
                }
            }
        }

        // 创建一个定时器任务
        RetryTimerTask retryTimerTask = new RetryTimerTask(loadbalance, invocation, invokers, lastInvoker, retries, RETRY_FAILED_PERIOD, consumerUrl);
        try {
            // 延迟5s执行定时器任务
            failTimer.newTimeout(retryTimerTask, RETRY_FAILED_PERIOD, TimeUnit.SECONDS);
        } catch (Throwable e) {
            logger.error(CLUSTER_TIMER_RETRY_FAILED,"add newTimeout exception","","Failback background works error, invocation->" + invocation + ", exception: " + e.getMessage(),e);
        }
    }

执行定时器任务

    // RetryTimerTask 的 run方法
    @Override
    public void run(Timeout timeout) {
        try {
            logger.info("Attempt to retry to invoke method " + RpcUtils.getMethodName(invocation) +
                    ". The total will retry " + retries + " times, the current is the " + retriedTimes + " retry");
            // 选择一个invoker-服务提供者
            Invoker<T> retryInvoker = select(loadbalance, invocation, invokers, Collections.singletonList(lastInvoker));
            lastInvoker = retryInvoker;
            // 执行调用
            invokeWithContextAsync(retryInvoker, invocation, consumerUrl);
        } catch (Throwable e) {
            logger.error(CLUSTER_FAILED_INVOKE_SERVICE,"Failed retry to invoke method","","Failed retry to invoke method " + RpcUtils.getMethodName(invocation) + ", waiting again.",e);
            if ((++retriedTimes) >= retries) {
                // 超过重试次数
                logger.error(CLUSTER_FAILED_INVOKE_SERVICE,"Failed retry to invoke method and retry times exceed threshold","","Failed retry times exceed threshold (" + retries + "), We have to abandon, invocation->" + invocation,e);
            } else {
                // 再次重试
                rePut(timeout);
            }
        }
    }

    private void rePut(Timeout timeout) {
        if (timeout == null) {
            return;
        }

        Timer timer = timeout.timer();
        if (timer.isStop() || timeout.isCancelled()) {
            return;
        }

        timer.newTimeout(timeout.task(), tick, TimeUnit.SECONDS);
    }

2.5 并行调用-Forking Cluster

2.5.1 概述

要点：当消费方调用接口后，消费方会并行调用多个服务提供者的服务，只要其中有一个成功即返回。
适用场景：用于实时性要求较高的读操作。
注意事项：需要浪费更大服务资源。
使用方式

<dubbo:reference ...  cluster="forking" forks="3" />

forks - 最大并行数

2.5.2 源码分析

实现类是ForkingClusterInvoker，主要的方法是其doInvoke()方法。具体如下所示。

public class ForkingCluster extends AbstractCluster {

    public final static String NAME = "forking";

    @Override
    public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
        return new ForkingClusterInvoker<>(directory);
    }

}


public class ForkingClusterInvoker<T> extends AbstractClusterInvoker<T> {

    /**
     * Use {@link NamedInternalThreadFactory} to produce {@link org.apache.dubbo.common.threadlocal.InternalThread}
     * which with the use of {@link org.apache.dubbo.common.threadlocal.InternalThreadLocal} in {@link RpcContext}.
     */
    private final ExecutorService executor;

    public ForkingClusterInvoker(Directory<T> directory) {
        super(directory);
        executor = directory.getUrl().getOrDefaultFrameworkModel().getBeanFactory()
            .getBean(FrameworkExecutorRepository.class).getSharedExecutor();
    }

    @Override
    @SuppressWarnings({"unchecked", "rawtypes"})
    public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        try {
            final List<Invoker<T>> selected;
            final int forks = getUrl().getParameter(FORKS_KEY, DEFAULT_FORKS);
            final int timeout = getUrl().getParameter(TIMEOUT_KEY, DEFAULT_TIMEOUT);
            if (forks <= 0 || forks >= invokers.size()) {
                selected = invokers;
            } else {
                selected = new ArrayList<>(forks);
                while (selected.size() < forks) {
                    Invoker<T> invoker = select(loadbalance, invocation, invokers, selected);
                    if (!selected.contains(invoker)) {
                        //Avoid add the same invoker several times.
                        selected.add(invoker);
                    }
                }
            }
            RpcContext.getServiceContext().setInvokers((List) selected);
            final AtomicInteger count = new AtomicInteger();
            final BlockingQueue<Object> ref = new LinkedBlockingQueue<>(1);
            
            // 并发调用
            selected.forEach(invoker -> {
                URL consumerUrl = RpcContext.getServiceContext().getConsumerUrl();
                CompletableFuture.<Object>supplyAsync(() -> {
                    if (ref.size() > 0) {
                        return null;
                    }
                    return invokeWithContextAsync(invoker, invocation, consumerUrl);
                }, executor).whenComplete((v, t) -> {
                    if (t == null) {
                        // 调用成功
                        ref.offer(v);
                    } else {
                        int value = count.incrementAndGet();
                        if (value >= selected.size()) {
                            // 全部调用失败
                            ref.offer(t);
                        }
                    }
                });
            });
            try {
                // 获取调用结果。只要有一个调用成功了，ref中就会存在一个结果，所以会直接返回。
                Object ret = ref.poll(timeout, TimeUnit.MILLISECONDS);
                if (ret instanceof Throwable) {
                    Throwable e = ret instanceof CompletionException ? ((CompletionException) ret).getCause() : (Throwable) ret;
                    throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : RpcException.UNKNOWN_EXCEPTION,
                        "Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. " +
                            "Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e);
                }
                return (Result) ret;
            } catch (InterruptedException e) {
                throw new RpcException("Failed to forking invoke provider " + selected + ", " +
                    "but no luck to perform the invocation. Last error is: " + e.getMessage(), e);
            }
        } finally {
            // clear attachments which is binding to current thread.
            RpcContext.getClientAttachment().clearAttachments();
        }
    }
}

2.6 广播调用-Broadcast Cluster

2.6.1 概述

要点：当消费方调用接口后，消费方会依次调用每个服务提供者的服务，任意一个调用失败则表示本次调用失败。
适用场景：适用于通知所有服务提供者更新缓存或日志等本地资源信息。
使用方式

<dubbo:reference ... cluster="broadcast" />

2.6.2 源码分析

实现类是BroadcastClusterInvoker，主要的方法是其doInvoke()方法。具体如下所示。

public class BroadcastCluster extends AbstractCluster {

    @Override
    public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
        return new BroadcastClusterInvoker<>(directory);
    }

}

// BroadcastClusterInvoker 的 doInvoke
public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    RpcContext.getServiceContext().setInvokers((List) invokers);
    RpcException exception = null;
    Result result = null;
    URL url = getUrl();

    int broadcastFailPercent = url.getParameter(BROADCAST_FAIL_PERCENT_KEY, MAX_BROADCAST_FAIL_PERCENT);

    if (broadcastFailPercent < MIN_BROADCAST_FAIL_PERCENT || broadcastFailPercent > MAX_BROADCAST_FAIL_PERCENT) {
        logger.info(String.format("The value corresponding to the broadcast.fail.percent parameter must be between 0 and 100. " +
                "The current setting is %s, which is reset to 100.", broadcastFailPercent));
        broadcastFailPercent = MAX_BROADCAST_FAIL_PERCENT;
    }

    int failThresholdIndex = invokers.size() * broadcastFailPercent / MAX_BROADCAST_FAIL_PERCENT;
    int failIndex = 0;
    
    // 依次调用所有服务提供者
    for (int i = 0, invokersSize = invokers.size(); i < invokersSize; i++) {
        Invoker<T> invoker = invokers.get(i);
        RpcContext.RestoreContext restoreContext = new RpcContext.RestoreContext();
        try {
            RpcInvocation subInvocation = new RpcInvocation(invocation.getTargetServiceUniqueName(),
                invocation.getServiceModel(), invocation.getMethodName(), invocation.getServiceName(), invocation.getProtocolServiceKey(),
                invocation.getParameterTypes(), invocation.getArguments(), invocation.copyObjectAttachments(),
                invocation.getInvoker(), Collections.synchronizedMap(new HashMap<>(invocation.getAttributes())),
                invocation instanceof RpcInvocation ? ((RpcInvocation) invocation).getInvokeMode() : null);
            result = invokeWithContext(invoker, subInvocation);
            if (null != result && result.hasException()) {
                Throwable resultException = result.getException();
                if (null != resultException) {
                    exception = getRpcException(result.getException());
                    logger.warn(CLUSTER_ERROR_RESPONSE,"provider return error response","",exception.getMessage(),exception);
                    failIndex++;
                    if (failIndex == failThresholdIndex) {
                        break;
                    }
                }
            }
        } catch (Throwable e) {
            exception = getRpcException(e);
            logger.warn(CLUSTER_ERROR_RESPONSE,"provider return error response","",exception.getMessage(),exception);
            failIndex++;
            if (failIndex == failThresholdIndex) {
                break;
            }
        } finally {
            if (i != invokersSize - 1) {
                restoreContext.restore();
            }
        }
    }

    // 如果有一个调用异常，抛出异常
    if (exception != null) {
        if (failIndex == failThresholdIndex) {
            if (logger.isDebugEnabled()) {
                logger.debug(
                    String.format("The number of BroadcastCluster call failures has reached the threshold %s", failThresholdIndex));

            }
        } else {
            if (logger.isDebugEnabled()) {
                logger.debug(String.format("The number of BroadcastCluster call failures has not reached the threshold %s, fail size is %s",
                    failThresholdIndex, failIndex));
            }
        }
        throw exception;
    }

    return result;
}

3 自定义集群容错策略

通过自定义集群容错策略，可以更好的理解和灵活使用集群容错策略。

3.1 自定义ClusterInvoker

创建继承 AbstractClusterInvoker 的子类，并重写 doInvoke 方法。举例如下。

public class MyClusterInvoker<T> extends AbstractClusterInvoker<T> {

    public MyClusterInvoker(Directory<T> directory) {
        super(directory);
    }

    @Override
    protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
        try {
            return invokeWithContext(invoker, invocation);
        } catch (Throwable e) {
            if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.
                throw (RpcException) e;
            }

            // ...

            throw new RpcException(e.getMessage(), e);
        }
    }
}

3.2 自定义Cluster类

创建继承 AbstractCluster 的子类，并重写 doJoin 方法。举例如下。

public class MyCluster extends AbstractCluster {
    @Override
    protected <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
        return new MyClusterInvoker<>(directory);
    }
}

3.3 配置和使用

在 resources 目录下, 添加 META-INF/dubbo 目录, 继而添加 org.apache.dubbo.rpc.cluster.Cluster 文件。并将自定义的Cluster类配置到该文件中。

mycluster=org.apache.dubbo.rpc.cluster.support.MyCluster

然后在消费接口时指定使用自定义的集群容错策略。

<dubbo:reference ... cluster="mycluster"/>