1 Dubbo的集群容错策略概述
Dubbo的集群容错策略(cluster strategy)应用于服务消费方,并在服务消费方进行设置。当服务消费方调用服务提供方的服务失败时,Dubbo提供了多种集群容错策略来处理失败。其中默认的策略为“失败重试”策略,即 Failover Cluster。下面将对多种集群容错策略依次进行解读。
服务消费端发起远程调用的过程中,会执行 AbstractClusterInvoker 的 invoke 方法。其中通过执行集群容错策略类的 doInvoke 方法,来执行集群容错策略以及发起最终的远程调用。源码如下所示。
// AbstractClusterInvoker 的 invoke 方法
@Override
public Result invoke(final Invocation invocation) throws RpcException {
checkWhetherDestroyed();
InvocationProfilerUtils.enterDetailProfiler(invocation, () -> "Router route.");
// 获取invoker列表(服务提供者列表)
List<Invoker<T>> invokers = list(invocation);
InvocationProfilerUtils.releaseDetailProfiler(invocation);
checkInvokers(invokers, invocation);
// 获取服务消费者配置的负载均衡策略,默认为随机负载均衡策略(random)
LoadBalance loadbalance = initLoadBalance(invokers, invocation);
RpcUtils.attachInvocationIdIfAsync(getUrl(), invocation);
InvocationProfilerUtils.enterDetailProfiler(invocation, () -> "Cluster " + this.getClass().getName() + " invoke.");
try {
// 执行调用(获取服务消费者配置的集群容错策略,
// 以及根据负载均衡策略选择某个服务提供者进行调用)
return doInvoke(invocation, invokers, loadbalance);
} finally {
InvocationProfilerUtils.releaseDetailProfiler(invocation);
}
}
2 Dubbo的集群容错策略源码分析
下面将介绍一些主要的集群容错策略,以及对其源码进行解读。
2.1 失败重试-Failover Cluster
2.1.1 概述
- 要点:当服务消费方调用服务提供方的服务失败时,消费方会自动切换到其他服务提供者服务器进行重试。
- 适用场景:读操作或者具有幂等的写操作。
- 注意事项:重试会带来更长的延迟。
- 使用方式:通过设置 retries 参数。如 retries=“3” 即重试3次(不含第一次调用),即最多调用4次。另外可以设置接口级别或者方法级别的重试次数。不主动设置时,默认的重试次数为2次。如果使用的是失败重试策略,则对于不具有幂等的写操作,retries一定要设置为0,防止数据重复等问题的出现。
<dubbo:reference ... cluster="failover" retries="3"/>
2.1.2 源码分析
实现类是FailoverClusterInvoker,主要的方法是其doInvoke()方法。具体如下所示。
public class FailoverCluster extends AbstractCluster {
public final static String NAME = "failover";
@Override
public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
return new FailoverClusterInvoker<>(directory);
}
}
// FailoverClusterInvoker 的 doInvoke()方法
public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
// 所有的服务提供者
List<Invoker<T>> copyInvokers = invokers;
String methodName = RpcUtils.getMethodName(invocation);
// 计算最多调用多少次(retries+1)
int len = calculateInvokeTimes(methodName);
// retry loop.
RpcException le = null;
List<Invoker<T>> invoked = new ArrayList<Invoker<T>>(copyInvokers.size()); // invoked invokers.
Set<String> providers = new HashSet<String>(len);
for (int i = 0; i < len; i++) {
//Reselect before retry to avoid a change of candidate `invokers`.
//NOTE: if `invokers` changed, then `invoked` also lose accuracy.
if (i > 0) {
checkWhetherDestroyed();
// 重试时,重新获取所有的服务提供者
copyInvokers = list(invocation);
// check again
checkInvokers(copyInvokers, invocation);
}
// 根据负载均衡策略选择一个服务提供者
Invoker<T> invoker = select(loadbalance, invocation, copyInvokers, invoked);
invoked.add(invoker);
RpcContext.getServiceContext().setInvokers((List) invoked);
boolean success = false;
try {
// 发起远程调用
Result result = invokeWithContext(invoker, invocation);
if (le != null && logger.isWarnEnabled()) {
logger.warn(CLUSTER_FAILED_MULTIPLE_RETRIES,"failed to retry do invoke","","Although retry the method " + methodName
+ " in the service " + getInterface().getName()
+ " was successful by the provider " + invoker.getUrl().getAddress()
+ ", but there have been failed providers " + providers
+ " (" + providers.size() + "/" + copyInvokers.size()
+ ") from the registry " + directory.getUrl().getAddress()
+ " on the consumer " + NetUtils.getLocalHost()
+ " using the dubbo version " + Version.getVersion() + ". Last error is: "
+ le.getMessage(),le);
}
success = true;
return result;
} catch (RpcException e) {
// 如果为 biz exception(BIZ_EXCEPTION)则不重试
if (e.isBiz()) {
throw e;
}
le = e;
} catch (Throwable e) {
le = new RpcException(e.getMessage(), e);
} finally {
if (!success) {
providers.add(invoker.getUrl().getAddress());
}
}
}
throw new RpcException(le.getCode(), "Failed to invoke the method "
+ methodName + " in the service " + getInterface().getName()
+ ". Tried " + len + " times of the providers " + providers
+ " (" + providers.size() + "/" + copyInvokers.size()
+ ") from the registry " + directory.getUrl().getAddress()
+ " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version "
+ Version.getVersion() + ". Last error is: "
+ le.getMessage(), le.getCause() != null ? le.getCause() : le);
}
// 计算最多调用多少次
private int calculateInvokeTimes(String methodName) {
int len = getUrl().getMethodParameter(methodName, RETRIES_KEY, DEFAULT_RETRIES) + 1;
RpcContext rpcContext = RpcContext.getClientAttachment();
Object retry = rpcContext.getObjectAttachment(RETRIES_KEY);
if (retry instanceof Number) {
len = ((Number) retry).intValue() + 1;
rpcContext.removeAttachment(RETRIES_KEY);
}
if (len <= 0) {
len = 1;
}
return len;
}
// 发起远程调用
protected Result invokeWithContext(Invoker<T> invoker, Invocation invocation) {
Invoker<T> originInvoker = setContext(invoker);
Result result;
try {
if (ProfilerSwitch.isEnableSimpleProfiler()) {
InvocationProfilerUtils.enterProfiler(invocation, "Invoker invoke. Target Address: " + invoker.getUrl().getAddress());
}
setRemote(invoker, invocation);
result = invoker.invoke(invocation);
} finally {
clearContext(originInvoker);
InvocationProfilerUtils.releaseSimpleProfiler(invocation);
}
return result;
}
2.2 快速失败-Failfast Cluster
2.2.1 概述
- 要点:服务消费方调用服务提供方失败后,立即抛出异常,即只调用一次。
- 适用场景:通常用于非幂等性的写操作。
- 使用方式
<dubbo:reference ... cluster="failfast"/>
2.2.2 源码分析
实现类是FailfastClusterInvoker,主要的方法是其doInvoke()方法。具体如下所示。
public class FailfastCluster extends AbstractCluster {
public final static String NAME = "failfast";
@Override
public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
return new FailfastClusterInvoker<>(directory);
}
}
public class FailfastClusterInvoker<T> extends AbstractClusterInvoker<T> {
public FailfastClusterInvoker(Directory<T> directory) {
super(directory);
}
@Override
public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
// 选择一个服务提供者
Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
try {
// 执行远程调用
return invokeWithContext(invoker, invocation);
} catch (Throwable e) {
if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.
throw (RpcException) e;
}
throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0,
"Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName()
+ " for service " + getInterface().getName()
+ " method " + RpcUtils.getMethodName(invocation) + " on consumer " + NetUtils.getLocalHost()
+ " use dubbo version " + Version.getVersion()
+ ", but no luck to perform the invocation. Last error is: " + e.getMessage(),
e.getCause() != null ? e.getCause() : e);
}
}
}
2.3 安全失败-Failsafe Cluster
2.3.1 概述
- 要点:服务消费方调用服务提供方失败时,忽略错误,直接返回空结果。
- 适用场景:通常用于记录审计日志等操作。
- 使用方式
<dubbo:reference ... cluster="failsafe"/>
2.3.2 源码分析
实现类是FailsafeClusterInvoker,主要的方法是其doInvoke()方法。具体如下所示。
public class FailsafeCluster extends AbstractCluster {
public final static String NAME = "failsafe";
@Override
public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
return new FailsafeClusterInvoker<>(directory);
}
}
public class FailsafeClusterInvoker<T> extends AbstractClusterInvoker<T> {
private static final ErrorTypeAwareLogger logger = LoggerFactory.getErrorTypeAwareLogger(FailsafeClusterInvoker.class);
public FailsafeClusterInvoker(Directory<T> directory) {
super(directory);
}
@Override
public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
try {
Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
return invokeWithContext(invoker, invocation);
} catch (Throwable e) {
logger.error(CLUSTER_ERROR_RESPONSE,"Failsafe for provider exception","","Failsafe ignore exception: " + e.getMessage(),e);
return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
}
}
}
2.4 失败自动恢复-Failback Cluster
2.4.1 概述
- 要点:服务消费方调用服务提供方失败时,记录失败的请求,并按照一定的策略定时重新发起请求。
- 适用场景:适用于消息通知操作。
- 使用方式
<dubbo:reference ... cluster="failback" retries="3" failbacktasks="100"/>
failbacktasks - 重新发起请求的任务数量的上限
2.4.2 源码分析
实现类是FailbackClusterInvoker,主要的方法是其doInvoke()方法。具体如下所示。
public class FailbackCluster extends AbstractCluster {
public final static String NAME = "failback";
@Override
public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
return new FailbackClusterInvoker<>(directory);
}
}
// FailbackClusterInvoker 的 doInvoke 方法
@Override
protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
Invoker<T> invoker = null;
URL consumerUrl = RpcContext.getServiceContext().getConsumerUrl();
try {
invoker = select(loadbalance, invocation, invokers, null);
// Asynchronous call method must be used here, because failback will retry in the background.
// Then the serviceContext will be cleared after the call is completed.
return invokeWithContextAsync(invoker, invocation, consumerUrl);
} catch (Throwable e) {
logger.error(CLUSTER_FAILED_INVOKE_SERVICE,"Failback to invoke method and start to retries",
"","Failback to invoke method " + RpcUtils.getMethodName(invocation) +
", wait for retry in background. Ignored exception: "
+ e.getMessage() + ", ",e);
if (retries > 0) {
// 将调用失败的请求加入到定时执行的任务中
addFailed(loadbalance, invocation, invokers, invoker, consumerUrl);
}
return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
}
}
将调用失败的请求加入到定时执行的任务中
private void addFailed(LoadBalance loadbalance, Invocation invocation, List<Invoker<T>> invokers, Invoker<T> lastInvoker, URL consumerUrl) {
// 创建一个定时器
if (failTimer == null) {
synchronized (this) {
if (failTimer == null) {
failTimer = new HashedWheelTimer(
new NamedThreadFactory("failback-cluster-timer", true),
1,
TimeUnit.SECONDS, 32, failbackTasks);
}
}
}
// 创建一个定时器任务
RetryTimerTask retryTimerTask = new RetryTimerTask(loadbalance, invocation, invokers, lastInvoker, retries, RETRY_FAILED_PERIOD, consumerUrl);
try {
// 延迟5s执行定时器任务
failTimer.newTimeout(retryTimerTask, RETRY_FAILED_PERIOD, TimeUnit.SECONDS);
} catch (Throwable e) {
logger.error(CLUSTER_TIMER_RETRY_FAILED,"add newTimeout exception","","Failback background works error, invocation->" + invocation + ", exception: " + e.getMessage(),e);
}
}
执行定时器任务
// RetryTimerTask 的 run方法
@Override
public void run(Timeout timeout) {
try {
logger.info("Attempt to retry to invoke method " + RpcUtils.getMethodName(invocation) +
". The total will retry " + retries + " times, the current is the " + retriedTimes + " retry");
// 选择一个invoker-服务提供者
Invoker<T> retryInvoker = select(loadbalance, invocation, invokers, Collections.singletonList(lastInvoker));
lastInvoker = retryInvoker;
// 执行调用
invokeWithContextAsync(retryInvoker, invocation, consumerUrl);
} catch (Throwable e) {
logger.error(CLUSTER_FAILED_INVOKE_SERVICE,"Failed retry to invoke method","","Failed retry to invoke method " + RpcUtils.getMethodName(invocation) + ", waiting again.",e);
if ((++retriedTimes) >= retries) {
// 超过重试次数
logger.error(CLUSTER_FAILED_INVOKE_SERVICE,"Failed retry to invoke method and retry times exceed threshold","","Failed retry times exceed threshold (" + retries + "), We have to abandon, invocation->" + invocation,e);
} else {
// 再次重试
rePut(timeout);
}
}
}
private void rePut(Timeout timeout) {
if (timeout == null) {
return;
}
Timer timer = timeout.timer();
if (timer.isStop() || timeout.isCancelled()) {
return;
}
timer.newTimeout(timeout.task(), tick, TimeUnit.SECONDS);
}
2.5 并行调用-Forking Cluster
2.5.1 概述
- 要点:当消费方调用接口后,消费方会并行调用多个服务提供者的服务,只要其中有一个成功即返回。
- 适用场景:用于实时性要求较高的读操作。
- 注意事项:需要浪费更大服务资源。
- 使用方式
<dubbo:reference ... cluster="forking" forks="3" />
forks - 最大并行数
2.5.2 源码分析
实现类是ForkingClusterInvoker,主要的方法是其doInvoke()方法。具体如下所示。
public class ForkingCluster extends AbstractCluster {
public final static String NAME = "forking";
@Override
public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
return new ForkingClusterInvoker<>(directory);
}
}
public class ForkingClusterInvoker<T> extends AbstractClusterInvoker<T> {
/**
* Use {@link NamedInternalThreadFactory} to produce {@link org.apache.dubbo.common.threadlocal.InternalThread}
* which with the use of {@link org.apache.dubbo.common.threadlocal.InternalThreadLocal} in {@link RpcContext}.
*/
private final ExecutorService executor;
public ForkingClusterInvoker(Directory<T> directory) {
super(directory);
executor = directory.getUrl().getOrDefaultFrameworkModel().getBeanFactory()
.getBean(FrameworkExecutorRepository.class).getSharedExecutor();
}
@Override
@SuppressWarnings({"unchecked", "rawtypes"})
public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
try {
final List<Invoker<T>> selected;
final int forks = getUrl().getParameter(FORKS_KEY, DEFAULT_FORKS);
final int timeout = getUrl().getParameter(TIMEOUT_KEY, DEFAULT_TIMEOUT);
if (forks <= 0 || forks >= invokers.size()) {
selected = invokers;
} else {
selected = new ArrayList<>(forks);
while (selected.size() < forks) {
Invoker<T> invoker = select(loadbalance, invocation, invokers, selected);
if (!selected.contains(invoker)) {
//Avoid add the same invoker several times.
selected.add(invoker);
}
}
}
RpcContext.getServiceContext().setInvokers((List) selected);
final AtomicInteger count = new AtomicInteger();
final BlockingQueue<Object> ref = new LinkedBlockingQueue<>(1);
// 并发调用
selected.forEach(invoker -> {
URL consumerUrl = RpcContext.getServiceContext().getConsumerUrl();
CompletableFuture.<Object>supplyAsync(() -> {
if (ref.size() > 0) {
return null;
}
return invokeWithContextAsync(invoker, invocation, consumerUrl);
}, executor).whenComplete((v, t) -> {
if (t == null) {
// 调用成功
ref.offer(v);
} else {
int value = count.incrementAndGet();
if (value >= selected.size()) {
// 全部调用失败
ref.offer(t);
}
}
});
});
try {
// 获取调用结果。只要有一个调用成功了,ref中就会存在一个结果,所以会直接返回。
Object ret = ref.poll(timeout, TimeUnit.MILLISECONDS);
if (ret instanceof Throwable) {
Throwable e = ret instanceof CompletionException ? ((CompletionException) ret).getCause() : (Throwable) ret;
throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : RpcException.UNKNOWN_EXCEPTION,
"Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. " +
"Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e);
}
return (Result) ret;
} catch (InterruptedException e) {
throw new RpcException("Failed to forking invoke provider " + selected + ", " +
"but no luck to perform the invocation. Last error is: " + e.getMessage(), e);
}
} finally {
// clear attachments which is binding to current thread.
RpcContext.getClientAttachment().clearAttachments();
}
}
}
2.6 广播调用-Broadcast Cluster
2.6.1 概述
- 要点:当消费方调用接口后,消费方会依次调用每个服务提供者的服务,任意一个调用失败则表示本次调用失败。
- 适用场景:适用于通知所有服务提供者更新缓存或日志等本地资源信息。
- 使用方式
<dubbo:reference ... cluster="broadcast" />
2.6.2 源码分析
实现类是BroadcastClusterInvoker,主要的方法是其doInvoke()方法。具体如下所示。
public class BroadcastCluster extends AbstractCluster {
@Override
public <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
return new BroadcastClusterInvoker<>(directory);
}
}
// BroadcastClusterInvoker 的 doInvoke
public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
RpcContext.getServiceContext().setInvokers((List) invokers);
RpcException exception = null;
Result result = null;
URL url = getUrl();
int broadcastFailPercent = url.getParameter(BROADCAST_FAIL_PERCENT_KEY, MAX_BROADCAST_FAIL_PERCENT);
if (broadcastFailPercent < MIN_BROADCAST_FAIL_PERCENT || broadcastFailPercent > MAX_BROADCAST_FAIL_PERCENT) {
logger.info(String.format("The value corresponding to the broadcast.fail.percent parameter must be between 0 and 100. " +
"The current setting is %s, which is reset to 100.", broadcastFailPercent));
broadcastFailPercent = MAX_BROADCAST_FAIL_PERCENT;
}
int failThresholdIndex = invokers.size() * broadcastFailPercent / MAX_BROADCAST_FAIL_PERCENT;
int failIndex = 0;
// 依次调用所有服务提供者
for (int i = 0, invokersSize = invokers.size(); i < invokersSize; i++) {
Invoker<T> invoker = invokers.get(i);
RpcContext.RestoreContext restoreContext = new RpcContext.RestoreContext();
try {
RpcInvocation subInvocation = new RpcInvocation(invocation.getTargetServiceUniqueName(),
invocation.getServiceModel(), invocation.getMethodName(), invocation.getServiceName(), invocation.getProtocolServiceKey(),
invocation.getParameterTypes(), invocation.getArguments(), invocation.copyObjectAttachments(),
invocation.getInvoker(), Collections.synchronizedMap(new HashMap<>(invocation.getAttributes())),
invocation instanceof RpcInvocation ? ((RpcInvocation) invocation).getInvokeMode() : null);
result = invokeWithContext(invoker, subInvocation);
if (null != result && result.hasException()) {
Throwable resultException = result.getException();
if (null != resultException) {
exception = getRpcException(result.getException());
logger.warn(CLUSTER_ERROR_RESPONSE,"provider return error response","",exception.getMessage(),exception);
failIndex++;
if (failIndex == failThresholdIndex) {
break;
}
}
}
} catch (Throwable e) {
exception = getRpcException(e);
logger.warn(CLUSTER_ERROR_RESPONSE,"provider return error response","",exception.getMessage(),exception);
failIndex++;
if (failIndex == failThresholdIndex) {
break;
}
} finally {
if (i != invokersSize - 1) {
restoreContext.restore();
}
}
}
// 如果有一个调用异常,抛出异常
if (exception != null) {
if (failIndex == failThresholdIndex) {
if (logger.isDebugEnabled()) {
logger.debug(
String.format("The number of BroadcastCluster call failures has reached the threshold %s", failThresholdIndex));
}
} else {
if (logger.isDebugEnabled()) {
logger.debug(String.format("The number of BroadcastCluster call failures has not reached the threshold %s, fail size is %s",
failThresholdIndex, failIndex));
}
}
throw exception;
}
return result;
}
3 自定义集群容错策略
通过自定义集群容错策略,可以更好的理解和灵活使用集群容错策略。
3.1 自定义ClusterInvoker
创建继承 AbstractClusterInvoker 的子类,并重写 doInvoke 方法。举例如下。
public class MyClusterInvoker<T> extends AbstractClusterInvoker<T> {
public MyClusterInvoker(Directory<T> directory) {
super(directory);
}
@Override
protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
try {
return invokeWithContext(invoker, invocation);
} catch (Throwable e) {
if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.
throw (RpcException) e;
}
// ...
throw new RpcException(e.getMessage(), e);
}
}
}
3.2 自定义Cluster类
创建继承 AbstractCluster 的子类,并重写 doJoin 方法。举例如下。
public class MyCluster extends AbstractCluster {
@Override
protected <T> AbstractClusterInvoker<T> doJoin(Directory<T> directory) throws RpcException {
return new MyClusterInvoker<>(directory);
}
}
3.3 配置和使用
在 resources 目录下, 添加 META-INF/dubbo 目录, 继而添加 org.apache.dubbo.rpc.cluster.Cluster 文件。并将自定义的Cluster类配置到该文件中。
mycluster=org.apache.dubbo.rpc.cluster.support.MyCluster
然后在消费接口时指定使用自定义的集群容错策略。
<dubbo:reference ... cluster="mycluster"/>
备注:Dubbo自带的集群容错配置如下所示