-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ALTS: GRPCLB LoadBalancer had an error #7643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
1.34.0_error.txt is a debug log of gRPC. |
There is an earlier error in the log. Is that related or is that also seen with 1.33.1
|
@dapengzhang0 any idea what changes in grpclb could have caused this? |
@apolcyn |
I think the +1 that this seems likely related to ID bound token. Can we check what the target name (https://github.com/grpc/grpc/blob/e6e6be4b0b12562e38083181aee27438d8258b5a/src/proto/grpc/gcp/handshaker.proto#L104) is that is passed to the ALTS handshake service? |
I haven't seen this issue with gRPC Java prior to 1.34. I just checked that gRPC 1.33.1 doesn't have this issue. So commits after v1.33.1 most likely caused this. |
I don't think Id bound token rollout in grpclb touches grpc code. It is all backend enabling. Let me check with Jianing who is responsible for Id bound token. |
Although ID bound token rollout itself doesn't touch grpc code, if I'm correct it does newly require the client to pass ALTS target names and RPC authority headers on the LB channel properly, which might have changed at the client |
Confirmed with Jianing, rollout only change ESF config and fallback code is still there.
Based on this, it looks like some changes in gRPC code caused the failure. |
This should have been a release blocker for v1.34 since it seems to be a serious regression. gRPCLB doesn't provide any call credentials to the server. If the server now requires it, that would definitely break gRPC. But why would older versions succeed? That's the important question. Looking at the logs, it appears that the gRPCLB request includes a JWT. That's surprising. I wonder if the swap to ChannelCredentials somehow added it. |
This is a regression caused by |
@veblush, can you try testing with this patch and see if it resolves the issue? diff --git a/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java b/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java
index 2edb32fa2..97088059a 100644
--- a/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java
+++ b/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java
@@ -153,6 +153,7 @@ final class ManagedChannelImpl extends ManagedChannel implements
private final NameResolver.Args nameResolverArgs;
private final AutoConfiguredLoadBalancerFactory loadBalancerFactory;
private final ClientTransportFactory transportFactory;
+ private final ClientTransportFactory oobTransportFactory;
private final RestrictedScheduledExecutor scheduledExecutor;
private final Executor executor;
private final ObjectPool<? extends Executor> executorPool;
@@ -593,6 +594,8 @@ final class ManagedChannelImpl extends ManagedChannel implements
this.executor = checkNotNull(executorPool.getObject(), "executor");
this.transportFactory = new CallCredentialsApplyingTransportFactory(
clientTransportFactory, builder.callCredentials, this.executor);
+ this.oobTransportFactory = new CallCredentialsApplyingTransportFactory(
+ clientTransportFactory, null, this.executor);
this.scheduledExecutor =
new RestrictedScheduledExecutor(transportFactory.getScheduledExecutorService());
maxTraceEvents = builder.maxTraceEvents;
@@ -1517,7 +1520,7 @@ final class ManagedChannelImpl extends ManagedChannel implements
final InternalSubchannel internalSubchannel = new InternalSubchannel(
Collections.singletonList(addressGroup),
- authority, userAgent, backoffPolicyProvider, transportFactory,
+ authority, userAgent, backoffPolicyProvider, oobTransportFactory,
transportFactory.getScheduledExecutorService(), stopwatchSupplier, syncContext,
// All callback methods are run from syncContext
new ManagedOobChannelCallback(), |
Created b/175066772 to follow up on this internally. |
I spoke with @veblush, and the service was not DP-only, so fallback to CFE should have occurred. I think if RPCs continued the fallback may have taken place. The initial RPCs seem to fail because the picker is immediately failed when lb RPC fails and there are no backends. Changing the logic to try fallback backends first should be possible, but it also seems it will be a hard to define/implement all the edge cases. @apolcyn, does it seem appropriate that some RPCs failed, assuming later ones would have succeeded? It seems like we may need to change this error behavior. |
This is interesting and I cannot reproduce this anymore on the same machine. Luckily I could run the exact same 1.34.0-SNAPSHOT binary which was used for the error case above and it somehow ran successfully over DirectPath. |
1.34.0.log: Success log with the one which failed previously |
This might be caused by a combination of the server and the client issue. Since I cannot reproduce this issue anymore with the same client, something which made the client fail to the GCS backend via DirectPath appears to be addressed. |
@veblush, the v1.34.1 log still includes the authorization header in the gRPCLB request. So either there was a mistake when testing or my patch doesn't address the problem. |
This also fixes a bug where resolving OOB channels would have CallCreds duplicated; that wasn't noticed or important because we don't use CallCreds in OOB channels. Fixes grpc#7643
The addition of CompositeChannelCredentials allowed CallCredentials to be passed to the ManagedChannel itself. But the implementation was buggy and used the call creds for out-of-band channels as well, which is inappropriate since they have a different authority. This also fixes a bug where resolving OOB channels would have CallCreds duplicated; that wasn't noticed or important because we don't use CallCreds in OOB channels. Fixes grpc#7643
The addition of CompositeChannelCredentials allowed CallCredentials to be passed to the ManagedChannel itself. But the implementation was buggy and used the call creds for out-of-band channels as well, which is inappropriate since they have a different authority. This also fixes a bug where resolving OOB channels would have CallCreds duplicated; that wasn't noticed or important because we don't use CallCreds in OOB channels. Fixes #7643
The addition of CompositeChannelCredentials allowed CallCredentials to be passed to the ManagedChannel itself. But the implementation was buggy and used the call creds for out-of-band channels as well, which is inappropriate since they have a different authority. This also fixes a bug where resolving OOB channels would have CallCreds duplicated; that wasn't noticed or important because we don't use CallCreds in OOB channels. Fixes grpc#7643
The addition of CompositeChannelCredentials allowed CallCredentials to be passed to the ManagedChannel itself. But the implementation was buggy and used the call creds for out-of-band channels as well, which is inappropriate since they have a different authority. This also fixes a bug where resolving OOB channels would have CallCreds duplicated; that wasn't noticed or important because we don't use CallCreds in OOB channels. Fixes #7643
The addition of CompositeChannelCredentials allowed CallCredentials to be passed to the ManagedChannel itself. But the implementation was buggy and used the call creds for out-of-band channels as well, which is inappropriate since they have a different authority. This also fixes a bug where resolving OOB channels would have CallCreds duplicated; that wasn't noticed or important because we don't use CallCreds in OOB channels. Fixes grpc#7643
When running GCS benchmark over DirectPath with gRPC 1.34.0-SNAPSHOT, it failed with the error below. It has been working with gRPC 1.33.1.
What version of gRPC-Java are you using?
gRPC 1.34.0-SNAPSHOT
What is your environment?
Linux Debian/10
What did you expect to see?
Connecting to the backend using ALTS.
What did you see instead?
GRPCLB Error
Steps to reproduce the bug
The text was updated successfully, but these errors were encountered: