-
Notifications
You must be signed in to change notification settings - Fork 97
fix: update retry strategy for mutation calls to handle aborted transactions #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e2b7413
to
0e7eb52
Compare
0e7eb52
to
01a0196
Compare
google/cloud/spanner_v1/_helpers.py
Outdated
Retry a specified function with different logic based on the type of exception raised. | ||
|
||
If the exception is of type google.api_core.exceptions.Aborted, | ||
apply an alternate retry strategy that relies on the provided deadline value instead of a fixed number of retries. | ||
For all other exceptions, retry the function up to a specified number of times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is not really logical. I would suggest splitting this into two separate functions:
- Keep the current retry function as-is.
- Add a new function
_retry_on_aborted_exception
that handles that specific case.
In the current form, the API is quite 'magical' and hard to understand. What is for example the definition of this function if you call it with Aborted
as one of the allowed exceptions? Will it use the specific logic for Aborted
in all cases? Or only if you have also supplied a deadline? What is the meaning of retry_count
if you use to it retry Aborted
errors? etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the exception is of type Aborted, it will activate the custom retry strategy. However, this will only occur if the user has listed this exception in the allowed_exception
s map and provided a deadline
value. If either condition is missing, the exception will not be retried. For the batch API use case, we will specifically allow this exception to be retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the _retry
function and the _retry_on_aborted_exception
should be completely separated. I don't really see any advantage of combining them, as the actual code that can be shared is minimal, and the API surface of this function is not logical.
E.g. if you have defined Aborted
as a retryable exception, but you forget to supply a deadline, then all of a sudden it is not retriable. Also, deadline is only used if you add Aborted as a possible retryable error, and is otherwise ignored if you only supply other error codes. Same with retry_count; it is only used for non-Aborted errors. The fact that there are many combinations of input arguments that don't make any sense, is an indication that the function itself should be split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I've implemented the new retry logic as suggested, separating the _retry
and _retry_on_aborted_exception
functions. This ensures clearer logic, as combining them led to confusing combinations of parameters that didn't make sense. Now, the retry logic for non-Aborted and Aborted exceptions is more distinct and easier to manage.
google/cloud/spanner_v1/_helpers.py
Outdated
while retries <= retry_count: | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks existing use cases that rely on this function to stop retrying after N retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the check for retries < retry_count
is already in place for generic retries. This ensures that the while loop terminates early and an exception is raised once the retry count is exceeded. So, in my opinion, this logic should work correctly for generic retries as well.
google/cloud/spanner_v1/batch.py
Outdated
@@ -293,7 +305,9 @@ def group(self): | |||
self._mutation_groups.append(mutation_group) | |||
return MutationGroup(self._session, mutation_group.mutations) | |||
|
|||
def batch_write(self, request_options=None, exclude_txn_from_change_streams=False): | |||
def batch_write( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_write
is a bit different. I don't think we should include it in this PR, as it is a non-atomic, streaming operation, that probably needs different error handling than 'just retry if it fails with an aborted error'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. In that case, we can bypass the retry behavior for this operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also remove the **kwargs
addition again from this PR. It would just be confusing if that is added in this PR, when it is not relevant to the actual change in this PR.
google/cloud/spanner_v1/batch.py
Outdated
def no_op_handler(exc): | ||
# No-op (does nothing) | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this and just pass in a lambda where a no-op handler is needed (if it is needed at all after we separate the normal retry function from the aborted retry function)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can use a no-op lambda for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the redundant code as this is no longer required with the new implementation.
tests/unit/test_batch.py
Outdated
@@ -618,6 +651,36 @@ def __init__(self, database=None, name=TestBatch.SESSION_NAME): | |||
def session_id(self): | |||
return self.name | |||
|
|||
def run_in_transaction(self, fnc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here? It does not look like a test method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not needed here. This change is carried forward from my previous PR where we call the run_in_transaction
method instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in subsequent commits.
google/cloud/spanner_v1/_helpers.py
Outdated
raise | ||
|
||
delay = _get_retry_delay(cause, attempts) | ||
print(now, delay, deadline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted these methods to make them more generic, allowing other clients to reuse the logic instead of it being tightly coupled with the session object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant: Remove the print(...)
line. We should not print debug info in non-test code (and normally also not in test code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. Removed from subsequent commits.
googleapis#1281) Source-Link: googleapis/synthtool@de3def6 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:a1c5112b81d645f5bbc4d4bbc99d7dcb5089a52216c0e3fb1203a0eeabadd7d5 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
8f6be72
to
687e0e3
Compare
…an up redundant code
687e0e3
to
a6e25a3
Compare
google/cloud/spanner_v1/_helpers.py
Outdated
def _retry_on_aborted_exception( | ||
func, | ||
deadline, | ||
allowed_exceptions=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can simplify this further and just remove allowed_exceptions
from this function. It should only retry aborted exceptions.
google/cloud/spanner_v1/_helpers.py
Outdated
except Exception as exc: | ||
try: | ||
retry_result = _retry(func=func, allowed_exceptions=allowed_exceptions) | ||
if retry_result is not None: | ||
return retry_result | ||
else: | ||
raise exc | ||
except Aborted: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should remove this part entirely. I know that the previous implementation of Batch
retried this specific RST_STREAM
error, but that was just a copy-paste from other methods. That error is not relevant for this type of operation.
google/cloud/spanner_v1/_helpers.py
Outdated
@@ -473,6 +505,7 @@ def _retry( | |||
Args: | |||
func: The function to be retried. | |||
retry_count: The maximum number of times to retry the function. | |||
deadline: This will be used in case of Aborted transactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove, this is not relevant anymore
google/cloud/spanner_v1/batch.py
Outdated
@@ -293,7 +305,9 @@ def group(self): | |||
self._mutation_groups.append(mutation_group) | |||
return MutationGroup(self._session, mutation_group.mutations) | |||
|
|||
def batch_write(self, request_options=None, exclude_txn_from_change_streams=False): | |||
def batch_write( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also remove the **kwargs
addition again from this PR. It would just be confusing if that is added in this PR, when it is not relevant to the actual change in this PR.
Recognize GRAPH and pipe syntax queries as valid queries in dbapi.
…leapis#1273) * chore: Add Custom OpenTelemetry Exporter in for Service Metrics * Updated copyright dates to 2025 --------- Co-authored-by: rahul2393 <[email protected]>
…d_exception handler
Updating retry strategy for mutation calls to handle aborted transactions
This PR updates the retry strategy for mutation calls to handle aborted transactions more effectively. Previously, the retry mechanism didn't handle certain edge cases for aborted transactions, leading to failures. The updated strategy ensures retries in these scenarios to improve the robustness of the mutation operations.
Test Results:
All unit tests related to the retry logic and mutation operations pass successfully.
Fixes #1133