Skip to content

fix: update retry strategy for mutation calls to handle aborted transactions #1279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 9, 2025

Conversation

aakashanandg
Copy link
Contributor

@aakashanandg aakashanandg commented Jan 2, 2025

Updating retry strategy for mutation calls to handle aborted transactions

This PR updates the retry strategy for mutation calls to handle aborted transactions more effectively. Previously, the retry mechanism didn't handle certain edge cases for aborted transactions, leading to failures. The updated strategy ensures retries in these scenarios to improve the robustness of the mutation operations.

Test Results:

All unit tests related to the retry logic and mutation operations pass successfully.

Fixes #1133

@aakashanandg aakashanandg requested review from a team as code owners January 2, 2025 12:59
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Jan 2, 2025
@aakashanandg aakashanandg requested a review from olavloite January 2, 2025 12:59
@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/python-spanner API. label Jan 2, 2025
@aakashanandg aakashanandg force-pushed the batch-retry-strategy branch 5 times, most recently from e2b7413 to 0e7eb52 Compare January 2, 2025 13:53
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jan 2, 2025
Comment on lines 476 to 480
Retry a specified function with different logic based on the type of exception raised.

If the exception is of type google.api_core.exceptions.Aborted,
apply an alternate retry strategy that relies on the provided deadline value instead of a fixed number of retries.
For all other exceptions, retry the function up to a specified number of times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is not really logical. I would suggest splitting this into two separate functions:

  1. Keep the current retry function as-is.
  2. Add a new function _retry_on_aborted_exception that handles that specific case.

In the current form, the API is quite 'magical' and hard to understand. What is for example the definition of this function if you call it with Aborted as one of the allowed exceptions? Will it use the specific logic for Aborted in all cases? Or only if you have also supplied a deadline? What is the meaning of retry_count if you use to it retry Aborted errors? etc...

Copy link
Contributor Author

@aakashanandg aakashanandg Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the exception is of type Aborted, it will activate the custom retry strategy. However, this will only occur if the user has listed this exception in the allowed_exceptions map and provided a deadline value. If either condition is missing, the exception will not be retried. For the batch API use case, we will specifically allow this exception to be retried.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the _retry function and the _retry_on_aborted_exception should be completely separated. I don't really see any advantage of combining them, as the actual code that can be shared is minimal, and the API surface of this function is not logical.

E.g. if you have defined Aborted as a retryable exception, but you forget to supply a deadline, then all of a sudden it is not retriable. Also, deadline is only used if you add Aborted as a possible retryable error, and is otherwise ignored if you only supply other error codes. Same with retry_count; it is only used for non-Aborted errors. The fact that there are many combinations of input arguments that don't make any sense, is an indication that the function itself should be split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I've implemented the new retry logic as suggested, separating the _retry and _retry_on_aborted_exception functions. This ensures clearer logic, as combining them led to confusing combinations of parameters that didn't make sense. Now, the retry logic for non-Aborted and Aborted exceptions is more distinct and easier to manage.

Comment on lines 484 to 494
while retries <= retry_count:
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks existing use cases that rely on this function to stop retrying after N retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the check for retries < retry_count is already in place for generic retries. This ensures that the while loop terminates early and an exception is raised once the retry count is exceeded. So, in my opinion, this logic should work correctly for generic retries as well.

@@ -293,7 +305,9 @@ def group(self):
self._mutation_groups.append(mutation_group)
return MutationGroup(self._session, mutation_group.mutations)

def batch_write(self, request_options=None, exclude_txn_from_change_streams=False):
def batch_write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_write is a bit different. I don't think we should include it in this PR, as it is a non-atomic, streaming operation, that probably needs different error handling than 'just retry if it fails with an aborted error'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. In that case, we can bypass the retry behavior for this operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also remove the **kwargs addition again from this PR. It would just be confusing if that is added in this PR, when it is not relevant to the actual change in this PR.

Comment on lines 398 to 400
def no_op_handler(exc):
# No-op (does nothing)
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this and just pass in a lambda where a no-op handler is needed (if it is needed at all after we separate the normal retry function from the aborted retry function)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can use a no-op lambda for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundant code as this is no longer required with the new implementation.

@@ -618,6 +651,36 @@ def __init__(self, database=None, name=TestBatch.SESSION_NAME):
def session_id(self):
return self.name

def run_in_transaction(self, fnc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here? It does not look like a test method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not needed here. This change is carried forward from my previous PR where we call the run_in_transaction method instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in subsequent commits.

raise

delay = _get_retry_delay(cause, attempts)
print(now, delay, deadline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted these methods to make them more generic, allowing other clients to reuse the logic instead of it being tightly coupled with the session object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant: Remove the print(...) line. We should not print debug info in non-test code (and normally also not in test code).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Removed from subsequent commits.

googleapis#1281)

Source-Link: googleapis/synthtool@de3def6
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:a1c5112b81d645f5bbc4d4bbc99d7dcb5089a52216c0e3fb1203a0eeabadd7d5

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
@aakashanandg aakashanandg force-pushed the batch-retry-strategy branch 2 times, most recently from 8f6be72 to 687e0e3 Compare January 6, 2025 18:38
def _retry_on_aborted_exception(
func,
deadline,
allowed_exceptions=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can simplify this further and just remove allowed_exceptions from this function. It should only retry aborted exceptions.

Comment on lines 484 to 492
except Exception as exc:
try:
retry_result = _retry(func=func, allowed_exceptions=allowed_exceptions)
if retry_result is not None:
return retry_result
else:
raise exc
except Aborted:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should remove this part entirely. I know that the previous implementation of Batch retried this specific RST_STREAM error, but that was just a copy-paste from other methods. That error is not relevant for this type of operation.

@@ -473,6 +505,7 @@ def _retry(
Args:
func: The function to be retried.
retry_count: The maximum number of times to retry the function.
deadline: This will be used in case of Aborted transactions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove, this is not relevant anymore

@@ -293,7 +305,9 @@ def group(self):
self._mutation_groups.append(mutation_group)
return MutationGroup(self._session, mutation_group.mutations)

def batch_write(self, request_options=None, exclude_txn_from_change_streams=False):
def batch_write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also remove the **kwargs addition again from this PR. It would just be confusing if that is added in this PR, when it is not relevant to the actual change in this PR.

olavloite and others added 3 commits January 8, 2025 18:52
Recognize GRAPH and pipe syntax queries as valid queries
in dbapi.
…leapis#1273)

* chore: Add Custom OpenTelemetry Exporter in for Service Metrics

* Updated copyright dates to 2025

---------

Co-authored-by: rahul2393 <[email protected]>
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Jan 8, 2025
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Jan 8, 2025
@olavloite olavloite merged commit 0887eb4 into googleapis:main Jan 9, 2025
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/python-spanner API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

database.batch does not retry aborted transactions
4 participants