fix: update retry strategy for mutation calls to handle aborted transactions #1279

aakashanandg · 2025-01-02T12:59:11Z

Updating retry strategy for mutation calls to handle aborted transactions

This PR updates the retry strategy for mutation calls to handle aborted transactions more effectively. Previously, the retry mechanism didn't handle certain edge cases for aborted transactions, leading to failures. The updated strategy ensures retries in these scenarios to improve the robustness of the mutation operations.

Test Results:

All unit tests related to the retry logic and mutation operations pass successfully.

Fixes #1133

…actions

olavloite · 2025-01-06T10:35:35Z

google/cloud/spanner_v1/_helpers.py

+    Retry a specified function with different logic based on the type of exception raised.
+
+    If the exception is of type google.api_core.exceptions.Aborted,
+    apply an alternate retry strategy that relies on the provided deadline value instead of a fixed number of retries.
+    For all other exceptions, retry the function up to a specified number of times.


This API is not really logical. I would suggest splitting this into two separate functions:

Keep the current retry function as-is.

Add a new function _retry_on_aborted_exception that handles that specific case.

In the current form, the API is quite 'magical' and hard to understand. What is for example the definition of this function if you call it with Aborted as one of the allowed exceptions? Will it use the specific logic for Aborted in all cases? Or only if you have also supplied a deadline? What is the meaning of retry_count if you use to it retry Aborted errors? etc...

If the exception is of type Aborted, it will activate the custom retry strategy. However, this will only occur if the user has listed this exception in the allowed_exceptions map and provided a deadline value. If either condition is missing, the exception will not be retried. For the batch API use case, we will specifically allow this exception to be retried.

I meant that the _retry function and the _retry_on_aborted_exception should be completely separated. I don't really see any advantage of combining them, as the actual code that can be shared is minimal, and the API surface of this function is not logical.

E.g. if you have defined Aborted as a retryable exception, but you forget to supply a deadline, then all of a sudden it is not retriable. Also, deadline is only used if you add Aborted as a possible retryable error, and is otherwise ignored if you only supply other error codes. Same with retry_count; it is only used for non-Aborted errors. The fact that there are many combinations of input arguments that don't make any sense, is an indication that the function itself should be split.

Thanks for the clarification. I've implemented the new retry logic as suggested, separating the _retry and _retry_on_aborted_exception functions. This ensures clearer logic, as combining them led to confusing combinations of parameters that didn't make sense. Now, the retry logic for non-Aborted and Aborted exceptions is more distinct and easier to manage.

olavloite · 2025-01-06T10:35:56Z

google/cloud/spanner_v1/_helpers.py

-    while retries <= retry_count:
+    while True:


This breaks existing use cases that rely on this function to stop retrying after N retries.

I believe the check for retries < retry_count is already in place for generic retries. This ensures that the while loop terminates early and an exception is raised once the retry count is exceeded. So, in my opinion, this logic should work correctly for generic retries as well.

olavloite · 2025-01-06T10:41:25Z

google/cloud/spanner_v1/batch.py

@@ -293,7 +305,9 @@ def group(self):
        self._mutation_groups.append(mutation_group)
        return MutationGroup(self._session, mutation_group.mutations)

-    def batch_write(self, request_options=None, exclude_txn_from_change_streams=False):
+    def batch_write(


batch_write is a bit different. I don't think we should include it in this PR, as it is a non-atomic, streaming operation, that probably needs different error handling than 'just retry if it fails with an aborted error'.

Understood. In that case, we can bypass the retry behavior for this operation.

Please also remove the **kwargs addition again from this PR. It would just be confusing if that is added in this PR, when it is not relevant to the actual change in this PR.

olavloite · 2025-01-06T10:42:02Z

google/cloud/spanner_v1/batch.py

+def no_op_handler(exc):
+    # No-op (does nothing)
+    pass


Can we remove this and just pass in a lambda where a no-op handler is needed (if it is needed at all after we separate the normal retry function from the aborted retry function)?

Yes, we can use a no-op lambda for this.

Removed the redundant code as this is no longer required with the new implementation.

olavloite · 2025-01-06T10:42:45Z

tests/unit/test_batch.py

@@ -618,6 +651,36 @@ def __init__(self, database=None, name=TestBatch.SESSION_NAME):
    def session_id(self):
        return self.name

+    def run_in_transaction(self, fnc):


Why is this here? It does not look like a test method.

Yes, it's not needed here. This change is carried forward from my previous PR where we call the run_in_transaction method instead.

Removed in subsequent commits.

olavloite · 2025-01-06T10:43:57Z

google/cloud/spanner_v1/_helpers.py

+        raise
+
+    delay = _get_retry_delay(cause, attempts)
+    print(now, delay, deadline)


nit: remove

I extracted these methods to make them more generic, allowing other clients to reuse the logic instead of it being tightly coupled with the session object.

I meant: Remove the print(...) line. We should not print debug info in non-test code (and normally also not in test code).

Noted. Removed from subsequent commits.

googleapis#1281) Source-Link: googleapis/synthtool@de3def6 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:a1c5112b81d645f5bbc4d4bbc99d7dcb5089a52216c0e3fb1203a0eeabadd7d5 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

…an up redundant code

olavloite · 2025-01-08T12:21:16Z

google/cloud/spanner_v1/_helpers.py

+def _retry_on_aborted_exception(
+    func,
+    deadline,
+    allowed_exceptions=None,


I think that we can simplify this further and just remove allowed_exceptions from this function. It should only retry aborted exceptions.

olavloite · 2025-01-08T12:23:35Z

google/cloud/spanner_v1/_helpers.py

+        except Exception as exc:
+            try:
+                retry_result = _retry(func=func, allowed_exceptions=allowed_exceptions)
+                if retry_result is not None:
+                    return retry_result
+                else:
+                    raise exc
+            except Aborted:
+                continue


I think that we should remove this part entirely. I know that the previous implementation of Batch retried this specific RST_STREAM error, but that was just a copy-paste from other methods. That error is not relevant for this type of operation.

olavloite · 2025-01-08T12:23:52Z

google/cloud/spanner_v1/_helpers.py

@@ -473,6 +505,7 @@ def _retry(
    Args:
        func: The function to be retried.
        retry_count: The maximum number of times to retry the function.
+        deadline: This will be used in case of Aborted transactions.


nit: remove, this is not relevant anymore

olavloite · 2025-01-08T12:25:18Z

google/cloud/spanner_v1/batch.py

@@ -293,7 +305,9 @@ def group(self):
        self._mutation_groups.append(mutation_group)
        return MutationGroup(self._session, mutation_group.mutations)

-    def batch_write(self, request_options=None, exclude_txn_from_change_streams=False):
+    def batch_write(


Please also remove the **kwargs addition again from this PR. It would just be confusing if that is added in this PR, when it is not relevant to the actual change in this PR.

Recognize GRAPH and pipe syntax queries as valid queries in dbapi.

…leapis#1273) * chore: Add Custom OpenTelemetry Exporter in for Service Metrics * Updated copyright dates to 2025 --------- Co-authored-by: rahul2393 <[email protected]>

…d_exception handler

aakashanandg requested review from a team as code owners January 2, 2025 12:59

product-auto-label bot added the size: m Pull request size is medium. label Jan 2, 2025

aakashanandg requested a review from olavloite January 2, 2025 12:59

blunderbuss-gcf bot assigned pratickchokhani Jan 2, 2025

product-auto-label bot added the api: spanner Issues related to the googleapis/python-spanner API. label Jan 2, 2025

This was referenced Jan 2, 2025

fix: update retry strategy for mutation calls to handle aborted transactions #1270

Closed

database.batch does not retry aborted transactions #1133

Closed

aakashanandg force-pushed the batch-retry-strategy branch 5 times, most recently from e2b7413 to 0e7eb52 Compare January 2, 2025 13:53

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jan 2, 2025

fix: update retry strategy for mutation calls to handle aborted trans…

01a0196

…actions

aakashanandg force-pushed the batch-retry-strategy branch from 0e7eb52 to 01a0196 Compare January 2, 2025 18:33

test: add mock server test for aborted batch

2a9b805

olavloite reviewed Jan 6, 2025

View reviewed changes

aakashanandg force-pushed the batch-retry-strategy branch 2 times, most recently from 8f6be72 to 687e0e3 Compare January 6, 2025 18:38

fix:Refactoring existing retry logic for aborted transactions and cle…

a6e25a3

…an up redundant code

aakashanandg force-pushed the batch-retry-strategy branch from 687e0e3 to a6e25a3 Compare January 6, 2025 18:39

aakashanandg added 2 commits January 7, 2025 00:17

Merge branch 'main' into batch-retry-strategy

198c7df

fix: fixed linting errors

1032b8b

olavloite reviewed Jan 8, 2025

View reviewed changes

olavloite and others added 3 commits January 8, 2025 18:52

feat: support GRAPH and pipe syntax in dbapi (googleapis#1285)

d4e7d9c

Recognize GRAPH and pipe syntax queries as valid queries in dbapi.

chore: Add Custom OpenTelemetry Exporter in for Service Metrics (goog…

fa8ae71

…leapis#1273) * chore: Add Custom OpenTelemetry Exporter in for Service Metrics * Updated copyright dates to 2025 --------- Co-authored-by: rahul2393 <[email protected]>

fix: removing retry logic for RST_STREAM errors from _retry_on_aborte…

d5c4975

…d_exception handler

product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Jan 8, 2025

Merge branch 'main' into batch-retry-strategy

7f3088c

product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Jan 8, 2025

olavloite approved these changes Jan 9, 2025

View reviewed changes

olavloite merged commit 0887eb4 into googleapis:main Jan 9, 2025
10 of 12 checks passed

release-please bot mentioned this pull request Jan 9, 2025

chore(main): release 3.52.0 #1258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update retry strategy for mutation calls to handle aborted transactions #1279

fix: update retry strategy for mutation calls to handle aborted transactions #1279

aakashanandg commented Jan 2, 2025 •

edited by olavloite

Loading

olavloite Jan 6, 2025

aakashanandg Jan 6, 2025 •

edited

Loading

olavloite Jan 6, 2025

aakashanandg Jan 6, 2025

olavloite Jan 6, 2025

aakashanandg Jan 6, 2025

olavloite Jan 6, 2025

aakashanandg Jan 6, 2025

olavloite Jan 8, 2025

olavloite Jan 6, 2025

aakashanandg Jan 6, 2025

aakashanandg Jan 6, 2025

olavloite Jan 6, 2025

aakashanandg Jan 6, 2025

aakashanandg Jan 6, 2025

olavloite Jan 6, 2025

aakashanandg Jan 6, 2025

olavloite Jan 6, 2025

aakashanandg Jan 6, 2025

olavloite Jan 8, 2025

olavloite Jan 8, 2025

olavloite Jan 8, 2025

olavloite Jan 8, 2025

fix: update retry strategy for mutation calls to handle aborted transactions #1279

fix: update retry strategy for mutation calls to handle aborted transactions #1279

Conversation

aakashanandg commented Jan 2, 2025 • edited by olavloite Loading

Updating retry strategy for mutation calls to handle aborted transactions

Test Results:

Choose a reason for hiding this comment

aakashanandg Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakashanandg commented Jan 2, 2025 •

edited by olavloite

Loading

aakashanandg Jan 6, 2025 •

edited

Loading