-
Notifications
You must be signed in to change notification settings - Fork 40.5k
Garbage collector behavior on invalid ownerReferences for existing uids across namespaces and across kinds is non-deterministic #65200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The discussion in #63386 is unclear to me. While that is 1.11 milestone material, is the content of this issue here a lower priority and deferred to the next milestone and perhaps later cherry-picking? Given the delay, I'd like to assume it is not critical-urgent, but rather than assume... @kubernetes/sig-api-machinery-bugs can folks comment? |
This is longstanding and not a 1.11 blocker. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
I work with K8s 1.13 and observed that GC works when
If required, I can provide a simple test example. |
It does not work reliably. If you set up a relationship like that, then restart the controller manager, the dependent will usually be deleted (because the uid-indexed owner cache is not populated yet) |
I mean that in my case the dependent is deleted without restart the controller manager. It is quite reproducible. |
That is the expected/desired behavior, but is racy, so sometimes the cross-namespace relationship does not result in deletion until restart. Because it is racy, environmental factors like load and performance (both of the API server and controller manager) can affect results. |
I see, thanks |
xref #72192 |
In GarbageCollector#attemptToDeleteItem :
I think we should continue with other dependents performing the unblocking (instead of breaking out of the loop). |
I'm working on a fix for this for the 1.20 timeframe |
👋🏽 Hey @liggitt ! I'm from the Bug Triage team. This issue has not been updated for a while, so I'd like to check on the status. The code freeze is starting November 12th (about 3 weeks from now) and while there is still plenty of time, we want to ensure that each PR has a chance to be merged on time. As the issue is tagged for 1.20, is it still planned for this release? |
Yes, work on this is in progress for 1.20 |
Fixed in 1.20 by #92743 |
A k8s bug (kubernetes/kubernetes#65200) may cause the k8s garbage collection to delete undesired resources in case users manually copy an operator-managed secret to another namespace. To avoid that situation, this commit ensures no ownerRef is set on a subset of managed secrets users are susceptible to copy around: - the elastic user password secret - elasticsearch public transport certs - elasticsearch, kibana, enterprise search, apm server public http certs Existing ownerReferences set with earlier ECK versions will be removed when reconciled. Since they do not have an ownerRef anymore, those secrets are not automatically deleted when the Elasticsearch resource is deleted. To work around that situation, the secret reconciliation logic adds an additional set of labels to the reconciled secrets that don't have an ownerRef specified. These labels reference the "soft" owner ("soft" as in handled through some custom code and not through k8s builtin garbage collection logic). Once a controller receives a deletion event for the resource it manages, it will automatically remove the soft-owned secrets. This is done as best-effort. Secrets will remain orphan if: - the operator is not running when the owner is deleted - an error happens while deleting the soft-owned secrets
…#3992) * Don't set an ownerRef on secrets users are susceptible to copy around A k8s bug (kubernetes/kubernetes#65200) may cause the k8s garbage collection to delete undesired resources in case users manually copy an operator-managed secret to another namespace. To avoid that situation, this commit ensures no ownerRef is set on a subset of managed secrets users are susceptible to copy around: - the elastic user password secret - elasticsearch public transport certs - elasticsearch, kibana, enterprise search, apm server public http certs Existing ownerReferences set with earlier ECK versions will be removed when reconciled. Since they do not have an ownerRef anymore, those secrets are not automatically deleted when the Elasticsearch resource is deleted. To work around that situation, the secret reconciliation logic adds an additional set of labels to the reconciled secrets that don't have an ownerRef specified. These labels reference the "soft" owner ("soft" as in handled through some custom code and not through k8s builtin garbage collection logic). Once a controller receives a deletion event for the resource it manages, it will automatically remove the soft-owned secrets. This is done as best-effort. Secrets will remain orphan if: - the operator is not running when the owner is deleted - an error happens while deleting the soft-owned secrets * Error out the reconciliation on garbage collection errors * Improvements from PR review * Label soft-owned secrets with the owner Kind * Reuse Kind constants instead of hardcoded string * Fix existing unit tests * Garbage collect orphan soft-owned secrets on operator startup * Fix linter warnings * Add unit tests * Cover soft owned secrets gc in e2e tests * Fix linter warning (typo) * Improvements from PR review * Trigger reconciliations on soft-owned secrets events * Fix linter stuff
…elastic#3992) * Don't set an ownerRef on secrets users are susceptible to copy around A k8s bug (kubernetes/kubernetes#65200) may cause the k8s garbage collection to delete undesired resources in case users manually copy an operator-managed secret to another namespace. To avoid that situation, this commit ensures no ownerRef is set on a subset of managed secrets users are susceptible to copy around: - the elastic user password secret - elasticsearch public transport certs - elasticsearch, kibana, enterprise search, apm server public http certs Existing ownerReferences set with earlier ECK versions will be removed when reconciled. Since they do not have an ownerRef anymore, those secrets are not automatically deleted when the Elasticsearch resource is deleted. To work around that situation, the secret reconciliation logic adds an additional set of labels to the reconciled secrets that don't have an ownerRef specified. These labels reference the "soft" owner ("soft" as in handled through some custom code and not through k8s builtin garbage collection logic). Once a controller receives a deletion event for the resource it manages, it will automatically remove the soft-owned secrets. This is done as best-effort. Secrets will remain orphan if: - the operator is not running when the owner is deleted - an error happens while deleting the soft-owned secrets * Error out the reconciliation on garbage collection errors * Improvements from PR review * Label soft-owned secrets with the owner Kind * Reuse Kind constants instead of hardcoded string * Fix existing unit tests * Garbage collect orphan soft-owned secrets on operator startup * Fix linter warnings * Add unit tests * Cover soft owned secrets gc in e2e tests * Fix linter warning (typo) * Improvements from PR review * Trigger reconciliations on soft-owned secrets events * Fix linter stuff
…#3992) (#4008) * Don't set an ownerRef on secrets users are susceptible to copy around A k8s bug (kubernetes/kubernetes#65200) may cause the k8s garbage collection to delete undesired resources in case users manually copy an operator-managed secret to another namespace. To avoid that situation, this commit ensures no ownerRef is set on a subset of managed secrets users are susceptible to copy around: - the elastic user password secret - elasticsearch public transport certs - elasticsearch, kibana, enterprise search, apm server public http certs Existing ownerReferences set with earlier ECK versions will be removed when reconciled. Since they do not have an ownerRef anymore, those secrets are not automatically deleted when the Elasticsearch resource is deleted. To work around that situation, the secret reconciliation logic adds an additional set of labels to the reconciled secrets that don't have an ownerRef specified. These labels reference the "soft" owner ("soft" as in handled through some custom code and not through k8s builtin garbage collection logic). Once a controller receives a deletion event for the resource it manages, it will automatically remove the soft-owned secrets. This is done as best-effort. Secrets will remain orphan if: - the operator is not running when the owner is deleted - an error happens while deleting the soft-owned secrets * Error out the reconciliation on garbage collection errors * Improvements from PR review * Label soft-owned secrets with the owner Kind * Reuse Kind constants instead of hardcoded string * Fix existing unit tests * Garbage collect orphan soft-owned secrets on operator startup * Fix linter warnings * Add unit tests * Cover soft owned secrets gc in e2e tests * Fix linter warning (typo) * Improvements from PR review * Trigger reconciliations on soft-owned secrets events * Fix linter stuff
If an object with a given uid is already in the garbage collector uid map, a child object created with an ownerReference pointing to that uid is not treated as having a non-existent parent, even if:
Original description follows
Forked from #63386 (comment).
Garbage collector should work for three cases: (a) cluster-scoped owner with namespaced dependents, (b) namespaced owner and namespaced dependents that are in the same namespace, and (c) cluster-scoped owner with cluster-scoped dependents.
Garbage collector should NOT work for the other two cases: (c) namespaced owner with cluster-scoped dependents. (d) owner and dependents that are in different namespaces. Today, GC sometimes work in these two cases. It's a bug for two reasons:
We can add extra checks in the GC controller to make it never work for case (c) and (d).
cc @lavalamp @liggitt @deads2k
The text was updated successfully, but these errors were encountered: