-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(manager) block synchronizer on cache readiness #5483
base: main
Are you sure you want to change the base?
Conversation
Obtain the manager cache during synchronizer setup and store it in the synchronizer. Block synchronizer start on cache sync. When the manager gains leadership, it starts populating cache and starts runnables (like the synchronizer). It waits for cache to sync (i.e. for it to read all entities present when it started) to block certain tasks, but not generic runnables. The synchronizer indirectly accesses the cache contents. Controllers popluate the store with a filtered cache. Depending on whether sync has completed, the store may have an incomplete view of resources in the cluster and may delete Kong configuration for as-yet uncached resources. This is bad and can lead to catastrophic wiping of proxy config associated with resources that are actually still present. Blocking the synchronizer start on cache sync ensures this does not happen.
Remove the naive time-based synchronizer delay. This was a hack to avoid configuration deletion at startup without knowledge of the manager cache state. It is unnecessary now that we have manager cache state.
Although most of the behavior here works as expected, I'm seeing something I can't explain where, after the sync timeout, the synchronizer does run a sync (deleting config) during shutdown. Don't know exactly what's causing that, but it needs to not happen. Best guess is that the manager cache marks itself "finished" after timing out, allowing the sync to proceed. This can hopefully be addressed by finding the appropriate context and checking if it's expired also. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #5483 +/- ##
========================================
- Coverage 69.7% 51.3% -18.5%
========================================
Files 176 176
Lines 22726 22731 +5
========================================
- Hits 15861 11671 -4190
- Misses 5928 9984 +4056
- Partials 937 1076 +139 ☔ View full report in Codecov by Sentry. |
There's some further weirdness where our manager only considers some caches relevant, apparently. https://github.com/kubernetes-sigs/controller-runtime/blob/v0.17.0/pkg/cache/internal/informers.go#L244-L261 gets all the per-resource cache wait functions. In our case, for reasons I cannot determine, this only includes four kinds. I'm not sure why we actually wait on these and not the others: Offhand I'm not sure what would cause IngressClass to show up but not Ingress from their controller definition. Neither is a dynamic controller that waits for the type to appear and starts the controller if it does; both of those should initialize immediately at start. The upshot of this is that simply disabling the permission for Ingress (or any other resource not in that list) does nothing, initially, and the manager cache will happily report that it has synced, though it will still eventually time out and exit everything. Unsure if that would matter in practice for the issue that prompted this initially (where there were presumably no permissions), but it could potentially result in issues where, for example, the IngressClass cache syncs before the Ingress one does, because there are more of the latter. |
What this PR does / why we need it:
Removes the time-based startup delay in favor of obtaining the manager cache and calling its wait function during synchronizer start. In situations where the cache has not fully synced, the dataplane synchronizer will block until it has. If the cache never syncs within its grace period, a cache sync timeout will terminate the controller, preventing it from ever syncing config.
This better addresses situations where the controller may incorrectly delete configuration upon gaining leadership because it cannot yet see all resources on the cluster. Configuration previously added by another controller replica should remain intact.
The previous naive time-based delay addressed this for some cases, but relied on an arbitrary "good enough" wait time. AFAIK this time did work in practice during normal startup. However, it did not work if the cache sync got stuck, such as when the controller role was broken or when API resources were not defined.
Which issue this PR fixes:
Fix #2315
Special notes for your reviewer:
I could have sworn that we had no means of getting the manager cache, that it was reserved for the manager's internal use. I then saw a
GetCache()
call in one of the controllers and re-checked that, and apparently it's (maybe always?) been accessible despite being in aninternal.go
.Manual testing supports the notion that this works in problem scenarios. I deployed a test image to a normal install and confirmed successful start and config population. While watching the admin API for configuration changes, I deleted the Ingress permissions from the controller role, scaled to 0, and scaled back to 1. The new replica started and blocked without syncing configuration before eventually hitting a cache sync timeout and dying.
PR Readiness Checklist:
Complete these before marking the PR as
ready to review
:CHANGELOG.md
release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR