Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/cacher: dispatchEvents use progressRequester #124754

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

p0lyn0mial
Copy link
Contributor

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 8, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 8, 2024
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 8, 2024
// TODO(p0lyn0mial): adapt the following logic once
// https://github.com/kubernetes/kubernetes/pull/124612 merges
progressRequesterCleanUpOnceFn := func() { /*no-op*/ }
if utilfeature.DefaultFeatureGate.Enabled(features.ConsistentListFromCache) && utilfeature.DefaultFeatureGate.Enabled(features.WatchList) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should actually gate the progressRequester only when the etcd version "matches" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - we should gate on DefaultFeatureSupportsChecker once that merges.
The problem there is that it may not yet be initialized... and we need to handle that case too, so it's a bit more tricky (because we are not able to differentiate between not-initialized and not-supports really).
@serathius - FYI

[On a related note, we should probably reject streaming list requests if progress-requester is not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we are not able to differentiate between not-initialized and not-supports really

why ? It is not supported when the version of etcd doesn't match.

On a related note, we should probably reject streaming list requests if progress-requester is not supported.

yeah, ideally if we could add etcdVersionChecker to

func ValidateListOptions(options *internalversion.ListOptions, isWatchListFeatureEnabled bool) field.ErrorList {

could the etcdVersionChecker gate the server readiness until it initialises ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ? It is not supported when the version of etcd doesn't match.

etcd can be started after kube-apiserver, so we default to false and let initialization switch it

could the etcdVersionChecker gate the server readiness until it initialises ?

no - because we don't really know when it is fully initialized...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd can be started after kube-apiserver, so we default to false and let initialization switch it

In that case the server won't be ready until newETCD3Check turns green. We could create something similar for the version checker.

@@ -2470,6 +2470,72 @@ func TestWatchStreamSeparation(t *testing.T) {
}
}

func TestDispatchEventsUseProgressRequester(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test works but can start flaking on "timing" issues. I like it because it test the entire watch request.

require.NoError(t, err, "failed to create watch: %v")
testCheckNoEvents(t, w)
w.Stop()
storeWatchProgressCounterValueAfterFirstWatch := backingStorage.getRequestWatchProgressCounter()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe before stopping the watch we should loop until the counter will be > 0 ? That could deflake the test

@p0lyn0mial
Copy link
Contributor Author

/assign @wojtek-t

// https://github.com/kubernetes/kubernetes/pull/124612 merges
progressRequesterCleanUpOnceFn := func() { /*no-op*/ }
if utilfeature.DefaultFeatureGate.Enabled(features.ConsistentListFromCache) && utilfeature.DefaultFeatureGate.Enabled(features.WatchList) {
progressRequester.Add()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just also realised than we need to slightly change the progressRequester so that it is able to send periodic progress updates. This will unblock watchers initialised from the global RV against resources that haven't received any changes/updates.

I can try changing the progressRequester for that purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is what should we do when the etcd was started with the progress notification flag ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, no, this will be handled by

c.watchCache.waitingUntilFresh.Add()

progressRequester.Add()
progressRequesterCleanUpOnceFn = sync.OnceFunc(progressRequester.Remove)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with all discussion on the other PRs, I think it will actually be much easier to proceed with a different approach (the one that I thought about originally).

Basically, instead of even requesting bookmarks, ensure that we initialize lastProcessedResourceVersion correctly from the beginning.
So the flow we want to achieve is that lastProcessedResourceVersion will actually be set when the first List call is done.

Now, synchronizing that correctly in arbitrary way is a bit tricky, but we have a pretty simple path to achieve it.

What you just need to do is very simple thing, just add here:

lastProcessedResourceVersion := uint64(0)
PollUntil(10ms, func() (bool, error) {
  if rv := c.watchCache.GetResourceVersion(); rv != 0 {
    lastProcessedResourceVersion = rv
    return true, nil
  }
  return false, nil
}

That solves the whole problem - you don't need any other changes in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i like the idea it will be detached from the progressRequester. thanks!

@p0lyn0mial p0lyn0mial force-pushed the upstream-cacher-dispatchevents-progress-requester branch 3 times, most recently from fc80b5e to 064fe38 Compare May 14, 2024 10:19
@fedebongio
Copy link
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels May 14, 2024
// The cache must wait until this first sync is completed to be deemed ready.
// Since we cannot send a bookmark when the lastProcessedResourceVersion is 0,
// we poll aggressively for the first RV before entering the dispatch loop.
if err := c.ready.wait(wait.ContextForChannel(c.stopCh)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it? I suggest removing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an extremely tight loop below.

Before hammering the CPU, we should check if the cacher has been synchronised so that we can read the current value of the resource version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the calls in that function (GetResourceVersion) are pretty cheap, so I wouldn't unnecessarily complicate it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your concern here? Does calling a well-known function really make this code more complicated ?:)
There is no point in calling getResourceVersion before the cacher synchronies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not-readiness can happen later too, if the cache unsynchronizes later it can block again.
But primarily - complexity is my concern. This code is already super complicated and we need to find ways for making it simpler.

Calling getResourceVersion() is really cheap and I don't bother calling it every 10ms until it initializes - the cost of doing that is negligible compared to initializations itself anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not-readiness can happen later too, if the cache unsynchronizes later it can block again.

we call it only once and then we enter the for loop from which we never exit (unless the stopCh was closed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But primarily - complexity is my concern. This code is already super complicated and we need to find ways for making it simpler.

OK, pushed, PTAL.

@@ -641,6 +641,12 @@ func (w *watchCache) Resync() error {
return nil
}

func (w *watchCache) GetResourceVersion() uint64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make it private function

}
return false, nil
}); err != nil {
return /*since it can only happen when the stopCh is closed*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

// Given the function above never returns error, it can happen only on stopCh being closed
```

@p0lyn0mial p0lyn0mial force-pushed the upstream-cacher-dispatchevents-progress-requester branch from 064fe38 to 783cf82 Compare May 20, 2024 10:59
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 20, 2024
@p0lyn0mial p0lyn0mial force-pushed the upstream-cacher-dispatchevents-progress-requester branch from 783cf82 to 33f81ee Compare May 20, 2024 12:18
@wojtek-t
Copy link
Member

/kind feature

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels May 20, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a2c5781dceeeb6ff2dfb380c19bc8908d7fbf891

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 20, 2024
@k8s-ci-robot
Copy link
Contributor

@p0lyn0mial: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-kind-ipv6 33f81ee link true /test pull-kubernetes-e2e-kind-ipv6

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@p0lyn0mial
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants