New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass event created timestamp correctly to cache #124297
base: master
Are you sure you want to change the base?
Conversation
`CreatedAt` timestamp of `ContainerEventResponse` should be passed as nanoseconds to `time.Unix()`.
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Apr 12 13:49:11 UTC 2024. |
Welcome @hshiina! |
Hi @hshiina. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/triage accepted |
@@ -264,7 +264,7 @@ func (e *EventedPLEG) processCRIEvents(containerEventsResponseCh chan *runtimeap | |||
} | |||
shouldSendPLEGEvent = true | |||
} else { | |||
if e.cache.Set(podID, status, err, time.Unix(event.GetCreatedAt(), 0)) { | |||
if e.cache.Set(podID, status, err, time.Unix(0, event.GetCreatedAt())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hshiina great catch! Thank you!
/lgtm
/assign @mrunalp @derekwaynecarr @dchen1107 |
/test pull-kubernetes-e2e-kind-evented-pleg |
cc @harche |
failure looks unrelated... retesting /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e |
The failure seems to be expected as described in this bug
|
There are some cases where a pod worker is woken up without a cache update by the PLEG such as a pod termination. Then, the worker gets stuck in `cache.GetNewerThan()` till the global cache timestamp is updated by the PLEG. In order to unblock the stuck worker as early as the Generic PLEG, this fix makes the Evented PLEG update the global cache as frequently as the Generic PLEG.
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hshiina The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-kind-evented-pleg |
/cc @pacoxu |
A failure in pull-kubernetes-node-e2e-containerd seems to be caused because a regular container and a postStart hook start asynchronously.
Because of #124668, kubernetes/test/e2e_node/container_lifecycle_test.go Lines 255 to 259 in 8e5b26b
|
A failure in pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e looks the same issue described in #124743.
|
/test pull-kubernetes-node-e2e-containerd |
/test pull-kubernetes-e2e-kind-beta-features |
/test pull-kubernetes-e2e-kind-evented-pleg |
1 similar comment
/test pull-kubernetes-e2e-kind-evented-pleg |
@hshiina: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
One of failures in pull-kubernetes-e2e-kind-evented-pleg looks the same issue as #123087:
I guess the bind error in the log was caused because two containers were created.
I don’t think it is enough to wait for the cache to be updated once at |
I posted another PR (#124953) to try solving the remaining problem. I think it's still worth merging this PR at first in order to improve the Evented PLEG even though there is still an unsolved problem. |
/test pull-kubernetes-e2e-kind-alpha-beta-features |
What type of PR is this?
/kind bug
What this PR does / why we need it:
CreatedAt
timestamp ofContainerEventResponse
should be passed as nanoseconds totime.Unix()
.This fix may mitigate #123087. I guess this issue happens as following:
Usually, a container created event is delivered while a pod worker is creating a pod at SyncPod() and the
created
status is cached by the PLEG. AfterSyncPod()
, the pod worker waits at GetNewerThan() until a next event (container started) happens because the cached container status is older thanlastSyncTime
. However, due to the bug in this PR, themodified
time of the cached status is so much newer because a nanoseconds value is set as seconds. Then, the pod worker gets acreated
container status from the cache and goes intoSyncPod()
soon. Then, the worker creates another container.By fixing the timestamp, a pod worker is blocked at
GetNewerThan()
as expected. However, it is blocked much longer than the Generic PLEG in cases where the worker is woken up without a cache update by the PLEG such as a pod termination (issue #124704). In order to unblock the stuck worker as early as the Generic PLEG, this fix also makes the Evented PLEG update the global cache as frequently as the Generic PLEG.Which issue(s) this PR fixes:
Fixes #124704
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: