Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod phase transition is slower when EventedPLEG is enabled #124704

Open
hshiina opened this issue May 6, 2024 · 7 comments · May be fixed by #124297 or #125018
Open

Pod phase transition is slower when EventedPLEG is enabled #124704

hshiina opened this issue May 6, 2024 · 7 comments · May be fixed by #124297 or #125018
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@hshiina
Copy link

hshiina commented May 6, 2024

What happened?

This issue presupposes that PR #124297 is merged. I believe this fix is necessary for the Evented PLEG.

As I tried running a pod on my laptop with a locally built kubernetes with applying #124297, a pod phase transition is slower especially at deletion when the EventedPLEG is enabled.

EventedPLEG is enabled:

$ kubectl create -f simple-pod.yaml; kubectl get pod simple-pod -w & kubectl wait --for=jsonpath='{.status.phase}'=Running pod simple-pod; sleep 1; kubectl delete pod simple-pod; kill $!
pod/simple-pod created
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	3s
pod/simple-pod condition met
simple-pod   1/1 	Terminating     	0      	4s
pod "simple-pod" deleted
simple-pod   0/1 	Terminating     	0      	13s
simple-pod   0/1 	Terminating     	0      	18s
simple-pod   0/1 	Terminating     	0      	18s
simple-pod   0/1 	Terminating     	0      	18s

EventedPLEG is disabled:

$ kubectl create -f simple-pod.yaml; kubectl get pod simple-pod -w & kubectl wait --for=jsonpath='{.status.phase}'=Running pod simple-pod; sleep 1; kubectl delete pod simple-pod; kill $!
pod/simple-pod created
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s

As described here, a pod worker is blocked at cache.GetNewerThan() when it is woken up by an update without a cache update in a PLEG:

default:
// wait until we see the next refresh from the PLEG via the cache (max 2s)
// TODO: this adds ~1s of latency on all transitions from sync to terminating
// to terminated, and on all termination retries (including evictions). We should
// improve latency by making the pleg continuous and by allowing pod status
// changes to be refreshed when key events happen (killPod, sync->terminating).
// Improving this latency also reduces the possibility that a terminated
// container's status is garbage collected before we have a chance to update the
// API server (thus losing the exit code).
status, err = p.podCache.GetNewerThan(update.Options.Pod.UID, lastSyncTime)

The worker is unblocked when another event is delivered or the PLEG calls cache.UpdateTime(). At the latter case, while the genericPLEG calls cache.UpdateTime() every one second along with Relist(), the EventPLEG calls cache.UpdateTime() every five seconds. Because of this difference, the Evented PLEG spends more time to get pods into another phase.

// The frequency with which global timestamp of the cache is to
// is to be updated periodically. If pod workers get stuck at cache.GetNewerThan
// call, after this period it will be unblocked.
const globalCacheUpdatePeriod = 5 * time.Second

Even if there is a cache update, a worker can be blocked when the cache is updated by an asynchronous event before the worker finishes SyncPod(). For instance, when a runtime starts a container, the PLEG gets an event and caches the container status(running). If this event is received after a pod worker finishes SyncPod() to start the container, the worker gets the new status at GetNewerThan() soon and runs SyncPod() again to update the pod phase to running. However, if the event arrives before the worker finishes SyncPod() to start the container, the worker is blocked at GetNewerThan() because the cached status is older than lastSyncTime.

What did you expect to happen?

The pod phase transition should be as fast as GenericPLEG. It would be better to set globalCacheUpdatePeriod to 1 second.

How can we reproduce it (as minimally and precisely as possible)?

Build kubernetes locally with applying PR #124297 and run with enabling EventedPLEG feature gate.

Use this simple-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
	run: simple-pod
  name: simple-pod
spec:
  containers:
  - command:
	- sh
	- -c
	- trap "exit 0" SIGTERM; while true; do sleep 1; done
	image: busybox
	name: simple-container

Run the command:

$ kubectl create -f simple-pod.yaml; kubectl get pod simple-pod -w & kubectl wait --for=jsonpath='{.status.phase}'=Running pod simple-pod; sleep 1; kubectl delete pod simple-pod; kill $!

This is the result I tried the command ten times:

$ for i in `seq 10`; do kubectl create -f simple-pod.yaml; kubectl get pod simple-pod -w & kubectl wait --for=jsonpath='{.status.phase}'=Running pod simple-pod; sleep 1; kubectl delete pod simple-pod; kill $!; done
pod/simple-pod created
[1] 17875
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	12s
simple-pod   0/1 	Terminating     	0      	17s
simple-pod   0/1 	Terminating     	0      	17s
simple-pod   0/1 	Terminating     	0      	17s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 18248
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	6s
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 18583
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	6s
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 19057
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
simple-pod   1/1 	Terminating     	0      	6s
pod "simple-pod" deleted
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 19386
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	6s
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 19718
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	6s
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 20052
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	6s
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 20586
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	6s
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 20914
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	6s
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 21294
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	5s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	6s
simple-pod   0/1 	Terminating     	0      	15s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s
simple-pod   0/1 	Terminating     	0      	20s

This is the result when EventedPLEG is disabled:

$ for i in `seq 10`; do kubectl create -f simple-pod.yaml; kubectl get pod simple-pod -w & kubectl wait --for=jsonpath='{.status.phase}'=Running pod simple-pod; sleep 1; kubectl delete pod simple-pod; kill $!; done
pod/simple-pod created
[1] 34236
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 34506
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 34820
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
pod/simple-pod condition met
simple-pod   1/1 	Running         	0      	2s
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 35084
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 35373
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 35640
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 35900
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 36173
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 36430
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
pod/simple-pod created
[1]+  Exit 1              	kubectl get pod simple-pod -w
[1] 36694
NAME     	READY   STATUS          	RESTARTS   AGE
simple-pod   0/1 	ContainerCreating   0      	0s
simple-pod   1/1 	Running         	0      	2s
pod/simple-pod condition met
pod "simple-pod" deleted
simple-pod   1/1 	Terminating     	0      	3s
simple-pod   0/1 	Terminating     	0      	4s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s
simple-pod   0/1 	Terminating     	0      	5s

Anything else we need to know?

No response

Kubernetes version

master + PR #124297

$ kubectl version
# paste output here
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.0-alpha.0.582+8ec29a96e91e40
WARNING: version difference between client (1.28) and server (1.31) exceeds the supported minor version skew of +/-1

Cloud provider

None

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@hshiina hshiina added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 6, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hshiina
Copy link
Author

hshiina commented May 6, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2024
@saschagrunert
Copy link
Member

cc @harche @sairameshv

@sairameshv sairameshv linked a pull request May 8, 2024 that will close this issue
@hshiina
Copy link
Author

hshiina commented May 8, 2024

/assign

@harche
Copy link
Contributor

harche commented May 8, 2024

/cc @pacoxu

@pacoxu pacoxu added this to Triage in SIG Node Bugs May 14, 2024
@pacoxu
Copy link
Member

pacoxu commented May 20, 2024

We are working a fix of EvenetedPLEG bug in #122778. Could you also test with both #122778 and #124297?

@hshiina
Copy link
Author

hshiina commented May 21, 2024

Sure. I created anther PR(#125018) for testing both PRs.

@haircommander haircommander moved this from Triage to Triaged in SIG Node Bugs May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
5 participants