Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet should alway get lastest secret/configmap resource in pod add event #124701

Open
V0idk opened this issue May 6, 2024 · 17 comments
Open
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@V0idk
Copy link

V0idk commented May 6, 2024

What happened?

The logic is as follows:

  1. Manually update secret resources.
  2. kubelet listens to resources and updates the watch cache.
  3. Start the pod and query the volume attached to the secret resource.

but probabilistically, especially at lots of operating pressures. The third step is performed before the second step. As a result, the old secret is mounted when the pod is started.

image


image

https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/

What did you expect to happen?

image

solution: kubelet should alway get lastest secret/configmap resource in pod add event instead of using cache

How can we reproduce it (as minimally and precisely as possible)?

see What happend

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@V0idk V0idk added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 6, 2024
@V0idk
Copy link
Author

V0idk commented May 6, 2024

/sig area/kubelet

@k8s-ci-robot
Copy link
Contributor

@V0idk: The label(s) sig/area/kubelet cannot be applied, because the repository doesn't have them.

In response to this:

/sig area/kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@V0idk V0idk changed the title secret/configmap should alway get lastest resource in pod add event kubelet should alway get lastest secret/configmap resource in pod add event May 6, 2024
@vaibhav2107
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2024
@pacoxu pacoxu added this to Triage in SIG Node Bugs May 14, 2024
@haircommander
Copy link
Contributor

/assign @harche

@haircommander
Copy link
Contributor

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label May 22, 2024
@liggitt
Copy link
Member

liggitt commented May 22, 2024

hrmm... when a new pod is registered, the cache manager implementation invalidates the secret / configmap. I expected the watch one to do the same, but it apparently does not.

@wojtek-t, this means that updating a secret and deployment does not guarantee the newly started pods use the new secret content. That seems... bad.

@harche
Copy link
Contributor

harche commented May 23, 2024

This is where kubelet sets the strategy for fetching the secrets -

switch kubeCfg.ConfigMapAndSecretChangeDetectionStrategy {

@harche
Copy link
Contributor

harche commented May 23, 2024

	// GetChangeDetectionStrategy is a mode in which kubelet fetches
	// necessary objects directly from apiserver.
	GetChangeDetectionStrategy ResourceChangeDetectionStrategy = "Get"
	// TTLCacheChangeDetectionStrategy is a mode in which kubelet uses
	// ttl cache for object directly fetched from apiserver.
	TTLCacheChangeDetectionStrategy ResourceChangeDetectionStrategy = "Cache"
	// WatchChangeDetectionStrategy is a mode in which kubelet uses
	// watches to observe changes to objects that are in its interest.
	WatchChangeDetectionStrategy ResourceChangeDetectionStrategy = "Watch"

and we default to WatchChangeDetectionStrategy here

obj.ConfigMapAndSecretChangeDetectionStrategy = kubeletconfigv1beta1.WatchChangeDetectionStrategy

It may happen that under resource contention, WatchChangeDetectionStrategy may not be able to detect the changes in the object of interest and then fetch the updated secrets fast enough. However, IMO, this strategy works great under normal operational circumstances. It avoids unnecessary trips to API server and hence is a bit more efficient.

Hence, instead of changing the default from WatchChangeDetectionStrategy to GetChangeDetectionStrategy for everyone (and potentially going away from existing default efficient secret fetching strategy as well as surprising the users with higher resource usage on their API server) we can modify our documentation to add a footnote, something like,

configMapAndSecretChangeDetectionStrategy
[ResourceChangeDetectionStrategy](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-ResourceChangeDetectionStrategy)	
configMapAndSecretChangeDetectionStrategy is a mode in which ConfigMap and Secret managers are running. Valid values include:

Get: kubelet fetches necessary objects directly from the API server;
Cache: kubelet uses TTL cache for object fetched from the API server;
Watch: kubelet uses watches to observe changes to objects that are in its interest.

Default: "Watch"

Note: For cluster nodes experiencing constantly higher resource consumption,  GetChangeDetectionStrategy is the recommended strategy. 

@liggitt @rphillips @haircommander @V0idk What do you think?

@liggitt
Copy link
Member

liggitt commented May 23, 2024

"this works well until it doesn't under load" is a really unactionable thing to document... I think there has to be a "happens before" ordering people can reason about and depend on ... if I update a secret, then create a new pod, and my updated secret doesn't get used, that's ~unusable in my opinion. I didn't know we gave up that ordering with the watch strategy.

@V0idk
Copy link
Author

V0idk commented May 24, 2024

My idea is:

  1. The watch strategy is used to reduce the apiserver pressure when kubelet hot updating volumes.
  2. But when creating a pod, we need to access the apiserver to obtain the latest data.

This ensures the ordering without losing too much performance.

@pisto
Copy link

pisto commented May 27, 2024

I believe that this issue is particularly evident in kubernetes/website#42359 (comment) (immutable: true on the configmap/secret resource exacerbates this issue). IMO an unified solution that ensures reasonable assumptions about time ordering is essential. In other words, I expect that the surefire process to get pods with up-to-date values should be

  • modify a secret/configmap
  • rollout the workloads that use it.

If that was assured, it becomes a matter of synchronization on the instrument being used to create/update resources on kubernetes (helm hooks, argo synch waves, ...). Otherwise, automated rollouts can get stuck because the new pod may not get up to date values for e.g. environment variables, and since container crashes may not trigger a container recreation, the invalid value stays and the rollout cannot advance.

@liggitt
Copy link
Member

liggitt commented May 27, 2024

immutable: true on the configmap/secret resource exacerbates this issue

to be fair, changing a resource marked immutable is not how that feature is expected to be used

@wojtek-t
Copy link
Member

Catching up....

It was years since we did that and I clearly missed the invalidation aspect. I agree this is a bug, but at least not a regression.

The question is how/when we should fix that. Given we're using reflector, which is "eventually consistent" by definition, starts at any point in time anyway and changing that to force from consistent list now would break scalability in many places, I would be afraid of breaking it.

Especially that we have a solution for that around the corner. Streaming LISTS will actually allow us address that clearly:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3157-watch-list/README.md

With streaming lists, we can change reflector to always start with consistent list, because it will be served from watchcache. And that's what would effectively address that.

So what I would do is to:

  • change the code now force reflector to restart [we can do that now]
  • wait until streaming lists land [we will change reflector at that point too to always start with consistent streaming list] and that will fix the issue

@pisto
Copy link

pisto commented May 28, 2024

Can you clarify on the "eventually consistent" bit? We have observed a rather prolonged period (at least one day but we suspect much longer) in which a kubelet was not updating the cached data of a secret.

@liggitt
Copy link
Member

liggitt commented May 28, 2024

I agree this is a bug, but at least not a regression.

It seems like it is... before the watch strategy, there was a happens-before relationship between secret/configmap writes and pod creation.

@wojtek-t
Copy link
Member

wojtek-t commented May 29, 2024

It seems like it is... before the watch strategy, there was a happens-before relationship between secret/configmap writes and pod creation.

Well, technically true, but we enabled that in 1.14 (so over 5y ago) and we didn't realize until now (and unless I'm missing something we haven't heard any complaints about it). So it's not something that we found because someone upgraded their cluster recently from a version that behaved differently.

f2a0c03

@haircommander haircommander moved this from Triage to Triaged in SIG Node Bugs May 29, 2024
@haircommander
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

8 participants