-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing test due to pv expecting a "topology.gke.io/zone" label that is not set in OSS kubernetes nodes on GCE #124721
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig storage |
@jbtk You need to install the CSI driver as part of the K8s cluster installation. Please ping @mattcary @BenTheElder. |
/remove-sig storage |
+1 How are you bringing up you cluster? IIUC, k/k cluster up will not bring up a cloud-provider specific cluster, you need to use the tooling in cloud-provider-gcp. |
no, kube-up.sh has been using external cloud provider by default for a while now, but might have gaps. It seems likely that the gap is here while @dims / @aojea had it using the out of tree CCM for prep on removing the provider, CSI driver may not have been running vs in-tree volume? In which case we should add CSI install to the in-tree kube-up for now, and long term autoscaling tests may move to the GCP repo depending on which are broadly relevant vs very GCP specific. We can see what is used from the logs at |
For background, see #102701. It seems there we decided that the pdcsi driver should not be installed in k/k and we should only use cloud-provider-gcp --- that's why I thought there was no cloud provider support. Anyway I'm sure there's been a lot of evolution around this. If we want to revisit that PR, or a variation of it, I'd be happy to help. |
Thanks, I do think we need to revisit this. How are other PV tests running on GCE right now? If this test depends on GCP specific behavior then yeah, we should just move all of that over to cloud-provider-gcp and not install it here, but if we have Kubernetes tests that cover in-tree logic where we'd ... want a storage class / PV with these behavior we should install a current implementation. Even if the tests sometimes depend on GCP specific things, we need to split them between:
What I can't quit tell from above is which category these tests fall into. If it's 1) then we should just port the tests over and make sure we're running the pdcsi driver over there. If it's 2) or any other tests reasonably depending on the pdcsi driver would be 2) then we should revisit installing pdcsi in cluster/ here. As much as we want to deprecate cluster/ we should not regress on coverage until we have it replaced. |
Most PV tests use a csi hostpath or similar cloud-provider independent volume, or the rancher default storage class that's now in kind. This gets most use cases, but not everything, because for example it can't move between nodes. For the second part of your comment, I think it's (2). We should revisit installing pdcsi in k/k. My long-winded rationale for that is: My understanding at the time of #102701 was that we were trying to move to a world where k/k e2e tests would be on kind, and anything that couldn't run on kind would have to be run in a cloud provider. I think this was meant to be the same thing that you're saying, except instead of saying "GCP-specific" we said "can't run on kind". Those two notions may be about the same when talking about storage tests, but I think it's quite different for the autoscaler tests. I suspect the autoscaler tests fall into a donut hole where they need more than kind, but they're not really GCP specific -- they don't care about certain GCE PD features, they just want a generic-looking cloud volume. My thinking is that either we should draw a nice line and say "only kind for k/k e2e" (and say that the autoscalar tests have to move to cloud-provider-gcp), or if we're going to support running k/k e2e on GCE, we need that GCE cluster-up to install the pd csi driver, even if that does leak cloud-provider-gcp a little into k/k. At least we'd have a clear rationale for the leakage (testing cloud features) even if we don't have a well-defined generic "cloud features" framework. I think it would be too strong to force the autoscalar tests into a cloud provider, because probably the tests are generic in practice, and it's a good idea for this test code to live in k/k and not be sequestered out to one cloud provider. |
Ok - some clarification:
Note that when I wrote a test that installed a driver and used the preprovisioned PV it worked because all the needed labels were there (I was planning to add it, but the removal of gce provider happened and I will likely add it to gcp specific repo) P.S. I do not think that it is possible to rewrite storage related autoscaling tests in provider agnostic way so to run it we will need to run it in provider specific repo. Fixing the tests/moving them is something that I am working on. |
Here I'm specifically referencing "GCP-specific" as in, "this test only covers things that are relevant to GCP" as opposed to "this test ensures coverage for code in the kubernetes/kubernetes repo that we wouldn't otherwise have". While we continue to expand what can be tested in a completely general way and/or with kind specifically, there are plenty of cases where we're NOT going to do that with kind, such as node tests, scalability / performance testing, etc. We should be deciding to put tests in this repo or cloud-provider-gcp based on if they're actually testing the cloud provider or if the cloud provider is a detail of testing Kubernetes core (and in the latter case within reason we'll continue to generalize / decouple these). For autoscaler tests, I'm not actually sure they belong here because my understanding is the implementation we're testing is out of tree, if there's some in-this-repo logic we're testing then we should fix them here.
In general yes, and we should aim to make them portable, but we shouldn't let portable block coverage. But for these specific tests, are they ever going to be tested on another provider? Do they cover logic contained in this repo? If not, we should move them. I think these are really testing features of cluster-autoscaler without code in this repo, so the cluster-autoscaler or cloud provider GCP repos are a better target. Seperately, I think installing the GCE PD CSI driver is reasonable and not really different from installing the cloud-provider-GCP CCM and I would approve a new PR to include it in kube-up. Tests that are testing that driver versus testing e.g. CSI migration for that driver should be written elsewhere though and the tests should not import GCP code.
Thanks, it sounds like we should move this test. We can support doing this in the cloud-provider-gcp repo or possibly the cluster autoscaler repo. |
I think I agree with everything you say, Ben. There is code in k/k/pkg/volume that I don't think can be tested on kind --- ie PVC handling and the same volume being used by different nodes in different ways. The only way in practice we know how to exercise those tests are to run on a cloud provider (GCE, AWS, etc). I think that means in order to continue to test k8s core features, we need some cloud provider support in k/k, and so we need to install the pd csi driver for GCE clusters. I'll look into reviving #102701. |
Yes, at least not currently ... maybe if someone invests in a sophisticated local volume driver ... but we should not block on that. We were in a similar state for loadbalancers but we're starting to cover those more with https://sigs.k8s.io/cloud-provider-kind, maybe someday we'll do write a more advanced local-PV to go with ...
Thank you! |
For moving tests into the cluster autoscaler repo - is there any way to run tests in other repos (support of kubetest2, using the testing libraries etc.)? These tests use extensively the e2e testing framework, I know that cloud-provider-gcp needed some work to make tests runnable and depend on k/k repo even though it does not officially support such dependencies. Are there any other components that are out of main k/k repo and use e2e testing framework/tooling from k/k that I could look into as an example? P.S. Forgive my basic questions, I am very new to kubernetes and might not have enough context here. |
I feel that this issue drifted from the initial problem to the problem of where the tests should be and how they are written. While I understand that the latter questions is indeed important (and may require some wider discussion) I would like to reiterate the original issue: The question to storage sig would be - should this be fixed and how? /sig storage |
@jbtk --- the pd csi driver ensures that the topology label is set correctly. The csi translation layer depends on a pd csi driver being installed; the behavior you see is working as intended. The solution is to install the pd csi driver as part of the GCE cluster creation which I'll look into. Since you're probably just concerned with getting your tests to pass we may be able to do a quick fix. Can you point me to how the test brings up a cluster? (eg is it kubetest or kubetest2)? |
|
This seems to be the documentation: https://kubernetes.io/blog/2022/09/26/storage-in-tree-to-csi-migration-status-update-1.25/#what-does-it-mean-for-the-storage-driver-csi-migration-to-go-ga, for better or worse.
I don't think preprovisioning is sufficient -- the csi driver is needed to attach & mount a PV. I'm also not sure the preprovisioning is a "more standard way" --- in our experience cloud users usually dynamically provision a PV. It's certainly much easier for testing.
Ok, thanks. Adding it to kube-up will be the nicest way, since AFAICT you're just running straight from the e2e ginkgo and not doing any setup on your own. |
Which jobs are failing?
autoscaling e2e test: Kubernetes e2e suite.[It] [sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pod requesting volume is pending [Feature:ClusterSizeAutoscalingScaleUp]
Which tests are failing?
Kubernetes e2e suite.[It] [sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pod requesting volume is pending [Feature:ClusterSizeAutoscalingScaleUp]
Since when has it been failing?
Not sure, it is failing for the whole time in the testgrid
Testgrid link
https://testgrid.k8s.io/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling
Reason for failure (if possible)
The scheduler rewrites the PV requirements from InTree to CSI requiring a label "topology.gke.io/zone" that is not set on nodes that are running in OSS kubernetes (started with kube up script).
Starting the cluster command:
kubetest2 gce -v 2 --repo-root ~/src/k8s.io/kubernetes --gcp-project --legacy-mode --build --up --env=ENABLE_CUSTOM_METRICS=true --env=KUBE_ENABLE_CLUSTER_AUTOSCALER=true --env=KUBE_AUTOSCALER_MIN_NODES=3 --env=KUBE_AUTOSCALER_MAX_NODES=6 --env=KUBE_AUTOSCALER_ENABLE_SCALE_DOWN=true --env=KUBE_ADMISSION_CONTROL=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,Priority --env=ENABLE_POD_PRIORITY=true
The problematic code seems to be here: https://github.com/kubernetes/csi-translation-lib/blob/master/plugins/gce_pd.go#L257
I see what is the problem, but it is not clear for me what should be the correct behavior. It seems that in GKE this label is actually set on the node.
The labels that I see on the node of my cluster:
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=e2-standard-2
beta.kubernetes.io/os=linux
cloud.google.com/metadata-proxy-ready=true
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-b
kubernetes.io/arch=amd64
kubernetes.io/hostname=kt2-1b77b5e4-87ae-minion-group-5pgr
kubernetes.io/os=linux
node.kubernetes.io/instance-type=e2-standard-2
topology.kubernetes.io/region=us-central1
topology.kubernetes.io/zone=us-central1-b
What the test is doing:
Anything else we need to know?
No response
Relevant SIG(s)
/sig-storage
The text was updated successfully, but these errors were encountered: