Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent POD status reporting #107713

Open
rdavyd opened this issue Jan 24, 2022 · 24 comments · May be fixed by #124766
Open

Inconsistent POD status reporting #107713

rdavyd opened this issue Jan 24, 2022 · 24 comments · May be fixed by #124766
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@rdavyd
Copy link

rdavyd commented Jan 24, 2022

What happened?

I run service mesh sidecar (in my case Istio) in PODs that are created/controlled by Jobs. "Main" container shuts down the istio sidecar via API call to 127.0.0.1 and exits with an actual application code. The issue is when "main" container finishes with error, POD status often displays Completed when called via kubectl get pod.

NAME                            READY   STATUS      RESTARTS   AGE
job-istio-proxy-test--1-zdlc6   0/2     Completed   0          47s
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:11Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:26Z"
    message: 'containers with unready status: [somejob istio-proxy]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:26Z"
    message: 'containers with unready status: [somejob istio-proxy]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:10Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://ccf54140202e07f2bd37151dee171fc744158ff81c4a58426bad0518f8dd6c6d
    image: docker.io/istio/proxyv2:1.12.2
    imageID: docker.io/istio/proxyv2@sha256:f26717efc7f6e0fe928760dd353ed004ea35444f5aa6d41341a003e7610cd26f
    lastState: {}
    name: istio-proxy
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://ccf54140202e07f2bd37151dee171fc744158ff81c4a58426bad0518f8dd6c6d
        exitCode: 0
        finishedAt: "2022-01-24T10:04:25Z"
        reason: Completed
        startedAt: "2022-01-24T10:04:13Z"
  - containerID: containerd://1f515cb65e4c3a8f206ae0cbbe19720fb4e734361ec6740156f53e1f5e002278
    image: docker.io/amouat/network-utils:latest
    imageID: docker.io/amouat/network-utils@sha256:c4da08f9dac831b8f83ffc63f4a7f327754e20aeac1e9ae68d7727ccc25b8172
    lastState: {}
    name: somejob
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://1f515cb65e4c3a8f206ae0cbbe19720fb4e734361ec6740156f53e1f5e002278
        exitCode: 1
        finishedAt: "2022-01-24T10:04:30Z"
        reason: Error
        startedAt: "2022-01-24T10:04:13Z"
  hostIP: 10.10.140.140
  initContainerStatuses:
  - containerID: containerd://a2c5c43f2730d7d16892b2197d438c87e1c25a9fd322e639e6a2b9702c881c0a
    image: docker.io/istio/proxyv2:1.12.2
    imageID: docker.io/istio/proxyv2@sha256:f26717efc7f6e0fe928760dd353ed004ea35444f5aa6d41341a003e7610cd26f
    lastState: {}
    name: istio-validation
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://a2c5c43f2730d7d16892b2197d438c87e1c25a9fd322e639e6a2b9702c881c0a
        exitCode: 0
        finishedAt: "2022-01-24T10:04:11Z"
        reason: Completed
        startedAt: "2022-01-24T10:04:11Z"
  phase: Failed
  podIP: 10.10.177.125
  podIPs:
  - ip: 10.10.177.125
  qosClass: Burstable
  startTime: "2022-01-24T10:04:10Z"

What did you expect to happen?

It should return status Error when one the containers in the POD fails.
I believe that is due to POD Status field is calculated incorrectly (it takes the value of the reason of the last container in the pod.Status.ContainerStatuses array)

} else if container.State.Terminated != nil && container.State.Terminated.Reason != "" {
reason = container.State.Terminated.Reason

The workaround for this situation is to name actual application container with first letters like abc and sidecar with last ones xyz.

How can we reproduce it (as minimally and precisely as possible)?

Test job

apiVersion: batch/v1
kind: Job
metadata:
  name: job-istio-proxy-test
spec:
  backoffLimit: 0
  ttlSecondsAfterFinished: 600
  template:
    metadata:
      labels:
        sidecar.istio.io/inject: "true"
    spec:
      containers:
      - name: somejob
        image: amouat/network-utils:latest
        command:
        - /bin/bash
        - -c
        - |
          # Wait for sidecar to be ready
          until curl -fsSI -o /dev/null http://localhost:15021/healthz/ready; do echo \"Waiting for Sidecar...\"; sleep 2; done; echo "Sidecar available. Running the command..."
          # Simulate some useful job
          sleep 10
          # Simulate job failure
          false
          # Shutdown sidecar and return job exit code
          ret=$(echo $?); echo "Command completed. Terminating sidecar..."; curl -fsSI -o /dev/null -X POST http://localhost:15000/quitquitquit; sleep 5; exit $ret
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
      restartPolicy: Never
      securityContext:
        runAsUser: 65000
        runAsGroup: 65000

Anything else we need to know?

Istio version used - 1.12.2

Kubernetes version

1.22.5

Cloud provider

On premise

OS version

Ubuntu 20.04.3
5.4.0-86-generic #97-Ubuntu SMP Fri Sep 17 19:19:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

kubespray 1.18.0

Container runtime (CRI) and and version (if applicable)

containerd 1.5.9

Related plugins (CNI, CSI, ...) and versions (if applicable)

@rdavyd rdavyd added the kind/bug Categorizes issue or PR as related to a bug. label Jan 24, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2022
@rdavyd
Copy link
Author

rdavyd commented Jan 24, 2022

/sig api-machinery
/sig cli

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 24, 2022
@rdavyd
Copy link
Author

rdavyd commented Jan 24, 2022

Refers to istio/istio#11659

@liggitt liggitt added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 25, 2022
@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Jan 25, 2022
@ehashman
Copy link
Member

/triage accepted
/help

This is a display issue in the CLI.

@k8s-ci-robot
Copy link
Contributor

@ehashman:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

This is a display issue in the CLI.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 26, 2022
@ehashman ehashman moved this from Triage to Needs Information in SIG Node Bugs Jan 26, 2022
@ehashman ehashman moved this from Needs Information to Triaged in SIG Node Bugs Jan 26, 2022
@Chalmiller
Copy link

Hi @ehashman, do you think this is a good issue to start contributing with? If so, I'd like to assign myself to it.

@kkkkun
Copy link
Member

kkkkun commented Jan 30, 2022

This may be caused by the reason of error is null. I fixed this similar with initContainer.

for i := range pod.Status.InitContainerStatuses {

/assign

@kkkkun
Copy link
Member

kkkkun commented Feb 7, 2022

I think pod status will be 'Completed' only when container.State.Terminated.ExitCode == 0 && len(container.State.Terminated.Reason) != 0. Otherwise, it should be Error.

https://github.com/kkkkun/kubernetes/blob/f44a6791e8d072cfbba2b77528bf6ff5b4336ffc/pkg/printers/internalversion/printers.go#L814

@rdavyd
Copy link
Author

rdavyd commented Feb 7, 2022

I think pod status will be 'Completed' only when container.State.Terminated.ExitCode == 0 && len(container.State.Terminated.Reason) != 0. Otherwise, it should be Error.

https://github.com/kkkkun/kubernetes/blob/f44a6791e8d072cfbba2b77528bf6ff5b4336ffc/pkg/printers/internalversion/printers.go#L814

I presume the error happens due to evaluating the POD status to the status of the last container in the array. But POD status should be Error if any of the containers returns with non-zero exit code.

@kkkkun
Copy link
Member

kkkkun commented Feb 7, 2022

I think pod status will be 'Completed' only when container.State.Terminated.ExitCode == 0 && len(container.State.Terminated.Reason) != 0. Otherwise, it should be Error.
https://github.com/kkkkun/kubernetes/blob/f44a6791e8d072cfbba2b77528bf6ff5b4336ffc/pkg/printers/internalversion/printers.go#L814

But POD status should be Error if any of the containers returns with non-zero exit code.

Yea, i agrees. So I add hasError . if hasError is true,the pod status will be reset to Error

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2022
@rdavyd
Copy link
Author

rdavyd commented May 9, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2022
@kkkkun
Copy link
Member

kkkkun commented Aug 5, 2022

I could not reproduce solutions.
PR may be not fixed this case.

Could you please paste pod by kubectl get po xxx -o yaml? @rdavyd

@rdavyd
Copy link
Author

rdavyd commented Aug 8, 2022

@kkkkun Sure

apiVersion: batch/v1
kind: Job
metadata:
  name: wrong-pod-status
spec:
  backoffLimit: 0
  ttlSecondsAfterFinished: 600
  template:
    spec:
      containers:
      - name: task-main
        image: busybox:latest
        command:
        - /bin/sh
        - -c
        - |
          # Simulate some useful job
          sleep 1
          # Simulate job failure
          false
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
      - name: sidecar
        image: busybox:latest
        command:
        - /bin/sh
        - -c
        - |
          # Simulate sidecar work
          sleep 2
          # Sidecar successful exit
          true
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
      restartPolicy: Never
      securityContext:
        runAsUser: 65000
        runAsGroup: 65000

POD status should display Error, but shows Completed. If you change container name task-main to just main POD status will be correct.

@kkkkun
Copy link
Member

kkkkun commented Aug 8, 2022

What's phase of pod status from kube-apiserver? It it correctly getted from kube-apiserver?
I want to get more infos from kubectl get po xxx -o yaml such as
`
status:
conditions:
containerStatuses:

  • containerID: docker://xx
    image: xx
    imageID: xx
    lastState: {}
    name: test-container
    ready: true
    restartCount: 0
    state:
    running:
    startedAt: "2022-08-08T02:49:31Z"
    phase: Running
    podIP: xx.xx.xx.xx
    qosClass: Guaranteed
    startTime: "2022-08-08T02:49:29Z"
    `

@rdavyd
Copy link
Author

rdavyd commented Aug 8, 2022

Short POD status

$ kubectl get pod wrong-pod-status-smz7x
NAME                     READY   STATUS      RESTARTS   AGE
wrong-pod-status-smz7x   0/2     Completed   0          4m
$

Extended status

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-08-08T07:57:09Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-08-08T07:57:12Z"
    message: 'containers with unready status: [task-main sidecar]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-08-08T07:57:12Z"
    message: 'containers with unready status: [task-main sidecar]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-08-08T07:57:09Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://10a04e3623c42ddb81c9d40acd56e2dfee7dca5727262a68ac21bbb526992f89
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:ef320ff10026a50cf5f0213d35537ce0041ac1d96e9b7800bafd8bc9eff6c693
    lastState: {}
    name: sidecar
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://10a04e3623c42ddb81c9d40acd56e2dfee7dca5727262a68ac21bbb526992f89
        exitCode: 0
        finishedAt: "2022-08-08T07:57:12Z"
        reason: Completed
        startedAt: "2022-08-08T07:57:10Z"
  - containerID: containerd://13974698f28f35eff7378b6c73cbca32b78f9e23b7477c85beca2b50fdea35b2
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:ef320ff10026a50cf5f0213d35537ce0041ac1d96e9b7800bafd8bc9eff6c693
    lastState: {}
    name: task-main
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://13974698f28f35eff7378b6c73cbca32b78f9e23b7477c85beca2b50fdea35b2
        exitCode: 1
        finishedAt: "2022-08-08T07:57:11Z"
        reason: Error
        startedAt: "2022-08-08T07:57:10Z"
  hostIP: 10.10.140.142
  phase: Failed
  podIP: 10.10.176.223
  podIPs:
  - ip: 10.10.176.223
  qosClass: Burstable
  startTime: "2022-08-08T07:57:09Z"

EDIT: As of k8s 1.23.7

@rdavyd
Copy link
Author

rdavyd commented Oct 4, 2022

@kkkkun Hi, any news?

@kkkkun
Copy link
Member

kkkkun commented Oct 6, 2022

@kkkkun Hi, any news?
PR is waiting for review. #107865

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@kkkkun
Copy link
Member

kkkkun commented Feb 8, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 8, 2024
@kkkkun
Copy link
Member

kkkkun commented Feb 20, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 20, 2024
@Monokaix
Copy link
Member

any progress here?

@hshiina
Copy link

hshiina commented May 3, 2024

If you can enable the SidecarContainers feature gate, which is relatively new (1.28: alpha, 1.29: beta), you may be able to avoid this issue by configuring a sidecar container as an init container with restartPolicy like this.

@rdavyd
Copy link
Author

rdavyd commented May 7, 2024

If you can enable the SidecarContainers feature gate, which is relatively new (1.28: alpha, 1.29: beta), you may be able to avoid this issue by configuring a sidecar container as an init container with restartPolicy like this.

True. But it is a workaround. Still it has to be supported by multiple projects like Istio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Backlog
9 participants