Inconsistent POD status reporting #107713

rdavyd · 2022-01-24T10:24:57Z

What happened?

I run service mesh sidecar (in my case Istio) in PODs that are created/controlled by Jobs. "Main" container shuts down the istio sidecar via API call to 127.0.0.1 and exits with an actual application code. The issue is when "main" container finishes with error, POD status often displays Completed when called via kubectl get pod.

NAME                            READY   STATUS      RESTARTS   AGE
job-istio-proxy-test--1-zdlc6   0/2     Completed   0          47s

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:11Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:26Z"
    message: 'containers with unready status: [somejob istio-proxy]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:26Z"
    message: 'containers with unready status: [somejob istio-proxy]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-01-24T10:04:10Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://ccf54140202e07f2bd37151dee171fc744158ff81c4a58426bad0518f8dd6c6d
    image: docker.io/istio/proxyv2:1.12.2
    imageID: docker.io/istio/proxyv2@sha256:f26717efc7f6e0fe928760dd353ed004ea35444f5aa6d41341a003e7610cd26f
    lastState: {}
    name: istio-proxy
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://ccf54140202e07f2bd37151dee171fc744158ff81c4a58426bad0518f8dd6c6d
        exitCode: 0
        finishedAt: "2022-01-24T10:04:25Z"
        reason: Completed
        startedAt: "2022-01-24T10:04:13Z"
  - containerID: containerd://1f515cb65e4c3a8f206ae0cbbe19720fb4e734361ec6740156f53e1f5e002278
    image: docker.io/amouat/network-utils:latest
    imageID: docker.io/amouat/network-utils@sha256:c4da08f9dac831b8f83ffc63f4a7f327754e20aeac1e9ae68d7727ccc25b8172
    lastState: {}
    name: somejob
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://1f515cb65e4c3a8f206ae0cbbe19720fb4e734361ec6740156f53e1f5e002278
        exitCode: 1
        finishedAt: "2022-01-24T10:04:30Z"
        reason: Error
        startedAt: "2022-01-24T10:04:13Z"
  hostIP: 10.10.140.140
  initContainerStatuses:
  - containerID: containerd://a2c5c43f2730d7d16892b2197d438c87e1c25a9fd322e639e6a2b9702c881c0a
    image: docker.io/istio/proxyv2:1.12.2
    imageID: docker.io/istio/proxyv2@sha256:f26717efc7f6e0fe928760dd353ed004ea35444f5aa6d41341a003e7610cd26f
    lastState: {}
    name: istio-validation
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://a2c5c43f2730d7d16892b2197d438c87e1c25a9fd322e639e6a2b9702c881c0a
        exitCode: 0
        finishedAt: "2022-01-24T10:04:11Z"
        reason: Completed
        startedAt: "2022-01-24T10:04:11Z"
  phase: Failed
  podIP: 10.10.177.125
  podIPs:
  - ip: 10.10.177.125
  qosClass: Burstable
  startTime: "2022-01-24T10:04:10Z"

What did you expect to happen?

It should return status Error when one the containers in the POD fails.
I believe that is due to POD Status field is calculated incorrectly (it takes the value of the reason of the last container in the pod.Status.ContainerStatuses array)

kubernetes/pkg/printers/internalversion/printers.go

Lines 812 to 813 in 5c99e2a

    
           } else if container.State.Terminated != nil && container.State.Terminated.Reason != "" { 
        
           	reason = container.State.Terminated.Reason

The workaround for this situation is to name actual application container with first letters like abc and sidecar with last ones xyz.

How can we reproduce it (as minimally and precisely as possible)?

Test job

apiVersion: batch/v1
kind: Job
metadata:
  name: job-istio-proxy-test
spec:
  backoffLimit: 0
  ttlSecondsAfterFinished: 600
  template:
    metadata:
      labels:
        sidecar.istio.io/inject: "true"
    spec:
      containers:
      - name: somejob
        image: amouat/network-utils:latest
        command:
        - /bin/bash
        - -c
        - |
          # Wait for sidecar to be ready
          until curl -fsSI -o /dev/null http://localhost:15021/healthz/ready; do echo \"Waiting for Sidecar...\"; sleep 2; done; echo "Sidecar available. Running the command..."
          # Simulate some useful job
          sleep 10
          # Simulate job failure
          false
          # Shutdown sidecar and return job exit code
          ret=$(echo $?); echo "Command completed. Terminating sidecar..."; curl -fsSI -o /dev/null -X POST http://localhost:15000/quitquitquit; sleep 5; exit $ret
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
      restartPolicy: Never
      securityContext:
        runAsUser: 65000
        runAsGroup: 65000

Anything else we need to know?

Istio version used - 1.12.2

Kubernetes version

1.22.5

Cloud provider

On premise

OS version

Ubuntu 20.04.3
5.4.0-86-generic #97-Ubuntu SMP Fri Sep 17 19:19:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

kubespray 1.18.0

Container runtime (CRI) and and version (if applicable)

containerd 1.5.9

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

rdavyd · 2022-01-24T10:32:08Z

/sig api-machinery
/sig cli

rdavyd · 2022-01-24T10:35:07Z

Refers to istio/istio#11659

ehashman · 2022-01-26T18:32:24Z

/triage accepted
/help

This is a display issue in the CLI.

k8s-ci-robot · 2022-01-26T18:32:25Z

@ehashman:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

This is a display issue in the CLI.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Chalmiller · 2022-01-27T00:46:42Z

Hi @ehashman, do you think this is a good issue to start contributing with? If so, I'd like to assign myself to it.

kkkkun · 2022-01-30T09:40:17Z

This may be caused by the reason of error is null. I fixed this similar with initContainer.

kubernetes/pkg/printers/internalversion/printers.go

Line 764 in 2e68fd2

for i := range pod.Status.InitContainerStatuses {

/assign

kkkkun · 2022-02-07T07:20:01Z

I think pod status will be 'Completed' only when container.State.Terminated.ExitCode == 0 && len(container.State.Terminated.Reason) != 0. Otherwise, it should be Error.

https://github.com/kkkkun/kubernetes/blob/f44a6791e8d072cfbba2b77528bf6ff5b4336ffc/pkg/printers/internalversion/printers.go#L814

rdavyd · 2022-02-07T07:54:10Z

I think pod status will be 'Completed' only when container.State.Terminated.ExitCode == 0 && len(container.State.Terminated.Reason) != 0. Otherwise, it should be Error.

https://github.com/kkkkun/kubernetes/blob/f44a6791e8d072cfbba2b77528bf6ff5b4336ffc/pkg/printers/internalversion/printers.go#L814

I presume the error happens due to evaluating the POD status to the status of the last container in the array. But POD status should be Error if any of the containers returns with non-zero exit code.

kkkkun · 2022-02-07T09:05:16Z

I think pod status will be 'Completed' only when container.State.Terminated.ExitCode == 0 && len(container.State.Terminated.Reason) != 0. Otherwise, it should be Error.
https://github.com/kkkkun/kubernetes/blob/f44a6791e8d072cfbba2b77528bf6ff5b4336ffc/pkg/printers/internalversion/printers.go#L814

But POD status should be Error if any of the containers returns with non-zero exit code.

Yea, i agrees. So I add hasError . if hasError is true，the pod status will be reset to Error

k8s-triage-robot · 2022-05-08T09:33:26Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rdavyd · 2022-05-09T07:34:04Z

/remove-lifecycle stale

kkkkun · 2022-08-05T10:02:11Z

I could not reproduce solutions.
PR may be not fixed this case.

Could you please paste pod by kubectl get po xxx -o yaml? @rdavyd

rdavyd · 2022-08-08T05:20:53Z

@kkkkun Sure

apiVersion: batch/v1
kind: Job
metadata:
  name: wrong-pod-status
spec:
  backoffLimit: 0
  ttlSecondsAfterFinished: 600
  template:
    spec:
      containers:
      - name: task-main
        image: busybox:latest
        command:
        - /bin/sh
        - -c
        - |
          # Simulate some useful job
          sleep 1
          # Simulate job failure
          false
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
      - name: sidecar
        image: busybox:latest
        command:
        - /bin/sh
        - -c
        - |
          # Simulate sidecar work
          sleep 2
          # Sidecar successful exit
          true
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
      restartPolicy: Never
      securityContext:
        runAsUser: 65000
        runAsGroup: 65000

POD status should display Error, but shows Completed. If you change container name task-main to just main POD status will be correct.

kkkkun · 2022-08-08T07:13:01Z

What's phase of pod status from kube-apiserver? It it correctly getted from kube-apiserver?
I want to get more infos from kubectl get po xxx -o yaml such as
`
status:
conditions:
containerStatuses:

containerID: docker://xx
image: xx
imageID: xx
lastState: {}
name: test-container
ready: true
restartCount: 0
state:
running:
startedAt: "2022-08-08T02:49:31Z"
phase: Running
podIP: xx.xx.xx.xx
qosClass: Guaranteed
startTime: "2022-08-08T02:49:29Z"
`

rdavyd · 2022-08-08T08:02:17Z

Short POD status

$ kubectl get pod wrong-pod-status-smz7x
NAME                     READY   STATUS      RESTARTS   AGE
wrong-pod-status-smz7x   0/2     Completed   0          4m
$

Extended status

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-08-08T07:57:09Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-08-08T07:57:12Z"
    message: 'containers with unready status: [task-main sidecar]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-08-08T07:57:12Z"
    message: 'containers with unready status: [task-main sidecar]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-08-08T07:57:09Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://10a04e3623c42ddb81c9d40acd56e2dfee7dca5727262a68ac21bbb526992f89
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:ef320ff10026a50cf5f0213d35537ce0041ac1d96e9b7800bafd8bc9eff6c693
    lastState: {}
    name: sidecar
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://10a04e3623c42ddb81c9d40acd56e2dfee7dca5727262a68ac21bbb526992f89
        exitCode: 0
        finishedAt: "2022-08-08T07:57:12Z"
        reason: Completed
        startedAt: "2022-08-08T07:57:10Z"
  - containerID: containerd://13974698f28f35eff7378b6c73cbca32b78f9e23b7477c85beca2b50fdea35b2
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:ef320ff10026a50cf5f0213d35537ce0041ac1d96e9b7800bafd8bc9eff6c693
    lastState: {}
    name: task-main
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://13974698f28f35eff7378b6c73cbca32b78f9e23b7477c85beca2b50fdea35b2
        exitCode: 1
        finishedAt: "2022-08-08T07:57:11Z"
        reason: Error
        startedAt: "2022-08-08T07:57:10Z"
  hostIP: 10.10.140.142
  phase: Failed
  podIP: 10.10.176.223
  podIPs:
  - ip: 10.10.176.223
  qosClass: Burstable
  startTime: "2022-08-08T07:57:09Z"

EDIT: As of k8s 1.23.7

rdavyd · 2022-10-04T15:15:09Z

@kkkkun Hi, any news?

kkkkun · 2022-10-06T06:44:25Z

@kkkkun Hi, any news?
PR is waiting for review. #107865

k8s-triage-robot · 2023-02-08T07:26:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kkkkun · 2023-02-08T09:50:26Z

/remove-lifecycle stale

k8s-triage-robot · 2024-02-08T10:12:43Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

kkkkun · 2024-02-20T02:15:59Z

/triage accepted

Monokaix · 2024-03-28T03:14:04Z

any progress here?

hshiina · 2024-05-03T08:45:07Z

If you can enable the SidecarContainers feature gate, which is relatively new (1.28: alpha, 1.29: beta), you may be able to avoid this issue by configuring a sidecar container as an init container with restartPolicy like this.

rdavyd · 2024-05-07T10:29:43Z

If you can enable the SidecarContainers feature gate, which is relatively new (1.28: alpha, 1.29: beta), you may be able to avoid this issue by configuring a sidecar container as an init container with restartPolicy like this.

True. But it is a workaround. Still it has to be supported by multiple projects like Istio.

rdavyd added the kind/bug Categorizes issue or PR as related to a bug. label Jan 24, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2022

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 24, 2022

liggitt added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 25, 2022

SergeyKanzhelev added this to Triage in SIG Node Bugs Jan 25, 2022

ehashman moved this from Triage to Needs Information in SIG Node Bugs Jan 26, 2022

ehashman moved this from Needs Information to Triaged in SIG Node Bugs Jan 26, 2022

kkkkun mentioned this issue Jan 30, 2022

Fixes pod status error when it has an error container #107865

Closed

k8s-ci-robot assigned kkkkun Jan 30, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023

SergeyKanzhelev mentioned this issue Sep 12, 2023

sidecar tests: specifically check that the pod was successful #120614

Merged

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 8, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 20, 2024

kkkkun linked a pull request May 9, 2024 that will close this issue

Fixes pod status error when it has an error container #124766

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent POD status reporting #107713

Inconsistent POD status reporting #107713

rdavyd commented Jan 24, 2022

rdavyd commented Jan 24, 2022

rdavyd commented Jan 24, 2022

ehashman commented Jan 26, 2022

k8s-ci-robot commented Jan 26, 2022

Chalmiller commented Jan 27, 2022

kkkkun commented Jan 30, 2022 •

edited

kkkkun commented Feb 7, 2022

rdavyd commented Feb 7, 2022

kkkkun commented Feb 7, 2022 •

edited

k8s-triage-robot commented May 8, 2022

rdavyd commented May 9, 2022

kkkkun commented Aug 5, 2022 •

edited

rdavyd commented Aug 8, 2022

kkkkun commented Aug 8, 2022

rdavyd commented Aug 8, 2022 •

edited

rdavyd commented Oct 4, 2022

kkkkun commented Oct 6, 2022

k8s-triage-robot commented Feb 8, 2023

kkkkun commented Feb 8, 2023

k8s-triage-robot commented Feb 8, 2024

kkkkun commented Feb 20, 2024

Monokaix commented Mar 28, 2024

hshiina commented May 3, 2024

rdavyd commented May 7, 2024

Inconsistent POD status reporting #107713

Inconsistent POD status reporting #107713

Comments

rdavyd commented Jan 24, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rdavyd commented Jan 24, 2022

rdavyd commented Jan 24, 2022

ehashman commented Jan 26, 2022

k8s-ci-robot commented Jan 26, 2022

Guidelines

Chalmiller commented Jan 27, 2022

kkkkun commented Jan 30, 2022 • edited

kkkkun commented Feb 7, 2022

rdavyd commented Feb 7, 2022

kkkkun commented Feb 7, 2022 • edited

k8s-triage-robot commented May 8, 2022

rdavyd commented May 9, 2022

kkkkun commented Aug 5, 2022 • edited

rdavyd commented Aug 8, 2022

kkkkun commented Aug 8, 2022

rdavyd commented Aug 8, 2022 • edited

rdavyd commented Oct 4, 2022

kkkkun commented Oct 6, 2022

k8s-triage-robot commented Feb 8, 2023

kkkkun commented Feb 8, 2023

k8s-triage-robot commented Feb 8, 2024

kkkkun commented Feb 20, 2024

Monokaix commented Mar 28, 2024

hshiina commented May 3, 2024

rdavyd commented May 7, 2024

kkkkun commented Jan 30, 2022 •

edited

kkkkun commented Feb 7, 2022 •

edited

kkkkun commented Aug 5, 2022 •

edited

rdavyd commented Aug 8, 2022 •

edited