System Administration & Network Administration
kubernetes nvidia gpu containerd
Updated Fri, 30 Sep 2022 05:45:22 GMT

Pod is stuck in PodInitializing status when an initContainer is OOMKilled


I have the following on-prem Kubernetes environment:

  • OS: Red Hat Enterprise Linux release 8.6 (Ootpa)
  • Kubernetes: 1.23.7 (single-node, build with kubeadm)
  • NVIDIA driver: 515.65.01
  • nvidia-container-toolkit: 1.10.0-1.x86_64 (rpm)
  • containerd: v1.6.2
  • vcr.io/nvidia/k8s-device-plugin:v0.12.2

And I run the following Pod on my server. Only app2 (initContainer2) uses GPU.

initContainer1: app1

initContainer2: app2 (Uses GPU)

container1: app3

When the app2 uses too much RAM and is OOM killed, the Pod should be in the OOMKilled status, but it's stuck in the PodInitializing status on my environment.

NAMESPACE     NAME       READY   STATUS            RESTARTS       AGE     IP               NODE      NOMINATED NODE   READINESS GATES
default       gpu-pod    0/1     PodInitializing   0              83m     xxx.xxx.xxx.xxx   xxxxx   <none>           <none>

The results of kubectl describe pod is as follows:

Init Containers:
  app1:
    ...
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 30 Aug 2022 10:50:38 +0900
      Finished:     Tue, 30 Aug 2022 10:50:44 +0900
      ...
app2:
    ...
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    0
      Started:      Tue, 30 Aug 2022 10:50:45 +0900
      Finished:     Tue, 30 Aug 2022 10:50:48 +0900
      ...
app3:
    ...
    State:          Waiting
      Reason:       PodInitializing
      ...
    ...

This problem will never happen when I replace app2 with another container that doesn't use GPU, or when I launch app2 as a single container (not an init Container) of the Pod. In both cases, the status will be properly OOMKilled.

Is this a bug? If so, are there any workarounds?




Solution

So workflow is like below from documentation.

Init containers are exactly like regular containers, except:

  • Init containers always run to completion.
  • Each init container must complete successfully before the next one starts.

If a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds. However, if the Pod has a restartPolicy of Never, and an init container fails during startup of that Pod, Kubernetes treats the overall Pod as failed.

So it's in correct state AFAIK.





Comments (3)

  • +0 – Thank you for the answer. The status of the Pod when app2 is OOMKilled depends on which container is used as app2. What causes this difference? Exit codes when OOMKilled? — Aug 30, 2022 at 04:26  
  • +0 – Please check this github.com/kubernetes/kubernetes/pull/104650/files, It should mark init container failed though I see return status is 0, it might be how you write script for init or how it used. — Aug 30, 2022 at 05:01  
  • +0 – It seems the app2's non-init process was OOMKilled and the init process didn't recognize that. Thanks! — Aug 30, 2022 at 08:15