I have the following on-prem Kubernetes environment:
And I run the following Pod on my server. Only app2 (initContainer2) uses GPU.
initContainer1: app1 initContainer2: app2 (Uses GPU) container1: app3
When the app2 uses too much RAM and is OOM killed, the Pod should be in the
OOMKilled status, but it's stuck in the
PodInitializing status on my environment.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default gpu-pod 0/1 PodInitializing 0 83m xxx.xxx.xxx.xxx xxxxx <none> <none>
The results of
kubectl describe pod is as follows:
Init Containers: app1: ... State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 30 Aug 2022 10:50:38 +0900 Finished: Tue, 30 Aug 2022 10:50:44 +0900 ... app2: ... State: Terminated Reason: OOMKilled Exit Code: 0 Started: Tue, 30 Aug 2022 10:50:45 +0900 Finished: Tue, 30 Aug 2022 10:50:48 +0900 ... app3: ... State: Waiting Reason: PodInitializing ... ...
This problem will never happen when I replace app2 with another container that doesn't use GPU, or when I launch app2 as a single container (not an init Container) of the Pod. In both cases, the status will be properly
Is this a bug? If so, are there any workarounds?
So workflow is like below from documentation.
Init containers are exactly like regular containers, except:
If a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds. However, if the Pod has a restartPolicy of Never, and an init container fails during startup of that Pod, Kubernetes treats the overall Pod as failed.
So it's in correct state AFAIK.