System Administration & Network Administration
kubernetes nvidia containerd
Updated Fri, 26 Aug 2022 13:17:11 GMT

Containerd failed to start after Nvidia Config


I've follow this official tutorial to allow a bare-metal k8s cluster to have GPU Access. However i received errors while doing so.

Kubernetes 1.21 containerd 1.4.11 and Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-91-generic x86_64).

Nvidia Driver is preinstalled on System OS with version 495 Headless

After pasting the following config inside /etc/containerd/config.toml and perform service restart, containerd would failed to start with exit 1.

Containerd Config.toml

systemd log here.

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"
# Kubernetes doesn't use containerd restart manager.
disabled_plugins = ["restart"]
# NVIDIA CONFIG START HERE
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
# NVIDIA CONFIG ENDS HERE
[debug]
  level = ""
[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
[plugins.linux]
  shim = "/usr/bin/containerd-shim"
  runtime = "/usr/bin/runc"

I can confirm that Nvidia Driver does detect the GPU (Nvidia GTX 750Ti) by running nvidia-smi and got the following output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   34C    P8     1W /  38W |      0MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

modified config.toml that got it to work.




Solution

As best I can tell, it's this:

Dec 02 03:15:36 k8s-node0 containerd[2179737]: containerd: invalid disabled plugin URI "restart" expect io.containerd.x.vx

Dec 02 03:15:36 k8s-node0 systemd[1]: containerd.service: Main process exited, code=exited, status=1/FAILURE

So if you know that the restart-ish plugin is in fact enabled, you'll need to track down its new URI syntax, but I'd actually recommend just commenting out that stanza, or going with disabled_plugins = [], since the containerd ansible role we use doesn't mention anything about "reboot" and does have the = [] flavor


Tangentially, you may want to restrict your journalctl invocation in the future to just look at the containerd.service, since it will throw out a lot of text that is a distraction: journalctl -u containerd.service and you can even restrict it to just the last few lines, which sometimes can help further: journalctl -u containerd.service --lines=250





Comments (4)

  • +0 – Thank for the extensive reply, i've tried putting disabled_plugins as empty list. It gave me a different error containerd: invalid plugin key URI "linux" expect io.containerd.x.vx. I've attached a complete containerd config.toml in the original post. If you could have a look that would be great. — Dec 02, 2021 at 13:15  
  • +0 – Yes, it seems to be the same problem; linux as an unqualified name is evidently the old style, so what you'll likely want is [plugins."io.containerd.runtime.v1.linux"] just like you see with the [plugins] members at the top of the file and as shown in the template I linked to — Dec 02, 2021 at 16:51  
  • +0 – Thanks for the help, i can now boot up containerd with the integrated config based on nvidia docs. For future ref: I've updated my original post for the updated config.toml — Dec 03, 2021 at 18:22  
  • +0 – I'm glad to hear it, and I'm always glad when it's something simple, and I wish you good luck on your journey running GPUs in k8s! Please consider putting the config inline in your question, since linking to external sites runs the risk of them being 404 for future generations — Dec 04, 2021 at 21:07