System Administration & Network Administration
kubernetes kubeadm containerd
Updated Mon, 06 Jun 2022 03:48:14 GMT

Offline installation of kubernetes fails when using containerd as a CRI


I had to build a bare-metal Kubernetes cluster with no Internet connection for some reason.

As dockershim was deprecated, I decided to use containerd as a CRI, but the offline installation with kubeadm failed while executing kubeadm init due to timeout.

    Unfortunately, an error has occurred:
            timed out waiting for the condition
    This error is likely caused by:
            - The kubelet is not running
            - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
    If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
            - 'systemctl status kubelet'
            - 'journalctl -xeu kubelet'

And I can see a lot of error logs as a result of journalctl -u kubelet -f:

11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.473188    9299 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://133.117.20.57:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/rhel8?timeout=10s": dial tcp 133.117.20.57:6443: connect: connection refused
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.533555    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:25 rhel8 kubelet[9299]: I1124 16:25:25.588986    9299 kubelet_node_status.go:71] "Attempting to register node" node="rhel8"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.589379    9299 kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://133.117.20.57:6443/api/v1/nodes\": dial tcp 133.117.20.57:6443: connect: connection refused" node="rhel8"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.634625    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.735613    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.835815    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:25 rhel8 kubelet[9299]: E1124 16:25:25.936552    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.036989    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.137464    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.238594    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.338704    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:26 rhel8 kubelet[9299]: E1124 16:25:26.394465    9299 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rhel8.16ba6aab63e58bd8", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"rhel8", UID:"rhel8", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"rhel8"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc05f9812b2b227d8, ext:5706873656, loc:(*time.Location)(0x55a228f25680)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc05f9812b2b227d8, ext:5706873656, loc:(*time.Location)(0x55a228f25680)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://133.117.20.57:6443/api/v1/namespaces/default/events": dial tcp 133.117.20.57:6443: connect: connection refused'(may retry after sleeping)
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.143503    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.244526    9299 kubelet.go:2407] "Error getting node" err="node \"rhel8\" not found"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.302890    9299 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.2\": failed to pull image \"k8s.gcr.io/pause:3.2\": failed to pull and unpack image \"k8s.gcr.io/pause:3.2\": failed to resolve reference \"k8s.gcr.io/pause:3.2\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.2\": dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:39732->[::1]:53: read: connection refused"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.302949    9299 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.2\": failed to pull image \"k8s.gcr.io/pause:3.2\": failed to pull and unpack image \"k8s.gcr.io/pause:3.2\": failed to resolve reference \"k8s.gcr.io/pause:3.2\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.2\": dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:39732->[::1]:53: read: connection refused" pod="kube-system/kube-scheduler-rhel8"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.302989    9299 kuberuntime_manager.go:815] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.2\": failed to pull image \"k8s.gcr.io/pause:3.2\": failed to pull and unpack image \"k8s.gcr.io/pause:3.2\": failed to resolve reference \"k8s.gcr.io/pause:3.2\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.2\": dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:39732->[::1]:53: read: connection refused" pod="kube-system/kube-scheduler-rhel8"
11 24 16:25:27 rhel8 kubelet[9299]: E1124 16:25:27.303080    9299 pod_workers.go:765] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-rhel8_kube-system(e5616b23d0312e4995fcb768f04aabbb)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-scheduler-rhel8_kube-system(e5616b23d0312e4995fcb768f04aabbb)\\\": rpc error: code = Unknown desc = failed to get sandbox image \\\"k8s.gcr.io/pause:3.2\\\": failed to pull image \\\"k8s.gcr.io/pause:3.2\\\": failed to pull and unpack image \\\"k8s.gcr.io/pause:3.2\\\": failed to resolve reference \\\"k8s.gcr.io/pause:3.2\\\": failed to do request: Head \\\"https://k8s.gcr.io/v2/pause/manifests/3.2\\\": dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:39732->[::1]:53: read: connection refused\"" pod="kube-system/kube-scheduler-rhel8" podUID=e5616b23d0312e4995fcb768f04aabbb

When I do the same thing with the Internet connection, the installation succeeds. And when using docker instead of containerd, the installation is successfully done even if there's no Internet connection.




Solution

It was caused by the containerd which has the setting to pull its sandbox_image from k8s.gcr.io by default, even though there's no Internet connection.

This setting is specified around line 57 of /etc/containerd/config.toml file.

   [plugins."io.containerd.grpc.v1.cri"]
     <snip>
     sandbox_image = "k8s.gcr.io/pause:3.2"

My current k8s cluster is v1.22.1, so it uses pause:3.5 rather than pause:3.2. By changing this to the existing image (k8s.gcr.io/pause:3.5 this time), I successfully build my kubernetes cluster without the Internet connection.