Unix & Linux
linux mount chroot unshare
Updated Fri, 09 Sep 2022 20:25:05 GMT

Why unshare with chroot does not isolate /dev like /proc?


I am following Container from scratch by Kevin Boone

I have alpine mini root filesystem under /mnt/container/

I am a little puzzled about how the mount works with chroot and unshare involved.

Without unshare if we do

chroot /mnt/container /bin/sh -l

we get a container(kind of) with its "/" (root) at host machine's /mnt/container.

Inside the container if we run the following command;

mount -t proc proc /proc >& /dev/null
mount -t devtmpfs dev /dev/ >& /dev/null

we see that we have mounted the host system's /proc and /dev and hence we can see the processes that are running on the host with ps -ef and can create a file in /dev as well which will be created on the host. This is expected because there is still no namespace isolation.

To create the namespace isolation we do;

unshare -mpfu chroot /mnt/container /bin/sh -l

and then inside the container we run

mount -t proc proc /proc >& /dev/null
mount -t devtmpfs dev /dev/ >& /dev/null

This time ps -ef will show only two processes that are inside the container. What I understand(correct me if I am wrong) is that mount -t proc proc /proc >& /dev/null did not mount the /proc of host system, but created a new directory /proc of type procfs, hence Isolation.

But, and this is the question, /dev inside the container is still the same /dev of the host. I can still create files inside /dev and it shows up on host machine.

Why is /dev not isolated like /proc?




Solution

devtmpfs isnt namespaced (see its shmem-based context initialisation), and its also not intended for use inside user namespaces (see Is devtmpfs special with respect to namespaces? a permissions problem).

There have been attempts to change this, for example this 2014 patch series submitted by Seth Forshee. But the kernel maintainers, Greg KH in particular, are of the opinion that sharing a devtmpfs instance between the host and user namespaces, even a namespace-aware instance, isnt useful:

Splitting a namespaced devtmpfs from loopdevfs discussion might be sensible. However, in defense of a namespaced devtmpfs I'd say that for userspace to, at every container startup, bind-mount in devices from the global devtmpfs into a private tmpfs (for systemd's sake it can't just be on the container rootfs), seems like something worth avoiding.

I think having to pick and choose what device nodes you want in a container is a good thing. Becides, you would have to do the same thing in the kernel anyway, what's wrong with userspace making the decision here, especially as it knows exactly what it wants to do much more so than the kernel ever can.

Basically, if you need /dev inside your user namespace, you should populate it manually.

How does /proc interact with PID namespaces? explains the /proc behaviour.