I'm trying to understand containers and stumbled upon a trick apparently found by the LXC developers, see this runc PR: You can call pivot_root(".", ".")
which avoids the need for a directory to put the old root into. However, this makes the mount namespace behave strangely:
unshare --user --map-root-user --mount bash -c "
mount --bind containerfs bindmountpoint
cd bindmountpoint
pivot_root . .
# this is fine:
ls -l /
# this is not fine:
ls -l /..
"
The parent of .
, accessed via ./..
or /..
or /proc/<any>/cwd/..
points to the root of the root mount namespace (I haven't tried nesting yet)! It does not point to the parent of containerfs
nor bindmountpoint
, but really the root of the root/outer mount namespace.
Similarly, when I try nsenter --user --preserve-credentials --mount --target=<pid>
, then this new process has its CWD placed at the root of the root mount namespace.
None of this happens when I pivot_root(".", "oldroot")
. The behaviour also disappears when unmounting the old root, either via a file descriptor, umount -l /
or umount -l /proc/1/cwd
.
I have also tried this sequence of syscalls from a custom C program, since the documentation of the pivot_root
only gives guarantees about the current process (so I do everything in the same process). The behaviour is the same as the multi-process steps using CLI tools shown above.
Tested on a 5.3 kernel.
What is going on when I run pivot_root(".", ".")
?