buildFHSUserEnv is meant primarily for running 3rd-party software
which is difficult to patch for NixOS. Such software is often built to
run from /opt. Currently, running such a software from FHS environment
is difficult for two reasons:
1. If the 3rd-party software is put into the Nix store via a simple
derivation (with e.g. installPhase = "dpkg-deb -x $src $out"), the
content of /opt directory of that derivation does not appear in the
FHSEnv even if the derivation is specified in targetPkgs. This is
why we change env.nix.
2. If using buildFHSUserEnvChroot and the host system has the /opt
directory, it always gets bind-mounted to the FHSEnv even if some
targetPkgs contain /opt (NB buildFHSUserEnvBubblewrap does not have
this problem). If that directory is not accessible for non-root
users (which is what docker's containerd does with /opt :-(), the
user running the FHSEnv cannot use it.
With the change in chrootenv.c, /opt is not bind-mounted to the
container, but instead created as user-modifiable symlink to
/host/opt (see the init attribute in
build-fhs-userenv/default.nix). If needed, the user can remove this
symlink and create an empty /opt directory which is under his/her
control.
If run as root we were leaking mounts to the parent namespace,
which lead to an error when removing the temporary mountroot.
To fix this we remount the whole tree as private as soon as we created
the new mountenamespace.
To avoid symlink loops to /host in nested chrootenvs we need to remove
one level of indirection. This is also what's generally expected of
/host contents.
* Remove unused argument from pivot_root;
* Factor out tmpdir creation into a separate function;
* Remove unused fstype from bind mount;
* Use unlink instead of a treewalk to remove empty temporary directory.
The problem with stacking chrootenv before was that CLONE_NEWUSER cannot
be used when a child uses chroot. So instead of that we use pivot_root
which replaces root in the whole namespace. This requires our new root
to be an actual fs so we mount tmpfs.
glibc 2.27 (and possibly other versions) can't handle an `nopenfd` value larger than 2^19 in `ntfw`, which is problematic if you've set the maximum number of fds per process to a value higher than that.
Changes:
* doesn't handle root user separately
* doesn't chdir("/") which makes using it seamless
* only bind mounts, doesn't symlink (i.e. files)
Incidentally, fixes#33106.
It's about two times shorter than the previous version, and much
easier to read/follow through. It uses GLib quite heavily, along with
RAII (available in GCC/Clang).