nixpkgs/nixos/tests/systemd-confinement.nix

185 lines
6.8 KiB
Nix
Raw Normal View History

import ./make-test-python.nix {
name = "systemd-confinement";
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
2022-03-20 23:15:30 +00:00
nodes.machine = { pkgs, lib, ... }: let
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
testServer = pkgs.writeScript "testserver.sh" ''
#!${pkgs.runtimeShell}
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
export PATH=${lib.escapeShellArg "${pkgs.coreutils}/bin"}
${lib.escapeShellArg pkgs.runtimeShell} 2>&1
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
echo "exit-status:$?"
'';
testClient = pkgs.writeScriptBin "chroot-exec" ''
#!${pkgs.runtimeShell} -e
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
output="$(echo "$@" | nc -NU "/run/test$(< /teststep).sock")"
ret="$(echo "$output" | sed -nre '$s/^exit-status:([0-9]+)$/\1/p')"
echo "$output" | head -n -1
exit "''${ret:-1}"
'';
nixos/systemd-confinement: Allow shipped unit file In issue #157787 @martined wrote: Trying to use confinement on packages providing their systemd units with systemd.packages, for example mpd, fails with the following error: system-units> ln: failed to create symbolic link '/nix/store/...-system-units/mpd.service': File exists This is because systemd-confinement and mpd both provide a mpd.service file through systemd.packages. (mpd got updated that way recently to use upstream's service file) To address this, we now place the unit file containing the bind-mounted paths of the Nix closure into a drop-in directory instead of using the name of a unit file directly. This does come with the implication that the options set in the drop-in directory won't apply if the main unit file is missing. In practice however this should not happen for two reasons: * The systemd-confinement module already sets additional options via systemd.services and thus we should get a main unit file * In the unlikely event that we don't get a main unit file regardless of the previous point, the unit would be a no-op even if the options of the drop-in directory would apply Another thing to consider is the order in which those options are merged, since systemd loads the files from the drop-in directory in alphabetical order. So given that we have confinement.conf and overrides.conf, the confinement options are loaded before the NixOS overrides. Since we're only setting the BindReadOnlyPaths option, the order isn't that important since all those paths are merged anyway and we still don't lose the ability to reset the option since overrides.conf comes afterwards. Fixes: https://github.com/NixOS/nixpkgs/issues/157787 Signed-off-by: aszlig <aszlig@nix.build>
2022-02-02 12:12:49 +00:00
mkTestStep = num: {
testScript,
config ? {},
serviceName ? "test${toString num}",
}: {
systemd.sockets.${serviceName} = {
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
description = "Socket for Test Service ${toString num}";
wantedBy = [ "sockets.target" ];
socketConfig.ListenStream = "/run/test${toString num}.sock";
socketConfig.Accept = true;
};
nixos/systemd-confinement: Allow shipped unit file In issue #157787 @martined wrote: Trying to use confinement on packages providing their systemd units with systemd.packages, for example mpd, fails with the following error: system-units> ln: failed to create symbolic link '/nix/store/...-system-units/mpd.service': File exists This is because systemd-confinement and mpd both provide a mpd.service file through systemd.packages. (mpd got updated that way recently to use upstream's service file) To address this, we now place the unit file containing the bind-mounted paths of the Nix closure into a drop-in directory instead of using the name of a unit file directly. This does come with the implication that the options set in the drop-in directory won't apply if the main unit file is missing. In practice however this should not happen for two reasons: * The systemd-confinement module already sets additional options via systemd.services and thus we should get a main unit file * In the unlikely event that we don't get a main unit file regardless of the previous point, the unit would be a no-op even if the options of the drop-in directory would apply Another thing to consider is the order in which those options are merged, since systemd loads the files from the drop-in directory in alphabetical order. So given that we have confinement.conf and overrides.conf, the confinement options are loaded before the NixOS overrides. Since we're only setting the BindReadOnlyPaths option, the order isn't that important since all those paths are merged anyway and we still don't lose the ability to reset the option since overrides.conf comes afterwards. Fixes: https://github.com/NixOS/nixpkgs/issues/157787 Signed-off-by: aszlig <aszlig@nix.build>
2022-02-02 12:12:49 +00:00
systemd.services."${serviceName}@" = {
description = "Confined Test Service ${toString num}";
confinement = (config.confinement or {}) // { enable = true; };
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
serviceConfig = (config.serviceConfig or {}) // {
ExecStart = testServer;
StandardInput = "socket";
};
} // removeAttrs config [ "confinement" "serviceConfig" ];
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
__testSteps = lib.mkOrder num (''
machine.succeed("echo ${toString num} > /teststep")
'' + testScript);
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
};
in {
imports = lib.imap1 mkTestStep [
{ config.confinement.mode = "chroot-only";
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
testScript = ''
with subtest("chroot-only confinement"):
systemd: 247.6 -> 249.4 This updates systemd to version v249.4 from version v247.6. Besides the many new features that can be found in the upstream repository they also introduced a bunch of cleanup which ended up requiring a few more patches on our side. a) 0022-core-Handle-lookup-paths-being-symlinks.patch: The way symlinked units were handled was changed in such that the last name of a unit file within one of the unit directories (/run/systemd/system, /etc/systemd/system, ...) is used as the name for the unit. Unfortunately that code didn't take into account that the unit directories themselves could already be symlinks and thus caused all our units to be recognized slightly different. There is an upstream PR for this new patch: https://github.com/systemd/systemd/pull/20479 b) The way the APIVFS is setup has been changed in such a way that we now always have /run. This required a few changes to the confinement tests which did assert that they didn't exist. Instead of adding another patch we can just adopt the upstream behavior. An empty /run doesn't seem harmful. As part of this work I refactored the confinement test just a little bit to allow better debugging of test failures. Previously it would just fail at some point and it wasn't obvious which of the many commands failed or what the unexpected string was. This should now be more obvious. c) Again related to the confinement tests the way a file was tested for being accessible was optimized. Previously systemd would in some situations open a file twice during that check. This was reduced to one operation but required the procfs to be mounted in a units namespace. An upstream bug was filed and fixed. We are now carrying the essential patch to fix that issue until it is backported to a new release (likely only version 250). The good part about this story is that upstream systemd now has a test case that looks very similar to one of our confinement tests. Hopefully that will lead to less friction in the long run. https://github.com/systemd/systemd/issues/20514 https://github.com/systemd/systemd/pull/20515 d) Previously we could grep for dlopen( somewhat reliably but now upstream started using a wrapper around dlopen that is most of the time used with linebreaks. This makes using grep not ergonomic anymore. With this bump we are grepping for anything that looks like a dynamic library name (in contrast to a dlopen(3) call) and replace those instead. That seems more robust. Time will tell if this holds. I tried using coccinelle to patch all those call sites using its tooling but unfornately it does stumble upon the _cleanup_ annotations that are very common in the systemd code. e) We now have some machinery for libbpf support in our systemd build. That being said it doesn't actually work as generating some skeletons doesn't work just yet. It fails with the below error message and is disabled by default (in both minimal and the regular build). > FAILED: src/core/bpf/socket_bind/socket-bind.skel.h > /build/source/tools/build-bpf-skel.py --clang_exec /nix/store/x1bi2mkapk1m0zq2g02nr018qyjkdn7a-clang-wrapper-12.0.1/bin/clang --llvm_strip_exec /nix/store/zm0kqan9qc77x219yihmmisi9g3sg8ns-llvm-12.0.1/bin/llvm-strip --bpftool_exec /nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool --arch x86_64 ../src/core/bpf/socket_bind/socket-bind.bpf.c src/core/bpf/socket_bind/socket-bind.skel.h > libbpf: elf: socket_bind_bpf is not a valid eBPF object file > Error: failed to open BPF object file: BPF object format invalid > Traceback (most recent call last): > File "/build/source/tools/build-bpf-skel.py", line 128, in <module> > bpf_build(args) > File "/build/source/tools/build-bpf-skel.py", line 92, in bpf_build > gen_bpf_skeleton(bpftool_exec=args.bpftool_exec, > File "/build/source/tools/build-bpf-skel.py", line 63, in gen_bpf_skeleton > skel = subprocess.check_output(bpftool_args, universal_newlines=True) > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 424, in check_output > return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 528, in run > raise CalledProcessError(retcode, process.args, > subprocess.CalledProcessError: Command '['/nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool', 'g', 's', '../src/core/bpf/socket_bind/socket-bind.bpf.o']' returned non-zero exit status 255. > [102/1457] Compiling C object src/journal/libjournal-core.a.p/journald-server.c.oapture output)put)ut) > ninja: build stopped: subcommand failed. f) We do now have support for TPM2 based disk encryption in our systemd build. The actual bits and pieces to make use of that are missing but there are various ongoing efforts in that direction. There is also the story about systemd in our initrd to enable this being used for root volumes. None of this will yet work out of the box but we can start improving on that front. g) FIDO2 support was added systemd and consequently we can now use that. Just with TPM2 there hasn't been any integration work with NixOS and instead this just adds that capability to work on that. Co-Authored-By: Jörg Thalheim <joerg@thalheim.io>
2021-08-30 13:10:54 +00:00
paths = machine.succeed('chroot-exec ls -1 / | paste -sd,').strip()
assert_eq(paths, "bin,nix,run")
uid = machine.succeed('chroot-exec id -u').strip()
assert_eq(uid, "0")
machine.succeed("chroot-exec chown 65534 /bin")
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
'';
}
{ testScript = ''
with subtest("full confinement with APIVFS"):
systemd: 247.6 -> 249.4 This updates systemd to version v249.4 from version v247.6. Besides the many new features that can be found in the upstream repository they also introduced a bunch of cleanup which ended up requiring a few more patches on our side. a) 0022-core-Handle-lookup-paths-being-symlinks.patch: The way symlinked units were handled was changed in such that the last name of a unit file within one of the unit directories (/run/systemd/system, /etc/systemd/system, ...) is used as the name for the unit. Unfortunately that code didn't take into account that the unit directories themselves could already be symlinks and thus caused all our units to be recognized slightly different. There is an upstream PR for this new patch: https://github.com/systemd/systemd/pull/20479 b) The way the APIVFS is setup has been changed in such a way that we now always have /run. This required a few changes to the confinement tests which did assert that they didn't exist. Instead of adding another patch we can just adopt the upstream behavior. An empty /run doesn't seem harmful. As part of this work I refactored the confinement test just a little bit to allow better debugging of test failures. Previously it would just fail at some point and it wasn't obvious which of the many commands failed or what the unexpected string was. This should now be more obvious. c) Again related to the confinement tests the way a file was tested for being accessible was optimized. Previously systemd would in some situations open a file twice during that check. This was reduced to one operation but required the procfs to be mounted in a units namespace. An upstream bug was filed and fixed. We are now carrying the essential patch to fix that issue until it is backported to a new release (likely only version 250). The good part about this story is that upstream systemd now has a test case that looks very similar to one of our confinement tests. Hopefully that will lead to less friction in the long run. https://github.com/systemd/systemd/issues/20514 https://github.com/systemd/systemd/pull/20515 d) Previously we could grep for dlopen( somewhat reliably but now upstream started using a wrapper around dlopen that is most of the time used with linebreaks. This makes using grep not ergonomic anymore. With this bump we are grepping for anything that looks like a dynamic library name (in contrast to a dlopen(3) call) and replace those instead. That seems more robust. Time will tell if this holds. I tried using coccinelle to patch all those call sites using its tooling but unfornately it does stumble upon the _cleanup_ annotations that are very common in the systemd code. e) We now have some machinery for libbpf support in our systemd build. That being said it doesn't actually work as generating some skeletons doesn't work just yet. It fails with the below error message and is disabled by default (in both minimal and the regular build). > FAILED: src/core/bpf/socket_bind/socket-bind.skel.h > /build/source/tools/build-bpf-skel.py --clang_exec /nix/store/x1bi2mkapk1m0zq2g02nr018qyjkdn7a-clang-wrapper-12.0.1/bin/clang --llvm_strip_exec /nix/store/zm0kqan9qc77x219yihmmisi9g3sg8ns-llvm-12.0.1/bin/llvm-strip --bpftool_exec /nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool --arch x86_64 ../src/core/bpf/socket_bind/socket-bind.bpf.c src/core/bpf/socket_bind/socket-bind.skel.h > libbpf: elf: socket_bind_bpf is not a valid eBPF object file > Error: failed to open BPF object file: BPF object format invalid > Traceback (most recent call last): > File "/build/source/tools/build-bpf-skel.py", line 128, in <module> > bpf_build(args) > File "/build/source/tools/build-bpf-skel.py", line 92, in bpf_build > gen_bpf_skeleton(bpftool_exec=args.bpftool_exec, > File "/build/source/tools/build-bpf-skel.py", line 63, in gen_bpf_skeleton > skel = subprocess.check_output(bpftool_args, universal_newlines=True) > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 424, in check_output > return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 528, in run > raise CalledProcessError(retcode, process.args, > subprocess.CalledProcessError: Command '['/nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool', 'g', 's', '../src/core/bpf/socket_bind/socket-bind.bpf.o']' returned non-zero exit status 255. > [102/1457] Compiling C object src/journal/libjournal-core.a.p/journald-server.c.oapture output)put)ut) > ninja: build stopped: subcommand failed. f) We do now have support for TPM2 based disk encryption in our systemd build. The actual bits and pieces to make use of that are missing but there are various ongoing efforts in that direction. There is also the story about systemd in our initrd to enable this being used for root volumes. None of this will yet work out of the box but we can start improving on that front. g) FIDO2 support was added systemd and consequently we can now use that. Just with TPM2 there hasn't been any integration work with NixOS and instead this just adds that capability to work on that. Co-Authored-By: Jörg Thalheim <joerg@thalheim.io>
2021-08-30 13:10:54 +00:00
machine.fail("chroot-exec ls -l /etc")
machine.fail("chroot-exec chown 65534 /bin")
assert_eq(machine.succeed('chroot-exec id -u').strip(), "0")
machine.succeed("chroot-exec chown 0 /bin")
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
'';
}
{ config.serviceConfig.BindReadOnlyPaths = [ "/etc" ];
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
testScript = ''
with subtest("check existence of bind-mounted /etc"):
systemd: 247.6 -> 249.4 This updates systemd to version v249.4 from version v247.6. Besides the many new features that can be found in the upstream repository they also introduced a bunch of cleanup which ended up requiring a few more patches on our side. a) 0022-core-Handle-lookup-paths-being-symlinks.patch: The way symlinked units were handled was changed in such that the last name of a unit file within one of the unit directories (/run/systemd/system, /etc/systemd/system, ...) is used as the name for the unit. Unfortunately that code didn't take into account that the unit directories themselves could already be symlinks and thus caused all our units to be recognized slightly different. There is an upstream PR for this new patch: https://github.com/systemd/systemd/pull/20479 b) The way the APIVFS is setup has been changed in such a way that we now always have /run. This required a few changes to the confinement tests which did assert that they didn't exist. Instead of adding another patch we can just adopt the upstream behavior. An empty /run doesn't seem harmful. As part of this work I refactored the confinement test just a little bit to allow better debugging of test failures. Previously it would just fail at some point and it wasn't obvious which of the many commands failed or what the unexpected string was. This should now be more obvious. c) Again related to the confinement tests the way a file was tested for being accessible was optimized. Previously systemd would in some situations open a file twice during that check. This was reduced to one operation but required the procfs to be mounted in a units namespace. An upstream bug was filed and fixed. We are now carrying the essential patch to fix that issue until it is backported to a new release (likely only version 250). The good part about this story is that upstream systemd now has a test case that looks very similar to one of our confinement tests. Hopefully that will lead to less friction in the long run. https://github.com/systemd/systemd/issues/20514 https://github.com/systemd/systemd/pull/20515 d) Previously we could grep for dlopen( somewhat reliably but now upstream started using a wrapper around dlopen that is most of the time used with linebreaks. This makes using grep not ergonomic anymore. With this bump we are grepping for anything that looks like a dynamic library name (in contrast to a dlopen(3) call) and replace those instead. That seems more robust. Time will tell if this holds. I tried using coccinelle to patch all those call sites using its tooling but unfornately it does stumble upon the _cleanup_ annotations that are very common in the systemd code. e) We now have some machinery for libbpf support in our systemd build. That being said it doesn't actually work as generating some skeletons doesn't work just yet. It fails with the below error message and is disabled by default (in both minimal and the regular build). > FAILED: src/core/bpf/socket_bind/socket-bind.skel.h > /build/source/tools/build-bpf-skel.py --clang_exec /nix/store/x1bi2mkapk1m0zq2g02nr018qyjkdn7a-clang-wrapper-12.0.1/bin/clang --llvm_strip_exec /nix/store/zm0kqan9qc77x219yihmmisi9g3sg8ns-llvm-12.0.1/bin/llvm-strip --bpftool_exec /nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool --arch x86_64 ../src/core/bpf/socket_bind/socket-bind.bpf.c src/core/bpf/socket_bind/socket-bind.skel.h > libbpf: elf: socket_bind_bpf is not a valid eBPF object file > Error: failed to open BPF object file: BPF object format invalid > Traceback (most recent call last): > File "/build/source/tools/build-bpf-skel.py", line 128, in <module> > bpf_build(args) > File "/build/source/tools/build-bpf-skel.py", line 92, in bpf_build > gen_bpf_skeleton(bpftool_exec=args.bpftool_exec, > File "/build/source/tools/build-bpf-skel.py", line 63, in gen_bpf_skeleton > skel = subprocess.check_output(bpftool_args, universal_newlines=True) > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 424, in check_output > return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 528, in run > raise CalledProcessError(retcode, process.args, > subprocess.CalledProcessError: Command '['/nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool', 'g', 's', '../src/core/bpf/socket_bind/socket-bind.bpf.o']' returned non-zero exit status 255. > [102/1457] Compiling C object src/journal/libjournal-core.a.p/journald-server.c.oapture output)put)ut) > ninja: build stopped: subcommand failed. f) We do now have support for TPM2 based disk encryption in our systemd build. The actual bits and pieces to make use of that are missing but there are various ongoing efforts in that direction. There is also the story about systemd in our initrd to enable this being used for root volumes. None of this will yet work out of the box but we can start improving on that front. g) FIDO2 support was added systemd and consequently we can now use that. Just with TPM2 there hasn't been any integration work with NixOS and instead this just adds that capability to work on that. Co-Authored-By: Jörg Thalheim <joerg@thalheim.io>
2021-08-30 13:10:54 +00:00
passwd = machine.succeed('chroot-exec cat /etc/passwd').strip()
assert len(passwd) > 0, "/etc/passwd must not be empty"
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
'';
}
{ config.serviceConfig.User = "chroot-testuser";
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
config.serviceConfig.Group = "chroot-testgroup";
testScript = ''
with subtest("check if User/Group really runs as non-root"):
machine.succeed("chroot-exec ls -l /dev")
systemd: 247.6 -> 249.4 This updates systemd to version v249.4 from version v247.6. Besides the many new features that can be found in the upstream repository they also introduced a bunch of cleanup which ended up requiring a few more patches on our side. a) 0022-core-Handle-lookup-paths-being-symlinks.patch: The way symlinked units were handled was changed in such that the last name of a unit file within one of the unit directories (/run/systemd/system, /etc/systemd/system, ...) is used as the name for the unit. Unfortunately that code didn't take into account that the unit directories themselves could already be symlinks and thus caused all our units to be recognized slightly different. There is an upstream PR for this new patch: https://github.com/systemd/systemd/pull/20479 b) The way the APIVFS is setup has been changed in such a way that we now always have /run. This required a few changes to the confinement tests which did assert that they didn't exist. Instead of adding another patch we can just adopt the upstream behavior. An empty /run doesn't seem harmful. As part of this work I refactored the confinement test just a little bit to allow better debugging of test failures. Previously it would just fail at some point and it wasn't obvious which of the many commands failed or what the unexpected string was. This should now be more obvious. c) Again related to the confinement tests the way a file was tested for being accessible was optimized. Previously systemd would in some situations open a file twice during that check. This was reduced to one operation but required the procfs to be mounted in a units namespace. An upstream bug was filed and fixed. We are now carrying the essential patch to fix that issue until it is backported to a new release (likely only version 250). The good part about this story is that upstream systemd now has a test case that looks very similar to one of our confinement tests. Hopefully that will lead to less friction in the long run. https://github.com/systemd/systemd/issues/20514 https://github.com/systemd/systemd/pull/20515 d) Previously we could grep for dlopen( somewhat reliably but now upstream started using a wrapper around dlopen that is most of the time used with linebreaks. This makes using grep not ergonomic anymore. With this bump we are grepping for anything that looks like a dynamic library name (in contrast to a dlopen(3) call) and replace those instead. That seems more robust. Time will tell if this holds. I tried using coccinelle to patch all those call sites using its tooling but unfornately it does stumble upon the _cleanup_ annotations that are very common in the systemd code. e) We now have some machinery for libbpf support in our systemd build. That being said it doesn't actually work as generating some skeletons doesn't work just yet. It fails with the below error message and is disabled by default (in both minimal and the regular build). > FAILED: src/core/bpf/socket_bind/socket-bind.skel.h > /build/source/tools/build-bpf-skel.py --clang_exec /nix/store/x1bi2mkapk1m0zq2g02nr018qyjkdn7a-clang-wrapper-12.0.1/bin/clang --llvm_strip_exec /nix/store/zm0kqan9qc77x219yihmmisi9g3sg8ns-llvm-12.0.1/bin/llvm-strip --bpftool_exec /nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool --arch x86_64 ../src/core/bpf/socket_bind/socket-bind.bpf.c src/core/bpf/socket_bind/socket-bind.skel.h > libbpf: elf: socket_bind_bpf is not a valid eBPF object file > Error: failed to open BPF object file: BPF object format invalid > Traceback (most recent call last): > File "/build/source/tools/build-bpf-skel.py", line 128, in <module> > bpf_build(args) > File "/build/source/tools/build-bpf-skel.py", line 92, in bpf_build > gen_bpf_skeleton(bpftool_exec=args.bpftool_exec, > File "/build/source/tools/build-bpf-skel.py", line 63, in gen_bpf_skeleton > skel = subprocess.check_output(bpftool_args, universal_newlines=True) > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 424, in check_output > return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 528, in run > raise CalledProcessError(retcode, process.args, > subprocess.CalledProcessError: Command '['/nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool', 'g', 's', '../src/core/bpf/socket_bind/socket-bind.bpf.o']' returned non-zero exit status 255. > [102/1457] Compiling C object src/journal/libjournal-core.a.p/journald-server.c.oapture output)put)ut) > ninja: build stopped: subcommand failed. f) We do now have support for TPM2 based disk encryption in our systemd build. The actual bits and pieces to make use of that are missing but there are various ongoing efforts in that direction. There is also the story about systemd in our initrd to enable this being used for root volumes. None of this will yet work out of the box but we can start improving on that front. g) FIDO2 support was added systemd and consequently we can now use that. Just with TPM2 there hasn't been any integration work with NixOS and instead this just adds that capability to work on that. Co-Authored-By: Jörg Thalheim <joerg@thalheim.io>
2021-08-30 13:10:54 +00:00
uid = machine.succeed('chroot-exec id -u').strip()
assert uid != "0", "UID of chroot-testuser shouldn't be 0"
machine.fail("chroot-exec touch /bin/test")
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
'';
}
(let
symlink = pkgs.runCommand "symlink" {
target = pkgs.writeText "symlink-target" "got me\n";
} "ln -s \"$target\" \"$out\"";
in {
config.confinement.packages = lib.singleton symlink;
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
testScript = ''
with subtest("check if symlinks are properly bind-mounted"):
machine.fail("chroot-exec test -e /etc")
systemd: 247.6 -> 249.4 This updates systemd to version v249.4 from version v247.6. Besides the many new features that can be found in the upstream repository they also introduced a bunch of cleanup which ended up requiring a few more patches on our side. a) 0022-core-Handle-lookup-paths-being-symlinks.patch: The way symlinked units were handled was changed in such that the last name of a unit file within one of the unit directories (/run/systemd/system, /etc/systemd/system, ...) is used as the name for the unit. Unfortunately that code didn't take into account that the unit directories themselves could already be symlinks and thus caused all our units to be recognized slightly different. There is an upstream PR for this new patch: https://github.com/systemd/systemd/pull/20479 b) The way the APIVFS is setup has been changed in such a way that we now always have /run. This required a few changes to the confinement tests which did assert that they didn't exist. Instead of adding another patch we can just adopt the upstream behavior. An empty /run doesn't seem harmful. As part of this work I refactored the confinement test just a little bit to allow better debugging of test failures. Previously it would just fail at some point and it wasn't obvious which of the many commands failed or what the unexpected string was. This should now be more obvious. c) Again related to the confinement tests the way a file was tested for being accessible was optimized. Previously systemd would in some situations open a file twice during that check. This was reduced to one operation but required the procfs to be mounted in a units namespace. An upstream bug was filed and fixed. We are now carrying the essential patch to fix that issue until it is backported to a new release (likely only version 250). The good part about this story is that upstream systemd now has a test case that looks very similar to one of our confinement tests. Hopefully that will lead to less friction in the long run. https://github.com/systemd/systemd/issues/20514 https://github.com/systemd/systemd/pull/20515 d) Previously we could grep for dlopen( somewhat reliably but now upstream started using a wrapper around dlopen that is most of the time used with linebreaks. This makes using grep not ergonomic anymore. With this bump we are grepping for anything that looks like a dynamic library name (in contrast to a dlopen(3) call) and replace those instead. That seems more robust. Time will tell if this holds. I tried using coccinelle to patch all those call sites using its tooling but unfornately it does stumble upon the _cleanup_ annotations that are very common in the systemd code. e) We now have some machinery for libbpf support in our systemd build. That being said it doesn't actually work as generating some skeletons doesn't work just yet. It fails with the below error message and is disabled by default (in both minimal and the regular build). > FAILED: src/core/bpf/socket_bind/socket-bind.skel.h > /build/source/tools/build-bpf-skel.py --clang_exec /nix/store/x1bi2mkapk1m0zq2g02nr018qyjkdn7a-clang-wrapper-12.0.1/bin/clang --llvm_strip_exec /nix/store/zm0kqan9qc77x219yihmmisi9g3sg8ns-llvm-12.0.1/bin/llvm-strip --bpftool_exec /nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool --arch x86_64 ../src/core/bpf/socket_bind/socket-bind.bpf.c src/core/bpf/socket_bind/socket-bind.skel.h > libbpf: elf: socket_bind_bpf is not a valid eBPF object file > Error: failed to open BPF object file: BPF object format invalid > Traceback (most recent call last): > File "/build/source/tools/build-bpf-skel.py", line 128, in <module> > bpf_build(args) > File "/build/source/tools/build-bpf-skel.py", line 92, in bpf_build > gen_bpf_skeleton(bpftool_exec=args.bpftool_exec, > File "/build/source/tools/build-bpf-skel.py", line 63, in gen_bpf_skeleton > skel = subprocess.check_output(bpftool_args, universal_newlines=True) > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 424, in check_output > return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 528, in run > raise CalledProcessError(retcode, process.args, > subprocess.CalledProcessError: Command '['/nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool', 'g', 's', '../src/core/bpf/socket_bind/socket-bind.bpf.o']' returned non-zero exit status 255. > [102/1457] Compiling C object src/journal/libjournal-core.a.p/journald-server.c.oapture output)put)ut) > ninja: build stopped: subcommand failed. f) We do now have support for TPM2 based disk encryption in our systemd build. The actual bits and pieces to make use of that are missing but there are various ongoing efforts in that direction. There is also the story about systemd in our initrd to enable this being used for root volumes. None of this will yet work out of the box but we can start improving on that front. g) FIDO2 support was added systemd and consequently we can now use that. Just with TPM2 there hasn't been any integration work with NixOS and instead this just adds that capability to work on that. Co-Authored-By: Jörg Thalheim <joerg@thalheim.io>
2021-08-30 13:10:54 +00:00
text = machine.succeed('chroot-exec cat ${symlink}').strip()
assert_eq(text, "got me")
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
'';
})
{ config.serviceConfig.User = "chroot-testuser";
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
config.serviceConfig.Group = "chroot-testgroup";
config.serviceConfig.StateDirectory = "testme";
testScript = ''
with subtest("check if StateDirectory works"):
machine.succeed("chroot-exec touch /tmp/canary")
machine.succeed('chroot-exec "echo works > /var/lib/testme/foo"')
machine.succeed('test "$(< /var/lib/testme/foo)" = works')
machine.succeed("test ! -e /tmp/canary")
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
'';
}
{ testScript = ''
with subtest("check if /bin/sh works"):
machine.succeed(
"chroot-exec test -e /bin/sh",
'test "$(chroot-exec \'/bin/sh -c "echo bar"\')" = bar',
)
'';
}
{ config.confinement.binSh = null;
testScript = ''
with subtest("check if suppressing /bin/sh works"):
machine.succeed("chroot-exec test ! -e /bin/sh")
machine.succeed('test "$(chroot-exec \'/bin/sh -c "echo foo"\')" != foo')
'';
}
{ config.confinement.binSh = "${pkgs.hello}/bin/hello";
testScript = ''
with subtest("check if we can set /bin/sh to something different"):
machine.succeed("chroot-exec test -e /bin/sh")
machine.succeed('test "$(chroot-exec /bin/sh -g foo)" = foo')
'';
}
{ config.environment.FOOBAR = pkgs.writeText "foobar" "eek\n";
testScript = ''
with subtest("check if only Exec* dependencies are included"):
machine.succeed('test "$(chroot-exec \'cat "$FOOBAR"\')" != eek')
'';
}
{ config.environment.FOOBAR = pkgs.writeText "foobar" "eek\n";
config.confinement.fullUnit = true;
testScript = ''
with subtest("check if all unit dependencies are included"):
machine.succeed('test "$(chroot-exec \'cat "$FOOBAR"\')" = eek')
'';
}
nixos/systemd-confinement: Allow shipped unit file In issue #157787 @martined wrote: Trying to use confinement on packages providing their systemd units with systemd.packages, for example mpd, fails with the following error: system-units> ln: failed to create symbolic link '/nix/store/...-system-units/mpd.service': File exists This is because systemd-confinement and mpd both provide a mpd.service file through systemd.packages. (mpd got updated that way recently to use upstream's service file) To address this, we now place the unit file containing the bind-mounted paths of the Nix closure into a drop-in directory instead of using the name of a unit file directly. This does come with the implication that the options set in the drop-in directory won't apply if the main unit file is missing. In practice however this should not happen for two reasons: * The systemd-confinement module already sets additional options via systemd.services and thus we should get a main unit file * In the unlikely event that we don't get a main unit file regardless of the previous point, the unit would be a no-op even if the options of the drop-in directory would apply Another thing to consider is the order in which those options are merged, since systemd loads the files from the drop-in directory in alphabetical order. So given that we have confinement.conf and overrides.conf, the confinement options are loaded before the NixOS overrides. Since we're only setting the BindReadOnlyPaths option, the order isn't that important since all those paths are merged anyway and we still don't lose the ability to reset the option since overrides.conf comes afterwards. Fixes: https://github.com/NixOS/nixpkgs/issues/157787 Signed-off-by: aszlig <aszlig@nix.build>
2022-02-02 12:12:49 +00:00
{ serviceName = "shipped-unitfile";
config.confinement.mode = "chroot-only";
testScript = ''
with subtest("check if shipped unit file still works"):
machine.succeed(
'chroot-exec \'kill -9 $$ 2>&1 || :\' | '
'grep -q "Too many levels of symbolic links"'
)
'';
}
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
];
options.__testSteps = lib.mkOption {
type = lib.types.lines;
description = "All of the test steps combined as a single script.";
};
config.environment.systemPackages = lib.singleton testClient;
nixos/systemd-confinement: Allow shipped unit file In issue #157787 @martined wrote: Trying to use confinement on packages providing their systemd units with systemd.packages, for example mpd, fails with the following error: system-units> ln: failed to create symbolic link '/nix/store/...-system-units/mpd.service': File exists This is because systemd-confinement and mpd both provide a mpd.service file through systemd.packages. (mpd got updated that way recently to use upstream's service file) To address this, we now place the unit file containing the bind-mounted paths of the Nix closure into a drop-in directory instead of using the name of a unit file directly. This does come with the implication that the options set in the drop-in directory won't apply if the main unit file is missing. In practice however this should not happen for two reasons: * The systemd-confinement module already sets additional options via systemd.services and thus we should get a main unit file * In the unlikely event that we don't get a main unit file regardless of the previous point, the unit would be a no-op even if the options of the drop-in directory would apply Another thing to consider is the order in which those options are merged, since systemd loads the files from the drop-in directory in alphabetical order. So given that we have confinement.conf and overrides.conf, the confinement options are loaded before the NixOS overrides. Since we're only setting the BindReadOnlyPaths option, the order isn't that important since all those paths are merged anyway and we still don't lose the ability to reset the option since overrides.conf comes afterwards. Fixes: https://github.com/NixOS/nixpkgs/issues/157787 Signed-off-by: aszlig <aszlig@nix.build>
2022-02-02 12:12:49 +00:00
config.systemd.packages = lib.singleton (pkgs.writeTextFile {
name = "shipped-unitfile";
destination = "/etc/systemd/system/shipped-unitfile@.service";
text = ''
[Service]
SystemCallFilter=~kill
SystemCallErrorNumber=ELOOP
'';
});
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
config.users.groups.chroot-testgroup = {};
config.users.users.chroot-testuser = {
isSystemUser = true;
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
description = "Chroot Test User";
group = "chroot-testgroup";
};
};
testScript = { nodes, ... }: ''
systemd: 247.6 -> 249.4 This updates systemd to version v249.4 from version v247.6. Besides the many new features that can be found in the upstream repository they also introduced a bunch of cleanup which ended up requiring a few more patches on our side. a) 0022-core-Handle-lookup-paths-being-symlinks.patch: The way symlinked units were handled was changed in such that the last name of a unit file within one of the unit directories (/run/systemd/system, /etc/systemd/system, ...) is used as the name for the unit. Unfortunately that code didn't take into account that the unit directories themselves could already be symlinks and thus caused all our units to be recognized slightly different. There is an upstream PR for this new patch: https://github.com/systemd/systemd/pull/20479 b) The way the APIVFS is setup has been changed in such a way that we now always have /run. This required a few changes to the confinement tests which did assert that they didn't exist. Instead of adding another patch we can just adopt the upstream behavior. An empty /run doesn't seem harmful. As part of this work I refactored the confinement test just a little bit to allow better debugging of test failures. Previously it would just fail at some point and it wasn't obvious which of the many commands failed or what the unexpected string was. This should now be more obvious. c) Again related to the confinement tests the way a file was tested for being accessible was optimized. Previously systemd would in some situations open a file twice during that check. This was reduced to one operation but required the procfs to be mounted in a units namespace. An upstream bug was filed and fixed. We are now carrying the essential patch to fix that issue until it is backported to a new release (likely only version 250). The good part about this story is that upstream systemd now has a test case that looks very similar to one of our confinement tests. Hopefully that will lead to less friction in the long run. https://github.com/systemd/systemd/issues/20514 https://github.com/systemd/systemd/pull/20515 d) Previously we could grep for dlopen( somewhat reliably but now upstream started using a wrapper around dlopen that is most of the time used with linebreaks. This makes using grep not ergonomic anymore. With this bump we are grepping for anything that looks like a dynamic library name (in contrast to a dlopen(3) call) and replace those instead. That seems more robust. Time will tell if this holds. I tried using coccinelle to patch all those call sites using its tooling but unfornately it does stumble upon the _cleanup_ annotations that are very common in the systemd code. e) We now have some machinery for libbpf support in our systemd build. That being said it doesn't actually work as generating some skeletons doesn't work just yet. It fails with the below error message and is disabled by default (in both minimal and the regular build). > FAILED: src/core/bpf/socket_bind/socket-bind.skel.h > /build/source/tools/build-bpf-skel.py --clang_exec /nix/store/x1bi2mkapk1m0zq2g02nr018qyjkdn7a-clang-wrapper-12.0.1/bin/clang --llvm_strip_exec /nix/store/zm0kqan9qc77x219yihmmisi9g3sg8ns-llvm-12.0.1/bin/llvm-strip --bpftool_exec /nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool --arch x86_64 ../src/core/bpf/socket_bind/socket-bind.bpf.c src/core/bpf/socket_bind/socket-bind.skel.h > libbpf: elf: socket_bind_bpf is not a valid eBPF object file > Error: failed to open BPF object file: BPF object format invalid > Traceback (most recent call last): > File "/build/source/tools/build-bpf-skel.py", line 128, in <module> > bpf_build(args) > File "/build/source/tools/build-bpf-skel.py", line 92, in bpf_build > gen_bpf_skeleton(bpftool_exec=args.bpftool_exec, > File "/build/source/tools/build-bpf-skel.py", line 63, in gen_bpf_skeleton > skel = subprocess.check_output(bpftool_args, universal_newlines=True) > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 424, in check_output > return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, > File "/nix/store/81lwy2hfqj4c1943b1x8a0qsivjhdhw9-python3-3.9.6/lib/python3.9/subprocess.py", line 528, in run > raise CalledProcessError(retcode, process.args, > subprocess.CalledProcessError: Command '['/nix/store/l6dg8jlbh8qnqa58mshh3d8r6999dk0p-bpftools-5.13.11/bin/bpftool', 'g', 's', '../src/core/bpf/socket_bind/socket-bind.bpf.o']' returned non-zero exit status 255. > [102/1457] Compiling C object src/journal/libjournal-core.a.p/journald-server.c.oapture output)put)ut) > ninja: build stopped: subcommand failed. f) We do now have support for TPM2 based disk encryption in our systemd build. The actual bits and pieces to make use of that are missing but there are various ongoing efforts in that direction. There is also the story about systemd in our initrd to enable this being used for root volumes. None of this will yet work out of the box but we can start improving on that front. g) FIDO2 support was added systemd and consequently we can now use that. Just with TPM2 there hasn't been any integration work with NixOS and instead this just adds that capability to work on that. Co-Authored-By: Jörg Thalheim <joerg@thalheim.io>
2021-08-30 13:10:54 +00:00
def assert_eq(a, b):
assert a == b, f"{a} != {b}"
machine.wait_for_unit("multi-user.target")
'' + nodes.machine.config.__testSteps;
nixos: Add 'chroot' options to systemd.services Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
2019-03-10 11:21:55 +00:00
}