[bug#75810] Locked mounts

Message ID 87y0xbivsh.fsf_-_@gnu.org
State New
Headers
Series [bug#75810] Locked mounts |

Commit Message

Ludovic Courtès March 11, 2025, 12:41 p.m. UTC
  Hi Reepca,

Reepca Russelstein <reepca@russelstein.xyz> skribis:

> I still think it would be a good idea to call unshare to create an extra
> user and mount namespace just before executing the builder in the
> unprivileged case, just to be sure that the mount-locking behavior is
> triggered in a way that is documented.

For some reason, it’s not working as advertised: mounts are seemingly
not locked together and umount(2) on one of them returns EPERM (instead
of EINVAL).  I suspect chroot, pivot_root, & co. spoil it all.

Attached is a patch and test case.

To be sure, I wrote a minimal C program: umount returns EINVAL as
expected.  However, when compiling it with -DWITH_CHROOT, unshare(2)
fails with EPERM because “the caller's root directory does not match the
root directory of the mount namespace in which it resides” (quoting
unshare(2)).

So I now get the idea of “locked mounts” but I’m at loss as to how this
is supposed to interact with chroots.

Thoughts?

Ludo’.
#define _GNU_SOURCE 1
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sched.h>
#include <signal.h>
#include <stdint.h>
#include <string.h>

#include <assert.h>
#include <error.h>
#include <errno.h>

#undef NDEBUG

static char child_stack[8192];

static int sync_pipe[2];

static int
child (void *arg)
{
  close (sync_pipe[1]);
  char str[123];

  read (sync_pipe[0], str, sizeof str);
  assert (strcmp (str, "go!\n") == 0);
  close (sync_pipe[0]);

  printf ("child process: %d\n", getpid ());
  mkdir ("/tmp/t", 0700);
  errno = 0;
  mount ("none", "/tmp/t", "tmpfs", 0, NULL);
  assert_perror (errno);

#ifdef WITH_CHROOT
  chroot ("/tmp");
  assert_perror (errno);
#endif
  unshare (CLONE_NEWNS | CLONE_NEWUSER);
  assert_perror (errno);
#ifdef WITH_CHROOT
  umount ("/t");
#else
  umount ("/tmp/t");
#endif

  int err = errno;
  printf ("umount errno: %s\n", strerror (err));
  assert (err == EINVAL);
  return EXIT_FAILURE;
}

static void
initialize_namespace (pid_t pid)
{
  close (sync_pipe[0]);

  char name[1024];
  FILE *file;

  sprintf (name, "/proc/%d/uid_map", pid);
  file = fopen (name, "w");
  fprintf (file, "42 %d 1", getuid ());
  fclose (file);

  sprintf (name, "/proc/%d/setgroups", pid);
  file = fopen (name, "w");
  fprintf (file, "deny");
  fclose (file);

  sprintf (name, "/proc/%d/gid_map", pid);
  file = fopen (name, "w");
  fprintf (file, "42 %d 1", getgid ());
  fclose (file);

  errno = 0;
  write (sync_pipe[1], "go!\n", 5);
  assert_perror (errno);
  close (sync_pipe[1]);
}


int
main ()
{
  errno = 0;
  pipe2 (sync_pipe, O_CLOEXEC);
  assert_perror (errno);

  int pid = clone (child,
		   (char *) (((intptr_t) child_stack + sizeof child_stack - sizeof (void *)) & ~0xfULL),
		   CLONE_NEWNS | CLONE_NEWUSER | SIGCHLD, NULL);
  assert_perror (errno);
  initialize_namespace (pid);

  return EXIT_SUCCESS;
}

/*
  unshare -mrf sh -c 'mount -t tmpfs none /tmp; exec unshare -mr strace -e umount2 umount /tmp'
*/
  

Comments

Reepca Russelstein March 11, 2025, 9:54 p.m. UTC | #1
Ludovic Courtès <ludo@gnu.org> writes:

>> I still think it would be a good idea to call unshare to create an extra
>> user and mount namespace just before executing the builder in the
>> unprivileged case, just to be sure that the mount-locking behavior is
>> triggered in a way that is documented.
>
> For some reason, it’s not working as advertised: mounts are seemingly
> not locked together and umount(2) on one of them returns EPERM (instead
> of EINVAL).  I suspect chroot, pivot_root, & co. spoil it all.

What this shows is that we're not currently testing the mount-locking
because the builder lacks the necessary capability in its user namespace
(this capability was removed from the effective set when the builder was
exec'ed).  That is, the kernel doesn't get as far as checking whether
the mount is locked because the caller wouldn't have the permission to
unmount it even if it weren't locked.  One way to test this would be to
use setns (perhaps via container-excursion) to enter the namespaces of
the builder, which will cause you to start out with a full set of
effective capabilities in its user namespace, then try umount.  Or a
test could be done within the daemon shortly prior to exec.

> Attached is a patch and test case.

[...]

> diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc
> index 057a15ccd0..6a6a960a1c 100644
> --- a/nix/libstore/build.cc
> +++ b/nix/libstore/build.cc
> @@ -2244,6 +2244,13 @@ void DerivationGoal::runChild()
>  	    /* Remount root as read-only.  */
>              if (mount("/", "/", 0, MS_BIND | MS_REMOUNT | MS_RDONLY, 0) == -1)
>                  throw SysError(format("read-only remount of build root '%1%' failed") % chrootRootDir);
> +
> +	    if (getuid() != 0) {
> +		/* Create a new mount namespace to "lock" previous mounts.
> +		   See mount_namespaces(7).  */
> +		if (unshare(CLONE_NEWNS | CLONE_NEWUSER) == -1)
> +		    throw SysError(format("creating new user and mount namespaces"));
> +	    }
>          }
>  #endif

Note that we still need to initialize /proc/self/uid_map and friends for
the new user namespace, if I understand correctly.  My reading of
user_namespaces(7) is that it should be possible to do this from within
the new user namespace.

> To be sure, I wrote a minimal C program: umount returns EINVAL as
> expected.  However, when compiling it with -DWITH_CHROOT, unshare(2)
> fails with EPERM because “the caller's root directory does not match the
> root directory of the mount namespace in which it resides” (quoting
> unshare(2)).
>
> So I now get the idea of “locked mounts” but I’m at loss as to how this
> is supposed to interact with chroots.

I hadn't heard of that restriction on unshare and clone.  Thinking about
it, I suppose the reason could be that merely creating a user namespace
grants CAP_SYS_CHROOT, and because the current root directory is a
per-process property whose setting isn't limited by any namespace, it
would be possible to undo a chroot someone had tried to set as a
restriction on the current process by just calling chroot("/").  But if
we use pivot_root in conjunction with it, like we do in the daemon, it
should work.

- reepca
  
Ludovic Courtès March 12, 2025, 10:27 p.m. UTC | #2
Hi,

Reepca Russelstein <reepca@russelstein.xyz> skribis:

>> For some reason, it’s not working as advertised: mounts are seemingly
>> not locked together and umount(2) on one of them returns EPERM (instead
>> of EINVAL).  I suspect chroot, pivot_root, & co. spoil it all.
>
> What this shows is that we're not currently testing the mount-locking
> because the builder lacks the necessary capability in its user namespace
> (this capability was removed from the effective set when the builder was
> exec'ed).  That is, the kernel doesn't get as far as checking whether
> the mount is locked because the caller wouldn't have the permission to
> unmount it even if it weren't locked.

Oh my.  These clumsy semantics just can’t fit in my head.

> One way to test this would be to use setns (perhaps via
> container-excursion) to enter the namespaces of the builder, which
> will cause you to start out with a full set of effective capabilities
> in its user namespace, then try umount.  Or a test could be done
> within the daemon shortly prior to exec.

I tried running a build that sleeps and then joining its namespaces but
failed:

--8<---------------cut here---------------start------------->8---
$ pgrep -fa builder
16091 guile --no-auto-compile -L /home/ludo/src/guix/test-tmp/store/ngrj4gl9lrbmbklcsbgcrq622n9nf0jw-module-import -C /home/ludo/src/guix/test-tmp/store/cskis66zjnhk28h11lbaxkd3j9lyzz6a-module-import-compiled /home/ludo/src/guix/test-tmp/store/akndbdx1lmnigf8bi29dr0vd3c8dbdrg-attempt-to-unmount-input-builder
$ nsenter -m -u -i -n -p -U  -t 16091 
nsenter: reassociate to namespace 'ns/ipc' failed: Operation not permitted
$ guix container exec 16091 /bin/sh
guix container: error: setns: 7 0: Operation not permitted
guix container: error: process exited with status 1
--8<---------------cut here---------------end--------------->8---

If you can think of ways to do that, I’m all ears.  :-)

You can try from the ‘wip-rootless-daemon’ at
<https://codeberg.org/civodul/guix> and apply the patch I sent earlier.

(Incidentally, I don’t think we could write an automated test for that;
in theory we could use (guix scripts processes) to determine the PID of
the build process but that would be too brittle, especially when running
“make check -j123” where it could pick the wrong guix-daemon process.)

> Note that we still need to initialize /proc/self/uid_map and friends for
> the new user namespace, if I understand correctly.

Yes, I left that for later (it triggers a test failure anyway, so we
won’t forget.)

> My reading of user_namespaces(7) is that it should be possible to do
> this from within the new user namespace.

Oh, interesting.

> I hadn't heard of that restriction on unshare and clone.  Thinking about
> it, I suppose the reason could be that merely creating a user namespace
> grants CAP_SYS_CHROOT, and because the current root directory is a
> per-process property whose setting isn't limited by any namespace, it
> would be possible to undo a chroot someone had tried to set as a
> restriction on the current process by just calling chroot("/").  But if
> we use pivot_root in conjunction with it, like we do in the daemon, it
> should work.

Interesting.  Thanks for your guidance, as always!

Ludo’.
  
Reepca Russelstein March 13, 2025, 6:04 a.m. UTC | #3
Ludovic Courtès <ludo@gnu.org> writes:

> I tried running a build that sleeps and then joining its namespaces but
> failed:
>
> $ pgrep -fa builder
> 16091 guile --no-auto-compile -L /home/ludo/src/guix/test-tmp/store/ngrj4gl9lrbmbklcsbgcrq622n9nf0jw-module-import -C /home/ludo/src/guix/test-tmp/store/cskis66zjnhk28h11lbaxkd3j9lyzz6a-module-import-compiled /home/ludo/src/guix/test-tmp/store/akndbdx1lmnigf8bi29dr0vd3c8dbdrg-attempt-to-unmount-input-builder
> $ nsenter -m -u -i -n -p -U  -t 16091 
> nsenter: reassociate to namespace 'ns/ipc' failed: Operation not permitted
> $ guix container exec 16091 /bin/sh
> guix container: error: setns: 7 0: Operation not permitted
> guix container: error: process exited with status 1

I failed to take into account that the setns sequence needs to start by
joining the user namespace that owns all the other namespaces, so as to
gain the necessary capabilities for joining them.  But after unshare
creates the second user namespace, that first user namespace no longer
has a process in it; it only still exists due to indirect references.
Without a process to reference it by, we can't join it.  Trying to join
the user namespace of the builder instead joins the inner user
namespace, then tries to use the acquired credentials to join the
namespaces owned by the outer user namespace, which naturally fails.

> If you can think of ways to do that, I’m all ears.  :-)

It looks like the only easy way to test this - aside from something like
scripted gdb playthroughs - might legitimately be to include a test
inside the daemon itself.

> You can try from the ‘wip-rootless-daemon’ at
> <https://codeberg.org/civodul/guix> and apply the patch I sent earlier.
>
> (Incidentally, I don’t think we could write an automated test for that;
> in theory we could use (guix scripts processes) to determine the PID of
> the build process but that would be too brittle, especially when running
> “make check -j123” where it could pick the wrong guix-daemon process.)

If we included a randomly-generated string in the command line or
environment of something in the build environment, and then went looking
through /proc for the matching process, it could work.  But it's moot if
we put this particular test inside the daemon.

- reepca
  
Ludovic Courtès March 13, 2025, 11:51 p.m. UTC | #4
Hey,

Reepca Russelstein <reepca@russelstein.xyz> skribis:

> I failed to take into account that the setns sequence needs to start by
> joining the user namespace that owns all the other namespaces, so as to
> gain the necessary capabilities for joining them.  But after unshare
> creates the second user namespace, that first user namespace no longer
> has a process in it; it only still exists due to indirect references.
> Without a process to reference it by, we can't join it.  Trying to join
> the user namespace of the builder instead joins the inner user
> namespace, then tries to use the acquired credentials to join the
> namespaces owned by the outer user namespace, which naturally fails.

I see; brilliant.

>> If you can think of ways to do that, I’m all ears.  :-)
>
> It looks like the only easy way to test this - aside from something like
> scripted gdb playthroughs - might legitimately be to include a test
> inside the daemon itself.

Yes, that makes sense.

I added a ‘umount’ call, checking that we get EINVAL, and confirmed that
this check fails if we comment out the ‘unshare’ call.

I pushed the updated branch to Codeberg.  There are test failures in the
‘debian-install’ test that I now need to investigate before I send v5,
notably CAP_SYS_CHOWN not working (?) when attempting to create root’s
profile:

--8<---------------cut here---------------start------------->8---
guix install: [1;31merror: [0mdirectory `/var/guix/profiles/per-user/root' is not owned by you
[1;36mhint: [0mPlease change the owner of `/var/guix/profiles/per-user/root' to user
"root".

ls: cannot access '/root/.guix-profile': No such file or directory
sh: 1: /root/.guix-profile/bin/hello: not found
[…]
PASS: marionette works
PASS: /etc/os-release
PASS: mount host file store
PASS: screenshot before
PASS: install fake dependencies
PASS: run install script
PASS: create user account
PASS: guix describe
PASS: hello not already built
PASS: guix build hello
PASS: hello indeed built
/gnu/store/59qdz41chhifidaq79iiiyx70m7lmyrp-debian-install-builder:1: FAIL guix install hello
/gnu/store/59qdz41chhifidaq79iiiyx70m7lmyrp-debian-install-builder:1: FAIL user profile created
/gnu/store/59qdz41chhifidaq79iiiyx70m7lmyrp-debian-install-builder:1: FAIL hello
PASS: guix install hello, unprivileged user
PASS: user hello
PASS: unprivileged user profile created
/gnu/store/59qdz41chhifidaq79iiiyx70m7lmyrp-debian-install-builder:1: FAIL store is read-only
PASS: screenshot after
# of expected passes      15
# of unexpected failures  4
--8<---------------cut here---------------end--------------->8---

Ludo’.
  

Patch

diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc
index 057a15ccd0..6a6a960a1c 100644
--- a/nix/libstore/build.cc
+++ b/nix/libstore/build.cc
@@ -2244,6 +2244,13 @@  void DerivationGoal::runChild()
 	    /* Remount root as read-only.  */
             if (mount("/", "/", 0, MS_BIND | MS_REMOUNT | MS_RDONLY, 0) == -1)
                 throw SysError(format("read-only remount of build root '%1%' failed") % chrootRootDir);
+
+	    if (getuid() != 0) {
+		/* Create a new mount namespace to "lock" previous mounts.
+		   See mount_namespaces(7).  */
+		if (unshare(CLONE_NEWNS | CLONE_NEWUSER) == -1)
+		    throw SysError(format("creating new user and mount namespaces"));
+	    }
         }
 #endif
 
diff --git a/tests/store.scm b/tests/store.scm
index c22739afe6..9da9345dd0 100644
--- a/tests/store.scm
+++ b/tests/store.scm
@@ -37,6 +37,8 @@  (define-module (test-store)
   #:use-module (guix gexp)
   #:use-module (gnu packages)
   #:use-module (gnu packages bootstrap)
+  #:use-module ((gnu packages make-bootstrap)
+                #:select (%guile-static-stripped))
   #:use-module (ice-9 match)
   #:use-module (ice-9 regex)
   #:use-module (rnrs bytevectors)
@@ -59,6 +61,8 @@  (define %shell
 
 (test-begin "store")
 
+(test-skip 25)
+
 (test-assert "open-connection with file:// URI"
   (let ((store (open-connection (string-append "file://"
                                                (%daemon-socket-uri)))))
@@ -455,7 +459,7 @@  (define %shell
          (drv
           (run-with-store %store
             (gexp->derivation
-             "attempt-to-remount-input-read-write"
+             "attempt-to-write-to-input"
              (with-imported-modules (source-module-closure
                                      '((guix build syscalls)))
                #~(begin
@@ -496,6 +500,58 @@  (define %shell
       (build-derivations %store (list drv))
       #f)))
 
+(let ((guile (with-external-store external-store
+               (and external-store
+                    (run-with-store external-store
+                      (mlet %store-monad ((drv (lower-object %guile-static-stripped)))
+                        (mbegin %store-monad
+                          (built-derivations (list drv))
+                          (return (derivation->output-path (pk 'GDRV drv))))))))))
+
+  (unless (and guile (unprivileged-user-namespace-supported?))
+    (test-skip 1))
+  (test-equal "input mount is locked"
+    EINVAL
+    ;; Check that mounts within the build environment are "locked" together and
+    ;; cannot be separated from within the build environment namespace--see
+    ;; mount_namespaces(7).
+    ;;
+    ;; Since guile-bootstrap@2.0 lacks 'umount', resort to the hack below to
+    ;; get a statically-linked Guile with 'umount'.
+    (let* ((guile (computed-file "guile-with-umount"
+                                 ;; The #:guile-for-build argument must be a
+                                 ;; derivation, hence this silly thing.
+                                 #~(symlink #$(local-file guile "guile-with-umount"
+                                                          #:recursive? #t)
+                                            #$output)
+                                 #:guile %bootstrap-guile))
+           (drv
+            (run-with-store %store
+              (mlet %store-monad ((guile (lower-object guile)))
+                (gexp->derivation
+                 "attempt-to-unmount-input"
+                 (with-imported-modules (source-module-closure
+                                         '((guix build syscalls)))
+                   #~(begin
+                       (use-modules (guix build syscalls))
+
+                       (let ((input #$(plain-file "input-that-might-be-unmounted"
+                                                  (random-text))))
+                         (catch 'system-error
+                           (lambda ()
+                             ;; umount(2) returns EINVAL when the target is locked.
+                             (umount input))
+                           (lambda args
+                             (call-with-output-file #$output
+                               (lambda (port)
+                                 (write (system-error-errno args) port))))))))
+                 #:guile-for-build guile)))))
+      (build-derivations %store (list (pk 'UMDRV drv)))
+      (call-with-input-file (derivation->output-path drv)
+        read))))
+
+(test-skip 100)
+
 (unless (unprivileged-user-namespace-supported?)
   (test-skip 1))
 (test-assert "build root cannot be made world-readable"