[bug#75810,v6,00/16] Rootless guix-daemon

Message ID 87sen55a86.fsf@gnu.org
State New
Headers

Commit Message

Ludovic Courtès March 22, 2025, 3:57 p.m. UTC
  Hi Reepca,

Reepca Russelstein <reepca@russelstein.xyz> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>>> While ensuring that what actually gets execve'd is in the store suffices
>>> to eliminate the vulnerability, it may be "conceptually purer" to
>>> require that the links pointing to it are all in the store as well.  For
>>> example, while a builder that is a symlink pointing to /proc/self/exe
>>> wouldn't be able to modify the daemon binary, it's still a piece of
>>> basically "undefined behavior" as far as the build environment is
>>> concerned, which could be closed up.  But that can come later just as
>>> well.
>>
>> Yes.  But in practice, “normal” symlinks (i.e., not /proc/self/exe) will
>> lead ‘canonPath’ to throw if one component is outside the store, since
>> ‘canonPath’ operates within the chroot.
>
> Unless the component actually exists and is outside of the store.  If we
> just rely on canonPath throwing an exception to be safe, then if there
> ever arose a situation where a non-symlink executable existed outside of
> the store, it would still be possible to convince the daemon to execute
> it.

[...]

> I mention this because I see that patch 07/16 of v7 has left out the
> isInStore check, and I think it should remain.

Hmm right (I was very much assuming that /proc/self/exe was the only
non-store executable, but better be safe than sorry).  Re-adding this:
> While researching container escape vulnerabilities, I recently came
> across CAP_DAC_READ_SEARCH and open_by_handle_at, which is a system call
> so insanely powerful it is outright banned in all but the root user
> namespace.  Or at least, it was.  10 months ago, in commit
> 620c266f394932e5decc4b34683a75dfc59dc2f4 of
> https://github.com/torvalds/linux, the requirements were relaxed so
> that, in certain cases, processes in non-root user namespaces could use
> open_by_handle_at.

The way ‘open_by_handle_at’ is documented (“half” of ‘openat’) does not
make it immediately obvious to me what makes it “powerful”.  I see the
risk of a confused deputy problem though because of the ‘mount_id’
argument in addition to ‘handle’.  Is that what you have in mind?

> The consequences of this for same-user containers are not clear to me
> yet, as I haven't studied the kernel source enough to know what exactly
> that commit message means by "privileges over the filesystem" or
> "privileges over a subtree".  I also haven't been able to test this
> behavior yet, because my kernel is actually too old (I do my rebases and
> upgrades rather less regularly than is recommended).  I'll try to look
> into this more once I update my system (and man-pages!), but figured I
> should mention it, because aside from that, and the aforementioned
> isInStore check, I can't think of any remaining concerns.

Alright.  I’ll send v8 with the change above.

Thanks again!

Ludo’.
  

Comments

Reepca Russelstein March 22, 2025, 9:48 p.m. UTC | #1
Ludovic Courtès <ludo@gnu.org> writes:

>> While researching container escape vulnerabilities, I recently came
>> across CAP_DAC_READ_SEARCH and open_by_handle_at, which is a system call
>> so insanely powerful it is outright banned in all but the root user
>> namespace.  Or at least, it was.  10 months ago, in commit
>> 620c266f394932e5decc4b34683a75dfc59dc2f4 of
>> https://github.com/torvalds/linux, the requirements were relaxed so
>> that, in certain cases, processes in non-root user namespaces could use
>> open_by_handle_at.
>
> The way ‘open_by_handle_at’ is documented (“half” of ‘openat’) does not
> make it immediately obvious to me what makes it “powerful”.  I see the
> risk of a confused deputy problem though because of the ‘mount_id’
> argument in addition to ‘handle’.  Is that what you have in mind?

The handle is a purely user-space sequence of bytes, and is not
namespaced whatsoever.  In other words, the first "half" (that is,
name_to_handle_at) is completely optional, as long as you have a good
idea of what sort of handle values to try.  This means that, if a
process has this capability in the root user namespace, they can
potentially access every file of any filesystem that has at least one
file visible to them.  Note that "filesystem" here is not the same thing
as "mount point", so this means that if you have a bind mount from the
root filesystem in the container (or the root filesystem itself in the
container is on the out-of-container root filesystem), a process in the
container but with CAP_DAC_READ_SEARCH in the root user namespace could
access *every file on the real root filesystem*.  This is how an exploit
for Docker named "shocker" worked
(http://stealth.openwall.net/xSports/shocker.c), caused by Docker
leaving CAP_DAC_READ_SEARCH available by default in privileged
containers.

I of course hope that the kernel's relaxing of the rules to also allow
open_by_handle_at in some situations in non-root user namespaces has
been carefully thought through to not open any holes like this, but it
would be good to keep an eye on it regardless.

- reepca
  
Ludovic Courtès March 23, 2025, 2:30 p.m. UTC | #2
Hello,

Reepca Russelstein <reepca@russelstein.xyz> skribis:

> The handle is a purely user-space sequence of bytes, and is not
> namespaced whatsoever.  In other words, the first "half" (that is,
> name_to_handle_at) is completely optional, as long as you have a good
> idea of what sort of handle values to try.  This means that, if a
> process has this capability in the root user namespace, they can
> potentially access every file of any filesystem that has at least one
> file visible to them.  Note that "filesystem" here is not the same thing
> as "mount point", so this means that if you have a bind mount from the
> root filesystem in the container (or the root filesystem itself in the
> container is on the out-of-container root filesystem), a process in the
> container but with CAP_DAC_READ_SEARCH in the root user namespace could
> access *every file on the real root filesystem*.  This is how an exploit
> for Docker named "shocker" worked
> (http://stealth.openwall.net/xSports/shocker.c), caused by Docker
> leaving CAP_DAC_READ_SEARCH available by default in privileged
> containers.

Ouch. I think the conceptual quagmire stemming from the accumulation of
features retrofitted on the otherwise simpler 1970 Unix model doesn’t
help: mount points, file systems, namespaces,
shared/locked/private/slave subtrees, “capabilities”, ACLs, etc.
It’s intractable.

> I of course hope that the kernel's relaxing of the rules to also allow
> open_by_handle_at in some situations in non-root user namespaces has
> been carefully thought through to not open any holes like this, but it
> would be good to keep an eye on it regardless.

Indeed. :-/

Ludo’.

PS: Just sent v8.
  

Patch

diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc
index 1733322316..d0fcc99854 100644
--- a/nix/libstore/build.cc
+++ b/nix/libstore/build.cc
@@ -2390,6 +2390,9 @@  void DerivationGoal::runChild()
 	       within the chroot.  */
 	    builderBasename = baseNameOf(drv.builder);
 	    drv.builder = canonPath(drv.builder, true);
+
+	    if (!isInStore(drv.builder))
+		throw Error(format("derivation builder '%1%' is outside the store") % drv.builder);
 	}
 
         /* Fill in the arguments. */