mbox series

[bug#68677,0/6] Service for "virtual build machines"

Message ID cover.1706027375.git.ludo@gnu.org
Headers show
Series Service for "virtual build machines" | expand

Message

Ludovic Courtès Jan. 23, 2024, 4:46 p.m. UTC
Hello Guix!

Lots of talk about reproducibility and how wonderful Guix is, but
as soon as you try to build packages from v1.0.0, released less
than 5 years ago, you hit a “time trap” in Python, in OpenSSL, or
some other ugly build failure—assuming you managed to fetch source
code in the first place¹.

This patch series defines a long-overdue
‘virtual-build-machine-service-type’: a service to run a virtual
machine available for offloading.  My main goal here is to
allow users to build stuff at a past date without having to
change their system clock.  It can also be used to control other
aspects usually not under control: the CPU model, the Linux kernel.

The series includes changes to <virtual-machine> that are not
actually used but can be useful; they come from a previous iteration
that didn’t pan out.

One limitation I’d like to address is the fact that the SSH and
secrets ports are exposed locally, as is already the case with
childhurds (any local user could inject secrets into the VM if
they connect at the right moment when it boots).  Future work
includes switching to AF_VSOCK sockets—see vsock(7).

Some of the code is shared with childhurds.  I don’t know if
we could factorize things further.

Thoughts?

Ludo’.

¹ This blog post by Simon explains the kind of problem one hits
  when traveling to the not-so-distant past:
  https://simon.tournier.info/posts/2023-12-21-repro-paper.html

Ludovic Courtès (6):
  services: secret-service: Make the endpoint configurable.
  vm: Add ‘date’ field to <virtual-machine>.
  vm: Export <virtual-machine> accessors.
  vm: Add ‘cpu-count’ field to <virtual-machine>.
  marionette: Add #:peek? to ‘wait-for-tcp-port?’.
  services: Add ‘virtual-build-machine’ service.

 doc/guix.texi                   | 139 ++++++-
 gnu/build/marionette.scm        |  32 +-
 gnu/build/secret-service.scm    |  62 ++--
 gnu/services/virtualization.scm | 640 ++++++++++++++++++++++++--------
 gnu/system/image.scm            |   1 +
 gnu/system/vm.scm               | 115 +++++-
 gnu/tests/virtualization.scm    | 176 +++++++--
 7 files changed, 933 insertions(+), 232 deletions(-)


base-commit: 299ce524c9f725549ab5548197cc88b085bba2f4

Comments

Simon Tournier Jan. 25, 2024, 2:18 p.m. UTC | #1
Hi Ludo,

On mar., 23 janv. 2024 at 17:46, Ludovic Courtès <ludo@gnu.org> wrote:

> Lots of talk about reproducibility and how wonderful Guix is, but
> as soon as you try to build packages from v1.0.0, released less
> than 5 years ago, you hit a “time trap” in Python, in OpenSSL, or
> some other ugly build failure—assuming you managed to fetch source
> code in the first place¹.

Cool!  Workarounds for “time trap” of the current past.

Note that today is the past of the future. ;-) Other said, the same
workarounds will help to detect today thus fix the “time trap” that
would arise in the future.

Without mentioning the bug of 2038 year. :-)



> This patch series defines a long-overdue
> ‘virtual-build-machine-service-type’: a service to run a virtual
> machine available for offloading.  My main goal here is to
> allow users to build stuff at a past date without having to
> change their system clock.  It can also be used to control other
> aspects usually not under control: the CPU model, the Linux kernel.

Yes, controlling CPU model and Linux kernel are worth:

 + CPU model because we already have examples of failures (Python 3.7
   packaged in Guix v1.0.0, some BLAS libraries, etc.);

 + Linux kernel because its stability is one of the strong assumption we
   are making for reproducibility.


Cheers,
simon
Ludovic Courtès Jan. 29, 2024, 11:25 a.m. UTC | #2
Simon Tournier <zimon.toutoune@gmail.com> skribis:

> Yes, controlling CPU model and Linux kernel are worth:
>
>  + CPU model because we already have examples of failures (Python 3.7
>    packaged in Guix v1.0.0, some BLAS libraries, etc.);

Yes!  And I think we should maintain a catalog of these problems (build
processes influenced by date, hardware, or kernel version).

Our horizon should be to somehow ensure such packages are always built
in the right environment, automatically, whether or not it involves
using a VM.

Ludo’.
Ludovic Courtès Feb. 5, 2024, 1:37 p.m. UTC | #3
Hello there!

Ludovic Courtès <ludo@gnu.org> skribis:

> This patch series defines a long-overdue
> ‘virtual-build-machine-service-type’: a service to run a virtual
> machine available for offloading.  My main goal here is to
> allow users to build stuff at a past date without having to
> change their system clock.  It can also be used to control other
> aspects usually not under control: the CPU model, the Linux kernel.

Any comments on this patch series?

  https://issues.guix.gnu.org/68677

I’d like to go ahead and apply it by the end of the week if there are no
objections.

(I realize all the files being touched here are in a limbo in terms of
team coverage.  We should fix that!)

Ludo’.
Suhail Feb. 5, 2024, 3:45 p.m. UTC | #4
Ludovic Courtès <ludo@gnu.org> writes:

> Any comments on this patch series?

I don't have comments regarding the code, but I do have a couple of
questions and a comment.  Please excuse my limited understanding of GNU
Shepherd and Guix System.  None of the questions/comments below are
deal-breakers in my opinion.

1. The documentation references GNU Shepherd.  Is GNU Shepherd a hard
   requirement in order to use the facilities provided by the patch
   series?  Would it be possible to use, say, Systemd on a foreign
   distribution?  If so, could examples of those be documented in the
   appropriate place as well?

2. The code sets the default date to be 2020-01-01; does this date have
   any significance?  It might help for the code to have a comment
   explaining whether this value is completely arbitrary or whether it
   has some significance.  On a related note, it might help for the
   documentation to note dates that are less likely to work (in case
   values before a certain time aren't expected to be well supported).

Additionally, I'm not sure if this belongs in the manual or in the
cookbook (or elsewhere), but it would be helpful to have some small, but
complete, examples.  The documentation in the patch series mentions two
situations (time traps, and CPU microarchitecture optimizations) and for
each it would be helpful to have a self-contained full working example
referenced.  For the "time trap" use-case, perhaps one of the
submissions from the Ten Years Reproducibility Challenge could be used.
Ludovic Courtès Feb. 7, 2024, 5:33 p.m. UTC | #5
Hi Suhail,

Suhail <suhail@bayesians.ca> skribis:

> 1. The documentation references GNU Shepherd.  Is GNU Shepherd a hard
>    requirement in order to use the facilities provided by the patch
>    series?  Would it be possible to use, say, Systemd on a foreign
>    distribution?  If so, could examples of those be documented in the
>    appropriate place as well?

What this patch adds is a service one can use on Guix System.  Someone
who adds this service to their Guix System config can then run ‘herd
start build-vm’ to enable offloading to the virtual build machine.

It’s possible to do something similar on a distro other than Guix System
but this patch series won’t help with that.  On another distro, one
would need to create a VM image and then manually start QEMU with the
right flags and set up offloading to that VM.  Nothing insurmountable,
but it’s quite tedious.

> 2. The code sets the default date to be 2020-01-01; does this date have
>    any significance?  It might help for the code to have a comment
>    explaining whether this value is completely arbitrary or whether it
>    has some significance.  On a related note, it might help for the
>    documentation to note dates that are less likely to work (in case
>    values before a certain time aren't expected to be well supported).

I picked a date in the past because I figured this would be the most
common use case at first: being able to rebuild things “in the past”
(the manual says that the default date is “in the past”).  Apart from
that, it has no significance.  I’ll add a comment as you suggest.

The manual cannot really say which date “won’t work” because (1) it
depends on what one is building, and (2) we simply don’t know in most
cases.

> Additionally, I'm not sure if this belongs in the manual or in the
> cookbook (or elsewhere), but it would be helpful to have some small, but
> complete, examples.  The documentation in the patch series mentions two
> situations (time traps, and CPU microarchitecture optimizations) and for
> each it would be helpful to have a self-contained full working example
> referenced.  For the "time trap" use-case, perhaps one of the
> submissions from the Ten Years Reproducibility Challenge could be used.

Yes, I agree we need complete examples (maybe not in the manual, rather
as blog posts and/or Cookbook entries I’d say).

Thanks for chiming in!

Ludo’.
Ludovic Courtès Feb. 10, 2024, 10:35 p.m. UTC | #6
Ludovic Courtès <ludo@gnu.org> skribis:

>   services: secret-service: Make the endpoint configurable.
>   vm: Add ‘date’ field to <virtual-machine>.
>   vm: Export <virtual-machine> accessors.
>   vm: Add ‘cpu-count’ field to <virtual-machine>.
>   marionette: Add #:peek? to ‘wait-for-tcp-port?’.
>   services: Add ‘virtual-build-machine’ service.

Pushed as 9edbb2d7a40c9da7583a1046e39b87633459f656 with an extra comment
explaining how the default date was chosen.

Ludo’.
Simon Tournier Feb. 14, 2024, 3:15 p.m. UTC | #7
Hi,

Thanks for your feedback.

On lun., 05 févr. 2024 at 15:45, Suhail via Guix-patches via <guix-patches@gnu.org> wrote:

> 1. The documentation references GNU Shepherd.  Is GNU Shepherd a hard
>    requirement in order to use the facilities provided by the patch
>    series?  Would it be possible to use, say, Systemd on a foreign
>    distribution?  If so, could examples of those be documented in the
>    appropriate place as well?

From my understanding, for now, it is for Guix System, so using
Shepherd.  It might be possible to use the ’vm’ on foreign distros but
some details must be configured by hand, when it is automatically done
by the “extended service”.  More or less. :-)


> 2. The code sets the default date to be 2020-01-01; does this date have
>    any significance?  It might help for the code to have a comment
>    explaining whether this value is completely arbitrary or whether it
>    has some significance.  On a related note, it might help for the
>    documentation to note dates that are less likely to work (in case
>    values before a certain time aren't expected to be well supported).

For this date, nothing specific I guess.  The oldest commit that one can
reaches using “guix time-machine” is May 2019.

Aside, it is hard to maintain a list of dates that “work”.  Because
nothing is written in stone and the passing of time cannot be frozen.

For instance, 6 months ago, a jump of ~4 years was just working [1].
And now, it is broken [2].  Somehow, Guix provides features that demo a
real-world experience which was simply impossible.  Therefore, things
are fluctuating toward more robustness.

That’s said, based on my experience playing with “guix time-machine”, my
rule of thumb is: 2-3 years old is most of the time ok.  Older than 3
years is… cross-finger.


1: https://simon.tournier.info/posts/2023-06-23-hackathon-repro.html
2: https://issues.guix.gnu.org/69058


> Additionally, I'm not sure if this belongs in the manual or in the
> cookbook (or elsewhere), but it would be helpful to have some small, but
> complete, examples.  The documentation in the patch series mentions two
> situations (time traps, and CPU microarchitecture optimizations) and for
> each it would be helpful to have a self-contained full working example
> referenced.  For the "time trap" use-case, perhaps one of the
> submissions from the Ten Years Reproducibility Challenge could be used.

The issue with time-trap is documented in the manual, see:

           Due to ‘guix time-machine’ relying on the “inferiors” mechanism
        (*note Inferiors::), the oldest commit it can travel to is commit
        ‘6298c3ff’ (“v1.0.0”), dated May 1^{st}, 2019, which is the first
        release that included the inferiors mechanism.  An error is returned
        when attempting to navigate to older commits.

             Note: Although it should technically be possible to travel to such
             an old commit, the ease to do so will largely depend on the
             availability of binary substitutes.  When traveling to a distant
             past, some packages may not easily build from source anymore.  One
             such example are old versions of Python 2 which had time bombs in
             its test suite, in the form of expiring SSL certificates.  This
             particular problem can be worked around by setting the hardware
             clock to a value in the past before attempting the build.

        https://guix.gnu.org/manual/devel/en/guix.html#Invoking-guix-time_002dmachine


However, it appears to me hard to maintain a list of all the known
time-trap.  For now, we are not re-building the past, therefore most of
the time-trap get unnoticed.

About CPU microarchitecture, I know only two: Python [3] and OpenBLAS
[4].

All in all we are at the infancy of this work and any help is
welcome. :-)

Cheers,
simon


3: Try “guix time-machine --commit=v1.0.0 -- describe”

4: Investigating a reproducibility failure
Konrad Hinsen <konrad.hinsen@fastmail.net>
Tue, 01 Feb 2022 15:05:40 +0100
id:m1a6fahebv.fsf@fastmail.net
https://lists.gnu.org/archive/html/guix-devel/2022-02
https://yhetil.org/guix/m1a6fahebv.fsf@fastmail.net

Follow-up:
Re: Investigating a reproducibility failure
zimoun <zimon.toutoune@gmail.com>
Wed, 02 Feb 2022 21:35:06 +0100
id:871r0l9fd1.fsf@gmail.com
https://lists.gnu.org/archive/html/guix-devel/2022-02
https://yhetil.org/guix/871r0l9fd1.fsf@gmail.com