diff mbox series

[bug#50072,v4,1/4] guix hash: Extract file hashing procedures.

Message ID 20220104200643.43374-2-maximedevos@telenet.be
State Accepted
Headers show
Series Add upstream updater for git-fetch origins | expand

Checks

Context Check Description
cbaines/applying patch fail View Laminar job
cbaines/issue success View issue
cbaines/applying patch fail View Laminar job
cbaines/issue success View issue
cbaines/applying patch fail View Laminar job
cbaines/issue success View issue

Commit Message

M Jan. 4, 2022, 8:06 p.m. UTC
From: Sarah Morgensen <iskarian@mgsn.dev>

* guix/scripts/hash.scm (guix-hash)[vcs-file?] (nar-hash, default-hash):
  Extract hashing logic to...
* guix/hash.scm (vcs-file?, file-hash*): ... these new procedures in this
  new file.

Modified-by: Maxime Devos <maximedevos@telenet.be>
---
 Makefile.am           |  1 +
 guix/hash.scm         | 68 +++++++++++++++++++++++++++++++++++++++++++
 guix/scripts/hash.scm | 22 +++-----------
 3 files changed, 73 insertions(+), 18 deletions(-)
 create mode 100644 guix/hash.scm

Comments

Simon Tournier Jan. 4, 2022, 10:22 p.m. UTC | #1
Hi Maxime,

Thanks!  All LGTM and I have two naive remarks.


On Tue, 04 Jan 2022 at 20:06, Maxime Devos <maximedevos@telenet.be> wrote:

> diff --git a/guix/hash.scm b/guix/hash.scm

[...]

> +(define-module (guix hash)
> +  #:use-module (gcrypt hash)
> +  #:use-module (guix serialization)
> +  #:use-module (srfi srfi-1)
> +  #:use-module (srfi srfi-11)
> +  #:export (vcs-file?
> +            file-hash*))
> +
> +(define (vcs-file? file stat)
> +  "Returns true if FILE is a version control system file."
> +  (case (stat:type stat)
> +    ((directory)
> +     (member (basename file) '(".bzr" ".git" ".hg" ".svn" "CVS")))
> +    ((regular)
> +     ;; Git sub-modules have a '.git' file that is a regular text file.
> +     (string=? (basename file) ".git"))
> +    (else
> +     #f)))

1) Why ’vcs-file?’ requires to be exported?  Is it used elsewhere?


> +(define* (file-hash* file #:key
> +                     (algorithm (hash-algorithm sha256))
> +                     (recursive? 'auto)

2) ’auto’ is confusing…

> +                     (select? (negate vcs-file?)))
> +  "Compute the hash of FILE with ALGORITHM.  If RECURSIVE? is #true or 'auto',
> +recurse into subdirectories of FILE, computing the combined hash (nar hash) of

…here I understand that ’auto’ means #true…

> +Keep in mind that the hash of a regular file depends on RECURSIVE?:
> +if the recursive hash is desired, it must be set to #true.  Otherwise, it must
> +be set to #false or 'auto'. In most situations, the non-recursive hash is desired
> +for regular files."

…but there it is the contrary. :-)  To me, #true/#false or #t/#f are
meaningful, especially when…

> +  (if (or (eq? recursive? #true)
> +          (and (eq? recursive? 'auto)

…the symbol ’auto’ is only used here.  IIRC all the series. :-)


(I know Ricardo is for instance in favor of #true/#false compared to
#t/#f.  I have an opinion but I would like to avoid another
bikeshed. ;-))


Cheers,
simon
M Jan. 5, 2022, 10:07 a.m. UTC | #2
zimoun schreef op di 04-01-2022 om 23:22 [+0100]:
> 2) ’auto’ is confusing…
> 
> > +                     (select? (negate vcs-file?)))
> > +  "Compute the hash of FILE with ALGORITHM.  If RECURSIVE? is #true or 'auto',
> > +recurse into subdirectories of FILE, computing the combined hash (nar hash) of
> 
> …here I understand that ’auto’ means #true…

Precisely, in the sense 'auto' means #true in that 'auto' recurses.
But sometimes #true / auto compute a different hash ...

> > +Keep in mind that the hash of a regular file depends on RECURSIVE?:
> > +if the recursive hash is desired, it must be set to #true.  Otherwise, it must
> > +be set to #false or 'auto'. In most situations, the non-recursive hash is desired
> > +for regular files."
> 
> …but there it is the contrary. :-)

No, when #:recursive? is 'auto' and the file is a directory, it
recurses. When it is 'auto' and the file is a regular file, then
it also recurses, albeit in a trivial way (because regular files don't
contain other files).

This comment explains that the 'recursive hash' (nar hash) and 'regular
hash' of a regular file are different, that usually you want the
regular hash for regular files, and implies that '#:recursive? auto'
usually does the right thing.

But if you really want the recursive hash for regular files, then you
can still compute that by setting #:recursive? #true.

>   To me, #true/#false or #t/#f are
> meaningful, especially when…
> 
> > +  (if (or (eq? recursive? #true)
> > +          (and (eq? recursive? 'auto)
> 
> …the symbol ’auto’ is only used here.  IIRC all the series. :-)

In ‘[PATCH v4 3/4] refresh: Support non-tarball sources.’, there's

> +                (let ((hash (file-hash* output)))

There, #:recursive? is 'auto'.

Greetings,
Maxime
M Jan. 5, 2022, 10:09 a.m. UTC | #3
zimoun schreef op di 04-01-2022 om 23:22 [+0100]:
> > +(define (vcs-file? file stat)
> > +  "Returns true if FILE is a version control system file."
> > +  (case (stat:type stat)
> > +    ((directory)
> > +     (member (basename file) '(".bzr" ".git" ".hg" ".svn" "CVS")))
> > +    ((regular)
> > +     ;; Git sub-modules have a '.git' file that is a regular text
> > file.
> > +     (string=? (basename file) ".git"))
> > +    (else
> > +     #f)))
> 
> 1) Why ’vcs-file?’ requires to be exported?  Is it used elsewhere?

It is used in (guix scripts hash):

         (select? (if (assq-ref opts 'exclude-vcs?)
                      (negate vcs-file?)
                      (const #t)))

Greetings,
Maxime.
Simon Tournier Jan. 5, 2022, 11:48 a.m. UTC | #4
Hi Maxime,

On Wed, 05 Jan 2022 at 11:07, Maxime Devos <maximedevos@telenet.be> wrote:

> Precisely, in the sense 'auto' means #true in that 'auto' recurses.
> But sometimes #true / auto compute a different hash ...

[...]

> No, when #:recursive? is 'auto' and the file is a directory, it
> recurses. When it is 'auto' and the file is a regular file, then
> it also recurses, albeit in a trivial way (because regular files don't
> contain other files).
>
> This comment explains that the 'recursive hash' (nar hash) and 'regular
> hash' of a regular file are different, that usually you want the
> regular hash for regular files, and implies that '#:recursive? auto'
> usually does the right thing.
>
> But if you really want the recursive hash for regular files, then you
> can still compute that by setting #:recursive? #true.

Thanks for explaining.

Hm, my confusion is probably the same as #51307 [1].

1: <https://issues.guix.gnu.org/51307#12>

Well, I think ’#:recursive?’ is confusing, and ’auto’ too because it is
not POLA for a plumbing function, IMHO.  Anyway. It is v4 and it is
ready to merge. :-)


I just propose to replace ’#:recursive?’ by ’#:nar-serializer?’ and a
docstring along these lines,

--8<---------------cut here---------------start------------->8---
  "Compute the hash of FILE with ALGORITHM.  If NAR-SERIALIZER? is
  #true, compute the combined hash (NAR hash) of FILE for which (SELECT?
  FILE STAT) returns true.

  If NAR-SERIALIZER? is #false, compute the regular hash using the
  default serializer.  It is meant to be used for a regular file.

  If NAR-SERIALIZER? is 'auto', when FILE is a directory, compute the
  combined hash (NAR hash).  When FILE is a regular file, compute the
  regular hash using the default serializer.  The option ’auto’ is meant
  to apply by default the expected hash computation.

  Symbolic links are not dereferenced unless NAR-SERIALIZER? is false.

  This procedure must only be used under controlled circumstances; the
  detection of symbolic links in FILE is racy.
--8<---------------cut here---------------end--------------->8---

WDYT?



>> > +  (if (or (eq? recursive? #true)
>> > +          (and (eq? recursive? 'auto)
>> 
>> …the symbol ’auto’ is only used here.  IIRC all the series. :-)
>
> In ‘[PATCH v4 3/4] refresh: Support non-tarball sources.’, there's
>
>> +                (let ((hash (file-hash* output)))
>
> There, #:recursive? is 'auto'.

Naive questions: Is it mandatory?  Or can be explicitly set?

(I have nothing against, just to me ’auto’ is somehow ambiguous and «In
the face of ambiguity, refuse the temptation to guess» as ’pyhon3 -c
'import this'’ says ;-))


Cheers,
simon
M Jan. 5, 2022, 12:10 p.m. UTC | #5
zimoun schreef op wo 05-01-2022 om 12:48 [+0100]:
> > > > +  (if (or (eq? recursive? #true)
> > > > +          (and (eq? recursive? 'auto)
> > > 
> > > …the symbol ’auto’ is only used here.  IIRC all the series. :-)
> > 
> > In ‘[PATCH v4 3/4] refresh: Support non-tarball sources.’, there's
> > 
> > > +                (let ((hash (file-hash* output)))
> > 
> > There, #:recursive? is 'auto'.
> 
> Naive questions: Is it mandatory?  Or can be explicitly set?
> 
> (I have nothing against, just to me ’auto’ is somehow ambiguous and
> «In
> the face of ambiguity, refuse the temptation to guess» as ’pyhon3 -c
> 'import this'’ says ;-))

'auto' is indeed a little ambigious, so I adjusted most calls to
file-hash* to set #:recursive? #true/#false appropriately in v3.
But in this particular case (guix/scripts/refresh.scm), it not known in
advance, so some guesswork is necessary.

Anyway, these calls to file-hash* are bothering me a little: can't
we just record the hash in the 'upstream-source' record or ask the
daemon for the hash of a store item (*) or something?

(*) Maybe query-path-hash works or maybe there are problems.
    Also, would be nice if there was a variant of query-path-hash
    that works on non-sha256 (in principle guix supports other hashes,
    though currently they are unused). Or maybe query-path-hash is
    works differently.

That would complicate this patch series more, so I'd prefer to delay
that for a future patch series.

Greetings,
Maxime.
M Jan. 5, 2022, 12:27 p.m. UTC | #6
zimoun schreef op wo 05-01-2022 om 12:48 [+0100]:
> Well, I think ’#:recursive?’ is confusing, and ’auto’ too because it is
> not POLA for a plumbing function, IMHO.  [...]

Principle of least authority, or principle of least astonishment?
I presume the latter.

> Anyway. It is v4 and it is ready to merge. :-)

I vote for a purple bikeshed! But your orange bikeshed would also keep
the bikes out of the rain.

> I just propose to replace ’#:recursive?’ by ’#:nar-serializer?’ and a
> docstring along these lines,
> 
> --8<---------------cut here---------------start------------->8---
>   "Compute the hash of FILE with ALGORITHM.  If NAR-SERIALIZER? is
>   #true, compute the combined hash (NAR hash) of FILE for which (SELECT?
>   FILE STAT) returns true.
> 
>   If NAR-SERIALIZER? is #false, compute the regular hash using the
>   default serializer.  It is meant to be used for a regular file.
> 
>   If NAR-SERIALIZER? is 'auto', when FILE is a directory, compute the
>   combined hash (NAR hash).  When FILE is a regular file, compute the
>   regular hash using the default serializer.  The option ’auto’ is meant
>   to apply by default the expected hash computation.
> 
>   Symbolic links are not dereferenced unless NAR-SERIALIZER? is false.
> 
>   This procedure must only be used under controlled circumstances; the
>   detection of symbolic links in FILE is racy.
> --8<---------------cut here---------------end--------------->8---
> 
> WDYT?

The nar hash / regular hash difference seems a very low-level detail to
me, that most (all?) users don't need to be bothered about. Except
maybe if FILE denotes an executable regular file, but file-hash* is
currently only used on tarballs/zip files/git checkouts, which aren't
executable files unless weirdness or some kind of attack is happening.

I think that, the ‘least astonishing’ thing to do here, is computing
the hash that would go into the 'hash' / 'sha256' field of 'origin'
objects by default, and not the nar hash for regular files that's
almost never used.

Greetings,
Maxime.
Simon Tournier Jan. 5, 2022, 12:58 p.m. UTC | #7
Hi Maxime,

On Wed, 05 Jan 2022 at 12:27, Maxime Devos <maximedevos@telenet.be> wrote:
> zimoun schreef op wo 05-01-2022 om 12:48 [+0100]:
>> Well, I think ’#:recursive?’ is confusing, and ’auto’ too because it is
>> not POLA for a plumbing function, IMHO.  [...]
>
> Principle of least authority, or principle of least astonishment?
> I presume the latter.

Latter. :-)

>> Anyway. It is v4 and it is ready to merge. :-)
>
> I vote for a purple bikeshed! But your orange bikeshed would also keep
> the bikes out of the rain.

:-)

>> --8<---------------cut here---------------start------------->8---
>>   "Compute the hash of FILE with ALGORITHM.  If NAR-SERIALIZER? is
>>   #true, compute the combined hash (NAR hash) of FILE for which (SELECT?
>>   FILE STAT) returns true.
>> 
>>   If NAR-SERIALIZER? is #false, compute the regular hash using the
>>   default serializer.  It is meant to be used for a regular file.
>> 
>>   If NAR-SERIALIZER? is 'auto', when FILE is a directory, compute the
>>   combined hash (NAR hash).  When FILE is a regular file, compute the
>>   regular hash using the default serializer.  The option ’auto’ is meant
>>   to apply by default the expected hash computation.
>> 
>>   Symbolic links are not dereferenced unless NAR-SERIALIZER? is false.
>> 
>>   This procedure must only be used under controlled circumstances; the
>>   detection of symbolic links in FILE is racy.
>> --8<---------------cut here---------------end--------------->8---

> The nar hash / regular hash difference seems a very low-level detail to
> me, that most (all?) users don't need to be bothered about. Except
> maybe if FILE denotes an executable regular file, but file-hash* is
> currently only used on tarballs/zip files/git checkouts, which aren't
> executable files unless weirdness or some kind of attack is happening.
>
> I think that, the ‘least astonishing’ thing to do here, is computing
> the hash that would go into the 'hash' / 'sha256' field of 'origin'
> objects by default, and not the nar hash for regular files that's
> almost never used.

I do not understand what you mean here.  ’file-hash*’ is a low-level
detail, no?  Whatever. :-)

Well, I am sorry if my 3 naive comments are not convenient.  Just, to be
sure, I am proposing:

 1) It is v4 and ready, I guess.  About ’auto’, I could have waken up
 earlier. :-) And it can be still improved later as you are saying in
 the other answer.  So, we are done, right?

 2) From my point of view, ’#:recursive?’ needs to be adapted in
 agreement with the discussion [1], quoting Ludo:

        Thinking more about it, I think confusion stems from the term
        “recursive” (inherited from Nix) because, as you write, it
        doesn’t necessarily have to do with recursion and directory
        traversal.

        Instead, it has to do with the serialization method.

        1: <http://issues.guix.gnu.org/issue/51307>

   And I do not have a strong opinion.  Just a naive remark.

 3) Whatever the keyword for the current v4 ’#:recursive?’ is picked, I
  still find the current docstring wording unclear.  In fact, reading
  the code is more helpful. :-) I am just proposing a reword which
  appears to me clearer than the current v4 one.  Maybe, I am missing
  the obvious.  Or maybe this proposed rewording is not clearer. :-)

WDYT?

Cheers,
simon
M Jan. 5, 2022, 2:06 p.m. UTC | #8
zimoun schreef op wo 05-01-2022 om 13:58 [+0100]:
> [...]
> > > --8<---------------cut here---------------start------------->8---
> > >   "Compute the hash of FILE with ALGORITHM.  If NAR-SERIALIZER? is
> > >   #true, compute the combined hash (NAR hash) of FILE for which (SELECT?
> > >   FILE STAT) returns true.
> > > 
> > >   If NAR-SERIALIZER? is #false, compute the regular hash using the
> > >   default serializer.  It is meant to be used for a regular file.
> > > 
> > >   If NAR-SERIALIZER? is 'auto', when FILE is a directory, compute the
> > >   combined hash (NAR hash).  When FILE is a regular file, compute the
> > >   regular hash using the default serializer.  The option ’auto’ is meant
> > >   to apply by default the expected hash computation.
> > > 
> > >   Symbolic links are not dereferenced unless NAR-SERIALIZER? is false.
> > > 
> > >   This procedure must only be used under controlled circumstances; the
> > >   detection of symbolic links in FILE is racy.
> > > --8<---------------cut here---------------end--------------->8---
> 
> > The nar hash / regular hash difference seems a very low-level detail to
> > me, that most (all?) users don't need to be bothered about. Except
> > maybe if FILE denotes an executable regular file, but file-hash* is
> > currently only used on tarballs/zip files/git checkouts, which aren't
> > executable files unless weirdness or some kind of attack is happening.
> > 
> > I think that, the ‘least astonishing’ thing to do here, is computing
> > the hash that would go into the 'hash' / 'sha256' field of 'origin'
> > objects by default, and not the nar hash for regular files that's
> > almost never used.
> 
> I do not understand what you mean here.  ’file-hash*’ is a low-level
> detail, no?  Whatever. :-)

I don't see what it matters if 'file-hash*' is classified as low-level
or high-level.  But what I do care about, is how easy to use file-hash*
is.

A low-level argument like #:nar-hash? #true/#false would make file-
hash* much more complicated: this patch series uses file-hash* to
compute the hash for 'origin' records, and the documentation of
'origin' doesn't mention 'nar' anywhere and if I search for 'nar hash'
in the manual, I find zero results.

Instead, file-hash* talks about directories, regular files, recursion
and claims that the default value of #:recursive? usually does the
right thing, so I don't have to look up any complicated terminology
to figure out how to use file-hash* to compute hashes for 'origin'
records.

And in the rare situation where file-hash* doesn't do the right thing,
the documentation tells me I can set #:recursive? #true/#false.
 
> Just, to be sure, I am proposing:
> 
>  1) It is v4 and ready, I guess.  About ’auto’, I could have waken up
>  earlier. :-) And it can be still improved later as you are saying in
>  the other answer.  So, we are done, right?

I think so, yes, except for a docstring change I'll send as a v5.
I'm also out of bikeshed paint.
Anway, keep in mind that I'm not a committer.

>  2) From my point of view, ’#:recursive?’ needs to be adapted in
>  agreement with the discussion [1], quoting Ludo:
> 
>         Thinking more about it, I think confusion stems from the term
>         “recursive” (inherited from Nix) because, as you write, it
>         doesn’t necessarily have to do with recursion and directory
>         traversal.
> 
>         Instead, it has to do with the serialization method.
> 
>         1: <http://issues.guix.gnu.org/issue/51307>
> 
>    And I do not have a strong opinion.  Just a naive remark.

I don't think the arguments for (guix scripts hash) apply directly
to (guix hash) -- (guix scripts hash) supports multiple serialisers:

 * none (regular in (guix hash) terminology)
 * git
 * nar
 * swh

so something like -S nar makes a lot of sense there. But (guix hash)
is only for computing the hash of something that would become a store
item after interning, more specifically is is currently only used for
computing the hash that would go into an (origin ...) object
(though I suppose it could be extended to support git/swh/... if
someone wants do that).

Possibly some name like
#:treat-it-as-a-directory-or-an-executable-file-or-a-symlink-and-
compute-the-alternative-hash-even-if-it-is-regular?
would be clearer and technically more accurate than #:recursive?, but
that's a bit of a mouthful.

>  3) Whatever the keyword for the current v4 ’#:recursive?’ is picked, I
>   still find the current docstring wording unclear.  In fact, reading
>   the code is more helpful. :-) I am just proposing a reword which
>   appears to me clearer than the current v4 one.  Maybe, I am missing
>   the obvious.  Or maybe this proposed rewording is not clearer. :-)

I've reworded it a bit; it falsely claimed that the nar hash was always
computed when recursive? is 'auto' (even if FILE is a regular file). It
also mentions executable files and SELECT? now.

Greetings,
Maxime.
Simon Tournier Jan. 5, 2022, 3:08 p.m. UTC | #9
Hi Maxime,

On Wed, 05 Jan 2022 at 14:06, Maxime Devos <maximedevos@telenet.be> wrote:

> A low-level argument like #:nar-hash? #true/#false would make file-
> hash* much more complicated: this patch series uses file-hash* to
> compute the hash for 'origin' records, and the documentation of
> 'origin' doesn't mention 'nar' anywhere and if I search for 'nar hash'
> in the manual, I find zero results.

I agree, it was my point #1. :-)

> Instead, file-hash* talks about directories, regular files, recursion
> and claims that the default value of #:recursive? usually does the
> right thing, so I don't have to look up any complicated terminology
> to figure out how to use file-hash* to compute hashes for 'origin'
> records.

I also agree, it was my point #3. :-)

> And in the rare situation where file-hash* doesn't do the right thing,
> the documentation tells me I can set #:recursive? #true/#false.

Yes.


>> Just, to be sure, I am proposing:
>> 
>>  1) It is v4 and ready, I guess.  About ’auto’, I could have waken up
>>  earlier. :-) And it can be still improved later as you are saying in
>>  the other answer.  So, we are done, right?
>
> I think so, yes, except for a docstring change I'll send as a v5.
> I'm also out of bikeshed paint.
> Anway, keep in mind that I'm not a committer.

I am not either.  If I had this power, I would have already pushed your
v4 with the docstring reword. :-)


>>  2) From my point of view, ’#:recursive?’ needs to be adapted in
>>  agreement with the discussion [1], quoting Ludo:

[...]

>>    And I do not have a strong opinion.  Just a naive remark.

[...]

> Possibly some name like
> #:treat-it-as-a-directory-or-an-executable-file-or-a-symlink-and-
> compute-the-alternative-hash-even-if-it-is-regular?
> would be clearer and technically more accurate than #:recursive?, but
> that's a bit of a mouthful.

I trust you, I do not have a strong opinion.  I was just a naive remark.


>>  3) Whatever the keyword for the current v4 ’#:recursive?’ is picked, I
>>   still find the current docstring wording unclear.  In fact, reading
>>   the code is more helpful. :-) I am just proposing a reword which
>>   appears to me clearer than the current v4 one.  Maybe, I am missing
>>   the obvious.  Or maybe this proposed rewording is not clearer. :-)
>
> I've reworded it a bit; it falsely claimed that the nar hash was always
> computed when recursive? is 'auto' (even if FILE is a regular file). It
> also mentions executable files and SELECT? now.

Thank you for your patient work.


Cheers,
simon
M Jan. 5, 2022, 3:54 p.m. UTC | #10
zimoun schreef op wo 05-01-2022 om 16:08 [+0100]:
> [...]
> Thank you for your patient work.

And thank you for double-checking things!
I'll drop a message on #guix that all appears to be ready for merging.

Greetings,
Maxime.
Ludovic Courtès Jan. 6, 2022, 10:06 a.m. UTC | #11
Hi,

Maxime Devos <maximedevos@telenet.be> skribis:

> 'auto' is indeed a little ambigious, so I adjusted most calls to
> file-hash* to set #:recursive? #true/#false appropriately in v3.
> But in this particular case (guix/scripts/refresh.scm), it not known in
> advance, so some guesswork is necessary.

We could move guesswork at the call site.  No big deal IMO, though.

> Anyway, these calls to file-hash* are bothering me a little: can't
> we just record the hash in the 'upstream-source' record or ask the
> daemon for the hash of a store item (*) or something?

<upstream-source> represents available source that has usually not been
downloaded yet (that’s what happens when running ‘guix refresh’ without
‘-u’), so it cannot contain the hash.

> That would complicate this patch series more, so I'd prefer to delay
> that for a future patch series.

Yes, this series LGTM as-is, but let’s keep these improvements in mind.

Ludo’.
Ludovic Courtès Jan. 6, 2022, 10:13 a.m. UTC | #12
Maxime Devos <maximedevos@telenet.be> skribis:

> zimoun schreef op wo 05-01-2022 om 13:58 [+0100]:

[...]

>>  2) From my point of view, ’#:recursive?’ needs to be adapted in
>>  agreement with the discussion [1], quoting Ludo:
>> 
>>         Thinking more about it, I think confusion stems from the term
>>         “recursive” (inherited from Nix) because, as you write, it
>>         doesn’t necessarily have to do with recursion and directory
>>         traversal.
>> 
>>         Instead, it has to do with the serialization method.
>> 
>>         1: <http://issues.guix.gnu.org/issue/51307>
>> 
>>    And I do not have a strong opinion.  Just a naive remark.
>
> I don't think the arguments for (guix scripts hash) apply directly
> to (guix hash) -- (guix scripts hash) supports multiple serialisers:
>
>  * none (regular in (guix hash) terminology)
>  * git
>  * nar
>  * swh

I think IWBN eventually for ‘file-hash*’ to have a #:serializer
argument.  (guix scripts hash) would then become a thin layer above
(guix hash).

That #:serializer would have a default value, probably “none”, but no
“auto-detection” as this amount of guessing would make for a fragile
interface IMO.

(Perhaps ‘file-hash’ is too generic-sounding; for a ‘source-code-hash’
or similarly-named procedure, some defaults and guesswork would be more
obvious.)

Thanks,
Ludo’.
M Jan. 6, 2022, 10:32 a.m. UTC | #13
Ludovic Courtès schreef op do 06-01-2022 om 11:13 [+0100]:
> That #:serializer would have a default value, probably “none”, but no
> “auto-detection” as this amount of guessing would make for a fragile
> interface IMO.

The auto-detection could be put into a separate procedure

(define (guess-nar-or-none-serializer file) ; to be renamed
  (if directory?
      'nar
      'none))

so that 'file-hash*' is not fragile, but auto-detection can still
happen if explicitely asked for:

  (file-hash* file #:serializer (guess-nar-or-serializer file))

WDYT?
Simon Tournier Jan. 6, 2022, 11:19 a.m. UTC | #14
Hi,

On Thu, 06 Jan 2022 at 11:13, Ludovic Courtès <ludo@gnu.org> wrote:

> I think IWBN eventually for ‘file-hash*’ to have a #:serializer
> argument.  (guix scripts hash) would then become a thin layer above
> (guix hash).
>
> That #:serializer would have a default value, probably “none”, but no
> “auto-detection” as this amount of guessing would make for a fragile
> interface IMO.

It was my idea behind when I proposed to rename #:recursive? to
#:nar-serializer?. ;-) The name #:nar-serializer? was somehow an
intermediary step since the patch was already ready and, as Maxime
explained, the change for #:serializer was implying another round for
adjusting and v4/v5 was already enough effort put in. :-)

Anyway, the patch is ready (probably already pushed \o/ :-)).  Nothing
prevent us (me?) to propose later a patch for this adjustment. ;-)

Cheers,
simon
diff mbox series

Patch

diff --git a/Makefile.am b/Makefile.am
index 8c5682a1c6..bc3d0087d0 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -99,6 +99,7 @@  MODULES =					\
   guix/extracting-download.scm			\
   guix/git-download.scm				\
   guix/hg-download.scm				\
+  guix/hash.scm					\
   guix/swh.scm					\
   guix/monads.scm				\
   guix/monad-repl.scm				\
diff --git a/guix/hash.scm b/guix/hash.scm
new file mode 100644
index 0000000000..19cbc41ad1
--- /dev/null
+++ b/guix/hash.scm
@@ -0,0 +1,68 @@ 
+;;; GNU Guix --- Functional package management for GNU
+;;; Copyright © 2021 Sarah Morgensen <iskarian@mgsn.dev>
+;;; Copyright © 2022 Maxime Devos <maximedevos@telenet.be>
+;;;
+;;; This file is part of GNU Guix.
+;;;
+;;; GNU Guix is free software; you can redistribute it and/or modify it
+;;; under the terms of the GNU General Public License as published by
+;;; the Free Software Foundation; either version 3 of the License, or (at
+;;; your option) any later version.
+;;;
+;;; GNU Guix is distributed in the hope that it will be useful, but
+;;; WITHOUT ANY WARRANTY; without even the implied warranty of
+;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;;; GNU General Public License for more details.
+;;;
+;;; You should have received a copy of the GNU General Public License
+;;; along with GNU Guix.  If not, see <http://www.gnu.org/licenses/>.
+
+(define-module (guix hash)
+  #:use-module (gcrypt hash)
+  #:use-module (guix serialization)
+  #:use-module (srfi srfi-1)
+  #:use-module (srfi srfi-11)
+  #:export (vcs-file?
+            file-hash*))
+
+(define (vcs-file? file stat)
+  "Returns true if FILE is a version control system file."
+  (case (stat:type stat)
+    ((directory)
+     (member (basename file) '(".bzr" ".git" ".hg" ".svn" "CVS")))
+    ((regular)
+     ;; Git sub-modules have a '.git' file that is a regular text file.
+     (string=? (basename file) ".git"))
+    (else
+     #f)))
+
+(define* (file-hash* file #:key
+                     (algorithm (hash-algorithm sha256))
+                     (recursive? 'auto)
+                     (select? (negate vcs-file?)))
+  "Compute the hash of FILE with ALGORITHM.  If RECURSIVE? is #true or 'auto',
+recurse into subdirectories of FILE, computing the combined hash (nar hash) of
+all files for which (SELECT?  FILE STAT) returns true.
+
+Symbolic links are not dereferenced unless RECURSIVE? is false.
+
+This procedure must only be used under controlled circumstances;
+the detection of symbolic links in FILE is racy.
+
+Keep in mind that the hash of a regular file depends on RECURSIVE?:
+if the recursive hash is desired, it must be set to #true.  Otherwise, it must
+be set to #false or 'auto'. In most situations, the non-recursive hash is desired
+for regular files."
+  (if (or (eq? recursive? #true)
+          (and (eq? recursive? 'auto)
+               ;; Don't change this to (eq? 'directory ...), because otherwise
+               ;; if 'file' denotes a symbolic link, the 'file-hash' below
+               ;; would dereference it -- dereferencing symbolic links would
+               ;; open an avoidable can of potential worms.
+               (not (eq? 'regular (stat:type (lstat file))))))
+      (let-values (((port get-hash)
+                    (open-hash-port algorithm)))
+        (write-file file port #:select? select?)
+        (force-output port)
+        (get-hash))
+      (file-hash algorithm file)))
diff --git a/guix/scripts/hash.scm b/guix/scripts/hash.scm
index d73e3d13dd..28d587b944 100644
--- a/guix/scripts/hash.scm
+++ b/guix/scripts/hash.scm
@@ -4,6 +4,7 @@ 
 ;;; Copyright © 2016 Jan Nieuwenhuizen <janneke@gnu.org>
 ;;; Copyright © 2018 Tim Gesthuizen <tim.gesthuizen@yahoo.de>
 ;;; Copyright © 2021 Simon Tournier <zimon.toutoune@gmail.com>
+;;; Copyright © 2021 Sarah Morgensen <iskarian@mgsn.dev>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -24,6 +25,7 @@ 
   #:use-module (gcrypt hash)
   #:use-module (guix serialization)
   #:use-module (guix ui)
+  #:use-module (guix hash)
   #:use-module (guix scripts)
   #:use-module (guix base16)
   #:use-module (guix base32)
@@ -46,20 +48,14 @@ 
 (define* (nar-hash file #:optional
                    (algorithm (assoc-ref %default-options 'hash-algorithm))
                    select?)
-  (let-values (((port get-hash)
-                (open-hash-port algorithm)))
-    (write-file file port #:select? select?)
-    (force-output port)
-    (get-hash)))
+  (file-hash* file #:algorithm algorithm #:select? select? #:recursive? #true))
 
 (define* (default-hash file #:optional
                        (algorithm (assoc-ref %default-options 'hash-algorithm))
                        select?)
   (match file
     ("-" (port-hash algorithm (current-input-port)))
-    (_
-     (call-with-input-file file
-       (cute port-hash algorithm <>)))))
+    (_ (file-hash* file #:algorithm algorithm #:recursive? #false))))
 
 (define* (git-hash file #:optional
                        (algorithm (assoc-ref %default-options 'hash-algorithm))
@@ -181,16 +177,6 @@  use '--serializer' instead~%"))
     (parse-command-line args %options (list %default-options)
                         #:build-options? #f))
 
-  (define (vcs-file? file stat)
-    (case (stat:type stat)
-      ((directory)
-       (member (basename file) '(".bzr" ".git" ".hg" ".svn" "CVS")))
-      ((regular)
-       ;; Git sub-modules have a '.git' file that is a regular text file.
-       (string=? (basename file) ".git"))
-      (else
-       #f)))
-
   (let* ((opts (parse-options))
          (args (filter-map (match-lambda
                             (('argument . value)