[bug#53818,0/3] Add Repology updater

Message ID	cover.1644147246.git.public@yoctocell.xyz
Headers	show Return-Path: <guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org> Subject: [bug#53818] [PATCH 0/3] Add Repology updater Resent-From: Xinglu Chen <public@yoctocell.xyz> Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org> Resent-CC: guix-patches@gnu.org Resent-Date: Sun, 06 Feb 2022 11:52:02 +0000 Resent-Message-ID: <handler.53818.B.164414831210756@debbugs.gnu.org> Resent-Sender: help-debbugs@gnu.org To: 53818@debbugs.gnu.org From: Xinglu Chen <public@yoctocell.xyz> Message-Id: <cover.1644147246.git.public@yoctocell.xyz> Date: Sun, 06 Feb 2022 12:50:27 +0100 MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Received-SPF: pass client-ip=178.251.242.94; envelope-from=public@yoctocell.xyz; helo=mail.yoctocell.xyz X-Spam_score_int: 31 X-Spam_score: 3.1 X-Spam_bar: +++ X-Spam_report: (3.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FROM_SUSPICIOUS_NTLD=0.499, FROM_SUSPICIOUS_NTLD_FP=1.999, PDS_OTHER_BAD_TLD=1.772, PDS_RDNS_DYNAMIC_FP=0.001, RDNS_DYNAMIC=0.982, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=no autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org Sender: "Guix-patches" <guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org> X-getmail-retrieved-from-mailbox: Patches
Series	Add Repology updater \| expand [bug#53818,0/3] Add Repology updater [bug#53818,1/3] git-download: Export <git-reference>. [bug#53818,2/3] import: Add 'repology' updater. [bug#53818,3/3] gnu: xorg-server-xwayland: Set 'repology-name' property.

Xinglu Chen Feb. 6, 2022, 11:50 a.m. UTC

Hi,

This patchset adds a new updater, which scans Repology[1] for updates.
It should technically support all packages in Guix!  :-)

The data on Repology isn’t as detailed as the one on language-specific
repos, e.g., PyPI, so the updater doesn’t support things like ‘input
changes’.  If the source URL doesn’t contain the version verbatim[2], it
won’t be able reconstruct the URL of the updated version, meaning that
‘guix refresh -u’ won’t work.

Because of the way ‘%updaters’ in (guix upstream) works, the Repology
updater is the first or second updater that is used (since it
technically works on ever package), but because of the limitations I
mentioned above, the result might not always be the best.  The Repology
updater is mostly useful for things that don’t already have an updater,
e.g., ‘maven-dependency-tree’.  Would it make sense to hard-code the
‘%updaters’ variable and put the Repology last in the list?

[1]: <https://repology.org>
[2]: e.g., the version is “1.0.0” but the URL is
     “https://example.org/1_0_0.tar.gz”

Xinglu Chen (3):
  git-download: Export <git-reference>.
  import: Add 'repology' updater.
  gnu: xorg-server-xwayland: Set 'repology-name' property.

 Makefile.am               |   3 +
 doc/guix.texi             |   7 ++
 gnu/packages/xorg.scm     |   2 +
 guix/git-download.scm     |   3 +-
 guix/import/repology.scm  | 226 ++++++++++++++++++++++++++++++++++++++
 tests/import-repology.scm | 145 ++++++++++++++++++++++++
 6 files changed, 385 insertions(+), 1 deletion(-)
 create mode 100644 guix/import/repology.scm
 create mode 100644 tests/import-repology.scm


base-commit: 7c9ad54b0616112c7eea6dd02379616aef206490

M Feb. 6, 2022, 12:41 p.m. UTC | #1

Xinglu Chen schreef op zo 06-02-2022 om 12:50 [+0100]:
> Because of the way ‘%updaters’ in (guix upstream) works, the Repology
> updater is the first or second updater that is used (since it
> technically works on ever package), but because of the limitations I
> mentioned above, the result might not always be the best.  The Repology
> updater is mostly useful for things that don’t already have an updater,
> e.g., ‘maven-dependency-tree’.  Would it make sense to hard-code the
> ‘%updaters’ variable and put the Repology last in the list?

I would prefer not to hardcode %updaters and keep the current
discovery mechanism, such that people can experiment with updaters
outside a git checkout of guix and in channels.

FWIW it would be useful to have the same mechanism for importers.

However, it might be a good idea to do some _postprocessing_ on
the discovered list of updaters, e.g. they could be sorted on
'genericity' with 'stable-sort' (*):

(define (genericity x)
  (cond ((it is "generic-SOMETHING") 1)
        ((it is repology) 2)
        (#true 0)))

(define (less x y)
  (<= (genericity x) (genericity y)))

(*) stable-sort and not sort, to preserve alphabetical ordering
for updaters with the same genericity.

Greetings,
Maxime.

Xinglu Chen Feb. 6, 2022, 3:17 p.m. UTC | #2

Maxime schrieb am Sonntag der 06. Februar 2022 um 13:41 +01:

> Xinglu Chen schreef op zo 06-02-2022 om 12:50 [+0100]:
>> Because of the way ‘%updaters’ in (guix upstream) works, the Repology
>> updater is the first or second updater that is used (since it
>> technically works on ever package), but because of the limitations I
>> mentioned above, the result might not always be the best.  The Repology
>> updater is mostly useful for things that don’t already have an updater,
>> e.g., ‘maven-dependency-tree’.  Would it make sense to hard-code the
>> ‘%updaters’ variable and put the Repology last in the list?
>
> I would prefer not to hardcode %updaters and keep the current
> discovery mechanism, such that people can experiment with updaters
> outside a git checkout of guix and in channels.

Good point.

> FWIW it would be useful to have the same mechanism for importers.
>
> However, it might be a good idea to do some _postprocessing_ on
> the discovered list of updaters, e.g. they could be sorted on
> 'genericity' with 'stable-sort' (*):
>
> (define (genericity x)
>   (cond ((it is "generic-SOMETHING") 1)
>         ((it is repology) 2)
>         (#true 0)))
>
> (define (less x y)
>   (<= (genericity x) (genericity y)))
>
> (*) stable-sort and not sort, to preserve alphabetical ordering
> for updaters with the same genericity.

That looks like a good idea

Ludovic Courtès Feb. 8, 2022, 10:59 p.m. UTC | #3

Hi!

Xinglu Chen <public@yoctocell.xyz> skribis:

> This patchset adds a new updater, which scans Repology[1] for updates.
> It should technically support all packages in Guix!  :-)

I wouldn’t want to spoil the party, but I’m mildly enthusiastic.

Repology implements the same functionality as our updaters, so
repology.org is effectively “service as a software substitute” (SaaSS).

My preference would be to keep our existing updaters rather than
effectively ditch them and delegate the work to Repology.  It’s tempting
to think we can have both, but I’m not sure this would last long.

WDYT?

Ludo’.

Xinglu Chen Feb. 9, 2022, 12:52 p.m. UTC | #4

Hi,

Ludovic schrieb am Dienstag der 08. Februar 2022 um 23:59 +01:

> Hi!
>
> Xinglu Chen <public@yoctocell.xyz> skribis:
>
>> This patchset adds a new updater, which scans Repology[1] for updates.
>> It should technically support all packages in Guix!  :-)
>
> I wouldn’t want to spoil the party, but I’m mildly enthusiastic.
>
> Repology implements the same functionality as our updaters, so
> repology.org is effectively “service as a software substitute”
> (SaaSS).

Right, but it tracks a lot more repositories than what our updaters do,
so why not take advantage of that.

> My preference would be to keep our existing updaters rather than
> effectively ditch them and delegate the work to Repology.  It’s tempting
> to think we can have both, but I’m not sure this would last long.

The point of the Repology updater is to act as a fallback if none of
the other updaters can update a package, e.g., ‘maven-dependency-tree’.
I already mentioned that language-specific updaters usually provide more
accurate and detailed information, so they should be used when possible;
we aren’t losing anything here.

Nicolas Goaziou Feb. 9, 2022, 2:29 p.m. UTC | #5

Hello,

Xinglu Chen <public@yoctocell.xyz> writes:

> The point of the Repology updater is to act as a fallback if none of
> the other updaters can update a package, e.g., ‘maven-dependency-tree’.
> I already mentioned that language-specific updaters usually provide more
> accurate and detailed information, so they should be used when possible;
> we aren’t losing anything here.

One issue is that such an updater will introduce frequent false
positives. It is common for Repology to get the latest release wrong,
because some distribution is doing fancy versioning, or because
different distributions disagree about what is upstream.

I don't think we can rely on Repology's "newest" status. The updater may
need to provide its own version comparison tool, because Repology's tool
and Guix versioning do not play nice, in particular when using
`git-version'.

Regards,

Xinglu Chen Feb. 10, 2022, 6:17 p.m. UTC | #6

Nicolas schrieb am Mittwoch der 09. Februar 2022 um 15:29 +01:

> Hello,
>
> Xinglu Chen <public@yoctocell.xyz> writes:
>
>> The point of the Repology updater is to act as a fallback if none of
>> the other updaters can update a package, e.g., ‘maven-dependency-tree’.
>> I already mentioned that language-specific updaters usually provide more
>> accurate and detailed information, so they should be used when possible;
>> we aren’t losing anything here.
>
> One issue is that such an updater will introduce frequent false
> positives. It is common for Repology to get the latest release wrong,
> because some distribution is doing fancy versioning, or because
> different distributions disagree about what is upstream.

Yeah, I have noticed that it sometimes thinks that a version like
“20080323” is newer than something like “0.1.2-0.a1b2b3d”, even though
it might not necessarily be true.  This seems to be the case for a lot
of Common Lisp packages which usually don’t have any proper releases.

> I don't think we can rely on Repology's "newest" status. The updater may
> need to provide its own version comparison tool, because Repology's tool
> and Guix versioning do not play nice, in particular when using
> `git-version'.

In my testing, the “newest” status does a pretty good job (besides the
problem I mentioned above)

Some other “bad” updates I found[*] are listed below (excluding Common Lisp
packages).

--8<---------------cut here---------------start------------->8---
guile-ac-d-bus would be upgraded from 1.0.0-beta.0 to 1.0.0-beta0
sic would be upgraded from 1.2 to 1.2+20210506_058547e
tla2tools would be upgraded from 1.7.1-0.6932e19 to 20140313
quickjs would be upgraded from 2021-03-27 to 2021.03.27
stow would be upgraded from 2.3.1 to 2.3.1+5.32
cube would be upgraded from 4.3.5 to 2005.08.29
python-ratelimiter would be upgraded from 1.2.0 to 1.2.0.post0
gr-osmosdr would be upgraded from 0.2.3-0.a100eb0 to 0.2.3.20210128
countdown would be upgraded from 1.0.0 to 20150606
http-parser would be upgraded from 2.9.4-1.ec8b5ee to 2.9.4.20201223
xlsx2csv would be upgraded from 0.7.4 to 20200427211949
keynav would be upgraded from 0.20110708.0 to 20150730+4ae486d
--8<---------------cut here---------------end--------------->8---

It seems like most of these could be solved by checking if the version
scheme changed from semver to calver.  I think that’s a pretty good
result considering how many packages we have.

[*] Until I ran into <https://issues.guix.gnu.org/53923>

Nicolas Goaziou Feb. 10, 2022, 7:30 p.m. UTC | #7

Hello,

Xinglu Chen <public@yoctocell.xyz> writes:

> Yeah, I have noticed that it sometimes thinks that a version like
> “20080323” is newer than something like “0.1.2-0.a1b2b3d”, even though
> it might not necessarily be true.  This seems to be the case for a lot
> of Common Lisp packages which usually don’t have any proper releases.

[...]

> In my testing, the “newest” status does a pretty good job (besides the
> problem I mentioned above)
>
> Some other “bad” updates I found[*] are listed below (excluding Common Lisp
> packages).
>
> --8<---------------cut here---------------start------------->8---
> guile-ac-d-bus would be upgraded from 1.0.0-beta.0 to 1.0.0-beta0
> sic would be upgraded from 1.2 to 1.2+20210506_058547e
> tla2tools would be upgraded from 1.7.1-0.6932e19 to 20140313
> quickjs would be upgraded from 2021-03-27 to 2021.03.27
> stow would be upgraded from 2.3.1 to 2.3.1+5.32
> cube would be upgraded from 4.3.5 to 2005.08.29
> python-ratelimiter would be upgraded from 1.2.0 to 1.2.0.post0
> gr-osmosdr would be upgraded from 0.2.3-0.a100eb0 to 0.2.3.20210128
> countdown would be upgraded from 1.0.0 to 20150606
> http-parser would be upgraded from 2.9.4-1.ec8b5ee to 2.9.4.20201223
> xlsx2csv would be upgraded from 0.7.4 to 20200427211949
> keynav would be upgraded from 0.20110708.0 to 20150730+4ae486d
> --8<---------------cut here---------------end--------------->8---
>
> It seems like most of these could be solved by checking if the version
> scheme changed from semver to calver.  I think that’s a pretty good
> result considering how many packages we have.

I think this would not cut it.

As I wrote, almost any package using `git-version' is going to create
a version mismatch. This is because we consider

  (git-version "X.Y" revision commit)

to be greater than "X.Y" whereas Repology either ignore the version, or
consider it to be a pre-release before "X.Y". See, e.g., "emacs:circe"
project, or "joycond". This, I think, the most prominent category of
comparison failures.

Also, there are versions which are plain wrong, e.g., "emacs:csv-mode",
and disqualify correct and up-to-date version. There are also version
disagreement in, e.g., "colobot", or upstream disagreement, e.g.,
"emacs:scala-mode".

See also "emacs:geiser-racket", "python:folium" or "higan" for other
projects with versioning issues.

Regards,

Ludovic Courtès Feb. 10, 2022, 8:49 p.m. UTC | #8

Hi,

Xinglu Chen <public@yoctocell.xyz> skribis:

> Ludovic schrieb am Dienstag der 08. Februar 2022 um 23:59 +01:

[...]

>> Repology implements the same functionality as our updaters, so
>> repology.org is effectively “service as a software substitute”
>> (SaaSS).
>
> Right, but it tracks a lot more repositories than what our updaters do,
> so why not take advantage of that.

True, but this is kinda self-reinforcing: it’ll sure keep tracking more
if we stop maintaining our own code (IIRC, Repology was started after
‘guix refresh’ and I believe it’s maintained by one person.)

>> My preference would be to keep our existing updaters rather than
>> effectively ditch them and delegate the work to Repology.  It’s tempting
>> to think we can have both, but I’m not sure this would last long.
>
> The point of the Repology updater is to act as a fallback if none of
> the other updaters can update a package, e.g., ‘maven-dependency-tree’.
> I already mentioned that language-specific updaters usually provide more
> accurate and detailed information, so they should be used when possible;
> we aren’t losing anything here.

Hmm yes, could be.  OTOH, like Nicolas writes, we would probably need
some filtering or post-processing to reduce false-positives, right?

Do you have examples where our updaters perform poorly and where
Repology does a better job?  I wonder if there are lessons to be drawn
and bugs to be fixed.

Thanks,
Ludo’.

Nicolas Goaziou Feb. 14, 2022, 10:40 a.m. UTC | #9

Hello,

Ludovic Courtès <ludo@gnu.org> writes:

> Do you have examples where our updaters perform poorly and where
> Repology does a better job?  I wonder if there are lessons to be drawn
> and bugs to be fixed.

As a data point, I'm sorry to say that our updaters are useless to me.

I watch over more than one thousand packages. I would have a hard time
expressing what are those packages to the updater, besides writing and
keeping up-to-date a huge manifest file. Assuming I could manage this,
fetching all version information would take considerable time, and,
since many packages are from GitHub, the party would stop early anyway
with GitHub refusing to proceed and requesting some token I don't have.

OTOH, using Repology API, I get the information I want in about ten
seconds. Sure, I need to eyeball through the results, filtering false
positives (around 4% in my case), but it still is a practical solution.

IMO, to be useful, updaters may need to rely on an external service,
which may, or may not, belong to the Guix ecosystem. They also need
a good UI.

I don't want to sound too negative, though. And current updaters are
certainly good enough when watching over a couple of packages, which
might be the most common use-case.

Cheers,

M Feb. 14, 2022, 4:07 p.m. UTC | #10

Nicolas Goaziou schreef op ma 14-02-2022 om 11:40 [+0100]:
> [...]. Assuming I could manage this,
> fetching all version information would take considerable time, and,
> since many packages are from GitHub, the party would stop early anyway
> with GitHub refusing to proceed and requesting some token I don't have.
> 
> OTOH, using Repology API, I get the information I want in about ten
> seconds. Sure, I need to eyeball through the results, filtering false
> positives (around 4% in my case), but it still is a practical solution.
> 
> IMO, to be useful, updaters may need to rely on an external service,
> which may, or may not, belong to the Guix ecosystem. They also need
> a good UI.

To avoid exceeding API limits and reduce network traffic, I suggest the
following change:

  Cache HTTP responses, using http-fetch/cached instead of
  http-fetch.  When something is in the cache and not expired,
  this avoids some network traffic and does not bring us closer
  to the API limits.

  When it is expired (and in the cache), then at least
  http-fetch/cached makes a conditional request with
  If-Modified-Since, which GitHub does not count against the rate
  limit, assuming a ‘304 Not Modified’ response!

That does not address all your concerns but it should help I think.

Greetings,
Maxime.

Ludovic Courtès Feb. 14, 2022, 4:58 p.m. UTC | #11

Hi Nicolas,

Nicolas Goaziou <mail@nicolasgoaziou.fr> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Do you have examples where our updaters perform poorly and where
>> Repology does a better job?  I wonder if there are lessons to be drawn
>> and bugs to be fixed.
>
> As a data point, I'm sorry to say that our updaters are useless to me.
>
> I watch over more than one thousand packages. I would have a hard time
> expressing what are those packages to the updater, besides writing and
> keeping up-to-date a huge manifest file. Assuming I could manage this,
> fetching all version information would take considerable time, and,
> since many packages are from GitHub, the party would stop early anyway
> with GitHub refusing to proceed and requesting some token I don't have.
>
> OTOH, using Repology API, I get the information I want in about ten
> seconds. Sure, I need to eyeball through the results, filtering false
> positives (around 4% in my case), but it still is a practical solution.

(I’m confused because my understanding of what you first wrote was that
Repology had too many false positives to be useful.)

You wrote about your feelings and that’s insightful, but can we focus on
specific examples where updaters are not helpful so we can better
understand and improve the situation?

> IMO, to be useful, updaters may need to rely on an external service,
> which may, or may not, belong to the Guix ecosystem.

All the updaters rely on an external service.  Relying on a centralized
SaaSS is different, though.

> They also need a good UI.

Do you have examples of what’s wrong on the UI side?

To me, the main shortcoming is that ‘guix refresh’ doesn’t tell you that
if you update X, you may also need to update Y and Z.  That info is not
always available, but it is available in repos such as PyPI and ELPA.

Thanks,
Ludo’.

Nicolas Goaziou Feb. 14, 2022, 6:42 p.m. UTC | #12

Hello,

Ludovic Courtès <ludo@gnu.org> writes:

> (I’m confused because my understanding of what you first wrote was that
> Repology had too many false positives to be useful.)

Repology is okay for my use-case because I've gotten accustomed to its
quirks. I wouldn't recommend it as a fall-back solution for Guix in its
current form, tho, for the reason above. Does that make sense?

> You wrote about your feelings and that’s insightful, but can we focus on
> specific examples where updaters are not helpful so we can better
> understand and improve the situation?

I wrote about the following facts:
- it is difficult to specify a large number of packages,
- when you have specified a large number of packages, the processing is
  slow,
- checking GitHub fails for me.

I don't see any feelings in there.

>> IMO, to be useful, updaters may need to rely on an external service,
>> which may, or may not, belong to the Guix ecosystem.
>
> All the updaters rely on an external service.  Relying on a centralized
> SaaSS is different, though.

Fair enough. I meant an external centralized service.

> Do you have examples of what’s wrong on the UI side?

It has no Emacs interface. Nuff said. ;)

Again, I don't know how to specify efficiently many packages, e.g., all
Emacs packages, or all games. Also, reading through a massive output in
the terminal is not very user friendly, IMO.

> To me, the main shortcoming is that ‘guix refresh’ doesn’t tell you that
> if you update X, you may also need to update Y and Z.  That info is not
> always available, but it is available in repos such as PyPI and ELPA.

I don't think solving this is realistic. Dependencies are sometimes very
loose.

Regards,

Ludovic Courtès Feb. 15, 2022, 9:57 a.m. UTC | #13

Hi!

Nicolas Goaziou <mail@nicolasgoaziou.fr> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> (I’m confused because my understanding of what you first wrote was that
>> Repology had too many false positives to be useful.)
>
> Repology is okay for my use-case because I've gotten accustomed to its
> quirks. I wouldn't recommend it as a fall-back solution for Guix in its
> current form, tho, for the reason above. Does that make sense?

It sure does, thanks for explaining.

> I wrote about the following facts:
> - it is difficult to specify a large number of packages,
> - when you have specified a large number of packages, the processing is
>   slow,
> - checking GitHub fails for me.

Alright, I had missed that.

Regarding “specifying many packages”, do examples like these work for
you:

  • guix refresh -t elpa

  • guix refresh $(guix package -A ^emacs- | cut -f1)

  • guix refresh -r emacs-emms

  • guix refresh -s non-core -t generic-git

  • guix refresh -m packages-i-care-about.scm

If not, what kind of selection mechanism could help?  ‘-s’ currently
accepts only two values, but we could augment it.

Regarding slow processing, it very much depends on the updater.  For
example, on a warm cache, ‘guix refresh -t gnu’ is relatively fast
thanks to caching:

--8<---------------cut here---------------start------------->8---
$ time guix refresh -t gnu
gnu/packages/wget.scm:48:13: wget would be upgraded from 1.21.1 to 1.21.2
gnu/packages/tls.scm:86:13: libtasn1 would be upgraded from 4.17.0 to 4.18.0

[...]

real	0m38.314s
user	0m38.981s
sys	0m0.164s
--8<---------------cut here---------------end--------------->8---

It could be that some updaters do many HTTP round trips without any
caching, which slows things down.

[...]

>> Do you have examples of what’s wrong on the UI side?
>
> It has no Emacs interface. Nuff said. ;)

True!  :-)

I realize this is going off-topic, but let’s see if we can improve the
existing infrastructure to make it more convenient.

Thanks,
Ludo’.

Nicolas Goaziou Feb. 16, 2022, 12:43 p.m. UTC | #14

Hello,

Ludovic Courtès <ludo@gnu.org> writes:

> Regarding “specifying many packages”, do examples like these work for
> you:
>
>   • guix refresh -t elpa

I don't find it very useful in practice. As a user, the packages I'm
interested in probably rely on more than one updater. I'm not even
supposed to know what updater relates to a given package.

I actually only use this when I know a GNU ELPA package is outdated
already, and I want it to compute the hash for me:

  ./pre-inst-env guix refresh -t elpa -u emacs-foo

>   • guix refresh $(guix package -A ^emacs- | cut -f1)

This one is interesting. This illustrates that the UI is, from my point
of view, a bit lacking. It would be a nice improvement to add a regexp
mechanism built-in, like in "guix search".

In any case, this fails after reporting status of around 50 packages,
with this time:

  real	0m41,881s
  user	0m12,155s
  sys	0m0,726s

Assuming I don't get the "rate limit exceeded" error, at this rate, it
would take more than 15 minutes to check all the packages in
"emacs-xyz.scm". This is a bit long.

I don't see how this could reasonably be made faster without relying on
an external centralized service doing the checks regularly (e.g., once
a day) before the user actually requests them.

>   • guix refresh -r emacs-emms

It also fails with the "rate limit exceeded". While this sounds
theoretically nice, I wouldn't know how to make use of it yet.

>   • guix refresh -s non-core -t generic-git

See above about "-t elpa".

>   • guix refresh -m packages-i-care-about.scm

Yes, obviously, this is a nice, too. However, it doesn't scale if you
need to specify 1000+ packages.

> If not, what kind of selection mechanism could help?  ‘-s’ currently
> accepts only two values, but we could augment it.

Besides regexp matching, it may be useful to filter packages per module,
or source file name. Package categories is a bit awkward, tho, and
probably not satisfying.

> I realize this is going off-topic, but let’s see if we can improve the
> existing infrastructure to make it more convenient.

Is it really off-topic?

Anyway, all of this is only one data point, and, as a reminder,
I certainly don't want to disparage either Xinglu Chen's work, or
current "guix refresh" functionality.

HTH,

Ludovic Courtès Feb. 17, 2022, 10:35 a.m. UTC | #15

Hi,

Nicolas Goaziou <mail@nicolasgoaziou.fr> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Regarding “specifying many packages”, do examples like these work for
>> you:
>>
>>   • guix refresh -t elpa
>
> I don't find it very useful in practice. As a user, the packages I'm
> interested in probably rely on more than one updater. I'm not even
> supposed to know what updater relates to a given package.

Right, that’s more for packagers than for users.

As a user, what works better is:

  guix refresh -r $(guix package -I |cut -f1) -s non-core

… or simply ‘--with-latest’, if I’m not interested in updating package
definitions.

>>   • guix refresh $(guix package -A ^emacs- | cut -f1)
>
> This one is interesting. This illustrates that the UI is, from my point
> of view, a bit lacking. It would be a nice improvement to add a regexp
> mechanism built-in, like in "guix search".

Makes sense, we can do that.

> In any case, this fails after reporting status of around 50 packages,
> with this time:
>
>   real	0m41,881s
>   user	0m12,155s
>   sys	0m0,726s

How does it fail?  If it’s the GitHub rate limit, then there’s only one
answer: you have to provide a token.

> Assuming I don't get the "rate limit exceeded" error, at this rate, it
> would take more than 15 minutes to check all the packages in
> "emacs-xyz.scm". This is a bit long.

> I don't see how this could reasonably be made faster without relying on
> an external centralized service doing the checks regularly (e.g., once
> a day) before the user actually requests them.

Maybe you’re right, but before jumping to the conclusion, we have to
investigate a bit.  Like I wrote, the ‘gnu’ updater for instance fetches
a single file that remains in cache afterwards—the cost is constant.

We should identify updaters that have linear cost and check what can be
done.  ‘github’, ‘generic-html’, and ‘generic-git’ are of that kind.

Now, the command I gave above looks at 1,134 packages, so is it even
something you want to do as a packager?

>>   • guix refresh -r emacs-emms
>
> It also fails with the "rate limit exceeded". While this sounds
> theoretically nice, I wouldn't know how to make use of it yet.
>
>>   • guix refresh -s non-core -t generic-git
>
> See above about "-t elpa".
>
>>   • guix refresh -m packages-i-care-about.scm
>
> Yes, obviously, this is a nice, too. However, it doesn't scale if you
> need to specify 1000+ packages.

You can use ‘fold-packages’ and have three lines that return a manifest
of 10K packages if you want it.

Honestly, since I mostly rely on others these days :-), I’m no longer
sure what the packager’s workflow is.  Also, the level of coupling
varies greatly between, say, a C/C++ package and a set of
Python/Emacs/Rust packages.

I find that ‘guix refresh’ works fine for loosely-coupled C/C++ packages
where often you’d want to upgrade packages individually.

But for Python and Emacs packages, what do we want?  Do packagers always
want to check 1K+ packages at once?  Or are there other patterns?

>> If not, what kind of selection mechanism could help?  ‘-s’ currently
>> accepts only two values, but we could augment it.
>
> Besides regexp matching, it may be useful to filter packages per module,
> or source file name. Package categories is a bit awkward, tho, and
> probably not satisfying.

We can add options to make it more convenient, but it’s already
possible:

  guix refresh $(guix package -A | grep emacs-xyz.scm | cut -f1)

>> I realize this is going off-topic, but let’s see if we can improve the
>> existing infrastructure to make it more convenient.
>
> Is it really off-topic?
>
> Anyway, all of this is only one data point, and, as a reminder,
> I certainly don't want to disparage either Xinglu Chen's work, or
> current "guix refresh" functionality.

Yup, same here!

I think we have nice infrastructure but you raise important
shortcomings.  What Xinglu Chen did might in fact be one way to address
it, and there may also be purely UI issues that we could address.

Thanks,
Ludo’.

Simon Tournier Feb. 17, 2022, 11:17 a.m. UTC | #16

Hi,

On Thu, 17 Feb 2022 at 11:35, Ludovic Courtès <ludo@gnu.org> wrote:

>>>   • guix refresh $(guix package -A ^emacs- | cut -f1)
>>
>> This one is interesting. This illustrates that the UI is, from my point
>> of view, a bit lacking. It would be a nice improvement to add a regexp
>> mechanism built-in, like in "guix search".
>
> Makes sense, we can do that.

I agree the UI is not nice.  Well, at the command line, I never read the
complete output of “guix package -A” and I always pipe it with “cut
-f1”.  Well, I think this complete display is only useful for
third-party; the only one I have in mind is emacs-guix.  Therefore, are
we maintaining this CLI for backward compatibility when we could change
both?

Something more useful as output would be:

   name version synopsis

Whatever. :-)

Even the internal etc/completion/bash/guix has to pipe:

--8<---------------cut here---------------start------------->8---
_guix_complete_available_package ()
{
    local prefix="$1"
    if [ -z "$_guix_available_packages" ]
    then
	# Cache the complete list because it rarely changes and makes
	# completion much faster.
	_guix_available_packages="$(${COMP_WORDS[0]} package -A 2> /dev/null \
                                    | cut -f1)"
    fi
    COMPREPLY+=($(compgen -W "$_guix_available_packages" -- "$prefix"))
}
--8<---------------cut here---------------end--------------->8---

Last, I am not convinced that “guix search” would be help here.
Because:

  1. the output requires to pipe with recsel,
  2. it is much slower than “package -A” [1].

1: <https://issues.guix.gnu.org/39258#119>

>>>   • guix refresh -m packages-i-care-about.scm
>>
>> Yes, obviously, this is a nice, too. However, it doesn't scale if you
>> need to specify 1000+ packages.

[...]

>> In any case, this fails after reporting status of around 50 packages,
>> with this time:
>>
>>   real	0m41,881s
>>   user	0m12,155s
>>   sys	0m0,726s
>
> How does it fail?  If it’s the GitHub rate limit, then there’s only one
> answer: you have to provide a token.

Let mimick a collection if 1000+ packages I care about.  Consider this
manifest for packages using r-build-system only…

--8<---------------cut here---------------start------------->8---
(use-modules (guix packages)
             (gnu packages)
             (guix build-system r))

(packages->manifest
 (fold-packages (lambda (package result)
                  (if (eq? (package-build-system package) r-build-system)
                      (cons package result)
                      result))
                '()))
--8<---------------cut here---------------end--------------->8---

…it hits the issue of Github token…

--8<---------------cut here---------------start------------->8---
gnu/packages/bioconductor.scm:6034:13: 1.66.0 is already the latest version of r-plgem
gnu/packages/bioconductor.scm:6011:13: 1.22.0 is already the latest version of r-rots
gnu/packages/bioconductor.scm:12614:2: warning: 'bioconductor' updater failed to determine available releases for r-fourcseq
Backtrace:
          13 (primitive-load "/home/simon/.config/guix/current/bin/guix")

[...]

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Error downloading release information through the GitHub
API. This may be fixed by using an access token and setting the environment
variable GUIX_GITHUB_TOKEN, for instance one procured from
https://github.com/settings/tokens

real	10m27.306s
user	4m14.077s
sys	0m12.467s
--8<---------------cut here---------------end--------------->8---

…when most R packages come from CRAN or Bioconductor archives.

Basically, ~5000 packages come from Github which represents ~25% of
overall.  Therefore, one needs to be really lucky when updating many
package and not hit the Github rate limit.

Yes, large collection of packages cannot be updated easily.  Somehow, it
is an issue from upstream and it is hard to fix… except by
duplicating upstream or provide a token. :-)

Well, using the external centralized Repology service is a first step to
update at scale, no?  A second step could be to have this feature
included in the Data Service; but before we have other fishes to fry,
IMHO. :-)

>> Assuming I don't get the "rate limit exceeded" error, at this rate, it
>> would take more than 15 minutes to check all the packages in
>> "emacs-xyz.scm". This is a bit long.
>>
>> I don't see how this could reasonably be made faster without relying on
>> an external centralized service doing the checks regularly (e.g., once
>> a day) before the user actually requests them.
>
> Maybe you’re right, but before jumping to the conclusion, we have to
> investigate a bit.  Like I wrote, the ‘gnu’ updater for instance fetches
> a single file that remains in cache afterwards—the cost is constant.

Repology acts as this “external centralized service”, no?  On one hand,
it is a practical solution; especially by being fast enough.  On the
other hand, it serves few false positives (say 4% to fix the ideas).

Nicolas, considering the complexity of packages and their origins, do
you think it would be possible to do better (fast and accurate) than
Repology at scale?

>>>   • guix refresh -m packages-i-care-about.scm
>>
>> Yes, obviously, this is a nice, too. However, it doesn't scale if you
>> need to specify 1000+ packages.
>
> You can use ‘fold-packages’ and have three lines that return a manifest
> of 10K packages if you want it.

Yes, see example above.

>>> If not, what kind of selection mechanism could help?  ‘-s’ currently
>>> accepts only two values, but we could augment it.
>>
>> Besides regexp matching, it may be useful to filter packages per module,
>> or source file name. Package categories is a bit awkward, tho, and
>> probably not satisfying.
>
> We can add options to make it more convenient, but it’s already
> possible:

Since these features are advanced, why not keep the CLI simple and
instead on rely manifest files for complex filtering?

>>> I realize this is going off-topic, but let’s see if we can improve the
>>> existing infrastructure to make it more convenient.

[...]

> I think we have nice infrastructure but you raise important
> shortcomings.  What Xinglu Chen did might in fact be one way to address
> it, and there may also be purely UI issues that we could address.

All the points raised here are important but appears to me orthogonal
with the patch series. :-)

Cheers,
simon

Nicolas Goaziou Feb. 18, 2022, 10:28 a.m. UTC | #17

Hello,

zimoun <zimon.toutoune@gmail.com> writes:

> On Thu, 17 Feb 2022 at 11:35, Ludovic Courtès <ludo@gnu.org> wrote:

>> How does it fail?  If it’s the GitHub rate limit, then there’s only one
>> answer: you have to provide a token.

IIUC, I have to register on GitHub to create this token. This is a bit
sad as a prerequisite to use one core feature of Guix.

> Let mimick a collection if 1000+ packages I care about.  Consider this
> manifest for packages using r-build-system only…
>
> --8<---------------cut here---------------start------------->8---
> (use-modules (guix packages)
>              (gnu packages)
>              (guix build-system r))
>
> (packages->manifest
>  (fold-packages (lambda (package result)
>                   (if (eq? (package-build-system package) r-build-system)
>                       (cons package result)
>                       result))
>                 '()))
> --8<---------------cut here---------------end--------------->8---

I have to learn about fold-packages.

> Nicolas, considering the complexity of packages and their origins, do
> you think it would be possible to do better (fast and accurate) than
> Repology at scale?

It's not about doing better, but doing differently. We do not need all
of Repology's features. 

As far as the updater part is concerned, Repology pokes at various
package repositories, which are usually not upstream, extracts package
versions, applies some heuristics to normalize and compare them, then
decides what is the newest version and which repositories provide
outdated packages. This has a two obvious shortcomings:

1. the version number usually doesn't come from an official source, so
   it may be wrong—e.g., our emacs-csv-mode is "outdated" because Funtoo
   1.4 chose a non-existing higher version number for the same package.

2. version comparison does not understand every local versioning
   scheme—e.g., our emacs-fold-dwim packages is currently at
   "1.2-0.c46f4bb", which is, in Guix parlance, after "1.2", yet
   Repology thinks this is actually older than "1.2".

Therefore, I think a (theoretical) centralized Guix-centric version
checker could be fast: it would only poke at what our packages consider
to be upstream, and accurate, since it would know about our versioning
rules. Basically it could boil down to calling current "guix refresh" on
every package daily, and serializing the results.

>>>>   • guix refresh -m packages-i-care-about.scm
>>>
>>> Yes, obviously, this is a nice, too. However, it doesn't scale if you
>>> need to specify 1000+ packages.
>>
>> You can use ‘fold-packages’ and have three lines that return a manifest
>> of 10K packages if you want it.
>
> Yes, see example above.

Point taken.

Regards,

Ludovic Courtès March 3, 2022, 9:28 p.m. UTC | #18

Hi!

Nicolas Goaziou <mail@nicolasgoaziou.fr> skribis:

> zimoun <zimon.toutoune@gmail.com> writes:
>
>> On Thu, 17 Feb 2022 at 11:35, Ludovic Courtès <ludo@gnu.org> wrote:
>
>
>>> How does it fail?  If it’s the GitHub rate limit, then there’s only one
>>> answer: you have to provide a token.
>
> IIUC, I have to register on GitHub to create this token. This is a bit
> sad as a prerequisite to use one core feature of Guix.

I have some good news!

  https://issues.guix.gnu.org/54241

Granted, it’s not a revolution, but it should fix one of the main
annoyances of ‘guix refresh’.

Regarding the “sad prerequisite”, the ‘github’ updater predates the
‘generic-git’ updater by several years, during which it was the only way
to get data for matching packages.

Actually I wonder if it’s still useful to keep.  In theory it can
provide more accurate data than the ‘generic-git’ updater; not sure if
this is the case in practice.

Ludo’.

[bug#53818,0/3] Add Repology updater

Message

Comments