mbox series

[bug#54241,0/4] 'github' importer gracefully handles rate limiting

Message ID 20220303211326.19884-1-ludo@gnu.org
Headers show
Series 'github' importer gracefully handles rate limiting | expand

Message

Ludovic Courtès March 3, 2022, 9:13 p.m. UTC
Hi Guix!

These patches address a famous complaint about “the GitHub problem”
that affects ‘guix refresh’¹, shown here in its naked awfulness:

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix refresh
gnu/packages/zig.scm:32:13: zig would be upgraded from 0.9.0 to 0.9.1

[...]

In guix/scripts/refresh.scm:
   578:14  5 (_ _)
In srfi/srfi-1.scm:
    634:9  4 (for-each #<procedure 7fe85c9a8e00 at guix/scripts/refresh.scm:578:24 (t-916fdc98f4be2f1-1d48)> _)
In guix/scripts/refresh.scm:
    378:2  3 (check-for-package-update #<package redshift-wayland@1.12-1.7da875d gnu/packages/xdisorg.scm:1425 7fe85879e790> (#<<upstream…>) …)
In guix/import/github.scm:
   232:12  2 (latest-release _)
In ice-9/boot-9.scm:
  1685:16  1 (raise-exception _ #:continuable? _)
  1685:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Error downloading release information through the GitHub
API. This may be fixed by using an access token and setting the environment
variable GUIX_GITHUB_TOKEN, for instance one procured from
https://github.com/settings/tokens
--8<---------------cut here---------------end--------------->8---

With this change, ‘guix refresh’ warns you when the GitHub rate limit
is reached, but it keeps going, falling back to the ‘generic-git’
updater if it’s among the applicable updaters:

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix refresh -t github,generic-git

[...]

guix refresh: warning: GitHub rate limit exceeded; disallowing requests for 1477 seconds
hint: You can waive the rate limit by setting the `GUIX_GITHUB_TOKEN' environment variable to a
token obtained from `https://github.com/settings/tokens' with your GitHub account.

Alternatively, you can wait until your rate limit is reset, or use the `generic-git' updater
instead.

gnu/packages/zile.scm:113:14: warning: no tags were found for zile-on-guile
gnu/packages/zig.scm:32:13: zig would be upgraded from 0.9.0 to 0.9.1
gnu/packages/xorg.scm:2830:7: warning: no valid tags found for xf86-video-freedreno
gnu/packages/xml.scm:2132:13: java-kxml2 would be upgraded from 2.4.2 to 2.5.0
--8<---------------cut here---------------end--------------->8---

The GitHub updater becomes functional again once the rate limit has
been reset.

The code to deal with rate limiting is similar to that in (guix swh).

Thoughts?

Thanks,
Ludo’.

¹ https://issues.guix.gnu.org/53818#50

Ludovic Courtès (4):
  http-client: Add response headers to '&http-get-error'.
  import: github: Gracefully handle rate limit exhaustion.
  http-client: Correctly handle redirects when #:keep-alive? #t.
  import: github: Reuse HTTP connection for the /tags URL fallback.

 .dir-locals.el         |   1 +
 guix/http-client.scm   |  39 +++++++++-----
 guix/import/github.scm | 119 +++++++++++++++++++++++++++++------------
 3 files changed, 112 insertions(+), 47 deletions(-)


base-commit: be84fb701bf7a36a0eb50147ccbb988aa3f41209

Comments

M March 4, 2022, 12:07 p.m. UTC | #1
Ludovic Courtès schreef op do 03-03-2022 om 22:13 [+0100]:
> With this change, ‘guix refresh’ warns you when the GitHub rate limit
> is reached, but it keeps going, falling back to the ‘generic-git’
> updater if it’s among the applicable updaters:
> [...]

WDYT of avoiding the rate limit by caching, using 'http-fetch/cached'?
GitHub does not count requests setting If-Modified-Since against the
rate limit (assuming the answer hasn't changed).

Greetings,
Maxime.
Ludovic Courtès March 4, 2022, 8:45 p.m. UTC | #2
Hi,

Maxime Devos <maximedevos@telenet.be> skribis:

> Ludovic Courtès schreef op do 03-03-2022 om 22:13 [+0100]:
>> With this change, ‘guix refresh’ warns you when the GitHub rate limit
>> is reached, but it keeps going, falling back to the ‘generic-git’
>> updater if it’s among the applicable updaters:
>> [...]
>
> WDYT of avoiding the rate limit by caching, using 'http-fetch/cached'?
> GitHub does not count requests setting If-Modified-Since against the
> rate limit (assuming the answer hasn't changed).

My concern is that we’d end up caching one or two little files in
~/.cache for each candidate package, and (rate limit aside) the overhead
of dealing with the cache might outweigh the benefits.  I’d rather use
‘http-fetch/cached’ for bigger files, like in (guix cve).

WDYT?

My goal here was to ensure the ‘github’ updater doesn’t get in the way
of those who don’t want to specify a token.

Thanks,
Ludo’.
M March 5, 2022, 9:44 a.m. UTC | #3
Ludovic Courtès schreef op vr 04-03-2022 om 21:45 [+0100]:
> My concern is that we’d end up caching one or two little files in
> ~/.cache for each candidate package, and (rate limit aside) the overhead
> of dealing with the cache might outweigh the benefits.  I’d rather use
> ‘http-fetch/cached’ for bigger files, like in (guix cve).
> 
> WDYT?

If the overhead of caching little files is a concern, then perhaps a
SQLite (or GDBM) database could be used instead of the filesystem-based
cache?  The number of packages in Guix was about 150 000 IIRC, if we
assume something around the magnitude of 200 bytes per package, then
we end up with about 29 MiB for the entirity of Guix.  And there might
be some opportunities for compression, reducing this number.

Something like this could be left for later though.

Greetings,
Maxime.
Ludovic Courtès March 5, 2022, 9:58 p.m. UTC | #4
Hi,

Maxime Devos <maximedevos@telenet.be> skribis:

> Ludovic Courtès schreef op vr 04-03-2022 om 21:45 [+0100]:
>> My concern is that we’d end up caching one or two little files in
>> ~/.cache for each candidate package, and (rate limit aside) the overhead
>> of dealing with the cache might outweigh the benefits.  I’d rather use
>> ‘http-fetch/cached’ for bigger files, like in (guix cve).
>> 
>> WDYT?
>
> If the overhead of caching little files is a concern, then perhaps a
> SQLite (or GDBM) database could be used instead of the filesystem-based
> cache?  The number of packages in Guix was about 150 000 IIRC, if we
> assume something around the magnitude of 200 bytes per package, then
> we end up with about 29 MiB for the entirity of Guix.  And there might
> be some opportunities for compression, reducing this number.

I think this would be going overboard in terms of complexity :-), and it
wouldn’t radically change the run-time overhead (you still potentially
have to do an HTTP round trip with ‘If-Modified-Since’, you’re just
saving a few hundred bytes on the response in the best case.)

> Something like this could be left for later though.

Yup!

Ludo’.
M March 5, 2022, 10:04 p.m. UTC | #5
Ludovic Courtès schreef op za 05-03-2022 om 22:58 [+0100]:
> [...] and it wouldn’t radically change the run-time overhead (you still
> potentially have to do an HTTP round trip with ‘If-Modified-Since’,
> you’re just saving a few hundred bytes on the response in the best case.)

IIUC, when the TTL hasn't been exceeded, then the file from the file
system is served without contacting the remote server at all.  So in
the best case, you only ‘round-trip’ to the disk instead of the HTTP
server.  So I think there's some potential benefits to be had here.

That assumes a sufficiently large TTL though.

Greetings,
Maxime.
M March 5, 2022, 10:11 p.m. UTC | #6
Ludovic Courtès schreef op za 05-03-2022 om 22:58 [+0100]:
> I think this would be going overboard in terms of complexity :-)

There's some complexity here, but assuming a sufficient amount of
tests, I believe it would be worth it if it allows side-stepping the
rate limit to some degree.  And the extra complexity would mostly
disappear if the overhead of tiny files was accepted (*).

There are also some other benefits, e.g. a kind of ‘download
resumption’ but for linters, reducing network traffic after retrying
"guix lint" on a lossy network (or because the terminal tab was closed
to early, etc.).

All stuff that can be left for later though!

Greetings,
Maxime.

(*) Assuming 150 000 packages and 1 KiB per package (this would be
file-system dependent!), I end up with 150 MiB.  That's a bit on the
large size though ...
Ludovic Courtès March 6, 2022, 5:21 p.m. UTC | #7
Maxime Devos <maximedevos@telenet.be> skribis:

> Ludovic Courtès schreef op za 05-03-2022 om 22:58 [+0100]:
>> I think this would be going overboard in terms of complexity :-)
>
> There's some complexity here, but assuming a sufficient amount of
> tests, I believe it would be worth it if it allows side-stepping the
> rate limit to some degree.

What should also be taken into account is the usefulness of the ‘github’
updater—investment should be proportionate.  I suspect it’s much less
useful now that we have the ‘generic-git’ updater.  Maybe, maybe it
gives slightly more accurate data in some cases, maybe it can be
slightly faster, but that’s not entirely clear to me.