Message ID | 20220303211326.19884-1-ludo@gnu.org |
---|---|
Headers | show |
Series | 'github' importer gracefully handles rate limiting | expand |
Ludovic Courtès schreef op do 03-03-2022 om 22:13 [+0100]: > With this change, ‘guix refresh’ warns you when the GitHub rate limit > is reached, but it keeps going, falling back to the ‘generic-git’ > updater if it’s among the applicable updaters: > [...] WDYT of avoiding the rate limit by caching, using 'http-fetch/cached'? GitHub does not count requests setting If-Modified-Since against the rate limit (assuming the answer hasn't changed). Greetings, Maxime.
Hi, Maxime Devos <maximedevos@telenet.be> skribis: > Ludovic Courtès schreef op do 03-03-2022 om 22:13 [+0100]: >> With this change, ‘guix refresh’ warns you when the GitHub rate limit >> is reached, but it keeps going, falling back to the ‘generic-git’ >> updater if it’s among the applicable updaters: >> [...] > > WDYT of avoiding the rate limit by caching, using 'http-fetch/cached'? > GitHub does not count requests setting If-Modified-Since against the > rate limit (assuming the answer hasn't changed). My concern is that we’d end up caching one or two little files in ~/.cache for each candidate package, and (rate limit aside) the overhead of dealing with the cache might outweigh the benefits. I’d rather use ‘http-fetch/cached’ for bigger files, like in (guix cve). WDYT? My goal here was to ensure the ‘github’ updater doesn’t get in the way of those who don’t want to specify a token. Thanks, Ludo’.
Ludovic Courtès schreef op vr 04-03-2022 om 21:45 [+0100]: > My concern is that we’d end up caching one or two little files in > ~/.cache for each candidate package, and (rate limit aside) the overhead > of dealing with the cache might outweigh the benefits. I’d rather use > ‘http-fetch/cached’ for bigger files, like in (guix cve). > > WDYT? If the overhead of caching little files is a concern, then perhaps a SQLite (or GDBM) database could be used instead of the filesystem-based cache? The number of packages in Guix was about 150 000 IIRC, if we assume something around the magnitude of 200 bytes per package, then we end up with about 29 MiB for the entirity of Guix. And there might be some opportunities for compression, reducing this number. Something like this could be left for later though. Greetings, Maxime.
Hi, Maxime Devos <maximedevos@telenet.be> skribis: > Ludovic Courtès schreef op vr 04-03-2022 om 21:45 [+0100]: >> My concern is that we’d end up caching one or two little files in >> ~/.cache for each candidate package, and (rate limit aside) the overhead >> of dealing with the cache might outweigh the benefits. I’d rather use >> ‘http-fetch/cached’ for bigger files, like in (guix cve). >> >> WDYT? > > If the overhead of caching little files is a concern, then perhaps a > SQLite (or GDBM) database could be used instead of the filesystem-based > cache? The number of packages in Guix was about 150 000 IIRC, if we > assume something around the magnitude of 200 bytes per package, then > we end up with about 29 MiB for the entirity of Guix. And there might > be some opportunities for compression, reducing this number. I think this would be going overboard in terms of complexity :-), and it wouldn’t radically change the run-time overhead (you still potentially have to do an HTTP round trip with ‘If-Modified-Since’, you’re just saving a few hundred bytes on the response in the best case.) > Something like this could be left for later though. Yup! Ludo’.
Ludovic Courtès schreef op za 05-03-2022 om 22:58 [+0100]: > [...] and it wouldn’t radically change the run-time overhead (you still > potentially have to do an HTTP round trip with ‘If-Modified-Since’, > you’re just saving a few hundred bytes on the response in the best case.) IIUC, when the TTL hasn't been exceeded, then the file from the file system is served without contacting the remote server at all. So in the best case, you only ‘round-trip’ to the disk instead of the HTTP server. So I think there's some potential benefits to be had here. That assumes a sufficiently large TTL though. Greetings, Maxime.
Ludovic Courtès schreef op za 05-03-2022 om 22:58 [+0100]:
> I think this would be going overboard in terms of complexity :-)
There's some complexity here, but assuming a sufficient amount of
tests, I believe it would be worth it if it allows side-stepping the
rate limit to some degree. And the extra complexity would mostly
disappear if the overhead of tiny files was accepted (*).
There are also some other benefits, e.g. a kind of ‘download
resumption’ but for linters, reducing network traffic after retrying
"guix lint" on a lossy network (or because the terminal tab was closed
to early, etc.).
All stuff that can be left for later though!
Greetings,
Maxime.
(*) Assuming 150 000 packages and 1 KiB per package (this would be
file-system dependent!), I end up with 150 MiB. That's a bit on the
large size though ...
Maxime Devos <maximedevos@telenet.be> skribis: > Ludovic Courtès schreef op za 05-03-2022 om 22:58 [+0100]: >> I think this would be going overboard in terms of complexity :-) > > There's some complexity here, but assuming a sufficient amount of > tests, I believe it would be worth it if it allows side-stepping the > rate limit to some degree. What should also be taken into account is the usefulness of the ‘github’ updater—investment should be proportionate. I suspect it’s much less useful now that we have the ‘generic-git’ updater. Maybe, maybe it gives slightly more accurate data in some cases, maybe it can be slightly faster, but that’s not entirely clear to me.