[bug#33801] import: github: Support source URIs that redirect to GitHub

Message ID cu7mup04rfp.fsf@systemreboot.net
State Accepted
Headers show
Series [bug#33801] import: github: Support source URIs that redirect to GitHub | expand

Checks

Context Check Description
cbaines/applying patch fail Apply failed

Commit Message

Arun Isaac Dec. 20, 2018, 6:56 a.m. UTC
> Do you know how many packages fall into that category?

With this patch, we have a problem estimating the coverage using `guix
refresh -L'. Now, to estimate coverage, we need to make HTTP requests
for every single source tarball in Guix to determine if it redirects to
GitHub. This is an enormous number of HTTP requests! When I ran `guix
refresh -L', it took a very long time to finish coverage estimation. So,
I cancelled the command. Any better way to handle this?

>> +(define (follow-redirects-to-github uri)
>> +  "Follow redirects of URI until a GitHub URI is found. Return that GitHub
>> +URI. If no GitHub URI is found, return #f."
>
> Perhaps add the yt-dl.org example as a comment here.

I added a reference to the youtube-dl package in the comments. I also
added a few more comments in other places.

>> +  (define (follow-redirect uri)
>> +    (receive (response body) (http-get uri #:streaming? #t)
>
> Add: (close-port body).

I switched to using (http-head uri) instead of (http-get uri
#:streaming? #t). So, (close-port body) should no longer be required.

I also modified follow-redirects-to-github to avoid following redirects
on mirror and file URIs.

Please find attached a new patch.

Comments

Ludovic Courtès Dec. 20, 2018, 10:55 a.m. UTC | #1
Hi,

Arun Isaac <arunisaac@systemreboot.net> skribis:

>> Do you know how many packages fall into that category?
>
> With this patch, we have a problem estimating the coverage using `guix
> refresh -L'. Now, to estimate coverage, we need to make HTTP requests
> for every single source tarball in Guix to determine if it redirects to
> GitHub. This is an enormous number of HTTP requests! When I ran `guix
> refresh -L', it took a very long time to finish coverage estimation. So,
> I cancelled the command. Any better way to handle this?

Ouch, that’s a problem indeed.  We could maintain a cache of redirects,
although that wouldn’t be a real solution.

Hmm, now that I think about it, shouldn’t we store the github.com URL
directly for these packages?  We could add a new lint check along the
lines of the ‘check-mirror-url’ procedure that would allow us to find
out which URLs need to be changed.  If we took that route, the changes
you made to the importer would no longer be necessary.  :-/

WDYT?

Thanks,
Ludo’.
Arun Isaac Dec. 20, 2018, 11:20 a.m. UTC | #2
> Hmm, now that I think about it, shouldn’t we store the github.com URL
> directly for these packages?  We could add a new lint check along the
> lines of the ‘check-mirror-url’ procedure that would allow us to find
> out which URLs need to be changed.  If we took that route, the changes
> you made to the importer would no longer be necessary.  :-/

My changes only took a small amount of effort. I don't have much of an
issue with them being made obsolete. Shall I proceed with creating a
lint check along the lines of what you proposed?
Ludovic Courtès Dec. 20, 2018, 11:22 a.m. UTC | #3
Arun Isaac <arunisaac@systemreboot.net> skribis:

>> Hmm, now that I think about it, shouldn’t we store the github.com URL
>> directly for these packages?  We could add a new lint check along the
>> lines of the ‘check-mirror-url’ procedure that would allow us to find
>> out which URLs need to be changed.  If we took that route, the changes
>> you made to the importer would no longer be necessary.  :-/
>
> My changes only took a small amount of effort. I don't have much of an
> issue with them being made obsolete. Shall I proceed with creating a
> lint check along the lines of what you proposed?

Yes, if that sounds good to you, please do!

Ludo’.
Arun Isaac Dec. 20, 2018, 1:07 p.m. UTC | #4
> We could maintain a cache of redirects, although that wouldn’t be a
> real solution.
>
> Hmm, now that I think about it, shouldn’t we store the github.com URL
> directly for these packages?  We could add a new lint check along the
> lines of the ‘check-mirror-url’ procedure that would allow us to find
> out which URLs need to be changed.  If we took that route, the changes
> you made to the importer would no longer be necessary.  :-/

On the other hand, if we constrain our updater predicates to not perform
network operations, we would be restricting what we can achieve with
updaters. For example, suppose in the future we have a cgit-updater to
update packages hosted using cgit. The only way to detect that would be
using some kind of network operation. The way our updaters currently
work, we will only ever be able to properly handle centralized platforms
like GitHub, PyPI, etc. And, that's a pity.

Another solution, though somewhat inelegant and tedious, would be to add
an `updater' field to the package specification so that the packager can
explicitly specify what updater to use.

WDYT?
Ludovic Courtès Dec. 20, 2018, 4:28 p.m. UTC | #5
Arun Isaac <arunisaac@systemreboot.net> skribis:

>> We could maintain a cache of redirects, although that wouldn’t be a
>> real solution.
>>
>> Hmm, now that I think about it, shouldn’t we store the github.com URL
>> directly for these packages?  We could add a new lint check along the
>> lines of the ‘check-mirror-url’ procedure that would allow us to find
>> out which URLs need to be changed.  If we took that route, the changes
>> you made to the importer would no longer be necessary.  :-/
>
> On the other hand, if we constrain our updater predicates to not perform
> network operations, we would be restricting what we can achieve with
> updaters. For example, suppose in the future we have a cgit-updater to
> update packages hosted using cgit. The only way to detect that would be
> using some kind of network operation. The way our updaters currently
> work, we will only ever be able to properly handle centralized platforms
> like GitHub, PyPI, etc. And, that's a pity.

We can use heuristics in many cases to determine whether an updater
applies to a package; most of the time, that’s only based on the URL,
and it might also work for many cgit instances—those that have “/cgit”
in their URL.

Then again, I expect that a growing number of packages will be obtained
through ‘git-fetch’, and the updater for such a thing is trivial
(interestingly it’s not yet implemented though :-)).

So I feel like at present we don’t have strong evidence that that the
simple URL-based approach would not scale.

WDYT?

> Another solution, though somewhat inelegant and tedious, would be to add
> an `updater' field to the package specification so that the packager can
> explicitly specify what updater to use.

That could be useful to handle corner cases, but it should remain the
exception.

Thanks,
Ludo’.
Arun Isaac Dec. 20, 2018, 4:48 p.m. UTC | #6
> We can use heuristics in many cases to determine whether an updater
> applies to a package; most of the time, that’s only based on the URL,
> and it might also work for many cgit instances—those that have “/cgit”
> in their URL.
>
> Then again, I expect that a growing number of packages will be obtained
> through ‘git-fetch’, and the updater for such a thing is trivial
> (interestingly it’s not yet implemented though :-)).
>
> So I feel like at present we don’t have strong evidence that that the
> simple URL-based approach would not scale.

Agreed! Better to not complicate things beyond what is necessary at this
point. I am not confident that good heuristics exist. But, if we do use
git-fetch for the majority of our packages, like you said, updating will
be trivial.

I will proceed with the lint check as discussed earlier.
Eric Bavier Dec. 21, 2018, 12:12 a.m. UTC | #7
On Thu, 20 Dec 2018 17:28:42 +0100
Ludovic Courtès <ludo@gnu.org> wrote:

> Then again, I expect that a growing number of packages will be obtained
> through ‘git-fetch’, and the updater for such a thing is trivial
> (interestingly it’s not yet implemented though :-)).

See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=33809

`~Eric

Patch

From 7fa1daaf44720fa31813e4f07a2c49a2540a0526 Mon Sep 17 00:00:00 2001
From: Arun Isaac <arunisaac@systemreboot.net>
Date: Wed, 19 Dec 2018 15:59:52 +0530
Subject: [PATCH] import: github: Support source URIs that redirect to GitHub.

* guix/import/github.scm (follow-redirects-to-github): New function.
(updated-github-url)[updated-url]: For source URIs on other domains, replace
all instances of the old version with the new version.
(latest-release)[origin-github-uri]: If necessary, follow redirects to find
the GitHub URI.
---
 guix/import/github.scm | 41 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/guix/import/github.scm b/guix/import/github.scm
index af9f56e1d..8db7db305 100644
--- a/guix/import/github.scm
+++ b/guix/import/github.scm
@@ -1,6 +1,7 @@ 
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2016 Ben Woodcroft <donttrustben@gmail.com>
 ;;; Copyright © 2017, 2018 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2018 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -19,6 +20,8 @@ 
 
 (define-module (guix import github)
   #:use-module (ice-9 match)
+  #:use-module (ice-9 receive)
+  #:use-module (ice-9 regex)
   #:use-module (srfi srfi-1)
   #:use-module (srfi srfi-26)
   #:use-module (srfi srfi-34)
@@ -29,6 +32,8 @@ 
   #:use-module (guix packages)
   #:use-module (guix upstream)
   #:use-module (guix http-client)
+  #:use-module (web client)
+  #:use-module (web response)
   #:use-module (web uri)
   #:export (%github-updater))
 
@@ -39,12 +44,30 @@  false if none is recognized"
         (list ".tar.gz" ".tar.bz2" ".tar.xz" ".zip" ".tar"
               ".tgz" ".tbz" ".love")))
 
+(define (follow-redirects-to-github uri)
+  "Follow redirects of URI until a GitHub URI is found. Return that GitHub
+URI. If no GitHub URI is found, return #f."
+  (define (follow-redirect uri)
+    (receive (response body) (http-head uri)
+      (case (response-code response)
+        ((301 302)
+         (uri->string (assoc-ref (response-headers response) 'location)))
+        (else #f))))
+
+  (cond
+   ((string-prefix? "https://github.com/" uri) uri)
+   ((string-prefix? "http" uri)
+    (and=> (follow-redirect uri) follow-redirects-to-github))
+   ;; Do not attempt to follow redirects on URIs other than http and https
+   ;; (such as mirror, file)
+   (else #f)))
+
 (define (updated-github-url old-package new-version)
   ;; Return a url for the OLD-PACKAGE with NEW-VERSION.  If no source url in
   ;; the OLD-PACKAGE is a GitHub url, then return false.
 
   (define (updated-url url)
-    (if (string-prefix? "https://github.com/" url)
+    (if (follow-redirects-to-github url)
         (let ((ext     (or (find-extension url) ""))
               (name    (package-name old-package))
               (version (package-version old-package))
@@ -83,7 +106,14 @@  false if none is recognized"
                             url)
             (string-append "/releases/download/" repo "-" version "/" repo "-"
                            version ext))
-           (#t #f))) ; Some URLs are not recognised.
+           ;; As a last resort, attempt to replace all instances of the old
+           ;; version with the new version. This is necessary to handle URIs
+           ;; hosted on other domains that redirect to GitHub (for an example,
+           ;; see the youtube-dl package). We do not know the internal
+           ;; structure of these URIs and cannot handle them more
+           ;; intelligently.
+           (else (regexp-substitute/global
+                  #f version url 'pre new-version 'post))))
         #f))
 
   (let ((source-url (and=> (package-source old-package) origin-uri))
@@ -210,11 +240,14 @@  https://github.com/settings/tokens"))
 (define (latest-release pkg)
   "Return an <upstream-source> for the latest release of PKG."
   (define (origin-github-uri origin)
+    ;; We follow redirects to GitHub because the origin URI might appear to be
+    ;; hosted on some other domain but just redirects to GitHub. For example,
+    ;; see the youtube-dl package.
     (match (origin-uri origin)
       ((? string? url)
-       url)                                       ;surely a github.com URL
+       (follow-redirects-to-github url))
       ((urls ...)
-       (find (cut string-contains <> "github.com") urls))))
+       (find follow-redirects-to-github urls))))
 
   (let* ((source-uri (origin-github-uri (package-source pkg)))
          (name (package-name pkg))
-- 
2.19.2