Message ID | f588bb38b4b9fdaff29dd8af8c62aa3c55902f7c.1697818202.git.ludo@gnu.org |
---|---|
State | New |
Headers | show |
Series | [bug#66650] git: Shell out to ‘git gc’ when necessary. | expand |
Ludovic Courtès <ludo@gnu.org> writes: > Fixes <https://issues.guix.gnu.org/65720>. > > This fixes a bug whereby libgit2-managed checkouts would keep growing as > we fetch. > > * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New > procedures. > (update-cached-checkout): Use it. > --- > guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++--- > 1 file changed, 36 insertions(+), 3 deletions(-) > > Hi! > > This is a radical fix/workaround for the unbounded Git checkout growth > problem, shelling out to ‘git gc’ when it’s likely needed (“too many” > pack files around). > > I thought we might be able to implement a ‘git gc’ approximation using > the libgit2 “packbuilder” interface, but I haven’t got around to doing > it: <https://libgit2.org/libgit2/#HEAD/search/pack>. > > Once again, shelling out is not my favorite option, but it’s a bug we > should fix sooner rather than later, hence this compromise. > > Thoughts? This sounds good to me, the data service has this problem as well of cached checkouts that grow to be too large and this sounds like it'll address it.
Hello, Christopher Baines <mail@cbaines.net> skribis: > Ludovic Courtès <ludo@gnu.org> writes: > >> Fixes <https://issues.guix.gnu.org/65720>. >> >> This fixes a bug whereby libgit2-managed checkouts would keep growing as >> we fetch. [...] > This sounds good to me, the data service has this problem as well of > cached checkouts that grow to be too large and this sounds like it'll > address it. Thanks for your input, Chris. Any other comments? I’d like to push the patch within a few days if there are no objections. https://issues.guix.gnu.org/66650 Ludo’.
Hi, On Tue, 14 Nov 2023 at 10:19, Ludovic Courtès <ludo@gnu.org> wrote: > Any other comments? I’d like to push the patch within a few days if > there are no objections. As mentioned in [1], >> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New >> procedures. >> (update-cached-checkout): Use it. >> --- >> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 36 insertions(+), 3 deletions(-) LGTM. Just two colors for the bikeshed. :-) >> + (when (> (packs-in-git-repository directory) 25) Why 25? And not 10 or 50 or 100? >> (define* (update-cached-checkout url >> #:key >> (ref '()) >> @@ -515,6 +545,9 @@ (define* (update-cached-checkout url >> seconds seconds >> nanoseconds nanoseconds)))) >> >> + ;; Run 'git gc' if needed. >> + (maybe-run-git-gc cache-directory) Why not trigger it by “guix gc”? Well, I expect “guix gc” to take some time and I choose when. However, I want “guix pull” or “guix time-machine” to be as fast as possible and here some extra time is added, and I cannot control exactly when. Cheers, simon 1: bug#65720: [PATCH] git: Shell out to ‘git gc’ when necessary. Simon Tournier <zimon.toutoune@gmail.com> Mon, 23 Oct 2023 12:08:07 +0200 id:87il6xlkhk.fsf@gmail.com https://issues.guix.gnu.org/65720 https://issues.guix.gnu.org/msgid/87il6xlkhk.fsf@gmail.com https://yhetil.org/guix/87il6xlkhk.fsf@gmail.com
Hi, Simon Tournier <zimon.toutoune@gmail.com> skribis: >>> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New >>> procedures. >>> (update-cached-checkout): Use it. >>> --- >>> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++--- >>> 1 file changed, 36 insertions(+), 3 deletions(-) > > LGTM. Thanks! > Just two colors for the bikeshed. :-) > > >>> + (when (> (packs-in-git-repository directory) 25) > > Why 25? And not 10 or 50 or 100? Totally arbitrary. :-) I sampled the checkouts I had on my laptop and that seems like a reasonable heuristic. In particular, it seems that Git-managed checkouts never have this many packs; only libgit2-managed checkouts do, precisely because libgit2 doesn’t repack/GC. >>> + ;; Run 'git gc' if needed. >>> + (maybe-run-git-gc cache-directory) > > Why not trigger it by “guix gc”? Because so far the idea is that ~/.cache/guix/checkouts is automatically managed without user intervention; it’s really a cache in that sense. > Well, I expect “guix gc” to take some time and I choose when. However, > I want “guix pull” or “guix time-machine” to be as fast as possible and > here some extra time is added, and I cannot control exactly when. Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow path; so for example, it’s not called on a ‘time-machine’ cache hit, but only on a cache miss, which is already expensive anyway. Does that make sense? Thanks, Ludo’.
Hi, On Thu, 16 Nov 2023 at 13:12, Ludovic Courtès <ludo@gnu.org> wrote: > > Well, I expect “guix gc” to take some time and I choose when. However, > > I want “guix pull” or “guix time-machine” to be as fast as possible and > > here some extra time is added, and I cannot control exactly when. > > Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow > path; so for example, it’s not called on a ‘time-machine’ cache hit, but > only on a cache miss, which is already expensive anyway. What you mean as "only called on the slow path" is each time 'update-cached-checkout' is called, right? So, somehow when 'maybe-run-git-gc' is called appears to me "unpredictable". But anyway. :-) Let move it elsewhere if I am really annoyed. Cheers, simon
Hi, Simon Tournier <zimon.toutoune@gmail.com> skribis: > On Thu, 16 Nov 2023 at 13:12, Ludovic Courtès <ludo@gnu.org> wrote: > >> > Well, I expect “guix gc” to take some time and I choose when. However, >> > I want “guix pull” or “guix time-machine” to be as fast as possible and >> > here some extra time is added, and I cannot control exactly when. >> >> Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow >> path; so for example, it’s not called on a ‘time-machine’ cache hit, but >> only on a cache miss, which is already expensive anyway. > > What you mean as "only called on the slow path" is each time > 'update-cached-checkout' is called, right? Yes, which usually indicates we’re on a cache miss (for example a cache miss of ‘guix time-machine’) and thus are going to do potentially more work (updating a Git repo, building things, etc.). That’s why I think it’s on the “slow path” and shouldn’t make much of a difference. More importantly, unless I’m mistaken, it’s rarely going to fire. > So, somehow when 'maybe-run-git-gc' is called appears to me > "unpredictable". But anyway. :-) Sure, but the way I see it, that’s the nature of caches. > Let move it elsewhere if I am really annoyed. :-/ Ludo’.
Hi Ludo, Thanks for explaining. On Wed, 22 Nov 2023 at 12:17, Ludovic Courtès <ludo@gnu.org> wrote: > it’s rarely going to fire. [...] >> Let move it elsewhere if I am really annoyed. > > :-/ Sorry, I poorly worded my last comment. :-) Somehow I was expressing: my view probably falls into the “Premature optimization is the root of all evil” category. Other said, I have no objection and I will revisit the issue when I will be on fire, if I am, or annoyed for real. Cheers, simon PS: Aside this patch: >> So, somehow when 'maybe-run-git-gc' is called appears to me >> "unpredictable". But anyway. :-) > > Sure, but the way I see it, that’s the nature of caches. What makes cache unpredictable is their current state. However, this does not imply that *all* the actions modifying from one state to another must also be triggered in unpredictable moment. For instance, I choose when I wash family’s clothes and the wash-machine does not start by itself when the unpredictable stack of family’s dirty clothes is enough. Because, maybe today it’s rainy so drying is difficult and tomorrow will be sunny so it will be a better moment. :-) For me, “guix gc” should be the driver for cleaning all the various Guix caches. Anyway. :-D
Hi, Simon Tournier <zimon.toutoune@gmail.com> skribis: > Somehow I was expressing: my view probably falls into the “Premature > optimization is the root of all evil” category. Other said, I have no > objection and I will revisit the issue when I will be on fire, if I am, > or annoyed for real. Alright! Pushed as b150c546b04c9ebb09de9f2c39789221054f5eea. Let’s see how it behaves and if there are problems we had overlooked… Ludo’.
diff --git a/guix/git.scm b/guix/git.scm index b7182305cf..d704b62333 100644 --- a/guix/git.scm +++ b/guix/git.scm @@ -1,6 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2017, 2020 Mathieu Othacehe <m.othacehe@gmail.com> -;;; Copyright © 2018-2022 Ludovic Courtès <ludo@gnu.org> +;;; Copyright © 2018-2023 Ludovic Courtès <ludo@gnu.org> ;;; Copyright © 2021 Kyle Meyer <kyle@kyleam.com> ;;; Copyright © 2021 Marius Bakke <marius@gnu.org> ;;; Copyright © 2022 Maxime Devos <maximedevos@telenet.be> @@ -29,15 +29,16 @@ (define-module (guix git) #:use-module (guix cache) #:use-module (gcrypt hash) #:use-module ((guix build utils) - #:select (mkdir-p delete-file-recursively)) + #:select (mkdir-p delete-file-recursively invoke/quiet)) #:use-module (guix store) #:use-module (guix utils) #:use-module (guix records) #:use-module (guix gexp) #:autoload (guix git-download) (git-reference-url git-reference-commit git-reference-recursive?) + #:autoload (guix config) (%git) #:use-module (guix sets) - #:use-module ((guix diagnostics) #:select (leave warning)) + #:use-module ((guix diagnostics) #:select (leave warning info)) #:use-module (guix progress) #:autoload (guix swh) (swh-download commit-id?) #:use-module (rnrs bytevectors) @@ -428,6 +429,35 @@ (define (delete-checkout directory) (rename-file directory trashed) (delete-file-recursively trashed))) +(define (packs-in-git-repository directory) + "Return the number of pack files under DIRECTORY, a Git checkout." + (catch 'system-error + (lambda () + (let ((directory (opendir (in-vicinity directory ".git/objects/pack")))) + (let loop ((count 0)) + (match (readdir directory) + ((? eof-object?) + (closedir directory) + count) + (str + (loop (if (string-suffix? ".pack" str) + (+ 1 count) + count))))))) + (const 0))) + +(define (maybe-run-git-gc directory) + "Run 'git gc' in DIRECTORY if needed." + ;; XXX: As of libgit2 1.3.x (used by Guile-Git), there's no support for GC. + ;; Each time a checkout is pulled, a new pack is created, which eventually + ;; takes up a lot of space (lots of small, poorly-compressed packs). As a + ;; workaround, shell out to 'git gc' when the number of packs in a + ;; repository has become "too large", potentially wasting a lot of space. + ;; See <https://issues.guix.gnu.org/65720>. + (when (> (packs-in-git-repository directory) 25) + (info (G_ "compressing cached Git repository at '~a'...~%") + directory) + (invoke/quiet %git "-C" directory "gc"))) + (define* (update-cached-checkout url #:key (ref '()) @@ -515,6 +545,9 @@ (define* (update-cached-checkout url seconds seconds nanoseconds nanoseconds)))) + ;; Run 'git gc' if needed. + (maybe-run-git-gc cache-directory) + ;; When CACHE-DIRECTORY is a sub-directory of the default cache ;; directory, remove expired checkouts that are next to it. (let ((parent (dirname cache-directory)))