diff mbox series

[bug#66650] git: Shell out to ‘git gc’ when necessary.

Message ID f588bb38b4b9fdaff29dd8af8c62aa3c55902f7c.1697818202.git.ludo@gnu.org
State New
Headers show
Series [bug#66650] git: Shell out to ‘git gc’ when necessary. | expand

Commit Message

Ludovic Courtès Oct. 20, 2023, 4:15 p.m. UTC
Fixes <https://issues.guix.gnu.org/65720>.

This fixes a bug whereby libgit2-managed checkouts would keep growing as
we fetch.

* guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
procedures.
(update-cached-checkout): Use it.
---
 guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 36 insertions(+), 3 deletions(-)

Hi!

This is a radical fix/workaround for the unbounded Git checkout growth
problem, shelling out to ‘git gc’ when it’s likely needed (“too many”
pack files around).

I thought we might be able to implement a ‘git gc’ approximation using
the libgit2 “packbuilder” interface, but I haven’t got around to doing
it: <https://libgit2.org/libgit2/#HEAD/search/pack>.

Once again, shelling out is not my favorite option, but it’s a bug we
should fix sooner rather than later, hence this compromise.

Thoughts?

Ludo’.


base-commit: 6b0a32196982a0a2f4dbb59d35e55833a5545ac6

Comments

Christopher Baines Oct. 30, 2023, 12:02 p.m. UTC | #1
Ludovic Courtès <ludo@gnu.org> writes:

> Fixes <https://issues.guix.gnu.org/65720>.
>
> This fixes a bug whereby libgit2-managed checkouts would keep growing as
> we fetch.
>
> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
> procedures.
> (update-cached-checkout): Use it.
> ---
>  guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
>  1 file changed, 36 insertions(+), 3 deletions(-)
>
> Hi!
>
> This is a radical fix/workaround for the unbounded Git checkout growth
> problem, shelling out to ‘git gc’ when it’s likely needed (“too many”
> pack files around).
>
> I thought we might be able to implement a ‘git gc’ approximation using
> the libgit2 “packbuilder” interface, but I haven’t got around to doing
> it: <https://libgit2.org/libgit2/#HEAD/search/pack>.
>
> Once again, shelling out is not my favorite option, but it’s a bug we
> should fix sooner rather than later, hence this compromise.
>
> Thoughts?

This sounds good to me, the data service has this problem as well of
cached checkouts that grow to be too large and this sounds like it'll
address it.
Ludovic Courtès Nov. 14, 2023, 9:19 a.m. UTC | #2
Hello,

Christopher Baines <mail@cbaines.net> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Fixes <https://issues.guix.gnu.org/65720>.
>>
>> This fixes a bug whereby libgit2-managed checkouts would keep growing as
>> we fetch.

[...]

> This sounds good to me, the data service has this problem as well of
> cached checkouts that grow to be too large and this sounds like it'll
> address it.

Thanks for your input, Chris.

Any other comments?  I’d like to push the patch within a few days if
there are no objections.

  https://issues.guix.gnu.org/66650

Ludo’.
Simon Tournier Nov. 14, 2023, 9:32 a.m. UTC | #3
Hi,

On Tue, 14 Nov 2023 at 10:19, Ludovic Courtès <ludo@gnu.org> wrote:

> Any other comments?  I’d like to push the patch within a few days if
> there are no objections.

As mentioned in [1],

>> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
>> procedures.
>> (update-cached-checkout): Use it.
>> ---
>>  guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 36 insertions(+), 3 deletions(-)

LGTM.  Just two colors for the bikeshed. :-)


>> +  (when (> (packs-in-git-repository directory) 25)

Why 25?  And not 10 or 50 or 100?


>>  (define* (update-cached-checkout url
>>                                   #:key
>>                                   (ref '())
>> @@ -515,6 +545,9 @@ (define* (update-cached-checkout url
>>                     seconds seconds
>>                     nanoseconds nanoseconds))))
>>  
>> +       ;; Run 'git gc' if needed.
>> +       (maybe-run-git-gc cache-directory)

Why not trigger it by “guix gc”?

Well, I expect “guix gc” to take some time and I choose when.  However,
I want “guix pull” or “guix time-machine” to be as fast as possible and
here some extra time is added, and I cannot control exactly when.


Cheers,
simon


1: bug#65720: [PATCH] git: Shell out to ‘git gc’ when necessary.
Simon Tournier <zimon.toutoune@gmail.com>
Mon, 23 Oct 2023 12:08:07 +0200
id:87il6xlkhk.fsf@gmail.com
https://issues.guix.gnu.org/65720
https://issues.guix.gnu.org/msgid/87il6xlkhk.fsf@gmail.com
https://yhetil.org/guix/87il6xlkhk.fsf@gmail.com
Ludovic Courtès Nov. 16, 2023, 12:12 p.m. UTC | #4
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

>>> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
>>> procedures.
>>> (update-cached-checkout): Use it.
>>> ---
>>>  guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
>>>  1 file changed, 36 insertions(+), 3 deletions(-)
>
> LGTM.

Thanks!

> Just two colors for the bikeshed. :-)
>
>
>>> +  (when (> (packs-in-git-repository directory) 25)
>
> Why 25?  And not 10 or 50 or 100?

Totally arbitrary.  :-)  I sampled the checkouts I had on my laptop and
that seems like a reasonable heuristic.  In particular, it seems that
Git-managed checkouts never have this many packs; only libgit2-managed
checkouts do, precisely because libgit2 doesn’t repack/GC.

>>> +       ;; Run 'git gc' if needed.
>>> +       (maybe-run-git-gc cache-directory)
>
> Why not trigger it by “guix gc”?

Because so far the idea is that ~/.cache/guix/checkouts is automatically
managed without user intervention; it’s really a cache in that sense.

> Well, I expect “guix gc” to take some time and I choose when.  However,
> I want “guix pull” or “guix time-machine” to be as fast as possible and
> here some extra time is added, and I cannot control exactly when.

Yes, I see.  The thing is ‘maybe-run-git-gc’ is only called on the slow
path; so for example, it’s not called on a ‘time-machine’ cache hit, but
only on a cache miss, which is already expensive anyway.

Does that make sense?

Thanks,
Ludo’.
Simon Tournier Nov. 16, 2023, 1:24 p.m. UTC | #5
Hi,

On Thu, 16 Nov 2023 at 13:12, Ludovic Courtès <ludo@gnu.org> wrote:

> > Well, I expect “guix gc” to take some time and I choose when.  However,
> > I want “guix pull” or “guix time-machine” to be as fast as possible and
> > here some extra time is added, and I cannot control exactly when.
>
> Yes, I see.  The thing is ‘maybe-run-git-gc’ is only called on the slow
> path; so for example, it’s not called on a ‘time-machine’ cache hit, but
> only on a cache miss, which is already expensive anyway.

What you mean as "only called on the slow path" is each time
'update-cached-checkout' is called, right?  So, somehow when
'maybe-run-git-gc' is called appears to me "unpredictable".  But
anyway. :-)

Let move it elsewhere if I am really annoyed.

Cheers,
simon
Ludovic Courtès Nov. 22, 2023, 11:17 a.m. UTC | #6
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

> On Thu, 16 Nov 2023 at 13:12, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> > Well, I expect “guix gc” to take some time and I choose when.  However,
>> > I want “guix pull” or “guix time-machine” to be as fast as possible and
>> > here some extra time is added, and I cannot control exactly when.
>>
>> Yes, I see.  The thing is ‘maybe-run-git-gc’ is only called on the slow
>> path; so for example, it’s not called on a ‘time-machine’ cache hit, but
>> only on a cache miss, which is already expensive anyway.
>
> What you mean as "only called on the slow path" is each time
> 'update-cached-checkout' is called, right?

Yes, which usually indicates we’re on a cache miss (for example a cache
miss of ‘guix time-machine’) and thus are going to do potentially more
work (updating a Git repo, building things, etc.).  That’s why I think
it’s on the “slow path” and shouldn’t make much of a difference.  More
importantly, unless I’m mistaken, it’s rarely going to fire.

> So, somehow when 'maybe-run-git-gc' is called appears to me
> "unpredictable".  But anyway. :-)

Sure, but the way I see it, that’s the nature of caches.

> Let move it elsewhere if I am really annoyed.

:-/

Ludo’.
Simon Tournier Nov. 22, 2023, 11:57 a.m. UTC | #7
Hi Ludo,

Thanks for explaining.

On Wed, 22 Nov 2023 at 12:17, Ludovic Courtès <ludo@gnu.org> wrote:

>                                   it’s rarely going to fire.

[...]

>> Let move it elsewhere if I am really annoyed.
>
> :-/

Sorry, I poorly worded my last comment. :-)

Somehow I was expressing: my view probably falls into the “Premature
optimization is the root of all evil” category.  Other said, I have no
objection and I will revisit the issue when I will be on fire, if I am,
or annoyed for real.

Cheers,
simon

PS:

Aside this patch:

>> So, somehow when 'maybe-run-git-gc' is called appears to me
>> "unpredictable".  But anyway. :-)
>
> Sure, but the way I see it, that’s the nature of caches.

What makes cache unpredictable is their current state.  However, this
does not imply that *all* the actions modifying from one state to
another must also be triggered in unpredictable moment.

For instance, I choose when I wash family’s clothes and the wash-machine
does not start by itself when the unpredictable stack of family’s dirty
clothes is enough.  Because, maybe today it’s rainy so drying is
difficult and tomorrow will be sunny so it will be a better moment. :-)

For me, “guix gc” should be the driver for cleaning all the various Guix
caches.  Anyway. :-D
Ludovic Courtès Nov. 22, 2023, 4 p.m. UTC | #8
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

> Somehow I was expressing: my view probably falls into the “Premature
> optimization is the root of all evil” category.  Other said, I have no
> objection and I will revisit the issue when I will be on fire, if I am,
> or annoyed for real.

Alright!

Pushed as b150c546b04c9ebb09de9f2c39789221054f5eea.

Let’s see how it behaves and if there are problems we had overlooked…

Ludo’.
diff mbox series

Patch

diff --git a/guix/git.scm b/guix/git.scm
index b7182305cf..d704b62333 100644
--- a/guix/git.scm
+++ b/guix/git.scm
@@ -1,6 +1,6 @@ 
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2017, 2020 Mathieu Othacehe <m.othacehe@gmail.com>
-;;; Copyright © 2018-2022 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2018-2023 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2021 Kyle Meyer <kyle@kyleam.com>
 ;;; Copyright © 2021 Marius Bakke <marius@gnu.org>
 ;;; Copyright © 2022 Maxime Devos <maximedevos@telenet.be>
@@ -29,15 +29,16 @@  (define-module (guix git)
   #:use-module (guix cache)
   #:use-module (gcrypt hash)
   #:use-module ((guix build utils)
-                #:select (mkdir-p delete-file-recursively))
+                #:select (mkdir-p delete-file-recursively invoke/quiet))
   #:use-module (guix store)
   #:use-module (guix utils)
   #:use-module (guix records)
   #:use-module (guix gexp)
   #:autoload   (guix git-download)
   (git-reference-url git-reference-commit git-reference-recursive?)
+  #:autoload   (guix config) (%git)
   #:use-module (guix sets)
-  #:use-module ((guix diagnostics) #:select (leave warning))
+  #:use-module ((guix diagnostics) #:select (leave warning info))
   #:use-module (guix progress)
   #:autoload   (guix swh) (swh-download commit-id?)
   #:use-module (rnrs bytevectors)
@@ -428,6 +429,35 @@  (define (delete-checkout directory)
     (rename-file directory trashed)
     (delete-file-recursively trashed)))
 
+(define (packs-in-git-repository directory)
+  "Return the number of pack files under DIRECTORY, a Git checkout."
+  (catch 'system-error
+    (lambda ()
+      (let ((directory (opendir (in-vicinity directory ".git/objects/pack"))))
+        (let loop ((count 0))
+          (match (readdir directory)
+            ((? eof-object?)
+             (closedir directory)
+             count)
+            (str
+             (loop (if (string-suffix? ".pack" str)
+                       (+ 1 count)
+                       count)))))))
+    (const 0)))
+
+(define (maybe-run-git-gc directory)
+  "Run 'git gc' in DIRECTORY if needed."
+  ;; XXX: As of libgit2 1.3.x (used by Guile-Git), there's no support for GC.
+  ;; Each time a checkout is pulled, a new pack is created, which eventually
+  ;; takes up a lot of space (lots of small, poorly-compressed packs).  As a
+  ;; workaround, shell out to 'git gc' when the number of packs in a
+  ;; repository has become "too large", potentially wasting a lot of space.
+  ;; See <https://issues.guix.gnu.org/65720>.
+  (when (> (packs-in-git-repository directory) 25)
+    (info (G_ "compressing cached Git repository at '~a'...~%")
+          directory)
+    (invoke/quiet %git "-C" directory "gc")))
+
 (define* (update-cached-checkout url
                                  #:key
                                  (ref '())
@@ -515,6 +545,9 @@  (define* (update-cached-checkout url
                    seconds seconds
                    nanoseconds nanoseconds))))
 
+       ;; Run 'git gc' if needed.
+       (maybe-run-git-gc cache-directory)
+
        ;; When CACHE-DIRECTORY is a sub-directory of the default cache
        ;; directory, remove expired checkouts that are next to it.
        (let ((parent (dirname cache-directory)))