mbox

[bug#39258,v4,0/3] Faster cache generation (similar as v3)

Message ID 20200503150154.26532-1-zimon.toutoune@gmail.com
Headers show

Message

Simon Tournier May 3, 2020, 3:01 p.m. UTC
Dear,

The aim of this version v4 is to keep the same searching performances as the previous version v3 but to drastically reduce the generation of the cache.  On my laptop, the overhead is now 4 seconds; compared to more than 20 seconds for v2 and v3.

--8<---------------cut here---------------start------------->8---
# default
time guix build /gnu/store/0nfpp82mqglpwvl1nbfpaphw5db2ivcp-guix-package-cache.drv --check
# v4
time guix build /gnu/store/y78gfh1n7m3kyrj8wsqj25qc2cbc1a4d-guix-package-cache.drv --check
--8<---------------cut here---------------end--------------->8---

|      | default  | v4        |
|------+----------+-----------|
| real | 0m6.012s | 0m10.244s |
| user | 0m0.541s | 0m0.542s  |
| sys  | 0m0.033s | 0m0.032s  |


In the version v3, the cache is built using 'cons' and 'fold-packages' (wrapper to 'fold-module-public-variables').  The version v4 modifies -- by adding other information -- the function 'generate-package-cache' which uses 'vhash' and 'fold-module-public-variables*'.

Therefore the cache '/lib/guix/package.cache' contains more information.  (The v4 structure of 'package.cache' is a quick draft, so details should be discussed and an interesting move should to have a structured (binary and all strings) S-exp; because it should become an entry point to export the packages list to JSON.  WDYT?)


Now, we are comparing apples to apples and the cost to compute BM25 (v2) is not free at all.  Remember that BM25 is the state-of-the-art of information retrieval (relevance ranking) and it is delegated to Xapian (v2).  I do not know if there is perfomance bottleneck between Guix, Guile-Xapian and Xapian itself but for sure the computation of BM25 is not free.  More about that soon.

To be clear about BM25 and caching, what I have in mind is:
  1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking.
  2. Use BM25 metrics to detect poor package meta-data (synopsis and description); if it worth why not add another checker to "guix lint".

However, ranking is another story and I am not convinced yet if BM25 fits Guix needs or not.



* Details
~~~~~~~~~

The pacthes applies against the commit a357849f5b (and it is not yet rebased).

--8<---------------cut here---------------start------------->8---
time ./pre-env-inst guix pull --branch=search-v4 --url=$PWD -p /tmp/v4
--8<---------------cut here---------------end--------------->8---


Similar test than the previous benchmark (cold cache).

--8<---------------cut here---------------start------------->8---
time ./pre-env-inst /tmp/v4/bin/guix search crypto library \
     | recsel -P name | grep libb2
name: libb2

real    0m0.784s
user    0m0.810s
sys     0m0.037s
--8<---------------cut here---------------end--------------->8---

And the option '--load-path' turns off the cache and it fallbacks to the usual 'fold-package'.

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env /tmp/v4/bin/guix search -L /tmp/my-pkgs crypto library \
     | recsel -C -p name | grep libb2
name: libb2

real    0m2.446s
user    0m1.872s
sys     0m0.187s
--8<---------------cut here---------------end--------------->8---



* Still draft
~~~~~~~~~~~~~

 1. The name of 'fold-packages*' should be misleading since it does not return "true" packages.

--8<---------------cut here---------------start------------->8---
(define get-hello (p r)
  (if (string=? (package-name p) "hello")
      p
      r))
(define no-cache   (fold-packages  get-hello '()))
(define from-cache (fold-packages* get-hello '()))

(equal? no-cache from-cache)
;;; #f
--8<---------------cut here---------------end--------------->8---

    Another name for the procedure is welcome if it is an issue.

 2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better.

--8<---------------cut here---------------start------------->8---
          (match (package-supported-systems p)
            (('cache supported-systems)
             (string-join supported-systems))
            (_
             (string-join (package-transitive-supported-systems p)))))
--8<---------------cut here---------------end--------------->8---

    However it avoids to duplicate code; as it is done in version v3.


 3. Deprecated packages are displayed (bug in v3 too).

 4. Impolite '@@' is used to access the private license construction.

 5. Commit messages are incomplete, copyright header too, etc..



* Next?
~~~~~~~

IMHO, simply caching improves the current situation:

 - a bit of extra time at pull time (less than 5s on my machine)
 + speed up at search time (2x faster)
 * maintainable code?

Is it in the right direction?
Could you advise for a more compliant code?
Could you test on your machines to have another point of comparison?



Best regards,
simon


zimoun (3):
  DRAFT packages: Add fields to packages cache.
  DRAFT packages: Add new procedure 'fold-packages*'.
  DRAFT guix package: Use cache in 'find-packages-by-description'.

 gnu/packages.scm         | 98 ++++++++++++++++++++++++++++++++++++++--
 guix/scripts/package.scm |  2 +-
 guix/ui.scm              | 29 +++++++-----
 tests/packages.scm       | 31 +++++++++++++
 4 files changed, 143 insertions(+), 17 deletions(-)

Comments

Ludovic Courtès May 3, 2020, 4:43 p.m. UTC | #1
Hello!

zimoun <zimon.toutoune@gmail.com> skribis:

> The aim of this version v4 is to keep the same searching performances as the previous version v3 but to drastically reduce the generation of the cache.  On my laptop, the overhead is now 4 seconds; compared to more than 20 seconds for v2 and v3.
>
> # default
> time guix build /gnu/store/0nfpp82mqglpwvl1nbfpaphw5db2ivcp-guix-package-cache.drv --check
> # v4
> time guix build /gnu/store/y78gfh1n7m3kyrj8wsqj25qc2cbc1a4d-guix-package-cache.drv --check
>
> |      | default  | v4        |
> |------+----------+-----------|
> | real | 0m6.012s | 0m10.244s |
> | user | 0m0.541s | 0m0.542s  |
> | sys  | 0m0.033s | 0m0.032s  |

Not bad!

> In the version v3, the cache is built using 'cons' and 'fold-packages' (wrapper to 'fold-module-public-variables').  The version v4 modifies -- by adding other information -- the function 'generate-package-cache' which uses 'vhash' and 'fold-module-public-variables*'.
>
> Therefore the cache '/lib/guix/package.cache' contains more
> information.

This breaks the binary interface, so we’ll have to analyze the impact of
such a change and devise a strategy.

> (The v4 structure of 'package.cache' is a quick draft, so details
> should be discussed and an interesting move should to have a
> structured (binary and all strings) S-exp; because it should become an
> entry point to export the packages list to JSON.  WDYT?)

It’s on purpose that this cache is an object file: it just needs to be
mmap’d, and that’s it.  It’s the cheapest possible way to do it.
Parsing sexps would be more costly, and since we’re talking about
startup time, this is sensitive.

> Now, we are comparing apples to apples and the cost to compute BM25 (v2) is not free at all.  Remember that BM25 is the state-of-the-art of information retrieval (relevance ranking) and it is delegated to Xapian (v2).  I do not know if there is perfomance bottleneck between Guix, Guile-Xapian and Xapian itself but for sure the computation of BM25 is not free.  More about that soon.
>
> To be clear about BM25 and caching, what I have in mind is:
>   1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking.

Something that must be done explicitly doesn’t seem great to me.  As a
user, I’d rather not think about search indexes and all.  But I don’t
know, maybe if it happened automatically on the first ‘guix search’
invocation that’d be fine.

>   2. Use BM25 metrics to detect poor package meta-data (synopsis and description); if it worth why not add another checker to "guix lint".

That’d be interesting!

>  1. The name of 'fold-packages*' should be misleading since it does not return "true" packages.

Did you see ‘fold-available-packages’?  It seems you could extend it
instead of introducing ‘fold-packages*’, no?

>  2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better.
>
>           (match (package-supported-systems p)
>             (('cache supported-systems)
>              (string-join supported-systems))
>             (_
>              (string-join (package-transitive-supported-systems p)))))
>
>     However it avoids to duplicate code; as it is done in version v3.

I made suggestions to Arun’s v3 about the API here.  Essentially, I
think I proposed having a procedure that takes the list of fields as
keyword parameters, and ‘package->recutils’ would just delegate to that.


>  3. Deprecated packages are displayed (bug in v3 too).
>
>  4. Impolite '@@' is used to access the private license construction.

(guix licenses) could provide a ‘string->license’ procedure.

Stopping here for now because I’m sorta drowning in patch review.  :-)

Thanks for exploring this design space, we’re making progress!

Ludo’.
Simon Tournier May 3, 2020, 6:10 p.m. UTC | #2
Hi Ludo,

On Sun, 3 May 2020 at 18:43, Ludovic Courtès <ludo@gnu.org> wrote:

> > Therefore the cache '/lib/guix/package.cache' contains more
> > information.
>
> This breaks the binary interface, so we’ll have to analyze the impact of
> such a change and devise a strategy.

Interface between what and what?

Because from my understanding, this file is only used by only one
guix.  What do I miss?


Note that I have read your comment in v3 2/3 but I did not understand it. Sorry.

--8<---------------cut here---------------start------------->8---
I realize the other cache also has that problem, but it would be nice to
add a version tag to the cache.  Basically emit something like:

  (package-metadata-cache (version 0) VECTOR …)

instead of just:

  (VECTOR …)
--8<---------------cut here---------------end--------------->8---



> > (The v4 structure of 'package.cache' is a quick draft, so details
> > should be discussed and an interesting move should to have a
> > structured (binary and all strings) S-exp; because it should become an
> > entry point to export the packages list to JSON.  WDYT?)
>
> It’s on purpose that this cache is an object file: it just needs to be
> mmap’d, and that’s it.  It’s the cheapest possible way to do it.
> Parsing sexps would be more costly, and since we’re talking about
> startup time, this is sensitive.

I agree and I have badly worded or I misunderstand something.
For example, 'supported-systems' is saved as a list of strings,
whereas 'license' is expanded as 3 strings without be packed in a list
of strings.  From my point of view, it is inconsistent and I do not
know what is the best (readibility, startup time, etc.).


> > To be clear about BM25 and caching, what I have in mind is:
> >   1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking.
>
> Something that must be done explicitly doesn’t seem great to me.  As a
> user, I’d rather not think about search indexes and all.  But I don’t
> know, maybe if it happened automatically on the first ‘guix search’
> invocation that’d be fine.

I do not think it is an option to build the BM25 the first time "guix
search" is called.  Back-to-envelop estimation, it needs ~25 seconds
to Xapian* to do so.

From my point of view, two options:
 a) "guix pull" does this extra ~25 seconds (compared to 10 seconds to
build the v4 cache)
 b) the user manually build the index (I agree it is awkward!)

Well, the first question is to evaluate if it is worth -- I am using
the v2 version based on Xapian to have an idea.  Please if you have
suggestions about query (terms an user could type) and results
(packages an user could expect), there are welcome.


*Xapian: I do not think we could do better but I have not checked yet
if there is a bottleneck Guix, Guile-Xapian and Xapian.


> >  1. The name of 'fold-packages*' should be misleading since it does not return "true" packages.
>
> Did you see ‘fold-available-packages’?  It seems you could extend it
> instead of introducing ‘fold-packages*’, no?

Yes and no.

 a) 'fold-available-packages' requires to modify the 'lambda' in
'find-package-by-description',
 b) 'fold-package*' returning a 'package' is less tweaks, IMHO.

Well, I agree that on the long term, what 'fold-package*' does could
be done by 'fold-available-packages' with the adequate 'proc'.

Thank you for the suggestion; even if once re-read correctly v3 2/3
you already mentioned it. :-)


> >  2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better.
> >
> >           (match (package-supported-systems p)
> >             (('cache supported-systems)
> >              (string-join supported-systems))
> >             (_
> >              (string-join (package-transitive-supported-systems p)))))
> >
> >     However it avoids to duplicate code; as it is done in version v3.
>
> I made suggestions to Arun’s v3 about the API here.  Essentially, I
> think I proposed having a procedure that takes the list of fields as
> keyword parameters, and ‘package->recutils’ would just delegate to that.

Yes, it was already your suggestion in v3 3/3.  Do you suggest to
refactor 'package->recutils'? For example,

--8<---------------cut here---------------start------------->8---
(define* (package->recutils name version
                            ... all-the-other-fields ...
                            port #:optional (width (%text-width))
                            #:key
                            (hyperlinks? (supports-hyperlinks? port))
                            (extra-fields '()))
--8<---------------cut here---------------end--------------->8---



> >  4. Impolite '@@' is used to access the private license construction.
>
> (guix licenses) could provide a ‘string->license’ procedure.

Well, do you suggest:

    (define (string->license name) (license name #f #f))

? Skipping 'uri' and 'comment'?  Naive question: what is the purpose
of these 2 fields?  Because there are not exposed at the CLI level,
AFAIK, and I do not think an user evaluate '(license-uri pkg)' in a
script.

Well, I think that the hyperlink feature could be used to display the
license URI too.  WDYT?



> Stopping here for now because I’m sorta drowning in patch review.  :-)

Thank you for all the comments.


> Thanks for exploring this design space, we’re making progress!

My pleasure. Scheme is designed to explore. ;-)


Cheers,
simon
Ludovic Courtès May 3, 2020, 7:49 p.m. UTC | #3
Hi,

zimoun <zimon.toutoune@gmail.com> skribis:

> On Sun, 3 May 2020 at 18:43, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> > Therefore the cache '/lib/guix/package.cache' contains more
>> > information.
>>
>> This breaks the binary interface, so we’ll have to analyze the impact of
>> such a change and devise a strategy.
>
> Interface between what and what?

Guix revision N creates a cache that will be read by revision N+1, upon
‘guix pull’ completion.

> Note that I have read your comment in v3 2/3 but I did not understand it. Sorry.
>
> I realize the other cache also has that problem, but it would be nice to
> add a version tag to the cache.  Basically emit something like:
>
>   (package-metadata-cache (version 0) VECTOR …)
>
> instead of just:
>
>   (VECTOR …)

Yes, it would be better.

> For example, 'supported-systems' is saved as a list of strings,
> whereas 'license' is expanded as 3 strings without be packed in a list
> of strings.  From my point of view, it is inconsistent and I do not
> know what is the best (readibility, startup time, etc.).

I guess both ‘license’ and ‘supported-systems’ should be list of
strings.  It doesn’t really have an impact on startup time (I thought
you were suggesting storing the cache as an sexp instead of an object
file.)

>> Something that must be done explicitly doesn’t seem great to me.  As a
>> user, I’d rather not think about search indexes and all.  But I don’t
>> know, maybe if it happened automatically on the first ‘guix search’
>> invocation that’d be fine.
>
> I do not think it is an option to build the BM25 the first time "guix
> search" is called.  Back-to-envelop estimation, it needs ~25 seconds
> to Xapian* to do so.
>
> From my point of view, two options:
>  a) "guix pull" does this extra ~25 seconds (compared to 10 seconds to
> build the v4 cache)
>  b) the user manually build the index (I agree it is awkward!)
>
> Well, the first question is to evaluate if it is worth -- I am using
> the v2 version based on Xapian to have an idea.  Please if you have
> suggestions about query (terms an user could type) and results
> (packages an user could expect), there are welcome.

Yeah, dunno.  Maybe an option would be to create the index in such a way
that it is substitutable.


[...]

> Yes, it was already your suggestion in v3 3/3.  Do you suggest to
> refactor 'package->recutils'? For example,
>
> (define* (package->recutils name version
>                             ... all-the-other-fields ...
>                             port #:optional (width (%text-width))
>                             #:key
>                             (hyperlinks? (supports-hyperlinks? port))
>                             (extra-fields '()))

Yes.

>> >  4. Impolite '@@' is used to access the private license construction.
>>
>> (guix licenses) could provide a ‘string->license’ procedure.
>
> Well, do you suggest:
>
>     (define (string->license name) (license name #f #f))

No; rather, it would look up the license in a dictionary and return the
corresponding object or #f if it’s not a known license.

Thanks,
Ludo’.