Message ID | 20200503150154.26532-1-zimon.toutoune@gmail.com |
---|---|
Headers | show |
Hello! zimoun <zimon.toutoune@gmail.com> skribis: > The aim of this version v4 is to keep the same searching performances as the previous version v3 but to drastically reduce the generation of the cache. On my laptop, the overhead is now 4 seconds; compared to more than 20 seconds for v2 and v3. > > # default > time guix build /gnu/store/0nfpp82mqglpwvl1nbfpaphw5db2ivcp-guix-package-cache.drv --check > # v4 > time guix build /gnu/store/y78gfh1n7m3kyrj8wsqj25qc2cbc1a4d-guix-package-cache.drv --check > > | | default | v4 | > |------+----------+-----------| > | real | 0m6.012s | 0m10.244s | > | user | 0m0.541s | 0m0.542s | > | sys | 0m0.033s | 0m0.032s | Not bad! > In the version v3, the cache is built using 'cons' and 'fold-packages' (wrapper to 'fold-module-public-variables'). The version v4 modifies -- by adding other information -- the function 'generate-package-cache' which uses 'vhash' and 'fold-module-public-variables*'. > > Therefore the cache '/lib/guix/package.cache' contains more > information. This breaks the binary interface, so we’ll have to analyze the impact of such a change and devise a strategy. > (The v4 structure of 'package.cache' is a quick draft, so details > should be discussed and an interesting move should to have a > structured (binary and all strings) S-exp; because it should become an > entry point to export the packages list to JSON. WDYT?) It’s on purpose that this cache is an object file: it just needs to be mmap’d, and that’s it. It’s the cheapest possible way to do it. Parsing sexps would be more costly, and since we’re talking about startup time, this is sensitive. > Now, we are comparing apples to apples and the cost to compute BM25 (v2) is not free at all. Remember that BM25 is the state-of-the-art of information retrieval (relevance ranking) and it is delegated to Xapian (v2). I do not know if there is perfomance bottleneck between Guix, Guile-Xapian and Xapian itself but for sure the computation of BM25 is not free. More about that soon. > > To be clear about BM25 and caching, what I have in mind is: > 1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking. Something that must be done explicitly doesn’t seem great to me. As a user, I’d rather not think about search indexes and all. But I don’t know, maybe if it happened automatically on the first ‘guix search’ invocation that’d be fine. > 2. Use BM25 metrics to detect poor package meta-data (synopsis and description); if it worth why not add another checker to "guix lint". That’d be interesting! > 1. The name of 'fold-packages*' should be misleading since it does not return "true" packages. Did you see ‘fold-available-packages’? It seems you could extend it instead of introducing ‘fold-packages*’, no? > 2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better. > > (match (package-supported-systems p) > (('cache supported-systems) > (string-join supported-systems)) > (_ > (string-join (package-transitive-supported-systems p))))) > > However it avoids to duplicate code; as it is done in version v3. I made suggestions to Arun’s v3 about the API here. Essentially, I think I proposed having a procedure that takes the list of fields as keyword parameters, and ‘package->recutils’ would just delegate to that. > 3. Deprecated packages are displayed (bug in v3 too). > > 4. Impolite '@@' is used to access the private license construction. (guix licenses) could provide a ‘string->license’ procedure. Stopping here for now because I’m sorta drowning in patch review. :-) Thanks for exploring this design space, we’re making progress! Ludo’.
Hi Ludo, On Sun, 3 May 2020 at 18:43, Ludovic Courtès <ludo@gnu.org> wrote: > > Therefore the cache '/lib/guix/package.cache' contains more > > information. > > This breaks the binary interface, so we’ll have to analyze the impact of > such a change and devise a strategy. Interface between what and what? Because from my understanding, this file is only used by only one guix. What do I miss? Note that I have read your comment in v3 2/3 but I did not understand it. Sorry. --8<---------------cut here---------------start------------->8--- I realize the other cache also has that problem, but it would be nice to add a version tag to the cache. Basically emit something like: (package-metadata-cache (version 0) VECTOR …) instead of just: (VECTOR …) --8<---------------cut here---------------end--------------->8--- > > (The v4 structure of 'package.cache' is a quick draft, so details > > should be discussed and an interesting move should to have a > > structured (binary and all strings) S-exp; because it should become an > > entry point to export the packages list to JSON. WDYT?) > > It’s on purpose that this cache is an object file: it just needs to be > mmap’d, and that’s it. It’s the cheapest possible way to do it. > Parsing sexps would be more costly, and since we’re talking about > startup time, this is sensitive. I agree and I have badly worded or I misunderstand something. For example, 'supported-systems' is saved as a list of strings, whereas 'license' is expanded as 3 strings without be packed in a list of strings. From my point of view, it is inconsistent and I do not know what is the best (readibility, startup time, etc.). > > To be clear about BM25 and caching, what I have in mind is: > > 1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking. > > Something that must be done explicitly doesn’t seem great to me. As a > user, I’d rather not think about search indexes and all. But I don’t > know, maybe if it happened automatically on the first ‘guix search’ > invocation that’d be fine. I do not think it is an option to build the BM25 the first time "guix search" is called. Back-to-envelop estimation, it needs ~25 seconds to Xapian* to do so. From my point of view, two options: a) "guix pull" does this extra ~25 seconds (compared to 10 seconds to build the v4 cache) b) the user manually build the index (I agree it is awkward!) Well, the first question is to evaluate if it is worth -- I am using the v2 version based on Xapian to have an idea. Please if you have suggestions about query (terms an user could type) and results (packages an user could expect), there are welcome. *Xapian: I do not think we could do better but I have not checked yet if there is a bottleneck Guix, Guile-Xapian and Xapian. > > 1. The name of 'fold-packages*' should be misleading since it does not return "true" packages. > > Did you see ‘fold-available-packages’? It seems you could extend it > instead of introducing ‘fold-packages*’, no? Yes and no. a) 'fold-available-packages' requires to modify the 'lambda' in 'find-package-by-description', b) 'fold-package*' returning a 'package' is less tweaks, IMHO. Well, I agree that on the long term, what 'fold-package*' does could be done by 'fold-available-packages' with the adequate 'proc'. Thank you for the suggestion; even if once re-read correctly v3 2/3 you already mentioned it. :-) > > 2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better. > > > > (match (package-supported-systems p) > > (('cache supported-systems) > > (string-join supported-systems)) > > (_ > > (string-join (package-transitive-supported-systems p))))) > > > > However it avoids to duplicate code; as it is done in version v3. > > I made suggestions to Arun’s v3 about the API here. Essentially, I > think I proposed having a procedure that takes the list of fields as > keyword parameters, and ‘package->recutils’ would just delegate to that. Yes, it was already your suggestion in v3 3/3. Do you suggest to refactor 'package->recutils'? For example, --8<---------------cut here---------------start------------->8--- (define* (package->recutils name version ... all-the-other-fields ... port #:optional (width (%text-width)) #:key (hyperlinks? (supports-hyperlinks? port)) (extra-fields '())) --8<---------------cut here---------------end--------------->8--- > > 4. Impolite '@@' is used to access the private license construction. > > (guix licenses) could provide a ‘string->license’ procedure. Well, do you suggest: (define (string->license name) (license name #f #f)) ? Skipping 'uri' and 'comment'? Naive question: what is the purpose of these 2 fields? Because there are not exposed at the CLI level, AFAIK, and I do not think an user evaluate '(license-uri pkg)' in a script. Well, I think that the hyperlink feature could be used to display the license URI too. WDYT? > Stopping here for now because I’m sorta drowning in patch review. :-) Thank you for all the comments. > Thanks for exploring this design space, we’re making progress! My pleasure. Scheme is designed to explore. ;-) Cheers, simon
Hi, zimoun <zimon.toutoune@gmail.com> skribis: > On Sun, 3 May 2020 at 18:43, Ludovic Courtès <ludo@gnu.org> wrote: > >> > Therefore the cache '/lib/guix/package.cache' contains more >> > information. >> >> This breaks the binary interface, so we’ll have to analyze the impact of >> such a change and devise a strategy. > > Interface between what and what? Guix revision N creates a cache that will be read by revision N+1, upon ‘guix pull’ completion. > Note that I have read your comment in v3 2/3 but I did not understand it. Sorry. > > I realize the other cache also has that problem, but it would be nice to > add a version tag to the cache. Basically emit something like: > > (package-metadata-cache (version 0) VECTOR …) > > instead of just: > > (VECTOR …) Yes, it would be better. > For example, 'supported-systems' is saved as a list of strings, > whereas 'license' is expanded as 3 strings without be packed in a list > of strings. From my point of view, it is inconsistent and I do not > know what is the best (readibility, startup time, etc.). I guess both ‘license’ and ‘supported-systems’ should be list of strings. It doesn’t really have an impact on startup time (I thought you were suggesting storing the cache as an sexp instead of an object file.) >> Something that must be done explicitly doesn’t seem great to me. As a >> user, I’d rather not think about search indexes and all. But I don’t >> know, maybe if it happened automatically on the first ‘guix search’ >> invocation that’d be fine. > > I do not think it is an option to build the BM25 the first time "guix > search" is called. Back-to-envelop estimation, it needs ~25 seconds > to Xapian* to do so. > > From my point of view, two options: > a) "guix pull" does this extra ~25 seconds (compared to 10 seconds to > build the v4 cache) > b) the user manually build the index (I agree it is awkward!) > > Well, the first question is to evaluate if it is worth -- I am using > the v2 version based on Xapian to have an idea. Please if you have > suggestions about query (terms an user could type) and results > (packages an user could expect), there are welcome. Yeah, dunno. Maybe an option would be to create the index in such a way that it is substitutable. [...] > Yes, it was already your suggestion in v3 3/3. Do you suggest to > refactor 'package->recutils'? For example, > > (define* (package->recutils name version > ... all-the-other-fields ... > port #:optional (width (%text-width)) > #:key > (hyperlinks? (supports-hyperlinks? port)) > (extra-fields '())) Yes. >> > 4. Impolite '@@' is used to access the private license construction. >> >> (guix licenses) could provide a ‘string->license’ procedure. > > Well, do you suggest: > > (define (string->license name) (license name #f #f)) No; rather, it would look up the license in a dictionary and return the corresponding object or #f if it’s not a known license. Thanks, Ludo’.