Message ID | cu7pnfaar36.fsf@systemreboot.net |
---|---|
State | Work in progress |
Headers | show |
Series | [bug#39258] Faster guix search using an sqlite cache | expand |
Context | Check | Description |
---|---|---|
cbaines/comparison | success | View comparision |
cbaines/git branch | success | View Git branch |
cbaines/applying patch | fail | View Laminar job |
Hi Arun, Thank you for the patch! Cool! :-) I have not tested it yet. Sorry. On Thu, 23 Jan 2020 at 20:53, Arun Isaac <arunisaac@systemreboot.net> wrote: > At the moment, I am having some difficulty populating the sqlite > database. generate-package-cache populates the database correctly when > invoked from a normal guile REPL using geiser, but fails to do so when > run by the guix daemon during guix pull. [...] > On examining package-cache.sqlite, I find that no records have been > written. And, there is a lingering journal file that shouldn't be > there. For some reason, populating the sqlite database does not work > with guix pull. sqlite probably crashes and leaves the journal file. Hum? weird... Is it possible that a module is loaded when Guile repl is used and not with Guix pull? What about "guix repl"? > If I try to populate the database with each package record being > inserted in its own transaction, at least some of the insertions You mean 'commit' the database after each insertion, right? > work. But the journal file still lingers. My unverified guess is that > everything except the last transaction was successful. And this does not happen with the repl, right? > Any ideas what's going on? I have no idea. Weird. What about adding 'last-insert-row-id' from 'guix/store/database.scm'? I mean without really understanding and just grepping 'sqlite-finalize' spots that 'last-insert-row-id' seems often around. :-) > Also, inserting each package in its own transaction is ridiculously slow > and so that is out of the question. See https://www.sqlite.org/faq.html#q19 Agree that it is not an option. :-) Otherwise, 'list->string' is defined twice. And the first one is not necessary, I guess. The docstring of 'cache-lookup' is not coherent anymore. :-) Cheers, simon
> Hum? weird... > Is it possible that a module is loaded when Guile repl is used and not > with Guix pull? It could be. But I don't know how to confirm this theory. > What about "guix repl"? I just tried 'guix repl'. It worked correctly, just like guile repl. >> If I try to populate the database with each package record being >> inserted in its own transaction, at least some of the insertions > > You mean 'commit' the database after each insertion, right? Yes, that is what I mean. >> work. But the journal file still lingers. My unverified guess is that >> everything except the last transaction was successful. > > And this does not happen with the repl, right? No, this does not happen with the repl. >> Any ideas what's going on? > > I have no idea. > Weird. Do you know of any way sqlite can create an error log to report what's going on? That might really help debug this issue. > What about adding 'last-insert-row-id' from 'guix/store/database.scm'? > I mean without really understanding and just grepping > 'sqlite-finalize' spots that 'last-insert-row-id' seems often around. > :-) I tried this just now, but still the journal lingers. > Otherwise, 'list->string' is defined twice. And the first one is not > necessary, I guess. Ah, thanks for catching this! > The docstring of 'cache-lookup' is not coherent anymore. :-) Yes, I haven't gotten around to fixing up all those yet. I thought I'll get the code working first.
Hi, On Thu, 30 Jan 2020 at 14:49, Arun Isaac <arunisaac@systemreboot.net> wrote: > >> Any ideas what's going on? [...] > Do you know of any way sqlite can create an error log to report what's > going on? That might really help debug this issue. Danny told me something like: (catch (sqlite-error I have not tried yet. > > The docstring of 'cache-lookup' is not coherent anymore. :-) > > Yes, I haven't gotten around to fixing up all those yet. I thought I'll > get the code working first. Yes, I imagine. Just to notice. :-) All the best, simon
> Danny told me something like: > > (catch (sqlite-error > > I have not tried yet. Thank you, this was useful. I was able to catch and report the error. I also found the log file for the guix-package-cache profile hook. It says (repl-version 0 0) Generating package cache for '/gnu/store/b6f9b5qbcn4r932whrr6m15rdimbgrhs-profile'... (exception sqlite-error (value sqlite-open) (value 14) (value "Unable to open the database file")) This could be a permission error, or something to do with the existence or lack thereof of certain directories (such as /var) in the chroot of the build daemon. I'm still figuring it out. I'm also in half a mind to get some guile xapian bindings ready so we can just do that instead of messing with sqlite here. But, let's see. :-P
Hi, On Sun, 2 Feb 2020 at 22:16, Arun Isaac <arunisaac@systemreboot.net> wrote: > Thank you, this was useful. I was able to catch and report the error. I Where have you reported the error? > also found the log file for the guix-package-cache profile hook. It says > > (repl-version 0 0) > Generating package cache for '/gnu/store/b6f9b5qbcn4r932whrr6m15rdimbgrhs-profile'... > (exception sqlite-error (value sqlite-open) (value 14) (value "Unable to open the database file")) > > This could be a permission error, or something to do with the existence > or lack thereof of certain directories (such as /var) in the chroot of > the build daemon. I'm still figuring it out. Hum? And this should explain why it is working with the REPL and not with the CLI, right? > I'm also in half a mind to get some guile xapian bindings ready so we > can just do that instead of messing with sqlite here. But, let's > see. :-P Cool! Let me know if you push something somewhere. Cheers, simon
Hi, I have finally got xapian working for package search. Some comments follow. * Speed improvement Despite search-package-index in gnu/packages.scm taking only around 1.5ms, I see an overall speedup in `guix search` of only a factor of 2 -- from around 2s to around 1s. I wonder what else in `guix search` is taking up so much time. * Currently indexing only the package descriptions In this patchset, I have only indexed the package descriptions. In the next version of this patchset, I will index all other terms as specified in %package-metrics of guix/ui.scm. * Should I add guile-xapian as a propagated input to guix in gnu/packages/package-management.scm? * Drop regexp search support In this patchset, I have retained the older regexp search support. But, I think we should drop it and only have xapian search. In cases where the search index is not authoritative, we can build an in-memory xapian search index on the fly and use it to search. This will slow down the search, but will ensure our search results are consistent and do not depend on the authoritativeness of the search index. * Commit messages Except for patch 1, I am not sure what prefixes (build-self, gnu, etc.) to use in the first line of the commit message. Some advice there would be helpful. Regards, Arun. Arun Isaac (4): gnu: Add guile-xapian. build-self: Add guile-xapian to Guix dependencies. gnu: Generate xapian package search index. gnu: Use xapian index for package search. build-aux/build-self.scm | 11 ++++++++ gnu/packages.scm | 44 ++++++++++++++++++++++++++++- gnu/packages/guile-xyz.scm | 50 ++++++++++++++++++++++++++++++++- guix/channels.scm | 34 ++++++++++++++++++++++- guix/scripts/package.scm | 57 ++++++++++++++++++++++---------------- guix/self.scm | 7 ++++- 6 files changed, 175 insertions(+), 28 deletions(-)
Hi, Here is the second iteration of my Xapian Guix package search patchset. I have found the reason the earlier patchset did not show significant speedup. It turns out that most of the time is spent in printing and texinfo rendering of the search results. So, in this patchset, I pre-render the search results while building the Xapian index and stuff them into the Xapian database itself. Therefore, during `guix search`, I just pull out the pre-rendered search results and print it on the screen. This is much faster. See comparison below. --8<---------------cut here---------------start------------->8--- With a warm cache, $ time guix search inkscape real 0m1.787s user 0m1.745s sys 0m0.111s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- $ time /tmp/test/bin/guix search inkscape real 0m0.199s user 0m0.182s sys 0m0.024s --8<---------------cut here---------------end--------------->8--- If most of the speedup comes from pre-rendering the results, it might seem that the Xapian search is not so useful. We might as well have stuffed the pre-rendered search results into the existing package cache generated by generate-package-cache, or so it might seem. But, there are the following arguments in favor of Xapian. - The package cache would grow in size, and lookup would be slowed down because we need to load the entire cache into memory. Xapian, on the other hand, need only look up the specific packages that match the search query. - Xapian can provide superior search results due to it stemming and language models. - Xapian can provide spelling correction and query expansion -- that is, suggest search terms to improve search results. Note that I haven't implemented this yet and is out of scope in this patchset. * Simplify our package search results Why not use a simpler package search results format like Arch Linux or Debian does? We could just display the package name, version and synopsis like so. inkscape 0.92.4 Vector graphics editor inklingreader 0.8 Wacom Inkling sketch format conversion and manipulation Why do we need the entire recutils format? If the user is interested, they can always use `guix package --show` to get the full recutils formatted info. Having shorter search results will make everything even faster and much more readable. WDYT? * How to test this patchset To get guile-xapian, run a `guix pull`, if you haven't already. Then in your Guix source directory, drop into an environment with guix dependencies and guile-xapian. $ guix environment guix --ad-hoc guile-xapian Apply patches and build. $ git am v2-0000-cover-letter.patch v2-0002-gnu-Generate-Xapian-package-search-index.patch v2-0001-build-self-Add-guile-xapian-to-Guix-dependencies.patch v2-0003-gnu-Use-Xapian-index-for-package-search.patch $ make Run a test guix pull. $ ./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test where xapian is the name of the branch you committed the patches to. Then, run the guix search in /tmp/test. $ /tmp/test/bin/guix search game * Comments Pierre Neidhardt <mail@ambrevar.xyz> writes: >> +(define (search-package-index profile querystring) > > Maybe `query-string'? Done in this patchset. >> + (define (regexp? str) >> + (string-any >> + (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$) >> + str)) >> + >> + (if (and (current-profile) >> + (not (any regexp? patterns))) > > I would not put characters like ".", "$", or "+" here, lest we mistake a > Xapian pattern for a regexp. > > As you said, I don't think both are compatible without ambiguity > anyways, so we should probably drop regexp (or at least toggle them with > a command line argument). I agree. zimoun <zimon.toutoune@gmail.com> writes: > In the commit message, I would capitalize Xapian. Done in this patchset. >> +(define (generate-package-search-index directory) >> + "Generate under DIRECTORY a xapian index of all the available packages." > > Xapian with capital. Done in this patchset. > Is (make-stem "en") for the locale? I still have English hard-coded. I haven't yet figured out how to detect the locale and stem accordingly. But, there is a larger problem. Since we cannot anticipate what locale the user will run guix search with, should we build the Xapian index for all locales? That is, should we index not only the English versions of the packages but also all other translations as well? > package-search-index and package-cache-file could be refactored > because they share all the same code. Yes, they could be. However, I'll postpone to the next iteration of the patchset. > I do not know what is the convention for the bindings. > But there is 'fold-packages' so I would be inclined to 'fold-msets' or > something in this flavour. Well, everywhere else in guile we have such things as vhash-fold, string-fold, hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are folding over a single mset (match-set). So, mset should be in the singular. > And more importantly, 'make as-derivations' to avoid a "guix pull" breakage, > Ah do not forget to adapt some tests. Will do this once we have consensus about the other features of this patchset. > b. The xapian relevance should truncated Done in this patchset. > Xapian does not return the package 'emacs' itself as the first. And worse, > it is not returned at all. In this patchset, since we're indexing the package name as well, emacs is returned but it is still far from the beginning. > I propose the value of 4294967295 for pagesize. In this patchset, I pass (database-document-count db) as the #:maximum-items keyword argument to enquire-mset. This is the upstream recommended way to get all search results. I hadn't done this earlier since I hadn't yet wrapped database-document-count in guile-xapian. >> In this patchset, I have only indexed the package descriptions. In the next >> version of this patchset, I will index all other terms as specified in >> %package-metrics of guix/ui.scm. > > Yes, it appears to me a detail that should be easy to fix. I mean, it > does not seems blocking. Done in this patchset. Ludovic Courtès <ludo@gnu.org> writes: > Note that ‘guix search’ time is largely dominated by I/O. Yes, `guix search` is I/O intensive. That is why I expect Xapian to do better since it only needs to access matching packages not all packages. Also, the Xapian index is fast at all times. It is not very dependent on a warm filesystem cache. > On my laptop, > I get (first measurement is cold cache, second one is warm cache): > > --8<---------------cut here---------------start------------->8--- > $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' > $ time guix search foo >/dev/null > > real 0m2.631s > user 0m1.134s > sys 0m0.124s > $ time guix search foo >/dev/null > > real 0m0.836s > user 0m1.027s > sys 0m0.053s > --8<---------------cut here---------------end--------------->8--- > > It’s hard to do better on the warm cache case because at this level, > there may be other things to optimize having little to do with searching > itself. > > Note that this is on an SSD; the cold-cache case must be worse on NFS or > on a spinning disk, and there we could gain a lot. My laptop is quite old with a particularly slow HDD. Hence my motivation to improve guix search performance! > I think we should weigh the pros and cons on all these aspects: speed, > complexity and maintenance cost, search result quality, search features, > etc. I agree. > PS: I have not yet looked at the whole series as I’m just coming back to > the keyboard. :-) Welcome back! :-) Arun Isaac (3): build-self: Add guile-xapian to Guix dependencies. gnu: Generate Xapian package search index. gnu: Use Xapian index for package search. build-aux/build-self.scm | 11 +++++++ gnu/packages.scm | 62 +++++++++++++++++++++++++++++++++++++++- guix/channels.scm | 34 +++++++++++++++++++++- guix/scripts/package.scm | 7 +++-- guix/self.scm | 7 ++++- guix/ui.scm | 37 ++++++++++++++++++++++++ 6 files changed, 153 insertions(+), 5 deletions(-)
Hi everyone, This is v3 of my attempt to make guix search faster. In this version, I have abandoned use of xapian. Instead I build a cache of the metadata of all packages in a profile hook. Then, I use that cache to search and display search results. This way, package guile modules are not loaded during guix search. Speedup is around 2x. Both measurements below are with a warm cache. --8<---------------cut here---------------start------------->8--- $ time guix search inkscape real 0m1.722s user 0m1.776s sys 0m0.097s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- $ time /tmp/test/bin/guix search inkscape real 0m0.749s user 0m0.770s sys 0m0.020s --8<---------------cut here---------------end--------------->8--- This patchset does not affect the search API nor does it improve the relevance of search results. If there is interest in this approach, I'll complete this patchset properly. But, in the long run, I do think we should aim to get xapian or the like for guix search. WDYT? Unfortunately, generate-package-metadata-cache takes 43 seconds to build the cache on my relatively slow computer. Performance should be better on other people's machines. Meanwhile, it would still be useful if someone built patchset v2 on their machine and reported the time it took to build the xapian index. * How to test this patchset Apply patches and build as usual. Do a guix pull into a temporary profile. $ ./pre-inst-env guix pull --url=$PWD --branch=the-name-of-the-branch-you-applied-patches-to -p /tmp/test Then, run guix search from the built profile $ /tmp/test/bin/guix search inkscape Thanks! Arun Isaac (3): guix: Generate package metadata cache. guix: Search package metadata cache. guix: Use package metadata cache for package search. gnu/packages.scm | 88 +++++++++++++++++++++++++- guix/channels.scm | 34 +++++++++- guix/packages.scm | 32 ++++++++++ guix/scripts/package.scm | 5 +- guix/ui.scm | 132 ++++++++++++++++++++++++++++++++++++--- 5 files changed, 277 insertions(+), 14 deletions(-)
Hi, Thank you Arun for the patches and all the work. Sorryfor the delay. TLDR: 1) around 25 seconds added to "guix pull"... but I am more than often waiting around 10 minutes when pulling. 2) the speedup is clear: more than 2x. The question is the tradeoff between: the slowdown of pull vs the speedup of search. What is acceptable? Here let benchmark 3 versions of Guix: - default is a357849f5b - v2 rebased on default and based on Xapian - v3 rebased on default too and based on "custom" index and let compare the time of "guix pull" and then "guix search". Because v2 uses Xapian, the accuracy is different and so the list of outputs is different depending on the query; the impact on the performance seems minimal. Let discuss elsewhere about accuracy and BM25 and let focus on performance for now. * guix pull ----------- The idea is: measure if computing the new index is expensive or not, compared to all of what "guix pull" computes. ** Reference ------------ Maybe, I should have misconfigured something or my laptop is really not powerful at all, but here some numbers. (Note: /proc/cpuinfo says 4 times Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz and /sys/block/sda/queue/rotational says 0 which is SSD.) --8<---------------cut here---------------start------------->8--- $ guix describe Generation 8 Apr 25 2020 09:00:01 (current) guix f84b036 repository URL: https://git.savannah.gnu.org/git/guix.git branch: master commit: f84b0363053e5479464f6ce6ded45f80360d90fc --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- $ time guix pull -C ~/.config/guix/default-channels.scm Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'... Building from this channel: guix https://git.savannah.gnu.org/git/guix.git 8cf6d15 downloading from https://ci.guix.gnu.org/nar/gzip/xgakzpfs3rz57m666hsk1v3d3zcy7wgn-config.scm ... config.scm [...] building fonts directory... building directory of Info manuals... building database for manual pages... building profile with 1 package... building /gnu/store/kq1zlj5rxz8wrxc3ha8vck2wv2iakfnb-inferior-script.scm.drv... building package cache... building profile with 1 package... New in this revision: 2 new packages: cl-osicat, sbcl-osicat real 13m37.997s user 1m38.129s sys 0m0.856s --8<---------------cut here---------------end--------------->8--- And because "guix search" is used say 10 times more than "guix pull", an increase of 10% of "guix pull" will ease the experience of the user if "guix search" is faster, IMHO. Therefore, because "guix pull" takes around 13 minutes, the extra cost to index all the packages can be roughly 1min30s (at most). Then, if I pull back from 8cf6d15 to '--commit=a357849f5b' then it takes: real 2m13.693s user 1m37.418s sys 0m0.666s so in this case 10% means around 7s. But after 1 minute waiting, the command feels too long to me and personally I am already waiting so I do not mind much if it would take 2m13s or 3m00s. Well, it is hard to draw a clear line about what could be accepted as the time of indexing because the time of pulling is already highly variable. What is the average of "guix pull"? It could be really interresting to probe the users. They could report: - guix describe - time guix pull whatever which channels are up. Just to have an idea about what should be the acceptable extra time added by indexing. For sure it depends on the hardware but it would provide an idea and help to see if the extra time is worth or not. WDYT? ** Let's compare the index time ------------------------------- Let pull for the 3 cases and populate the store by all the necessary items. Could be looooonng! (20minutes) For example, for the version 2 of patches -- living in my branch 'search-v2' using a worktree. --8<---------------cut here---------------start------------->8--- time ./pre-inst-env guix pull -p /tmp/v2 \ --url=$PWD --branch=search-v2 \ -C ~/.config/guix/default-channels.scm --8<---------------cut here---------------end--------------->8--- and then let spot the index file for each version: --8<---------------cut here---------------start------------->8--- # ls -l /tmp/default/lib/guix /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache readlink /tmp/v2/lib/guix/package-search.index /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index/lib/guix/package-search.index readlink /tmp/v3/lib/guix/package-metadata.cache /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache/lib/guix/package-metadata.cache --8<---------------cut here---------------end--------------->8--- Well, let remove the profiles and garbage collect the index files: --8<---------------cut here---------------start------------->8--- rm /tmp/default /tmp/v{2,3}* guix gc -D \ /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \ /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \ /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache --8<---------------cut here---------------end--------------->8--- And then re-run "guix pull". We are now comparing apple to apple, I guess. | time | default | v2 | v3 | |------+-----------+-----------+-----------| | real | 1m11.899s | 1m30.806s | 1m34.341s | | user | 1m23.845s | 1m24.160s | 1m24.233s | | sys | 0m0.570s | 0m0.563s | 0m0.529s | Therefore less than extra 20s and 25s for v2 and v3. All the question is an extra 25s compared to which time of "guix pull": - more than 13m: adding 25s is acceptable - less than 2m: adding 25s is questionable Usually, my feeling about "guix pull" is... I am waiting! Therefore, I will not see this extra 25s because it is masked by all the other work "guix pull" is doing. * guix search ------------- Let compare cold (sudo echo 3 > /proc/sys/vm/drop_caches) and warm cache. For example for the query 'inkscape'. | time | default | v2 | v3 | |------+----------+----------+----------| | real | 0m1.842s | 0m0.331s | 0m0.437s | | user | 0m1.270s | 0m0.179s | 0m0.336s | | sys | 0m0.142s | 0m0.047s | 0m0.052s | |------+----------+----------+----------| | real | 0m0.898s | 0m0.132s | 0m0.292s | | user | 0m1.069s | 0m0.168s | 0m0.353s | | sys | 0m0.072s | 0m0.008s | 0m0.019s | Therefore the speedup is at least 3. | cache | default-vs-v2 | default-vs-v3 | |-------+---------------+---------------| | cold | 5.6 | 4.2 | | warm | 6.8 | 3.1 | Another query: --8<---------------cut here---------------start------------->8--- time guix search crypto library | recsel -P name | grep libb2 --8<---------------cut here---------------end--------------->8--- | time | default | v2 | v3 | |------+----------+----------+----------| | real | 0m2.216s | 0m1.109s | 0m0.689s | | user | 0m1.655s | 0m1.309s | 0m0.683s | | sys | 0m0.193s | 0m0.073s | 0m0.035s | |------+----------+----------+----------| | real | 0m1.197s | 0m0.490s | 0m0.491s | | user | 0m1.448s | 0m0.819s | 0m0.625s | | sys | 0m0.089s | 0m0.034s | 0m0.039s | | cache | default-vs-v2 | default-vs-v3 | |-------+---------------+---------------| | cold | 2.0 | 3.2 | | warm | 2.4 | 2.4 | Before going further, especially about any other more sophisticated inverted index (BM25), it appears to me important to fix what is "cost" on "guix pull" that the users are ready to pay. Because somehow the inverted index has to be computed. And without an inverted index, it seems difficult to improve the accurary. One solution should be: let compute the inverted index in the background with a low priority. If the index is not done yet when "guix search" is called, then fallback to the current default behaviour. WDYT? Cheers, simon
Hi Simon, Thanks for taking the time to benchmark this, this is very insightful! > Usually, my feeling about "guix pull" is... I am waiting! Therefore, > I will not see this extra 25s because it is masked by all the other > work "guix pull" is doing. I agree and this is a very good point in my opinion. While I don't expect nor do I need "guix pull" to complete immediately, this is not true of "guix search". As Simon suggested, maybe we can wrap a benchmark script together, post it on the mailing list and ask member to report their results. Maybe a few dozen results would give us a better idea of the numbers we are dealing with. Cheers!
Hi, zimoun <zimon.toutoune@gmail.com> skribis: > 1) around 25 seconds added to "guix pull"... but I am more than often > waiting around 10 minutes when pulling. > 2) the speedup is clear: more than 2x. Nice! It does seem like Arun’s v3 (or maybe even v2) would work nicely. > The question is the tradeoff between: the slowdown of pull vs the > speedup of search. What is acceptable? That’s only one criterion among others. I hear the argument that 25s is “nothing” compared to the rest, but it’s really a tradeoff. Like, if I spent a day optimizing ‘guix pull’ and managed to save 25s, I would find it nice. :-) > $ time guix pull -C ~/.config/guix/default-channels.scm It also depends on what’s in that file, of course. > Then, if I pull back from 8cf6d15 to '--commit=a357849f5b' then it takes: > > real 2m13.693s > user 1m37.418s > sys 0m0.666s For me: --8<---------------cut here---------------start------------->8--- $ guix describe Generacio 139 Apr 13 2020 21:50:08 (nuna) guix bad368b repository URL: https://git.savannah.gnu.org/git/guix.git branch: master commit: bad368b0d794689f3a8a11b58f1ea4987938682e $ time guix pull -p /tmp/test --commit=bad368b0d794689f3a8a11b58f1ea4987938682e Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'... Building from this channel: guix https://git.savannah.gnu.org/git/guix.git bad368b [...] real 0m57.916s user 1m1.017s sys 0m0.609s --8<---------------cut here---------------end--------------->8--- (On a 2.6 GHz i7 though.) > Well, let remove the profiles and garbage collect the index files: > > rm /tmp/default /tmp/v{2,3}* > guix gc -D \ > /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \ > /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \ > /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache Could you do, for v2 and v3: time guix build /gnu/store/…-package-metadata-cache.drv --check ? That we’ll give us the exact cost of that part. It’ll be interesting especially in the Xapian case, which we expected to be higher. Thanks for the insightful benchmarks! Ludo’.
Hi Ludo, On Sun, 26 Apr 2020 at 17:49, Ludovic Courtès <ludo@gnu.org> wrote: > It does seem like Arun’s v3 (or maybe even v2) would work nicely. The v3 is more interesting because it does not change the relevance scoring and does not add other dependency. However v2 is interesting to easily test BM25 which is another relevance scoring... work in progress. :-) > > The question is the tradeoff between: the slowdown of pull vs the > > speedup of search. What is acceptable? > > That’s only one criterion among others. I hear the argument that 25s is > “nothing” compared to the rest, but it’s really a tradeoff. Like, if I > spent a day optimizing ‘guix pull’ and managed to save 25s, I would find > it nice. :-) And I expect that the middle-term roadmap would even decrease more the computations of derivations. ;-) > > $ time guix pull -C ~/.config/guix/default-channels.scm > > It also depends on what’s in that file, of course. Contains only one line: %default-channels See my wishlist ;-) https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00393.html me: 2m13.693s you: 0m57.916s As we already discussed elsewhere, it is hard to "test" 'guix pull'. Does it make sense to measure "guix pull"? As Chris (Marusich) did for CDN. > > Well, let remove the profiles and garbage collect the index files: > > > > rm /tmp/default /tmp/v{2,3}* > > guix gc -D \ > > /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \ > > /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \ > > /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache > > Could you do, for v2 and v3: > > time guix build /gnu/store/…-package-metadata-cache.drv --check Newbie me! :-) Two points: 1. It may not be reproducible... I am checking. 2. The time seems similar (v2=26s and v3=29s) considering the time to start Guile and so on. --8<---------------cut here---------------start------------->8--- guix gc --list-live | grep metadata time /tmp/v3/bin/guix build /gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv --check The following profile hook will be built: /gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv building package cache... (repl-version 0 1 1) Generating package metadata cache for '/gnu/store/95mi525syinh08jmcd3q7a7a8mr1sykb-profile'... (values (value "/gnu/store/zhp7wv87vr6iis0fa3ff925i5r04i08q-guix-package-metadata-cache/lib/guix/package-metadata.cache")) guix build: error: derivation `/gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv' may not be deterministic: output `/gnu/store/zhp7wv87vr6iis0fa3ff925i5r04i08q-guix-package-metadata-cache' differs real 0m29.788s user 0m0.535s sys 0m0.025s --8<---------------cut here---------------end--------------->8--- > That we’ll give us the exact cost of that part. It’ll be interesting > especially in the Xapian case, which we expected to be higher. --8<---------------cut here---------------start------------->8--- time /tmp/v2/bin/guix build /gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv --check The following profile hook will be built: /gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv running profile hook of type 'package-search-index'... (repl-version 0 1 1) Generating package search index for '/gnu/store/wiinj9nrb45wlf2cgbgkjl9chxz9cb9b-profile'... (values (value "/gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index/lib/guix/package-search.index")) guix build: error: derivation `/gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv' may not be deterministic: output `/gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index' differs real 0m26.552s user 0m0.626s sys 0m0.046s --8<---------------cut here---------------end--------------->8--- It is not higher. Why should it be? Considering aside the issue of reproducibility -- which should be one! -- well, should be possible to download the index file as any other substitute? Cheers, simon
zimoun <zimon.toutoune@gmail.com> skribis: >> Could you do, for v2 and v3: >> >> time guix build /gnu/store/…-package-metadata-cache.drv --check > > Newbie me! :-) > > Two points: > > 1. It may not be reproducible... I am checking. > 2. The time seems similar (v2=26s and v3=29s) considering the time > to start Guile and so on. Good, so it means that’s not Xapian taking time here. Thanks again! Ludo’.
Hi Ludo, On Sun, 26 Apr 2020 at 17:49, Ludovic Courtès <ludo@gnu.org> wrote: > That’s only one criterion among others. I hear the argument that 25s is > “nothing” compared to the rest, but it’s really a tradeoff. Like, if I > spent a day optimizing ‘guix pull’ and managed to save 25s, I would find > it nice. :-) I am not sure to understand all what "guix pull" does. Does "guix pull" compile all the scheme files under 'gnu/'? Probably only recompiles the "new" files? I do not know if it makes sense, but I just note this difference: 1. Search without compiling of all files under 'gnu/packages/' 2. Compile all the files under 'gnu/packages/' then search 3. Search with only the file gnu/packages/emacs-xyz.scm not compiled (all the other files are compiled) 4. Compile the file above and then search 3b and 4b with gnu/packages/cobol.scm which is smaller than emacs-xyz.scm. Results: 1) 1m43.312s 2) 0m1.301s (but 9m51.801s compiling) 3) 0m6.526s 4) 0m1.389s (1m8.670s compiling) 3b) 0m0.921s 4b) 0m0.924s (0m1.884s compiling) Therefore, an option to reduce the time when pulling should to relax the "compilation" for 'gnu/packages/' and 'gnu/services'; something less optimized since the packages and services "just" need to be transformed into bytecode to improve IO when reading them. Perhaps I miss a point... And maybe, it is similar than what Andy Wingo is proposing in [1]. [1] https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00444.html Cheers, simon --8<---------------cut here---------------start------------->8--- find gnu/packages -name "*.scm" -type f -exec touch {} \; time ./pre-inst-env guix search gmsh | recsel -C -p name ;;; note: source file /home/simon/src/guix/wk/tmp/gnu/packages/abduco.scm ;;; newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/abduco.go [...] ;;; note: source file /home/simon/src/guix/wk/tmp/gnu/packages/zwave.scm ;;; newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/zwave.go name: gmsh real 1m43.312s user 2m19.318s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- find gnu/packages -name "*.scm" -type f -exec touch {} \; time make -j4 && time ./pre-inst-env guix search gmsh | recsel -C -p name make all-recursive make[1]: Entering directory '/home/simon/src/guix/wk/tmp' Making all in po/guix make[2]: Entering directory '/home/simon/src/guix/wk/tmp/po/guix' make[2]: Leaving directory '/home/simon/src/guix/wk/tmp/po/guix' Making all in po/packages make[2]: Entering directory '/home/simon/src/guix/wk/tmp/po/packages' make[2]: Leaving directory '/home/simon/src/guix/wk/tmp/po/packages' make[2]: Entering directory '/home/simon/src/guix/wk/tmp' Compiling Scheme modules... [ 0%] LOAD gnu/packages/abduco.scm ;;; note: source file ./gnu/packages/abduco.scm ;;; newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/abduco.go [...] [100%] GUILEC gnu/packages/zwave.go make[2]: Leaving directory '/home/simon/src/guix/wk/tmp' make[1]: Leaving directory '/home/simon/src/guix/wk/tmp' real 9m51.801s user 29m18.938s sys 0m5.822s name: gmsh real 0m1.301s user 0m1.266s sys 0m0.101s --8<---------------cut here---------------end--------------->8---
Dear, The aim of this version v4 is to keep the same searching performances as the previous version v3 but to drastically reduce the generation of the cache. On my laptop, the overhead is now 4 seconds; compared to more than 20 seconds for v2 and v3. --8<---------------cut here---------------start------------->8--- # default time guix build /gnu/store/0nfpp82mqglpwvl1nbfpaphw5db2ivcp-guix-package-cache.drv --check # v4 time guix build /gnu/store/y78gfh1n7m3kyrj8wsqj25qc2cbc1a4d-guix-package-cache.drv --check --8<---------------cut here---------------end--------------->8--- | | default | v4 | |------+----------+-----------| | real | 0m6.012s | 0m10.244s | | user | 0m0.541s | 0m0.542s | | sys | 0m0.033s | 0m0.032s | In the version v3, the cache is built using 'cons' and 'fold-packages' (wrapper to 'fold-module-public-variables'). The version v4 modifies -- by adding other information -- the function 'generate-package-cache' which uses 'vhash' and 'fold-module-public-variables*'. Therefore the cache '/lib/guix/package.cache' contains more information. (The v4 structure of 'package.cache' is a quick draft, so details should be discussed and an interesting move should to have a structured (binary and all strings) S-exp; because it should become an entry point to export the packages list to JSON. WDYT?) Now, we are comparing apples to apples and the cost to compute BM25 (v2) is not free at all. Remember that BM25 is the state-of-the-art of information retrieval (relevance ranking) and it is delegated to Xapian (v2). I do not know if there is perfomance bottleneck between Guix, Guile-Xapian and Xapian itself but for sure the computation of BM25 is not free. More about that soon. To be clear about BM25 and caching, what I have in mind is: 1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking. 2. Use BM25 metrics to detect poor package meta-data (synopsis and description); if it worth why not add another checker to "guix lint". However, ranking is another story and I am not convinced yet if BM25 fits Guix needs or not. * Details ~~~~~~~~~ The pacthes applies against the commit a357849f5b (and it is not yet rebased). --8<---------------cut here---------------start------------->8--- time ./pre-env-inst guix pull --branch=search-v4 --url=$PWD -p /tmp/v4 --8<---------------cut here---------------end--------------->8--- Similar test than the previous benchmark (cold cache). --8<---------------cut here---------------start------------->8--- time ./pre-env-inst /tmp/v4/bin/guix search crypto library \ | recsel -P name | grep libb2 name: libb2 real 0m0.784s user 0m0.810s sys 0m0.037s --8<---------------cut here---------------end--------------->8--- And the option '--load-path' turns off the cache and it fallbacks to the usual 'fold-package'. --8<---------------cut here---------------start------------->8--- time ./pre-inst-env /tmp/v4/bin/guix search -L /tmp/my-pkgs crypto library \ | recsel -C -p name | grep libb2 name: libb2 real 0m2.446s user 0m1.872s sys 0m0.187s --8<---------------cut here---------------end--------------->8--- * Still draft ~~~~~~~~~~~~~ 1. The name of 'fold-packages*' should be misleading since it does not return "true" packages. --8<---------------cut here---------------start------------->8--- (define get-hello (p r) (if (string=? (package-name p) "hello") p r)) (define no-cache (fold-packages get-hello '())) (define from-cache (fold-packages* get-hello '())) (equal? no-cache from-cache) ;;; #f --8<---------------cut here---------------end--------------->8--- Another name for the procedure is welcome if it is an issue. 2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better. --8<---------------cut here---------------start------------->8--- (match (package-supported-systems p) (('cache supported-systems) (string-join supported-systems)) (_ (string-join (package-transitive-supported-systems p))))) --8<---------------cut here---------------end--------------->8--- However it avoids to duplicate code; as it is done in version v3. 3. Deprecated packages are displayed (bug in v3 too). 4. Impolite '@@' is used to access the private license construction. 5. Commit messages are incomplete, copyright header too, etc.. * Next? ~~~~~~~ IMHO, simply caching improves the current situation: - a bit of extra time at pull time (less than 5s on my machine) + speed up at search time (2x faster) * maintainable code? Is it in the right direction? Could you advise for a more compliant code? Could you test on your machines to have another point of comparison? Best regards, simon zimoun (3): DRAFT packages: Add fields to packages cache. DRAFT packages: Add new procedure 'fold-packages*'. DRAFT guix package: Use cache in 'find-packages-by-description'. gnu/packages.scm | 98 ++++++++++++++++++++++++++++++++++++++-- guix/scripts/package.scm | 2 +- guix/ui.scm | 29 +++++++----- tests/packages.scm | 31 +++++++++++++ 4 files changed, 143 insertions(+), 17 deletions(-)
Hi, Sorry for the long delay in replying to this thread. I think Ludo is right in that we can improve guix search performance with only simple code improvements rather than including xapian or improving our existing cache. Here are a few patches on those lines. In `relevance`, we set our score to 0 if any of the regexps don't match. Then, we might as well not match the remaining regexps. Patch 1 does this early cut off optimization. Often our search strings are only literal strings. So, we can save some time by using string-contains instead of invoking the regexp engine. Patch 2 does this. In addition, guile's string-contains uses a naive O(n^2) string search algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code mentions this. If implemented, the KMP algorithm could speed up guix search further. [1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm Patch 3 and 4 are minor improvements. Here's a rough performance comparison. --8<---------------cut here---------------start------------->8--- time ./pre-inst-env guix search game real 0m2.261s user 0m2.351s sys 0m0.104s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- time guix search game real 0m2.661s user 0m2.843s sys 0m0.080s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- time ./pre-inst-env guix search strategy game real 0m1.613s user 0m1.635s sys 0m0.096s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- time guix search strategy game real 0m2.520s user 0m2.583s sys 0m0.112s --8<---------------cut here---------------end--------------->8--- Arun Isaac (4): ui: Cut off search early if any regexp does not match. ui: Use string matching with literal search strings. ui: Do not translate package synopsis a second time. ui: Use package-description-string. guix/scripts/package.scm | 12 +++++-- guix/ui.scm | 68 ++++++++++++++++++++++++---------------- 2 files changed, 50 insertions(+), 30 deletions(-)
Dear, > > Often our search strings are only literal strings. So, we can save some time > > by using string-contains instead of invoking the regexp engine. Patch 2 does > > this. In addition, guile's string-contains uses a naive O(n^2) string search > > algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In > > fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code > > mentions this. If implemented, the KMP algorithm could speed up guix search > > further. > > > > [1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm It could improve. Well, I will try to do some back-to-envelop computations because I am not convinced that the mean value of 'n' (length of description, isn't) is large enough to really see an improvement for the end-user; the visible bottleneck is I/O. All the best, simon ps; To be honest, I thought this kind of algorithm was the default. :-)
On Mon, Jun 01, 2020 at 12:11:52PM +0200, zimoun wrote: > Dear, > > > > Often our search strings are only literal strings. So, we can save some time > > > by using string-contains instead of invoking the regexp engine. Patch 2 does > > > this. In addition, guile's string-contains uses a naive O(n^2) string search > > > algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In > > > fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code > > > mentions this. If implemented, the KMP algorithm could speed up guix search > > > further. > > > > > > [1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm > > It could improve. > Well, I will try to do some back-to-envelop computations because I am > not convinced that the mean value of 'n' (length of description, > isn't) is large enough to really see an improvement for the end-user; > the visible bottleneck is I/O. > > All the best, > simon > > ps; > To be honest, I thought this kind of algorithm was the default. :-) I also recommend taking a look at the Boyer Moore string search implementation in (guix build grafts). It would be great to generalize it and make it accessible to other parts of Guix.
> I also recommend taking a look at the Boyer Moore string search > implementation in (guix build grafts). Nice, I didn't know Guix had an implementation of Boyer Moore. I'll take a look at it. At the very least, I need something similar for guile-email. But, the current implementation of guile's string-contains is in C. So, I assume a KMP or Boyer Moore implementation of string-contains should also be in C.
Arun Isaac <arunisaac@systemreboot.net> skribis: >> I also recommend taking a look at the Boyer Moore string search >> implementation in (guix build grafts). > > Nice, I didn't know Guix had an implementation of Boyer Moore. I'll take > a look at it. At the very least, I need something similar for > guile-email. > > But, the current implementation of guile's string-contains is in C. So, > I assume a KMP or Boyer Moore implementation of string-contains should > also be in C. Not necessarily. But it’d be great to have it in Guile proper, for sure! Ludo’.
Hi, This is an attempt to improve the performance of "guix search". It is still half baked but it allows to discuss further the idea about expanding the current '/lib/guix/package.cache' and avoids to forget an IRL discussion. ;-) Let start by what needs to be improved: the part when cache is not authoritative. It is slower than the current approach because the package is read twice, i.e., the module is indeed loaded twice, once by 'fold-available-packages' via 'fold-module-public-variables*' and then again by 'find-packages-by-description' via 'read-package-from'. The issue is to have a common interface for both cases (cache and no-cache). More thoughts are required. ;-) Then, using the cache is slower than expected. Therefore, something is maybe twisted -- quick implementation before holidays ;-) -- with the use of 'fold-avaibale-packages' as proposed by Ludo [1]. Note that instead another 'fold-packages' (say 'fold-packages*') using the new cache should be used. As it is done with v4 and the performances were as expected: <http://issues.guix.gnu.org/39258#89> 1: <http://issues.guix.gnu.org/39258#93> From my understanding, the issue that 'package-relevance' accepts a 'package' (and then all the chain until displaying) and 'fold-avaibale-packages' does not return a package. Well, I do not know; especially where to put something similiar to 'read-package-from'. To test, after applying the patches, the command is: ./pre-inst-env guix pull --allow-downgrades --disable-authentication \ --url=$(pwd) --branch=search-v6 -p /tmp/new Let compare only for cold cache and time this cache building (Guix 7db8fd6): sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix build --check $(guix gc --derivers $(readlink -f ~/.config/guix/current/lib/guix/package.cache)) real 0m28,848s user 0m1,481s sys 0m0,252s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix build --check $(guix gc --derivers $(readlink -f /tmp/new/lib/guix/package.cache)) real 0m40,279s user 0m1,582s sys 0m0,232s It seems longer but compared to the time of "guix pull" completion, it seems acceptable. However, maybe it could become an issue when running a lot of "guix time-machine"... Well, hard trade-off. ;-) Let compare for some queries: --8<---------------cut here---------------start------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search game | recsel -C -P name | wc -l 371 real 0m7,561s user 0m3,525s sys 0m0,391s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search game | recsel -C -P name | wc -l 371 real 0m9,814s user 0m3,240s sys 0m0,363s --8<---------------cut here---------------end--------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search strategy game | recsel -C -P name | wc -l 16 real 0m8,565s user 0m2,803s sys 0m0,430s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search strategy game | recsel -C -P name | wc -l 16 real 0m9,679s user 0m2,370s sys 0m0,334s --8<---------------cut here---------------start------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search strategy game caesar | recsel -C -P name | wc -l 0 real 0m8,307s user 0m2,388s sys 0m0,366s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search strategy game caesar | recsel -C -P name | wc -l 0 real 0m3,626s user 0m0,948s sys 0m0,101s --8<---------------cut here---------------end--------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search game strategy the | recsel -C -P name | wc -l 4 real 0m8,776s user 0m2,903s sys 0m0,454s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search game strategy the | recsel -C -P name | wc -l 4 real 0m9,495s user 0m2,546s sys 0m0,313s --8<---------------cut here---------------start------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search the game strategy | recsel -C -P name | wc -l 4 real 0m8,502s user 0m2,534s sys 0m0,388s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search the game strategy | recsel -C -P name | wc -l 4 real 0m9,508s user 0m2,254s sys 0m0,363s --8<---------------cut here---------------end--------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search crypto library | recsel -C -P name | grep libb2 libb2 real 0m8,744s user 0m2,875s sys 0m0,374s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search crypto library | recsel -C -P name | grep libb2 libb2 real 0m9,229s user 0m2,448s sys 0m0,397s --8<---------------cut here---------------start------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search cuirass integration | recsel -C -P name | wc -l 1 real 0m8,132s user 0m2,343s sys 0m0,407s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search cuirass integration | recsel -C -P name | wc -l 1 real 0m8,940s user 0m2,036s sys 0m0,369s --8<---------------cut here---------------end--------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search cuirass | recsel -C -P name | wc -l 2 real 0m8,240s user 0m2,461s sys 0m0,367s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search cuirass | recsel -C -P name | wc -l 2 real 0m8,863s user 0m2,019s sys 0m0,377s --8<---------------cut here---------------start------------->8--- sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time guix search cuirass integration foo | recsel -C -P name | wc -l 0 real 0m8,258s user 0m2,418s sys 0m0,521s sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search cuirass integration foo | recsel -C -P name | wc -l 0 real 0m3,358s user 0m0,867s sys 0m0,139s --8<---------------cut here---------------end--------------->8--- This last example suggests that 'read-package-from' is the slowdown. (On a side note, maybe I am doing wrong, but there is no improvement by the recent introduction of 'cut' for multi-terms as the query "the game strategy" and "game strategy the". Another story. :-)) When cache is not authoritative, it is worse, as expected: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' time /tmp/new/bin/guix search -L /tmp/my-pkgs cuirass integration foo | recsel -C -P name | wc -l 0 real 0m12,503s user 0m7,807s sys 0m0,529s and note that currently the performances of "guix search" is the same for both cases (authoritative and not authoritative); i.e., see previous timing. Last, two minor remarks about previous comments. 1. Ludo commented: > Therefore the cache '/lib/guix/package.cache' contains more > information. This breaks the binary interface, so we’ll have to analyze the impact of such a change and devise a strategy. <http://issues.guix.gnu.org/39258#93> and after some checking, this should be fine, IIUC. The '--news' is ok because of '#:allow-other-keys'. And other parts are modified accordingly. Guix revision N creates a cache that Guix revision N+1 will read but it should not be an issue; see 'inferior-available-packages'. 2. And Ludo wrote: I realize the other cache also has that problem, but it would be nice to add a version tag to the cache. Basically emit something like: (package-metadata-cache (version 0) VECTOR …) instead of just: (VECTOR …) <http://issues.guix.gnu.org/39258#93> which is, after discussions, not necessary. Versioning does not make sense here because the cache is read by the Guix which generates it. Therefore, specify a version is extraneous here. Comments are welcome for this work-in-progress. ;-) Cheers, simon zimoun (2): DRAFT packages: Add fields to packages cache. DRAFT scripts: package: Use cache in 'find-packages-by-description'. gnu/packages.scm | 52 +++++++++++++++++++++++++++------------- guix/scripts/package.scm | 46 +++++++++++++++++++++++++---------- 2 files changed, 70 insertions(+), 28 deletions(-) base-commit: 4196087f3d6fc254db5b4c47658e5679c835516f
Hi Ludo, On Fri, 23 Jul 2021 at 17:43, Ludovic Courtès <ludo@gnu.org> wrote: > > From my understanding, the issue that 'package-relevance' accepts a 'package' > > (and then all the chain until displaying) and 'fold-avaibale-packages' does > > not return a package. Well, I do not know; especially where to put something > > similiar to 'read-package-from'. > > Yeah that’s annoying. Perhaps we need <proto-package> or > <package-metadata>. With some trickery we could have record type > inheritance or something, maybe. Dunno. > > It would be good if we could arrange so that ‘fold-available-packages’ > doesn’t allocate anything though. It does not allocate, the allocation is done by 'find-packages-by-description'. Well, I think v4/v3 [0] is the direction to follow. Therefore, I would revisit this version and try to address two of Ludo's comments [1] and the other ones in v3. BTW, I have not investigated from where the slowness comes: - allocation - garbage collection - '(module-ref (resolve-interface module) symbol)' because I have been a bit disappointed by the performance of this v6. 0: <http://issues.guix.gnu.org/39258#89> 1: <http://issues.guix.gnu.org/39258#93> > > Let compare only for cold cache and time this cache building (Guix 7db8fd6): > > > > sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' > > time guix build --check $(guix gc --derivers $(readlink -f ~/.config/guix/current/lib/guix/package.cache)) > > > > real 0m28,848s > > user 0m1,481s > > sys 0m0,252s > > > > sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' > > time guix build --check $(guix gc --derivers $(readlink -f /tmp/new/lib/guix/package.cache)) > > > > real 0m40,279s > > user 0m1,582s > > sys 0m0,232s > > > > It seems longer but compared to the time of "guix pull" completion, it seems > > acceptable. > > Both the initial timing and the target are waaay too much. :-/ Yes, but that's another story. :-) We cannot fix all in the same time, IMHO. Here the point is to speedup "guix search" and to accept to pay a little more at "guix pull" time; then we could optimize the cache generation. Considering the overall time of "guix pull", this extra time appears to me acceptable---if "guix search" is faster. :-) > On my i7 laptop I have: > > --8<---------------cut here---------------start------------->8--- > $ time ./pre-inst-env guile -c '(use-modules (gnu packages)) (generate-package-cache "/tmp/t.cache")' > > real 0m20.738s > user 0m44.413s > sys 0m0.341s > --8<---------------cut here---------------end--------------->8--- > > It’s CPU-bound; we should probably start by optimizing that. > > In Guile 3.0.7 there was a change that improved this noticeably: > > https://git.savannah.gnu.org/cgit/guile.git/commit/?id=05614f792bfabbc33798863edd0bb67c744e9299 > > We should prolly look for similar optimization opportunities in the > assembler… Yes, but this kind of optimization will not provide a faster "guix search" but a faster "guix pull". --8<---------------cut here---------------start------------->8--- $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time guile -c '(use-modules (gnu packages)) (generate-package-cache "/tmp/t1.cache")' real 0m15,728s user 0m23,940s sys 0m0,826s $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time guile -c '(load-compiled "/tmp/t1.cache/lib/guix/package.cache")' real 0m1,026s user 0m0,258s sys 0m0,051s $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time ./pre-inst-env guile -c '(use-modules (gnu packages)) (generate-package-cache "/tmp/t2.cache")' real 0m35,570s user 3m12,951s sys 0m3,807s $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time guile -c '(load-compiled "/tmp/t2.cache/lib/guix/package.cache")' real 0m1,055s user 0m0,283s sys 0m0,055s --8<---------------cut here---------------end--------------->8--- (And if we speak about performance, raw loading the cache takes ~1s but "guix package -A >/dev/null" takes ~8s. It is a big gap for parsing the CLI and sorting. Worse, "guix --version >/dev/null" takes ~2s on cold cache. We should probably start by reduce this overhead.) > > Let compare for some queries: > > > > sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' > > time guix search game | recsel -C -P name | wc -l > > 371 > > > > real 0m7,561s > > user 0m3,525s > > sys 0m0,391s > > I think you should run: > > time guix search game > /dev/null > > otherwise Bash’s built-in ‘time’ shows the wall-clock time of the whole > pipeline, and the processing time of ‘recsel’ is probably not negligible > here. First, I am confused... If the formatter (recsel) is not negligible, then it should be dropped and replaced by an internal (fast) formatter. Well, I mean that as an end-user I am interested by the time of the whole pipeline. Second, on my machine the time is somehow negligible*. ;-) On cold cache, 10 runs using the pipe or using the redirection, keeping the max and the min for each: --8<---------------cut here---------------start------------->8--- real 0m9,598s user 0m3,961s sys 0m0,415s real 0m8,744s user 0m3,772s sys 0m0,431s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- real 0m8,755s user 0m3,869s sys 0m0,540s real 0m8,390s user 0m3,767s sys 0m0,416s --8<---------------cut here---------------end--------------->8--- *negligible: better said, it does not give the bad impression. Even if it is roughly 5% of difference. Cheers, simon
From d1305351a90a84eb75e4769284d5e06927eade3e Mon Sep 17 00:00:00 2001 From: Arun Isaac <arunisaac@systemreboot.net> Date: Tue, 21 Jan 2020 20:45:43 +0530 Subject: [PATCH] fast search --- build-aux/build-self.scm | 5 + gnu/packages.scm | 207 +++++++++++++++++++++++---------------- 2 files changed, 128 insertions(+), 84 deletions(-) diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm index fc13032b73..c123ad3b11 100644 --- a/build-aux/build-self.scm +++ b/build-aux/build-self.scm @@ -264,6 +264,9 @@ interface (FFI) of Guile.") (define fake-git (scheme-file "git.scm" #~(define-module (git)))) + (define fake-sqlite3 + (scheme-file "sqlite3.scm" #~(define-module (sqlite3)))) + (with-imported-modules `(((guix config) => ,(make-config.scm)) @@ -278,6 +281,8 @@ interface (FFI) of Guile.") ;; (git) to placate it. ((git) => ,fake-git) + ((sqlite3) => ,fake-sqlite3) + ,@(source-module-closure `((guix store) (guix self) (guix derivations) diff --git a/gnu/packages.scm b/gnu/packages.scm index d22c992bb1..4e2c52e62d 100644 --- a/gnu/packages.scm +++ b/gnu/packages.scm @@ -43,6 +43,7 @@ #:use-module (srfi srfi-34) #:use-module (srfi srfi-35) #:use-module (srfi srfi-39) + #:use-module (sqlite3) #:export (search-patch search-patches search-auxiliary-file @@ -204,10 +205,8 @@ PROC is called along these lines: PROC can use #:allow-other-keys to ignore the bits it's not interested in. When a package cache is available, this procedure does not actually load any package module." - (define cache - (load-package-cache (current-profile))) - - (if (and cache (cache-is-authoritative?)) + (if (and (cache-is-authoritative?) + (current-profile)) (vhash-fold (lambda (name vector result) (match vector (#(name version module symbol outputs @@ -220,7 +219,7 @@ package module." #:supported? supported? #:deprecated? deprecated?)))) init - cache) + (cache-lookup (current-profile))) (fold-packages (lambda (package result) (proc (package-name package) (package-version package) @@ -252,31 +251,7 @@ is guaranteed to never traverse the same package twice." (define %package-cache-file ;; Location of the package cache. - "/lib/guix/package.cache") - -(define load-package-cache - (mlambda (profile) - "Attempt to load the package cache. On success return a vhash keyed by -package names. Return #f on failure." - (match profile - (#f #f) - (profile - (catch 'system-error - (lambda () - (define lst - (load-compiled (string-append profile %package-cache-file))) - (fold (lambda (item vhash) - (match item - (#(name version module symbol outputs - supported? deprecated? - file line column) - (vhash-cons name item vhash)))) - vlist-null - lst)) - (lambda args - (if (= ENOENT (system-error-errno args)) - #f - (apply throw args)))))))) + "/lib/guix/package-cache.sqlite") (define find-packages-by-name/direct ;bypass the cache (let ((packages (delay @@ -297,25 +272,57 @@ decreasing version order." matching) matching))))) -(define (cache-lookup cache name) +(define* (cache-lookup profile #:optional name) "Lookup package NAME in CACHE. Return a list sorted in increasing version order." (define (package-version<? v1 v2) (version>? (vector-ref v2 1) (vector-ref v1 1))) - (sort (vhash-fold* cons '() name cache) - package-version<?)) + (define (int->boolean n) + (case n + ((0) #f) + ((1) #t))) + + (define (string->list str) + (call-with-input-string str read)) + + (define select-statement + (string-append + "SELECT name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn from packages" + (if name " WHERE name = :name" ""))) + + (define cache-file + (string-append profile %package-cache-file)) + + (let* ((db (sqlite-open cache-file SQLITE_OPEN_READONLY)) + (statement (sqlite-prepare db select-statement))) + (when name + (sqlite-bind-arguments statement #:name name)) + (let ((result (sqlite-fold (lambda (v result) + (match v + (#(name version module symbol outputs supported superseded file line column) + (cons + (vector name + version + (string->list module) + (string->symbol symbol) + (string->list outputs) + (int->boolean supported) + (int->boolean superseded) + (list file line column)) + result)))) + '() statement))) + (sqlite-finalize statement) + (sqlite-close db) + (sort result package-version<?)))) (define* (find-packages-by-name name #:optional version) "Return the list of packages with the given NAME. If VERSION is not #f, then only return packages whose version is prefixed by VERSION, sorted in decreasing version order." - (define cache - (load-package-cache (current-profile))) - - (if (and (cache-is-authoritative?) cache) - (match (cache-lookup cache name) - (#f #f) + (if (and (cache-is-authoritative?) + (current-profile)) + (match (cache-lookup (current-profile) name) ((#(_ versions modules symbols _ _ _ _ _ _) ...) (fold (lambda (version* module symbol result) (if (or (not version) @@ -331,12 +338,9 @@ decreasing version order." (define* (find-package-locations name #:optional version) "Return a list of version/location pairs corresponding to each package matching NAME and VERSION." - (define cache - (load-package-cache (current-profile))) - - (if (and cache (cache-is-authoritative?)) - (match (cache-lookup cache name) - (#f '()) + (if (and (cache-is-authoritative?) + (current-profile)) + (match (cache-lookup (current-profile) name) ((#(name versions modules symbols outputs supported? deprecated? files lines columns) ...) @@ -372,6 +376,9 @@ VERSION." ;; Prevent Guile 3 from inlining this procedure so we can mock it in tests. (set! find-best-packages-by-name find-best-packages-by-name) +(define (list->string x) + (call-with-output-string (cut write x <>))) + (define (generate-package-cache directory) "Generate under DIRECTORY a cache of all the available packages. @@ -381,49 +388,81 @@ reducing the memory footprint." (define cache-file (string-append directory %package-cache-file)) - (define (expand-cache module symbol variable result+seen) + (define schema + "CREATE TABLE packages (name text, +version text, +module text, +symbol text, +outputs text, +supported int, +superseded int, +locationFile text, +locationLine int, +locationColumn int); +CREATE VIRTUAL TABLE packageSearch USING fts5(name, searchText);") + + (define insert-statement + "INSERT INTO packages(name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn) +VALUES(:name, :version, :module, :symbol, :outputs, :supported, :superseded, :locationfile, :locationline, :locationcolumn)") + + (define insert-package-search-statement + "INSERT INTO packageSearch(name, searchText) VALUES(:name, :searchtext)") + + (define (boolean->int x) + (if x 1 0)) + + (define (list->string x) + (call-with-output-string (cut write x <>))) + + (define (insert-package db module symbol variable seen) (match (false-if-exception (variable-ref variable)) ((? package? package) - (match result+seen - ((result . seen) - (if (or (vhash-assq package seen) - (hidden-package? package)) - (cons result seen) - (cons (cons `#(,(package-name package) - ,(package-version package) - ,(module-name module) - ,symbol - ,(package-outputs package) - ,(->bool (supported-package? package)) - ,(->bool (package-superseded package)) - ,@(let ((loc (package-location package))) - (if loc - `(,(location-file loc) - ,(location-line loc) - ,(location-column loc)) - '(#f #f #f)))) - result) - (vhash-consq package #t seen)))))) - (_ - result+seen))) - - (define exp - (first - (fold-module-public-variables* expand-cache - (cons '() vlist-null) - (all-modules (%package-module-path) - #:warn - warn-about-load-error)))) + (cond + ((or (vhash-assq package seen) + (hidden-package? package)) + seen) + (else + (let ((statement (sqlite-prepare db insert-statement))) + (sqlite-bind-arguments statement + #:name (package-name package) + #:version (package-version package) + #:module (list->string (module-name module)) + #:symbol (symbol->string symbol) + #:outputs (list->string (package-outputs package)) + #:supported (boolean->int (supported-package? package)) + #:superseded (boolean->int (package-superseded package)) + #:locationfile (cond + ((package-location package) => location-file) + (else #f)) + #:locationline (cond + ((package-location package) => location-line) + (else #f)) + #:locationcolumn (cond + ((package-location package) => location-column) + (else #f))) + (sqlite-fold cons '() statement) + (sqlite-finalize statement)) + (let ((statement (sqlite-prepare db insert-package-search-statement))) + (sqlite-bind-arguments statement + #:name (package-name package) + #:searchtext (package-description package)) + (sqlite-fold cons '() statement) + (sqlite-finalize statement)) + (vhash-consq package #t seen)))) + (_ seen))) (mkdir-p (dirname cache-file)) - (call-with-output-file cache-file - (lambda (port) - ;; Store the cache as a '.go' file. This makes loading fast and reduces - ;; heap usage since some of the static data is directly mmapped. - (put-bytevector port - (compile `'(,@exp) - #:to 'bytecode - #:opts '(#:to-file? #t))))) + (let ((db (sqlite-open cache-file))) + (sqlite-exec db schema) + (sqlite-exec db "BEGIN") + (fold-module-public-variables* (cut insert-package db <> <> <> <>) + vlist-null + (all-modules (%package-module-path) + #:warn + warn-about-load-error)) + (sqlite-exec db "COMMIT;") + (sqlite-close db)) + cache-file) -- 2.23.0