mbox

[bug#39258,0/4] Xapian for Guix package search

Message ID 20200227204150.30985-1-arunisaac@systemreboot.net
Headers show

Message

Arun Isaac Feb. 27, 2020, 8:41 p.m. UTC
Hi,

I have finally got xapian working for package search. Some comments follow.

* Speed improvement

Despite search-package-index in gnu/packages.scm taking only around 1.5ms, I
see an overall speedup in `guix search` of only a factor of 2 -- from around
2s to around 1s. I wonder what else in `guix search` is taking up so much
time.

* Currently indexing only the package descriptions

In this patchset, I have only indexed the package descriptions. In the next
version of this patchset, I will index all other terms as specified in
%package-metrics of guix/ui.scm.

* Should I add guile-xapian as a propagated input to guix in
  gnu/packages/package-management.scm?

* Drop regexp search support

In this patchset, I have retained the older regexp search support. But, I
think we should drop it and only have xapian search. In cases where the search
index is not authoritative, we can build an in-memory xapian search index on
the fly and use it to search. This will slow down the search, but will ensure
our search results are consistent and do not depend on the authoritativeness
of the search index.

* Commit messages

Except for patch 1, I am not sure what prefixes (build-self, gnu, etc.) to use
in the first line of the commit message. Some advice there would be helpful.

Regards,
Arun.

Arun Isaac (4):
  gnu: Add guile-xapian.
  build-self: Add guile-xapian to Guix dependencies.
  gnu: Generate xapian package search index.
  gnu: Use xapian index for package search.

 build-aux/build-self.scm   | 11 ++++++++
 gnu/packages.scm           | 44 ++++++++++++++++++++++++++++-
 gnu/packages/guile-xyz.scm | 50 ++++++++++++++++++++++++++++++++-
 guix/channels.scm          | 34 ++++++++++++++++++++++-
 guix/scripts/package.scm   | 57 ++++++++++++++++++++++----------------
 guix/self.scm              |  7 ++++-
 6 files changed, 175 insertions(+), 28 deletions(-)

Comments

Simon Tournier Feb. 28, 2020, 12:36 p.m. UTC | #1
Hi Arun,

Really cool! Thank you!


On Thu, 27 Feb 2020 at 21:42, Arun Isaac <arunisaac@systemreboot.net> wrote:

> * Speed improvement
>
> Despite search-package-index in gnu/packages.scm taking only around 1.5ms, I
> see an overall speedup in `guix search` of only a factor of 2 -- from around
> 2s to around 1s. I wonder what else in `guix search` is taking up so much
> time.

Interesting... maybe an hidden 'fold-packages'?
Well, I have not yet looked into your code.


> * Currently indexing only the package descriptions
>
> In this patchset, I have only indexed the package descriptions. In the next
> version of this patchset, I will index all other terms as specified in
> %package-metrics of guix/ui.scm.

Yes, it appears to me a detail that should be easy to fix. I mean, it
does not seems blocking.


> * Should I add guile-xapian as a propagated input to guix in
>   gnu/packages/package-management.scm?

IMHO, yes.
I mean, I guess. :-)


> * Drop regexp search support
>
> In this patchset, I have retained the older regexp search support. But, I
> think we should drop it and only have xapian search. In cases where the search
> index is not authoritative, we can build an in-memory xapian search index on
> the fly and use it to search. This will slow down the search, but will ensure
> our search results are consistent and do not depend on the authoritativeness
> of the search index.

I understand why you have turned off the regexp support. It is not
necessary at the first experimentation to see if it is worth the
addition or not.
So, before investigating how some better regexp could be used with
Xapian, let start to benchmark Xapian vs plain 'fold-packages'.


> * Commit messages
>
> Except for patch 1, I am not sure what prefixes (build-self, gnu, etc.) to use
> in the first line of the commit message. Some advice there would be helpful.

I cannot help. )-:


All the best,
simon
Simon Tournier Feb. 28, 2020, 12:39 p.m. UTC | #2
Hi Pierre,

On Fri, 28 Feb 2020 at 09:13, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> Beside this issue, how do you test it?  I guess we first need to install
> a bunch of package with `pre-inst-env guix ...` then to a `pre-inst-env search`?

It is not searching in the installed packages but in all the packages.
So, to test it, you need to "./pre-inst-env guix pull -p" or something
like that to populate the Xapian index database. Then "./pre-inst-env
guix search" will lookup into.
I mean, it is how I understand it should work. I have not yet looked
into the code.


Cheers,
simon
Pierre Neidhardt Feb. 28, 2020, 12:49 p.m. UTC | #3
zimoun <zimon.toutoune@gmail.com> writes:

> Hi Pierre,
>
> On Fri, 28 Feb 2020 at 09:13, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>
>> Beside this issue, how do you test it?  I guess we first need to install
>> a bunch of package with `pre-inst-env guix ...` then to a `pre-inst-env search`?
>
> It is not searching in the installed packages but in all the packages.
> So, to test it, you need to "./pre-inst-env guix pull -p" or something
> like that to populate the Xapian index database. Then "./pre-inst-env
> guix search" will lookup into.
> I mean, it is how I understand it should work. I have not yet looked
> into the code.

What I meant with "install a bunch of packages" is "guix pull -p", is
you said.  Xapian cache
is populated as a hook of guix pull if I got it correctly.
Arun Isaac Feb. 28, 2020, 3:36 p.m. UTC | #4
> I can't build your patch though:
>
> ice-9/eval.scm:293:34: no code for module (xapian xapian)

Sorry, I forgot to mention this in my patch cover letter. The above
error is happening because of the new guile-xapian dependency. It's a
little tricky to get right at the moment. Here goes.

Drop into a guix development environment.

$ guix environment guix

Commit patch 1 (the patch that adds guile-xapian) alone, and build.

$ git am 0001-gnu-Add-guile-xapian.patch
$ make

Then, drop into an environment where guile-xapian is available.

$ ./pre-inst-env guix environment guix --ad-hoc guile-xapian

Apply the other 3 patches and build.

$ git am 0002-build-self-Add-guile-xapian-to-Guix-dependencies.patch 0003-gnu-Generate-xapian-package-search-index.patch 0004-gnu-Use-xapian-index-for-package-search.patch
$ make

Now, the build should have completed successfully. Let's do a test guix
pull to actually test the new guix search.

$ ./pre-inst-env guix pull -p /tmp/test

Then, run the guix search in /tmp/test.

$ /tmp/test/bin/guix search game

That's it! :-)

This whole process will be simpler if the guile-xapian package is pushed
to master and guile-xapian added as an input to the guix package in
gnu/packages/package-management.scm. But, for now...
Arun Isaac Feb. 28, 2020, 4:04 p.m. UTC | #5
> $ ./pre-inst-env guix pull -p /tmp/test

One mistake. This command should be

./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test

where xapian is the name of the branch you committed the patches to.

Also, I acknowledge the corrections you both suggested. I will
incorporate them in v2 of the patchset.
Arun Isaac Feb. 29, 2020, 8:25 a.m. UTC | #6
> This whole process will be simpler if the guile-xapian package is pushed
> to master and guile-xapian added as an input to the guix package in
> gnu/packages/package-management.scm. But, for now...

Shall I push patch 1 (add guile-xapian) alone to master?
Simon Tournier March 2, 2020, 6:27 p.m. UTC | #7
Hi Arun,

On Sat, 29 Feb 2020 at 09:25, Arun Isaac <arunisaac@systemreboot.net> wrote:

> Shall I push patch 1 (add guile-xapian) alone to master?

Yes, it seems a good idea and it will ease the process for building
and then benchmarking the "guix search" via Xapian.


All the best,
simon
Simon Tournier March 2, 2020, 6:37 p.m. UTC | #8
Hi Arun,

Do you have some benchmark in mind?


On Fri, 28 Feb 2020 at 17:05, Arun Isaac <arunisaac@systemreboot.net> wrote:

> ./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test

We need to benchmark on different machines the new "guix pull". Well,
it is nothing compared to the derivation computations. :-)
And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,

Then on cold caches, the new "guix search" for a couple of query.

There is no so much inspiration in tests/. :-)
Ah do not forget to adapt some tests.


All the best,
simon
Simon Tournier March 2, 2020, 7:13 p.m. UTC | #9
Hi,

After a quick benchmark:

 a. It is faster. Between x2 and x3. Really?
 b. The xapian relevance should truncated and examined in more details.

--8<---------------cut here---------------start------------->8---
time guix search emacs | recsel -p name,relevance | head -n18
name: emacs
relevance: 33

name: emacs-with-editor
relevance: 19

name: emacs-restart-emacs
relevance: 19

name: emacs-epkg
relevance: 18

name: guile-emacs
relevance: 17

name: emacs-xwidgets
relevance: 17


real    0m1.530s
user    0m1.827s
sys     0m0.074s
--8<---------------cut here---------------end--------------->8---


--8<---------------cut here---------------start------------->8---
time /tmp/test/bin/guix search emacs | recsel -p name,relevance | head -n18
name: emacs-helm-pass
relevance: 5.0774748262821685

name: emacs-spark
relevance: 4.898640632723127

name: emacs-evil-smartparens
relevance: 4.898640632723127

name: emacs-howm
relevance: 4.8638448958830685

name: emacs-el-mock
relevance: 4.8638448958830685

name: emacs-strace-mode
relevance: 4.693676055650271


real    0m0.440s
user    0m0.482s
sys     0m0.058s
--8<---------------cut here---------------end--------------->8---


Here for example, Xapian does not return the package 'emacs' itself as
the first. And worse, it is not returned at all.
That's said, I do not know if it is really faster since:

--8<---------------cut here---------------start------------->8---
guix search emacs | recsel -C -P name | wc -l
829
--8<---------------cut here---------------end--------------->8---

and

--8<---------------cut here---------------start------------->8---
/tmp/test/bin/guix search emacs | recsel -C -P name | wc -l
10
--8<---------------cut here---------------end--------------->8---

Maybe I am doing a mistake.


Well, thank you Arun for the Xapian bindings which will improve the
searching experience. :-)
And now it needs some polishing.


All the best
simo
Simon Tournier March 3, 2020, 8:04 p.m. UTC | #10
Hi,

On Mon, 2 Mar 2020 at 20:13, zimoun <zimon.toutoune@gmail.com> wrote:

> --8<---------------cut here---------------start------------->8---
> /tmp/test/bin/guix search emacs | recsel -C -P name | wc -l
> 10
> --8<---------------cut here---------------end--------------->8---
>
> Maybe I am doing a mistake.

I think this issue is fixed when changing the 'pagesize' value.

Well, with '(pagesize 4294967295)' and using the same commit
(c1febbbf94), I get:

--8<---------------cut here---------------start------------->8---
guix time-machine --commit=c1febbbf94 -- guix search games | recsel -C
-p name | wc -l
247

./pre-inst-env guix search games | recsel -C -p name | wc -l
236
--8<---------------cut here---------------end--------------->8---

(I modified the patches in order to pull once to generate the index at
commit c1febbbf94 and then do some stuff.)


Note that the old "guix search" does not output blender and Xapian
does even if the term 'games' is not in the description but 'game' is.
Well, I am comparing the different list, i.e., "guix search games |
recsel -C -P name | sort" to see which one is in one list and not the
other one.

But before going more ahead, let polish a bit the patches to more
easily test without the double environment etc.
And because I am using good old HDD and some SSD comparison should be welcome.


All the best,
simon
Ludovic Courtès March 5, 2020, 4:46 p.m. UTC | #11
Hello Arun,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * Speed improvement
>
> Despite search-package-index in gnu/packages.scm taking only around 1.5ms, I
> see an overall speedup in `guix search` of only a factor of 2 -- from around
> 2s to around 1s. I wonder what else in `guix search` is taking up so much
> time.

Note that ‘guix search’ time is largely dominated by I/O.  On my laptop,
I get (first measurement is cold cache, second one is warm cache):

--8<---------------cut here---------------start------------->8---
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time guix search foo >/dev/null

real    0m2.631s
user    0m1.134s
sys     0m0.124s
$ time guix search foo >/dev/null

real    0m0.836s
user    0m1.027s
sys     0m0.053s
--8<---------------cut here---------------end--------------->8---

It’s hard to do better on the warm cache case because at this level,
there may be other things to optimize having little to do with searching
itself.

Note that this is on an SSD; the cold-cache case must be worse on NFS or
on a spinning disk, and there we could gain a lot.

I think we should weigh the pros and cons on all these aspects: speed,
complexity and maintenance cost, search result quality, search features,
etc.

Thanks,
Ludo’.

PS: I have not yet looked at the whole series as I’m just coming back to
    the keyboard.  :-)