mbox series

[bug#39258,v2,0/3] Xapian for Guix package search

Message ID 20200307133116.11443-1-arunisaac@systemreboot.net
Headers show
Series Xapian for Guix package search | expand

Message

Arun Isaac March 7, 2020, 1:31 p.m. UTC
Hi,

Here is the second iteration of my Xapian Guix package search patchset. I have
found the reason the earlier patchset did not show significant speedup. It
turns out that most of the time is spent in printing and texinfo rendering of
the search results. So, in this patchset, I pre-render the search results
while building the Xapian index and stuff them into the Xapian database
itself. Therefore, during `guix search`, I just pull out the pre-rendered
search results and print it on the screen. This is much faster. See comparison
below.

--8<---------------cut here---------------start------------->8---
With a warm cache,
$ time guix search inkscape

real	0m1.787s
user	0m1.745s
sys	0m0.111s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
$ time /tmp/test/bin/guix search inkscape

real	0m0.199s
user	0m0.182s
sys	0m0.024s
--8<---------------cut here---------------end--------------->8---

If most of the speedup comes from pre-rendering the results, it might seem
that the Xapian search is not so useful. We might as well have stuffed the
pre-rendered search results into the existing package cache generated by
generate-package-cache, or so it might seem. But, there are the following
arguments in favor of Xapian.

- The package cache would grow in size, and lookup would be slowed down
  because we need to load the entire cache into memory. Xapian, on the other
  hand, need only look up the specific packages that match the search query.
- Xapian can provide superior search results due to it stemming and language
  models.
- Xapian can provide spelling correction and query expansion -- that is,
  suggest search terms to improve search results. Note that I haven't
  implemented this yet and is out of scope in this patchset.

* Simplify our package search results

Why not use a simpler package search results format like Arch Linux or Debian
does? We could just display the package name, version and synopsis like so.

inkscape 0.92.4
    Vector graphics editor
inklingreader 0.8
    Wacom Inkling sketch format conversion and manipulation

Why do we need the entire recutils format? If the user is interested, they can
always use `guix package --show` to get the full recutils formatted
info. Having shorter search results will make everything even faster and much
more readable. WDYT?

* How to test this patchset

To get guile-xapian, run a `guix pull`, if you haven't already. Then in your
Guix source directory, drop into an environment with guix dependencies and
guile-xapian.

$ guix environment guix --ad-hoc guile-xapian

Apply patches and build.

$ git am v2-0000-cover-letter.patch v2-0002-gnu-Generate-Xapian-package-search-index.patch v2-0001-build-self-Add-guile-xapian-to-Guix-dependencies.patch v2-0003-gnu-Use-Xapian-index-for-package-search.patch
$ make

Run a test guix pull.

$ ./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test

where xapian is the name of the branch you committed the patches to.

Then, run the guix search in /tmp/test.

$ /tmp/test/bin/guix search game

* Comments

Pierre Neidhardt <mail@ambrevar.xyz> writes:

>> +(define (search-package-index profile querystring)
>
> Maybe `query-string'?

Done in this patchset.

>> +  (define (regexp? str)
>> +    (string-any
>> +     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
>> +     str))
>> +
>> +  (if (and (current-profile)
>> +           (not (any regexp? patterns)))
>
> I would not put characters like ".", "$", or "+" here, lest we mistake a
> Xapian pattern for a regexp.
>
> As you said, I don't think both are compatible without ambiguity
> anyways, so we should probably drop regexp (or at least toggle them with
> a command line argument).

I agree.

zimoun <zimon.toutoune@gmail.com> writes:

> In the commit message, I would capitalize Xapian.

Done in this patchset.

>> +(define (generate-package-search-index directory)
>> +  "Generate under DIRECTORY a xapian index of all the available packages."
>
> Xapian with capital.

Done in this patchset.

> Is (make-stem "en") for the locale?

I still have English hard-coded. I haven't yet figured out how to detect the
locale and stem accordingly. But, there is a larger problem. Since we cannot
anticipate what locale the user will run guix search with, should we build the
Xapian index for all locales? That is, should we index not only the English
versions of the packages but also all other translations as well?

> package-search-index and package-cache-file could be refactored
> because they share all the same code.

Yes, they could be. However, I'll postpone to the next iteration of the
patchset.

> I do not know what is the convention for the bindings.
> But there is 'fold-packages' so I would be inclined to 'fold-msets' or
> something in this flavour.

Well, everywhere else in guile we have such things as vhash-fold, string-fold,
hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are
folding over a single mset (match-set). So, mset should be in the singular.

> And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,
> Ah do not forget to adapt some tests.

Will do this once we have consensus about the other features of this patchset.

>  b. The xapian relevance should truncated

Done in this patchset.

> Xapian does not return the package 'emacs' itself as the first. And worse,
> it is not returned at all.

In this patchset, since we're indexing the package name as well, emacs is
returned but it is still far from the beginning.

> I propose the value of 4294967295 for pagesize.

In this patchset, I pass (database-document-count db) as the #:maximum-items
keyword argument to enquire-mset. This is the upstream recommended way to get
all search results. I hadn't done this earlier since I hadn't yet wrapped
database-document-count in guile-xapian.

>> In this patchset, I have only indexed the package descriptions. In the next
>> version of this patchset, I will index all other terms as specified in
>> %package-metrics of guix/ui.scm.
>
> Yes, it appears to me a detail that should be easy to fix. I mean, it
> does not seems blocking.

Done in this patchset.

Ludovic Courtès <ludo@gnu.org> writes:

> Note that ‘guix search’ time is largely dominated by I/O.

Yes, `guix search` is I/O intensive. That is why I expect Xapian to do better
since it only needs to access matching packages not all packages. Also, the
Xapian index is fast at all times. It is not very dependent on a warm
filesystem cache.

> On my laptop,
> I get (first measurement is cold cache, second one is warm cache):
>
> --8<---------------cut here---------------start------------->8---
> $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
> $ time guix search foo >/dev/null
>
> real    0m2.631s
> user    0m1.134s
> sys     0m0.124s
> $ time guix search foo >/dev/null
>
> real    0m0.836s
> user    0m1.027s
> sys     0m0.053s
> --8<---------------cut here---------------end--------------->8---
>
> It’s hard to do better on the warm cache case because at this level,
> there may be other things to optimize having little to do with searching
> itself.
>
> Note that this is on an SSD; the cold-cache case must be worse on NFS or
> on a spinning disk, and there we could gain a lot.

My laptop is quite old with a particularly slow HDD. Hence my motivation to
improve guix search performance!

> I think we should weigh the pros and cons on all these aspects: speed,
> complexity and maintenance cost, search result quality, search features,
> etc.

I agree.

> PS: I have not yet looked at the whole series as I’m just coming back to
>     the keyboard.  :-)

Welcome back! :-)

Arun Isaac (3):
  build-self: Add guile-xapian to Guix dependencies.
  gnu: Generate Xapian package search index.
  gnu: Use Xapian index for package search.

 build-aux/build-self.scm | 11 +++++++
 gnu/packages.scm         | 62 +++++++++++++++++++++++++++++++++++++++-
 guix/channels.scm        | 34 +++++++++++++++++++++-
 guix/scripts/package.scm |  7 +++--
 guix/self.scm            |  7 ++++-
 guix/ui.scm              | 37 ++++++++++++++++++++++++
 6 files changed, 153 insertions(+), 5 deletions(-)

Comments

Ludovic Courtès March 7, 2020, 8:33 p.m. UTC | #1
Hello,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> Here is the second iteration of my Xapian Guix package search patchset. I have
> found the reason the earlier patchset did not show significant speedup. It
> turns out that most of the time is spent in printing and texinfo rendering of
> the search results. So, in this patchset, I pre-render the search results
> while building the Xapian index and stuff them into the Xapian database
> itself. Therefore, during `guix search`, I just pull out the pre-rendered
> search results and print it on the screen. This is much faster. See comparison
> below.
>
> With a warm cache,
> $ time guix search inkscape
>
> real	0m1.787s
> user	0m1.745s
> sys	0m0.111s
>
> $ time /tmp/test/bin/guix search inkscape
>
> real	0m0.199s
> user	0m0.182s
> sys	0m0.024s

Nice!

In general, pre-rendering doesn’t seem practical to me: the output of
‘guix search’ is locale-dependent (it speaks the user’s language) and
adjusts to the terminal width (well, this is temporarily broken on
Guile 3.0.0, but see ‘%text-width’ in (guix ui)).

Also, if the 12K+ descriptions need to be rendered at the time the user
runs ‘guix pull’, the experience may not be great, because it could take
a bit of time.

WDYT?

> Why not use a simpler package search results format like Arch Linux or Debian
> does? We could just display the package name, version and synopsis like so.
>
> inkscape 0.92.4
>     Vector graphics editor
> inklingreader 0.8
>     Wacom Inkling sketch format conversion and manipulation
>
> Why do we need the entire recutils format? If the user is interested, they can
> always use `guix package --show` to get the full recutils formatted
> info. Having shorter search results will make everything even faster and much
> more readable. WDYT?

What I like about the recutils format in this context is that it’s both
human- and machine-readable.  The examples in the manual show how it can
be useful to select the information displayed or to refine the search
(info "(guix) Invoking guix package").

Also: I’d recommend tackling one thing at a time.  :-)

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Note that ‘guix search’ time is largely dominated by I/O.
>
> Yes, `guix search` is I/O intensive. That is why I expect Xapian to do better
> since it only needs to access matching packages not all packages. Also, the
> Xapian index is fast at all times. It is not very dependent on a warm
> filesystem cache.

Yes, indeed.

>> On my laptop,
>> I get (first measurement is cold cache, second one is warm cache):
>>
>> --8<---------------cut here---------------start------------->8---
>> $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
>> $ time guix search foo >/dev/null
>>
>> real    0m2.631s
>> user    0m1.134s
>> sys     0m0.124s
>> $ time guix search foo >/dev/null
>>
>> real    0m0.836s
>> user    0m1.027s
>> sys     0m0.053s
>> --8<---------------cut here---------------end--------------->8---
>>
>> It’s hard to do better on the warm cache case because at this level,
>> there may be other things to optimize having little to do with searching
>> itself.
>>
>> Note that this is on an SSD; the cold-cache case must be worse on NFS or
>> on a spinning disk, and there we could gain a lot.
>
> My laptop is quite old with a particularly slow HDD. Hence my motivation to
> improve guix search performance!

Were you able to measure the cost of rendering specifically?

Here’s what I see when I turn ‘package->recutils’ into a no-op:

--8<---------------cut here---------------start------------->8---
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time ./pre-inst-env guix search foo 

real	0m1.617s
user	0m0.812s
sys	0m0.094s
$ time ./pre-inst-env guix search foo 

real	0m0.595s
user	0m0.747s
sys	0m0.043s
--8<---------------cut here---------------end--------------->8---

To compare with:

--8<---------------cut here---------------start------------->8---
$ time ./pre-inst-env guix search foo >/dev/null

real	0m0.829s
user	0m1.026s
sys	0m0.046s
--8<---------------cut here---------------end--------------->8---

I think we should look at a profile of ‘package->recutils’, there’s
probably room for improvement there.

Thoughts?

Ludo’.
Arun Isaac March 8, 2020, 9:01 a.m. UTC | #2
>> It turns out that most of the time is spent in printing and texinfo
>> rendering of the search results.

Also, when we put all package metadata into the Xapian index, we don't
have to look up any of the package variables in (gnu packages *) during
`guix search` time. This also contributes substantially to the speedup.

> In general, pre-rendering doesn’t seem practical to me: the output of
> ‘guix search’ is locale-dependent (it speaks the user’s language) and

Note that we already need to index package synopses and descriptions in
all languages. I still haven't implemented this, though.

> adjusts to the terminal width (well, this is temporarily broken on
> Guile 3.0.0, but see ‘%text-width’ in (guix ui)).

This could be accomplished even with pre-rendering. Xapian provides
"slots" to store arbitrary strings with a document. Instead of storing
the pre-rendered document as a whole, we could store pre-rendered fields
in separate slots. Then, during `guix search` time, we can assemble the
result from these pre-rendered fields.

> Also, if the 12K+ descriptions need to be rendered at the time the user
> runs ‘guix pull’, the experience may not be great, because it could take
> a bit of time.

This is a problem, but I would see it as a necessary "compilation"
step. :-P In fact, this whole patchset speeds up `guix search` by doing
part of the work of `guix search` ahead of time. So, some such cost is
unavoidable.

> What I like about the recutils format in this context is that it’s both
> human- and machine-readable.  The examples in the manual show how it can
> be useful to select the information displayed or to refine the search
> (info "(guix) Invoking guix package").

Xapian's query language is much more natural (as in natural language)
than the regexp based techniques we need to use with recutils. I have
hardly ever used the regexp based search and I suspect many others
haven't either. Also, refining the search query should be easier to do
with Xapian. We could even use Xapian's query expansion feature to
suggest improved queries to the user.

That said, if we want the recutils format, we can still keep it in a
simplified form like so.

name: inkscape
version: 0.92.4
synopsis: Vector graphics editor

name: inklingreader
version: 0.8
synopsis: Wacom Inkling skecth format conversion and manipulation

> Also: I’d recommend tackling one thing at a time.  :-)

I totally agree, but I'm tempted to say that pre-rendering would be a
lot cheaper with the simplified form of search results. :-)

> Were you able to measure the cost of rendering specifically?

generate-package-search-index takes around 50 seconds. If I modify
generate-package-search-index to not pre-render but simply store the
package description alone, it takes around 20 seconds. That gives us a
rough idea of the cost of pre-rendering.

> I think we should look at a profile of ‘package->recutils’, there’s
> probably room for improvement there.

On quick inspection, most of the time in package->recutils is spent in
texinfo rendering the description. Unless we use the simplified search
results format as discussed above, we cannot avoid it.
Ludovic Courtès March 8, 2020, 11:33 a.m. UTC | #3
Hi,

Arun Isaac <arunisaac@systemreboot.net> skribis:

>>> It turns out that most of the time is spent in printing and texinfo
>>> rendering of the search results.
>
> Also, when we put all package metadata into the Xapian index, we don't
> have to look up any of the package variables in (gnu packages *) during
> `guix search` time. This also contributes substantially to the speedup.

Yup.

>> In general, pre-rendering doesn’t seem practical to me: the output of
>> ‘guix search’ is locale-dependent (it speaks the user’s language) and
>
> Note that we already need to index package synopses and descriptions in
> all languages. I still haven't implemented this, though.

Oh, right.  Tricky!

>> adjusts to the terminal width (well, this is temporarily broken on
>> Guile 3.0.0, but see ‘%text-width’ in (guix ui)).
>
> This could be accomplished even with pre-rendering. Xapian provides
> "slots" to store arbitrary strings with a document. Instead of storing
> the pre-rendered document as a whole, we could store pre-rendered fields
> in separate slots. Then, during `guix search` time, we can assemble the
> result from these pre-rendered fields.

I’m not sure I understand.  The index wouldn’t store pre-rendered
strings for every possible terminal width, right?

>> Also, if the 12K+ descriptions need to be rendered at the time the user
>> runs ‘guix pull’, the experience may not be great, because it could take
>> a bit of time.
>
> This is a problem, but I would see it as a necessary "compilation"
> step. :-P In fact, this whole patchset speeds up `guix search` by doing
> part of the work of `guix search` ahead of time. So, some such cost is
> unavoidable.

Yeah.  I think we need to take the whole user experience into account,
not just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
fairly common operation.  Conversely, ‘guix search’ takes roughly
between 0.5 and 2 seconds and is an uncommon operation on a “slow path”
(in the sense that when you’re searching for software, you’ll probably
have to spend more than a couple of seconds to find what you’re looking
for.)

>> What I like about the recutils format in this context is that it’s both
>> human- and machine-readable.  The examples in the manual show how it can
>> be useful to select the information displayed or to refine the search
>> (info "(guix) Invoking guix package").
>
> Xapian's query language is much more natural (as in natural language)
> than the regexp based techniques we need to use with recutils. I have
> hardly ever used the regexp based search and I suspect many others
> haven't either. Also, refining the search query should be easier to do
> with Xapian. We could even use Xapian's query expansion feature to
> suggest improved queries to the user.

I’m not sufficiently familiar with Xapian’s query language.  The
examples I had in mind were:

  guix search malloc | recsel -p name,version,relevance
  guix search | recsel -p name -e 'license ~ "LGPL 3"'
  guix search crypto library | \
    recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis

It’s not so much about regexps than it is about selecting individual
fields.

>> Were you able to measure the cost of rendering specifically?
>
> generate-package-search-index takes around 50 seconds. If I modify
> generate-package-search-index to not pre-render but simply store the
> package description alone, it takes around 20 seconds. That gives us a
> rough idea of the cost of pre-rendering.

To me, adding 20–50 seconds on ‘guix pull’ would be undesirable.  :-/

>> I think we should look at a profile of ‘package->recutils’, there’s
>> probably room for improvement there.
>
> On quick inspection, most of the time in package->recutils is spent in
> texinfo rendering the description. Unless we use the simplified search
> results format as discussed above, we cannot avoid it.

What I meant was that we could use (statprof) to see whether/how Texinfo
rendering/parsing can be optimized.

Thanks,
Ludo’.
Simon Tournier March 8, 2020, 8:27 p.m. UTC | #4
Hi Arun,

Thank you for that.
I will be probably far from keyboard for a couple of days for medical
reasons* and I will not be able to look into this second set patches
soon.

Cheers,
simon

*nothing to worry, even if I am currently typing that in an hospital,
just the bad consequence of sports. ;-)
Arun Isaac March 8, 2020, 8:27 p.m. UTC | #5
>> This could be accomplished even with pre-rendering. Xapian provides
>> "slots" to store arbitrary strings with a document. Instead of storing
>> the pre-rendered document as a whole, we could store pre-rendered fields
>> in separate slots. Then, during `guix search` time, we can assemble the
>> result from these pre-rendered fields.
>
> I’m not sure I understand.  The index wouldn’t store pre-rendered
> strings for every possible terminal width, right?

No, it wouldn't. It would store a partially pre-rendered string, that is
without fill-paragraph. We run fill-paragraph at `guix search` time to
complete the rendering.

> I think we need to take the whole user experience into account, not
> just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
> fairly common operation.  Conversely, ‘guix search’ takes roughly
> between 0.5 and 2 seconds and is an uncommon operation on a “slow
> path” (in the sense that when you’re searching for software, you’ll
> probably have to spend more than a couple of seconds to find what
> you’re looking for.)

I agree we can't compromise too much on `guix pull` performance.

> To me, adding 20–50 seconds on ‘guix pull’ would be undesirable.  :-/

Maybe I'm missing something here. guix pull takes around 40 minutes on
my machine. In comparison to that, is another 20-50 seconds (roughly 1
minute) a big deal? How much time would it be acceptable to spend on
building the Xapian index?

Also, is it possible to somehow provide substitutes for the Xapian index
so that the user does not have to actually build it locally during `guix
pull` time?

> I’m not sufficiently familiar with Xapian’s query language.  The
> examples I had in mind were:
> It’s not so much about regexps than it is about selecting individual
> fields.

I have totally not tested this, but I imagine that equivalent Xapian
queries might look something like:

>   guix search | recsel -p name -e 'license ~ "LGPL 3"'

guix search license:LGPL3

>   guix search crypto library | \
>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis

guix search crypto library AND (NOT ghc) AND (NOT perl) AND (NOT python)
AND (NOT ruby)

> What I meant was that we could use (statprof) to see whether/how Texinfo
> rendering/parsing can be optimized.

Oh, ok. I'll try this if we decide not to pre-render.
Arun Isaac March 8, 2020, 8:40 p.m. UTC | #6
zimoun <zimon.toutoune@gmail.com> writes:

> I will be probably far from keyboard for a couple of days for medical
> reasons* and I will not be able to look into this second set patches
> soon.

No problem, take care! :-)

> *nothing to worry, even if I am currently typing that in an hospital,
> just the bad consequence of sports. ;-)
Pierre Neidhardt March 9, 2020, 7:42 a.m. UTC | #7
Arun Isaac <arunisaac@systemreboot.net> writes:

>> I’m not sufficiently familiar with Xapian’s query language.  The
>> examples I had in mind were:
>> It’s not so much about regexps than it is about selecting individual
>> fields.
>
> I have totally not tested this, but I imagine that equivalent Xapian
> queries might look something like:
>
>>   guix search | recsel -p name -e 'license ~ "LGPL 3"'
>
> guix search license:LGPL3
>
>>   guix search crypto library | \
>>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis
>
> guix search crypto library AND (NOT ghc) AND (NOT perl) AND (NOT python)
> AND (NOT ruby)

Indeed, if you look at the notmuch-search-terms man page, you'll see
that you can select fields.
In my opinion, the recsel format is fully superseded by Xapian.
Pierre Neidhardt March 9, 2020, 7:50 a.m. UTC | #8
Ludovic Courtès <ludo@gnu.org> writes:

> Yeah.  I think we need to take the whole user experience into account,
> not just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
> fairly common operation.  Conversely, ‘guix search’ takes roughly
> between 0.5 and 2 seconds and is an uncommon operation on a “slow path”
> (in the sense that when you’re searching for software, you’ll probably
> have to spend more than a couple of seconds to find what you’re looking
> for.)

I think I disagree with "guix search" being an uncommon operation and a
slow path.

- The slowness of `guix search' (and the awkwardness of recutils) is
  maybe what makes it uncommon: users refrain from using it because it's
  too impractical.

- Searches are typically refined, i.e. you run a search multiple times
  by precising the terms, so in that sense I believe `guix search` is a
  very common operation.  Or should be.

Anyways, one of the key issues here is the inherent limitation of the
shell interface that does not allow us to directly and contextually
process the output of a command (at least not without rerunning it).

This issue can only be tackled with a GUI: there the user would be able
to interactively act with the result of the search, without having to
re-run the search.

Concretely, the GUI search would only return the package name, version
and synopses.  No need for the Texinfo / recutils juggling.

Then the user would select the packages of interest to display more
details.  This allows us to query the full details just-in-time.



Back to the topic: I believe that Xapian is a huge win both for the
shell and the future GUI :)
Ludovic Courtès March 9, 2020, 10:28 a.m. UTC | #9
Hello,

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Yeah.  I think we need to take the whole user experience into account,
>> not just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
>> fairly common operation.  Conversely, ‘guix search’ takes roughly
>> between 0.5 and 2 seconds and is an uncommon operation on a “slow path”
>> (in the sense that when you’re searching for software, you’ll probably
>> have to spend more than a couple of seconds to find what you’re looking
>> for.)
>
> I think I disagree with "guix search" being an uncommon operation and a
> slow path.

(Not “and” but “on” a slow path.)

> - The slowness of `guix search' (and the awkwardness of recutils) is
>   maybe what makes it uncommon: users refrain from using it because it's
>   too impractical.

I think “slowness” and “awkwardness” are overstatements.  I’m not saying
this is perfect, but to me it’s not bad.  (Of course I’m biased :-), but
I’ve used other similar tools and this one looks rather good compared to
what I’ve used.)

> - Searches are typically refined, i.e. you run a search multiple times
>   by precising the terms, so in that sense I believe `guix search` is a
>   very common operation.  Or should be.
>
> Anyways, one of the key issues here is the inherent limitation of the
> shell interface that does not allow us to directly and contextually
> process the output of a command (at least not without rerunning it).

I agree, but ‘guix search’ is a shell command, so we have to adapt to
that context.

> Concretely, the GUI search would only return the package name, version
> and synopses.  No need for the Texinfo / recutils juggling.
>
> Then the user would select the packages of interest to display more
> details.  This allows us to query the full details just-in-time.

Note that Emacs-Guix does that, although it doesn’t use the search
facility of (guix ui) with relevance metrics.

> Back to the topic: I believe that Xapian is a huge win both for the
> shell and the future GUI :)

It could be, but we need to consider all the aspects of the story,
including the maintenance cost and overhead moved to ‘guix pull’.  So
it’s not so much about “beliefs” at this point, but rather about
demonstrating what can be done, and I’m glad Arun is exploring that
space!

Ludo’.
Ludovic Courtès March 9, 2020, 10:35 a.m. UTC | #10
Hello!

Arun Isaac <arunisaac@systemreboot.net> skribis:

>>> This could be accomplished even with pre-rendering. Xapian provides
>>> "slots" to store arbitrary strings with a document. Instead of storing
>>> the pre-rendered document as a whole, we could store pre-rendered fields
>>> in separate slots. Then, during `guix search` time, we can assemble the
>>> result from these pre-rendered fields.
>>
>> I’m not sure I understand.  The index wouldn’t store pre-rendered
>> strings for every possible terminal width, right?
>
> No, it wouldn't. It would store a partially pre-rendered string, that is
> without fill-paragraph. We run fill-paragraph at `guix search` time to
> complete the rendering.

Note that Texinfo rendering doesn’t use (@ (guix ui) fill-paragraph).
It has its own paragraph-filling code.  We cannot use ‘fill-paragraph’
after Texinfo rendering anyway, since Texinfo knows where things can be
filled and where they cannot—e.g., @example.

>> I think we need to take the whole user experience into account, not
>> just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
>> fairly common operation.  Conversely, ‘guix search’ takes roughly
>> between 0.5 and 2 seconds and is an uncommon operation on a “slow
>> path” (in the sense that when you’re searching for software, you’ll
>> probably have to spend more than a couple of seconds to find what
>> you’re looking for.)
>
> I agree we can't compromise too much on `guix pull` performance.
>
>> To me, adding 20–50 seconds on ‘guix pull’ would be undesirable.  :-/
>
> Maybe I'm missing something here. guix pull takes around 40 minutes on
> my machine. In comparison to that, is another 20-50 seconds (roughly 1
> minute) a big deal? How much time would it be acceptable to spend on
> building the Xapian index?

On my laptop, in the best case, when all the substitutes are available
(not uncommon), it takes 2 minutes.  Sometimes, when some substitutes
are missing, it takes 15 minutes.

So of course, the 20–50 seconds matter only in the best case.  But they
matter primarily because that index build may not be substitutable: it’s
possibly unique to each profile (see below).  That means we know we’re
often going to pay for it.

> Also, is it possible to somehow provide substitutes for the Xapian index
> so that the user does not have to actually build it locally during `guix
> pull` time?

We could provide a substitute for users who use only the official 'guix
channel.  However, as soon as users combine multiple channels, they’ll
have to build the index locally.

>> I’m not sufficiently familiar with Xapian’s query language.  The
>> examples I had in mind were:
>> It’s not so much about regexps than it is about selecting individual
>> fields.
>
> I have totally not tested this, but I imagine that equivalent Xapian
> queries might look something like:
>
>>   guix search | recsel -p name -e 'license ~ "LGPL 3"'
>
> guix search license:LGPL3

Nice.

>>   guix search crypto library | \
>>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis
>
> guix search crypto library AND (NOT ghc) AND (NOT perl) AND (NOT python)
> AND (NOT ruby)

This one is not quite equivalent I guess, but yeah.  :-)

>> What I meant was that we could use (statprof) to see whether/how Texinfo
>> rendering/parsing can be optimized.
>
> Oh, ok. I'll try this if we decide not to pre-render.

It’d be beneficial anyways.

Thank you!

Ludo’.
Simon Tournier March 9, 2020, 12:28 p.m. UTC | #11
Hi,

On Sat, 7 Mar 2020 at 14:31, Arun Isaac <arunisaac@systemreboot.net> wrote:

> --8<---------------cut here---------------start------------->8---
> With a warm cache,
> $ time guix search inkscape
>
> real    0m1.787s
> user    0m1.745s
> sys     0m0.111s
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> $ time /tmp/test/bin/guix search inkscape
>
> real    0m0.199s
> user    0m0.182s
> sys     0m0.024s
> --8<---------------cut here---------------end--------------->8---

IMHO, it is interesting to compare the list of results and the order
of the both query; as i did with Emacs.
Speed is one thing, the initial motivation. But accuracy is maybe more
important.


> - The package cache would grow in size, and lookup would be slowed down
>   because we need to load the entire cache into memory. Xapian, on the other
>   hand, need only look up the specific packages that match the search query.

I agree that 'fold-packages' could become soon a bottleneck.

IMHO, 'mset-fold' should be a drop-in replacement of 'fold-package' in
the search function.


> - Xapian can provide superior search results due to it stemming and language
>   models.
> - Xapian can provide spelling correction and query expansion -- that is,
>   suggest search terms to improve search results. Note that I haven't
>   implemented this yet and is out of scope in this patchset.

I agree too that Xapian should improve the user experience when searching.


> * Simplify our package search results
>
> Why not use a simpler package search results format like Arch Linux or Debian
> does? We could just display the package name, version and synopsis like so.
>
> inkscape 0.92.4
>     Vector graphics editor
> inklingreader 0.8
>     Wacom Inkling sketch format conversion and manipulation2
>
> Why do we need the entire recutils format? If the user is interested, they can
> always use `guix package --show` to get the full recutils formatted
> info. Having shorter search results will make everything even faster and much
> more readable. WDYT?

I disagree.

What I proposed some time ago was to have different flavour of the
ouput of search; as e.g., 'git log --pretty=oneline' etc..
For example by default, it should be what you suggest. Then "guix
search --format=full" should output the current. And we could imagine
mimick the Git log strategy: "guix search --format="%name
%version\n%license" etc.

WDYT?



> > Is (make-stem "en") for the locale?
>
> I still have English hard-coded. I haven't yet figured out how to detect the
> locale and stem accordingly. But, there is a larger problem. Since we cannot
> anticipate what locale the user will run guix search with, should we build the
> Xapian index for all locales? That is, should we index not only the English
> versions of the packages but also all other translations as well?

I understand. Let consider that for the next round.


> > package-search-index and package-cache-file could be refactored
> > because they share all the same code.
>
> Yes, they could be. However, I'll postpone to the next iteration of the
> patchset.

Ok.


> > I do not know what is the convention for the bindings.
> > But there is 'fold-packages' so I would be inclined to 'fold-msets' or
> > something in this flavour.
>
> Well, everywhere else in guile we have such things as vhash-fold, string-fold,
> hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are
> folding over a single mset (match-set). So, mset should be in the singular.

I understand.


> > And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,
> > Ah do not forget to adapt some tests.
>
> Will do this once we have consensus about the other features of this patchset.

And we should test that on different machines and states.



> > Xapian does not return the package 'emacs' itself as the first. And worse,
> > it is not returned at all.
>
> In this patchset, since we're indexing the package name as well, emacs is
> returned but it is still far from the beginning.

This is an issue.

IMHO, it is because of the BM25 score. It is too rough and some weight
should be applied. But that another story.
The fix is:
 a- provide a scoring function to Xapian as the doc explains
 b- adapt 'fold-package' to 'mset-fold' in
'find-packages-by-description' and implement our version of BM25 then
use it in 'relevance'


> > I propose the value of 4294967295 for pagesize.
>
> In this patchset, I pass (database-document-count db) as the #:maximum-items
> keyword argument to enquire-mset. This is the upstream recommended way to get
> all search results. I hadn't done this earlier since I hadn't yet wrapped
> database-document-count in guile-xapian.

Cool!



> My laptop is quite old with a particularly slow HDD. Hence my motivation to
> improve guix search performance!

I agree.
But performance is not all. Accuracy counts more! :-)


> > I think we should weigh the pros and cons on all these aspects: speed,
> > complexity and maintenance cost, search result quality, search features,
> > etc.
>
> I agree.

I agree too.
We should write a benchmark. For example, using Emacs as query or more
complex we could think of.


All the best,
simon
Simon Tournier March 9, 2020, 12:34 p.m. UTC | #12
Hi,

On Sat, 7 Mar 2020 at 21:33, Ludovic Courtès <ludo@gnu.org> wrote:
> Arun Isaac <arunisaac@systemreboot.net> skribis:

> > Why not use a simpler package search results format like Arch Linux or Debian
> > does? We could just display the package name, version and synopsis like so.
> >
> > inkscape 0.92.4
> >     Vector graphics editor
> > inklingreader 0.8
> >     Wacom Inkling sketch format conversion and manipulation
> >
> > Why do we need the entire recutils format? If the user is interested, they can
> > always use `guix package --show` to get the full recutils formatted
> > info. Having shorter search results will make everything even faster and much
> > more readable. WDYT?
>
> What I like about the recutils format in this context is that it’s both
> human- and machine-readable.  The examples in the manual show how it can
> be useful to select the information displayed or to refine the search
> (info "(guix) Invoking guix package").
>
> Also: I’d recommend tackling one thing at a time.  :-)

I agree with Ludo.

And IMHO, we should add "guix search --format=<options>" mimicking how
"git log" works.
By default, displays as Arun proposes. Using '--format=full" as it is
done now by default.
And we could imagine "--format=%name \t %version \n %description" etc.



> I think we should look at a profile of ‘package->recutils’, there’s
> probably room for improvement there.

Interesting. Note that speed was the initial motivation but accuracy
is another important one. As we discussed earlier when I showed an
example with TF-IDF. And Xapian implemets the state-of-art (BM25) for
scoring.


All the best,
simon
Simon Tournier March 9, 2020, 12:40 p.m. UTC | #13
On Sun, 8 Mar 2020 at 10:02, Arun Isaac <arunisaac@systemreboot.net> wrote:

> >> It turns out that most of the time is spent in printing and texinfo
> >> rendering of the search results.
>
> Also, when we put all package metadata into the Xapian index, we don't
> have to look up any of the package variables in (gnu packages *) during
> `guix search` time. This also contributes substantially to the speedup.

Yes, magic power of inverted index. ;-)



> > Also, if the 12K+ descriptions need to be rendered at the time the user
> > runs ‘guix pull’, the experience may not be great, because it could take
> > a bit of time.
>
> This is a problem, but I would see it as a necessary "compilation"
> step. :-P In fact, this whole patchset speeds up `guix search` by doing
> part of the work of `guix search` ahead of time. So, some such cost is
> unavoidable.

Currently "guix pull" is rather long on my machine. I would accept a
couple of seconds more (even minutes).
So this compilation step could be done at the "guix pull" time.
Or even we could imagine something indexing in the background.


> > What I like about the recutils format in this context is that it’s both
> > human- and machine-readable.  The examples in the manual show how it can
> > be useful to select the information displayed or to refine the search
> > (info "(guix) Invoking guix package").
>
> Xapian's query language is much more natural (as in natural language)
> than the regexp based techniques we need to use with recutils. I have
> hardly ever used the regexp based search and I suspect many others
> haven't either. Also, refining the search query should be easier to do
> with Xapian. We could even use Xapian's query expansion feature to
> suggest improved queries to the user.
>
> That said, if we want the recutils format, we can still keep it in a
> simplified form like so.
>
> name: inkscape
> version: 0.92.4
> synopsis: Vector graphics editor
>
> name: inklingreader
> version: 0.8
> synopsis: Wacom Inkling skecth format conversion and manipulation
>
> > Also: I’d recommend tackling one thing at a time.  :-)
>
> I totally agree, but I'm tempted to say that pre-rendering would be a
> lot cheaper with the simplified form of search results. :-)

IMHO, we "just" need to propose different outputs mimicking "git log
--format". Soemthing like "guix search --format=".

What do you think?



All the best,
simon
Simon Tournier March 9, 2020, 12:47 p.m. UTC | #14
On Sun, 8 Mar 2020 at 12:33, Ludovic Courtès <ludo@gnu.org> wrote:
> Arun Isaac <arunisaac@systemreboot.net> skribis:

> > This is a problem, but I would see it as a necessary "compilation"
> > step. :-P In fact, this whole patchset speeds up `guix search` by doing
> > part of the work of `guix search` ahead of time. So, some such cost is
> > unavoidable.
>
> Yeah.  I think we need to take the whole user experience into account,
> not just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
> fairly common operation.  Conversely, ‘guix search’ takes roughly
> between 0.5 and 2 seconds and is an uncommon operation on a “slow path”
> (in the sense that when you’re searching for software, you’ll probably
> have to spend more than a couple of seconds to find what you’re looking
> for.)

We could imagine something doing the job of indexing in the
background; using the daemon or whatever.


> >> What I like about the recutils format in this context is that it’s both
> >> human- and machine-readable.  The examples in the manual show how it can
> >> be useful to select the information displayed or to refine the search
> >> (info "(guix) Invoking guix package").
> >
> > Xapian's query language is much more natural (as in natural language)
> > than the regexp based techniques we need to use with recutils. I have
> > hardly ever used the regexp based search and I suspect many others
> > haven't either. Also, refining the search query should be easier to do
> > with Xapian. We could even use Xapian's query expansion feature to
> > suggest improved queries to the user.
>
> I’m not sufficiently familiar with Xapian’s query language.  The
> examples I had in mind were:
>
>   guix search malloc | recsel -p name,version,relevance
>   guix search | recsel -p name -e 'license ~ "LGPL 3"'
>   guix search crypto library | \
>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis

I think these examples are good ones to benchmark the different approaches.
Because the speed is one thing, the accuracy is another one.

Let cut the "slow path" by providing a better experience when searching. ;-)


> It’s not so much about regexps than it is about selecting individual
> fields.

The regexp should be provided directly to "guix search" actually and
'recsel' is only a "filter" allowing to deal differently with the
fields.



> To me, adding 20–50 seconds on ‘guix pull’ would be undesirable.  :-/

Ok, at least it is clear. :-)
And computing in the background?


All the best,
simon
Simon Tournier March 9, 2020, 12:50 p.m. UTC | #15
On Mon, 9 Mar 2020 at 08:42, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>
> Arun Isaac <arunisaac@systemreboot.net> writes:
>
> >> I’m not sufficiently familiar with Xapian’s query language.  The
> >> examples I had in mind were:
> >> It’s not so much about regexps than it is about selecting individual
> >> fields.
> >
> > I have totally not tested this, but I imagine that equivalent Xapian
> > queries might look something like:
> >
> >>   guix search | recsel -p name -e 'license ~ "LGPL 3"'
> >
> > guix search license:LGPL3
> >
> >>   guix search crypto library | \
> >>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis
> >
> > guix search crypto library AND (NOT ghc) AND (NOT perl) AND (NOT python)
> > AND (NOT ruby)
>
> Indeed, if you look at the notmuch-search-terms man page, you'll see
> that you can select fields.
> In my opinion, the recsel format is fully superseded by Xapian.

No!
Because implementing the "fields" using Xapian is not done and it is
not as straightforward as it seems.
For sure, Xapian could do a lot of thing. But we should move one step
after one step.

Let first focus on speed and accuracy. For example, the fact that
"guix search emacs" does not returns first the package 'emacs' using
Xapian is really an issue.

Cheers,
simon
Simon Tournier March 9, 2020, 12:53 p.m. UTC | #16
On Mon, 9 Mar 2020 at 08:50, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> Back to the topic: I believe that Xapian is a huge win both for the
> shell and the future GUI :)

I agree.
The big win is to test the strategy of the inverted index strategy and
in the same time the state-of-art of scoring (relevance).
Simon Tournier March 9, 2020, 1:03 p.m. UTC | #17
On Mon, 9 Mar 2020 at 11:29, Ludovic Courtès <ludo@gnu.org> wrote:

> > Back to the topic: I believe that Xapian is a huge win both for the
> > shell and the future GUI :)
>
> It could be, but we need to consider all the aspects of the story,
> including the maintenance cost and overhead moved to ‘guix pull’.  So
> it’s not so much about “beliefs” at this point, but rather about
> demonstrating what can be done, and I’m glad Arun is exploring that
> space!

I agree.
What is currently tested with Xapian is:
 1- speeding up (or not) using an inverted index
 2- the accuracy using the state-of-art of information retrieval (BM25)

About 1- I do not have a strong opinion; even if I find "guix search"
terribly slow as I mentioned earlier (one year ago ;-)).

About 2- as I mentioned earlier, the 'relevance' function could be
improved. Currently, the score is computed only considering the
package itself and not the other packages (the words they use, their
number etc.). BM25 is the state-of-art using what I tried to explained
some time ago when I showed for example TF-IDF. The question is so
what the best move to improve the accuracy. And the improvement
necessarily uses a global index (of terms, at least). But on the other
hand, the improvement should not pay off because it would add
complexity and burden, more than the improvement itself.

Without testing, we cannot say. Thank you Arun for pushing forward.


All the best,
simon
Arun Isaac March 10, 2020, 2:17 p.m. UTC | #18
> Note that Texinfo rendering doesn’t use (@ (guix ui) fill-paragraph).
> It has its own paragraph-filling code.  We cannot use ‘fill-paragraph’
> after Texinfo rendering anyway, since Texinfo knows where things can be
> filled and where they cannot—e.g., @example.

True, I did not think of this.

> We could provide a substitute for users who use only the official 'guix
> channel.  However, as soon as users combine multiple channels, they’ll
> have to build the index locally.

We could build a separate Xapian database for each channel. Xapian does
support searching across multiple databases at once and will handle
merging the results together appropriately. If I understand correctly,
this means we can provide substitutes for at least the official guix
channel and let the user build the index locally for other channels. Is
that correct?

Also, could someone please build the patchset v2 on their machine and
measure the time taken by generate-package-search-index? My laptop,
particularly my HDD is slow even as far as HDDs go. So, my figure of
20-50 seconds may not be representative.

Thanks,
Arun.
Simon Tournier March 10, 2020, 2:33 p.m. UTC | #19
Hi Arun,

On Tue, 10 Mar 2020 at 15:18, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > We could provide a substitute for users who use only the official 'guix
> > channel.  However, as soon as users combine multiple channels, they’ll
> > have to build the index locally.
>
> We could build a separate Xapian database for each channel. Xapian does
> support searching across multiple databases at once and will handle
> merging the results together appropriately. If I understand correctly,
> this means we can provide substitutes for at least the official guix
> channel and let the user build the index locally for other channels. Is
> that correct?

To complement your words, you could also imagine index all the history
as any other channels. It needs some thoughts but it seems a path that
I would to go.


> Also, could someone please build the patchset v2 on their machine and
> measure the time taken by generate-package-search-index? My laptop,
> particularly my HDD is slow even as far as HDDs go. So, my figure of
> 20-50 seconds may not be representative.

I will do when I will be fully back. :-)


All the best,
simon
Ludovic Courtès March 11, 2020, 1:50 p.m. UTC | #20
Hello!

Arun Isaac <arunisaac@systemreboot.net> skribis:

>> We could provide a substitute for users who use only the official 'guix
>> channel.  However, as soon as users combine multiple channels, they’ll
>> have to build the index locally.
>
> We could build a separate Xapian database for each channel. Xapian does
> support searching across multiple databases at once and will handle
> merging the results together appropriately.

Nice!

> If I understand correctly, this means we can provide substitutes for
> at least the official guix channel and let the user build the index
> locally for other channels. Is that correct?

I’m afraid not, or at least not trivially.

Currently, profile hooks such as ‘%channel-profile-hooks’, receive a
complete profile—in this case, the composition of all the channels the
user chose.

So if we want to achieve what you propose, we’d need to find another way
to hook database generation.


BTW, there’s also the problem of modules added dynamically with
$GUIX_PACKAGE_PATH or ‘-L’.  With the proposed scheme, it seems that
they could no longer be searched.  Is that correct?

(Conversely the package cache is optional: it’s only used when it’s
considered authoritative, see (gnu packages).  The API and behavior are
exactly the same whether or not the package cache is used.)

Thanks,
Ludo’.
Arun Isaac March 13, 2020, 5:37 a.m. UTC | #21
> Currently, profile hooks such as ‘%channel-profile-hooks’, receive a
> complete profile—in this case, the composition of all the channels the
> user chose.
>
> So if we want to achieve what you propose, we’d need to find another way
> to hook database generation.

Hmmm. Tough luck, I suppose. Do you have suggestions for anywhere else
to hook database generation?

> BTW, there’s also the problem of modules added dynamically with
> $GUIX_PACKAGE_PATH or ‘-L’.  With the proposed scheme, it seems that
> they could no longer be searched.  Is that correct?

Unfortunately, that is correct. To address this, we discussed retaining
the current search implementation along with the new xapian
implementation. But, that changes the search query behaviour and
adds a lot of complexity. I'll think of some other way out.

> (Conversely the package cache is optional: it’s only used when it’s
> considered authoritative, see (gnu packages).  The API and behavior are
> exactly the same whether or not the package cache is used.)

Thanks,
Arun
Ludovic Courtès March 15, 2020, 8:40 p.m. UTC | #22
Hi Arun,

Arun Isaac <arunisaac@systemreboot.net> skribis:

>> Currently, profile hooks such as ‘%channel-profile-hooks’, receive a
>> complete profile—in this case, the composition of all the channels the
>> user chose.
>>
>> So if we want to achieve what you propose, we’d need to find another way
>> to hook database generation.
>
> Hmmm. Tough luck, I suppose. Do you have suggestions for anywhere else
> to hook database generation?

For the core database (packages that come with Guix), (guix self) could
take care of it.

>> BTW, there’s also the problem of modules added dynamically with
>> $GUIX_PACKAGE_PATH or ‘-L’.  With the proposed scheme, it seems that
>> they could no longer be searched.  Is that correct?
>
> Unfortunately, that is correct. To address this, we discussed retaining
> the current search implementation along with the new xapian
> implementation. But, that changes the search query behaviour and
> adds a lot of complexity. I'll think of some other way out.

Yeah, I think we’d want to have roughly a single implementation.

I wonder if the relevant metrics that Xapian implements, like zimoun
mentioned, could be available directly in Scheme in a way that allows us
to compute them at run time when the pre-built cache is unavailable.  Or
would that be necessarily too slow?

If so, perhaps a slightly less fancy metric could work with better
performance?

Thanks,
Ludo’.