diff mbox series

[bug#39258,2/4] ui: Use string matching with literal search strings.

Message ID 20200601000030.7443-3-arunisaac@systemreboot.net
State Work in progress
Headers show
Series Optimize guix search | expand

Checks

Context Check Description
cbaines/comparison success View comparision
cbaines/git branch success View Git branch
cbaines/applying patch success View Laminar job
cbaines/comparison success View comparision
cbaines/git branch success View Git branch
cbaines/applying patch success View Laminar job

Commit Message

Arun Isaac June 1, 2020, midnight UTC
* guix/scripts/package.scm (process-query): Make search query a regexp only if
it is not a literal search string.
* guix/ui.scm (relevance): Use string matching with literal search strings and
regexp matching with regexp search strings.
---
 guix/scripts/package.scm | 12 +++++++---
 guix/ui.scm              | 50 +++++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 22 deletions(-)

Comments

Ludovic Courtès June 9, 2020, 8:33 a.m. UTC | #1
Arun Isaac <arunisaac@systemreboot.net> skribis:

> * guix/scripts/package.scm (process-query): Make search query a regexp only if
> it is not a literal search string.
> * guix/ui.scm (relevance): Use string matching with literal search strings and
> regexp matching with regexp search strings.

How does this affect performance?

I would expect the regexp engine in libc to do something similar
internally, so I wonder if the extra work in Scheme pays off.

Ludo’.
Simon Tournier June 9, 2020, 9:55 a.m. UTC | #2
On Tue, 9 Jun 2020 at 10:34, Ludovic Courtès <ludo@gnu.org> wrote:
> Arun Isaac <arunisaac@systemreboot.net> skribis:
>
> > * guix/scripts/package.scm (process-query): Make search query a regexp only if
> > it is not a literal search string.
> > * guix/ui.scm (relevance): Use string matching with literal search strings and
> > regexp matching with regexp search strings.
>
> How does this affect performance?

On my machine, it changes nothing.

Even, I have applied the patches of the serie one by one to see the
effect on timing and I do not see an improvement.  Below an email that
I started but never completed. :-)
However, it seems to be The Right Thing to do. :-)

All the best,
simon

--

Here a quick benchmark.  Because once reading the code, I was not
convinced by the improvement. :-)
About the cut-off, the optimization should be hard to see because the
bottleneck is elsewhere. And I was doubtful about the string literal
but who knows. :-)

And to compare apple to apple, the patch set is rebased onto
a357849f5b as all the others.

Warm the cache is done by "guix search foo".


* Cut-off [PATCH 1/4]

The first patch: cut off i.e., finer implementation of '(map
regexp->score regexps)'.

** Query: crypto library

The query used is:

   guix search crypto library | recsel -P name | grep libb2

| cache | default  | v5       |
|-------+----------+----------|
| cold  | 0m2.083s | 0m2.292s |
| warm  | 0m1.404s | 0m1.470s |

And for another data point on the same query, see [1]:

| time | default  |
|------+----------|
| real | 0m2.216s | cold
|------+----------|
| real | 0m1.197s | warm

[1] http://issues.guix.gnu.org/issue/39258#78


** Query: strategy game

Using the query:

   guix search strategy game | recsel -P name | grep julius

| cache | default  | v5       |
|-------+----------+----------|
| cold  | 0m2.006s | 0m2.165s |
| warm  | 0m1.253s | 0m1.081s |


* String literal [PATCH 2/4] (+cut-off)

| cache | strategy game | crypto library |
|-------+---------------+----------------|
| cold  | 0m2.110s      | 0m2.246s       |
| warm  | 0m1.058s      | 0m1.217s       |
Arun Isaac June 13, 2020, 12:37 p.m. UTC | #3
>> * guix/scripts/package.scm (process-query): Make search query a regexp only if
>> it is not a literal search string.
>> * guix/ui.scm (relevance): Use string matching with literal search strings and
>> regexp matching with regexp search strings.
>
> How does this affect performance?

See my results from earlier.

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env guix search game

real	0m2.261s
user	0m2.351s
sys	0m0.104s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
time guix search game

real	0m2.661s
user	0m2.843s
sys	0m0.080s
--8<---------------cut here---------------end--------------->8---

> I would expect the regexp engine in libc to do something similar
> internally, so I wonder if the extra work in Scheme pays off.

I agree it would better to do this optimization at the regexp engine, if
it doesn't do it already.

So, shall I push the remaining patches (patches 1, 3, 4) after applying
the change you suggested for patch 1 (use of if versus cond)?
Simon Tournier June 13, 2020, 1:36 p.m. UTC | #4
Dear Arun,

On Sat, 13 Jun 2020 at 18:07, Arun Isaac <arunisaac@systemreboot.net> wrote:

>> How does this affect performance?
>
> See my results from earlier.
>
> --8<---------------cut here---------------start------------->8---
> time ./pre-inst-env guix search game
>
> real	0m2.261s
> user	0m2.351s
> sys	0m0.104s
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> time guix search game
>
> real	0m2.661s
> user	0m2.843s
> sys	0m0.080s
> --8<---------------cut here---------------end--------------->8---

I confirm that it changes nothing.  See [1].

1: http://issues.guix.gnu.org/39258#112


On the other hand, on my machine I do not see any timing improvement with
this patch set.  Could you check that you are comparing apple to apple?


All the best,
simon
Arun Isaac June 13, 2020, 5:21 p.m. UTC | #5
> I confirm that it changes nothing.  See [1].
>
> 1: http://issues.guix.gnu.org/39258#112

Yes, I did read your earlier mail. And, I tried again, this time with
patch 1 alone. It certainly makes a difference on my machine. It is
clear from the code logic that it should make a difference on your
machine as well, at least for longer queries. But, somehow it isn't and
I do not understand why. :-(

Here are more fresh results. Could you try for longer queries like
"strategy game caesar" and without the output being piped to recsel,
grep, etc.? For simplicity, let's talk only about warm cache results.

|----------------------------------+--------+-------|
| query                            | before | after |
|----------------------------------+--------+-------|
| guix search strategy game        |   2.58 |  1.96 |
| guix search strategy game caesar |   2.95 |  1.76 |
|----------------------------------+--------+-------|
Ludovic Courtès June 13, 2020, 7:32 p.m. UTC | #6
Hi,

Arun Isaac <arunisaac@systemreboot.net> skribis:

>>> * guix/scripts/package.scm (process-query): Make search query a regexp only if
>>> it is not a literal search string.
>>> * guix/ui.scm (relevance): Use string matching with literal search strings and
>>> regexp matching with regexp search strings.
>>
>> How does this affect performance?

(To be clear, I’m referring specifically to this patch.)

> See my results from earlier.
>
> time ./pre-inst-env guix search game
>
> real	0m2.261s
> user	0m2.351s
> sys	0m0.104s
>
> time guix search game
>
> real	0m2.661s
> user	0m2.843s
> sys	0m0.080s
>
>> I would expect the regexp engine in libc to do something similar
>> internally, so I wonder if the extra work in Scheme pays off.
>
> I agree it would better to do this optimization at the regexp engine, if
> it doesn't do it already.

Yeah.  I feel like we shouldn’t have to do this, so I’d lean towards
excluding this patch from the series.  It’s too early to be confident
about it, but it might be something as discussed in
<https://lists.gnu.org/archive/html/guile-user/2020-06/msg00038.html>,
i.e., a problem to solve at the Guile level.

> So, shall I push the remaining patches (patches 1, 3, 4) after applying
> the change you suggested for patch 1 (use of if versus cond)?

Yes, definitely!

Thank you,
Ludo’.
Simon Tournier June 14, 2020, 7:14 p.m. UTC | #7
Dear Arun,

Here, I am speaking about only the first patch: the cut-off.

TL;DR:
 1. I was wrong about the bottleneck.
 2. The queries were not the good ones to see a clear effect
    -- on my machine.
    

On Sat, 13 Jun 2020 at 22:51, Arun Isaac <arunisaac@systemreboot.net> wrote:

> Yes, I did read your earlier mail. And, I tried again, this time with
> patch 1 alone. It certainly makes a difference on my machine. It is
> clear from the code logic that it should make a difference on your
> machine as well, at least for longer queries. But, somehow it isn't and
> I do not understand why. :-(

Well, I spent some hours* to do some stats (Student's t-test).  Roughly
speaking, on my machine, the standard deviation error (stddev) hides the
point -- depending on the query -- and that's why I am not always seeing
the improvement, I guess.

*ah all my Sunday in fact. ;-)


I compared different conditions for the query "game strategy":

 - cold    vs warm
 - xterm   vs shell in Emacs (my config vs -q)
 - no pipe vs pipe

And I run 10 times in a row each experiment.  The conclusion is: in
average -- on my machine -- the cut-off improves.  But sometimes
considering only 3 repeats in a row, the improvement is not obvious (on
the mean); because the both tails of distribution overlap a bit on my
machine and so it is kind of bad luck.  And it is ``worse'' depending
against which commit your patch is rebased: a357849 (old) vs e782756.

The t-test captures this variation, even with only 3 repeats, but I have
not done in my previous email and only compared the visible mean.  Sorry
about that.

Moreover, printing increases the stddev, so the results are more
fluctuating inside Emacs vs xterm and piping helps in this case.

Piping does not change the final result -- hopefully. :-)  It adds an
extra time but in average it is the same.

About cold vs warm cache, I notice that the improvement is not the same
(in average).  Considering the raw time, there is a difference about 10%
(with "good" confidence); it could be worth to understand why.


Well, considering that, I did other stats with other queries and the
conclusion for my machine is that *the patch improves* on average by
reducing the timing for typical usages.  Which is really cool! :-)


I definitively have wrong about the bottleneck and this one could be
one.  One way to have an idea is to use "statprof" but it is hard for me
to read the results (I believe Guile master have a fix improving the
'anon #addr', but do not really know more).

--8<---------------cut here---------------start------------->8---
$ /tmp/v5-1/bin/guix repl
scheme@(guix-user)> ,use(guix scripts search)
scheme@(guix-user)> ,pr (guix-search "game" "strategy")
%     cumulative   self             
time   seconds     seconds  procedure
 17.81      0.29      0.27  anon #xe40178
 12.33      0.20      0.18  ice-9/boot-9.scm:2201:0:%load-announce
 12.33      0.18      0.18  anon #xe3c770
  5.48      0.08      0.08  ice-9/boot-9.scm:1396:0:symbol-append
  4.11      1.57      0.06  guix/memoization.scm:100:0
  4.11      0.06      0.06  ice-9/popen.scm:145:0:reap-pipes
  2.74      0.55      0.04  guix/ui.scm:1511:12
  2.74      0.33      0.04  ice-9/regex.scm:170:0:fold-matches
  2.74      0.04      0.04  ice-9/boot-9.scm:3540:0:autoload-done-or-in-progress?
  2.74      0.04      0.04  texinfo/string-utils.scm:98:5
  2.74      0.04      0.04  ice-9/vlist.scm:539:0:vhash-assq
  1.37     69.81      0.02  ice-9/threads.scm:388:4
[...]
---
Sample count: 73
Total time: 1.490955132 seconds (0.387756476 seconds in GC)
--8<---------------cut here---------------end--------------->8---

To compare with the default:

--8<---------------cut here---------------start------------->8---
time   seconds     seconds  procedure
 24.47      0.49      0.46  anon #x1d89178
 21.28      0.40      0.40  anon #x1d85770
  9.57      0.20      0.18  ice-9/boot-9.scm:2201:0:%load-announce
  3.19      4.71      0.06  ice-9/boot-9.scm:1673:4:with-exception-handler
  3.19      1.64      0.06  guix/memoization.scm:100:0
  3.19      0.06      0.06  ice-9/boot-9.scm:3540:0:autoload-done-or-in-progress?
  3.19      0.06      0.06  anon #x1d84c78
  3.19      0.06      0.06  ice-9/popen.scm:145:0:reap-pipes
  2.13      1.01      0.04  guix/ui.scm:1511:12
  2.13      0.08      0.04  ice-9/boot-9.scm:1396:0:symbol-append
  2.13      0.04      0.04  anon #x1d83248
  1.06      0.30      0.02  anon #x7f057e6c90e8
[...]
--8<---------------cut here---------------end--------------->8---

So clearly the patch has an effect!  If someone knows what is:

 - ice-9/boot-9.scm:2201:0:%load-announce
 - ice-9/boot-9.scm:1396:0:symbol-append
 
and from where they could come from, it could help. :-)

Well, I am interested to know which part is the Regex Engine and the
string search. :-) Linking to the discussion about KMP and others.


> Here are more fresh results. Could you try for longer queries like
> "strategy game caesar" and without the output being piped to recsel,
> grep, etc.? For simplicity, let's talk only about warm cache results.
>
> |----------------------------------+--------+-------|
> | query                            | before | after |
> |----------------------------------+--------+-------|
> | guix search strategy game        |   2.58 |  1.96 |
> | guix search strategy game caesar |   2.95 |  1.76 |
> |----------------------------------+--------+-------|

At first, I was confused why one more terms returns faster.  This is
because the query "caesar" returns only one package so the query
"strategy game caesar" cuts off all the packages when searching the
terms "game" and then "strategy".  I mean

   guix search julius

should be as long as

   guix search strategy game caesar

It is; in average on my machine.

And secondly, I was confused because the timing of the query "caesar
strategy game" is almost the same (2.8% +/- 2.5% with 99.0% of
confidence; 10 repeats).  Well, it is because in one case the term
"caesar" is applied to 15 packages and in another case the terms
"strategy" and "game" are applied to 1 package.  Adding some stddev
error and not enough repeats (nor good stats), the confusion is complete
and my conclusion is wrong.


That's said, the effect of the cut-off is clear (on my machine even with
on shot) with the queries:

  - game strategy the
  - the game strategy


Thank you,
simon
Arun Isaac June 15, 2020, 8:18 p.m. UTC | #8
>> So, shall I push the remaining patches (patches 1, 3, 4) after applying
>> the change you suggested for patch 1 (use of if versus cond)?
>
> Yes, definitely!

Done!

>>>> * guix/scripts/package.scm (process-query): Make search query a regexp only if
>>>> it is not a literal search string.
>>>> * guix/ui.scm (relevance): Use string matching with literal search strings and
>>>> regexp matching with regexp search strings.
>>>
>>> How does this affect performance?
>
> (To be clear, I’m referring specifically to this patch.)

Oh, I misunderstood. Here are the results specifically comparing patch 2
against the latest master (that includes the patches 1, 3 and 4 I just
pushed). All readings are on a warm cache.

|----------------------------------+--------+-------|
| query                            | before | after |
|----------------------------------+--------+-------|
| guix search strategy game        |    2.1 |   1.7 |
| guix search strategy game caesar |    1.8 |   1.5 |
|----------------------------------+--------+-------|
diff mbox series

Patch

diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm
index 1246147798..1b637f7802 100644
--- a/guix/scripts/package.scm
+++ b/guix/scripts/package.scm
@@ -675,6 +675,11 @@  doesn't need it."
 (define (process-query opts)
   "Process any query specified by OPTS.  Return #t when a query was actually
 processed, #f otherwise."
+  (define (regexp-pattern? str)
+    (string-any
+     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
+     str))
+
   (let* ((profiles (delete-duplicates
                     (match (filter-map (match-lambda
                                          (('profile . p) p)
@@ -781,11 +786,12 @@  processed, #f otherwise."
 
       (('search _)
        (let* ((patterns (filter-map (match-lambda
-                                      (('query 'search rx) rx)
+                                      (('query 'search (? regexp-pattern? rx))
+                                       (make-regexp* rx regexp/icase))
+                                      (('query 'search pattern) pattern)
                                       (_                   #f))
                                     opts))
-              (regexps  (map (cut make-regexp* <> regexp/icase) patterns))
-              (matches  (find-packages-by-description regexps)))
+              (matches  (find-packages-by-description patterns)))
          (leave-on-EPIPE
           (display-search-results matches (current-output-port)))
          #t))
diff --git a/guix/ui.scm b/guix/ui.scm
index 4a22358963..56754dba83 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -1489,41 +1489,53 @@  HYPERLINKS? is true, emit hyperlink escape sequences when appropriate."
 ;;; Searching.
 ;;;
 
-(define (relevance obj regexps metrics)
+(define (relevance obj patterns metrics)
   "Compute a \"relevance score\" for OBJ as a function of its number of
-matches of REGEXPS and accordingly to METRICS.  METRICS is list of
+matches of PATTERNS and accordingly to METRICS.  METRICS is list of
 field/weight pairs, where FIELD is a procedure that returns a string or list
 of strings describing OBJ, and WEIGHT is a positive integer denoting the
 weight of this field in the final score.
 
-A score of zero means that OBJ does not match any of REGEXPS.  The higher the
-score, the more relevant OBJ is to REGEXPS."
-  (define (score regexp str)
-    (fold-matches regexp str 0
-                  (lambda (m score)
-                    (+ score
-                       (if (string=? (match:substring m) str)
-                           5             ;exact match
-                           1)))))
-
-  (define (regexp->score regexp)
-    (let ((score-regexp (lambda (str) (score regexp str))))
+A score of zero means that OBJ does not match any of PATTERNS.  The higher the
+score, the more relevant OBJ is to PATTERNS."
+  (define (score pattern str)
+    (match pattern
+      ((? string? pattern)
+       (cond
+        ((string=? str pattern) 5)
+        (else
+         (let loop ((score 0) (start 0))
+           (cond
+            ((string-contains-ci str pattern start)
+             => (lambda (index)
+                  (loop (+ score 1) (+ index (string-length pattern)))))
+            (else score))))))
+      ((? regexp? regexp)
+       (fold-matches regexp str 0
+                     (lambda (m score)
+                       (+ score
+                          (if (string=? (match:substring m) str)
+                              5             ;exact match
+                              1)))))))
+
+  (define (pattern->score pattern)
+    (let ((score-pattern (lambda (str) (score pattern str))))
       (fold (lambda (metric relevance)
               (match metric
                 ((field . weight)
                  (match (field obj)
                    (#f  relevance)
                    ((? string? str)
-                    (+ relevance (* (score-regexp str) weight)))
+                    (+ relevance (* (score-pattern str) weight)))
                    ((lst ...)
-                    (+ relevance (* weight (apply + (map score-regexp lst)))))))))
+                    (+ relevance (* weight (apply + (map score-pattern lst)))))))))
             0 metrics)))
 
-  (let loop ((regexps regexps)
+  (let loop ((patterns patterns)
              (total-score 0))
-    (match regexps
+    (match patterns
       ((head . tail)
-       (let ((score (regexp->score head)))
+       (let ((score (pattern->score head)))
          ;; Return zero if one of PATTERNS doesn't match.
          (cond
           ((zero? score) 0)