diff mbox series

[bug#39258] Faster guix search using an sqlite cache

Message ID cu7v9oka341.fsf@systemreboot.net
State Work in progress
Headers show
Series [bug#39258] Faster guix search using an sqlite cache | expand

Checks

Context Check Description
cbaines/comparison success View comparision
cbaines/git branch success View Git branch
cbaines/applying patch fail View Laminar job

Commit Message

Arun Isaac Feb. 6, 2020, 1:58 a.m. UTC
>> Thank you, this was useful. I was able to catch and report the error. I
>
> Where have you reported the error?

I reported the error to the derivation log. For example, if the
derivation for the guix-package-cache derivation is
/gnu/store/cyf2h3frcjxm147dii5qic8d6kpm39nq-guix-package-cache.drv, the
log file will be at
/var/log/guix/drvs/cy/f2h3frcjxm147dii5qic8d6kpm39nq-guix-package-cache.drv.bz2. Notice
that the directory name under drvs is the first two letters of the hash,
and the file name under that directory is the remaining letters.

Also please find attached a dump of my code so far.

>> This could be a permission error, or something to do with the existence
>> or lack thereof of certain directories (such as /var) in the chroot of
>> the build daemon. I'm still figuring it out.
>
> Hum? And this should explain why it is working with the REPL and not
> with the CLI, right?

This could expalin it, but I am not sure if this is the correct
explanation.

>> I'm also in half a mind to get some guile xapian bindings ready so we
>> can just do that instead of messing with sqlite here. But, let's
>> see. :-P
>
> Cool!
> Let me know if you push something somewhere.

Sure, will let you know.

Comments

Ludovic Courtès Feb. 11, 2020, 4:29 p.m. UTC | #1
Hello Arun!

Arun Isaac <arunisaac@systemreboot.net> skribis:

> From 4c883fcff1f44339b28df6ccdb2b10c906439e3d Mon Sep 17 00:00:00 2001
> From: Arun Isaac <arunisaac@systemreboot.net>
> Date: Tue, 21 Jan 2020 20:45:43 +0530
> Subject: [PATCH] fast search

[...]

> --- a/gnu/packages.scm
> +++ b/gnu/packages.scm
> @@ -43,6 +43,7 @@
>    #:use-module (srfi srfi-34)
>    #:use-module (srfi srfi-35)
>    #:use-module (srfi srfi-39)
> +  #:use-module (sqlite3)
>    #:export (search-patch
>              search-patches
>              search-auxiliary-file
> @@ -204,10 +205,8 @@ PROC is called along these lines:
>  PROC can use #:allow-other-keys to ignore the bits it's not interested in.
>  When a package cache is available, this procedure does not actually load any
>  package module."
> -  (define cache
> -    (load-package-cache (current-profile)))
> -
> -  (if (and cache (cache-is-authoritative?))
> +  (if (and (cache-is-authoritative?)
> +           (current-profile))
>        (vhash-fold (lambda (name vector result)
>                      (match vector
>                        (#(name version module symbol outputs
> @@ -220,7 +219,7 @@ package module."
>                               #:supported? supported?
>                               #:deprecated? deprecated?))))
>                    init
> -                  cache)
> +                  (cache-lookup (current-profile)))
>        (fold-packages (lambda (package result)
>                         (proc (package-name package)
>                               (package-version package)
> @@ -252,31 +251,7 @@ is guaranteed to never traverse the same package twice."
>  
>  (define %package-cache-file
>    ;; Location of the package cache.
> -  "/lib/guix/package.cache")
> -
> -(define load-package-cache

[...]

> +(define* (cache-lookup profile #:optional name)
>    "Lookup package NAME in CACHE.  Return a list sorted in increasing version
>  order."
>    (define (package-version<? v1 v2)
>      (version>? (vector-ref v2 1) (vector-ref v1 1)))
>  
> -  (sort (vhash-fold* cons '() name cache)
> -        package-version<?))
> +  (define (int->boolean n)
> +    (case n
> +      ((0) #f)
> +      ((1) #t)))
> +
> +  (define (string->list str)
> +    (call-with-input-string str read))
> +
> +  (define select-statement
> +    (string-append
> +     "SELECT name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn from packages"
> +     (if name " WHERE name = :name" "")))

I would rather keep the current package cache as-is instead of inserting
sqlite in here.  I don’t expect it to bring much compared
performance-wise to the current simple cache (especially if we look at
load time), and it does increase complexity quite a bit.

However, using sqlite for keyword search as you initially proposed on
guix-devel does sound like a great idea to me.

WDYT?

Thanks,
Ludo’.
Simon Tournier Feb. 11, 2020, 6:21 p.m. UTC | #2
Hi Ludo,

On Tue, 11 Feb 2020 at 17:29, Ludovic Courtès <ludo@gnu.org> wrote:

> I would rather keep the current package cache as-is instead of inserting
> sqlite in here.  I don’t expect it to bring much compared
> performance-wise to the current simple cache (especially if we look at
> load time), and it does increase complexity quite a bit.

Complexity is about taste. ;-)
About performance, the idea was to first implement something with
sqlite and then see if it makes the difference. I mean I have
understood that.

> However, using sqlite for keyword search as you initially proposed on
> guix-devel does sound like a great idea to me.

If I understand correctly, you are proposing 2 caches, right?
Or are you proposing an inverted index (VHash/VList table) based on
trigrams to speed up the lookup?

Cheers,
simon
Ludovic Courtès Feb. 11, 2020, 6:39 p.m. UTC | #3
zimoun <zimon.toutoune@gmail.com> skribis:

> On Tue, 11 Feb 2020 at 17:29, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> I would rather keep the current package cache as-is instead of inserting
>> sqlite in here.  I don’t expect it to bring much compared
>> performance-wise to the current simple cache (especially if we look at
>> load time), and it does increase complexity quite a bit.
>
> Complexity is about taste. ;-)

It’s measurable to some extent (lines of code, cyclomatic complexity,
etc.)

> About performance, the idea was to first implement something with
> sqlite and then see if it makes the difference. I mean I have
> understood that.

Yes.  But keep in mind that this package cache is used exclusively for
package lookups by name.  Namely, the goal is to speed package lookup in
operations like “guix install foo” (mapping “foo” to the right <package>
in the right module without walking through all the modules) and “guix
package -A” (which is what the shell completion hooks use).

Currently “guix package -A” runs in .5s on my laptop, and I suspect it’s
going to be hard to do better just by touching the cache.

>> However, using sqlite for keyword search as you initially proposed on
>> guix-devel does sound like a great idea to me.
>
> If I understand correctly, you are proposing 2 caches, right?
> Or are you proposing an inverted index (VHash/VList table) based on
> trigrams to speed up the lookup?

Arun started the discussion on guix-devel with the idea of an inverted
index, and I thought this would become a second index (possibly
implemented using SQLite).  Perhaps I misunderstood the discussion all
along though, let me know!  :-)

Thanks,
Ludo’.
Arun Isaac Feb. 11, 2020, 7:07 p.m. UTC | #4
>>> I would rather keep the current package cache as-is instead of inserting
>>> sqlite in here.  I don’t expect it to bring much compared
>>> performance-wise to the current simple cache (especially if we look at
>>> load time), and it does increase complexity quite a bit.
>>
>> Complexity is about taste. ;-)
>
> It’s measurable to some extent (lines of code, cyclomatic complexity,
> etc.)

I agree with Ludo here. I think it does increase the complexity, and
probably unnecessarily so.

> Arun started the discussion on guix-devel with the idea of an inverted
> index, and I thought this would become a second index (possibly
> implemented using SQLite).  Perhaps I misunderstood the discussion all
> along though, let me know!  :-)

No, you didn't misunderstand. That's where it began. But, while
implementing it, I thought I might as well replace the existing cache.

Also, I've started working on guile-xapian bindings. With that, it seems
simpler to keep the current package cache and add a xapian index only to
speed up package search.
Simon Tournier Feb. 11, 2020, 8:13 p.m. UTC | #5
Hi Ludo,

On Tue, 11 Feb 2020 at 19:39, Ludovic Courtès <ludo@gnu.org> wrote:

> > About performance, the idea was to first implement something with
> > sqlite and then see if it makes the difference. I mean I have
> > understood that.
>
> Yes.  But keep in mind that this package cache is used exclusively for
> package lookups by name.  Namely, the goal is to speed package lookup in

I agree that some confusion happens here. And this cache cannot be improved.


> >> However, using sqlite for keyword search as you initially proposed on
> >> guix-devel does sound like a great idea to me.
> >
> > If I understand correctly, you are proposing 2 caches, right?
> > Or are you proposing an inverted index (VHash/VList table) based on
> > trigrams to speed up the lookup?
>
> Arun started the discussion on guix-devel with the idea of an inverted
> index, and I thought this would become a second index (possibly
> implemented using SQLite).  Perhaps I misunderstood the discussion all
> along though, let me know!  :-)

Well, your suggestion is very welcome and 2 caches are required: one
for the lookup by name, as it is already (and does a good job);
another one for "guix search" speeds up, SQLite or whatever (based on
inverted index or whatever).


Thanks,
simon
Simon Tournier Feb. 11, 2020, 8:20 p.m. UTC | #6
Hi Arun,

On Tue, 11 Feb 2020 at 20:07, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > Arun started the discussion on guix-devel with the idea of an inverted
> > index, and I thought this would become a second index (possibly
> > implemented using SQLite).  Perhaps I misunderstood the discussion all
> > along though, let me know!  :-)
>
> No, you didn't misunderstand. That's where it began. But, while
> implementing it, I thought I might as well replace the existing cache.

An inverted index backed by Guile as you did on guix-devel or backed
by SQLite seems a good improvement for "guix search".

> Also, I've started working on guile-xapian bindings. With that, it seems
> simpler to keep the current package cache and add a xapian index only to
> speed up package search.

Xapian would be cool!
And an SQLite based index seems easier to index locally the packages
and their history. The Guix Data Service is already doing such thing
but AFAIK using PostgreSQL and some magic. :-)

http://data.guix.gnu.org/repository/1/branch/master/package/git


All the best,
simon
Arun Isaac Feb. 15, 2020, 2:50 p.m. UTC | #7
> Also, I've started working on guile-xapian bindings. With that, it seems
> simpler to keep the current package cache and add a xapian index only to
> speed up package search.

I have published the first version of guile-xapian! Feedback is
welcome. :-)

https://git.systemreboot.net/guile-xapian/about/

I will now move on to building a Xapian index for Guix's package search.
diff mbox series

Patch

From 4c883fcff1f44339b28df6ccdb2b10c906439e3d Mon Sep 17 00:00:00 2001
From: Arun Isaac <arunisaac@systemreboot.net>
Date: Tue, 21 Jan 2020 20:45:43 +0530
Subject: [PATCH] fast search

---
 build-aux/build-self.scm |   5 +
 gnu/packages.scm         | 234 +++++++++++++++++++++++++--------------
 2 files changed, 155 insertions(+), 84 deletions(-)

diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm
index fc13032b73..c123ad3b11 100644
--- a/build-aux/build-self.scm
+++ b/build-aux/build-self.scm
@@ -264,6 +264,9 @@  interface (FFI) of Guile.")
   (define fake-git
     (scheme-file "git.scm" #~(define-module (git))))
 
+  (define fake-sqlite3
+    (scheme-file "sqlite3.scm" #~(define-module (sqlite3))))
+
   (with-imported-modules `(((guix config)
                             => ,(make-config.scm))
 
@@ -278,6 +281,8 @@  interface (FFI) of Guile.")
                            ;; (git) to placate it.
                            ((git) => ,fake-git)
 
+                           ((sqlite3) => ,fake-sqlite3)
+
                            ,@(source-module-closure `((guix store)
                                                       (guix self)
                                                       (guix derivations)
diff --git a/gnu/packages.scm b/gnu/packages.scm
index d22c992bb1..0ae5b84284 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -43,6 +43,7 @@ 
   #:use-module (srfi srfi-34)
   #:use-module (srfi srfi-35)
   #:use-module (srfi srfi-39)
+  #:use-module (sqlite3)
   #:export (search-patch
             search-patches
             search-auxiliary-file
@@ -204,10 +205,8 @@  PROC is called along these lines:
 PROC can use #:allow-other-keys to ignore the bits it's not interested in.
 When a package cache is available, this procedure does not actually load any
 package module."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and cache (cache-is-authoritative?))
+  (if (and (cache-is-authoritative?)
+           (current-profile))
       (vhash-fold (lambda (name vector result)
                     (match vector
                       (#(name version module symbol outputs
@@ -220,7 +219,7 @@  package module."
                              #:supported? supported?
                              #:deprecated? deprecated?))))
                   init
-                  cache)
+                  (cache-lookup (current-profile)))
       (fold-packages (lambda (package result)
                        (proc (package-name package)
                              (package-version package)
@@ -252,31 +251,7 @@  is guaranteed to never traverse the same package twice."
 
 (define %package-cache-file
   ;; Location of the package cache.
-  "/lib/guix/package.cache")
-
-(define load-package-cache
-  (mlambda (profile)
-    "Attempt to load the package cache.  On success return a vhash keyed by
-package names.  Return #f on failure."
-    (match profile
-      (#f #f)
-      (profile
-       (catch 'system-error
-         (lambda ()
-           (define lst
-             (load-compiled (string-append profile %package-cache-file)))
-           (fold (lambda (item vhash)
-                   (match item
-                     (#(name version module symbol outputs
-                             supported? deprecated?
-                             file line column)
-                      (vhash-cons name item vhash))))
-                 vlist-null
-                 lst))
-         (lambda args
-           (if (= ENOENT (system-error-errno args))
-               #f
-               (apply throw args))))))))
+  "/lib/guix/package-cache.sqlite")
 
 (define find-packages-by-name/direct              ;bypass the cache
   (let ((packages (delay
@@ -297,25 +272,57 @@  decreasing version order."
                     matching)
             matching)))))
 
-(define (cache-lookup cache name)
+(define* (cache-lookup profile #:optional name)
   "Lookup package NAME in CACHE.  Return a list sorted in increasing version
 order."
   (define (package-version<? v1 v2)
     (version>? (vector-ref v2 1) (vector-ref v1 1)))
 
-  (sort (vhash-fold* cons '() name cache)
-        package-version<?))
+  (define (int->boolean n)
+    (case n
+      ((0) #f)
+      ((1) #t)))
+
+  (define (string->list str)
+    (call-with-input-string str read))
+
+  (define select-statement
+    (string-append
+     "SELECT name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn from packages"
+     (if name " WHERE name = :name" "")))
+
+  (define cache-file
+    (string-append profile %package-cache-file))
+
+  (let* ((db (sqlite-open cache-file SQLITE_OPEN_READONLY))
+         (statement (sqlite-prepare db select-statement)))
+    (when name
+      (sqlite-bind-arguments statement #:name name))
+    (let ((result (sqlite-fold (lambda (v result)
+                                 (match v
+                                   (#(name version module symbol outputs supported superseded file line column)
+                                    (cons
+                                     (vector name
+                                             version
+                                             (string->list module)
+                                             (string->symbol symbol)
+                                             (string->list outputs)
+                                             (int->boolean supported)
+                                             (int->boolean superseded)
+                                             (list file line column))
+                                     result))))
+                               '() statement)))
+      (sqlite-finalize statement)
+      (sqlite-close db)
+      (sort result package-version<?))))
 
 (define* (find-packages-by-name name #:optional version)
   "Return the list of packages with the given NAME.  If VERSION is not #f,
 then only return packages whose version is prefixed by VERSION, sorted in
 decreasing version order."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and (cache-is-authoritative?) cache)
-      (match (cache-lookup cache name)
-        (#f #f)
+  (if (and (cache-is-authoritative?)
+           (current-profile))
+      (match (cache-lookup (current-profile) name)
         ((#(_ versions modules symbols _ _ _ _ _ _) ...)
          (fold (lambda (version* module symbol result)
                  (if (or (not version)
@@ -331,12 +338,9 @@  decreasing version order."
 (define* (find-package-locations name #:optional version)
   "Return a list of version/location pairs corresponding to each package
 matching NAME and VERSION."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and cache (cache-is-authoritative?))
-      (match (cache-lookup cache name)
-        (#f '())
+  (if (and (cache-is-authoritative?)
+           (current-profile))
+      (match (cache-lookup (current-profile) name)
         ((#(name versions modules symbols outputs
                  supported? deprecated?
                  files lines columns) ...)
@@ -372,6 +376,33 @@  VERSION."
 ;; Prevent Guile 3 from inlining this procedure so we can mock it in tests.
 (set! find-best-packages-by-name find-best-packages-by-name)
 
+;; (generate-package-cache "/tmp/test")
+
+;; XXX: missing in guile-sqlite3@0.1.0
+(define SQLITE_BUSY 5)
+
+(define (call-with-transaction db proc)
+  "Start a transaction with DB (make as many attempts as necessary) and run
+PROC.  If PROC exits abnormally, abort the transaction, otherwise commit the
+transaction after it finishes."
+  (catch 'sqlite-error
+    (lambda ()
+      ;; We use begin immediate here so that if we need to retry, we
+      ;; figure that out immediately rather than because some SQLITE_BUSY
+      ;; exception gets thrown partway through PROC - in which case the
+      ;; part already executed (which may contain side-effects!) would be
+      ;; executed again for every retry.
+      (sqlite-exec db "begin immediate;")
+      (let ((result (proc)))
+        (sqlite-exec db "commit;")
+        result))
+    (lambda (key who error description)
+      (if (= error SQLITE_BUSY)
+          (call-with-transaction db proc)
+          (begin
+            (sqlite-exec db "rollback;")
+            (throw 'sqlite-error who error description))))))
+
 (define (generate-package-cache directory)
   "Generate under DIRECTORY a cache of all the available packages.
 
@@ -381,49 +412,84 @@  reducing the memory footprint."
   (define cache-file
     (string-append directory %package-cache-file))
 
-  (define (expand-cache module symbol variable result+seen)
+  (define schema
+    "CREATE TABLE packages (name text,
+version text,
+module text,
+symbol text,
+outputs text,
+supported int,
+superseded int,
+locationFile text,
+locationLine int,
+locationColumn int);
+CREATE VIRTUAL TABLE packageSearch USING fts5(name, searchText);")
+
+  (define insert-statement
+    "INSERT INTO packages(name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn)
+VALUES(:name, :version, :module, :symbol, :outputs, :supported, :superseded, :locationfile, :locationline, :locationcolumn)")
+
+  (define insert-package-search-statement
+    "INSERT INTO packageSearch(name, searchText) VALUES(:name, :searchtext)")
+
+  (define (boolean->int x)
+    (if x 1 0))
+
+  (define (list->string x)
+    (call-with-output-string (cut write x <>)))
+
+  (define (insert-package db module symbol variable seen)
     (match (false-if-exception (variable-ref variable))
       ((? package? package)
-       (match result+seen
-         ((result . seen)
-          (if (or (vhash-assq package seen)
-                  (hidden-package? package))
-              (cons result seen)
-              (cons (cons `#(,(package-name package)
-                             ,(package-version package)
-                             ,(module-name module)
-                             ,symbol
-                             ,(package-outputs package)
-                             ,(->bool (supported-package? package))
-                             ,(->bool (package-superseded package))
-                             ,@(let ((loc (package-location package)))
-                                 (if loc
-                                     `(,(location-file loc)
-                                       ,(location-line loc)
-                                       ,(location-column loc))
-                                     '(#f #f #f))))
-                          result)
-                    (vhash-consq package #t seen))))))
-      (_
-       result+seen)))
-
-  (define exp
-    (first
-     (fold-module-public-variables* expand-cache
-                                    (cons '() vlist-null)
-                                    (all-modules (%package-module-path)
-                                                 #:warn
-                                                 warn-about-load-error))))
+       (cond
+        ((or (vhash-assq package seen)
+             (hidden-package? package))
+         seen)
+        (else
+         (let ((statement (sqlite-prepare db insert-statement)))
+           (sqlite-bind-arguments statement
+                                  #:name (package-name package)
+                                  #:version (package-version package)
+                                  #:module (list->string (module-name module))
+                                  #:symbol (symbol->string symbol)
+                                  #:outputs (list->string (package-outputs package))
+                                  #:supported (boolean->int (supported-package? package))
+                                  #:superseded (boolean->int (package-superseded package))
+                                  #:locationfile (cond
+                                                  ((package-location package) => location-file)
+                                                  (else #f))
+                                  #:locationline (cond
+                                                  ((package-location package) => location-line)
+                                                  (else #f))
+                                  #:locationcolumn (cond
+                                                    ((package-location package) => location-column)
+                                                    (else #f)))
+           (sqlite-fold cons '() statement)
+           (sqlite-finalize statement))
+         (let ((statement (sqlite-prepare db insert-package-search-statement)))
+           (sqlite-bind-arguments statement
+                                  #:name (package-name package)
+                                  #:searchtext (package-description package))
+           (sqlite-fold cons '() statement)
+           (sqlite-finalize statement))
+         (vhash-consq package #t seen))))
+      (_ seen)))
 
   (mkdir-p (dirname cache-file))
-  (call-with-output-file cache-file
-    (lambda (port)
-      ;; Store the cache as a '.go' file.  This makes loading fast and reduces
-      ;; heap usage since some of the static data is directly mmapped.
-      (put-bytevector port
-                      (compile `'(,@exp)
-                               #:to 'bytecode
-                               #:opts '(#:to-file? #t)))))
+  (let ((tmp (string-append (dirname cache-file) "/tmp")))
+    (mkdir-p tmp)
+    (setenv "SQLITE_TMPDIR" tmp))
+  (let ((db (sqlite-open cache-file)))
+    (sqlite-exec db schema)
+    (call-with-transaction db
+        (lambda ()
+          (fold-module-public-variables* (cut insert-package db <> <> <> <>)
+                                         vlist-null
+                                         (all-modules (%package-module-path)
+                                                      #:warn
+                                                      warn-about-load-error))))
+    (sqlite-close db))
+
   cache-file)
 
 
-- 
2.23.0