From patchwork Fri Sep 13 07:02:25 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: aurtzy X-Patchwork-Id: 67903 Return-Path: X-Original-To: patchwork@mira.cbaines.net Delivered-To: patchwork@mira.cbaines.net Received: by mira.cbaines.net (Postfix, from userid 113) id 9055027BBEA; Fri, 13 Sep 2024 08:04:30 +0100 (BST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on mira.cbaines.net X-Spam-Level: X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED,DKIM_VALID,FREEMAIL_FROM,MAILING_LIST_MULTI, RCVD_IN_VALIDITY_CERTIFIED,RCVD_IN_VALIDITY_RPBL,RCVD_IN_VALIDITY_SAFE, SPF_HELO_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mira.cbaines.net (Postfix) with ESMTPS id 82ACB27BBE2 for ; Fri, 13 Sep 2024 08:04:29 +0100 (BST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sp0Le-0002W9-V0; Fri, 13 Sep 2024 03:04:07 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sp0LX-0002Vp-Bo for guix-patches@gnu.org; Fri, 13 Sep 2024 03:04:01 -0400 Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1sp0LW-0006Rg-AO; Fri, 13 Sep 2024 03:03:58 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debbugs.gnu.org; s=debbugs-gnu-org; h=MIME-Version:Date:From:To:Subject; bh=s8q6RMG7mBPp0xPIrEGPqxMFvj+I+gDDY2ZFoB3zFxc=; b=ZxQNbQ3vjx0D4imXjSiIddjoUlQIoBuH9Nage8AoxdE3cMtX8pCweHEhDvZbNq9YE4E+aIgd/5/C0VIpO1CI0IFZv65pwHahaWQc2KUeEh+NoX0uFAbQkXuq8ZmCV6UB/30RKzXT67wzTUMbQGxJUOwUR+az55PcIfDp6XsymWI0CBx2o+RbD+ZmcNGjODZ2eE1SGIRtIvRq8/qyWwvDAGbSClN8k9ZJHbVi5wSkkWXuL+NaQqgDYe3D/kugDldPW6JWonvBuPhsAm+MDWG+sorkO+oE+wppb6DGIq24yBXVS1v7yeHi2xU4DlKba2n5TKfPKe8+PIULTmbhhwo7GA==; Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1sp0La-00015K-OI; Fri, 13 Sep 2024 03:04:02 -0400 X-Loop: help-debbugs@gnu.org Subject: [bug#73220] [PATCH] ui: Add more nuance to relevance scoring. Resent-From: aurtzy Original-Sender: "Debbugs-submit" Resent-CC: guix@cbaines.net, dev@jpoiret.xyz, ludo@gnu.org, othacehe@gnu.org, zimon.toutoune@gmail.com, me@tobias.gr, guix-patches@gnu.org Resent-Date: Fri, 13 Sep 2024 07:04:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 73220 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: 73220@debbugs.gnu.org Cc: aurtzy , Christopher Baines , Josselin Poiret , Ludovic =?utf-8?q?Court=C3=A8s?= , Mathieu Othacehe , Simon Tournier , Tobias Geerinckx-Rice X-Debbugs-Original-To: guix-patches@gnu.org X-Debbugs-Original-Xcc: Christopher Baines , Josselin Poiret , Ludovic =?utf-8?q?Court=C3=A8s?= , Mathieu Othacehe , Simon Tournier , Tobias Geerinckx-Rice Received: via spool by submit@debbugs.gnu.org id=B.17262109844079 (code B ref -1); Fri, 13 Sep 2024 07:04:02 +0000 Received: (at submit) by debbugs.gnu.org; 13 Sep 2024 07:03:04 +0000 Received: from localhost ([127.0.0.1]:42334 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sp0Kd-00013d-4U for submit@debbugs.gnu.org; Fri, 13 Sep 2024 03:03:04 -0400 Received: from lists.gnu.org ([209.51.188.17]:37922) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sp0Ka-00013E-9N for submit@debbugs.gnu.org; Fri, 13 Sep 2024 03:03:01 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sp0KR-0002Sc-8S for guix-patches@gnu.org; Fri, 13 Sep 2024 03:02:51 -0400 Received: from mail-qk1-x734.google.com ([2607:f8b0:4864:20::734]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1sp0KN-0006Nm-2o for guix-patches@gnu.org; Fri, 13 Sep 2024 03:02:51 -0400 Received: by mail-qk1-x734.google.com with SMTP id af79cd13be357-7a9a23fc16fso177824485a.2 for ; Fri, 13 Sep 2024 00:02:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726210965; x=1726815765; darn=gnu.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=s8q6RMG7mBPp0xPIrEGPqxMFvj+I+gDDY2ZFoB3zFxc=; b=TndsZxoTB8DWCAtcz4hG4OJ+5pj2Ic8WGeqhSvGzHBnMjSkM+6xGYH3p2khgpvMmAH 9Zp5R7VvJUgLWwJCWltiMJTFEdb0Cw53wh+PKmP5tEAwxe5AsFhm4ErQJR4lPjmXn30r oYjLutxBPpm9Oo3+XRry/wt2ihiWmSiHD6t7uurHHV/xnD3giklxGVkEe/JDDE2ZYHqx 0xiFBlBQCm+TWfBcooRfy3BIeYKliRO4m9uwocIwAMbWvNA0pElTEkDM8b50O/7C7wFm d7LNXOffyaAWQ8VtZWHwnGAqrPJrhotpnIQ9SGwvjJwvwzI5UL3VFU2PENMSkjAAp24N CO+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726210965; x=1726815765; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=s8q6RMG7mBPp0xPIrEGPqxMFvj+I+gDDY2ZFoB3zFxc=; b=RKMRXLIGWGhqdGwCvGLr2IINngQER3Iya5/uDd6jFOZOv1gub44ibI7IxTN0CS8oc7 ePetZabppU8cXGb5xYqlB+uGrLkSqZ/a2wYkxEVUQOl51oxK5mXAw/bbFNRs23Sy1QQP npHARTzc7JpHr2P1L7ix4bq0FAgoilRjtF/MFxD/kaU6xmpcU+V0+N9ssn8aSv7jy3bb glF+WJ2EbUZeQv6wcmzTjlMB/P0HdO15wLjlnTzSRddEMPhTCYVO9APbCv0/43OBtmji YXu9l7MF3MoocbujLuVcgK4yjWj4d/xzEixBE+o7KUwyvb48L+t+cmqtb9658wkByRJn hw0g== X-Gm-Message-State: AOJu0Yzji2WxdmNM/ZLkT5iw1tFzCB3UJ8AjdUY67g/ZO64FPUbt1fDn UXyp7Cpb+kAr8SSZb+Gusa0xl6V+58MH008L3hW5Hf3M5UsvLNIP+Z0XHw== X-Google-Smtp-Source: AGHT+IGkMfwdsBaXeRfbeCfpu3+fwBuOOvlQXNZiBT1V8WQkjK4yrTwTr+Sh9eBLqL2pZ6z8dS+bsw== X-Received: by 2002:a05:620a:298e:b0:7a9:b250:d57a with SMTP id af79cd13be357-7a9e5eea278mr989105085a.1.1726210964967; Fri, 13 Sep 2024 00:02:44 -0700 (PDT) Received: from localhost.localdomain ([2600:4808:a053:7600::e413]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6c53474d632sm62716226d6.89.2024.09.13.00.02.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 Sep 2024 00:02:44 -0700 (PDT) From: aurtzy Date: Fri, 13 Sep 2024 03:02:25 -0400 Message-ID: X-Mailer: git-send-email 2.46.0 MIME-Version: 1.0 Received-SPF: pass client-ip=2607:f8b0:4864:20::734; envelope-from=aurtzy@gmail.com; helo=mail-qk1-x734.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org Sender: guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org X-getmail-retrieved-from-mailbox: Patches Fixes . * guix/ui.scm (char-set:word-border): New variable. (relevance): Update docstring. [whole-word-score, exact-match-score]: New variables. [score]: Score whole words such that matching a whole word will always put an object higher than another if the other does not match any whole words. Exact matches are given similar treatment. Score matches slightly higher than the baseline if they have one word boundary, with the assumption that they are more likely to be part of compound words rather than simply substrings. Only count a maximum of one scored match per field to limit putting too much weight on terms that happen to be very common. [score][string-ref-border?]: New procedure. Change-Id: I8e3d7a20bf296485355d1c191fe3fee5ef6490c8 --- Hello! This is an attempt to improve guix's search functionality for cases like the linked issue. Elaborating on some parts of my implementation: I opted to switch to counting a maximum of one match per field, which helps with cases where a common subword matches /many/ times in packages with longer descriptions, pushing more relevant packages down. In multi-term searches, the unique terms - which are naturally rarer - also contribute to a larger percentage of the score as a result of these changes. Having matches with only one word boundary be scored as 2 instead of 1 was done with the reasoning that a term is more likely to be part of a compound word name (and thus more relevant) if it is a prefix or suffix; for example, "gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser. In an effort to minimize regressions in scoring, I've compiled a set of test cases and their expected results, which - if useful - might also be usable in future work: | Keyword(s) with poor | Expectations | | results before | | |-----------------------+-----------------------------------------------| | dig | ~bind~ near top. | | rsh | ~inetutils~ near top. | | c | C language related. | | c compiler | Compiler-related C stuff. | | r | R language related. | | tor | Tor related; ~torbrowser~ somewhere near top. | | gcc | ~gcc-toolchain~ near top. | |-----------------------+-----------------------------------------------| | Keyword(s) with mixed | | | results before | | |-----------------------+-----------------------------------------------| | gl | GL related. | | sh | Shell-related. | |-----------------------+-----------------------------------------------| | Keywords(s) with good | | | results before | | |-----------------------+-----------------------------------------------| | gcc toolchain | ~gcc-toolchain~ near top. | | python | ~python~ at top. | | python language | ~python~ at top. | | python minimal | ~python{,2}-minimal~ and friends near top. | | sync files | File synchronization related. | | sdl2 | ~sdl2~ at top. | However, some of these cases might be a bit too abstract, so I'm not sure how sufficient this testing is. Note that I only did minimal testing with =guix system search= and =guix home search= which - while seemingly fine - could be more rigorous (am I forgetting any other commands?). Going over the results of these changes on the test cases: There were notable improvements searching: - =rsh=: ~inetutils~ now shows up at the top when searching =rsh=, with another relevant (but previously buried) ~emacs-tramp~ at second place. - =c=: Searches for =c= return results related to the language now, whereas before it was a lot of unrelated packages that simply had the most =c= characters. - =dig=: While not the first result, ~bind~ is now displayed as tied for 3rd in relevance score, showing up within 10 packages. - =r=: Previously in a similar situation as C. Now ~r~ shows up at the top, with other R-related packages under it. - =gl=: The =gl= test case's results are slightly improved. Before, there were some non-relevant packages with the =gl= substring near the top, which is no longer the case. - =sh=: As a common subword, searching =sh= led to a mix of relevant and less relevant results at the top. A good majority are now shell-related. - =tor=: ~tor~ shows up on the top in both cases, but before with lots of non-relevant packages under it; the previously buried ~torbrowser~ now accompanies other more relevant results near the top. - =gcc=: ~gcc-toolchain~ is now a top result, compared to ~gccgo~ at the top before (and even ~gdc-toolchain~ also being higher; upstream name being "gcc" seems to have caused that). There are slight regressions with searching: - =sync files=: The new algorithm has a few less relevant results at the top compared to before, but otherwise seems like a shuffling of the old results. - =sdl2=: ~sdl2~'s top rank is overtaken by two libraries. If I didn't mention a test case from the table, it's probably because results were at least consistent or better (and I think I've written too much to read already). Closing this message on an unrelated note for future work: I stumbled on an interesting idea while looking for test cases which suggested reducing the score of a programming library when its language is not included in search terms. It's out of scope for the current issue, but I thought I'd mention it anyways for potential further improvements. Cheers, aurtzy guix/ui.scm | 54 ++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 45 insertions(+), 9 deletions(-) base-commit: b6d5a7f5836739dab884b49a64ca354794dd845f diff --git a/guix/ui.scm b/guix/ui.scm index 966f0611f6..420f1f7501 100644 --- a/guix/ui.scm +++ b/guix/ui.scm @@ -19,6 +19,7 @@ ;;; Copyright © 2018 Steve Sprang ;;; Copyright © 2022 Taiju HIGASHI ;;; Copyright © 2022 Liliana Marie Prikler +;;; Copyright © 2024 aurtzy ;;; ;;; This file is part of GNU Guix. ;;; @@ -1678,22 +1679,57 @@ (define* (package->recutils p port #:optional (width (terminal-columns)) ;;; Searching. ;;; +(define char-set:word-border (char-set-union char-set:digit + char-set:punctuation + char-set:symbol + char-set:whitespace)) + (define (relevance obj regexps metrics) - "Compute a \"relevance score\" for OBJ as a function of its number of -matches of REGEXPS and accordingly to METRICS. METRICS is list of -field/weight pairs, where FIELD is a procedure that returns a string or list -of strings describing OBJ, and WEIGHT is a positive integer denoting the -weight of this field in the final score. + "Compute a \"relevance score\" for OBJ as a function of its matches of REGEXPS and +accordingly to METRICS. METRICS is list of field/weight pairs, where FIELD is a +procedure that returns a string or list of strings describing OBJ, and WEIGHT is a +positive integer denoting the weight of this field in the final score. A score of zero means that OBJ does not match any of REGEXPS. The higher the score, the more relevant OBJ is to REGEXPS." + ;; Ensure that objects with whole word matches always score greater than (or equal + ;; to) objects that only match substrings. + (define whole-word-score (apply + (map (match-lambda + ((_ . weight) weight)) + metrics))) + (define exact-match-score (* whole-word-score 2)) + (define (score regexp str) + (define (string-ref-border? k) + (if (<= 0 k (1- (string-length str))) + (char-set-contains? char-set:word-border (string-ref str k)) + #t)) + (fold-matches regexp str 0 (lambda (m score) - (+ score - (if (string=? (match:substring m) str) - 5 ;exact match - 1))))) + (cond + ((string=? (match:substring m) str) + exact-match-score) + ((>= score whole-word-score) + ;; No need to compute further if score is already max + ;; possible score + score) + (else + (let ((start-border? + (string-ref-border? (1- (match:start m)))) + (end-border? + (string-ref-border? (match:end m)))) + (max score + (cond + ((and start-border? end-border?) + whole-word-score) + ((or start-border? end-border?) + ;; If the match only has one border, it could still be + ;; part of a compound word, and thus be more likely to + ;; be relevant than if it was just a substring. + 2) + (else + 1))))))))) (define (regexp->score regexp) (let ((score-regexp (lambda (str) (score regexp str))))