From patchwork Sat Mar 25 15:32:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nicolas Graves X-Patchwork-Id: 48674 Return-Path: X-Original-To: patchwork@mira.cbaines.net Delivered-To: patchwork@mira.cbaines.net Received: by mira.cbaines.net (Postfix, from userid 113) id A17A616FFD; Sat, 25 Mar 2023 15:33:50 +0000 (GMT) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on mira.cbaines.net X-Spam-Level: X-Spam-Status: No, score=-2.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2,SPF_HELO_PASS autolearn=unavailable autolearn_force=no version=3.4.6 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mira.cbaines.net (Postfix) with ESMTPS id 7C22716FE7 for ; Sat, 25 Mar 2023 15:33:46 +0000 (GMT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pg5tE-0000ns-85; Sat, 25 Mar 2023 11:33:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pg5tA-0000n7-R9 for guix-patches@gnu.org; Sat, 25 Mar 2023 11:33:04 -0400 Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pg5tA-0004P6-Ib for guix-patches@gnu.org; Sat, 25 Mar 2023 11:33:04 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1pg5t7-0001oo-Tb for guix-patches@gnu.org; Sat, 25 Mar 2023 11:33:01 -0400 X-Loop: help-debbugs@gnu.org Subject: [bug#62443] [PATCH 1/3] gnu: Add sentencepiece. References: <875yaoc1nj.fsf@ngraves.fr> In-Reply-To: <875yaoc1nj.fsf@ngraves.fr> Resent-From: Nicolas Graves Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sat, 25 Mar 2023 15:33:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 62443 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: 62443@debbugs.gnu.org Cc: ngraves@ngraves.fr Received: via spool by 62443-submit@debbugs.gnu.org id=B62443.16797583556931 (code B ref 62443); Sat, 25 Mar 2023 15:33:01 +0000 Received: (at 62443) by debbugs.gnu.org; 25 Mar 2023 15:32:35 +0000 Received: from localhost ([127.0.0.1]:43264 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pg5sh-0001ni-8x for submit@debbugs.gnu.org; Sat, 25 Mar 2023 11:32:35 -0400 Received: from 2.mo582.mail-out.ovh.net ([46.105.76.65]:49849) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pg5se-0001nW-P0 for 62443@debbugs.gnu.org; Sat, 25 Mar 2023 11:32:34 -0400 Received: from director2.ghost.mail-out.ovh.net (unknown [10.109.143.3]) by mo582.mail-out.ovh.net (Postfix) with ESMTP id 24A0424641 for <62443@debbugs.gnu.org>; Sat, 25 Mar 2023 15:32:29 +0000 (UTC) Received: from ghost-submission-6684bf9d7b-wpl9d (unknown [10.110.171.1]) by director2.ghost.mail-out.ovh.net (Postfix) with ESMTPS id 8AB871FD83; Sat, 25 Mar 2023 15:32:29 +0000 (UTC) Received: from ngraves.fr ([37.59.142.105]) by ghost-submission-6684bf9d7b-wpl9d with ESMTPSA id pgB1Gw0UH2Q8sTYA+2z0Pg (envelope-from ); Sat, 25 Mar 2023 15:32:29 +0000 Authentication-Results: garm.ovh; auth=pass (GARM-105G0065630b9cf-cf54-4d94-9992-27b536c367a9, 6FD3C7460AE388A97E4C058CE803011BF627B91A) smtp.auth=ngraves@ngraves.fr X-OVh-ClientIp: 90.45.24.108 Date: Sat, 25 Mar 2023 16:32:18 +0100 Message-Id: <20230325153220.26027-1-ngraves@ngraves.fr> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 X-Ovh-Tracer-Id: 1444811059201434338 X-VR-SPAMSTATE: OK X-VR-SPAMSCORE: 0 X-VR-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegkedgjeelucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuqfggjfdpvefjgfevmfevgfenuceurghilhhouhhtmecuhedttdenucenucfjughrpefhvfevufffkffoggfgsedtkeertdertddtnecuhfhrohhmpefpihgtohhlrghsucfirhgrvhgvshcuoehnghhrrghvvghssehnghhrrghvvghsrdhfrheqnecuggftrfgrthhtvghrnhepteeffefhfffhjeevleeuvdehgffgveekheeuhfekhfehuefgheffhedugfegleeinecuffhomhgrihhnpehgihhthhhusgdrtghomhenucfkphepuddvjedrtddrtddruddpfeejrdehledrudegvddruddtheenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeduvdejrddtrddtrddupdhmrghilhhfrhhomhepoehnghhrrghvvghssehnghhrrghvvghsrdhfrheqpdhnsggprhgtphhtthhopedupdhrtghpthhtohepiedvgeegfeesuggvsggsuhhgshdrghhnuhdrohhrghdpoffvtefjohhsthepmhhoheekvddpmhhouggvpehsmhhtphhouhht X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: Nicolas Graves X-ACL-Warn: , Nicolas Graves via Guix-patches X-Patchwork-Original-From: Nicolas Graves via Guix-patches via From: Nicolas Graves Errors-To: guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org Sender: guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org X-getmail-retrieved-from-mailbox: Patches * gnu/packages/machine-learning.scm (sentencepiece): New variable. --- gnu/packages/machine-learning.scm | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm index 37d4ef78ad..f6996af77b 100644 --- a/gnu/packages/machine-learning.scm +++ b/gnu/packages/machine-learning.scm @@ -583,6 +583,33 @@ (define openfst-for-vosk '("--enable-shared" "--enable-far" "--enable-ngram-fsts" "--enable-lookahead-fsts" "--with-pic" "--disable-bin"))))) +(define-public sentencepiece + (package + (name "sentencepiece") + (version "0.1.97") + (source + (origin + (method git-fetch) + (uri (git-reference + (url "https://github.com/google/sentencepiece") + (commit (string-append "v" version)))) + (file-name (git-file-name name version)) + (sha256 + (base32 "1kzfkp2pk0vabyw3wmkh16h11chzq63mzc20ddhsag5fp6s91ajg")))) + (build-system cmake-build-system) + (arguments '(#:tests? #f)) + (native-inputs (list gperftools)) + (home-page "https://github.com/google/sentencepiece") + (synopsis "Unsupervised tokenizer for Neural Network-based text generation") + (description "SentencePiece is an unsupervised text tokenizer and +detokenizer mainly for Neural Network-based text generation systems where the +vocabulary size is predetermined prior to the neural model training. +SentencePiece implements subword units (e.g., byte-pair-encoding +(BPE) and unigram language model) with the extension of direct training from +raw sentences. SentencePiece allows us to make a purely end-to-end system +that does not depend on language-specific pre/postprocessing.") + (license license:asl2.0))) + (define-public shogun (package (name "shogun")