From patchwork Fri Aug 12 05:07:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Maxim Cournoyer X-Patchwork-Id: 41570 Return-Path: X-Original-To: patchwork@mira.cbaines.net Delivered-To: patchwork@mira.cbaines.net Received: by mira.cbaines.net (Postfix, from userid 113) id E4F0127BBEA; Fri, 12 Aug 2022 06:09:39 +0100 (BST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on mira.cbaines.net X-Spam-Level: X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI, SPF_HELO_PASS autolearn=unavailable autolearn_force=no version=3.4.6 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mira.cbaines.net (Postfix) with ESMTPS id AB07227BBE9 for ; Fri, 12 Aug 2022 06:09:39 +0100 (BST) Received: from localhost ([::1]:54954 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oMMvS-0001Vs-PM for patchwork@mira.cbaines.net; Fri, 12 Aug 2022 01:09:38 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:56166) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oMMux-0001V3-7g for guix-patches@gnu.org; Fri, 12 Aug 2022 01:09:07 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:37378) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oMMus-0000Xs-B1 for guix-patches@gnu.org; Fri, 12 Aug 2022 01:09:06 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1oMMus-0005vX-6N for guix-patches@gnu.org; Fri, 12 Aug 2022 01:09:02 -0400 X-Loop: help-debbugs@gnu.org Subject: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast. References: <20220812050543.3923-1-maxim.cournoyer@gmail.com> In-Reply-To: <20220812050543.3923-1-maxim.cournoyer@gmail.com> Resent-From: Maxim Cournoyer Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 12 Aug 2022 05:09:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 57151 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: 57151@debbugs.gnu.org Cc: Maxim Cournoyer Received: via spool by 57151-submit@debbugs.gnu.org id=B57151.166028088722702 (code B ref 57151); Fri, 12 Aug 2022 05:09:02 +0000 Received: (at 57151) by debbugs.gnu.org; 12 Aug 2022 05:08:07 +0000 Received: from localhost ([127.0.0.1]:55357 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMty-0005u6-LM for submit@debbugs.gnu.org; Fri, 12 Aug 2022 01:08:06 -0400 Received: from mail-qk1-f180.google.com ([209.85.222.180]:35762) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMtw-0005ta-Fu for 57151@debbugs.gnu.org; Fri, 12 Aug 2022 01:08:04 -0400 Received: by mail-qk1-f180.google.com with SMTP id u24so60048qku.2 for <57151@debbugs.gnu.org>; Thu, 11 Aug 2022 22:08:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc; bh=GbJGnHBh9MJDsGA+rg/3OJ2Iiu4gy6Tx9D9DOTXOvCc=; b=pyygfnLA78LAU9/fBt4zhcRYADBCxNPAtVh+Hzdo6McECoxPzoWCJ2aB1vRW35L1Tn wy6TwfcVZysNnXFsHWZVUVZJ/1Qmhzc4+kPA73nnYsaD2TVH3REa9gs2xz5yNzGMBs0f 7ZU9RrkErHNBVlz0wXo5hf6i/CINbTMqXgOoIYLNbSLO5i3q9xS1y08JlS+H7cjlDwdf 6M1bF/p3ZJgzhw+ZPGovqTJCV08JRQ766NZHPQ6dLVgk7Cg93BfN5Xn2M5DSWgMTUXzb rOkbzlaYBxBxGoVyi0gGsbXjGw8tTGWvrVmLrOTlFbaORcSI4KyQ2MRCy9KG2sk3WLo3 uvOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc; bh=GbJGnHBh9MJDsGA+rg/3OJ2Iiu4gy6Tx9D9DOTXOvCc=; b=Lp1Vt08fk7ooaAWLVDcGjGTb1pKHi1xyYQxA/4uj3tzWNmhcY1R3OuL+WVuZWMwtv2 NFn+KepwCFB5XTzWFCrN2dp9oafes7vDmSh8ez9Pw4nsKvf+JpKLdK4powFZQiGXMiNZ MuMzIUrH6rojhDM1wlagBaniZzOMwIZ+BDLyHDsQ5Wb71hQm0j1fArX0onnFE8XhbsPr kXeCxJO2L39461JukFxCxapVzWdwZt+oEQSWMrzzK+0gqcgbFojie83+WHa2Akje5wz4 GZgsRI2tYOtQFxuvqJA5XZLfqmfQS5idTD7aPrHXjO7GLyRMkQnySGAdXOC0QLnOVg4x uzvA== X-Gm-Message-State: ACgBeo0iI4dy2MDTAtZDqpadF2GOYMr9rOIIus3D59oUuXZBgxRO8NQr eJo9yKRkTz2R/RPBXB4PI9XST9xPTzA= X-Google-Smtp-Source: AA6agR49JDyGfnG0zST1UIR7mU8oLN9/MoqWR1naSGXveNZatBw//xS5GrenSM4iyGDpj7t3CoR6iA== X-Received: by 2002:a05:620a:bc9:b0:6b6:66b2:d417 with SMTP id s9-20020a05620a0bc900b006b666b2d417mr1683544qki.3.1660280877539; Thu, 11 Aug 2022 22:07:57 -0700 (PDT) Received: from localhost.localdomain (dsl-10-148-207.b2b2c.ca. [72.10.148.207]) by smtp.gmail.com with ESMTPSA id l18-20020a37f912000000b006b5fe1c376fsm938253qkj.131.2022.08.11.22.07.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Aug 2022 22:07:57 -0700 (PDT) From: Maxim Cournoyer Date: Fri, 12 Aug 2022 01:07:51 -0400 Message-Id: <20220812050752.3980-1-maxim.cournoyer@gmail.com> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org Sender: "Guix-patches" X-getmail-retrieved-from-mailbox: Patches * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable. --- gnu/packages/ocr.scm | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm index e28bd17668..e2c9f561cc 100644 --- a/gnu/packages/ocr.scm +++ b/gnu/packages/ocr.scm @@ -29,6 +29,7 @@ (define-module (gnu packages ocr) #:use-module (guix gexp) #:use-module (guix git-download) #:use-module (guix build-system cmake) + #:use-module (guix build-system copy) #:use-module (guix build-system gnu) #:use-module (guix build-system python) #:use-module (gnu packages) @@ -74,6 +75,32 @@ (define-public ocrad it produces text in 8-bit or UTF-8 formats.") (license license:gpl3+))) +(define-public tesseract-ocr-tessdata-fast + (package + (name "tesseract-ocr-tessdata-fast") + (version "4.1.0") + (source (origin + (method git-fetch) + (uri (git-reference + (url "https://github.com/tesseract-ocr/tessdata_fast") + (commit version))) + (file-name (git-file-name name version)) + (sha256 + (base32 + "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084")))) + (build-system copy-build-system) + (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata")) + #:phases #~(modify-phases %standard-phases + (add-after 'unpack 'delete-broken-links + (lambda _ + (delete-file "configs") + (delete-file "pdf.ttf")))))) + (home-page "https://github.com/tesseract-ocr/tessdata_fast") + (synopsis "Fast integer versions of trained LSTM models") + (description "This repository contains fast integer versions of trained +models for the Tesseract OCR Engine.") + (license license:asl2.0))) + (define-public tesseract-ocr (package (name "tesseract-ocr") From patchwork Fri Aug 12 05:07:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Maxim Cournoyer X-Patchwork-Id: 41569 Return-Path: X-Original-To: patchwork@mira.cbaines.net Delivered-To: patchwork@mira.cbaines.net Received: by mira.cbaines.net (Postfix, from userid 113) id 6B2F227BBEA; Fri, 12 Aug 2022 06:09:39 +0100 (BST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on mira.cbaines.net X-Spam-Level: X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI, SPF_HELO_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mira.cbaines.net (Postfix) with ESMTPS id BE8C427BBE9 for ; Fri, 12 Aug 2022 06:09:38 +0100 (BST) Received: from localhost ([::1]:54938 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oMMvR-0001VI-UH for patchwork@mira.cbaines.net; Fri, 12 Aug 2022 01:09:37 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:56164) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oMMux-0001V2-7K for guix-patches@gnu.org; Fri, 12 Aug 2022 01:09:07 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:37379) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oMMus-0000Xv-P4 for guix-patches@gnu.org; Fri, 12 Aug 2022 01:09:06 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1oMMus-0005vf-KL for guix-patches@gnu.org; Fri, 12 Aug 2022 01:09:02 -0400 X-Loop: help-debbugs@gnu.org Subject: [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful. Resent-From: Maxim Cournoyer Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 12 Aug 2022 05:09:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 57151 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: 57151@debbugs.gnu.org Cc: Maxim Cournoyer Received: via spool by 57151-submit@debbugs.gnu.org id=B57151.166028089222719 (code B ref 57151); Fri, 12 Aug 2022 05:09:02 +0000 Received: (at 57151) by debbugs.gnu.org; 12 Aug 2022 05:08:12 +0000 Received: from localhost ([127.0.0.1]:55360 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMu4-0005uM-3M for submit@debbugs.gnu.org; Fri, 12 Aug 2022 01:08:12 -0400 Received: from mail-qt1-f181.google.com ([209.85.160.181]:34514) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMty-0005td-RN for 57151@debbugs.gnu.org; Fri, 12 Aug 2022 01:08:11 -0400 Received: by mail-qt1-f181.google.com with SMTP id e28so61180qts.1 for <57151@debbugs.gnu.org>; Thu, 11 Aug 2022 22:08:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=HDg+cFq/ihlwS9SixKXXfP0sbLRkkHbidgtddAS5G+0=; b=UrT1wj9K2bso9tB9cvdLkgmqIfswm0GSwszL5VqhLG9Txd2byRp3uebN9DnqZb4cVt MGKchH+1xvCc3t0iRCOgtsg8sR+fUJYB3Y0ahGUiMfpibewVZMbsymDOkh3hOn4arH64 S9mFhfqgOKLokY+PBSF+l1L6Fpz4WDSP7smbykZlC6uwaH9AN+p72tBFbWZDpML4Bu9A cjlCjBi5id0DdMC7oiX2WGPwKS1VEbyWLuyGHheVjeAvEV3GWtr5b/Lzq2bTbNJXi36f nllgwalSACF7TELdEWVhEtQJRT8pj0C83/nkuvGVWKOrw17hJies+Oj8/Dt6brLAeFiQ 2zhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=HDg+cFq/ihlwS9SixKXXfP0sbLRkkHbidgtddAS5G+0=; b=2eTHljZrvMI7vTSsFRmezO9CO+JTjjYMD71nY5jZu1z0aWb9BKG8T3tGqxBb0T/wRr 6eA9L/2ZXr+Xc+nCAHG6e1NFG0eunWGgVZaEnKGrJ97z+xkGcVaTdDEj/NMpJZGJhj9c U4WjsYubZkvvCpMDQh0kcMgoxUq7rI32Rfh7pXxuijACX95qpZtN+r7aUzdihhlR2DOz AmoDkaHL6ylVE7qoQBphxd9QaEbpEzFp2nyNdn3sRGn00NViK6tuvFpwS1kogqp7i91O B7HfDULXuiHT4SpdPMTusdELGsuK3wY6NUw8Mpz3JWCSiXnccY/xVYTLrjp7Tk+vQU2b m2+Q== X-Gm-Message-State: ACgBeo1mChn3d2niRTUqZTz6BaYlZCctdCK5ak0Iz58n2Cx3APGZBoPy 3tofcNPen3Hg/rwa90WpaLAA3zyJEnU= X-Google-Smtp-Source: AA6agR6ykEQR0AKc58jhPhnjF9BRYIm9z/GLXZHVYL/Rq9HbaG7QyKeFOLTJy8u6AgxbAmD5e/gIWA== X-Received: by 2002:ac8:5f12:0:b0:343:6510:ed6f with SMTP id x18-20020ac85f12000000b003436510ed6fmr2195974qta.342.1660280881225; Thu, 11 Aug 2022 22:08:01 -0700 (PDT) Received: from localhost.localdomain (dsl-10-148-207.b2b2c.ca. [72.10.148.207]) by smtp.gmail.com with ESMTPSA id l18-20020a37f912000000b006b5fe1c376fsm938253qkj.131.2022.08.11.22.08.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Aug 2022 22:08:00 -0700 (PDT) From: Maxim Cournoyer Date: Fri, 12 Aug 2022 01:07:52 -0400 Message-Id: <20220812050752.3980-2-maxim.cournoyer@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220812050752.3980-1-maxim.cournoyer@gmail.com> References: <20220812050752.3980-1-maxim.cournoyer@gmail.com> MIME-Version: 1.0 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+patchwork=mira.cbaines.net@gnu.org Sender: "Guix-patches" X-getmail-retrieved-from-mailbox: Patches * gnu/packages/ocr.scm (tesseract-ocr) [phases]{adjust-TESSDATA_PREFIX-macro}: New phase. {install-minimal-tessdata}: New phase. [native-inputs]: Add tesseract-ocr-tessdata-fast. [search-paths]: New field. [description]: Mention how to add support for more languages. --- gnu/packages/ocr.scm | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm index e2c9f561cc..21d257ef24 100644 --- a/gnu/packages/ocr.scm +++ b/gnu/packages/ocr.scm @@ -132,6 +132,15 @@ (define-public tesseract-ocr (substitute* "configure.ac" (("AC_SUBST\\(\\[XML_CATALOG_FILES])") "")))) + (add-after 'unpack 'adjust-TESSDATA_PREFIX-macro + (lambda _ + ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more + ;; specific search-path than '/share' can be specified. The + ;; build system uses CPPFLAGS for itself, so we can't simply set + ;; a make flag. + (substitute* "Makefile.am" + (("-DTESSDATA_PREFIX='\"@datadir@\"'") + "-DTESSDATA_PREFIX='\"@datadir@/tesseract-ocr\"'")))) (add-after 'build 'build-training (lambda* (#:key parallel-build? #:allow-other-keys) (define n (if parallel-build? (number->string @@ -140,7 +149,18 @@ (define n (if parallel-build? (number->string (invoke "make" "-j" n "training"))) (add-after 'install 'install-training (lambda _ - (invoke "make" "training-install")))))) + (invoke "make" "training-install"))) + (add-after 'install 'install-minimal-tessdata + ;; tesseract-ocr cannot be used without its trained models data; + ;; install the English language as a minimal base which can be + ;; extended via TESSDATA_PREFIX. + (lambda* (#:key native-inputs inputs #:allow-other-keys) + (define eng.traineddata + "/share/tesseract-ocr/tessdata/eng.traineddata") + (install-file (search-input-file (or native-inputs inputs) + eng.traineddata) + (dirname (string-append #$output + eng.traineddata)))))))) (native-inputs (list asciidoc autoconf @@ -152,13 +172,18 @@ (define n (if parallel-build? (number->string libtool libxml2 ;for XML_CATALOG_FILES libxslt - pkg-config)) + pkg-config + tesseract-ocr-tessdata-fast)) (inputs (list cairo icu4c leptonica pango python-wrapper)) + (native-search-paths (list (search-path-specification + (variable "TESSDATA_PREFIX") + (files (list "share/tesseract-ocr/tessdata")) + (separator #f)))) ;single value (home-page "https://github.com/tesseract-ocr/tesseract") (synopsis "Optical character recognition engine") (description @@ -166,7 +191,9 @@ (define n (if parallel-build? (number->string high accuracy. It supports many languages, output text formatting, hOCR positional information and page layout analysis. Several image formats are supported through the Leptonica library. It can also detect whether text is -monospaced or proportional.") +monospaced or proportional. Support for the English language is included by +default. To add support for more languages, the +@code{tesseract-ocr-tessdata-fast} package should be installed.") (license license:asl2.0))) (define-public gimagereader