diff mbox series

[bug#57151,1/2] gnu: Add tesseract-ocr-tessdata-fast.

Message ID 20220812050752.3980-1-maxim.cournoyer@gmail.com
State Accepted
Headers show
Series *** Add trained data models for Tesseract OCR *** | expand

Checks

Context Check Description
cbaines/comparison success View comparision
cbaines/git-branch success View Git branch
cbaines/applying patch success View Laminar job
cbaines/issue success View issue

Commit Message

Maxim Cournoyer Aug. 12, 2022, 5:07 a.m. UTC
* gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
---
 gnu/packages/ocr.scm | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Simon South Aug. 12, 2022, 11:27 a.m. UTC | #1
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.

Maxim,

Would it not be better to generate a separate package for each of the
languages and scripts this data covers, as is done by Debian for
instance?  The entire dataset is about a gigabyte in size and supports
more than a hundred languages yet I imagine most people would be using
only one or two.

This would mean tesseract-ocr could simply propagate the
"tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
specific file, and would establish a convention that would be necessary
for packaging the "best" dataset as well, if that's desired.

(Thanks for working on this; it's been on my to-do list for a while as
well.)
Maxim Cournoyer Aug. 12, 2022, 12:52 p.m. UTC | #2
Hi Simon,

Simon South <simon@simonsouth.net> writes:

> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
>
> Maxim,
>
> Would it not be better to generate a separate package for each of the
> languages and scripts this data covers, as is done by Debian for
> instance?  The entire dataset is about a gigabyte in size and supports
> more than a hundred languages yet I imagine most people would be using
> only one or two.
>
> This would mean tesseract-ocr could simply propagate the
> "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
> specific file, and would establish a convention that would be necessary
> for packaging the "best" dataset as well, if that's desired.

That's a good idea!  I think we could have both, like Debian also has a
'tesseract-ocr-all' package for all the languages/scripts.  Which means
the individual variants could be added in at a later time by those
interested, eh :-).

A procedure returning a language-specific package variant would make
sense for that.

Thanks,

Maxim
Maxim Cournoyer Aug. 12, 2022, 8:08 p.m. UTC | #3
Hi Simon,

Simon South <simon@simonsouth.net> writes:

> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>> Which means the individual variants could be added in at a later time
>> by those interested, eh :-).
>
> Subtext noted.
>
> One last thing, in case you weren't already aware: Issue 47536 was
> opened a while ago regarding the missing tessdata package, so you may
> want to link it to your own issue 57151 and/or close it once your
> changes are committed:
>
> https://issues.guix.gnu.org/47536

Thanks for pointing that to me.  Pushed as ff0600c5ef.  I'll now close
the issue linked above.

Thanks!

Closing.

Maxim
diff mbox series

Patch

diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e28bd17668..e2c9f561cc 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -29,6 +29,7 @@  (define-module (gnu packages ocr)
   #:use-module (guix gexp)
   #:use-module (guix git-download)
   #:use-module (guix build-system cmake)
+  #:use-module (guix build-system copy)
   #:use-module (guix build-system gnu)
   #:use-module (guix build-system python)
   #:use-module (gnu packages)
@@ -74,6 +75,32 @@  (define-public ocrad
 it produces text in 8-bit or UTF-8 formats.")
     (license license:gpl3+)))
 
+(define-public tesseract-ocr-tessdata-fast
+  (package
+    (name "tesseract-ocr-tessdata-fast")
+    (version "4.1.0")
+    (source (origin
+              (method git-fetch)
+              (uri (git-reference
+                    (url "https://github.com/tesseract-ocr/tessdata_fast")
+                    (commit version)))
+              (file-name (git-file-name name version))
+              (sha256
+               (base32
+                "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084"))))
+    (build-system copy-build-system)
+    (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata"))
+                     #:phases #~(modify-phases %standard-phases
+                                  (add-after 'unpack 'delete-broken-links
+                                    (lambda _
+                                      (delete-file "configs")
+                                      (delete-file "pdf.ttf"))))))
+    (home-page "https://github.com/tesseract-ocr/tessdata_fast")
+    (synopsis "Fast integer versions of trained LSTM models")
+    (description "This repository contains fast integer versions of trained
+models for the Tesseract OCR Engine.")
+    (license license:asl2.0)))
+
 (define-public tesseract-ocr
   (package
     (name "tesseract-ocr")