Message ID | 20220812050752.3980-1-maxim.cournoyer@gmail.com |
---|---|
State | Accepted |
Headers | show |
Series | *** Add trained data models for Tesseract OCR *** | expand |
Context | Check | Description |
---|---|---|
cbaines/comparison | success | View comparision |
cbaines/git-branch | success | View Git branch |
cbaines/applying patch | success | View Laminar job |
cbaines/issue | success | View issue |
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
Maxim,
Would it not be better to generate a separate package for each of the
languages and scripts this data covers, as is done by Debian for
instance? The entire dataset is about a gigabyte in size and supports
more than a hundred languages yet I imagine most people would be using
only one or two.
This would mean tesseract-ocr could simply propagate the
"tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
specific file, and would establish a convention that would be necessary
for packaging the "best" dataset as well, if that's desired.
(Thanks for working on this; it's been on my to-do list for a while as
well.)
Hi Simon, Simon South <simon@simonsouth.net> writes: > Maxim Cournoyer <maxim.cournoyer@gmail.com> writes: >> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable. > > Maxim, > > Would it not be better to generate a separate package for each of the > languages and scripts this data covers, as is done by Debian for > instance? The entire dataset is about a gigabyte in size and supports > more than a hundred languages yet I imagine most people would be using > only one or two. > > This would mean tesseract-ocr could simply propagate the > "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a > specific file, and would establish a convention that would be necessary > for packaging the "best" dataset as well, if that's desired. That's a good idea! I think we could have both, like Debian also has a 'tesseract-ocr-all' package for all the languages/scripts. Which means the individual variants could be added in at a later time by those interested, eh :-). A procedure returning a language-specific package variant would make sense for that. Thanks, Maxim
Hi Simon, Simon South <simon@simonsouth.net> writes: > Maxim Cournoyer <maxim.cournoyer@gmail.com> writes: >> Which means the individual variants could be added in at a later time >> by those interested, eh :-). > > Subtext noted. > > One last thing, in case you weren't already aware: Issue 47536 was > opened a while ago regarding the missing tessdata package, so you may > want to link it to your own issue 57151 and/or close it once your > changes are committed: > > https://issues.guix.gnu.org/47536 Thanks for pointing that to me. Pushed as ff0600c5ef. I'll now close the issue linked above. Thanks! Closing. Maxim
diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm index e28bd17668..e2c9f561cc 100644 --- a/gnu/packages/ocr.scm +++ b/gnu/packages/ocr.scm @@ -29,6 +29,7 @@ (define-module (gnu packages ocr) #:use-module (guix gexp) #:use-module (guix git-download) #:use-module (guix build-system cmake) + #:use-module (guix build-system copy) #:use-module (guix build-system gnu) #:use-module (guix build-system python) #:use-module (gnu packages) @@ -74,6 +75,32 @@ (define-public ocrad it produces text in 8-bit or UTF-8 formats.") (license license:gpl3+))) +(define-public tesseract-ocr-tessdata-fast + (package + (name "tesseract-ocr-tessdata-fast") + (version "4.1.0") + (source (origin + (method git-fetch) + (uri (git-reference + (url "https://github.com/tesseract-ocr/tessdata_fast") + (commit version))) + (file-name (git-file-name name version)) + (sha256 + (base32 + "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084")))) + (build-system copy-build-system) + (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata")) + #:phases #~(modify-phases %standard-phases + (add-after 'unpack 'delete-broken-links + (lambda _ + (delete-file "configs") + (delete-file "pdf.ttf")))))) + (home-page "https://github.com/tesseract-ocr/tessdata_fast") + (synopsis "Fast integer versions of trained LSTM models") + (description "This repository contains fast integer versions of trained +models for the Tesseract OCR Engine.") + (license license:asl2.0))) + (define-public tesseract-ocr (package (name "tesseract-ocr")