[bug#54635,v2,5/5] gnu: Add wfmash.

Message ID 20220331072849.22417-5-arunisaac@systemreboot.net
State New
Headers
Series [bug#54635,v2,1/5] gnu: gsl: Force bootstrap when cross-compiling to riscv64-linux. |

Commit Message

Arun Isaac March 31, 2022, 7:28 a.m. UTC
  * gnu/packages/bioinformatics.scm (wfmash): New variable.
---
 gnu/packages/bioinformatics.scm | 51 +++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)
  

Comments

Maxime Devos March 31, 2022, 11:34 a.m. UTC | #1
Arun Isaac schreef op do 31-03-2022 om 12:58 [+0530]:
> +                       (("-march=native") ""))

This is also wrong for x86 systems because it makes the build non-
reproducible.  Also, has upstream been informed about some of the
compiler flags being architecture-specific?

Greetings,
Maxime.
  
Efraim Flashner March 31, 2022, 12:18 p.m. UTC | #2
On Thu, Mar 31, 2022 at 01:34:29PM +0200, Maxime Devos wrote:
> Arun Isaac schreef op do 31-03-2022 om 12:58 [+0530]:
> > +                       (("-march=native") ""))
> 
> This is also wrong for x86 systems because it makes the build non-
> reproducible.  Also, has upstream been informed about some of the
> compiler flags being architecture-specific?

I'm pretty sure upstream is aware of it, and the -mcx16 flag. That whole
phase doesn't need to be non-x86_64 only, upstream prefers it that way
to get fater results but IMO it would be fine to move it into a snippet.
  
Maxime Devos March 31, 2022, 1:07 p.m. UTC | #3
Efraim Flashner schreef op do 31-03-2022 om 15:18 [+0300]:
> > Arun Isaac schreef op do 31-03-2022 om 12:58 [+0530]:
> > > +                       (("-march=native") ""))
> > 
> > This is also wrong for x86 systems because it makes the build non-
> > reproducible.  Also, has upstream been informed about some of the
> > compiler flags being architecture-specific?
> 
> I'm pretty sure upstream is aware of it, and the -mcx16 flag. That
> whole phase doesn't need to be non-x86_64 only, upstream prefers it
> that way to get fater results

wfmash could be written to detect CPU features at runtime and there is
also --tune.  Also, upstream preferring march=native does not make the
build reproducible.

> but IMO it would be fine to move it into a snippet.

It does not have to be in a snippet, it just needs to be reproducible
(so no march=native, whether on x86 or not).

Upstream seems to be aware of the non-x86
(https://github.com/ekg/wfmash/issues/125) but they do not seem to be
aware of the problems with march=native.

Greetings,
Maxime
  
Efraim Flashner March 31, 2022, 1:09 p.m. UTC | #4
On Thu, Mar 31, 2022 at 03:07:49PM +0200, Maxime Devos wrote:
> Efraim Flashner schreef op do 31-03-2022 om 15:18 [+0300]:
> > > Arun Isaac schreef op do 31-03-2022 om 12:58 [+0530]:
> > > > +                       (("-march=native") ""))
> > > 
> > > This is also wrong for x86 systems because it makes the build non-
> > > reproducible.  Also, has upstream been informed about some of the
> > > compiler flags being architecture-specific?
> > 
> > I'm pretty sure upstream is aware of it, and the -mcx16 flag. That
> > whole phase doesn't need to be non-x86_64 only, upstream prefers it
> > that way to get fater results
> 
> wfmash could be written to detect CPU features at runtime and there is
> also --tune.  Also, upstream preferring march=native does not make the
> build reproducible.
> 
> > but IMO it would be fine to move it into a snippet.
> 
> It does not have to be in a snippet, it just needs to be reproducible
> (so no march=native, whether on x86 or not).
> 
> Upstream seems to be aware of the non-x86
> (https://github.com/ekg/wfmash/issues/125) but they do not seem to be
> aware of the problems with march=native.

I will let them know. I also added a patch which runs a test suite based
on the github action.
  
Efraim Flashner March 31, 2022, 1:31 p.m. UTC | #5
On Thu, Mar 31, 2022 at 03:07:49PM +0200, Maxime Devos wrote:
> Efraim Flashner schreef op do 31-03-2022 om 15:18 [+0300]:
> > > Arun Isaac schreef op do 31-03-2022 om 12:58 [+0530]:
> > > > +                       (("-march=native") ""))
> > > 
> > > This is also wrong for x86 systems because it makes the build non-
> > > reproducible.  Also, has upstream been informed about some of the
> > > compiler flags being architecture-specific?
> > 
> > I'm pretty sure upstream is aware of it, and the -mcx16 flag. That
> > whole phase doesn't need to be non-x86_64 only, upstream prefers it
> > that way to get fater results
> 
> wfmash could be written to detect CPU features at runtime and there is
> also --tune.  Also, upstream preferring march=native does not make the
> build reproducible.
> 
> > but IMO it would be fine to move it into a snippet.
> 
> It does not have to be in a snippet, it just needs to be reproducible
> (so no march=native, whether on x86 or not).

I suppose not, but the -mcx16 should be in the snippet, since it adds
compiler flags which only work on some architectures. And while we're at
it we can do the -march=native one too.

> Upstream seems to be aware of the non-x86
> (https://github.com/ekg/wfmash/issues/125) but they do not seem to be
> aware of the problems with march=native.
> 

I've pushed these patches with a few tweaks and an additional patch to
run a test suite based on the github workflow in the repository. The
entire check phase takes ~2 minutes on my pinebook pro, so aarch64
doesn't seem to need to skip some tests like riscv64 does.
  
Arun Isaac April 1, 2022, 7:40 a.m. UTC | #6
Thanks for reviewing and merging this patchset!

>> wfmash could be written to detect CPU features at runtime and there is
>> also --tune.

wfmash's speed is one of the critical features it brings to the
table. We might have rendered the package more or less unusable by
disabling CPU optimizations.

I could try talking to upstream about detecting and using CPU features
at runtime. But, I don't know too much about the topic. Any quick links
I can share with them?

> I've pushed these patches with a few tweaks and an additional patch to
> run a test suite based on the github workflow in the repository. The
> entire check phase takes ~2 minutes on my pinebook pro, so aarch64
> doesn't seem to need to skip some tests like riscv64 does.

I think it is better if upstream provides us a `make check' target to
run these tests. I have asked them:
https://github.com/ekg/wfmash/issues/130 Hard-coding their CI tests
makes the package fragile. In the future, upstream could change the
tests they run.
  
Maxime Devos April 1, 2022, 9:39 a.m. UTC | #7
Arun Isaac schreef op vr 01-04-2022 om 13:10 [+0530]:
> wfmash's speed is one of the critical features it brings to the
> table. We might have rendered the package more or less unusable by
> disabling CPU optimizations.
> 
> I could try talking to upstream about detecting and using CPU features
> at runtime. But, I don't know too much about the topic. Any quick links
> I can share with them?

For the concept, maybe
<https://hpc.guix.info/blog/2022/01/tuning-packages-for-a-cpu-micro-architecture/>
, which is about combining high performance with reproducibility using
the (Guix-specific) --tune package transformation.  To implement run-
time detection, the following could be useful:

* Guix article:
<https://hpc.guix.info/blog/2018/01/pre-built-binaries-vs-performance/>
* LWN article: <https://lwn.net/Articles/691932/>.

Greetings,
Maxime.
  
Arun Isaac April 4, 2022, 6:01 p.m. UTC | #8
Hi Maxime,

> For the concept, maybe
> <https://hpc.guix.info/blog/2022/01/tuning-packages-for-a-cpu-micro-architecture/>
> , which is about combining high performance with reproducibility using
> the (Guix-specific) --tune package transformation.  To implement run-
> time detection, the following could be useful:
>
> * Guix article:
> <https://hpc.guix.info/blog/2018/01/pre-built-binaries-vs-performance/>
> * LWN article: <https://lwn.net/Articles/691932/>.

This is helpful. I will read and pass it on to wfmash upstream.

Thanks,
Arun
  

Patch

diff --git a/gnu/packages/bioinformatics.scm b/gnu/packages/bioinformatics.scm
index 2f6f3efb66..500ba4ac60 100644
--- a/gnu/packages/bioinformatics.scm
+++ b/gnu/packages/bioinformatics.scm
@@ -16107,3 +16107,54 @@  (define-public ccwl
 @acronym{EDSL, Embedded Domain Specific Language} in the Scheme programming
 language.")
     (license license:gpl3+)))
+
+(define-public wfmash
+  (package
+    (name "wfmash")
+    (version "0.8.1")
+    (source
+     (origin
+       (method url-fetch)
+       (uri (string-append "https://github.com/ekg/wfmash/releases/download/v"
+                           version "/wfmash-v" version ".tar.gz"))
+       (sha256
+        (base32
+         "031cm1arpfckvihb28vlk69mirpnmlag81zcscfba1bac58wvr7c"))
+       (modules '((guix build utils)))
+       (snippet
+        '(begin
+           ;; Unbundle atomic-queue.
+           (delete-file-recursively "src/common/atomic_queue")
+           (substitute* "src/align/include/computeAlignments.hpp"
+             (("\"common/atomic_queue/atomic_queue.h\"")
+              "<atomic_queue/atomic_queue.h>"))))))
+    (build-system cmake-build-system)
+    (arguments
+     `(#:tests? #f                      ; no tests
+       #:phases
+       (modify-phases %standard-phases
+         ;; Remove compiler flags and checks specific to x86 when not
+         ;; targeting it.
+         ,@(if (target-x86-64?)
+               '()
+               '((add-after 'unpack 'remove-x86-specific-compile-flags
+                   (lambda _
+                     (substitute* (find-files "." "CMakeLists\\.txt")
+                       (("-mcx16") "")
+                       (("-march=native") ""))
+                     (substitute* "src/common/dset64.hpp"
+                       (("!__x86_64__") "0")))))))))
+    (inputs
+     (list atomic-queue
+           gsl
+           htslib
+           jemalloc
+           zlib))
+    (synopsis "Base-accurate DNA sequence aligner")
+    (description "@code{wfmash} is a DNA sequence read mapper based on mash
+distances and the wavefront alignment algorithm.  It is a fork of MashMap that
+implements base-level alignment via the wflign tiled wavefront global
+alignment algorithm.  It completes MashMap with a high-performance alignment
+module capable of computing base-level alignments for very large sequences.")
+    (home-page "https://github.com/ekg/wfmash")
+    (license license:expat)))