[bug#73664] gnu: libtorrent-rasterbar: Work around hang in test_ssl.

Message ID b318e317dab97a658b90669a49edc79247716c87.1728231109.git.~@wolfsden.cz
State New
Headers
Series [bug#73664] gnu: libtorrent-rasterbar: Work around hang in test_ssl. |

Commit Message

Tomas Volf Oct. 6, 2024, 4:11 p.m. UTC
test_ssl does sometimes hang (at least when executed under faketime).  It is
somewhat unlikely to happen, and (on my machine) required a build with
--rounds=32 to reproduce it.

The workaround is to set somewhat lower timeout of 240s (expected test
duration * 5 rounded up to whole minutes) and retry few times on failure.  In
this way, --rounds=64 finished successfully (on my machine).

At the same time remove the timeout from the other tests, since it is not
necessary (they do not hang), and one of them runs for ~270s (almost half the
original timeout), so it could pose a problem on slow/overloaded machine.

* gnu/packages/bittorrent.scm
(libtorrent-rasterbar)[arguments]<#:phases>['check]: Remote test timeout for
most tests.  Lower the timeout for test_ssl.  Retry test_ssl on failure.

Change-Id: I535c72fec24658a4b2151d2e8794319055c9a278
---
 gnu/packages/bittorrent.scm | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)
  

Comments

Maxim Cournoyer Dec. 17, 2024, 1:20 a.m. UTC | #1
Hi Tomas,

Tomas Volf <~@wolfsden.cz> writes:

> test_ssl does sometimes hang (at least when executed under faketime).  It is
> somewhat unlikely to happen, and (on my machine) required a build with
> --rounds=32 to reproduce it.

It'd be nice if upstream was made aware of this problem.  Perhaps they
could come up with a fix for good.

> The workaround is to set somewhat lower timeout of 240s (expected test
> duration * 5 rounded up to whole minutes) and retry few times on failure.  In
> this way, --rounds=64 finished successfully (on my machine).
>
> At the same time remove the timeout from the other tests, since it is not
> necessary (they do not hang), and one of them runs for ~270s (almost half the
> original timeout), so it could pose a problem on slow/overloaded machine.

This means the tests may take up to 20 minutes, which is a bit too much
to my taste.

[...]

> -                  ;; Note: The test_ssl test times out in the ci.
> -                  ;; Temporarily disable it until that is resolved.
> -                  ;; (invoke "faketime" "2022-10-24"
> -                  ;;         "ctest"
> -                  ;;         "-R" "^test_ssl$"
> -                  ;;         "-j" jobs
> -                  ;;         "--timeout" timeout
> -                  ;;         "--output-on-failure")
> -                  )))))))
> +                  (invoke "faketime" "2022-10-24"
> +                          "ctest"
> +                          "-R" "^test_ssl$"
> +                          "-j" jobs
> +                          ;; test_ssl sometimes hangs (at least when run under
> +                          ;; faketime), therefore set a time limit and retry
> +                          ;; few times on failure.
> +                          "--timeout" "240"
> +                          "--repeat" "until-pass:5"
> +                          "--output-on-failure"))))))))

I think that a test sometimes hang is a good reason to leave it
disabled, report it to upstream, and reference the issue.  The test can
be re-enabled when the issue is resolved and part of a new release.

So in concrete terms, what I'd rather see here is a report of the
problems (requirement on faketime + propension to hang) to upstream, the
and an updated comment cross-referencing it (with the test kept
commented-out/disabled in the mean time).

Does that make sense?
  
Maxim Cournoyer Dec. 17, 2024, 5:08 a.m. UTC | #2
Hello!

Tomas Volf <~@wolfsden.cz> writes:

> test_ssl does sometimes hang (at least when executed under faketime).  It is
> somewhat unlikely to happen, and (on my machine) required a build with
> --rounds=32 to reproduce it.

Also worth adding on top of my previous reply, when trying this out, I
got a failure:

--8<---------------cut here---------------start------------->8---
[...]

MALICIOUS PEER TEST: valid-certificate valid-SNI-hash invalid-bittorrent-hash  port: 35161
set_password_callback
use_certificate_file "../ssl/peer_certificate.pem"
use_private_key_file "../ssl/peer_private_key.pem"
use_tmp_dh_file "../ssl/dhparams.pem"
connecting 127.0.0.1:35161
SNI: e300afcc0aa67a459ec14862a4d0bf930060167a
SSL handshake
bittorrent handshake
00:00:00.010: ses1: [log]  *** peer SSL handshake done [ ip: 127.0.0.1:44976 ec: certificate verify failed (SSL routines) socket: SSL/TCP ]
read bittorrent handshake
00:00:00.010: ses1: [peer_error]  -  peer [ 127.0.0.1:44976 client: Unknown ] peer error [ssl_handshake] [asio.ssl]: certificate verify failed (SSL routines)
--- peer_errors: 6 ssl_disconnects: 6
failed to read bittorrent handshake: sslv3 alert bad certificate (SSL routines)


0% tests passed, 1 tests failed out of 1

Total Test time (real) = 405.71 sec

The following tests FAILED:
	 75 - test_ssl (Failed)
Errors while running CTest
error: in phase 'check': uncaught exception:
%exception #<&invoke-error program: "faketime" arguments: ("2022-10-24" "ctest" "-R" "^test_ssl$" "-j" "1" "--timeout" "240" "--repeat" "until-pass:5" "--output-on-failure") exit-status: 8 term-signal: #f stop-signal: #f> 
phase `check' failed after 1154.0 seconds
command "faketime" "2022-10-24" "ctest" "-R" "^test_ssl$" "-j" "1" "--timeout" "240" "--repeat" "until-pass:5" "--output-on-failure" failed with status 8
build process 18 exited with status 256
builder for `/gnu/store/hkji5nzsa32jngg7kii9bg9ch9kdvs84-libtorrent-rasterbar-2.0.10.drv' failed with exit code 1
build of /gnu/store/hkji5nzsa32jngg7kii9bg9ch9kdvs84-libtorrent-rasterbar-2.0.10.drv failed
View build log at '/var/log/guix/drvs/hk/ji5nzsa32jngg7kii9bg9ch9kdvs84-libtorrent-rasterbar-2.0.10.drv'.
--8<---------------cut here---------------end--------------->8---
  
Tomas Volf Jan. 28, 2025, 8:01 p.m. UTC | #3
Hi Maxim,

thank you very much for the review and apologies for the late response,
I have kept postponing it.

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

> Hi Tomas,
>
> Tomas Volf <~@wolfsden.cz> writes:
>
>> test_ssl does sometimes hang (at least when executed under faketime).  It is
>> somewhat unlikely to happen, and (on my machine) required a build with
>> --rounds=32 to reproduce it.
>
> It'd be nice if upstream was made aware of this problem.  Perhaps they
> could come up with a fix for good.

My experience with reporting bugs to this specific upstream is not the
best, and I admit I expect the answer to "it hangs under faketime" would
be "do not run it under faketime".

>
>> The workaround is to set somewhat lower timeout of 240s (expected test
>> duration * 5 rounded up to whole minutes) and retry few times on failure.  In
>> this way, --rounds=64 finished successfully (on my machine).
>>
>> At the same time remove the timeout from the other tests, since it is not
>> necessary (they do not hang), and one of them runs for ~270s (almost half the
>> original timeout), so it could pose a problem on slow/overloaded machine.
>
> This means the tests may take up to 20 minutes, which is a bit too much
> to my taste.

During validation of the v2, the run-time of the check phase is:

--8<---------------cut here---------------start------------->8---
phase `check' succeeded after 811.0 seconds
--8<---------------cut here---------------end--------------->8---

That seems fine?  We have software already that takes similar time (or
even longer) to build.

>
> [...]
>
>> -                  ;; Note: The test_ssl test times out in the ci.
>> -                  ;; Temporarily disable it until that is resolved.
>> -                  ;; (invoke "faketime" "2022-10-24"
>> -                  ;;         "ctest"
>> -                  ;;         "-R" "^test_ssl$"
>> -                  ;;         "-j" jobs
>> -                  ;;         "--timeout" timeout
>> -                  ;;         "--output-on-failure")
>> -                  )))))))
>> +                  (invoke "faketime" "2022-10-24"
>> +                          "ctest"
>> +                          "-R" "^test_ssl$"
>> +                          "-j" jobs
>> +                          ;; test_ssl sometimes hangs (at least when run under
>> +                          ;; faketime), therefore set a time limit and retry
>> +                          ;; few times on failure.
>> +                          "--timeout" "240"
>> +                          "--repeat" "until-pass:5"
>> +                          "--output-on-failure"))))))))
>
> I think that a test sometimes hang is a good reason to leave it
> disabled, report it to upstream, and reference the issue.  The test can
> be re-enabled when the issue is resolved and part of a new release.
>
> So in concrete terms, what I'd rather see here is a report of the
> problems (requirement on faketime + propension to hang) to upstream, the
> and an updated comment cross-referencing it (with the test kept
> commented-out/disabled in the mean time).
>
> Does that make sense?

As mentioned above, I do not expect upstream to care about this specific
issue (faketime).  However what I was successful in was convincing
upstream to vastly increase the lifetime of the bundled SSL certificates
used for testing.  That means that the new version (2.0.11) will run the
tests fine until ~end of the year 2297.  Honestly I consider that to be
an acceptable deadline to find a better long term solution (for example
we could explore the use of time namespace for build environment).

So I have submitted v2 which updates to the new version and removes all
special casing for test_ssl, since it should not be required for more
than quarter of a millennium.

I hope this approach is acceptable.

Have a nice day,
Tomas
  
Maxim Cournoyer Jan. 29, 2025, 2 a.m. UTC | #4
Hi Tomas,

Tomas Volf <~@wolfsden.cz> writes:

> Hi Maxim,
>
> thank you very much for the review and apologies for the late response,
> I have kept postponing it.
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hi Tomas,
>>
>> Tomas Volf <~@wolfsden.cz> writes:
>>
>>> test_ssl does sometimes hang (at least when executed under faketime).  It is
>>> somewhat unlikely to happen, and (on my machine) required a build with
>>> --rounds=32 to reproduce it.
>>
>> It'd be nice if upstream was made aware of this problem.  Perhaps they
>> could come up with a fix for good.
>
> My experience with reporting bugs to this specific upstream is not the
> best, and I admit I expect the answer to "it hangs under faketime" would
> be "do not run it under faketime".

Some upstream takes patience to work with, but I still think it's useful
to put out a report of this problem, would it only be for other
downstreams trying to find a solution.

In this case, I'd tell them the goal is to have the tests work reliably
in time, even in 2 or 5 years, and this is not possible currently
because of expiry on the certs they use in the test suite (right?).

Possible options:

1. Remove expiry on certs used.
2. Use faketime, as attempted here; if this is chosen we need to figure
out why it hangs.

Obviously 1 is the easier/better option (if it's possible).

>>
>>> The workaround is to set somewhat lower timeout of 240s (expected test
>>> duration * 5 rounded up to whole minutes) and retry few times on failure.  In
>>> this way, --rounds=64 finished successfully (on my machine).
>>>
>>> At the same time remove the timeout from the other tests, since it is not
>>> necessary (they do not hang), and one of them runs for ~270s (almost half the
>>> original timeout), so it could pose a problem on slow/overloaded machine.
>>
>> This means the tests may take up to 20 minutes, which is a bit too much
>> to my taste.
>
> During validation of the v2, the run-time of the check phase is:
>
> phase `check' succeeded after 811.0 seconds
>
> That seems fine?  We have software already that takes similar time (or
> even longer) to build.

That's still too long to my taste, but if it succeeds reliably at least,
then so be it.

>> I think that a test sometimes hang is a good reason to leave it
>> disabled, report it to upstream, and reference the issue.  The test can
>> be re-enabled when the issue is resolved and part of a new release.
>>
>> So in concrete terms, what I'd rather see here is a report of the
>> problems (requirement on faketime + propension to hang) to upstream, the
>> and an updated comment cross-referencing it (with the test kept
>> commented-out/disabled in the mean time).
>>
>> Does that make sense?
>
> As mentioned above, I do not expect upstream to care about this specific
> issue (faketime).  However what I was successful in was convincing
> upstream to vastly increase the lifetime of the bundled SSL certificates
> used for testing.  That means that the new version (2.0.11) will run the
> tests fine until ~end of the year 2297.  Honestly I consider that to be
> an acceptable deadline to find a better long term solution (for example
> we could explore the use of time namespace for build environment).

That's good!  That's practically the #1 option I hinted at earlier
(didn't know they already did that).

> So I have submitted v2 which updates to the new version and removes all
> special casing for test_ssl, since it should not be required for more
> than quarter of a millennium.
>
> I hope this approach is acceptable.

Great!  Thanks for engaging with upstream and sending a v2.  I'll apply
it shortly if everything looks good.
  
Maxim Cournoyer Jan. 29, 2025, 1:19 p.m. UTC | #5
Hi Tomas,

Finally pushed.  Thanks for the good work!
  

Patch

diff --git a/gnu/packages/bittorrent.scm b/gnu/packages/bittorrent.scm
index 2b38c7cb65..1a0735d928 100644
--- a/gnu/packages/bittorrent.scm
+++ b/gnu/packages/bittorrent.scm
@@ -452,7 +452,6 @@  (define-public libtorrent-rasterbar
                      (exclude-regex (string-append "^("
                                                    (string-join disabled-tests "|")
                                                    ")$"))
-                     (timeout "600")
                      (jobs (if parallel-tests?
                                (number->string (parallel-job-count))
                                "1")))
@@ -460,7 +459,6 @@  (define-public libtorrent-rasterbar
                   (invoke "ctest"
                           "-E" exclude-regex
                           "-j" jobs
-                          "--timeout" timeout
                           "--output-on-failure")
                   ;; test_ssl relies on bundled TLS certificates with a fixed
                   ;; expiry date.  To ensure succesful builds in the future,
@@ -470,16 +468,16 @@  (define-public libtorrent-rasterbar
                   ;; test_fast_extension, test_privacy and test_resolve_links
                   ;; to hang, even with FAKETIME_ONLY_CMDS.  Not sure why.  So
                   ;; execute only test_ssl under faketime.
-                  ;;
-                  ;; Note: The test_ssl test times out in the ci.
-                  ;; Temporarily disable it until that is resolved.
-                  ;; (invoke "faketime" "2022-10-24"
-                  ;;         "ctest"
-                  ;;         "-R" "^test_ssl$"
-                  ;;         "-j" jobs
-                  ;;         "--timeout" timeout
-                  ;;         "--output-on-failure")
-                  )))))))
+                  (invoke "faketime" "2022-10-24"
+                          "ctest"
+                          "-R" "^test_ssl$"
+                          "-j" jobs
+                          ;; test_ssl sometimes hangs (at least when run under
+                          ;; faketime), therefore set a time limit and retry
+                          ;; few times on failure.
+                          "--timeout" "240"
+                          "--repeat" "until-pass:5"
+                          "--output-on-failure"))))))))
     (inputs (list boost openssl))
     (native-inputs
      (list libfaketime