mbox series

[bug#50040,0/2] publish: Always render nar/narinfo during backing.

Message ID 20210813102800.805-1-othacehe@gnu.org
Headers show
Series publish: Always render nar/narinfo during backing. | expand

Message

Mathieu Othacehe Aug. 13, 2021, 10:28 a.m. UTC
Hello,

This patchset does two things:

* It ensures that the narinfo creation is always done in a dedicated thread to
  prevent blocking the main publish thread and cause the timeout connection
  errors that some users experimented.

* It removes the cache bypass support and always renders temporary narinfo and
  nar files during baking. This way, the first user asking for a substitute
  won't get a 404 error. It also removes the need for Cuirass to pre-bake nar
  files. Finally it will make way easier to use a publish server with caching
  in a CI context.

The downside is that it makes the publish server code harder to understand and
that it consumes more server resources by duplicating the nar/narinfo file
creation.

In the long run, spawning new threads constantly won't scale. I think it would
be nice to rewrite the publish server using Fiber or an HTTP server framework.

Thanks,

Mathieu

Mathieu Othacehe (2):
  publish: Defer narinfo string creation to the http-write.
  publish: Remove cache bypass support.

 doc/guix.texi            |  15 -----
 gnu/services/base.scm    |  11 +---
 guix/scripts/publish.scm | 136 +++++++++++++++++++++++----------------
 tests/publish.scm        |  88 +++----------------------
 4 files changed, 93 insertions(+), 157 deletions(-)

Comments

Mathieu Othacehe Aug. 22, 2021, 7:33 a.m. UTC | #1
Hey,

I increased the bypass cache value on the Berlin publish server from 50
to 500 MiB to confirm that this patch will help. Turns out, there are
still some Cuirass errors.

Running this command:

--8<---------------cut here---------------start------------->8---
sudo strace -f  -s 10000 -e write -p 124867 -p 124868 -p 124869 -p
124870 -p 124871 -p 124872 -p 124873 -p 124874 -p 124875 -p 124876 -p
124877 -p 124878 -p 124879 -p 124880 -p 124881 -p 124882 -p 124883  2>&1
|grep narinfo
--8<---------------cut here---------------end--------------->8---

(The processes are all the Nginx workers)

Shows those errors:

--8<---------------cut here---------------start------------->8---
[pid 124867] write(4, "2021/08/21 22:17:40 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /s03ydmlv5p7v206hvj5z1d4hn9nzsv2s.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/s03ydmlv5p7v206hvj5z1d4hn9nzsv2s.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124869] write(4, "2021/08/21 22:17:40 [error] 124869#0: *16550621 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.175, server: ci.guix.gnu.org, request: \"GET /kfzipc676in0ff2im8zprbsgv9idrc9j.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/kfzipc676in0ff2im8zprbsgv9idrc9j.narinfo\", host: \"141.80.167.131\"\n", 355 <unfinished ...>
[pid 124867] write(4, "2021/08/21 22:17:41 [error] 124867#0: *16550771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.158, server: ci.guix.gnu.org, request: \"GET /0mpspvh9h208vhrd3wxhw1165rz8mslb.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/0mpspvh9h208vhrd3wxhw1165rz8mslb.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:41 [error] 124867#0: *16550774 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.163, server: ci.guix.gnu.org, request: \"GET /4hy6nms1lzgbdrwslw5dlilvvjlrz95g.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/4hy6nms1lzgbdrwslw5dlilvvjlrz95g.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550628 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.173, server: ci.guix.gnu.org, request: \"GET /y9adqfkz8h3if17q3w801wfr0r91xdds.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/y9adqfkz8h3if17q3w801wfr0r91xdds.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550471 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.178, server: ci.guix.gnu.org, request: \"GET /d6vvg1rsqya8cnhi2mzqqvin55qax3xy.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/d6vvg1rsqya8cnhi2mzqqvin55qax3xy.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /h3bd6psd2ijj5d9a8q46w2ldqsi135j0.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/h3bd6psd2ijj5d9a8q46w2ldqsi135j0.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550780 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.184, server: ci.guix.gnu.org, request: \"GET /8jvzhavq2zkychyjhs0ywnz2mpgp73qz.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/8jvzhavq2zkychyjhs0ywnz2mpgp73qz.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124868] write(4, "2021/08/21 22:17:43 [error] 124868#0: *16550782 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ci.guix.gnu.org, request: \"GET /3wl6j74i18wgcyqwrvfjjcyag4w3ip3d.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/3wl6j74i18wgcyqwrvfjjcyag4w3ip3d.narinfo\", host: \"127.0.0.1\"\n", 345) = 345
[pid 124868] write(4, "2021/08/21 22:17:43 [error] 124868#0: *16550784 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.161, server: ci.guix.gnu.org, request: \"GET /lf8lczjd7k55x2byql0rn5qpgibvl6kh.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/lf8lczjd7k55x2byql0rn5qpgibvl6kh.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:43 [error] 124867#0: *16550786 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.177, server: ci.guix.gnu.org, request: \"GET /fvddmn01gx8amxlh6kvv8j7pcm6dxddn.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/fvddmn01gx8amxlh6kvv8j7pcm6dxddn.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550788 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.183, server: ci.guix.gnu.org, request: \"GET /wcl4mlxpnn20q2f2mgc2hj7w21jf03iz.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/wcl4mlxpnn20q2f2mgc2hj7w21jf03iz.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550471 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.178, server: ci.guix.gnu.org, request: \"GET /hakhzv36jl50mas5am4pwk89p428wnrl.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/hakhzv36jl50mas5am4pwk89p428wnrl.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /64fsv1vjf1mxkv6wv2wh3vkffzrgalyq.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/64fsv1vjf1mxkv6wv2wh3vkffzrgalyq.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550796 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.159, server: ci.guix.gnu.org, request: \"GET /skbvrqdl64lph6k1ba5gay31spdbkm04.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/skbvrqdl64lph6k1ba5gay31spdbkm04.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550799 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.164, server: ci.guix.gnu.org, request: \"GET /b9lrrh5zj7wjhgs93m1aaz0chbp7awzk.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/b9lrrh5zj7wjhgs93m1aaz0chbp7awzk.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550802 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ci.guix.gnu.org, request: \"GET /3sqsphfshnkpxk91zi38hh11ifq144n3.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/3sqsphfshnkpxk91zi38hh11ifq144n3.narinfo\", host: \"127.0.0.1\"\n", 345) = 345
[pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550804 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.185, server: ci.guix.gnu.org, request: \"GET /f8ld4yvmsp5z1850gppy6sv0g59yamh8.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/f8ld4yvmsp5z1850gppy6sv0g59yamh8.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
--8<---------------cut here---------------end--------------->8---

I think this is caused by the following lines in the Berlin Nginx
configuration:

--8<---------------cut here---------------start------------->8---
;; Do not tolerate slowness of hydra.gnu.org when fetching
;; narinfos: better return 504 quickly than wait forever.
"proxy_connect_timeout 2s;"
"proxy_read_timeout 2s;"
"proxy_send_timeout 2s;"
--8<---------------cut here---------------end--------------->8---

The narinfos that are not in cache yet are computed in the main thread,
slowing down the publish process. Under pressure, computing those
narinfo can exceed the 2s timeout resulting in the failures above.

Our best course of action here could be to merge the proposed patches,
extend the Nginx timeout to 10s and see if all the requests can be
handled quickly enough.

Thanks,

Mathieu
Ludovic Courtès Aug. 30, 2021, 10:33 p.m. UTC | #2
Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

> I think this is caused by the following lines in the Berlin Nginx
> configuration:
>
> ;; Do not tolerate slowness of hydra.gnu.org when fetching
> ;; narinfos: better return 504 quickly than wait forever.
> "proxy_connect_timeout 2s;"
> "proxy_read_timeout 2s;"
> "proxy_send_timeout 2s;"
>
> The narinfos that are not in cache yet are computed in the main thread,
> slowing down the publish process. Under pressure, computing those
> narinfo can exceed the 2s timeout resulting in the failures above.
>
> Our best course of action here could be to merge the proposed patches,
> extend the Nginx timeout to 10s and see if all the requests can be
> handled quickly enough.

Yeah, increasing these timeouts a little bit (to 5–7s? making it too
long could lead to annoyances) sounds reasonable.

Ludo’.
Ludovic Courtès Sept. 6, 2021, 1:54 p.m. UTC | #3
Hi Mathieu,

An additional data point: while stracing ‘guix publish’ on berlin just
now, I saw read(2) calls taking several seconds (!) to complete:

--8<---------------cut here---------------start------------->8---
14:22:07 openat(AT_FDCWD, "/var/cache/guix/publish/hashes/wcd9bci97i2zi52ay4vddvr1jx9w50kz", O_RDONLY) = 33 <0.000032>
14:22:07 fstat(33, {st_mode=S_IFREG|0666, st_size=60, ...}) = 0 <0.000007>
14:22:07 lseek(33, 0, SEEK_CUR)         = 0 <0.000006>
14:22:07 read(33, "/gnu/store/wcd9bci97i2zi52ay4vdd"..., 4096) = 60 <2.540400>
14:22:10 read(33, "", 4096)             = 0 <0.000014>
14:22:10 close(33)                      = 0 <0.000008>
--8<---------------cut here---------------end--------------->8---

The load is indeed high right now:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ uptime
 14:24:18  up 265 days 21:09,  1 user,  load average: 10.61, 8.42, 8.00
--8<---------------cut here---------------end--------------->8---

There are a few ongoing builds, which could partly explain the high load
(‘guix-daemon’ shows up at the top in iotop):

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo guix processes|grep -C3 LockHeld
SessionPID: 43868
ClientPID: 43515
ClientCommand: /gnu/store/6l9rix46ydxyldf74dvpgr60rf5ily0c-guile-3.0.7/bin/guile --no-auto-compile /gnu/store/7rqj61dy660k0nw97w7657zbxnbdvg1g-update-guix.gnu.org
LockHeld: /gnu/store/srim97jxlxgyiday0j3hbq2z0w03iy3k-guix-web-site.lock
ChildPID: 49648
ChildCommand: : /gnu/store/p9di786py35m4kxszf466wcla6jvq96h-guix-1.3.0-3.50dfbbf/libexec/guix/guile \ /gnu/store/p9di786py35m4kxszf466wcla6jvq96h-guix-1.3.0-3.50dfbbf/bin/guix substitute --query

--
SessionPID: 61372
ClientPID: 58620
ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259
LockHeld: /gnu/store/xdvzddg84by15s1c17k543b39c03jp8i-isl-0.23.drv.lock

SessionPID: 61373
ClientPID: 58620
ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259
LockHeld: /gnu/store/2wm7b9g07nxxhrnfcggw6ibdry65fpwb-xfce4-panel-4.16.3-builder.lock

SessionPID: 61374
ClientPID: 58620
ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259
LockHeld: /gnu/store/1jmidz4i70w3lczn99wgn19dksb0pzhv-pkg-config-0.29.2-builder.lock

SessionPID: 61375
ClientPID: 58620
ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259
LockHeld: /gnu/store/4j7s86xp7w5jgm84lr18szln1zpw36zd-static-binaries-tarball-0.drv.lock
--8<---------------cut here---------------end--------------->8---

The ‘mumi worker’ process also seems to be doing a lot of I/O and using
quite a bit of CPU (presumably allocating a lot).  Its ‘perf’ profile is
like this:

--8<---------------cut here---------------start------------->8---
   2.62%  .mumi-real  libguile-3.0.so.1.1.1  [.] scm_ilength
   1.20%  .mumi-real  libguile-3.0.so.1.1.1  [.] scm_getc
   1.12%  .mumi-real  libgc.so.1.4.3         [.] GC_add_to_black_list_normal.part.4
   1.06%  .mumi-real  libgc.so.1.4.3         [.] GC_build_fl
   0.98%  .mumi-real  libgc.so.1.4.3         [.] GC_reclaim_clear
   0.97%  .mumi-real  libguile-3.0.so.1.1.1  [.] get_callee_vcode
   0.91%  .mumi-real  libgc.so.1.4.3         [.] GC_generic_malloc_many
   0.90%  .mumi-real  libguile-3.0.so.1.1.1  [.] peek_byte_or_eof
   0.78%  .mumi-real  [JIT] tid 62601        [.] 0x00007f886964804d
--8<---------------cut here---------------end--------------->8---

Food for thought…

Ludo’.