Message ID | 20210813102800.805-1-othacehe@gnu.org |
---|---|
Headers | show |
Series | publish: Always render nar/narinfo during backing. | expand |
Hey, I increased the bypass cache value on the Berlin publish server from 50 to 500 MiB to confirm that this patch will help. Turns out, there are still some Cuirass errors. Running this command: --8<---------------cut here---------------start------------->8--- sudo strace -f -s 10000 -e write -p 124867 -p 124868 -p 124869 -p 124870 -p 124871 -p 124872 -p 124873 -p 124874 -p 124875 -p 124876 -p 124877 -p 124878 -p 124879 -p 124880 -p 124881 -p 124882 -p 124883 2>&1 |grep narinfo --8<---------------cut here---------------end--------------->8--- (The processes are all the Nginx workers) Shows those errors: --8<---------------cut here---------------start------------->8--- [pid 124867] write(4, "2021/08/21 22:17:40 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /s03ydmlv5p7v206hvj5z1d4hn9nzsv2s.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/s03ydmlv5p7v206hvj5z1d4hn9nzsv2s.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124869] write(4, "2021/08/21 22:17:40 [error] 124869#0: *16550621 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.175, server: ci.guix.gnu.org, request: \"GET /kfzipc676in0ff2im8zprbsgv9idrc9j.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/kfzipc676in0ff2im8zprbsgv9idrc9j.narinfo\", host: \"141.80.167.131\"\n", 355 <unfinished ...> [pid 124867] write(4, "2021/08/21 22:17:41 [error] 124867#0: *16550771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.158, server: ci.guix.gnu.org, request: \"GET /0mpspvh9h208vhrd3wxhw1165rz8mslb.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/0mpspvh9h208vhrd3wxhw1165rz8mslb.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:41 [error] 124867#0: *16550774 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.163, server: ci.guix.gnu.org, request: \"GET /4hy6nms1lzgbdrwslw5dlilvvjlrz95g.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/4hy6nms1lzgbdrwslw5dlilvvjlrz95g.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550628 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.173, server: ci.guix.gnu.org, request: \"GET /y9adqfkz8h3if17q3w801wfr0r91xdds.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/y9adqfkz8h3if17q3w801wfr0r91xdds.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550471 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.178, server: ci.guix.gnu.org, request: \"GET /d6vvg1rsqya8cnhi2mzqqvin55qax3xy.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/d6vvg1rsqya8cnhi2mzqqvin55qax3xy.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /h3bd6psd2ijj5d9a8q46w2ldqsi135j0.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/h3bd6psd2ijj5d9a8q46w2ldqsi135j0.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550780 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.184, server: ci.guix.gnu.org, request: \"GET /8jvzhavq2zkychyjhs0ywnz2mpgp73qz.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/8jvzhavq2zkychyjhs0ywnz2mpgp73qz.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124868] write(4, "2021/08/21 22:17:43 [error] 124868#0: *16550782 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ci.guix.gnu.org, request: \"GET /3wl6j74i18wgcyqwrvfjjcyag4w3ip3d.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/3wl6j74i18wgcyqwrvfjjcyag4w3ip3d.narinfo\", host: \"127.0.0.1\"\n", 345) = 345 [pid 124868] write(4, "2021/08/21 22:17:43 [error] 124868#0: *16550784 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.161, server: ci.guix.gnu.org, request: \"GET /lf8lczjd7k55x2byql0rn5qpgibvl6kh.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/lf8lczjd7k55x2byql0rn5qpgibvl6kh.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:43 [error] 124867#0: *16550786 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.177, server: ci.guix.gnu.org, request: \"GET /fvddmn01gx8amxlh6kvv8j7pcm6dxddn.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/fvddmn01gx8amxlh6kvv8j7pcm6dxddn.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550788 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.183, server: ci.guix.gnu.org, request: \"GET /wcl4mlxpnn20q2f2mgc2hj7w21jf03iz.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/wcl4mlxpnn20q2f2mgc2hj7w21jf03iz.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550471 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.178, server: ci.guix.gnu.org, request: \"GET /hakhzv36jl50mas5am4pwk89p428wnrl.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/hakhzv36jl50mas5am4pwk89p428wnrl.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /64fsv1vjf1mxkv6wv2wh3vkffzrgalyq.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/64fsv1vjf1mxkv6wv2wh3vkffzrgalyq.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550796 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.159, server: ci.guix.gnu.org, request: \"GET /skbvrqdl64lph6k1ba5gay31spdbkm04.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/skbvrqdl64lph6k1ba5gay31spdbkm04.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550799 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.164, server: ci.guix.gnu.org, request: \"GET /b9lrrh5zj7wjhgs93m1aaz0chbp7awzk.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/b9lrrh5zj7wjhgs93m1aaz0chbp7awzk.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 [pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550802 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ci.guix.gnu.org, request: \"GET /3sqsphfshnkpxk91zi38hh11ifq144n3.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/3sqsphfshnkpxk91zi38hh11ifq144n3.narinfo\", host: \"127.0.0.1\"\n", 345) = 345 [pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550804 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.185, server: ci.guix.gnu.org, request: \"GET /f8ld4yvmsp5z1850gppy6sv0g59yamh8.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/f8ld4yvmsp5z1850gppy6sv0g59yamh8.narinfo\", host: \"141.80.167.131\"\n", 355) = 355 --8<---------------cut here---------------end--------------->8--- I think this is caused by the following lines in the Berlin Nginx configuration: --8<---------------cut here---------------start------------->8--- ;; Do not tolerate slowness of hydra.gnu.org when fetching ;; narinfos: better return 504 quickly than wait forever. "proxy_connect_timeout 2s;" "proxy_read_timeout 2s;" "proxy_send_timeout 2s;" --8<---------------cut here---------------end--------------->8--- The narinfos that are not in cache yet are computed in the main thread, slowing down the publish process. Under pressure, computing those narinfo can exceed the 2s timeout resulting in the failures above. Our best course of action here could be to merge the proposed patches, extend the Nginx timeout to 10s and see if all the requests can be handled quickly enough. Thanks, Mathieu
Hi! Mathieu Othacehe <othacehe@gnu.org> skribis: > I think this is caused by the following lines in the Berlin Nginx > configuration: > > ;; Do not tolerate slowness of hydra.gnu.org when fetching > ;; narinfos: better return 504 quickly than wait forever. > "proxy_connect_timeout 2s;" > "proxy_read_timeout 2s;" > "proxy_send_timeout 2s;" > > The narinfos that are not in cache yet are computed in the main thread, > slowing down the publish process. Under pressure, computing those > narinfo can exceed the 2s timeout resulting in the failures above. > > Our best course of action here could be to merge the proposed patches, > extend the Nginx timeout to 10s and see if all the requests can be > handled quickly enough. Yeah, increasing these timeouts a little bit (to 5–7s? making it too long could lead to annoyances) sounds reasonable. Ludo’.
Hi Mathieu, An additional data point: while stracing ‘guix publish’ on berlin just now, I saw read(2) calls taking several seconds (!) to complete: --8<---------------cut here---------------start------------->8--- 14:22:07 openat(AT_FDCWD, "/var/cache/guix/publish/hashes/wcd9bci97i2zi52ay4vddvr1jx9w50kz", O_RDONLY) = 33 <0.000032> 14:22:07 fstat(33, {st_mode=S_IFREG|0666, st_size=60, ...}) = 0 <0.000007> 14:22:07 lseek(33, 0, SEEK_CUR) = 0 <0.000006> 14:22:07 read(33, "/gnu/store/wcd9bci97i2zi52ay4vdd"..., 4096) = 60 <2.540400> 14:22:10 read(33, "", 4096) = 0 <0.000014> 14:22:10 close(33) = 0 <0.000008> --8<---------------cut here---------------end--------------->8--- The load is indeed high right now: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ uptime 14:24:18 up 265 days 21:09, 1 user, load average: 10.61, 8.42, 8.00 --8<---------------cut here---------------end--------------->8--- There are a few ongoing builds, which could partly explain the high load (‘guix-daemon’ shows up at the top in iotop): --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo guix processes|grep -C3 LockHeld SessionPID: 43868 ClientPID: 43515 ClientCommand: /gnu/store/6l9rix46ydxyldf74dvpgr60rf5ily0c-guile-3.0.7/bin/guile --no-auto-compile /gnu/store/7rqj61dy660k0nw97w7657zbxnbdvg1g-update-guix.gnu.org LockHeld: /gnu/store/srim97jxlxgyiday0j3hbq2z0w03iy3k-guix-web-site.lock ChildPID: 49648 ChildCommand: : /gnu/store/p9di786py35m4kxszf466wcla6jvq96h-guix-1.3.0-3.50dfbbf/libexec/guix/guile \ /gnu/store/p9di786py35m4kxszf466wcla6jvq96h-guix-1.3.0-3.50dfbbf/bin/guix substitute --query -- SessionPID: 61372 ClientPID: 58620 ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259 LockHeld: /gnu/store/xdvzddg84by15s1c17k543b39c03jp8i-isl-0.23.drv.lock SessionPID: 61373 ClientPID: 58620 ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259 LockHeld: /gnu/store/2wm7b9g07nxxhrnfcggw6ibdry65fpwb-xfce4-panel-4.16.3-builder.lock SessionPID: 61374 ClientPID: 58620 ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259 LockHeld: /gnu/store/1jmidz4i70w3lczn99wgn19dksb0pzhv-pkg-config-0.29.2-builder.lock SessionPID: 61375 ClientPID: 58620 ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259 LockHeld: /gnu/store/4j7s86xp7w5jgm84lr18szln1zpw36zd-static-binaries-tarball-0.drv.lock --8<---------------cut here---------------end--------------->8--- The ‘mumi worker’ process also seems to be doing a lot of I/O and using quite a bit of CPU (presumably allocating a lot). Its ‘perf’ profile is like this: --8<---------------cut here---------------start------------->8--- 2.62% .mumi-real libguile-3.0.so.1.1.1 [.] scm_ilength 1.20% .mumi-real libguile-3.0.so.1.1.1 [.] scm_getc 1.12% .mumi-real libgc.so.1.4.3 [.] GC_add_to_black_list_normal.part.4 1.06% .mumi-real libgc.so.1.4.3 [.] GC_build_fl 0.98% .mumi-real libgc.so.1.4.3 [.] GC_reclaim_clear 0.97% .mumi-real libguile-3.0.so.1.1.1 [.] get_callee_vcode 0.91% .mumi-real libgc.so.1.4.3 [.] GC_generic_malloc_many 0.90% .mumi-real libguile-3.0.so.1.1.1 [.] peek_byte_or_eof 0.78% .mumi-real [JIT] tid 62601 [.] 0x00007f886964804d --8<---------------cut here---------------end--------------->8--- Food for thought… Ludo’.