Message ID | 20190524134238.22802-1-ludo@gnu.org |
---|---|
State | Accepted |
Headers | show |
Series | [bug#35880,1/7] lzlib: Add 'make-lzip-input-port/compressed'. | expand |
Context | Check | Description |
---|---|---|
cbaines/applying patch | fail | Apply failed |
As an Lzip enthusiast, I have some questions ;) I see you are using make-lzip-input-port/compressed in a subsequent patch, but this does not map how it's done for gzip et al., the latter being invoked via it's system command "gzip -c ...". Why did you decide to do it differently for lzip? Much of the code induced by make-lzip-input-port/compressed seems to repeat the lzread! / lzwrite business, maybe there is a way to factor some of it? > +(define (lzwrite! encoder source source-offset source-count > + target target-offset target-count) > + "Write up to SOURCE-COUNT bytes from SOURCE to ENCODER, and read up to > +TARGET-COUNT bytes into TARGET at TARGET-OFFSET. Return two values: the > +number of bytes read from SOURCE, and the number of bytes written to TARGET." > + (define read > + (if (< 0 (lz-compress-write-size encoder)) > + (match (lz-compress-write encoder source source-offset source-count) > + (0 (lz-compress-finish encoder) 0) > + (n n)) > + 0)) > + > + (let loop () > + (match (lz-compress-read encoder target target-offset target-count) > + (0 (loop)) > + (written (values read written))))) Why looping on 0? If there is no byte to read, wouldn't this loop indefinitely?
Hi! Pierre Neidhardt <mail@ambrevar.xyz> skribis: > As an Lzip enthusiast, I have some questions ;) > > I see you are using make-lzip-input-port/compressed in a subsequent > patch, but this does not map how it's done for gzip et al., the latter > being invoked via it's system command "gzip -c ...". Why did you decide > to do it differently for lzip? > > Much of the code induced by make-lzip-input-port/compressed seems to > repeat the lzread! / lzwrite business, maybe there is a way to factor > some of it? Actually, ‘make-lzip-input-port/compressed’ exists solely so we can have ‘compressed-port’ for lzip, which in turn allows us to write tests. It uses (guix lzlib) instead of invoking the ‘lzip’ command because we can. :-) Invoking commands is not as nice, because it’s more expensive, requires us to spawn an additional process when the input is not a file port (e.g., it’s a string port), and it’s forking is not possible in multi-threaded programs like ‘guix publish’. >> +(define (lzwrite! encoder source source-offset source-count >> + target target-offset target-count) >> + "Write up to SOURCE-COUNT bytes from SOURCE to ENCODER, and read up to >> +TARGET-COUNT bytes into TARGET at TARGET-OFFSET. Return two values: the >> +number of bytes read from SOURCE, and the number of bytes written to TARGET." >> + (define read >> + (if (< 0 (lz-compress-write-size encoder)) >> + (match (lz-compress-write encoder source source-offset source-count) >> + (0 (lz-compress-finish encoder) 0) >> + (n n)) >> + 0)) >> + >> + (let loop () >> + (match (lz-compress-read encoder target target-offset target-count) >> + (0 (loop)) >> + (written (values read written))))) > > Why looping on 0? If there is no byte to read, wouldn't this loop indefinitely? Hmm, good point. The idea is that ‘lzwrite!’ should return 0 only on end-of-file, but then the loop should include reading more from SOURCE. I’ll follow up on this one. Thanks! Ludo’.
Ludovic Courtès <ludo@gnu.org> skribis: > Pierre Neidhardt <mail@ambrevar.xyz> skribis: [...] >>> +(define (lzwrite! encoder source source-offset source-count >>> + target target-offset target-count) >>> + "Write up to SOURCE-COUNT bytes from SOURCE to ENCODER, and read up to >>> +TARGET-COUNT bytes into TARGET at TARGET-OFFSET. Return two values: the >>> +number of bytes read from SOURCE, and the number of bytes written to TARGET." >>> + (define read >>> + (if (< 0 (lz-compress-write-size encoder)) >>> + (match (lz-compress-write encoder source source-offset source-count) >>> + (0 (lz-compress-finish encoder) 0) >>> + (n n)) >>> + 0)) >>> + >>> + (let loop () >>> + (match (lz-compress-read encoder target target-offset target-count) >>> + (0 (loop)) >>> + (written (values read written))))) >> >> Why looping on 0? If there is no byte to read, wouldn't this loop indefinitely? > > Hmm, good point. The idea is that ‘lzwrite!’ should return 0 only on > end-of-file, but then the loop should include reading more from SOURCE. > I’ll follow up on this one. I noticed that ‘lz-compress-read’ is documented to return a “strictly positive integer”, so I’m changing it to this: --8<---------------cut here---------------start------------->8--- (define (lzwrite! encoder source source-offset source-count target target-offset target-count) "Write up to SOURCE-COUNT bytes from SOURCE to ENCODER, and read up to TARGET-COUNT bytes into TARGET at TARGET-OFFSET. Return two values: the number of bytes read from SOURCE, and the non-zero number of bytes written to TARGET." (define read (if (< 0 (lz-compress-write-size encoder)) (match (lz-compress-write encoder source source-offset source-count) (0 (lz-compress-finish encoder) 0) (n n)) 0)) (define written ;; Note: 'lz-compress-read' promises to return a non-zero integer. (lz-compress-read encoder target target-offset target-count)) (values read written)) --8<---------------cut here---------------end--------------->8--- Let me know what you think! Ludo’.
Ludovic Courtès <ludo@gnu.org> writes: > I noticed that ‘lz-compress-read’ is documented to return a “strictly > positive integer”, so I’m changing it to this: The docstring (that I wrote) says "non-negative positive integer." "positive" is a typo (sorry about that), it should read "non-negative integer" since the return value can be zero. In general, lzlib's "reads" and "writes" don't give any guarantee about the size of bytes that are actually processed. You need to loop over the calls until some condition is met, see the "finish(ed)" functions. Here in particular, it's not clear that lz-compress-read is going to read all the bytes in the encoder buffer. Maybe that's OK for this particular functions if we don't expect TARGET-COUNT to be reached. See http://www.nongnu.org/lzip/manual/lzlib_manual.html#Buffering That said, if the encoder buffer is not empty, I think lz-compress-read should always return something >0. Yup, lzlib is very low-level! :p
Hi! Pierre Neidhardt <mail@ambrevar.xyz> skribis: > That said, if the encoder buffer is not empty, I think lz-compress-read > should always return something >0. Yes, probably. The docstring for ‘lz-compress-read’ says: "Read up to COUNT bytes from the encoder stream, storing the results in LZFILE-BV. Return the number of uncompressed bytes written, a strictly positive integer." ^~~~~~~~~~~~~~~~~ However, the lzlib manual doesn’t say that for ‘LZ_compress_read’ (info "(lzlib) Compression functions"). But that’s OK: the ‘read!’ method in ‘make-lzip-input-port/compressed’ can just call ‘lzwrite!’ again with more data when that happens, so I’ve done that. And I pushed the whole thing! :-) I think it’d be good to let people play with it in their personal setups. Next up: multi-compression support in ‘guix publish’ (possibly?) so we can smoothly transition on our build farms. Thanks! Ludo’.
Ludovic Courtès <ludo@gnu.org> writes: >> That said, if the encoder buffer is not empty, I think lz-compress-read >> should always return something >0. > > Yes, probably. The docstring for ‘lz-compress-read’ says: Oops, I read the docstring of lz-DEcompress-read. My bad. > "Read up to COUNT bytes from the encoder stream, storing the results in LZFILE-BV. > Return the number of uncompressed bytes written, a strictly positive integer." > ^~~~~~~~~~~~~~~~~ Bigger oops! This comes from a copy-paste of the gzip docstring which I forgot to update properly (I did for the decompression functions, but not for the compression functions). The docstrings should be fixed. > But that’s OK: the ‘read!’ method in ‘make-lzip-input-port/compressed’ > can just call ‘lzwrite!’ again with more data when that happens, so I’ve > done that. This could work, but I've had some headaches on such assumptions before. Tests are very necessary here to validate those assumptions ;) The thing is that we are not using lzlib as it is meant to be used (i.e. with the finish* functions) because of the functional approach of the binary ports which don't really play well with the procedural approach of the C library. > And I pushed the whole thing! :-) Hurray! Can't wait to say lz-compressed archives coming to Guix! :)
Pierre Neidhardt <mail@ambrevar.xyz> skribis: > Ludovic Courtès <ludo@gnu.org> writes: > >>> That said, if the encoder buffer is not empty, I think lz-compress-read >>> should always return something >0. >> >> Yes, probably. The docstring for ‘lz-compress-read’ says: > > Oops, I read the docstring of lz-DEcompress-read. My bad. > >> "Read up to COUNT bytes from the encoder stream, storing the results in LZFILE-BV. >> Return the number of uncompressed bytes written, a strictly positive integer." >> ^~~~~~~~~~~~~~~~~ > > Bigger oops! This comes from a copy-paste of the gzip docstring which I > forgot to update properly (I did for the decompression functions, but > not for the compression functions). The docstrings should be fixed. I fixed this one in e13354a7ca5a0d5e28e02c4cfce6fecb1ab770e4. >> But that’s OK: the ‘read!’ method in ‘make-lzip-input-port/compressed’ >> can just call ‘lzwrite!’ again with more data when that happens, so I’ve >> done that. > > This could work, but I've had some headaches on such assumptions > before. Tests are very necessary here to validate those assumptions ;) Definitely! > The thing is that we are not using lzlib as it is meant to be used > (i.e. with the finish* functions) because of the functional approach of > the binary ports which don't really play well with the procedural > approach of the C library. I think we’re using it the way it’s meant to be used, roughly along the lines of the examples of its manual (info "(lzlib) Examples"). (I/O ports are not very “functional”.) >> And I pushed the whole thing! :-) > > Hurray! Can't wait to say lz-compressed archives coming to Guix! :) I’ve updated the ‘guix’ package so people can start using ‘guix publish -C lzip’ and fetch substitute from there. Thanks for making it possible! Ludo’.
Ludovic Courtès <ludo@gnu.org> writes: >> Bigger oops! This comes from a copy-paste of the gzip docstring which I >> forgot to update properly (I did for the decompression functions, but >> not for the compression functions). The docstrings should be fixed. > > I fixed this one in e13354a7ca5a0d5e28e02c4cfce6fecb1ab770e4. Thanks! >> The thing is that we are not using lzlib as it is meant to be used >> (i.e. with the finish* functions) because of the functional approach of >> the binary ports which don't really play well with the procedural >> approach of the C library. > > I think we’re using it the way it’s meant to be used, roughly along the > lines of the examples of its manual (info "(lzlib) Examples"). > > (I/O ports are not very “functional”.) In the sense that we define "read!" and "write!" functions which don't allow us to call the "finish" functions properly. So maybe we are following the doc too "roughly" :p This has multiple implications, e.g. it impedes support for multiple members, (which is OK as long as we don't accept archives produced by a third-party) and more importantly it lifts the guarantee that the library is going to work as per the documentation. >>> And I pushed the whole thing! :-) >> >> Hurray! Can't wait to say lz-compressed archives coming to Guix! :) > > I’ve updated the ‘guix’ package so people can start using > ‘guix publish -C lzip’ and fetch substitute from there. > > Thanks for making it possible! And thanks for doing most of the work! :D
Pierre Neidhardt <mail@ambrevar.xyz> skribis: > Ludovic Courtès <ludo@gnu.org> writes: [...] >>> The thing is that we are not using lzlib as it is meant to be used >>> (i.e. with the finish* functions) because of the functional approach of >>> the binary ports which don't really play well with the procedural >>> approach of the C library. >> >> I think we’re using it the way it’s meant to be used, roughly along the >> lines of the examples of its manual (info "(lzlib) Examples"). >> >> (I/O ports are not very “functional”.) > > In the sense that we define "read!" and "write!" functions which don't > allow us to call the "finish" functions properly. > > So maybe we are following the doc too "roughly" :p > > This has multiple implications, e.g. it impedes support for multiple > members, (which is OK as long as we don't accept archives produced by a > third-party) and more importantly it lifts the guarantee that the > library is going to work as per the documentation. I’m not sure I follow. I think ‘make-lzip-input-port/compressed’ corresponds to Example 2 in the manual (info "(lzlib) Examples"), ‘make-lzip-output-port’ corresponds to Example 1, and ‘make-lzip-input-port’ corresponds to Example 4 (with the exception that ‘lzread!’ doesn’t call ‘lz-decompress-finished?’, but it has other means to tell whether we’re done processing input.) What do you think is not done as documented? Thanks, Ludo’.
Ludovic Courtès <ludo@gnu.org> writes: > I’m not sure I follow. I think ‘make-lzip-input-port/compressed’ > corresponds to Example 2 in the manual (info "(lzlib) Examples"), > ‘make-lzip-output-port’ corresponds to Example 1, and > ‘make-lzip-input-port’ corresponds to Example 4 (with the exception that > ‘lzread!’ doesn’t call ‘lz-decompress-finished?’, but it has other means > to tell whether we’re done processing input.) Example 4 is: 1) LZ_decompress_open 2) go to step 5 if LZ_decompress_write_size returns 0 3) LZ_decompress_write 4) if no more data to write, call LZ_decompress_finish 5) LZ_decompress_read 5a) optionally, if LZ_decompress_member_finished returns 1, read final values for member with LZ_decompress_data_crc, etc. 6) go back to step 2 until LZ_decompress_finished returns 1 7) LZ_decompress_close In `lzread!', we don't call lz-decompress-finished? nor do we loop on lz-decompress-finished. This only works for decompression of single-member archive, but the documentation does not say that. --8<---------------cut here---------------start------------->8--- (match (get-bytevector-n port (lz-decompress-write-size decoder)) ((? eof-object? eof) eof) (bv (lz-decompress-write decoder bv))) --8<---------------cut here---------------end--------------->8--- In the above if lz-decompress-write-size returns 0, we won't be reading anything (infinite loop?). While I understand this should not happen in practice, the documentation of the library does not give such guarantees. Antonio told me that explicitly. --8<---------------cut here---------------start------------->8--- (match (lz-decompress-read decoder bv start (- count read)) (0 (if (eof-object? (feed-decoder! decoder)) read (loop read start))) --8<---------------cut here---------------end--------------->8--- I'm not sure I understand the above: if we read nothing, then we try again? This might loop forever. What do you think?
Hello, Pierre Neidhardt <mail@ambrevar.xyz> skribis: > Ludovic Courtès <ludo@gnu.org> writes: > >> I’m not sure I follow. I think ‘make-lzip-input-port/compressed’ >> corresponds to Example 2 in the manual (info "(lzlib) Examples"), >> ‘make-lzip-output-port’ corresponds to Example 1, and >> ‘make-lzip-input-port’ corresponds to Example 4 (with the exception that >> ‘lzread!’ doesn’t call ‘lz-decompress-finished?’, but it has other means >> to tell whether we’re done processing input.) > > Example 4 is: > > 1) LZ_decompress_open > 2) go to step 5 if LZ_decompress_write_size returns 0 > 3) LZ_decompress_write > 4) if no more data to write, call LZ_decompress_finish > 5) LZ_decompress_read > 5a) optionally, if LZ_decompress_member_finished returns 1, read > final values for member with LZ_decompress_data_crc, etc. > 6) go back to step 2 until LZ_decompress_finished returns 1 > 7) LZ_decompress_close > > In `lzread!', we don't call lz-decompress-finished? nor do we loop on > lz-decompress-finished. Indeed, we’re missing a call to ‘lz-decompress-finish’, so lzlib could in theory think there’s still data coming, and so fail to produce more output, possibly leading to an infinite loop. I think the ‘lzread!’ loop should look like this (the tests still pass with this): --8<---------------cut here---------------start------------->8--- (let loop ((read 0) (start start)) (cond ((< read count) (match (lz-decompress-read decoder bv start (- count read)) (0 (cond ((lz-decompress-finished? decoder) read) ((eof-object? (feed-decoder! decoder)) (lz-decompress-finish decoder) (loop read start)) (else ;read again (loop read start)))) (n (loop (+ read n) (+ start n))))) (else read))) --8<---------------cut here---------------end--------------->8--- That way, I believe all cases are correctly handled. WDYT? > This only works for decompression of single-member archive, but the > documentation does not say that. > > (match (get-bytevector-n port (lz-decompress-write-size decoder)) > ((? eof-object? eof) eof) > (bv (lz-decompress-write decoder bv))) > > > In the above if lz-decompress-write-size returns 0, we won't be reading > anything (infinite loop?). We’re calling ‘feed-decoder!’ iff ‘lz-decompress-read’ returned 0; when that happens ‘lz-decompress-write-size’ must return a strictly positive number. Otherwise there’s an inconsistency. > (match (lz-decompress-read decoder bv start (- count read)) > (0 (if (eof-object? (feed-decoder! decoder)) > read > (loop read start))) > > I'm not sure I understand the above: if we read nothing, then we try > again? No: if we read *something*, we try again; if we read nothing, we return. Thanks for your careful review, much appreciated! Ludo’.
Ludovic Courtès <ludo@gnu.org> writes: > I think the ‘lzread!’ loop should look like this (the tests still pass > with this): > > --8<---------------cut here---------------start------------->8--- > (let loop ((read 0) > (start start)) > (cond ((< read count) > (match (lz-decompress-read decoder bv start (- count read)) > (0 (cond ((lz-decompress-finished? decoder) > read) > ((eof-object? (feed-decoder! decoder)) > (lz-decompress-finish decoder) > (loop read start)) > (else ;read again > (loop read start)))) > (n (loop (+ read n) (+ start n))))) > (else > read))) > --8<---------------cut here---------------end--------------->8--- Looks good to me! >> (match (lz-decompress-read decoder bv start (- count read)) >> (0 (if (eof-object? (feed-decoder! decoder)) >> read >> (loop read start))) >> >> I'm not sure I understand the above: if we read nothing, then we try >> again? > > No: if we read *something*, we try again; if we read nothing, we return. If we read nothing _and_ it is not an EOF (it can be an empty vector), then we loop indefinitely, no? > Thanks for your careful review, much appreciated! You are welcome, thanks for your invaluable work!
Hi! Pierre Neidhardt <mail@ambrevar.xyz> skribis: > Ludovic Courtès <ludo@gnu.org> writes: > >> I think the ‘lzread!’ loop should look like this (the tests still pass >> with this): >> >> --8<---------------cut here---------------start------------->8--- >> (let loop ((read 0) >> (start start)) >> (cond ((< read count) >> (match (lz-decompress-read decoder bv start (- count read)) >> (0 (cond ((lz-decompress-finished? decoder) >> read) >> ((eof-object? (feed-decoder! decoder)) >> (lz-decompress-finish decoder) >> (loop read start)) >> (else ;read again >> (loop read start)))) >> (n (loop (+ read n) (+ start n))))) >> (else >> read))) >> --8<---------------cut here---------------end--------------->8--- > > Looks good to me! OK, committed! >>> (match (lz-decompress-read decoder bv start (- count read)) >>> (0 (if (eof-object? (feed-decoder! decoder)) >>> read >>> (loop read start))) >>> >>> I'm not sure I understand the above: if we read nothing, then we try >>> again? >> >> No: if we read *something*, we try again; if we read nothing, we return. > > If we read nothing _and_ it is not an EOF (it can be an empty vector), > then we loop indefinitely, no? ‘feed-decoder!’ cannot return an empty bytevector because ‘lz-decompress-write-size’ necessarily returns a strictly positive integer at this point. (Imperative programming is hard! :-)) Thanks, Ludo’.
Ludovic Courtès <ludo@gnu.org> writes: > ‘feed-decoder!’ cannot return an empty bytevector because > ‘lz-decompress-write-size’ necessarily returns a strictly positive > integer at this point. I'm not sure that's true: if the buffer is full and the next lz-decompress-read does not read anything, then the buffer will still be full and lz-decompress-write-size will return 0. The specs don't guarantee that lz-decompress-read will always read something. But that's the only assumption we are making I believe, and it's fair :)
Pierre Neidhardt <mail@ambrevar.xyz> skribis: > Ludovic Courtès <ludo@gnu.org> writes: > >> ‘feed-decoder!’ cannot return an empty bytevector because >> ‘lz-decompress-write-size’ necessarily returns a strictly positive >> integer at this point. > > I'm not sure that's true: if the buffer is full and the next > lz-decompress-read does not read anything, then the buffer will still be > full and lz-decompress-write-size will return 0. Hmm I don’t think that’s a reasonable scenario, otherwise it would mean that you’re in a deadlock anyway (decoder buffer is already full yet it does not contain enough data to actually decode anything.) Dunno, I’m rather confident here but we’ll see if that causes any troubles. Thanks, Ludo’.
diff --git a/guix/lzlib.scm b/guix/lzlib.scm index a6dac46049..48927c6262 100644 --- a/guix/lzlib.scm +++ b/guix/lzlib.scm @@ -1,5 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2019 Pierre Neidhardt <mail@ambrevar.xyz> +;;; Copyright © 2019 Ludovic Courtès <ludo@gnu.org> ;;; ;;; This file is part of GNU Guix. ;;; @@ -23,9 +24,11 @@ #:use-module (ice-9 match) #:use-module (system foreign) #:use-module (guix config) + #:use-module (srfi srfi-11) #:export (lzlib-available? make-lzip-input-port make-lzip-output-port + make-lzip-input-port/compressed call-with-lzip-input-port call-with-lzip-output-port %default-member-length-limit @@ -515,6 +518,23 @@ the end-of-stream has been reached." (loop rd))) read)) +(define (lzwrite! encoder source source-offset source-count + target target-offset target-count) + "Write up to SOURCE-COUNT bytes from SOURCE to ENCODER, and read up to +TARGET-COUNT bytes into TARGET at TARGET-OFFSET. Return two values: the +number of bytes read from SOURCE, and the number of bytes written to TARGET." + (define read + (if (< 0 (lz-compress-write-size encoder)) + (match (lz-compress-write encoder source source-offset source-count) + (0 (lz-compress-finish encoder) 0) + (n n)) + 0)) + + (let loop () + (match (lz-compress-read encoder target target-offset target-count) + (0 (loop)) + (written (values read written))))) + (define* (lzwrite encoder bv lz-port #:optional (start 0) (count (bytevector-length bv))) "Write up to COUNT bytes from BV at offset START into LZ-PORT. Return @@ -597,6 +617,48 @@ port is closed." (lz-compress-close encoder) (close-port port)))) +(define* (make-lzip-input-port/compressed port + #:key + (level %default-compression-level)) + "Return an input port that compresses data read from PORT, with the given LEVEL. +PORT is automatically closed when the resulting port is closed." + (define encoder (apply lz-compress-open + (car (assoc-ref %compression-levels level)))) + + (define input-buffer (make-bytevector 8192)) + (define input-len 0) + (define input-offset 0) + + (define input-eof? #f) + + (define (read! bv start count) + (cond + (input-eof? + (lz-compress-read encoder bv start count)) + ((= input-offset input-len) + (match (get-bytevector-n! port input-buffer 0 + (bytevector-length input-buffer)) + ((? eof-object?) + (set! input-eof? #t) + (lz-compress-finish encoder)) + (count + (set! input-offset 0) + (set! input-len count))) + (read! bv start count)) + (else + (let-values (((read written) + (lzwrite! encoder + input-buffer input-offset + (- input-len input-offset) + bv start count))) + (set! input-offset (+ input-offset read)) + written)))) + + (make-custom-binary-input-port "lzip-input/compressed" + read! #f #f + (lambda () + (close-port port)))) + (define* (call-with-lzip-input-port port proc) "Call PROC with a port that wraps PORT and decompresses data read from it. PORT is closed upon completion." diff --git a/guix/tests.scm b/guix/tests.scm index 35ebf8464d..66d60e964e 100644 --- a/guix/tests.scm +++ b/guix/tests.scm @@ -33,6 +33,7 @@ #:use-module (web uri) #:export (open-connection-for-tests with-external-store + %seed random-text random-bytevector file=? diff --git a/tests/lzlib.scm b/tests/lzlib.scm index cf53a9417d..543622bb45 100644 --- a/tests/lzlib.scm +++ b/tests/lzlib.scm @@ -108,4 +108,14 @@ (test-assert* "Bytevector of size relative to Lzip internal buffers (1MiB+1)" (compress-and-decompress (random-bytevector (1+ (* 1024 1024))))) +(test-assert "make-lzip-input-port/compressed" + (let* ((len (pk 'len (+ 10 (random 4000 %seed)))) + (data (random-bytevector len)) + (compressed (make-lzip-input-port/compressed + (open-bytevector-input-port data))) + (result (call-with-lzip-input-port compressed + get-bytevector-all))) + (pk (bytevector-length result) (bytevector-length data)) + (bytevector=? result data))) + (test-end)