[bug#34223] Fixing timestamps in archives.

Message ID e13090c1-dabb-da63-cc62-3975f2697527@yahoo.de
State Accepted
Headers show
Series [bug#34223] Fixing timestamps in archives. | expand

Checks

Context Check Description
cbaines/applying patch success Successfully applied

Commit Message

Tim Gesthuizen Jan. 27, 2019, 5:58 p.m. UTC
Hi Ludo,

as discussed before I have looked into the problems of timestamps in the
zip files.
I looked at the way this is solved in ant-build-system with jar files
and thought that this could be done in a more elegant way.
Because of this I wrote a simple frontend for LibArchive in C that
repacks archives and sets their timestamps to zero and disables
compression as it is done in the ant-build-system.
Creative as I am the program is called repack.
You find a git repository attached with the history of the repack program.
The attached patches add repack to Guix and use it for pwsafe and the
ant-build-system.

This is a work in progress version: If you like these changes I will
work on missing details so that we can add it, otherwise now would be a
good point to stop development on these changes.
I can still fall back to the variant that ant-build-system uses now for
pwsafe.

It would be nice if you could find the time to review everything and
tell me what you think about the patches.
The changes trigger a lot of rebuilds so it will take some time to build
dependencies of the required programs.

The repack tar contains a bare git repository with the program.
Because I could not yet find a place to host the repository you need to
unpack it somewhere, create an archive with the source code and change
the guix definition of repack to use that source.

Tim.

Comments

Ludovic Courtès Feb. 16, 2019, 10:35 p.m. UTC | #1
Hi Tim,

Sorry for the delay!

Tim Gesthuizen <tim.gesthuizen@yahoo.de> skribis:

> as discussed before I have looked into the problems of timestamps in the
> zip files.
> I looked at the way this is solved in ant-build-system with jar files
> and thought that this could be done in a more elegant way.
> Because of this I wrote a simple frontend for LibArchive in C that
> repacks archives and sets their timestamps to zero and disables
> compression as it is done in the ant-build-system.
> Creative as I am the program is called repack.
> You find a git repository attached with the history of the repack program.
> The attached patches add repack to Guix and use it for pwsafe and the
> ant-build-system.

Nice work!  It’s great that libarchive doesn’t need to actually extract
the zip file to operate on it.

Overall I think the approach of factorizing archive-timestamp-resetting
in one place and using it everywhere (‘ant-build-system’ and all) is the
right thing to do.

However, I’m not sure whether we should introduce a new program for this
purpose.  I believe ‘strip-nondeterminism’¹ (in Perl) by fellow
Reproducible Builds hackers also addresses this problem, so it may be
wiser to use it.

But really, since (guix build utils) already implements a significant
subset of ‘strip-nondeterminism’, it would be even better if could avoid
to shell out to a C or Perl program.

I played a bit with this idea and, as an example, the attached file
allows you to traverse the list of entries in a zip file (it uses
‘guile-bytestructures’).  Specifically, you can get the list of file
names in a zip file by running:

  (call-with-input-file "something.zip"
    (lambda (port)
      (fold-entries cons '() port)))

Resetting timestamps should be just as simple.

How about taking this route?

Thanks,
Ludo’.

¹ https://salsa.debian.org/reproducible-builds/strip-nondeterminism
(define-module (guix zip)
  #:use-module (rnrs bytevectors)
  #:use-module (rnrs io ports)
  #:use-module (bytestructures guile)
  #:use-module (ice-9 match)
  #:export (fold-entries))

(define <file-header>
  ;; File header, see
  ;; <https://en.wikipedia.org/wiki/Zip_(file_format)#File_headers>.
  (bs:struct #t                                   ;packed
             `((signature ,uint32le)
               (version-needed ,uint16le)
               (flags ,uint16le)
               (compression ,uint16le)
               (modification-time ,uint16le)
               (modification-date ,uint16le)
               (crc32 ,uint32le)
               (compressed-size ,uint32le)
               (uncompressed-size ,uint32le)
               (file-name-length ,uint16le)
               (extra-field-length ,uint16le))))

(define-bytestructure-accessors <file-header>
  file-header-unwrap file-header-ref set-file-header!)

(define (fold-entries proc seed port)
  "Fold PROC over all the entries in the zip file at PORT."
  (let loop ((result seed))
    (match (get-bytevector-n port (bytestructure-descriptor-size
                                   <file-header>))
      ((? bytevector? bv)
       (match (file-header-ref bv signature)
         (#x04034b50                              ;local file header
          (let* ((len  (file-header-ref bv file-name-length))
                 (name (utf8->string (get-bytevector-n port len))))
            (set-port-position! port
                                (+ (file-header-ref bv extra-field-length)
                                   (file-header-ref bv compressed-size)
                                   (port-position port)))
            (loop (proc name result))))
         (#x02014b50                               ;central directory record
          result)
         (#x06054b50                          ;end of central directory record
          result)))
      ((? eof-object?)
       result))))
Julien Lepiller Feb. 17, 2019, 7:42 a.m. UTC | #2
Le 16 février 2019 23:35:50 GMT+01:00, "Ludovic Courtès" <ludo@gnu.org> a écrit :
>Hi Tim,
>
>Sorry for the delay!
>
>Tim Gesthuizen <tim.gesthuizen@yahoo.de> skribis:
>
>> as discussed before I have looked into the problems of timestamps in
>the
>> zip files.
>> I looked at the way this is solved in ant-build-system with jar files
>> and thought that this could be done in a more elegant way.
>> Because of this I wrote a simple frontend for LibArchive in C that
>> repacks archives and sets their timestamps to zero and disables
>> compression as it is done in the ant-build-system.
>> Creative as I am the program is called repack.
>> You find a git repository attached with the history of the repack
>program.
>> The attached patches add repack to Guix and use it for pwsafe and the
>> ant-build-system.
>
>Nice work!  It’s great that libarchive doesn’t need to actually extract
>the zip file to operate on it.
>
>Overall I think the approach of factorizing archive-timestamp-resetting
>in one place and using it everywhere (‘ant-build-system’ and all) is
>the
>right thing to do.
>
>However, I’m not sure whether we should introduce a new program for
>this
>purpose.  I believe ‘strip-nondeterminism’¹ (in Perl) by fellow
>Reproducible Builds hackers also addresses this problem, so it may be
>wiser to use it.
>
>But really, since (guix build utils) already implements a significant
>subset of ‘strip-nondeterminism’, it would be even better if could
>avoid
>to shell out to a C or Perl program.
>
>I played a bit with this idea and, as an example, the attached file
>allows you to traverse the list of entries in a zip file (it uses
>‘guile-bytestructures’).  Specifically, you can get the list of file
>names in a zip file by running:
>
>  (call-with-input-file "something.zip"
>    (lambda (port)
>      (fold-entries cons '() port)))
>
>Resetting timestamps should be just as simple.
>
>How about taking this route?
>
>Thanks,
>Ludo’.
>
>¹ https://salsa.debian.org/reproducible-builds/strip-nondeterminism

One of the reasons why we extract jar files is to remove compression, because the content might have store references that would be hidden, so grafting for instance wouldn't work.
Tim Gesthuizen Feb. 18, 2019, 8:07 p.m. UTC | #3
Hi Ludo,

> Sorry for the delay!

No problem! I have very little time anyway.

> Nice work!  It’s great that libarchive doesn’t need to actually extract
> the zip file to operate on it.
>
> Overall I think the approach of factorizing archive-timestamp-resetting
> in one place and using it everywhere (‘ant-build-system’ and all) is the
> right thing to do.
>
> However, I’m not sure whether we should introduce a new program for this
> purpose.  I believe ‘strip-nondeterminism’¹ (in Perl) by fellow
> Reproducible Builds hackers also addresses this problem, so it may be
> wiser to use it.

I also think so. If there is already another program that does the job
we should probably use it.

> But really, since (guix build utils) already implements a significant
> subset of ‘strip-nondeterminism’, it would be even better if could avoid
> to shell out to a C or Perl program.
>
> I played a bit with this idea and, as an example, the attached file
> allows you to traverse the list of entries in a zip file (it uses
> ‘guile-bytestructures’).  Specifically, you can get the list of file
> names in a zip file by running:
>
>   (call-with-input-file "something.zip"
>     (lambda (port)
>       (fold-entries cons '() port)))
>
> Resetting timestamps should be just as simple.
>
> How about taking this route?

I also thought about taking this route.
There are some problems with it though:

- As Julien pointed out, the archive contents need to be uncompressed.
  This makes the problem much more complex and keeps us from writing
  a partial ZIP parser that replaces the timestamps in place.
- While it would be quite elegant to just implement the parser in
  Scheme it would be redundant. After all we are developing a package
  manager so we should use it.
  This approach would be more attractive if there would be a Guile
  library for this.
  The best solution would be creating a proper library for handling
  archives when going with Scheme.
- Maintaining a ZIP parser in Guix is a burden we should not take.
- We need to care about a lot of details (ZIP64, probably more exotic
  extensions).

I would be fine with writing an own parser in Scheme but I would like to
point out that in every other place in Guix we are using external tools
for handling archives (AFAIK).

I am not quite sure which version would be the best, so I am open for
other opinions on this.
Maybe you could rephrase your position taking the compression problem
into consideration.

Tim.
Ludovic Courtès Feb. 18, 2019, 10:24 p.m. UTC | #4
Hello,

Tim Gesthuizen <tim.gesthuizen@yahoo.de> skribis:

>> I played a bit with this idea and, as an example, the attached file
>> allows you to traverse the list of entries in a zip file (it uses
>> ‘guile-bytestructures’).  Specifically, you can get the list of file
>> names in a zip file by running:
>>
>>   (call-with-input-file "something.zip"
>>     (lambda (port)
>>       (fold-entries cons '() port)))
>>
>> Resetting timestamps should be just as simple.
>>
>> How about taking this route?
>
> I also thought about taking this route.
> There are some problems with it though:
>
> - As Julien pointed out, the archive contents need to be uncompressed.
>   This makes the problem much more complex and keeps us from writing
>   a partial ZIP parser that replaces the timestamps in place.

True, I had overlooked that.  In that case, we should definitely unpack
and repack using the ‘zip’ package (I wasn’t suggesting to write a
complete ‘zip’ implementation; I do think it would be valuable in the
long term, but it’s a project for another time, no question here.)

In that case though, it probably doesn’t buy us much to use libarchive
in a separate C program, WDYT?  Should we just stick to the current
approach that invokes ‘unzip’ and ‘zip’?

Thanks,
Ludo’.

Patch

From 3df6e33f52ac2906ec98cc9b74ef93d9cbb22108 Mon Sep 17 00:00:00 2001
From: Tim Gesthuizen <tim.gesthuizen@yahoo.de>
Date: Sat, 19 Jan 2019 17:13:45 +0100
Subject: [PATCH 11/11] gnu: ant: Use repack for repacking archives

* gnu/packages/java.scm (ant):
[native-inputs]: Use repack in favour of zip and unzip.
---
 gnu/packages/java.scm | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/gnu/packages/java.scm b/gnu/packages/java.scm
index ad61bf294..0002c1f83 100644
--- a/gnu/packages/java.scm
+++ b/gnu/packages/java.scm
@@ -1887,8 +1887,7 @@  new Date();"))
                 "1k28mka0m3isy9yr8gz84kz1f3f879rwaxrd44vdn9xbfwvwk86n"))))
     (native-inputs
      `(("jdk" ,icedtea-7 "jdk")
-       ("zip" ,zip)
-       ("unzip" ,unzip)))))
+       ("repack" ,repack)))))
 
 (define-public ant-apache-bcel
   (package
-- 
2.20.1