mbox

[bug#33899,0/5] Distributing substitutes over IPFS

Message ID 20181228231205.8068-1-ludo@gnu.org
Headers show

Message

Ludovic Courtès Dec. 28, 2018, 11:12 p.m. UTC
Hello Guix!

Here is a first draft adding support to distribute and retrieve substitutes
over IPFS.  This builds on discussions at the R-B Summit with Héctor Sanjuan
of IPFS, lewo of Nix, Pierre Neidhardt, and also on the work Florian
Paul Schmidt posted on guix-devel last month.

The IPFS daemon exposes an HTTP API and the (guix ipfs) module provides
bindings to a subset of that API.  This module also implements a custom
“directory” format to store directory trees in IPFS (IPFS already provides
“UnixFS” and “tar” but they store too many or too few file attributes.)

‘guix publish’ and ‘guix substitute’ use (guix ipfs) to
store and retrieve store items.  Complete directory trees are stored in
IPFS “as is”, rather than as compressed archives (nars).  This allows for
deduplication in IPFS.  ‘guix publish’ adds a new “IPFS” field in
narinfos and ‘guix substitute’ can then query those objects over IPFS.
So the idea is that you still get narinfos over HTTP(S), and then you
have the option of downloading substitutes over IPFS.

I’ve pushed these patches in ‘wip-ipfs-substitutes’.  This is rough on the
edges and probably buggy, but the adventurous among us might want to give
it a spin.  :-)

Thanks,
Ludo’.

Ludovic Courtès (5):
  Add (guix json).
  tests: 'file=?' now recurses on directories.
  Add (guix ipfs).
  publish: Add IPFS support.
  DRAFT substitute: Add IPFS support.

 Makefile.am                 |   3 +
 doc/guix.texi               |  33 +++++
 guix/ipfs.scm               | 250 ++++++++++++++++++++++++++++++++++++
 guix/json.scm               |  63 +++++++++
 guix/scripts/publish.scm    |  67 +++++++---
 guix/scripts/substitute.scm | 106 ++++++++-------
 guix/swh.scm                |  35 +----
 guix/tests.scm              |  26 +++-
 tests/ipfs.scm              |  55 ++++++++
 9 files changed, 535 insertions(+), 103 deletions(-)
 create mode 100644 guix/ipfs.scm
 create mode 100644 guix/json.scm
 create mode 100644 tests/ipfs.scm

Comments

Alex Griffin May 13, 2019, 6:51 p.m. UTC | #1
Do I understand correctly that the only reason you don't just store nar files is for deduplication? Reading [this page][1] suggests to me that you might be overthinking it. IPFS already uses a content-driven chunking algorithm that might provide good enough deduplication on its own. It also looks like you can use your own chunker, so a future improvement could be implementing a custom chunker that makes sure to split nar files at the file boundaries within them.

[1]: https://github.com/ipfs/archives
Pierre Neidhardt July 1, 2019, 9:36 p.m. UTC | #2
Hi!

(Re-sending to debbugs, sorry for the double email :p)

A little update/recap after many months! :)

I talked with Héctor and some other people from IPFS + I reviewed Ludo's
patch so now I have a little better understanding of the current state
of affair.

- We could store the substitutes as tarballs on IPFS, but this has
  some possible downsides:

  - We would need to use IPFS' tar chunker to deduplicate the content of
    the tarball.  But the tar chunker is not well maintained currently,
    and it's not clear whether it's reproducible at the moment, so it
    would need some more work.

  - Tarballs might induce some performance cost.  Nix had attempted
    something similar in the past and this may have incurred a significant
    performance penalty, although this remains to be confirmed.
    Lewo?

- Ludo's patch stores all files on IPFS individually.  This way we don't
  need to touch the tar chunker, so it's less work :)
  This raises some other issues however:

  - Extra metadata:  IPFS stores files on UnixFSv1 which does not
    include the executable bit.

    - Right now we store a s-exp manifest with a list of files and a
      list of executable bits.  But maybe we don't have to roll out our own.

    - UnixFSv1 has some metadata field, but Héctor and Alex did not
      recommend using it (not sure why though).

    - We could use UnixFSv2 but it's not released yet and it's unclear when
      it's going to be released.  So we can't really count on it right now.

    - IPLD: As Héctor suggested in the previous email, we could leverage
      IPLD and generate a JSON object that references the files with
      their paths together with an "executable?" property.
      A problem would arise if this IPLD object grows over the 2M
      block-size limit because then we would have to shard it (something
      that UnixFS would do automatically for us).

  - Flat storage vs. tree storage: Right now we are storing the files
    separately, but this has some shortcomings, namely we need multiple
    "get" requests instead of just one, and that IPFS does
    not "know" that those files are related.  (We lose the web view of
    the tree, etc.)  Storing them as tree could be better.
    I don't understand if that would work with the "IPLD manifest"
    suggested above.  Héctor?

  - Pinning: Pinning all files separately incurs an overhead.  It's
    enough to just pin the IPLD object since it propagates recursively.
    When adding a tree, then it's no problem since pinning is only done once.

  - IPFS endpoint calls: instead of adding each file individually, it's
    possible to add them all in one go.  Can we add all files at once
    while using a flat storage? (I.e. not adding them all under a common
    root folder.)

To sum up, here is what remains to be done on the current patch:

- Add all files in one go without pinning them.
- Store as the file tree?  Can we still us the IPLD object to reference
  the files in the tree?  Else use the "raw-leaves" option to avoid
  wrapping small files in UnixFS blocks.
- Remove the Scheme manifest if IPLD can do.
- Generate the IPLD object and pin it.

Any corrections?
Thoughts?

Cheers!
Pierre Neidhardt July 6, 2019, 8:44 a.m. UTC | #3
Link to the Nix integration discussion:
https://github.com/NixOS/nix/issues/859.
Molly Mackinlay July 12, 2019, 8:02 p.m. UTC | #4
Thanks for the update Pierre! Also adding Alex, Jessica, Eric and Andrew
from the package managers discussions at IPFS Camp as FYI.

Generating the ipld manifest with the metadata and the tree of files should
also be fine AFAIK - I’m sure Hector and Eric can expand more on how to
compose them, but data storage format shouldn’t make a big difference for
the ipld manifest.

On Mon, Jul 1, 2019 at 2:36 PM Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> Hi!
>
> (Re-sending to debbugs, sorry for the double email :p)
>
> A little update/recap after many months! :)
>
> I talked with Héctor and some other people from IPFS + I reviewed Ludo's
> patch so now I have a little better understanding of the current state
> of affair.
>
> - We could store the substitutes as tarballs on IPFS, but this has
>   some possible downsides:
>
>   - We would need to use IPFS' tar chunker to deduplicate the content of
>     the tarball.  But the tar chunker is not well maintained currently,
>     and it's not clear whether it's reproducible at the moment, so it
>     would need some more work.
>
>   - Tarballs might induce some performance cost.  Nix had attempted
>     something similar in the past and this may have incurred a significant
>     performance penalty, although this remains to be confirmed.
>     Lewo?
>
> - Ludo's patch stores all files on IPFS individually.  This way we don't
>   need to touch the tar chunker, so it's less work :)
>   This raises some other issues however:
>
>   - Extra metadata:  IPFS stores files on UnixFSv1 which does not
>     include the executable bit.
>
>     - Right now we store a s-exp manifest with a list of files and a
>       list of executable bits.  But maybe we don't have to roll out our
> own.
>
>     - UnixFSv1 has some metadata field, but Héctor and Alex did not
>       recommend using it (not sure why though).
>
>     - We could use UnixFSv2 but it's not released yet and it's unclear when
>       it's going to be released.  So we can't really count on it right now.
>
>     - IPLD: As Héctor suggested in the previous email, we could leverage
>       IPLD and generate a JSON object that references the files with
>       their paths together with an "executable?" property.
>       A problem would arise if this IPLD object grows over the 2M
>       block-size limit because then we would have to shard it (something
>       that UnixFS would do automatically for us).
>
>   - Flat storage vs. tree storage: Right now we are storing the files
>     separately, but this has some shortcomings, namely we need multiple
>     "get" requests instead of just one, and that IPFS does
>     not "know" that those files are related.  (We lose the web view of
>     the tree, etc.)  Storing them as tree could be better.
>     I don't understand if that would work with the "IPLD manifest"
>     suggested above.  Héctor?
>
>   - Pinning: Pinning all files separately incurs an overhead.  It's
>     enough to just pin the IPLD object since it propagates recursively.
>     When adding a tree, then it's no problem since pinning is only done
> once.
>
>   - IPFS endpoint calls: instead of adding each file individually, it's
>     possible to add them all in one go.  Can we add all files at once
>     while using a flat storage? (I.e. not adding them all under a common
>     root folder.)
>
> To sum up, here is what remains to be done on the current patch:
>
> - Add all files in one go without pinning them.
> - Store as the file tree?  Can we still us the IPLD object to reference
>   the files in the tree?  Else use the "raw-leaves" option to avoid
>   wrapping small files in UnixFS blocks.
> - Remove the Scheme manifest if IPLD can do.
> - Generate the IPLD object and pin it.
>
> Any corrections?
> Thoughts?
>
> Cheers!
>
> --
> Pierre Neidhardt
> https://ambrevar.xyz/
>
> --
> You received this message because you are subscribed to the Google Groups
> "Go IPFS Working Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to go-ipfs-wg+unsubscribe@ipfs.io.
> To view this discussion on the web visit
> https://groups.google.com/a/ipfs.io/d/msgid/go-ipfs-wg/87zhlxe8t9.fsf%40ambrevar.xyz
> .
>
Alex Potsides July 15, 2019, 9:20 a.m. UTC | #5
The reason not to use the UnixFSv1 metadata field was that it's in the spec
<https://github.com/ipfs/specs/tree/master/unixfs#data-format> but it's not
really been implemented.  As it stands in v1, you'd have to add explicit
metadata types to the spec (executable, owner?, group?, etc) because
protobufs need to know about everything ahead of time and each
implementation would have update to implement those.  This is all possible
& not a technical blocker, but since most effort is centred around UnixFSv2
the timescales might not fit with people's requirements.

The more pragmatic approach Hector suggested was to wrap a CID that
resolves to the UnixFSv1 file in a JSON object that you could use to store
application-specific metadata - something similar to the UnixFSv1.5 section
<https://github.com/ipfs/camp/blob/master/DEEP_DIVES/package-managers/README.md#unixfs-v15>
in our notes from the Package Managers deep dive we did at camp.

a.






On Fri, Jul 12, 2019 at 9:03 PM Molly Mackinlay <molly@protocol.ai> wrote:

> Thanks for the update Pierre! Also adding Alex, Jessica, Eric and Andrew
> from the package managers discussions at IPFS Camp as FYI.
>
> Generating the ipld manifest with the metadata and the tree of files
> should also be fine AFAIK - I’m sure Hector and Eric can expand more on how
> to compose them, but data storage format shouldn’t make a big difference
> for the ipld manifest.
>
> On Mon, Jul 1, 2019 at 2:36 PM Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>
>> Hi!
>>
>> (Re-sending to debbugs, sorry for the double email :p)
>>
>> A little update/recap after many months! :)
>>
>> I talked with Héctor and some other people from IPFS + I reviewed Ludo's
>> patch so now I have a little better understanding of the current state
>> of affair.
>>
>> - We could store the substitutes as tarballs on IPFS, but this has
>>   some possible downsides:
>>
>>   - We would need to use IPFS' tar chunker to deduplicate the content of
>>     the tarball.  But the tar chunker is not well maintained currently,
>>     and it's not clear whether it's reproducible at the moment, so it
>>     would need some more work.
>>
>>   - Tarballs might induce some performance cost.  Nix had attempted
>>     something similar in the past and this may have incurred a significant
>>     performance penalty, although this remains to be confirmed.
>>     Lewo?
>>
>> - Ludo's patch stores all files on IPFS individually.  This way we don't
>>   need to touch the tar chunker, so it's less work :)
>>   This raises some other issues however:
>>
>>   - Extra metadata:  IPFS stores files on UnixFSv1 which does not
>>     include the executable bit.
>>
>>     - Right now we store a s-exp manifest with a list of files and a
>>       list of executable bits.  But maybe we don't have to roll out our
>> own.
>>
>>     - UnixFSv1 has some metadata field, but Héctor and Alex did not
>>       recommend using it (not sure why though).
>>
>>     - We could use UnixFSv2 but it's not released yet and it's unclear
>> when
>>       it's going to be released.  So we can't really count on it right
>> now.
>>
>>     - IPLD: As Héctor suggested in the previous email, we could leverage
>>       IPLD and generate a JSON object that references the files with
>>       their paths together with an "executable?" property.
>>       A problem would arise if this IPLD object grows over the 2M
>>       block-size limit because then we would have to shard it (something
>>       that UnixFS would do automatically for us).
>>
>>   - Flat storage vs. tree storage: Right now we are storing the files
>>     separately, but this has some shortcomings, namely we need multiple
>>     "get" requests instead of just one, and that IPFS does
>>     not "know" that those files are related.  (We lose the web view of
>>     the tree, etc.)  Storing them as tree could be better.
>>     I don't understand if that would work with the "IPLD manifest"
>>     suggested above.  Héctor?
>>
>>   - Pinning: Pinning all files separately incurs an overhead.  It's
>>     enough to just pin the IPLD object since it propagates recursively.
>>     When adding a tree, then it's no problem since pinning is only done
>> once.
>>
>>   - IPFS endpoint calls: instead of adding each file individually, it's
>>     possible to add them all in one go.  Can we add all files at once
>>     while using a flat storage? (I.e. not adding them all under a common
>>     root folder.)
>>
>> To sum up, here is what remains to be done on the current patch:
>>
>> - Add all files in one go without pinning them.
>> - Store as the file tree?  Can we still us the IPLD object to reference
>>   the files in the tree?  Else use the "raw-leaves" option to avoid
>>   wrapping small files in UnixFS blocks.
>> - Remove the Scheme manifest if IPLD can do.
>> - Generate the IPLD object and pin it.
>>
>> Any corrections?
>> Thoughts?
>>
>> Cheers!
>>
>> --
>> Pierre Neidhardt
>> https://ambrevar.xyz/
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Go IPFS Working Group" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to go-ipfs-wg+unsubscribe@ipfs.io.
>> To view this discussion on the web visit
>> https://groups.google.com/a/ipfs.io/d/msgid/go-ipfs-wg/87zhlxe8t9.fsf%40ambrevar.xyz
>> .
>>
>
Ludovic Courtès July 15, 2019, 9:24 a.m. UTC | #6
Hello Héctor!  :-)

Hector Sanjuan <code@hector.link> skribis:

> On Friday, July 12, 2019 10:15 PM, Ludovic Courtès <ludo@gnu.org> wrote:

[...]

>> > -   Pinning: Pinning all files separately incurs an overhead. It's
>> >     enough to just pin the IPLD object since it propagates recursively.
>> >     When adding a tree, then it's no problem since pinning is only done once.
>> >
>>
>> Where’s the overhead exactly?
>
> There are reasons why we are proposing to create a single DAG with an
> IPLD object at the root. Pinning has a big overhead because it
> involves locking, reading, parsing, and writing an internal pin-DAG. This
> is specially relevant when the pinset is very large.
>
> Doing multiple GET requests also has overhead, like being unable to use
> a single bitswap session (which, when downloading something new means a
> big overhead since every request will have to find providers).
>
> And it's not just the web view, it's the ability to walk/traverse all
> the object related to a given root natively, which allows also to compare
> multiple trees and to be more efficient for some things ("pin update"
> for example). Your original idea is to create a manifest with
> references to different parts. I'm just asking you to
> create that manifest in a format where those references are understood
> not only by you, the file creator, but by IPFS and any tool that can
> read IPLD, by making this a IPLD object (which is just a json).

OK, I see.  Put this way, it seems like creating a DAG with an IPLD
object as its root is pretty compelling.

Thanks for clarifying!

Ludo’.
Pierre Neidhardt July 15, 2019, 10:10 a.m. UTC | #7
Héctor mentioned a possible issue with the IPLD manifest growing too big
(in case of too many files in a package), that is, above 2MB.
Then we would need to implement some form of sharding.

Héctor, do you confirm?  Any idea on how to tackle this elegantly?
Hector Sanjuan July 15, 2019, 10:21 a.m. UTC | #8
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, July 15, 2019 12:10 PM, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> Héctor mentioned a possible issue with the IPLD manifest growing too big
> (in case of too many files in a package), that is, above 2MB.
> Then we would need to implement some form of sharding.
>
> Héctor, do you confirm? Any idea on how to tackle this elegantly?
>

Doing the DAG node the way I proposed it (referencing a single root) should be
ok... Unless you put too many executable files in that list, it should largely
stay within the 2MB limit.


--
Hector
Tony Olagbaiye June 6, 2021, 5:54 p.m. UTC | #9
Hi,

Has this task stagnated? What's the news?

Thanks,
ix :)