Skip to content

Commit 2bef8cf

Browse files
committed
README: add CLI section
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
1 parent cad5034 commit 2bef8cf

File tree

1 file changed

+58
-20
lines changed

1 file changed

+58
-20
lines changed

README.md

Lines changed: 58 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,15 @@
22

33
Go implementation of offset-based native UnixFS proofs.
44

5-
**Note:** this is a side-project not used in production. It's mostly in alpha version. It isn't optimized at any level nor audited in any way.
5+
**Note:** this is a side-project and not be considered production-ready. It isn't optimized nor audited in any way.
66

77
## Table of contents
88
- [About the project](#about)
99
- [Assumptions of the UnixFS DAG file](#Assumptions-of-the-UnixFS-DAG-file)
1010
- [Proof format](#proof-format)
1111
- [Use-case analysis and security](#use-case-analysis-and-security)
1212
- [Proof sizes and benchmark](#proof-sizes-and-benchmark)
13+
- [CLI](#cli)
1314
- [Roadmap](#roadmap)
1415
- [Contributing](#contributing)
1516
- [License](#license)
@@ -19,9 +20,14 @@ Go implementation of offset-based native UnixFS proofs.
1920
## About
2021
This library allows generating and verification proofs for UnixFS file DAGs.
2122

22-
The challenger knows the _Cid_ of a UnixFS DAG and the maximum size of the underlying represented file. This information asks the prover to generate proof that it stores the block at a specified offset between _[0, max-file-size]_.
23+
The verifier knows the _Cid_ of a UnixFS DAG and the size of the underlying represented file.
24+
With this information, the verifier asks the prover to generate a proof that it stores the block at a specified offset between _[0, max-file-size]_.
2325

24-
The proof is a sub-DAG of the original, which contains the path to the targeted block, plus each level of intermediate nodes.
26+
The proof is a sub-DAG which contains all the necessary blocks to assert that:
27+
- The provided block is part of the DAG with the expected Cid root.
28+
- The provided block of data is at the specified offset in the file.
29+
30+
The primary motivation for this kind of library is to provide a way to make challenges at a random-sampled offset of the original file to have a probabilistic guarantee that the prover is storing the data.
2531

2632
Consider the following UnixFS DAG file with a fanout factor of 3:
2733
![image](https://user-images.githubusercontent.com/6136245/139512869-5135649f-dc34-4ef1-9862-5c47860ec581.png)
@@ -30,19 +36,26 @@ Consider the following UnixFS DAG file with a fanout factor of 3:
3036
-->
3137

3238

33-
Considering a verifer is asking a prover to provide a proof that it contains the corresponding block at the _file level offset_ X, the prover generates the subdag inside the green zone:
34-
- Roundo nodes are internal DAG nodes that are somewhat small-ish and don't contain file data.
39+
Considering a verifier is asking a prover to provide a proof that it contains the corresponding block at the _file level offset_ X, the prover generates the subdag inside the green zone:
40+
- Round nodes are internal DAG nodes that are somewhat small-ish and don't contain file data.
3541
- Square nodes contain chunks of the original file data.
36-
- The indigo colored nodes are necessary nodes to make the proof verify that the target block (red) is at the specified offset.
42+
- The indigo-colored nodes are necessary nodes to verify that the target block (red) is at the specified offset.
43+
44+
To understand better more details about this proof, read the _Proof sizes and benchmark_ section.
3745

46+
## Does this library assume any particular setup of the UnixFS DAG for the file?
47+
No, this library works with any DAG layout, so it doesn't have any particular assumptions.
48+
The DAG can have different layouts (e.g., balanced, trickle, etc.), chunking (e.g., fixed size, etc.), or other particular DAG builder configurations.
3849

39-
## Assumptions of the UnixFS DAG file
40-
This library works with any file UnixFS DAG. It doesn't assume any particular layout (e.g., balanced, trickle, etc.), chunking (e.g., fixed size, etc.), or other particular DAG builder configuration.
50+
This minimum level of assumptions allows the challenger to only needed to know the _Cid_ and file size to ask and verify the proof.
51+
There's an inherent tradeoff between assumptions and possible optimizations of the proof. See _Proof size and benchmark_ section.
4152

4253
## Proof format
43-
To avoid inventing any new proof standard or format, the proof is a byte array. This byte array is a CAR file format of all the blocks that are part of the proof.
54+
To avoid inventing any new proof standard or format, the proof is a byte array corresponding to a CAR file format of all the blocks that are part of the proof.
55+
The decision was mainly to avoid friction about defining a new format or standard.
4456

45-
Today this is the decided format mostly to avoid friction about defining other formats. The order of blocks in the CAR file should be considered undefined despite the current implementation having a BFS order.
57+
The order of blocks in the CAR file should be considered undefined despite the current implementation having a BFS order.
58+
Defining a particular order can improve the proof verification, so that's a possible change that can be done.
4659

4760
## Use-case analysis and security
4861
The primary motivation is to support a random-sampling-based challenge system between a prover and a verifier.
@@ -51,33 +64,58 @@ Given a file with size _MaxSize_, the verifying can ask the prover to generate p
5164

5265
The security of this schema is similar to other random-sampling schemas:
5366
- If the underlying prover doesn't have the block, it won't generate the proof.
54-
- If the offset is random-sampled in the _[0, MaxSize]_ range, it can't be guessed by the prover without storing all the files.
67+
- If the offset is random-sampled in the _[0, MaxSize]_ range, it can't be guessed by the prover without storing all the file.
5568

5669
If the bad-prover is storing only part of the leaves _p_ (e.g., 50%):
5770
- A single challenge makes the prover have a probability `p` (e.g., 50%) of success.
58-
- If the challenger asks for N (e.g., 5) proofs, the probability of generating all correct proofs is `p^N` (e.g., 3%) at the cost of a proof size of `SingleProofSize*N`.
71+
- If the challenger asks for N (e.g., 5) proofs, the probability of generating all correct proofs is `p^N` (e.g., 3%) at the cost of a proof size of ~`SingleProofSize*N`.
5972

60-
If the underlying file has some erasure coding applied with leverage `X` (e.g., 2x):
61-
- A single challenge makes the prover have a probability of `p^X` of success (e.g., 25%)
62-
- If the challenger asks for N (e.g., 5) proofs, the probability of generating all correct proofs is `p^(X*N)` (e.g., 0.097%)
73+
Despite the above, if the prover deletes only 1 byte of the data, it would still generate proofs with ~high chance. Still, the file could be considered corrupted since a single byte is usually enough to make the file unavailable.
6374

64-
In summary, applying an erasure coding schema in the underlying file can make a single proof be _good enough_ to balance the proof size with more underlying storage for the original file.
75+
One possible approach can be inspired from work by Mustafa et al. for data-availability schemas (see [here](https://ethresear.ch/t/simulating-a-fraud-proof-blockchain/5024)).
76+
If an erasure-code schema was applied to the data, this forces the prover to drop a significant amount of data to make the file unrecoverable. For example, if the erasure code has a 2x leverage, the miner should drop at least 50% of the file to make it unrecoverable. As shown before, dropping 50% of the data means it has 3% success if asked for 5 proofs. This means that if the file is in an unrecoverable state, with 5 proofs, we should detect this at least 97% of the time.
6577

6678
Notice that if the prover has missing internal nodes of the UnixFS, then the impact of a missed block is much higher than missing leaves (underlying data) since the probability of hitting an internal node is way bigger than leaves for a random offset. (e.g., if the root Cid block is missing, all challenges will fail). This means that the probability of the prover failing to provide the proofs is lower than the analysis made above for leaves.
6779

6880

6981
## Proof sizes and benchmark
70-
The size of the proof should be already close to the minimal level. Notice that these proofs are pretty big for the single reason that no assumptions are made of DAG layout nor chunking. Thus internal nodes at visited levels include many children. If we're able to have some extra assumptions as fixed-size chunking, then we could potentially ignore untargeted raw leaves which are the biggest in size, and only include the targeted (red) leaf node.
82+
The proof size is directly related to how many assumptions we have about the underlying DAG structure. The current implementation of this library doesn't assume anything about the DAG structure, so it isn't optimized for proof size.
83+
The biggest weight in the proofs comes from leave blocks which are usually heavy (~100s of KB), and depending on where an offest lands on the DAG structure, it could contain multiple data-blocks.
84+
85+
If at least we can bake an assumption about constant size chunks with defined size, we could generate mostly minimal and constant sized proofs since we could probably avoid all leaves and only include the targeted one. Maybe the library can be extended to allow baking assumptions like this and generate smaller proofs in the future.
86+
87+
The cost of generating the proofs should be _O(1)_. Probably soon, I'll add some benchmarks, but realistically speaking is mainly tied to how fast lookups can be done in the `DAGService`, which mainly depends on the source of the data, not the algorithm.
88+
89+
## CLI
90+
A simple CLI `ufsproof` is provided, allowing easy to generate and verify proofs, which can be installed running `make install`.
91+
92+
To generate proofs, run `ufsproof prove [cid] [offset]`, which prints in stdout the proof for block of Cid at the provided offset.
93+
For example:
94+
- `ufsproof prove QmUavJLgtkQy6wW2j1J1A5cAP6UQt3XLQjsArsU2ZYmgSo 1300`: assumes that the Cid is stored in an IPFS API at `/ip4/127.0.0.1/tcp/5001`.
95+
- `ufsproof prove QmUavJLgtkQy6wW2j1J1A5cAP6UQt3XLQjsArsU2ZYmgSo 1300 > proof.car`: stores the proof in a file.
96+
- `ufsproof prove --car-file mydag.car QmUavJLgtkQy6wW2j1J1A5cAP6UQt3XLQjsArsU2ZYmgSo 1300`: uses a CAR file instead of an IPFS API.
97+
98+
To verify proofs, run `ufsproof verify [cid] [offset] [proof-path:(optional, by default stdin)]`.
99+
For example:
100+
- `ufsproof verify QmUavJLgtkQy6wW2j1J1A5cAP6UQt3XLQjsArsU2ZYmgSo 1300 proof.car`
101+
71102

72-
Generating and verifying proofs are mostly symmetrical operations. The current implementation is very naive and not optimized in any way. Being stricter with the spec CAR serialization block order can make the implementation faster. Probably, not a big deal unless you're generating proofs for thousands of _Cids_.
103+
Closing the loop:
104+
```
105+
$ ufsproof prove QmUavJLgtkQy6wW2j1J1A5cAP6UQt3XLQjsArsU2ZYmgSo 1300 | ufsproof verify QmUavJLgtkQy6wW2j1J1A5cAP6UQt3XLQjsArsU2ZYmgSo 1300
106+
The proof is valid
107+
$ ufsproof prove QmUavJLgtkQy6wW2j1J1A5cAP6UQt3XLQjsArsU2ZYmgSo 10 | ufsproof verify QmUavJLgtkQy6wW2j1J1A5cAP6UQt3XLQjsArsU2ZYmgSo 50000000
108+
The proof is NOT valid
109+
```
110+
Remember that because of (**) mentioned in _Proof sizes and benchmark_ is possible to have a valid proof message on some offsets greater than the proved one.
73111

74112
## Roadmap
75-
The following bullets will probably be implemented soon:
113+
Possible ideas in the near future:
76114
- [ ] Allow direct leaf Cid proof (non-offset based); a bit offtopic for this lib and not sure entirely useful.
77115
- [ ] Benchmarks, may be fun but nothing entirely useful for now.
78-
- [ ] CLI command wirable to `go-ipfs`. The lib already supports any `DAGService` so anything can be pluggable.
79116
- [ ] Allow strict mode proof validation; maybe it makes sense to fail faster in some cases, nbd.
80117
- [ ] CLI for validation from DealID in Filecoin network; maybe fun, but `Labels` are unverified.
118+
- Baking assumptions for shorter proofs.
81119
- [ ] godocs
82120

83121
This is a side-project made for fun, so a priori is a hand-wavy roadmap.

0 commit comments

Comments
 (0)