This is a brilliant article. Talks about applications that combine the user sovereignty and offline-friendliness/robustness of local data (e.g., MS Office, Adobe Creative Suite) with the effortless collaboration potential and redundancy of cloud apps (like Google Docs and Trello). A lot of things are relevant to Holochain app designers too.
Love this comprehensive article! Would be super interesting to have a review of HoloChainâs current capabilities, graded according to the criteria the fine folks at Ink and Switch have developed.
There was recently a condensed version of the topic (written by some of the same people I believe) here:
The more I learn about CRDTs the more I want to understand how that system fits into HoloChain. Thereâs some new entrants in the community like @PekkaNikander that might want to partake in that discussion.
CRDTs are a great and deep set of tools. Back in 2015, when I studied them, they were certainly among the most promising approach to offline-supporting collaborative editing. As the article states, OT (Operational Transformations) are apparently better for collaborative editing, while state based CRDTs are more akin to git and better preserve the history. There is also a rich literature on how to compact CRDT history, thereby preserving storage, but AFAIK nobody has worked out how to combine such compaction with cryptographically preserved history. (A nice Ph.D. topic, perhaps?)
Today GUN seems to be a popular community project, based on CRDTs.
What comes to data first, Iâm definitely very old school.
I still have my own multi-homed and UPSed server in my cellar, with enough of disk space for all of my data, including all history and then some. It has been there for some 30 years, with the HW upgraded every 10 years or so. (Soon time to do that again.)
But that is also waste of resources. It would be great if I could share the unused disk space (and CPU) with the community. Anything that implements deduplication with community-salted convergent encryption, combined with distributed back-up (duplication) with erasure codes?
Iâm happy to see some conversation around this topic â local-first was an idea I fell in love with at my previous job, where we used Slack and GitLab heavily. Any outages would bring our work to a halt for half a day or more. Sure, you could still write code and check it into the repop, but you couldnât have any conversations about it. Now, granted, having conversations when youâre offline can only get you so far, but local-first + P2P networking is where it gets really exciting.
@erlend_sh hereâs my stab at a grading of Holochain against Ink and Switchâs criteria:
- It should be fast. Work in progress. Local operations (writes, function calls, state changes, lookups of cached data) are super fast, while network operations still need optimisation.
- It should work across multiple devices. Yup, absolutely.
- It should work without a network. Yes again. Currently the most reliable path for networking is sim2h, which requires a network connection (although you could easily run a local sim2h instance for your LAN). This is just a testbed, though, and will gradually be replaced by the real thing.
- It should support collaboration. Nothing that bars collaboration, with the exception of app-level CRDT libraries not existing yet.
- It should support data access for all time. As with Ink and Switchâs examples, with Holochain the DNA = access to the data and knowledge of how to interpret it. Open data formats are left as an exercise to the app developer.
-
It should be secure and private by default. A few notes:
- Usersâ source chains stay on their devices, and may contain private (non-published) entries.
- All DHT communications â both gossip and direct messages â are E2EE, with TLS termination on the communicating nodes.
- Holochainâs public/private membranes are pretty granular: If you want something secret, keep it on your source chain, donât publish it, and only share it via direct messaging. If you want something private (that is, for a groupâs eyes only), your only read-control measure is the DHT, and itâs an all-or-nothing thing: all information on a DHT is available to all of its members. You can, however, create as many private DHTs for as many groups as you like.
- Still missing: encryption and decryption functions in the SDK. Currently if you wanted to share a private message publicly, so people could download it while youâre offline, youâd have to do the work in the client portion of the app.
- Also missing: patterns and practices for sharing private data âin the openâ. Encryption libs only cover the means, not the user experience and best practices. What does it mean to publish an encrypted message to a public data store that is designed to live for as long as people participate in it? How does the future of quantum computing affect the present-day choices that people make about the data they publish?
- It should give the user full ownership and control of their data. Absolutely; each user is in charge of writing their own data on their own source chain and publishing it onto the DHT, and their PK signatures are proofs of authorship. Thereâs one caveat though: deleting data. I donât know of any distributed system (which includes both P2P and client/server) that allows an agent to force anyone else to delete her data. Not her peer Bob, not the server that stores her data, not another user Charlie who happened to take a screenshot of her nasty tweet about her employer. This is a social-legal problem, not a technological one.
This sounds interesting â I guess because one of the consequence of a CRDT is that all of its operations are commutative, you could create checkpoints and dispose of the state changes that led to it (as long as you kept track of which state changes youâd just disposed of). Is that right? Iâm still learning about category theory, but my understanding is that if it can commute, it can compose. So if each state transition in a CRDT is a function that takes existing state and outputs a new state, a checkpoint is just a function composed of other functions, and can be composed with other checkpoints or individual state transitions! Lovely.
Tell me more about this:
Whatâs the definition of âcryptographically preserved historyâ and the nature of the problems?
Iâve heard of GUN â met the founder Mark in San Francisco this summer; really interesting and passionate guy â and it does seem to be enjoying some good real-world success. What I like about Mark is he just does stuff; he doesnât wait for perfection.
This feels like a really rich sentence and Iâd love to see it unpacked â use cases youâre envisioning, etc. I havenât heard of a project like that, although some things come close, and it certainly could be built on top of some existing platform like IPFS.
Well, my understanding of local-first âwork without a networkâ is more like being able to work offline on (almost) everything and (almost) the same way as when online. This is what I call off-line editing in the parallel; see the other discussion thread.
Firstly, Iâve mostly worked with state-based CRDTs, so what I say here may not be true for OT. That said, a state-based CRDT is pretty much like a blockchain with all forks preserved, and with well-defined semantics of how to interpret the forks in the history, but without any hash chains or crypto.
Now, from that point of view, you are right. Since the operations are commutative and associative, you can in theory combine a sequence of operations, thereby creating more complex operations. However, whether such more complex operations can actually be represented with less memory depends on the details of the CRDT. Thatâs where the literature kicks in. Unfortunately Iâm not an expert there. I just know it does exist.
Well, with âcryptographically preserved historyâ I simply mean something like a hash chain (with or without PoW) applied on a CRDT-like history. That is, a (fully or partially) ordered sequence of events where you cannot change the history. Consider git. If you amend a commit, you create a new branch of history, with quite different commit fingerprints. In that sense, if you rewrite git history, you branch the history all the way from the point of edit.
Now, if you want to compact such a branching CRDT history, AFAIK nobody has worked out a solution that is both fully secure and saves a lot of space. Maybe that is impossible. Or maybe Iâm just ignorant. I think some of the so-called light blockchain protocols have good related ideas, but AFAIK they are only probabilistically secure.
(Iâm saving the âreally rich sentenceâ for another post. )
Continuing the discussion of offline-friendliness from another thread:
This is exactly what Iâm picturing too, and why I see Holochain as quite able to support offline mode. The source chain is not particularly interesting â itâs just a sequential log of one personâs state changes, which may be inputs to CRDTs or may not. For many apps, the source chain will be completely irrelevant; itâll be the content of the entries thatâs important. And I think that, for entries that contain operation-based state changes (in other words, CvRDTs, where ordering is important), you wouldnât be interested in the ordering of one agentâs entries so much â youâd be interested in the ordering of an individual entry in relation to other entries that relate to a document, which will be floating out there in the DHT somewhere. The important thing is that you can reference them in your entry so a partial causal ordering of operations can be reconstructed.
So to support offline mode as the Ink & Switch folks talk about, all you need to do is accumulate a bunch of state changes on your source chain, then publish them to the DHT when you get offline. As you say, thereâs a high likelihood that youâve created a parallel branch that will need to get merged by the app or its users (or stand on its own, as with Git or FedWiki), but as you also say, thatâs not actually a problem.
I keep reading things about compaction and saving space. I donât think any of us have talked much about how to do that in concrete terms, and I think itâll become important when the rubber hits the road. All the Holochain DHT does is accumulate immutable facts for an application (be they documents, username registrations, transactions, CRDT states/operations, etc) and expect the application code to reduce them into something meaningful. So the Holochain DHT has the same storage problems as Git and blockchain, although one advantage is that storage is sharded up so nobody has to store the full set.
I think IPFS is close, but I havenât studied well enough. Anyway, letâs unpack.
Deduplication is the idea that you can identify whenever any two files are identical, and if so, instead of having a number of separate copies, you consider those copies as instances of the same file.
More details
The usual way of implementing deduplication is that you take a good long fingerprint (or two) from the file (or disk block). Then you compare that to the fingerprints of all other files. If the fingerprints are identical, you have a very high probability that the files are identical. If you have a long enough fingerprint, you can make the collision probability so low that it happens only once in billion universes.
However, a better practice is to have two distinct hash functions. That additionally adds a reasonable level of protection against collision attacks, since even in the case that someone would invent a collision algorithm against one of the hash functions, the probability that one could invent a collision algorithm that works simultaneously against both of the functions is quite low.
key = hash_1(value) | hash_2(value)
, where |
is concatenation.
Now, if you want both deduplicate and encrypt your files, there is the problem of how to recognise when two encrypted files are identical. One approach to tackle that is convergent encryption. The basic idea is that you use a hash of the content as the encryption key. Everyone having the same value will generate the same encryption key.
More details
The rationale is that if you already have the value, you donât need the key. Hence, the hash of the value is as good key as any.
key_enc = hash_enc(value)
value_enc = encrypt(key_enc, value)
key_dedup = hash_dedup(value)
You store <key_dedup, value_enc>
into your deduplicated storage, using key_dedup
as the index. You keep <key_dedup, key_enc>
as a secret.
With that you can first fetch value_enc
using key_dedup
and then reveal the original value with decrypt(key_enc, value_enc)
.
You can deduplicate the encrypted files, leading to nice orthogonality.
A problem with this is that if your storage is distributed, anyone that has a given file can see whenever others are looking up the file. Sometimes that is not wanted. Thatâs why we need salt.
More details
Instead of having plain convergent encryption, where key_enc = hash(value)
, you salt it: key_enc = hash(salt, value)
, where salt
is a relatively short semi-secret bitstring.
Community-salting simply means that you agree on the salt with your local, trusted community. Deduplication works within the community, but it does not leak information across communities that use different salts.
Finally, using erasure codes to backup files is based on the idea that you compute a (non-cryptographic) checksum over a number of files in such a way that you can regenerate any of the files from the checksum and the other files. Hence, if any of the original files is destroyed but the checksum and other files are available, you can easily restore the destroyed file from the rest.
More details
In practice, this is generalised so that you have a number of checksums over a large number of files, e.g. 10 checksums over a set of 1000 files, in a such a way that if 1â10 files are destroyed, you can still restore them. Or 50 checksums for 100 files, or even 100 checksums for 100 files. In the last case, your duplication factor is 2 but you can randomly lose (almost) half of the storage and you can still restore all the information!
In practice, with suitable erasure codes you can keep your duplication factor quite low, e.g. well below 1.5, and still be able to recover any file with a reasonable effort. Or alternatively you can gear up your duplication factor to some relatively high number, such as 4, and be able to restore all data from a quarter of all the storage.
So, what I am envisioning with that sentence is a
- distributed storage system, where
- all files are encrypted, but where
- the system stores only one logical copy of a file (within a community), and where
- one can restore the files even when only a fraction of the storage is online.
Maybe IPFS implements that all. I donât know.
Now, from the holochain point of view, what â of course â is then next needed there is an incentive system that compensates people for sharing their storage. Only with such incentive I can be reasonable sure that others are keeping their shares (erasure codes) that backup my files.
Cool, thanks for unpacking. I think I understand it all now. Deduplication I was familiar with from IPFS and Holochain, and now I understand convergent encryption and community salting!
Convergent encryption = using an encryption key deterministically derived from the plaintext, but that allows snoopers to figure out what data youâre interested in.
Community salting = âre-duplicatingâ content per trusted group so only those privy to the salt know who else is interested in the data.
This could be quite important to Holochain, as all data is signed by its author (thereby revealing who published what) and peers could keep track of all âgetâ requests (thereby revealing who wanted to read what).
There are some interesting properties of Holochain that may make group privacy even easier:
- Each application has its own isolated DHT and E2EE network, partitioned by the hash of the applicationâs binary code + metadata (name, description, UUID, arbitrary keys/values).
- Itâs trivial to fork an application, creating a separate space with the same logic (data types, validation rules, and state-manipulation functions) as the original. All you need to do is pass a new UUID or arbitrary keys/values into the config for that app.
And as for erasure codes, I guess Reed-Solomon is an example of that? It could be valuable for apps that just store big chunks of data. For apps that have specific validation rules on data, itâd be tricky â as soon as you break an entry into chunks and RS-encode them, your validation routine has to wait until all chunks are committed to the journal before it can decode, combine, and check them. Not impossible though.
Validation routines
Incidentally, these validation routines are deterministic/pure functions that receive an entry and its context (which can include the entire journal preceding it) and make a boolean decision about whether itâs correct or not. They get used two places: just before an agent commits the entry to their journal, and just before a validator is asked to hold an entry. The consequences are different in each cases: at commit time, it just rejects the entry and prevents the author from committing data that breaks the rules; whereas at hold time, the validator can choose to gossip a âwarrantâ flagging the author as a bad actor who has likely modified their executable.
Weâre assuming that itâs incentive enough just to be able to use the app. As others are asking you to store their data, youâre asking them to store yours. If you can get by with only 2Ă redundancy (either by expecting the DHT to hold two copies, or by breaking the entries into N chunks and using erasure code params that introduce 2 / N redundancy into those chunks â sorry; Iâm not familiar with the right terms) then thatâs not all that painful. Because both data and nodes are expected to be distributed statistically evenly around the DHT, as long as you put some sort of constraint on chunk size, nobody should get stuck with some massive 2GB holiday video file.
I can see places where this assumption would hold up and places where it wouldnât. High-trust environments (teams) should have no problem accepting this burden or setting up dedicated nodes that replicate the full DHT, whereas low-trust environments could see abuse.
I wonder if proof-of-storage might help. Depends on the application⌠if itâs just a blob storage app (backups, for instance) you could reward people who can prove that theyâre storing othersâ blobs by granting them permission to store more of their own stuff.
My understanding is that it doesnât, but you could easily build this on top.
So Iâm really curious: whatâs your particular interest in these sorts of applications?
Yes. But Iâm far from an an expert in erasure codes.
Yes, that seems to be the case. But my intuition says that studying validation and shared redundancy (ala erasure codes) in detail might be very interesting. Not that Iâd have time and energy for that, thoughâŚ
In general, the freeloader problem is real, having a direct (though typically non-linear) relationship with the community size and an inverse relationship with trust. That is, the bigger the community and the lower the average level of trust, the more there will be freeloading and fraud.
Reputation systems help there, but those reputation systems that rely on human reports have their own set of problems. I surmise that fully automated proof-based reputation systems work better. But I donât remember having seen any comparative studies.
I concur. I believe that with proof-of-storage and probabilistic probing you can build an automated, proof-based reputation system. What is missing there though still is the proper punishment for misusing your reputation.
That is, one attack against (automated) reputation systems is to first gain a high reputation, allowing others to trust you, and then on one shot misuse it and disappear. From the game theoretic point of view, that is simply the classical last turn defection in the iterated Prisonerâs dilemma. To counter that, misusing reputation should be mechanically bound to a punishment that exceeds the benefit from a defection. Bonds are the most often used mechanism, but an apparently equally effective mechanism is to bind the assets to the identifier for a period of time, thereby preventing the defector from reaping the benefits and disappearing in one shot. (This resembles to but is not identical with how in Bitcoin the block rewards are paid only after a 100 rounds cooldown period.)
Well, my primary interest is very simple:
As I said, Iâm very old school here. My whole digital history (a few terabytes of data) is on my own server in our cellar. That is also basically my only backup. Sure, I have ZFS so that I can (and have) easily recover any single disk going bad.
But what if our house burns? My data will be gone. While not a catastrophe, not so nice. (My most important data I have also in the bank vault, but manually taking the backups there every now and then is cumbersome.)
Backing up the data to the existing could is not really an option, since storing that many terabytes is not exactly free. (e.g. Google Nearline storage costs some $10/TB/month, while a 10 TB disk costs about $300 or three months Cloud storage.)
Hence, I would like to share my excess disk (~5 TB at this point) for having some erasure codes over my data stored in the community. Since the redundancy factor needed for backups is typically 1.1, or less than 1.5 anyway, that would be a very cost efficient and convenient solution for me. Provided that I could trust it to work for at least 10 years, so that it would be worth the effort of putting up.
Great to hear more about your context @PekkaNikander â I too would love some sort of P2P cloud backup that âJust Worksâ. Re: freeloading, I can picture something whereby, in order to store data, you have to prove that youâre storing a roughly equal amount of data by producing a fairly fresh set of storage proofs for all the data youâve committed to. Would love to get into implementation details, but now is probably not the time
But the tricky thing is that youâd have to rebuild the DHTâs algorithm in application-land, so Iâm wondering about whether the DHT already does (or could be made to) get you close enough. The DHTâs routing algorithm is fairly sloppy (itâs called ârrDHTâ, and at least one of the 'râs stands for ârelaxedâ) which means it can accommodate a bit of variance in the amount of data people are willing to store. Hereâs how it works:
- Everyone has an advertised âindexing arcâ, a range of addresses that theyâve committed to holding data for.
- Alice advertises her indexing arc to her peers. Wherever her peersâ indexing arcs overlap with hers, both they and she expect each other to be holding the same entries. If thereâs a mismatch, they push missing entries to each other and expect to get a response affirming that theyâre now holding those entries.
- You can have as large or as small of an indexing arc as you like, based on your deviceâs storage restrictions and your responsibility within the group.
- Peers gossip with each other about uptime, storage arcs, new peers, reliability re: holding data, etc. The ultimate goal is to make cooperation more advantageous than rule-breaking by making everyoneâs DHT behaviour visible and allowing people to sanction cheaters.
Iâm glossing over some details; Iâll let you dig into the draft paper and construct your own understanding of it. (Note: because itâs a draft, please donât share far and wide.) But the two conclusions I draw:
- It should be possible for honest peers (as defined by peers who see some value in the app and donât want to break the rules) can enforce sanctions against peers who claim that theyâre holding lots of data but actually arenât.
- That only covers claims about indexing; it doesnât cover freeloading via keeping your indexing arc low under false pretenses. I donât know how to translate that into the needs of a particular app; that is, enforcing a give/take ratio. One size does not fit all, so itâd have to be an app-level algorithm or an app-defined setting that the DHT obeys.