This is a brilliant article. Talks about applications that combine the user sovereignty and offline-friendliness/robustness of local data (e.g., MS Office, Adobe Creative Suite) with the effortless collaboration potential and redundancy of cloud apps (like Google Docs and Trello). A lot of things are relevant to Holochain app designers too.
Love this comprehensive article! Would be super interesting to have a review of HoloChain’s current capabilities, graded according to the criteria the fine folks at Ink and Switch have developed.
There was recently a condensed version of the topic (written by some of the same people I believe) here:
The more I learn about CRDTs the more I want to understand how that system fits into HoloChain. There’s some new entrants in the community like @PekkaNikander that might want to partake in that discussion.
CRDTs are a great and deep set of tools. Back in 2015, when I studied them, they were certainly among the most promising approach to offline-supporting collaborative editing. As the article states, OT (Operational Transformations) are apparently better for collaborative editing, while state based CRDTs are more akin to git and better preserve the history. There is also a rich literature on how to compact CRDT history, thereby preserving storage, but AFAIK nobody has worked out how to combine such compaction with cryptographically preserved history. (A nice Ph.D. topic, perhaps?)
Today GUN seems to be a popular community project, based on CRDTs.
What comes to data first, I’m definitely very old school.
I still have my own multi-homed and UPSed server in my cellar, with enough of disk space for all of my data, including all history and then some. It has been there for some 30 years, with the HW upgraded every 10 years or so. (Soon time to do that again.)
But that is also waste of resources. It would be great if I could share the unused disk space (and CPU) with the community. Anything that implements deduplication with community-salted convergent encryption, combined with distributed back-up (duplication) with erasure codes?
I’m happy to see some conversation around this topic — local-first was an idea I fell in love with at my previous job, where we used Slack and GitLab heavily. Any outages would bring our work to a halt for half a day or more. Sure, you could still write code and check it into the repop, but you couldn’t have any conversations about it. Now, granted, having conversations when you’re offline can only get you so far, but local-first + P2P networking is where it gets really exciting.
@erlend_sh here’s my stab at a grading of Holochain against Ink and Switch’s criteria:
- It should be fast. Work in progress. Local operations (writes, function calls, state changes, lookups of cached data) are super fast, while network operations still need optimisation.
- It should work across multiple devices. Yup, absolutely.
- It should work without a network. Yes again. Currently the most reliable path for networking is sim2h, which requires a network connection (although you could easily run a local sim2h instance for your LAN). This is just a testbed, though, and will gradually be replaced by the real thing.
- It should support collaboration. Nothing that bars collaboration, with the exception of app-level CRDT libraries not existing yet.
- It should support data access for all time. As with Ink and Switch’s examples, with Holochain the DNA = access to the data and knowledge of how to interpret it. Open data formats are left as an exercise to the app developer.
It should be secure and private by default. A few notes:
- Users’ source chains stay on their devices, and may contain private (non-published) entries.
- All DHT communications — both gossip and direct messages — are E2EE, with TLS termination on the communicating nodes.
- Holochain’s public/private membranes are pretty granular: If you want something secret, keep it on your source chain, don’t publish it, and only share it via direct messaging. If you want something private (that is, for a group’s eyes only), your only read-control measure is the DHT, and it’s an all-or-nothing thing: all information on a DHT is available to all of its members. You can, however, create as many private DHTs for as many groups as you like.
- Still missing: encryption and decryption functions in the SDK. Currently if you wanted to share a private message publicly, so people could download it while you’re offline, you’d have to do the work in the client portion of the app.
- Also missing: patterns and practices for sharing private data ‘in the open’. Encryption libs only cover the means, not the user experience and best practices. What does it mean to publish an encrypted message to a public data store that is designed to live for as long as people participate in it? How does the future of quantum computing affect the present-day choices that people make about the data they publish?
- It should give the user full ownership and control of their data. Absolutely; each user is in charge of writing their own data on their own source chain and publishing it onto the DHT, and their PK signatures are proofs of authorship. There’s one caveat though: deleting data. I don’t know of any distributed system (which includes both P2P and client/server) that allows an agent to force anyone else to delete her data. Not her peer Bob, not the server that stores her data, not another user Charlie who happened to take a screenshot of her nasty tweet about her employer. This is a social-legal problem, not a technological one.
This sounds interesting — I guess because one of the consequence of a CRDT is that all of its operations are commutative, you could create checkpoints and dispose of the state changes that led to it (as long as you kept track of which state changes you’d just disposed of). Is that right? I’m still learning about category theory, but my understanding is that if it can commute, it can compose. So if each state transition in a CRDT is a function that takes existing state and outputs a new state, a checkpoint is just a function composed of other functions, and can be composed with other checkpoints or individual state transitions! Lovely.
Tell me more about this:
What’s the definition of ‘cryptographically preserved history’ and the nature of the problems?
I’ve heard of GUN — met the founder Mark in San Francisco this summer; really interesting and passionate guy — and it does seem to be enjoying some good real-world success. What I like about Mark is he just does stuff; he doesn’t wait for perfection.
This feels like a really rich sentence and I’d love to see it unpacked — use cases you’re envisioning, etc. I haven’t heard of a project like that, although some things come close, and it certainly could be built on top of some existing platform like IPFS.
Well, my understanding of local-first “work without a network” is more like being able to work offline on (almost) everything and (almost) the same way as when online. This is what I call off-line editing in the parallel; see the other discussion thread.
Firstly, I’ve mostly worked with state-based CRDTs, so what I say here may not be true for OT. That said, a state-based CRDT is pretty much like a blockchain with all forks preserved, and with well-defined semantics of how to interpret the forks in the history, but without any hash chains or crypto.
Now, from that point of view, you are right. Since the operations are commutative and associative, you can in theory combine a sequence of operations, thereby creating more complex operations. However, whether such more complex operations can actually be represented with less memory depends on the details of the CRDT. That’s where the literature kicks in. Unfortunately I’m not an expert there. I just know it does exist.
Well, with ‘cryptographically preserved history’ I simply mean something like a hash chain (with or without PoW) applied on a CRDT-like history. That is, a (fully or partially) ordered sequence of events where you cannot change the history. Consider git. If you amend a commit, you create a new branch of history, with quite different commit fingerprints. In that sense, if you rewrite git history, you branch the history all the way from the point of edit.
Now, if you want to compact such a branching CRDT history, AFAIK nobody has worked out a solution that is both fully secure and saves a lot of space. Maybe that is impossible. Or maybe I’m just ignorant. I think some of the so-called light blockchain protocols have good related ideas, but AFAIK they are only probabilistically secure.
(I’m saving the ‘really rich sentence’ for another post. )
Continuing the discussion of offline-friendliness from another thread:
This is exactly what I’m picturing too, and why I see Holochain as quite able to support offline mode. The source chain is not particularly interesting — it’s just a sequential log of one person’s state changes, which may be inputs to CRDTs or may not. For many apps, the source chain will be completely irrelevant; it’ll be the content of the entries that’s important. And I think that, for entries that contain operation-based state changes (in other words, CvRDTs, where ordering is important), you wouldn’t be interested in the ordering of one agent’s entries so much — you’d be interested in the ordering of an individual entry in relation to other entries that relate to a document, which will be floating out there in the DHT somewhere. The important thing is that you can reference them in your entry so a partial causal ordering of operations can be reconstructed.
So to support offline mode as the Ink & Switch folks talk about, all you need to do is accumulate a bunch of state changes on your source chain, then publish them to the DHT when you get offline. As you say, there’s a high likelihood that you’ve created a parallel branch that will need to get merged by the app or its users (or stand on its own, as with Git or FedWiki), but as you also say, that’s not actually a problem.
I keep reading things about compaction and saving space. I don’t think any of us have talked much about how to do that in concrete terms, and I think it’ll become important when the rubber hits the road. All the Holochain DHT does is accumulate immutable facts for an application (be they documents, username registrations, transactions, CRDT states/operations, etc) and expect the application code to reduce them into something meaningful. So the Holochain DHT has the same storage problems as Git and blockchain, although one advantage is that storage is sharded up so nobody has to store the full set.
I think IPFS is close, but I haven’t studied well enough. Anyway, let’s unpack.
Deduplication is the idea that you can identify whenever any two files are identical, and if so, instead of having a number of separate copies, you consider those copies as instances of the same file.
The usual way of implementing deduplication is that you take a good long fingerprint (or two) from the file (or disk block). Then you compare that to the fingerprints of all other files. If the fingerprints are identical, you have a very high probability that the files are identical. If you have a long enough fingerprint, you can make the collision probability so low that it happens only once in billion universes.
However, a better practice is to have two distinct hash functions. That additionally adds a reasonable level of protection against collision attacks, since even in the case that someone would invent a collision algorithm against one of the hash functions, the probability that one could invent a collision algorithm that works simultaneously against both of the functions is quite low.
key = hash_1(value) | hash_2(value), where
| is concatenation.
Now, if you want both deduplicate and encrypt your files, there is the problem of how to recognise when two encrypted files are identical. One approach to tackle that is convergent encryption. The basic idea is that you use a hash of the content as the encryption key. Everyone having the same value will generate the same encryption key.
The rationale is that if you already have the value, you don’t need the key. Hence, the hash of the value is as good key as any.
key_enc = hash_enc(value) value_enc = encrypt(key_enc, value) key_dedup = hash_dedup(value)
<key_dedup, value_enc> into your deduplicated storage, using
key_dedup as the index. You keep
<key_dedup, key_enc> as a secret.
With that you can first fetch
key_dedup and then reveal the original value with
You can deduplicate the encrypted files, leading to nice orthogonality.
A problem with this is that if your storage is distributed, anyone that has a given file can see whenever others are looking up the file. Sometimes that is not wanted. That’s why we need salt.
Instead of having plain convergent encryption, where
key_enc = hash(value), you salt it:
key_enc = hash(salt, value), where
salt is a relatively short semi-secret bitstring.
Community-salting simply means that you agree on the salt with your local, trusted community. Deduplication works within the community, but it does not leak information across communities that use different salts.
Finally, using erasure codes to backup files is based on the idea that you compute a (non-cryptographic) checksum over a number of files in such a way that you can regenerate any of the files from the checksum and the other files. Hence, if any of the original files is destroyed but the checksum and other files are available, you can easily restore the destroyed file from the rest.
In practice, this is generalised so that you have a number of checksums over a large number of files, e.g. 10 checksums over a set of 1000 files, in a such a way that if 1–10 files are destroyed, you can still restore them. Or 50 checksums for 100 files, or even 100 checksums for 100 files. In the last case, your duplication factor is 2 but you can randomly lose (almost) half of the storage and you can still restore all the information!
In practice, with suitable erasure codes you can keep your duplication factor quite low, e.g. well below 1.5, and still be able to recover any file with a reasonable effort. Or alternatively you can gear up your duplication factor to some relatively high number, such as 4, and be able to restore all data from a quarter of all the storage.
So, what I am envisioning with that sentence is a
- distributed storage system, where
- all files are encrypted, but where
- the system stores only one logical copy of a file (within a community), and where
- one can restore the files even when only a fraction of the storage is online.
Maybe IPFS implements that all. I don’t know.
Now, from the holochain point of view, what — of course — is then next needed there is an incentive system that compensates people for sharing their storage. Only with such incentive I can be reasonable sure that others are keeping their shares (erasure codes) that backup my files.
Cool, thanks for unpacking. I think I understand it all now. Deduplication I was familiar with from IPFS and Holochain, and now I understand convergent encryption and community salting!
Convergent encryption = using an encryption key deterministically derived from the plaintext, but that allows snoopers to figure out what data you’re interested in.
Community salting = ‘re-duplicating’ content per trusted group so only those privy to the salt know who else is interested in the data.
This could be quite important to Holochain, as all data is signed by its author (thereby revealing who published what) and peers could keep track of all ‘get’ requests (thereby revealing who wanted to read what).
There are some interesting properties of Holochain that may make group privacy even easier:
- Each application has its own isolated DHT and E2EE network, partitioned by the hash of the application’s binary code + metadata (name, description, UUID, arbitrary keys/values).
- It’s trivial to fork an application, creating a separate space with the same logic (data types, validation rules, and state-manipulation functions) as the original. All you need to do is pass a new UUID or arbitrary keys/values into the config for that app.
And as for erasure codes, I guess Reed-Solomon is an example of that? It could be valuable for apps that just store big chunks of data. For apps that have specific validation rules on data, it’d be tricky — as soon as you break an entry into chunks and RS-encode them, your validation routine has to wait until all chunks are committed to the journal before it can decode, combine, and check them. Not impossible though.
Incidentally, these validation routines are deterministic/pure functions that receive an entry and its context (which can include the entire journal preceding it) and make a boolean decision about whether it’s correct or not. They get used two places: just before an agent commits the entry to their journal, and just before a validator is asked to hold an entry. The consequences are different in each cases: at commit time, it just rejects the entry and prevents the author from committing data that breaks the rules; whereas at hold time, the validator can choose to gossip a ‘warrant’ flagging the author as a bad actor who has likely modified their executable.
We’re assuming that it’s incentive enough just to be able to use the app. As others are asking you to store their data, you’re asking them to store yours. If you can get by with only 2× redundancy (either by expecting the DHT to hold two copies, or by breaking the entries into N chunks and using erasure code params that introduce 2 / N redundancy into those chunks — sorry; I’m not familiar with the right terms) then that’s not all that painful. Because both data and nodes are expected to be distributed statistically evenly around the DHT, as long as you put some sort of constraint on chunk size, nobody should get stuck with some massive 2GB holiday video file.
I can see places where this assumption would hold up and places where it wouldn’t. High-trust environments (teams) should have no problem accepting this burden or setting up dedicated nodes that replicate the full DHT, whereas low-trust environments could see abuse.
I wonder if proof-of-storage might help. Depends on the application… if it’s just a blob storage app (backups, for instance) you could reward people who can prove that they’re storing others’ blobs by granting them permission to store more of their own stuff.
My understanding is that it doesn’t, but you could easily build this on top.
So I’m really curious: what’s your particular interest in these sorts of applications?
Yes. But I’m far from an an expert in erasure codes.
Yes, that seems to be the case. But my intuition says that studying validation and shared redundancy (ala erasure codes) in detail might be very interesting. Not that I’d have time and energy for that, though…
In general, the freeloader problem is real, having a direct (though typically non-linear) relationship with the community size and an inverse relationship with trust. That is, the bigger the community and the lower the average level of trust, the more there will be freeloading and fraud.
Reputation systems help there, but those reputation systems that rely on human reports have their own set of problems. I surmise that fully automated proof-based reputation systems work better. But I don’t remember having seen any comparative studies.
I concur. I believe that with proof-of-storage and probabilistic probing you can build an automated, proof-based reputation system. What is missing there though still is the proper punishment for misusing your reputation.
That is, one attack against (automated) reputation systems is to first gain a high reputation, allowing others to trust you, and then on one shot misuse it and disappear. From the game theoretic point of view, that is simply the classical last turn defection in the iterated Prisoner’s dilemma. To counter that, misusing reputation should be mechanically bound to a punishment that exceeds the benefit from a defection. Bonds are the most often used mechanism, but an apparently equally effective mechanism is to bind the assets to the identifier for a period of time, thereby preventing the defector from reaping the benefits and disappearing in one shot. (This resembles to but is not identical with how in Bitcoin the block rewards are paid only after a 100 rounds cooldown period.)
Well, my primary interest is very simple:
As I said, I’m very old school here. My whole digital history (a few terabytes of data) is on my own server in our cellar. That is also basically my only backup. Sure, I have ZFS so that I can (and have) easily recover any single disk going bad.
But what if our house burns? My data will be gone. While not a catastrophe, not so nice. (My most important data I have also in the bank vault, but manually taking the backups there every now and then is cumbersome.)
Backing up the data to the existing could is not really an option, since storing that many terabytes is not exactly free. (e.g. Google Nearline storage costs some $10/TB/month, while a 10 TB disk costs about $300 or three months Cloud storage.)
Hence, I would like to share my excess disk (~5 TB at this point) for having some erasure codes over my data stored in the community. Since the redundancy factor needed for backups is typically 1.1, or less than 1.5 anyway, that would be a very cost efficient and convenient solution for me. Provided that I could trust it to work for at least 10 years, so that it would be worth the effort of putting up.
Great to hear more about your context @PekkaNikander — I too would love some sort of P2P cloud backup that ‘Just Works’. Re: freeloading, I can picture something whereby, in order to store data, you have to prove that you’re storing a roughly equal amount of data by producing a fairly fresh set of storage proofs for all the data you’ve committed to. Would love to get into implementation details, but now is probably not the time
But the tricky thing is that you’d have to rebuild the DHT’s algorithm in application-land, so I’m wondering about whether the DHT already does (or could be made to) get you close enough. The DHT’s routing algorithm is fairly sloppy (it’s called ‘rrDHT’, and at least one of the 'r’s stands for ‘relaxed’) which means it can accommodate a bit of variance in the amount of data people are willing to store. Here’s how it works:
- Everyone has an advertised ‘indexing arc’, a range of addresses that they’ve committed to holding data for.
- Alice advertises her indexing arc to her peers. Wherever her peers’ indexing arcs overlap with hers, both they and she expect each other to be holding the same entries. If there’s a mismatch, they push missing entries to each other and expect to get a response affirming that they’re now holding those entries.
- You can have as large or as small of an indexing arc as you like, based on your device’s storage restrictions and your responsibility within the group.
- Peers gossip with each other about uptime, storage arcs, new peers, reliability re: holding data, etc. The ultimate goal is to make cooperation more advantageous than rule-breaking by making everyone’s DHT behaviour visible and allowing people to sanction cheaters.
I’m glossing over some details; I’ll let you dig into the draft paper and construct your own understanding of it. (Note: because it’s a draft, please don’t share far and wide.) But the two conclusions I draw:
- It should be possible for honest peers (as defined by peers who see some value in the app and don’t want to break the rules) can enforce sanctions against peers who claim that they’re holding lots of data but actually aren’t.
- That only covers claims about indexing; it doesn’t cover freeloading via keeping your indexing arc low under false pretenses. I don’t know how to translate that into the needs of a particular app; that is, enforcing a give/take ratio. One size does not fit all, so it’d have to be an app-level algorithm or an app-defined setting that the DHT obeys.