How to deal with accumulation of obsolete data over time in a world with finite resources

nicolas · February 5, 2020, 2:57pm

If my limited understanding of Holochain is right, during the entire lifetime of a hApp:

No data ever gets deleted/erased from the DHT and all agents’ source chains. Holochain marks entries as deleted or updated but never actually removes an entry in a local chain.
Over time, we get more and more entries, distributed and replicated over an ever-growing network of source chains and the DHT, likely involving more and more agents.

This brings me to question the wisdom of using hardware resources (space on the hardrive) and ultimately natural resources (to build the new hard-drive that agents will need in order to store the ever-increasing amount of data) for data that became stale and that we don’t want to keep any trace of (old photo/video albums, past chat conversations etc…), especially considering it might represent a good proportion of the total data that I’m creating, and that some of that data may live in IOT devices with limited storage capabilities.

I heard from @guillemcordoba in the devcamp that there will be a migration mechanism to migrate entries from one version of a hApp to another, but I wonder if it’s possible to have it somehow filter through which parts of an agent’s source chain to keep or not.
More generally, I would like to know if there’s any way to « clean-up » one’s source chain from stale data (which use space on my hard drive and may push me to waste natural resources by purchasing a bigger hard drive or IOT device with greater storage capabilities).

Apologies if any of this is unclear or based on a misunderstanding. Would appreciate any resource that clarify the subject matter.

[EDIT] After some search I found some posts already covering the topic:

Deleting an entry complies with GDPR right to be forgotten?
Throwaway DHT
Limiting the size of my chain
But feel free to add other relevant resource here, especially more recent ones.

mikeg · February 10, 2020, 4:24am

Hi Nicolas, very relevant question, one I’ve had in mind since I understood Holo and HoloPorts. RedGrid works with IoT devices and we are incorporating HoloPorts in our solution to offload storage, but that will eventually fill up, especially since Holo will host many hApps.

However, when and who decides data are obsolete? Perhaps this is clearer in an agent-centric world since the hApp developer can decide; or maybe it is the community that forms around a hApp. In the world we have now, the Wayback Machine is really useful. The value in HoloFuel and the popularity of hosting can be related, but there may be a ‘retainer’ for hosting stale data.

To your posts you can add How to keep drafts / working copies / session variables?, which includes a ‘garbage collection’ idea.

nicolas · February 10, 2020, 5:12am

Putting aside the reality of how Holochain works, here would be my expectations as a user.

I would say unless the hApp design requires otherwise, agents should have (at least some) authority to decide of the persistence duration of their data. I would probably find a default of one year relevant most of my chat data, would expect a longer default for my photography library or travel maps data.
Also, I would want « the system » to be smart enough to realize which of my old data I keep coming back to and should keep longer, and which I may need to be asked about before complete removal. In some case I’d be happy to receive notification of the type « There’s a bunch of data you haven’t accessed in a while, do you want to mark them as obsolete / permanently delete them? ». I think having data first soft deleted and only after a period completely removed would help accidental loss of data.

I consider the subject matter of critical importance for the future of hApps or agent-centric applications in general and would love to see more discussion on how Holochain & DHT-based systems in general can deal with the issue.

PS: thanks for your link @mikeg, unfortunately it seems I’m not able to edit my topics after some time or a certain number of answers. Probably a Discourse design feature.

pauldaoust · February 12, 2020, 10:32pm

Hi @nicolas. I like that you’re digging into some of the difficult design problems that Holochain brings up. To the ideas you linked to above, I’d add one that Arthur suggested to me: throwing away your old source chain and migrating to a new one. For apps where all data is on one single DHT, this approach has advantages over DHT migration, because a user can decide to throw away their old data whenever they like. (Migration, on the other hand, forces everyone to upgrade to a new DHT at roughly the same time.)

One thing that it doesn’t help with, though, is that the DHT still stores your old source chain data (well, the public bits anyway), so on a global scale it still means more hard drives manufactured and discarded over time.

The ‘Throwaway DHT’ idea could be combined with the ‘Throwaway Source Chain’ idea in an interesting way: each throwaway DHT is ‘owned’ by one person, and all other participants are just there because they’re interested in that person’s data. This could be used for social media feeds, for instance — when I follow someone, I join their DHT, get access to their data, and help replicate it. If they kill their DHT and create a new one, it still exists among all the followers, until eventually the data gets so old that nobody’s interested in it and they all leave. Automatic garbage collection! This is almost exactly how Dat (especially the Cabal chat app) and Secure Scuttlebutt work. You subscribe to the feeds you’re interested in, and help host them for friends-of-friends.

Re: how migration works, you can pretty much do anything you like. Here’s what it looks like from an app developer’s POV:

DNA creator publishes an update.
Agents decide they want to use the update.
Agents terminate their source chain with a migration record pointing to the new DNA (this serves as a record that those agents have updated, so don’t expect any new data on this chain).
In the new DNA, a migration record points to the old DNA as a signal that this is indeed the agent’s new presence in the app.
A migration routine bridges to the agent’s old DNA instance and does whatever it likes with the source chain — leave it all behind (and anyone who still wants historical data can keep running the old DNA), copy everything to the new source chain, skip things that are deleted, etc.

You could do something very similar with the ‘Throwaway Source Chain’ idea but keep it on the same DHT.

You’re probably recognising that each one of these techniques would be appropriate for some apps but not for others.

On top of any one of these patterns, you could probably build something that meets reasonable user expectations.