Why everything inside Elemental Chat is being sorted inside the client?

In this commit for example I can see that the sort of messages is being made inside the client

What if we have a chat with something like 50k+ messages? And this sort happens at every new signal? And this implies that pagination is not a feature of RSM yet?

see if you can do what you need with granularity of Path rather than pagination

path trees represent a more general approach than ā€˜pageā€™

see this conversation for an example - https://forum.holochain.org/t/how-to-structure-implement-a-sorted-linked-list-on-the-dht/4478/12

1 Like

Hi @thedavidmeister, thanks for your answer, I tried to access the link but it gives me a 404

Oops! Looks like Iā€™d have to make that conversation public,
Give me a min.

1 Like

Quoting David,

So you can sort-of have your DNA programmed such that messages follow a time-ordered BST (Binary Search Tree) branch (not the whole tree, just a branch, or you may call it a linked-list; https://appliedgo.net/media/balancedtree/BinTreeShapes.png).

So for a simple scenario, the full-parsed-path may look like this:
public_posts.post_xxx(the-entry-hash-of-the-post-youā€™re-interested-in-reading).messages.msg_xxx(the-entry-hash-of-the-latest-message; if you follow the hash, youā€™ll find the message ā€œThank you so much; issue resolvedā€).msg_xxx(the-previous-post; follow to discover ā€œTry rebooting the systemā€)ā€¦ and so onā€¦

It works great for messages, 'cause youā€™re mostly interested in the latest messages and donā€™t have to follow a long trail just to read the latest ones. Linking a new post is trivial; only takes a few calls to create_link() and delete_link(). Moreover, it mimics the order in which messenger clients have to call get_previous_message() when the user keeps scrolling up to reach and read some arbitrarily old message.

However, I generally donā€™t like using the pure-path method in which the full-traversed-path may look like:
public-posts.Windows-10-broke-again.messages.Thank-you-so-much.Try-rebooting.

Though I prefer the linked-list workaround hack.

Anyways, thatā€™s all I have to addā€¦

1 Like

blogs pretty commonly create urls like posts/2020/01/06/my-blog-post and a similar thing happens for the media files they save on the file system to prevent the scenario where you end up with 50k images in a single directory

Path tackles two problems:

  • ā€˜hot spotsā€™ where the base of all links in a collection falls unfairly on a few agents
  • more efficient lookups than fetching all data in a collection

the first is solved by just doing something with hashes but the second needs something domain specific

examples:

  • dates like 2020.01.07
  • longitude/latitude
  • usernames

the limitation of pagination is that you need to page over some field

https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/guides/working-with-timelines

e.g. twitter uses since_id

so the question for me is that after you solve the hot spot problem, which implies data is being spread across many agents, how would ā€˜pagesā€™ be created

1 Like

This post is still 404ā€™ing for me. Can you guys open it up? Id love to take a look.

which post?

This one:

https://forum.holochain.org/t/how-to-structure-implement-a-sorted-linked-list-on-the-dht/4478/12

aaah, that is because it was a private message that i thought was a public forum thread haha

Hahha, np just grant me a cap token to view the thread :wink:

itā€™s not transferrable :wink:

ask @The-A-Man for what could be shared publicly

done. (would do all talks publicly now onā€¦ :-))

Sorting the data ad-hoc at the client is a terrible idea; it may work in the short term, but beyond that, for any real-world use-case, Iā€™d highly recommend against itā€¦

Also, @kristofer has beautifully explained something along those lines here at:

what is ā€˜the clientā€™ where there are no servers?

someone needs to do the sorting, if you foist off heavy sorting ā€˜to the DHTā€™ then someone is going to be asking you to do the same at some point for them.

try to create some bounds on how much data youā€™re sorting at once by building granularity into large sets of data - https://docs.rs/hdk/0.0.100-alpha.1/hdk/hash_path/index.html

even better, try to include rough ordering into the granularity itself

The client in the sense that the UI consumes the DNA, so it is in essence a client to the DNA. Of course, thereā€™re no client and servers, but what better name to give to the UI (PWA, SPA, electron app, a windows binary, or simply a CLI interface) than a client.

what is the main concern re: sorting in the UI vs. the wasm?

At https://docs.rs/hdk/0.0.100-alpha.1/hdk/hash_path/index.html,

  • Holochain neighbourhoods (who needs to hold the data) center around the content address so the poor nodes closest to ā€œchat-messagesā€ will be forced to hold all messages (DHT hotspots)

Needs correction.
It should rather say,

ā€œthe poor nodes closest to ā€œchat-messagesā€ will be forced to hold the links to all messagesā€

They wonā€™t be storing all messages, obviously! If that were the case, weā€™d be calling it a ā€œfatspotā€.
It is a hotspot in the sense that almost all users trying to read messages will have to do so through the ā€œall-messagesā€ entry, i.e., all users would, all day long, be calling the get-links on that one single entry (the ā€œanchorā€ entry that simply says ā€œall-messagesā€).

Correct me if Iā€™m wrong though (in which case it would mean that Iā€™ve been wrong all this timeā€¦ :ā€“(

[Indeed, one can argue that holding the links (i.e., the hashes plus some Holochain link-related data) to all messages is kinda like the same as holding all entries themselves, especially from a storage-perspective, and I agreeā€¦ It indeed is a hell of a bulk load; but itā€™s not the same as holding all the entries themselves.]


Will be waiting the next noonā€¦ My technical doubts regarding your tree vs a linked-list approach is all thatā€™s remaining to be sortedā€¦ Would love to finally have that clarified too.

Ideally the tree structure embeds some domain specific granularity

Examples of granularity include:

  • Orders of magnitude

try to find a way to create buckets of granularity

(That way) we never need to fetch all messages because we can start as deeply down the tree as is appropriate

I get it. Makes perfect sense.

Imagine a clutter-twitter app, where people post tweets, and users should be able to see the latest tweets. It has no concept of Profiles or whatever. Everyone kinda posts in a universal feed.

Given a tree-based (ā€œPathā€-based, in HDKā€™s terminology) solution, users who want to read last hourā€™s posts can simply retrieve the entry for ā€œ1/3/2021: 23.00 to 24.00ā€ and simply do a get-all-links on it and get all the messages posted in that hour (assuming the granularity of partitioning/pagination is 1 hour).

But the problem is that not all use-cases that the end-developers ((h)app developers) have exhibit that sort of a granularity; they simply donā€™t fit in that domain. Thus, the tree-style solution fails them (as it once failed me)ā€¦


Imagine that fresh Ford Model 1 cars are on sale, and that buyers would have to purchase it in exchange for US Dollars. And, for this specific use-case, imagine that no single manufacturer is churning out those cars; rather people who have a 3D-printer are printing those cars (at home) and offering it for sale at an arbitrary price that they think there hard-work is worth for; this transaction has to take place on a Holochain app (Yippee!!!).

Hereā€™s the catch: the user has absolutely no idea what a Ford Model 1 should be worth.

One can see that a tree-based solution wonā€™t be of any help hereā€¦ For example, you may store ā€œall-cars-for-dollarsā€ entry, and link them to ā€œall-cars-for-0-to-100-dollarsā€, ā€œall-cars-for-100-to-200-dollarsā€, ā€œall-cars-for-500-to-600-dollarsā€ and so onā€¦ Further down, one may have other subtree-roots that say ā€œall-cars-for-100-to-150-dollarsā€ and so onā€¦

Now what can you, as a rational user, query to this (h)app so as to strike the best deal? Nothing! Youā€™re absolutely hopeless due to such an architectureā€¦

You can make a guess and search for the subtree-root (i.e., an entry) that says ā€œall-cars-for-0-to-100ā€, in which case youā€™d just be right on the money (meaning very lucky). But what if those 3D printers costed a billion dollars each? Well then would forever be searching in the wrong places! Plus doing a get-all-links to the rootest-root isnā€™t an option ('cause what would be the point then; I mean, that way weā€™d be back to the same problem we started with!)


Imagine another architecture of organizing such diverse prices on the DHT.

Thereā€™s an entry that says ā€œbest-dealā€, to which is linked the Ford Model 1 thatā€™s up for sale for the least amount of money. That car (or rather, the entry of that car) further links to the next best deal, and so onā€¦

The user gets() the ā€œbest-dealā€ entry, gets its link (tagged ā€œnextā€), and guess what? Heā€™s found the cheapest Ford Model 1!

[Replace Ford Model 1 with, letā€™s say, Starbucks coffee-coupons, and US Dollars with Amazon-coupons and youā€™ve got the Vrilā€™s token exchange system. Just so as a side-noteā€¦]

[Note that for this use-case, new sellers might have to be placed anywhere within the listā€¦ However, for the use case of messages, this is a great approach; in fact, Iā€™d say even better than HDK::Path based approach. The ā€œLatest-Messageā€ entry would, upon the arrival (i.e., the posting) of a new message, would simply delete the previous link that it held, and rather create a new one that points to this newer message, while the new message would now (and forever from then on) hold the link of the last latest message (the previous latest message, in other words). This way, a user can read the latest message, the one posted before it, the one posted before that one, and so onā€¦ Trust me, nobody reads the messages posted on ā€œTuesday, July 21st 2019ā€; they simply scroll up and up until they reach what they had been looking for. And my experience with distributed databased has taught me to always ā€œstore data the way it would be retrievedā€, i.e. time-sorted ā€œlatest-firstā€ manner, in this example.


One doubt I have is whether deleting and creating (or simply updating, as updates are nothing but sequential deletes and creates) the link on the ā€œlatest-messageā€ entry would be an efficient thing to do, especially from a storage perspective. Because there would always be some remnant leftovers (such as the link-headers, etc)ā€¦ But, from a storage-optimization perspective, your Tree solution also ends up creating a great many ā€œrootsā€ and ā€œsub-rootsā€, such as ā€œall-cars-for-0-to-100-dollarsā€ which do not hold any valuable information in and of themselvesā€¦ Plus, youā€™ve already assured me that itā€™s best to ignore such silly storage-costs (How exactly does holochain support micro-payments?).

As for the hotspot thing, then yeah, I would mostly be hitting the ā€œlatest-messages-between-the-a-man-and-davidā€ entry (substitute the string with a well-formatted chat-room entry that stores the hashes of the participants and not their names; you get the ideaā€¦) because thatā€™s what I care about (i.e., I care about the latest messages). But note that it always has just one link. Plus it (or rather, the neighbourhood of that entry) would only be receiving get() requests from me and you; not that hot a spot. Even if the whole world were in that room, the same argument would apply to, letā€™s say, a famous personā€™s tweet/message when everyone wants to read it at the same time (i.e., would incur a huge traffic influx on that neighbourhood, so as to make their deviceā€™s wifi glow red hot).


Interesting points, althought I think the linked list approach has too many issues to be really viable.

So, from write costs: every action (create_link, create_entry, delete_entryā€¦) creates a number of DhtOps (1 DHT OP = 1 thing that has to change in the DHT). Doing too many of these is bad not so much because of storage, but because of gossip and processing. You need to be gossiping at all times with your neighborhood to guarantee that all DHTOps stay online, and recruit new nodes if that doesnā€™t look likely. Those new nodes are going to need to validate itā€¦ Quite a bit of background processing. Here is how many DHTOps every action creates:

  • create_entry: 3
  • update_entry: 5
  • delete_entry: 3
  • create_link: 3

So yes, be careful when deleting things away. Also, they are not really deleted and cleaned away from the DHT, they are kept there for integrity and only ā€œmarked deletedā€. So you actually gain no storage benefit.

Also, nodes or entries cannot really ā€œreactā€ to an DHTOp being validated and held in the DHT by themselves, it has to be one of the agents who actively takes and action after that happens. Youā€™d have problems otherwiseā€¦ Which of the 25 nodes holding the entry executes the action? How do validation rules work in that sense? What about partitionsā€¦?

About the querying of the linked listā€¦ How are they going to maintain order? Letā€™s say that the cheapest one is $50, and I have to commit one with $100: do I need to go over the whole list just to find the one I need to link from? What if just then the $50 one is deleted? What if two people commit two different $50 at the same time, or while being in different partitions?

While interesting and maybe useful in specific cases, in general I think this kind of approach does not really play well with the CRDT behaviour in holochain IMO, in which the data structure doesnā€™t depend on the ā€œorder of arrivalā€ of events.

Another thing is that the UI may very well obscure the details of querying for paths. I think itā€™s been already 3 projects in which we are doing time indexing via trees of paths, and I think only one of them had explicit time range filter, the other were just infite-scrolling UX patterns like the ones you are describing.

1 Like