How to avoid duplication of entities when updating the and retrieve only the latest instances?

I have this:

 struct MyEntity {
     val name: String,
     val age: u64
 }

Then I do this in js tests:

let {addr1: Ok} = await alice.call("my_app", "create_my_entity", { ....});

//1
let res1 = await alice.call("my_app", "update_my_entity", {addr: addr1, age: 33});
let {addr2: Ok} = res1;

//2
let res2 = await alice.call("my_app", "update_my_entity", {addr: addr2, age: 23});
let {addr3: Ok} = res2;

//3
let res3 = await alice.call("my_app", "update_my_entity", {addr: addr3, age: 11});

Now this

let myEntities = await alice.call("my_app", "get_all_entities", {  });

will return a vector of identical MyEntity the same amount of times I’ve updated it - 3 times + 1 instance for creation. Except – the Addresses will be different.

If I had, say, 5 different MyEntity and updated them several times, the function would return (5*times_each_one_was_updated). This isn’t what I’d need, I’d need only 5.

"get_all_entities" is implemented as `hdk::query("my_entity".into(), 0, 0)`

Q: How can I get it to return only the latest instance/version of each MyEntity?
In other words, after I update an Entity, it returns a new address and a new, updated record gets created, right? How can I retrieve only the updated, latest record of each individual Entity?

https://developer.holochain.org/api/latest/hdk/api/fn.get_entry.html

As far as I’ve had experience, that’s exactly what get_entry does, it returns the latest updated version of the entry.

You’d need to share the handler you’ve used for the get_all_entities zome function

there’s no get_entry(...) call in my code

Use hdk::get_entry rather than hdk::query

Note: hdk::query only queries an agent’s local source chain, and as @marcus implies, it will get all entries with the same entry type, regardless of whether they were initial entries or updates. hdk::get_entry will query your source chain and the DHT for only one entry by its hash — if the hash you give is an old entry, Holochain will automatically follow the chain of updates until it gets to the newest entry. That is, if you:

  1. Store an entry
  2. Update it twice
  3. Call a theoretical get_my_entity zome function that uses hdk::get_entry three times, once with addr1, once with addr2, and once with addr3

it would always return the newest entry.

So how do you implement a get_all_entities function? It depends; do you want to get all entities from the user’s source chain or from all users across the DHT?

I don’t need one entry. I need all entries.

I need a list of ā€œmy_entitiesā€. All of them. I don’t need a single entity. How will ā€œget_entry(…)ā€ help me? Show me.

I’ve shown that.

@alx I think we need more clarifying context to determine what these functions are trying to do. One question, and one clarification:

I need a list of ā€œmy_entitiesā€. All of them.

From across the DHT or just one agent’s chain?

So how do you implement a get_all_entities function?

I’ve shown that.

Sorry, lack of clarity on my part. That was meant as a hypothetical question. What I should have said was ā€œGiven how hdk::query and hdk::get_entry work, how would we reimplement get_all_entities to work the way you expect? Well, it dependsā€¦ā€

I don’t know which. I call get_all_entities() as a zome function from js tests. I guess I’ll need only agent’s ones for now – to ensure that all works via the js test. How do they differ? @pauldaoust

I know the end result I need:

Suppose I had a list of articles for a blog, Article being Entity.

What I want to get is a list of those Article to show on main page of a blog. That’s it.
I don’t need multiple versions of each Article – only the latest version of each Article

One agent’s chain will typically only hold their own actions in a network. (The exception is a countersigned transaction, which reflects another agent’s participation too.) The DHT holds copies of all agents’ public chain entries and is a global view of the entire community of participants.

So it depends on what you want — when a user’s UI calls get_all_entities() does the user expect everyone’s blog articles or just their own? I suspect that this decision is important for ensuring that your JS test is testing something relevant to a business goal.

Here’s how querying a local source chain differs from getting an entry from the DHT:

hdk::query() hdk::get_entry()
Retrieves my own data only Retrieves my data and everyone else's
Can retrieve multiple entries at once (data is local, so lookups can scan entire data set quickly) Can only retrieve one entry at a time (data is remote in unstructured hash space; lookups can only be done if you know the hash of the entry)
Retrieves deleted and obsolete entries Traverses revision history to newest version of entry (default; behaviour can be configured to retrieve deleted/obsolete entries to)

Now, you can do something very similar to hdk::query() on the DHT. First, a bit of background: if the DHT is just an unstructured, unqueryable hash space spread across many machines, you need to know the exact location of any entry in order to retrieve it. So how the heck do you do anything useful? You can link data together. Here’s a writeup on how links on the DHT work and why they’re necessary: https://hackmd.io/ZEOwR3aIQN-971lGYWo-zw?view

One thing the article doesn’t talk about is link ā€˜tags’, arbitrary bits of content attached to links that let you do ad-hoc queries. The query language is pretty primitive right now — just exact matches and regexes — but it allows you to filter results or prefetch important bits of an entry without having to do 1+n lookups (which could get expensive on a DHT).

My suspicion is that, in most cases, you’ll want to show your own articles and other people’s articles with the same function. So perhaps your ā€˜get all’ function would have this signature:

fn get_all_blog_posts(author_address: Address) -> ZomeApiResult<Vec<BlogPost>>

and would work something like this:

  1. Get all the links of type author_to_blog_post that are attached to the author’s address.
  2. Map over all those links, calling get_entry to get the blog content. Two notes:
    • This will by default get only the newest copy, regardless of whether it’s your article in your source chain or someone else’s article on the DHT.
    • This is an expensive 1+n query, so if all you’re generating in the UI is a list, you might instead want to consider including everything you need for a summary in the link tag. Something like:
      {"title":"How to feed a bison","publish_date":"2019-08-27T15:36:00"}
      
      One thing I don’t like about hdk::update_entry() is that its abstraction isn’t very clean — in the case of links, those links don’t get updated when the target entry gets updated. They still point to the old entry, which is fine if you’re calling get_entry() which follows the update chain, but not if you’re just relying on the link tag to still hold relevant information. If you want the link tag content to stay fresh, it’s up to you to delete the old link and create a new one on entry update.
  3. Return the blog posts (or just their summaries).
2 Likes

Suppose, there was only single author on the blog – me. No other authors could post there, and they couldn’t sign up for the blog either. How would do I retrieve the list of all the articles?

@pauldaoust

There are two ways you could do this. There’s an advantage to querying your own source chain because you know all the data you want is there, so it’s fast. But querying doesn’t automatically respect updated/deleted status, so you’d have to:

  1. Call hdk::query() to get the addresses of all the entries you want.
  2. Map over each result, calling hdk::get_entry() on each.
  3. hdk::get_entry() gives back an empty result for deleted items and follows the update chain for modified items, so filter out all duplicates and empties.

I’ll try to.

After having filtered out empty items, all duplicated items will have a) identical values in their fields b) different Address/HashString s

right?

How will find out which item/Address is the most recent one?

They should have identical hash strings, because get_entry will have followed to the newest version. Oh hey! I have an idea for streamlining this even further:

  1. When you’re mapping on the addresses to turn them into entries, check the address of the entry you just received against the address you were looking for. If they’re different, that means the original address is for an old version, and you could just return an empty result like get_entry does for a deleted entry. Here’s an example of a closure you could use in a mapping function.
    | addr | => {
        let latest_entry_result = get_entry(addr);
        match latest_entry_result {
            Ok(Some(_, entry)) => match entry_address(entry) {
                // The address of the entry we received matches the address of
                // the entry we asked for. Return it.
                addr => Some(entry),
                // The addresses don't match; that means the entry has been
                // updated. Don't include the updated entry, because it would be a duplicate.
                _ => None,
            },
            // Either a None (entry has been deleted) or an error. Return None.
            _ => None,
        }
    

(Note, I haven't actually tried to compile this code, so there are probably syntax errors.)

By hash strings do you their addresses? No, I had all the address different. But all other fields were identical – they had the latest updated values.

Hm, you mean when you incorporated that mapping function I gave? Should be:

  • different addresses = entry address received by query is an old one; return empty value
  • same addresses = entry address received by query is a new one; return value received by get_entry