Problem that may occur when creating a link between two entries when the said entries are literally just created in the DHT

Tl;dr: An error is returned when an agent tries to make a link between two entries when that agent don’t have the dependency yet. The error is below. Is it possible to have a retry mechanism for create link when the dependency hash is not yet held but can be assumed to be held at some point in the future?

{type: "internal_error", date: "Source chain error: InvalidCommit error: The dependency AnyDhtHash(uhCEk...) is not held"

Hello! As we were testing Kizuna’s group messaging over the internet, we found some interesting bug that I thought would be nice to share here as well. And in the end of this post, there is a question/request that I think may be interesting to be pondered on.

Note:

  • In our manual test over the internet, we used an AWS EC2 instance (ironic but will do for now :sweat_smile:) for the proxy server but that is not shown in the diagram so as to not over complicate it.
  • Charlie’s part is omitted since it is identical to what bob is doing here.
  • Number of agent in the network: 3

each zome functions in the diagram are simplified here as well to focus on the matter at hand. However, if anyone wish to look deeper into the architecture of each zome call, you could PM me :laughing:

  • send_group_message() → sends a message (text/file/media/etc) to a group
  • read_group_message() → mark a message read by the caller by creating a link between the message and the AgentPubKey and let other members of the group know that you read the message too

Apologies in advance if the diagram is small! tried my best to make it larger :sweat_smile: (please zoom in if it is too small)

I can confirm that this error is coming from the zome function read_group_message() and the only API that may seem to return this error is the create_link() . And I also confirmed that the dependency that is missing is the EntryHash of the group message which is used as base.

I am not 100% certain as to why this is happening, but my hunch is that since DHT is eventually consistent, when bob tried to make a link between the message EntryHash and the AgentPubKey, he could not find the EntryHash as it was still in the process of getting published to the DHT (To be more precise, Alice is telling bob and charlie to hold the GroupMessage entry for her). However, Bob can already make a create_link() call between GroupMessage EntryHash and his AgentPubKey as he already received the hash of the GroupMessage through remote_signal() from Alice beforehand!

Even though this was a bug on our end, I am honestly really amazed at how distributed networks behave and it shows that various components of Holochain is really working!

Now, given that other projects in holochain may potentially do something similar, I have a few question/suggestion,

  • What does the dependency not held really mean?
  • Is it possible to have a retry mechanism for create link when the dependency hash is not yet held but can be assumed to be held at some point in the future?

Related issue: create_link fails in read_group_message when bob calls it right away after receiving the message from alice through remote_signal · Issue #54 · hc-institute-japan/kizuna · GitHub

ping @guillemcordoba @pauldaoust @thedavidmeister @Connoropolous

3 Likes

After some more testing, I figured that even testing locally in the same computer with 3 agents, if alice sends a large file (13mb was the size), bob and charlie’s console will still emit the error stated above. In send_group_message, we have 2 create_entry() if the payload is a file (one for the metadata of the file and another for the actually bytes of the file). I’m now starting to wonder if remote_signal is being emitted even before create_entry() is not yet finished fully which is causing the same problem above?

This is really interesting, it shows to the level of depth that you are reaching in Kizuna, props for that.

I don’t know the full answer to this. What I do know is that create_link actually does a get on both the base and target entries to be able to execute the validation function. I think you are on the right track and what is happening is that that retrieve_entry is trying to fetch an entry that hasn’t been published to the DHT yet.

The call zome workflow works like this:

  1. The conductor receives a call zome function request for a cell.
  2. The conductor creates a context (a copy of the state of that cell) for that zome call.
  3. The conductor executes the call → this is where the happ code gets executed.
  4. After the call, it checks the context to see if new elements were added (these are not committed yet to the source-chain).
  5. If there are new elements, it validates them → this is where it’s trying to do the retrieve_entry.
  6. If there is any error, return back the error, clean the context and don’t commit anything to the source-chain.
  7. If there isn’t any error, return the result coming from the zome function.

Maybe the best option is to do it like this:

fn read_message(message_hash: EntryHash, group_hash: EntryHash) {
  let participants = get_participants(group_hash)?;

  remote_signal("read", participants)?; // This will execute immediately without waiting for the zome fn to return!

  while let None = entry { 
    entry = get(message_hash)?;
  }  

  create_link(message_hash, my_pub_key, None)?;
}

This way you have both immediate feedback and waiting until the message arrives to create the link. Note that here it won’t be useful for the UI to wait for the promise of this call since it could block many seconds potentially.

Note that the “while get” could get resource intensive until we have sleep working.

3 Likes

Ahhhhh you are awesomely awesome!!! @guillemcordoba

Okay so after some more digging after your ultra-helpful insights, I figured that holochain was also returning this error

[1] Jun 16 23:31:32.670 ERROR holochain::core::workflow::sys_validation_workflow: msg="Direct validation failed" element=Element { signed_header: SignedHeaderHashed { header: HoloHashed(CreateLink(CreateLink { author: AgentPubKey(uhCAkKn38_bPKJAYc8fcBuYNnAzym8qoE4p0Lj-LgKtxQJNmaco02), timestamp: Timestamp(2021-06-16T14:31:32.666455Z), header_seq: 16, prev_header: HeaderHash(uhCkktV7rcv8GQi6GYHIwOvPCQx0Z1Q-bjFuX1aI6Penhg6HNTGGN), base_address: EntryHash(uhCEkbgwqyg2qEJ1jG2ZvmMeE5Dr4eJQpqsrKRq_bedwwNxQE7tpj), target_address: EntryHash(uhCEkKn38_bPKJAYc8fcBuYNnAzym8qoE4p0Lj-LgKtxQJNmaco02), zome_id: ZomeId(5), tag: LinkTag([114, 101, 97, 100]) })), signature: [147, 43, 84, 118, 50, 147, 170, 77, 62, 6, 147, 212, 46, 140, 247, 175, 225, 158, 232, 163, 50, 135, 128, 179, 125, 219, 253, 156, 236, 190, 71, 160, 27, 222, 228, 205, 9, 38, 129, 108, 64, 218, 85, 36, 144, 38, 63, 216, 42, 209, 210, 250, 100, 94, 109, 100, 72, 204, 104, 152, 15, 61, 57, 4] }, entry: NotApplicable }

I searched "Direct validation failed" in crates/holochain/core and found that this is coming from here which is when the validation is running as you said here.

And given that the conductor is returning this error as I mentioned in OP, I searched for any place within sys_validate_element_inner() that is returning NotHoldingDep(AnyDhtHash) and found a relevant one in check_and_hold() that is being called in check_and_hold_any_store_entry() which is finally being called in register_add_link()!!
In check_and_hold(), retrieve() is actually really calling get() when I dug deeper and given this whole block of code, it seems like the base address (in my case the GroupMessage EntryHash that were just committed to DHT which is the arg to check_and_hold_any_store_entry() which is then passed to check_and_hold()) is used really to retrieve the Element from DHT which then is not found since alice just committed the GroupMessage element to DHT!!

Unless I am utterly missing some important points, I think I was able to confirm and fully understand the problem.

Aaaaaaand, thanks to your suggestion, I have implemented this here and it definitely works which strengthen the validity of the diagnosis to the problem!

wow, I think you just helped me learn how to diagnose problems in happ in a much deeper sense. This was a really fun process!! Thank you soooo much.

Lastly, am I on the right track in thinking that when sleep comes in, we can have some sort of exponential back-off retry mechanism on the application code (guest) level for get to avoid being resource-intensive?

3 Likes

Exactly to everything you said. This is fun!

3 Likes