Avoiding Networking Issue For Multiple Nodes On 1 Device

Connoropolous · April 27, 2021, 10:10pm

related: How to run multiple agents and let them interact locally using `hc`?

As I was trying to run up multiples nodes to attempt networking, during development, on my 1 computer, and I was having issues with hc sandbox so I wanted to use holochain binary, I encountered an issue that got me stuck for a long time. Writing it up for saving my own time, and others time, who are trying this.

The NOT WORKING configuration

If I want to run holochain via the holochain binary, then I run:
RUST_LOG=info holochain -c conductor-config.yml.

Where conductor-config.yml is a file sitting in the same directory as I’m running the command and it looks like this:

environment_path: databases
use_dangerous_test_keystore: false
signing_service_uri: ~
encryption_service_uri: ~
decryption_service_uri: ~
dpki: ~
keystore_path: ~
passphrase_service: ~
admin_interfaces:
  - driver:
      type: websocket
      port: 1234
network:
  bootstrap_service: https://bootstrap-staging.holo.host
  transport_pool:
    - type: proxy
      sub_transport:
        type: quic
      proxy_config:
        type: local_proxy_server
  network_type: quic_bootstrap

Next I install the Happ and activate an “app” websocket port on this conductor.

# Install the Happ
hc sandbox call --running=1234 install-app-bundle --app-id=main-app back/workdir/acorn.happ

# ACTIVATE THE WEBSOCKET PORT
hc sandbox call --running=1234 add-app-ws 8888

Now let’s say I have a 2-conductor-config.yml which will act as a configuration for a second node.

environment_path: databases2
use_dangerous_test_keystore: false
signing_service_uri: ~
encryption_service_uri: ~
decryption_service_uri: ~
dpki: ~
keystore_path: ~
passphrase_service: ~
admin_interfaces:
  - driver:
      type: websocket
      port: 1235
network:
  bootstrap_service: https://bootstrap-staging.holo.host
  transport_pool:
    - type: proxy
      sub_transport:
        type: quic
      proxy_config:
        type: local_proxy_server
  network_type: quic_bootstrap

Note that I changed environment_path to databases2 to separate the two conductors disk space usage, and admin_interfaces.driver.port to 1235 to use a different networking port than the first. I thought this should be enough (but was wrong).

We would run this conductor up using RUST_LOG=info holochain -c 2-conductor-config.yml command.

Then like the first we would run these to install the Happ and activate an “app” websocket port on this conductor.

# Install the Happ
hc sandbox call --running=1234 install-app-bundle --app-id=main-app back/workdir/acorn.happ

# ACTIVATE THE WEBSOCKET PORT
hc sandbox call --running=1234 add-app-ws 8888

Now we should have two identical DNAs running on different agents, and able to discover each other via the bootstrap service https://bootstrap-staging.holo.host and able to network because they’re on the same local network, meaning they should be able to each handle incoming connections and talk to each other.

When I tried this, and I tried multiple different network configurations, I always got this terrible result:

WARN gossip_loop: kitsune_p2p::spawn::actor::gossip: msg="gossip loop error" e=KitsuneError(KitsuneError(Other(OtherError("refusing outgoing connection to node with same cert"))))
Apr 26 18:14:33.136  WARN gossip_loop:process_next_gossip: kitsune_p2p_proxy::tx2: refusing outgoing connection to node with same cert

This was flooding through my logs every 25 milliseconds or so.

I couldn’t make sense of the error, and tried posting it to the Discord chat space I am in, but no one recognized the error.

After some hours spread over two days, I started to mentally process the issue, as one little thing started to stand out in the Rust logs…

INFO kitsune_p2p_transport_quic::tx2: bound local endpoint (quic) local_cert=Cert(gkQUat..FMJG1c) url=kitsune-quic://10.0.0.140:58280

The “local_cert=Cert(gkQUat…FMJG1c)”.
This seemed related to the “node with same cert” message.
Then I realized that node 2 had a similar text:

INFO kitsune_p2p_transport_quic::tx2: bound local endpoint (quic) local_cert=Cert(gkQUat..FMJG1c) url=kitsune-quic://10.0.0.140:55921

In the floods of logs I had not been able to see this at first, especially as when I did see it I thought the kitsune-quick addresses were different so it was fine.
Now I realized the “local_cert=Cert(gkQUat…FMJG1c)” was the same.

I wondered how to the two nodes were ending up using the same “local_cert”. Where was the “local_cert” coming from?

I realized that it must be the place where Holochain tends to store its secret values and most private information, the “lair” :p. There is really something called “lair”.

It turned out, the two nodes were sharing the SAME “lair keystore”, and it was from the lair keystore that the “local_cert” which “kitsune” (the networking module) was using to perform encryption and stuff was being pulled.

In particular, the second holochain would start itself without spawning a new lair keystore, but instead satisfied itself with connecting to the currently running one, even though another holochain process already was connected to it. Gotcha!

The WORKING configuration

So, how did we fix it?

In conductor-config.yml, we changed:
from:

keystore_path: ~

to:

keystore_path: keystore

and in 2-conductor-config.yml, we changed:
from:

keystore_path: ~

to:

keystore_path: keystore2

Now we have defined, for the two conductors, a separate keystore. It is a path which will have a folder actually created at the given path, such as “keystore” and “keystore2”.

This immediately solved the problem, after restarting the conductors, and networking was suddenly working without issue!

Lesson

When you start holochain, you SHOULD always see this message (which is confusing, because it says its an error)

Apr 27 18:02:40.677 ERROR lair_keystore_client: error=IpcClientConnectError("/Users/x/code/rsm/acorn/keystore/socket", Os { code: 61, kind: ConnectionRefused, message: "Connection refused" }) file="/Users/x/.cargo/registry/src/github.com-1ecc6299db9ec823/lair_keystore_client-0.0.1-alpha.12/src/lib.rs" line=42

This is a GOOD thing, because it means that holochain WILL, after failing to connect to a “lair”, spawn/start one. And there SHOULD only be one lair keystore per holochain.

e.g. on Mac, if I open Activity Monitor and search for lair there should be the same number of lair processes as holochain processes.
If there are two holochain processes, then it should look like this:

harlan · May 2, 2021, 8:06am

Thanks @Connoropolous ! This is super helpful and exactly what I’m looking for right now!

Connoropolous · May 2, 2021, 4:13pm

Good good you’re welcome. Glad it will serve

tats_sato · June 4, 2021, 2:09pm

This really helps!!

Connoropolous · June 4, 2021, 2:55pm

Did you actually encounter this same thing?

tats_sato · June 5, 2021, 5:46am

Not quite thanks to this post I set different keystore_path right away as adviced here for both conductors before even running the 2 conductors on my computer so I did not even encounter the issue!