yes, i oversimplified a little bit, the main goals of witnesses in a DAG as i understand:
- overlay a credibly neutral and reliable total ordering on top of partially ordered data (i.e. the DAG)
- prevent attackers from hiding data then releasing it later, potentially forcing the DAG to re-order after the fact, leading to e.g. double spends, by forcing data through the hashing done by the public witnesses before it ‘counts’, as hashing de factro requires the data to exist and be known
in holochain every agent also has their own authorities for the headers of their source chain, the ‘agent authorities’
the agent authorities act as witnesses for the agent’s source chain, but they only hold the headers and validate the ordering and linearity of the chain, not all the data
if an agent cares about something sensitive to ordering and forking, e.g. a ledger, they can query the agent authorities to ensure that the head of the chain they are being presented by the counterparty matches the network’s opinion of the head of that chain (witnessed)
most of the issues i know of re: witnesses in a DAG come back to that ‘credibly neutral’ requirement - which is where we start talking about PoW or PoS etc. and there are hybrid models that mix blockchain and DAG for this reason, so that the blockchain witnesses the DAG
so yes, there is an algorithm that determines the ‘distance’ between data and agents, and it’s ‘random’ in that it is based on comparing hashes and pubkeys
the subtlety in ‘randomly selecting an authority’ is who is doing the selecting? for example, there can always be an eclipse attack on a DHT where absolutely everyone you ever interact with lies about the state of things and you’ll never achieve any further security from that point (that problem also applies to bitcoin and ethereum etc.)
what we want to do is setup a situation where ‘the good guys’ have a huge advantage over ‘the bad guys’ - e.g. as long as an agent can reach at least one honest authority they can detect bad behaviour and resolve conflicts (e.g. doing their own validation) from potentially many bad authorities
the first thing there is that validation should be as deterministic as possible, so a true/false is reliably available from the input data without any further network interactions for a given validation - this allows all agents to decide for themselves who they think is honest once they have some data (where the integrity of the data is provided by its hash)
even then we need to know who to talk to about some item of data, and don’t want to simply trust the author
in holochain each agent opts in to some percentage, e.g. 10% of the data on the network which is their ‘arc’ - this means that they will validate and hold the ‘closest’ 10% of data to themselves, and also let everybody else on the network know their arc at the same time as they broadcast their network location to allow incoming connections
so if alice broadcasts ‘10% arc’ to bob, then bob can calculate whether he should expect that some data that hashes to within her arc to be held by alice, if so then she is on bob’s list of agents that bob can get
data from - alice has no control over which data falls in her arc, only the size of the arc, her pubkey vs. the data hash determines what is in her arc, an honest alice has no choice but to validate and hold everything assigned to her and respond to queries about it
pubkeys are randomly spread across the space of all hashes by their nature, so a healthy network will have many arcs ready to receive any new hashes for new data, organically spreading data between peers according to their ability to handle it (similar to seeders and leechers in bittorrent)
we’re expecting that small DHTs start with everyone at 100% arc and their arcs start to shrink as the DHT gets more heavily populated and more data (agents set their arc per-DHT)
bob chooses for himself a random set of N agents that he believes should have the data he is looking for, based on the agents he is aware of and the data he is looking for, he queries them all in parallel, each will either respond with the data or better candidates for bob to query (thus giving bob a better view on who is in the network) - once bob receives M responses (where M could be a bit less than N) he can move forward in cross-referencing and using them - bigger M and N is more conservative/defensive but implies more network overhead, that’s also a tradeoff that can be made per-app