Reproduceable network/core testing for concurrent/looping stuff

thedavidmeister · August 21, 2019, 2:40am

at the moment it is really tough to get the testing working nicely for app spec and more complex zomes when core has networking plugged in

we have to choose between example based deterministic testing (e.g. calling .process() or tick() or whatever manually in unit tests for a specific component) which is fragile and doesn’t hit edge cases easily OR we just throw loops at hachiko and wait for something to break, which leads to breakagaes but it’s then not clear how we reproduce/debug a specific breakage locally, or how we disambiguate two or more existent bugs with similar symptoms

it would be great to have something quick-check-y that can “randomly” generate process/tick calls for us and if/when it breaks something to spit out the seed used by the PRNG so we can then reproduce that specific failure for debugging and to build up our library of examples

wanted to open this up a bit to discussion and to track progress, is there prior art we should be checking out? etc.

note that this conversation is orthogonal to how we debug/log failures when we do find them, it’s more about how we can automatically throw lots of different things at our code without ending up with flaky CI, etc.

thedavidmeister · August 21, 2019, 2:43am

CC: @maackle @zippy @lucksus

thedavidmeister · August 21, 2019, 2:50am

i was thinking as a “light touch” approach, we simply expose the inner loops somehow, defuse them and use a high quality seeded PRNG like xoroshiro128+ to randomly alice.process() and bob.process() with a console log of the seed at the top of ?each test? or ?the whole test suite? and some way to optionally plug a seed back into the PRNG for local debugging

maackle · August 21, 2019, 6:15am

I love this approach. We’ve talked about it out-of-band, pre-forum, so I already have a clear idea of what it entails, but it would be good to eventually write it up as a proposal to make it easier to convey.

It occurs to me that one important part of this is the ability to run each subsequent tick() only after all messages have been passed around as a result of processing the previous tick. If we can’t take time out of the equation and make these side effects synchronous, we may still need some kind of consistency model like hachiko that will let us know when all effects have died down. It will still be a huge improvement because we’ll have deterministic test cases – if they actually halt.

thedavidmeister · August 21, 2019, 7:13am

indeed… although even in this case we can “timeout” a certain number of ticks rather than milliseconds

in the unlikely case that e.g. 1000 ticks legitimately don’t result in any forward progress for some scenario we can up the timeout and document the edge case

maackle · August 21, 2019, 3:36pm

I don’t get that. If side effects have a clock time delay, then we have to wait in clock time to make sure we see them. Also, ticking is only idempotent when the inbox is empty. In the general case, what if messages get put into inboxes in the wrong order?

If we seize control of the consumption of messages, then we have to ensure that the production of messages is timeless and deterministic as well. It might already be that way, but it might not.

If not, we might need to take this same approach for the emission of messages. The orchestrator could tell actors not only when to consume, but also when to produce messages. In which case we need control over outboxes as well as inboxes. Also, we’d need to be sure that messages get put in outboxes in the right order. Does this make sense?

pythagorean · August 21, 2019, 9:44pm

This conversation may be going over my head at the moment as far as the technical details of what you are trying to do, but isn’t an eventually consistent model non-deterministic by its nature? So the only way you can really know someone else has received a message is if they send an acknowledgement and you receive that, and until you do the sender should be uncertain of its delivery. The only reason to timeout is if the transaction itself is time critical for its completion and acknowledgement. Again, I may be interrupting and if so I apologize, please let me know what I can do to help test things and to make multiuser scenarios work again for my apps as well.

thedavidmeister · August 22, 2019, 4:06am

@maackle side effects should be syncronous for a given tick that hits an action right? moving to the next action before completing the previous one seems problematic for exactly this reason

@pythagorean yeah for testing we’re doing things that you can’t do in a live network, like having a “global view” on the internals of agents, and turning off the internal processing loops of agents so we can step through them in a deterministic way

this conversation is about “scanning” lots of different possible ways things can be non-deterministic with a seeded PRNG so when something breaks we can reproduce that one exact ordering of actions being processed across agents

thedavidmeister · September 22, 2019, 3:24pm

this is happening with the new process macros in network tests