New "stacked average" pattern for cost-effective average calculation in frequently updated entries

JhonatanHern · April 28, 2020, 12:39pm

Hello community, as you know, data mining is a complicated subject when taking about holochain. We do not want to make an excessive number of petitions and we want every app to be able to run in any basic device. That’s why I’m looking for a way to make the computation of averages fast and efficient.

That’s when the “stacked average” (just came up with that name, suggestions are open) pattern comes in. Consider the following scenario:

We have a variable inside an entry of which we want to know the average value, let’s say:

struct TennisMatch {
    duration: u64
}

We could always get all of the tennis matches ever played and calculate the average every time we need it or we could do this:

struct TennisMatch {
    duration: u64
}
struct TennisMatchHistory {
    average_duration: u64
}

Great! but how are we going to know how to change the average_duration after every match? Well, we can add a helper variable that takes care of that:

struct TennisMatch {
    duration: u64
}
struct TennisMatchHistory {
    average_duration: u64,
    number_of_games: u32
}

And after every match, we could do this:

let new_history = TennisMatchHistory {
    average_duration: (
            previous_history.average_duration * previous_history.number_of_games +
            new_match.duration
        ) / (previous_history.number_of_games + 1),
    number_of_games : previous_history.number_of_games + 1
}

update_history(new_history)

This way, the latest history entry will always have an updated average. What do you think? Is this pattern viable?

drei · June 9, 2021, 4:39pm

I see this question was posed in April of 2020… so did you try your approach? What were your findings?

I do have a note or two on the design of this approach. It’s worth considering the “open-closed” principle here, where resources should be open to extension but closed to modification.

What if one match is rain delayed and the duration skews your results? You would likely want to store a “precipitationAmount” value to weight the duration but it would be difficult to fix in your original calculation. You would then have to modify the average_duration calculation in order to incorporate this new value.

Any change or addition to the metrics you collect or the way in which they are calculated could violate the open-closed principle. Each new metric would need to be added to the TennisMatchHistory object, and would likely not have the same number of observations and thus be incomparable to your original computed average. In this case, maybe you can fix this by doing a “catch-up” procedure to store this property as a one-time fix, but that’s disposable code.

Additionally, wouldn’t this mean computing new summary statistics after every match completes? Perhaps this is necessary overhead, but it’s a dependency in your match that is only descriptive in nature. The match doesn’t care about its average, but in this pattern, it needs to dispatch its values to a global struct. If that struct breaks or fails to compile, your match fails unnecessarily.

Hopefully the above demonstrates some of the potential risks of your approach. This is by no means an indictment though. Perhaps there really is a performance gain that outweighs any requirements changes.

As a final note… I do not have the best grasp of performance optimization on graph databases like Holochain. Maybe this is a non-trivial question. I’ll leave the performance discussion to another poster.