Continuous Aggregates on summary data updated second by second

Say I have data going back several years.

I would like to SUM one column in the entire data set. 100s millions of rows, takes about ~20 seconds.
New data arrives every second so need the updated SUM every time new data hits. Summing the whole set is wasteful. But I can just sum the new data within a window of time, and add it to the already calculated sum of the historical data. The refresh window of time would have to be at MOST several seconds. Need close to true realtime as possible.

All fine and dandy. But here’s the problem. The incoming new data is not always a new INSERT. Sometimes, it’s an UPDATE. Worse, an UPDATE to any historical data point. it could be an UPDATE to the very first row in the set, or to just update the very last most recent row. This happens often.

I would like to think continuous aggregates can do a lookup on the timestamp of the new data making the UPDATE, know where in a particular timeubcket that data is going to change some row, and merely recompute the SUM for that bucket. The calculation would then be

(HistoricalDataSum - OldUpdatedBucketSum) + NewUpdatedBucketSum

So instead of having to recompute the whole historical data set because some row in the middle got changed, only a partial computation need happen. This allow me to avoid the ~20 second query time (as of current data volume) on the whole set so I can accomplish my true realtime needs.

  • Note: One might suggest not doing updates and only inserting new data. This is the worst outcome. Remember when I said 100s millions of rows? Duplicate data comes in the volume of throngs. This would quickly explode the volume of the set. Even if I wanted to pay for the space, this would make the aggregation query, with a new dedupe subquery to boot, on just a bucket alone take longer than just rerunning the query on the whole deduped set.
1 Like

Hi!

I’d recommend that you use a continuous aggregate with a shorter period, say a day, and then re-aggregate the data over the whole time at query time. This will allow you to use a real -time view and still get the correct answer with a minimal amount of work, summing say, ~1000 values, one per day for a few years, is quite a minimal overhead, and the real time view might just have to work over the last day or hour or the like, and will simply get the most recent data by default.

In terms of updates, it will also not have to recompute the entire set, instead it will only recompute the day or so that was updated.

I would also recommend using things like stats_agg and other hyperfunctions so that you can use rollup in order to do the rollup, that’s mostly if you’re ever doing things other than sum, with sum it’s easy, with other stuff it gets a bit more complex.

Finally, I noticed you said that there was a lot of duplicate data that was causing updates…why are you updating with duplicates? are you using on conflict do update? Are you making sure that there’s a where clause in the update so that you only actually perform the update if you change the data? I’d definitely recommend doing that if you can.

If this doesn’t work, you can do something different than caggs and use some sort of trigger based system to do this sort of work, for something like sum that should work just fine, though it will slow down your inserts / updates /deletes. It all depends on what you’re looking to do, but I do think caggs will work quite nicely for you. Especially the new version that’s coming out in 2.7.

I hope what I suggested for that was clear, but if not, let me know and, maybe you could provide a bit of sample data / a schema and we could turn this into a blog post? I think this is a great question and something that other folks could learn from.

1 Like