Strategies for Optimizing Continuous Aggregates with Frequent Historical Data Inserts

Timsgmlr · May 8, 2024, 2:33pm

Hello TimescaleDB Community,

I’m managing a database with hundreds of time series, each having millions of points. Right now the table includes time, seriesId, and value columns, with continuous aggregates over 5 minutes, 15 minutes, and 1 hour.

We frequently add new series with years of historical data, which invalidates and triggers re-aggregation of basically our whole continuous aggregates due to the shared chunks with other series, even though the other series didn’t change at all. This impacts performance significantly, because most or all of the data has to be re-aggregated, even though only a very small percentage of the total data has changed.

Current Setup:

Schema: time, seriesId, value
Continuous Aggregates: 5, 15 minutes, and 1 hour
Issue: Adding historical data invalidates all related aggregates

We have considered restructuring our database to create a separate table for each series. However, with series being frequently added and removed through our frontend, this approach would require dynamic table creation and deletion, which might complicate our backend processes.

I am looking for advice on:

Best practices or alternative designs in TimescaleDB to manage such data without extensive aggregate invalidation.
Is separating series into individual tables advisable, or are there more efficient partitioning strategies?
Any particular TimescaleDB configurations that could help optimize this setup?

Thanks for any insights or recommendations!

jonatasdp · May 8, 2024, 5:51pm

that’s a great question @Timsgmlr!

Look to promscale design: 1 table per metric. Each seriesId should have it’s own table. I created a small dynamic creation as an example.

github.com

jonatas/sql-snippets/blob/2f886d561776723041c0e731788f970e7c891d44/hypertable_model.sql#L32-L40


      
          CREATE TRIGGER feed_child_table_trigger
          INSTEAD OF INSERT ON mother FOR EACH ROW
          EXECUTE PROCEDURE feed_child_table();
          
          INSERT INTO mother (table_name, time, identifier, value) VALUES
          ('a', '2021-08-26 10:09:00.01'::timestamp, 'id1', 10.1),
          ('a', '2021-08-26 10:09:00.08'::timestamp, 'id2', 10.0),
          ('b', '2021-08-26 10:09:00.23'::timestamp, 'id3', 10.2),
          ('b', '2021-08-26 10:09:00.40'::timestamp, 'id4', 10.3);

The main advantages are that you get independent policies for every series. Until you reach thousands of series, it will work. I created a test in the past to load 100k hypertables: sql-snippets/poc_100k_hypertables.sql at master · jonatas/sql-snippets · GitHub

It may help you to benchmark.

I think the main point is that your series are not even. I see a lot of advantages in this model. If your series have other types of attributes that could qualify and cluster the periodicity, you may find a better layout nesting some of those series in the same hypertable.

I spoke at a conference about this topic, all pros and cons too: https://www.youtube.com/watch?v=qd6w10Z2YYc