Doubt with ingesting old data & continous aggregates

H25E · March 2, 2023, 1:48pm

Imagine there is a TSDB continuous aggregate that has a refresh policy with a start offset of 1 week and some there is a network problem somewhere. This problem takes 2 weeks to get solved and after it’s solved data from 2 weeks it’s received at once.

Now we are going to have 1 week of data that falls of the continuous aggregate. So what’s going to happen when someone reads from it taking into account that real time aggregation it’s enabled?

It’s RT aggregation going to fill both the last time bucket and this hole in the continuous aggregate? Or the hole it’s going to be present on the output data?
It’s this hole going to be there forever until a manual refresh it’s done?
Would be a good idea to have two refresh policies, one with high frequency execution but low amount of data to check and other with very low frequency and that refreshes all or a big part of the dataset?

Thanks for your time

jonatasdp · March 2, 2023, 2:29pm

Welcome, @H25E !

You’ll need to manually refresh continuous aggregates. If you have a lot of backfill happening, I’d encourage you to create a function that wraps the backfill inserts and then refresh the backfilled part.

If you add a large aggregate, it will cost more computation cycles and need more IO. So, if you’re looking for a more performant solution, just create a backfill routine that will insert the data, get the time range of the backfilled data and then run the refresh continuous aggregates over that period.

H25E · March 2, 2023, 4:15pm

Thanks for your answer @jonatasdp !

One thing I still don’t have clear it’s if the holes left on a continuous aggregates it’s going to be filled or not by real time aggregation when the aggregate it’s queried. The absence of answer makes me think that not, but I prefer to be sure.

Being more specific I’m having a problem building a 1 minute continuous aggregate on top of a 1Hz hypertable. End offset it’s one minute, and if I set start offset to something small like half an hour, if anything gets a little bit delayed it’s going to fall of the aggregate. If I put something bigger like a few hours, the refresh job will need to run over tens of thousands of rows (for example 3h = 10800 secs => 10800 rows) and I don’t know if this it’s going to freeze my DB a little every time the job runs.

What’s a sane number of rows to be checked every time a refresh job runs? I know it will depend in a lot of factors but at least an aprox idea…

A refresh job refreshes all the rows between the start offset and the end one? Or only calculates the missing data?

jonatasdp · March 3, 2023, 7:24pm

One thing I still don’t have clear it’s if the holes left on a continuous aggregates it’s going to be filled or not by real time aggregation when the aggregate it’s queried.

Sorry if I was not clear in the previous answer. It will not magically process it if this is a hole in the middle of the data.

You can take a look at the query that is on the underlying caggs by using \d+ <cagg_name> and you will see something like this:

SELECT _materialized_hypertable_8.bucket,
    _materialized_hypertable_8.hourly_agg
   FROM _timescaledb_internal._materialized_hypertable_8
  WHERE _materialized_hypertable_8.bucket < COALESCE(_timescaledb_internal.to_timestamp_without_timezone(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp without time zone)
UNION ALL
 SELECT time_bucket('01:00:00'::interval, weather_metrics."time") AS bucket,
    stats_agg(weather_metrics.temp_c) AS hourly_agg
   FROM weather_metrics
  WHERE weather_metrics.city_name = 'New York'::text AND weather_metrics."time" >= COALESCE(_timescaledb_internal.to_timestamp_without_timezone(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp without time zone)
  GROUP BY (time_bucket('01:00:00'::interval, weather_metrics."time"));

Note the use of timescaledb_internal.cagg_watermark(8) which is the flag control that tracks the last processing slice. So, this one will let it process everything or only the missing fragments.

If I put something bigger like a few hours, the refresh job will need to run over tens of thousands of rows (for example 3h = 10800 secs => 10800 rows) and I don’t know if this it’s going to freeze my DB a little every time the job runs.

It will depend on the efficiency of the query and the hardware availability.

What’s a sane number of rows to be checked every time a refresh job runs?

I think you can try to calculate it based on the number of rows that your query needs to keep consistency. Let’s say you’re calculating an average of it with a time_bucket of one minute if you don’t need to have more than that minute to validate your data.

A refresh job refreshes all the rows between the start offset and the end one? Or only calculates the missing data?

Refresh really refreshes the data. It recalculates and overrides previous values.

dberardo-com · April 23, 2024, 6:32am

is there any smart way to backfill and refresh aggregates WITHOUT losing data from the aggregate in case one is using retention policies.

example:

hypertable retention → 1 year
aggregate hourly values, retention → forever
now the user backfills data older than 1 year → this data is backfilled in the hypertable (will be deleted whenever the retention job runs again)
how to refresh aggregate?

jonatasdp · April 23, 2024, 2:52pm

@dberardo-com it’s not possible. You’d need to write the data directly in the hypertable underlying the continuous aggregates. Related issue: Backfilling data to continuous aggregate · Issue #3546 · timescale/timescaledb · GitHub

dberardo-com · April 23, 2024, 3:06pm

yes i can write the data in the hypertable this is no issue, the problem is how to refresh the aggregate … i feel like i have to include a cumbersome procedure to update the aggregate in a manual fashion.

maybe i should define a TRIGGER on the aggregate to fire every time an INSERT command is performed on it. This way i can recompute the new rows in the aggregate and UPDATE the respective time buckets. But this might be suboptimal to be run for every new row …

so there is no smarter way to achieve this then ?

jonatasdp · April 30, 2024, 1:03pm

I think the easiest way is disable your retention policies

Because the cagg already tracks all changes on invalidation logs: https://youtu.be/2fpbgx0Z13A?si=XgL6Fm4mJPllM3dU&t=1128