Using gap fill with a continuous aggregate

Hello,

I am trying to use time_bucket_gapfill() to fill gaps in missing sensor data with a continuous aggregate, or on a continuous aggregate. Let me build some context:

My raw data could look as follows, note the gaps, from
09:00:40 to 09:01:00 and
09:02:00 to 09:03:00 :

sensor_id timestamp value
a 2023-02-15 09:00:00.022 0
a 2023-02-15 09:00:10.019 10
a 2023-02-15 09:00:20.019 20
a 2023-02-15 09:00:30.028 30
a 2023-02-15 09:00:40.017 40
a 2023-02-15 09:01:00.022 100
a 2023-02-15 09:01:10.019 110
a 2023-02-15 09:01:20.019 120
a 2023-02-15 09:01:30.028 130
a 2023-02-15 09:01:40.017 140
a 2023-02-15 09:01:50.019 150
a 2023-02-15 09:03:00.018 300
a 2023-02-15 09:03:10.014 310
a 2023-02-15 09:03:20.016 320
a 2023-02-15 09:03:30.022 330
a 2023-02-15 09:03:40.017 340
a 2023-02-15 09:03:50.019 350
a 2023-02-15 09:04:00.018 400
a 2023-02-15 09:04:10.014 410
a 2023-02-15 09:04:20.016 420
a 2023-02-15 09:04:30.022 430
a 2023-02-15 09:04:40.017 440
a 2023-02-15 09:04:50.019 450

It’s basically an ever-increasing index value that is reported every 10 seconds. The index never becomes a smaller value. For simplicity’s sake in this example, it’s just increasing by 10 for every datapoint and the minute value prepended.

On this data I have a few continuous aggregates running to aggregate data over different time resolutions, i.e. in a minute by minute or day by day fashion.
Continuing with the minute-by-minute as an example, I want to find the min() and max() values for each minute, i.e. with the following continuous aggregate:

CREATE MATERIALIZED VIEW data_aggregation_minute_min_max
    WITH (timescaledb.continuous, timescaledb.materialized_only= false)
AS
SELECT sensor_id,
       time_bucket(INTERVAL '1 minute', timestamp) AS timestamp,
       min(timestamp)                              AS first_data_received,
       max(timestamp)                              AS last_data_received,
       min(value),
       max(value)
FROM raw_data
GROUP BY sensor_id, time_bucket(INTERVAL '1 minute', timestamp)
WITH DATA;

SELECT add_continuous_aggregate_policy('data_aggregation_minute_min_max',
                                       start_offset => INTERVAL '6 days',
                                       end_offset => INTERVAL '2 minutes',
                                       schedule_interval => INTERVAL '1 minute');

This will give me a table that looks like this with the example raw data:

sensor_id timestamp first_data_received last_data_received min_value max_value
a 2023-02-15 09:00:00.000000 2023-02-15 09:00:00.000000 2023-02-15 09:00:40.000000 0 40
a 2023-02-15 09:01:00.000000 2023-02-15 09:01:00.000000 2023-02-15 09:01:50.000000 100 150
a 2023-02-15 09:03:00.000000 2023-02-15 09:03:00.000000 2023-02-15 09:03:50.000000 300 350
a 2023-02-15 09:04:00.000000 2023-02-15 09:04:00.000000 2023-02-15 09:04:50.000000 400 450

Now I want to calculate the amount the index increased between each time_bucket.

I need the max() values to calculate the difference between two rows of this cagg in order to arrive at the amount the index increased over time, i.e. with the following query:

SELECT sensor_id,
       timestamp,
       CASE
           WHEN lag(max_value, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) IS NULL OR
                lag(last_data_received, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) < first_data_received - INTERVAL '19 seconds' 
               THEN max_value - min_value
           ELSE max_value - lag(imported_energy_total_max, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp)
           END                                                                            AS value_diff
FROM data_aggregation_minute_min_max
WHERE sensor_id = 'a';

This would give me the following result:

sensor_id timestamp value_diff
a 2023-02-15 09:00:00.000000 40
a 2023-02-15 09:01:00.000000 50
a 2023-02-15 09:03:00.000000 50
a 2023-02-15 09:04:00.000000 100

If I only use the max() values to arrive at these results, it will shift an increase of the index into a wrong time-period in case of a gap in the data:
In the given raw data, the sensor was not reporting data for 9:02:00 to 9:02:50.
So if I used the lag() function to only use the max_value column, it would put the increase of the index that happened in the non-reported time-window into the 9:03:00 time_bucket, resulting in a value_diff of 200 for the 09:03:00 bucket, which is wrong and result in a ‘peak’ in the next data-point.

Hence why I am using the reported min() value in case the gap between the first_data_received and last_data_received of the previous entry is too big.

I still want to make the diff between two max() values as often as I can or else the increase of the index during the jump between two buckets is lost. The value_diff for the 09:04:00 bucket would report an increase of 50 instead of 100, where 100 is the value that I want.

Now, I have been experimenting around with using time_bucket_gapfill() and interpolate() to arrive at a more “complete” diff table/query, but I haven’t managed to arrive at the result that I want, which would be a table that looks as follows with the raw data and cagg given above:

First idea:

sensor_id timestamp first_data_received last_data_received min_value max_value value_diff
a 2023-02-15 09:00:00.000000 2023-02-15 09:00:00.000000 2023-02-15 09:00:40.000000 0 40 40
a 2023-02-15 09:01:00.000000 2023-02-15 09:01:00.000000 2023-02-15 09:01:50.000000 100 150 50
a 2023-02-15 09:02:00.000000 null null null null null
a 2023-02-15 09:03:00.000000 2023-02-15 09:03:00.000000 2023-02-15 09:03:50.000000 300 350 50
a 2023-02-15 09:04:00.000000 2023-02-15 09:04:00.000000 2023-02-15 09:04:50.000000 400 450 100

This would allow me to easily detect gaps in the raw data / cagg and further treat them in java with interpolation for instance. It’s not perfect but would allow me to circumvent some problems and at least have interpolation where data is missing.

An ideal table would look as follows but I don’t think it’s possible to achieve in the database:

sensor_id timestamp first_data_received last_data_received min_value max_value value_diff interpolation
a 2023-02-15 09:00:00.000000 2023-02-15 09:00:00.000000 2023-02-15 09:00:40.000000 0 40 40 60
a 2023-02-15 09:01:00.000000 2023-02-15 09:01:00.000000 2023-02-15 09:01:50.000000 100 150 50 null
a 2023-02-15 09:02:00.000000 null null null null null 150
a 2023-02-15 09:03:00.000000 2023-02-15 09:03:00.000000 2023-02-15 09:03:50.000000 300 350 50 50
a 2023-02-15 09:04:00.000000 2023-02-15 09:04:00.000000 2023-02-15 09:04:50.000000 400 450 100 null

And then have these as a cagg too, so I can query the tables with minimum effort from java and differentiate between actual values and interpolated values.

The trouble that I am having is that I do not know how to integrate time_bucket_gapfill() into a cagg, nor do I yet know how to use interpolate() in order to get the values that I gave in my example.

I have managed to use time_bucket_gapfill() in a query like this one for instance:

SELECT sensor_id,
       time_bucket_gapfill('1 minute', timestamp) as bucket_timestamp,
       max(value)
FROM raw_data
WHERE sensor_id = 'a'
  AND timestamp > '2023-02-15 08:55:00.000000'::timestamp
  AND timestamp < '2023-02-15 09:05:00.000000'::timestamp
GROUP BY sensor_id, bucket_timestamp
ORDER BY bucket_timestamp DESC;

But it’s not giving me the results that I am looking for since it’s also producing redundant data between the required timestamps in the WHERE clause. As far as I can tell this is the only way to use time_bucket_gapfill() currently, but maybe I am wrong.

I hope with the example given above, the problem I am trying to solve is clear and someone can help me out or even tell me that what I am trying to solve is not possible with the tools that I am looking at.