I am trying to use
time_bucket_gapfill() to fill gaps in missing sensor data with a continuous aggregate, or on a continuous aggregate. Let me build some context:
My raw data could look as follows, note the gaps, from
It’s basically an ever-increasing index value that is reported every 10 seconds. The index never becomes a smaller value. For simplicity’s sake in this example, it’s just increasing by 10 for every datapoint and the minute value prepended.
On this data I have a few continuous aggregates running to aggregate data over different time resolutions, i.e. in a minute by minute or day by day fashion.
Continuing with the minute-by-minute as an example, I want to find the min() and max() values for each minute, i.e. with the following continuous aggregate:
CREATE MATERIALIZED VIEW data_aggregation_minute_min_max WITH (timescaledb.continuous, timescaledb.materialized_only= false) AS SELECT sensor_id, time_bucket(INTERVAL '1 minute', timestamp) AS timestamp, min(timestamp) AS first_data_received, max(timestamp) AS last_data_received, min(value), max(value) FROM raw_data GROUP BY sensor_id, time_bucket(INTERVAL '1 minute', timestamp) WITH DATA; SELECT add_continuous_aggregate_policy('data_aggregation_minute_min_max', start_offset => INTERVAL '6 days', end_offset => INTERVAL '2 minutes', schedule_interval => INTERVAL '1 minute');
This will give me a table that looks like this with the example raw data:
|a||2023-02-15 09:00:00.000000||2023-02-15 09:00:00.000000||2023-02-15 09:00:40.000000||0||40|
|a||2023-02-15 09:01:00.000000||2023-02-15 09:01:00.000000||2023-02-15 09:01:50.000000||100||150|
|a||2023-02-15 09:03:00.000000||2023-02-15 09:03:00.000000||2023-02-15 09:03:50.000000||300||350|
|a||2023-02-15 09:04:00.000000||2023-02-15 09:04:00.000000||2023-02-15 09:04:50.000000||400||450|
Now I want to calculate the amount the index increased between each time_bucket.
I need the max() values to calculate the difference between two rows of this cagg in order to arrive at the amount the index increased over time, i.e. with the following query:
SELECT sensor_id, timestamp, CASE WHEN lag(max_value, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) IS NULL OR lag(last_data_received, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) < first_data_received - INTERVAL '19 seconds' THEN max_value - min_value ELSE max_value - lag(imported_energy_total_max, 1) OVER (PARTITION BY sensor_id ORDER BY timestamp) END AS value_diff FROM data_aggregation_minute_min_max WHERE sensor_id = 'a';
This would give me the following result:
If I only use the max() values to arrive at these results, it will shift an increase of the index into a wrong time-period in case of a gap in the data:
In the given raw data, the sensor was not reporting data for
So if I used the lag() function to only use the
max_value column, it would put the increase of the index that happened in the non-reported time-window into the
9:03:00 time_bucket, resulting in a
200 for the
09:03:00 bucket, which is wrong and result in a ‘peak’ in the next data-point.
Hence why I am using the reported
min() value in case the gap between the
last_data_received of the previous entry is too big.
I still want to make the diff between two
max() values as often as I can or else the increase of the index during the jump between two buckets is lost. The
value_diff for the
09:04:00 bucket would report an increase of
50 instead of
100 is the value that I want.
Now, I have been experimenting around with using
interpolate() to arrive at a more “complete”
diff table/query, but I haven’t managed to arrive at the result that I want, which would be a table that looks as follows with the raw data and cagg given above:
|a||2023-02-15 09:00:00.000000||2023-02-15 09:00:00.000000||2023-02-15 09:00:40.000000||0||40||40|
|a||2023-02-15 09:01:00.000000||2023-02-15 09:01:00.000000||2023-02-15 09:01:50.000000||100||150||50|
|a||2023-02-15 09:03:00.000000||2023-02-15 09:03:00.000000||2023-02-15 09:03:50.000000||300||350||50|
|a||2023-02-15 09:04:00.000000||2023-02-15 09:04:00.000000||2023-02-15 09:04:50.000000||400||450||100|
This would allow me to easily detect gaps in the raw data / cagg and further treat them in java with interpolation for instance. It’s not perfect but would allow me to circumvent some problems and at least have interpolation where data is missing.
An ideal table would look as follows but I don’t think it’s possible to achieve in the database:
|a||2023-02-15 09:00:00.000000||2023-02-15 09:00:00.000000||2023-02-15 09:00:40.000000||0||40||40||60|
|a||2023-02-15 09:01:00.000000||2023-02-15 09:01:00.000000||2023-02-15 09:01:50.000000||100||150||50||null|
|a||2023-02-15 09:03:00.000000||2023-02-15 09:03:00.000000||2023-02-15 09:03:50.000000||300||350||50||50|
|a||2023-02-15 09:04:00.000000||2023-02-15 09:04:00.000000||2023-02-15 09:04:50.000000||400||450||100||null|
And then have these as a cagg too, so I can query the tables with minimum effort from java and differentiate between actual values and interpolated values.
The trouble that I am having is that I do not know how to integrate
time_bucket_gapfill() into a cagg, nor do I yet know how to use
interpolate() in order to get the values that I gave in my example.
I have managed to use
time_bucket_gapfill() in a query like this one for instance:
SELECT sensor_id, time_bucket_gapfill('1 minute', timestamp) as bucket_timestamp, max(value) FROM raw_data WHERE sensor_id = 'a' AND timestamp > '2023-02-15 08:55:00.000000'::timestamp AND timestamp < '2023-02-15 09:05:00.000000'::timestamp GROUP BY sensor_id, bucket_timestamp ORDER BY bucket_timestamp DESC;
But it’s not giving me the results that I am looking for since it’s also producing redundant data between the required timestamps in the
WHERE clause. As far as I can tell this is the only way to use
time_bucket_gapfill() currently, but maybe I am wrong.
I hope with the example given above, the problem I am trying to solve is clear and someone can help me out or even tell me that what I am trying to solve is not possible with the tools that I am looking at.