Finding delta between events that are outside of the `time_bucket` window

chrishop · September 5, 2023, 7:41am

I have hyper-table which has a list of events.
Each event represents an AWS instance changing price.
For a given window of time I want to calculate total cost of all the running instances within that window.

Let me give you an example. Say we have a table called instance_events:

| created_at | cost_per_minute |
|------------|-----------------|
|   08:04:00 |               0 |
|   08:14:00 |              10 |
|   08:17:00 |              15 |
|   08:19:00 |               0 |
|   08:22:00 |              10 |

If we calculate the window from 08:15:00 -> 08:20:00:

At 08:15:00 -> 08:17:00 the instance costs is $10, so the cost here is $10 * 2 mins = $20
At 08:17:00 -> 08:19:00 the instance cost is now $15, so the cost here is $15 * 2 mins = $30
At 08:19:00 -> 08:20:00 the instance cost is $0 so the cost here is $0 * 1 min = $0

We can then sum these values $20 + $30 + $0 = $50.
So from 08:15:00 -> 08:20:00 the total money spent is $50.

I can write a long ugly query in postgresql that does the job

Pure Postgres Query

WITH
first_event AS (SELECT 15 AS t, COALESCE((SELECT cost_per_hour FROM wat_test WHERE t <= 15 ORDER BY t desc LIMIT 1), 0) AS cost_per_hour),
last_event AS (SELECT 20 as t, COALESCE((SELECT cost_per_hour FROM wat_test WHERE t = 20), 0) AS cost_per_hour),
within_window AS (SELECT * FROM wat_test WHERE t >= 15 AND t < 20),
all_events AS (SELECT * FROM first_event UNION SELECT * FROM within_window UNION SELECT * FROM last_event),
ordered_events AS (SELECT * FROM all_events ORDER BY t ASC),
with_diff AS (SELECT *, LEAD(t) OVER (ORDER BY t) as t_2, LEAD(t) OVER (ORDER BY t) - t  as diff FROM ordered_events)

SELECT *, (cost_per_hour/60) * diff as abs_cost FROM with_diff;

I was wondering if there is anyway of doing this using timescale db.
I know you can bucket things by time, But in order to calculate the first cost accurately, you need to know the cost of the previous hour (in the example we needed to know what the cost was at 08:14:00 was $10 even though its outside our window of 08:15:00 -> 08:20:00)

I’ve tried using time_bucket_gapfill and locf() in order to set up the window and get the last item out of the previous window. But I think locf() is only applied after the aggregation has happened. Also I don’t know how to apply the logic to calculate the cost per window as an aggregate, which I have to do since I’m using time_bucket_gapfill().

I have a vague feeling I need transform my table from a sparse to a dense format. Currently I log any time the cost_per_minute changes, I need to somehow transform that into a table where every minute there is a row which contains the current cost, then I could simply sum the cost over the window.
However I’ve really got no clue where to start with that.

I’m new to timescaleDB, and I’ve really gone and thrown myself in the depend so any and all suggestions are greatly appreciated.
Thanks!

jonatasdp · September 5, 2023, 3:06pm

Hello @chrishop, and welcome to our community!

As I see you’re very familiar with Custom Table Expressions aka CTEs, so just go deeper and run it with the default time_bucket function and then use a CTEs to get access to the data with a window function.

You can also use time series pipelines to get access to delta directly through the timevectors which would be another similar approach.

I have a vague feeling I need transform my table from a sparse to a dense format. Currently I log any time the cost_per_minute changes, I need to somehow transform that into a table where every minute there is a row which contains the current cost, then I could simply sum the cost over the window.

A continuous aggregates can help to address this.

chrishop · September 6, 2023, 1:24am

Thanks for the response, here’s what I ended up coming up with:

with dense_table AS (
	SELECT
		time_bucket_gapfill('1 minute', time) AS one_min_interval,
		locf(avg(cost_per_hour)/60) AS cost_per_minute
	FROM ts_instance_event_mock
	WHERE time >= '2023-09-01 08:00:00+09'::timestamptz
	        AND time < '2023-09-01 08:20:00+09'::timestamptz
	GROUP BY one_min_interval ORDER BY one_min_interval)

SELECT time_bucket('5 minute', one_min_interval) AS five_mins, SUM(cost_per_minute) FROM dense_table GROUP BY five_mins ORDER BY five_mins

I just need to group it by instance_id to do it for multiple instances and I’m pretty sure I can use PARTITION BY in that situation.

Those time series pipelines look very interesting. I might explore those more thank you for pointing them out to me