Continuous aggregate with stores min/max as seperate rows

dragonnn · July 21, 2022, 11:26am

Hi!
I am trying to create a continuous aggregate with stores min/max of my data in a 1 second timespan.
My problem is that I need to retain the original datetime and store rows with min/max values, not store them in different columns.
I did come out with something like that:

SELECT time_bucket('1seconds', datetime) AS datetime, min(measuring_result) AS measuring_result
    FROM vibrodetectors.data
    WHERE device_uuid = 'b90a7f6e-78ea-4d8e-a9c7-d998374b70c1'
    GROUP BY time_bucket('1seconds', datetime)
    
UNION

SELECT time_bucket('1seconds', datetime) AS datetime, max(measuring_result) AS measuring_result
    FROM vibrodetectors.data
    WHERE device_uuid = 'b90a7f6e-78ea-4d8e-a9c7-d998374b70c1'
    GROUP BY time_bucket('1seconds', datetime)
    
ORDER BY datetime ASC;

But the problem is that now I get stuff like that:

I need to have in that output the original datetime for the min/max. Not the truncate one by the timebucket. Any idea how to archive that?

Also it seams that you can not use union in a materialized view

ryanbooz · July 21, 2022, 6:26pm

This is a request we’ve seen a few times. Unfortunately, it’s we don’t have a specific min/max function that also returns the timestamp.

This should accomplish the same thing ATM, even if it’s a bit less efficient than a dedicated alternative. The first() and last() functions select a value sorted by another column. Most of our examples sort on time, but you can sort on something else. I’d strongly encourage you to be using TimescaleDB 2.7 if you take this approach because the performance improvements, particularly because of the finalized nature of the data, will be a much better experience!

CREATE MATERIALIZED VIEW test 
WITH (timescaledb.continuous)
AS
SELECT time_bucket('1seconds', datetime) AS datetime, 
   device_uuid,
   first(measuring_result, measuring_result) AS min_measuring_result,
   first(datetime, measuring_result) AS min_measuring_result_time,
   last(measuring_result, measuring_result) AS max_measuring_result,
   last(datetime, measuring_result) AS max_measuring_result_time,
    FROM vibrodetectors.data
    GROUP BY 1,2

Let us know if that helps as a path forward.

I’d encourage you to create an issue for the TimescaleDB Toolkit to implement something like this. The team can give feedback and determine if it’s best accomplished through the extension or elsewhere.

dragonnn · July 21, 2022, 9:27pm

Thanks! That helps “a litte bit” but I really wish I could that change to creating two rows, I even come up with something like that:

SELECT * from
	json_to_recordset((SELECT '[' || string_agg(t.row, ',') || ']'
FROM 
	(SELECT 	json_build_object(
					'device_uuid',      first(device_uuid, datetime),
					'measuring_result', first(measuring_result, measuring_result), 
					'datetime',         first(datetime, measuring_result))::TEXT || ',' || 
				json_build_object(
				'device_uuid',      last(device_uuid, datetime),
				'measuring_result', last(measuring_result, measuring_result), 
				'datetime',         last(datetime, measuring_result)::TEXT
			) AS row FROM vibrodetectors.data 
			GROUP BY time_bucket('1 seconds', datetime), device_uuid) AS t)::json) 
as x(device_uuid uuid, measuring_result float, datetime TIMESTAMP)

A hack but unfortunately it doesn’t work because subqueryies are not supported by continuous aggregate :/. Is they any other way around that?
That solution with using first/last is fine for me but I just need two queries.
Worst case I will create two continuous aggregate, one with min and one with max but that is not ideal.

dragonnn · July 22, 2022, 6:25am

CREATE MATERIALIZED VIEW test 
WITH (timescaledb.continuous)
AS
SELECT unnest(ARRAY[
		first(measuring_result, measuring_result),
		last(measuring_result, measuring_result)
	]) AS measuring_result, 
	unnest(array[
		first(datetime, measuring_result),
		last(datetime, measuring_result)
	]) AS datetime,
	unnest(array[
		first(device_uuid, measuring_result),
		last(device_uuid, measuring_result)
	]) AS device_uuid
     FROM vibrodetectors.data
 WHERE  device_uuid = 'ec1c1fb7-4d4a-48ae-b3fb-20e3a7dfc67b'
 GROUP BY time_bucket('1 seconds', datetime), device_uuid

This doesn’t work too… ends up with ERROR: invalid continuous aggregate query DETAIL: CTEs, subqueries and set-returning functions are not supported by continuous aggregates.
I suspect that is the set-returning part, and that probably leads to that is just not current possible?

EDIT

Hmmm I could use the unnest solution when selecting data from that view in my program.
That could work and would be probably be the best solution.

ryanbooz · July 22, 2022, 3:46pm

Interesting, you actually are approaching what was going to be my second potential solution… essentially storing an array and unnesting it at query time. I just didn’t have time yesterday to work that through some more (I was going to try and do it without first/last in some way).

Based on your two examples, I just wanted to add some clarifications (which you’ve probably already figured out!):

Sort order when using first/last for multiple columns
If you’re going to use first/last to return multiple values for the (hopefully) same point, just make sure you’re using the same sorting column each time. I noticed in your first response, the device_uuid call was sorted by the time column instead.

It’s also worth noting that if multiple measuring_result values qualify as “high” or “low”, you really need to sort by a secondary column. That’s not documented yet, but it is possible by providing a column set for the second parameter (eg. first(device_uuid,(measuring_result,time))). Since you want to get other columns that coincide with your high/low value, you need to make sure you’re breaking ties with a consistent ordering, otherwise, there’s no 100% guarantee that the datetime or device_uuid actually matches the value you received.

You’ll need to test this, and hopefully we’ll be able to provide a more tailored solution in the future, but you need to be aware of it. Therefore, in your second example, you’ll need to have the same multi-column ordering for each function call.

CAGG definition filtering
In your second response (using the arrays) I noticed that you filtered on device_uuid. Typically, you wouldn’t filter the CAGG to a specific device because it then limits that CAGG data to only ever be from that one device. If that’s your purpose, great! Just wanted to check.

Querying the data
Obviously, realize that if your goal is to select all of the “high” data together, unnesting will create multiple rows, one for “high” and one for “low”. If you have no identifier in the array, you might have to select things based on array position instead, eg SELECT bucket, symbol, measuring_result[1], device_uuid[1], datetime[1] from test to get all of the “low” data together.

Lots of food for thought there, but I’d be interested to know how you move forward and how it goes.

And again, your experience would make for great input to the toolkit team if you ever feel like creating an issue to describe the need!

dragonnn · July 26, 2022, 7:15am

Thanks for this compressive answer! I appreciate it.

I tested doing first(device_uuid,(measuring_result,time))), the view does create without a problem, I can also refresh it’s data. But when I try select anything from it I get an error: ERROR: input of anonymous composite types is not implemented. any idea why that happens? Probably something not yet implemented, I am using TimescaleDB 2.6.0 with Postgress 14. Although I am not that worried about having to be picked the wrong time when they are more rows with min/max value. This should be pretty rare in my use case.

The filtering by device_uuid was only for testing so I don’t get huge outputs from a query, since it didn’t work I didn’t bother removing it.

I have already looked into toolkit, looks pretty great especially that I am big fan of Rust! Do you have a roadmap/plans for adding Windows target to it? Unfortunately in some deployments I am bound to it…

ryanbooz · July 26, 2022, 1:02pm

Ah! Yes, I was only testing with TimescaleDB 2.7 which I think would be a requirement for doing the multi-column sorting. We completely redid how data is materialized in continuous aggregates in 2.7 and that’s why it worked for me. In 2.6 and lower, we have to store the order of each column for each column and there’s a chance something didn’t get typed correctly with the storage (especially since it’s somewhat undocumented that you can sort on two columns).

If you can upgrade to 2.7, you’ll see some big improvements as we discussed in the blog post I linked to.