How to store long time events?

anthony · July 25, 2022, 9:46am

Hello everyone,

I’m trying to use TimescaleDb to store long time events (events started for example 2 days ago and ended 1 hour ago).
There are two types of events (current events without end date and historical events with en date).
Events (or chunks) should be deleted after one month (based on the event’s end date).

My questions are :

1/ Is there a way to store events without end date and to keep them in a specific chunk that will never be deleted. Indeed, for current events I don’t want them to be deleted even if the event started more than 1 month ago.

2/ When I want to request data, I want to see only events that were active during the selected date range. For example, retrieve all events that were active between '2022/07/24 00:00:00" and “2022/07/24 23:59:59”. By active, I mean event that potentially :

starts before date range and ends after date range
starts before date range and ends between date range
starts between date range and ends after date range
starts between date range and ends between date range.

I found the overlaps postgres function but the problem is that Timescale still have to scan all chunks to find matching events. Indeed, as the end date is potentially between “start date range” and “now”, it means the more I want to have and old date range, the more time the request will take. For now, the only solution I found to limit the number of chunks to scan is to add a where clause where end_date > start_date_range.

Thanks for your time, hope you will give me some good advices

Anthony

James · August 1, 2022, 5:01pm

Hi Anthony, these are really good questions.

Firstly (as I think you’ve realised over the course of your investigations) Timescale doesn’t partition using a start and end time - we store Timeseries using a single column to select a chunk. This causes issues for people trying to do what you’re doing because that second time can trigger a full scan if you’re not careful when it’s “just another column”.

If you’ve got the concept of a “maximum duration” for an event then you can break the start/end events into two events with an ID and a single time (which is great for hypertables). Any metadata would either be attached to the events or in another table. That way you can select from the table with a duration window (padded by the max duration on either side) and be 100% sure you’re getting the start and the end (if it exists). I’ve done this in the past and it works well, but it’s unwieldily as you need to have a function which creates the SQL and then executes it to make sure you get constraint exclusion working.

Happy to delve into this more if it sounds like what you’re after?

Apart from that I’d also recommend using start time as your partitioning column.

anthony · August 2, 2022, 8:33am

Hello James,

Thank you for your response and your interest on my topic.

I really like the solution you proposed because it is very flexible. Events without end date are juste stored in the first hypertable (partitioned by start time) and when the event ends we just have to add a new row in the second hypertable (partitioned by end time), right ?

I have two questions regarding your solution and the partitioning column :

1/ regarding the 4 cases I posted to define what is an active event, I think your solution well covers 3 cases but not the first one (starts before and ends after date range). Indeed, if we want to retrieve events that started before and ended after the date range, we will have to scan all chunks from “start_table” that are before the end date and all chunks from “end_table” that are after the start date. Do you see ? In terms of performance, if I want to have one year of data retention with a one day chunk, I will have to scan 365 chunks for every requests… Or I misunderstood something…

2/ You suggest to use the start date as the partitioning column. But, how can I handle data retention policy with start date ? I mean, I have 30 days of data retention, an event has started 2 months ago and is still active. I don’t want to remove it because it still relevant for the user. And if I don’t want to remove it, I can’t use the drop_chunk function, so I will have to delete row by row.

Thanks again for you time !

Anthony