Query perf when getting items within a time range along with further n items beyond the range

ianf · March 29, 2022, 7:27am

Hi, I was hoping to get some help with some slow query performance - I have to support a query to select items within a time range, as well as a further n items beyond the time range (which end to take depends on the sort order)

Right now I am solving this by doing a union of 2 queries - for example:

(SELECT event_time,event_data FROM test_events
WHERE series_id = 1
AND event_time >= ‘20200101T00:00:01Z’
AND event_time < ‘20200101T00:00:05Z’
ORDER by event_time)
UNION ALL
(SELECT event_time,event_data FROM test_events
WHERE series_id = 1
AND event_time >= ‘20200101T00:00:05Z’
ORDER by event_time
LIMIT 3); – could be any number, usually 1-10

But my performance is poor and gets even worse when there are many millions of rows.
(my only index comes from my primary key definition:
CONSTRAINT vector_events_pkey PRIMARY KEY (series_id, event_time)
giving me a btree index as well as a uniqueness constraint needed for upsert )

I have a (non timescale db fiddle, showing schema, indexes and an example query here: DB Fiddle - SQL Database Playground )

Taken alone, the first part of the above query performs pretty well, the second part is not so good. So I have a couple of questions

Is there a better way of writing a query like this? Perhaps a timescale extension that can be used? I see that both parts of the union are doing their own scan of the index. Intuitively, I wish that the index scan of the bounded query could just continue a bit further to take the extra rows, but I don’t know if / how that can be done. I have heard of cursors, but never used them I wonder if that might be an approach to take?
Maybe there is something smarter to be done with the sorting?
I generally see in explain analyze that querying a hypertable with no upper bound on time such as
SELECT * FROM some_hypertable WHERE event_time > @lowerBound ORDER by event_time LIMIT 5
Seems to do a parallel index scan on all chunks after the lower bound - for our queries, the LIMIT is low, and the number of chunks very high, so this is quite costly - is there some way of giving timescale a hint to scan chunks sequentially? I think that in almost all cases, we should reach our result limit within 1-2 chunks of the start timestamp.

I really hope that this can be done better as this query is really letting down what’s otherwise a pretty good solution.
btw I’m using PostgreSQL 13 and Timescale 2.5 but am working to upgrade to postgres 14 and timescale 2.6. If any more info is needed I’ll be glad to provide it.

LorraineP · March 29, 2022, 12:36pm

Hello @ianf

It would be worth you sharing the EXPLAIN for the slow queries, you can format it nicely at https://explain.depesz.com/ There’s a discussion on query filtering here, too, for you to just check through SQL query filtering best practices

Once we have an EXPLAIN then I’ll see if I can get someone else to cast

ianf · March 29, 2022, 1:13pm

Hi, thank you I hadn’t used that tool before, it makes things a bit clearer to follow than the output from pgadmin.

Here’s the output for a test timescale database, containing 30 time series, 6 months of data with 1 measurement per second per timeseries.
In real scenarios we will have a lot more timeseries, although the data rate may not always be so high.

A query that using fairly recent data:
https://explain.depesz.com/s/IZt7

Queries further back in time tend to suffer more:
https://explain.depesz.com/s/IUP6

We have a 1 day chunk size, a compression policy on all events older than 1 month, and a retention policy at 3 yrs.

LorraineP · March 29, 2022, 4:38pm

It’s a great tool, thanks to an active and awesome PostgreSQL community member, I don’t think they are here in our Forum but on PostgreSQL Slack they’re also username depesz

I’ve also modified your membership to a higher level so that you don’t get blocked again … meanwhile I’ll see if we can get some

ryanbooz · March 29, 2022, 8:32pm

@ianf - It would be great if you could explain a little more about what you are expecting from performance. The explain plans you provided (thank you!) show what I would generally expect from this query.

Overall execution is ~40ms
Although the planner thinks it has to look at all of the chunks, almost all of them say (never executed) in the plan and you’ll see that there is no “(actual…)” execution numbers.

Maybe this isn’t a good sample?

Kay · March 29, 2022, 10:32pm

I’m interested in the same basic question. In my case, the database stores samples whenever there’s a change in value. Some series change every few seconds, while others are more like operator “set points” , so they don’t change for a long time.
When preparing a plot of data from some start to some end time, we don’t just need to SELECT .. WHERE time BETWEEN start AND end. In addition, we need that last sample at-or-before the start time, because that value is still valid at the start time. For setpoint type of data, that last sample at-or-before the start time may actually be many weeks old and there’s no new data since, which is fine, just means we need to get it.
The best way to fetch is seems to be
SELECT .. WHERE time <= start ORDER BY time DESC LIMIT 1, like your example of getting N=1 before the start, and I wonder if that’s really the best way.

ianf · April 4, 2022, 9:55am

Thank you for the reply, raising this example led me to check out what’s really happening and embarrassingly I realized that when I wrote out my simplified example, I had accidentally fixed a couple of issues in the query generated by our app code…

The app code query looked like this:

EXPLAIN ANALYZE (SELECT event_time,event_data0 FROM vector_events
WHERE vector_stream_id = 19
AND event_time >= '20211001T01:00:01Z'
AND event_time < '20211004T01:30:01Z'
ORDER BY event_time)
UNION -- Should have been UNION ALL (no need to consider filtering duplicates in the UNION)
(SELECT event_time,event_data0 FROM vector_events
WHERE vector_stream_id = 19
AND event_time >= '20211004T01:30:01Z'
ORDER BY event_time
LIMIT 10)
ORDER BY event_time; -- This sort is not needed

In the above, the final order by was doing an unnecessary sort, and the UNION should have been a UNION ALL, since we dont care about filtering duplicate events.

So that was a really useful find, and the analysis tool really helped me to focus on the real problems in my query rather than getting distracted by all those apparent scans, which were never really being executed.

However, now that the simple case is solved, I still have a similar kind of problem when dealing with aggregates, in particular the second part of the UNION query which is a kind of open ended query with greater than filter on time and some LIMIT.

Here’s an example of one such open ended query for aggregates:

SELECT time_bucket (INTERVAL '1 second', ev.event_time) AS time,
  ARRAY[CAST(AVG(ev.event_data0) AS real), MIN(ev.event_data0), MAX(ev.event_data0), COUNT(ev.event_data0)]
FROM vector_events ev
WHERE vector_stream_id = 15 AND ev.event_time >= time_bucket (interval '1 second', '2021-11-14T00:00:00Z')
GROUP BY time
ORDER BY time AS
LIMIT 90;

I’m trying to get the avg,min,max and count for the aggregate
In cases like this, all chunks are getting evaluated, which is demonstrably slow, and involves writing to temp files.
https://explain.depesz.com/s/SPEZ
That’s 1.3GB of disk read and 2.5GB of disk write. I know my data set has only 1 event per second, so this query just has to get 90 raw events, however it seems to be going through everything.

sven · May 2, 2022, 12:15am

From the EXPLAIN output you can see it is doing a parallel HashAggregate, unlike GroupAggregate HashAggregate does not return ordered output so a Sort node is put on top. Can you try rerunning the query with set enable_hashagg to false; turning off parallelism might also lead to the desired result and switch to a plan with GroupAggregate (set max_parallel_workers_per_gather to 0;).

ianf · May 31, 2022, 8:18am

Thank you for the suggestion, I haven’t had a chance to check it out yet, but I will do - it sounds like just what I was hoping for - I am a little nervous about any side effects of such a server level (?) change it might affect other queries, but as mentioned we probably don’t run more than about 10 - 20 different queries in the whole app - and this one is a big performance killer.

In the meantime I made a quick fix to my problem by assuming that those further n items are almost always within the next 3 days of the upper bound of my query.
So a query which was
‘SELECT all aggregated events between 1st May and 5th May AND the next 10 events after that’
becomes
SELECT all aggregated events between 1st May and 5th May AND the next 10 events after that, but before 8th May’
Which is a lot faster, but will occasionally not find all the events when the time series is sparse.
So I’m not entirely happy about it, but it seems to be good enough for end users.