Parallel Queries slower than subsequent queries

Lagoon · February 10, 2025, 10:00am

Hello,

I am currently looking into TimeScaleDB to figure if we can use it in production. My experience with it is very limited.

My simple test setup looks like this. I have one (big) table that contains data like this:
timestamp, value a, value b, value c.

I created a hypertable using the timestamp and indexes for all values, since I want to filter on them. My main use case is aggregation with time bucketing.

When I ran subsequent queries, one query at a time, everything is fine and the DB is responsive. However when I do the same queries multithreaded, I noticed that it takes much longer (factor two to three) to complete all queries, vs doing it single threaded subsequently.

This is very surprising to me. Is this just how TimeScaleDB operates or am I doing something wrong? There are no foreign key relations and during querying there where no write operations. It is running in a single Postgresql instance. I queried it from a Java 8 program using HikariCP as connection pool.

Any advice would be highly appreciated. Thanks for reading. Cheers!

jonatasdp · February 10, 2025, 5:59pm

Hi @Lagoon, have you checked if any locks happen during the parallel process? any chance to give an example of your query and the data schema?

Lagoon · February 11, 2025, 10:26am

Hi @jonatasdp

as far as I can see a lot of AccessShareLocks and a few ExclusiveLocks are used during querying. But I am not sure, if I check it correctly. So if you have a link on how to monitor this, I would gladly check again.

My table looks like this:

CREATE TABLE public.parameters (
    id text NOT NULL,
    generation_time timestamp with time zone NOT NULL,
    recording_time timestamp with time zone NOT NULL,
    reception_time timestamp with time zone NOT NULL,
    raw_value double precision NOT NULL,
    eng_value double precision NOT NULL,
    channel text DEFAULT 'Channel1'::text,
    raw_value_type smallint DEFAULT 15,
    raw_data jsonb DEFAULT '{}'::jsonb
);


ALTER TABLE public.parameters OWNER TO lagoon;
ALTER TABLE ONLY public.parameters
    ADD CONSTRAINT parameters_pkey PRIMARY KEY (id, generation_time);
CREATE INDEX parameters_channel_idx ON public.parameters USING btree (channel);
CREATE INDEX parameters_generation_time_idx ON public.parameters USING btree (generation_time DESC);
CREATE INDEX parameters_raw_value_type_idx ON public.parameters USING btree (raw_value_type);
CREATE INDEX parameters_reception_time_idx ON public.parameters USING btree (reception_time);
CREATE INDEX parameters_recording_time_idx ON public.parameters USING btree (recording_time);
CREATE TRIGGER ts_insert_blocker BEFORE INSERT ON public.parameters FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker();

And this is the query that I run for different ids:

"SELECT time_bucket('1 day', generation_time) AS day, " +
                    "       min(raw_value) AS min, " +
                    "       max(raw_value) AS max, " +
                    "       avg(raw_value) AS avg, " +
                    "       count(*) AS count " +
                    "FROM public.parameters " +
                    "WHERE generation_time >= ? AND generation_time < ? AND id = ? " +
                    "GROUP BY day " +
                    "ORDER BY day;";

The data takes up roughly 110GB while the indexes add another 170GB.

When I run the queries subsequently each query takes around 500ms (hot). My naive expectation would be, if I run those queries in parallel, then all queries would be answered within 500ms. However 5 queries from 5 threads look this:

Results for ID: test1
Thread for id test1 finished in 536 ms

Results for ID: test2
Thread for id test2 finished in 2373 ms

Results for ID: test3
Thread for id test3 finished in 2435 ms

Results for ID: test4
Thread for id test4 finished in 2475 ms

Results for ID: test5
Thread for id test5 finished in 2481 ms

So while the first query is as fast as the usual, all others take 5 times longer. In total the operation takes as long as if the queries were ran subsequently.

Thanks again for reading. Any ideas?
Cheers

jonatasdp · February 11, 2025, 1:35pm

Hi @Lagoon, thanks for the details. Yes, it seems very slow in somehow.

One idea to follow debugging it would be you run it on threads, but capture the output of EXPLAIN ANALYZE and compare it with the single thread version.

Can you share what time series range are you using? Is it too large? Are the chunks compressed? Just thinking loud, my guess is that it can be a lock on DecompressChunk to lock only one decompression of each chunk at a time.

Lagoon · February 11, 2025, 5:01pm

Thank you for your suggestions, @jonatasdp.

select * from timescaledb_information.compressed_chunk_stats;

Returns 0 rows, so I guess the chunks are not compressed?

The time ranges are very long indeed. The test data covers several hundred of parameters with a resolution of 1 minute over two years. The query aggregates over 1 year. To be very performant, you probably would want to use some sort of pre aggregation, but I wanted to understand how flexible you are with your queries in TimescaleDB.

I tried the ANALYZE EXPLAIN, but to me the output looks very similar. The most obvious to me is that for the first query it states:

    Workers Planned: 6
    Workers Launched: 5

while it states for the following queries

    Workers Planned: 6
    Workers Launched: 0

Essentially the out looks like this, with a lot more Parallel Bitmap Heap Scans:
First query:

Finalize GroupAggregate  (cost=574270.40..574315.59 rows=55 width=40) (actual time=526.233..556.650 rows=365 loops=1)
  Group Key: (time_bucket('1 day'::interval, _hyper_1_87_chunk.generation_time))
  ->  Gather Merge  (cost=574270.40..574310.64 rows=330 width=64) (actual time=526.221..556.478 rows=395 loops=1)
        Workers Planned: 6
        Workers Launched: 5
        ->  Sort  (cost=573270.30..573270.43 rows=55 width=64) (actual time=504.171..504.203 rows=66 loops=6)
              Sort Key: (time_bucket('1 day'::interval, _hyper_1_87_chunk.generation_time))
              Sort Method: quicksort  Memory: 34kB
              Worker 0:  Sort Method: quicksort  Memory: 34kB
              Worker 1:  Sort Method: quicksort  Memory: 34kB
              Worker 2:  Sort Method: quicksort  Memory: 34kB
              Worker 3:  Sort Method: quicksort  Memory: 34kB
              Worker 4:  Sort Method: quicksort  Memory: 33kB
              ->  Partial HashAggregate  (cost=573268.02..573268.71 rows=55 width=64) (actual time=504.090..504.138 rows=66 loops=6)
                    Group Key: time_bucket('1 day'::interval, _hyper_1_87_chunk.generation_time)
                    ->  Result  (cost=369.38..572160.71 rows=88585 width=16) (actual time=7.267..484.004 rows=87600 loops=6)
                          ->  Parallel Append  (cost=369.38..571053.39 rows=88585 width=16) (actual time=7.261..473.124 rows=87600 loops=6)
                                ->  Parallel Bitmap Heap Scan on _hyper_1_87_chunk  (cost=369.51..11050.84 rows=3323 width=16) (actual time=7.904..66.623 rows=10080 loops=1)
                                      Recheck Cond: ((id = 'test1'::text) AND (generation_time >= '2022-01-01 00:00:00+00'::timestamp with time zone) AND (generation_time < '2023-01-01 00:00:00+00'::timestamp with time zone))
                                      ->  Bitmap Index Scan on "87_87_parameters_pkey"  (cost=0.00..366.93 rows=10302 width=0) (actual time=6.483..6.483 rows=10080 loops=1)
                                            Index Cond: ((id = 'test1'::text) AND (generation_time >= '2022-01-01 00:00:00+00'::timestamp with time zone) AND (generation_time < '2023-01-01 00:00:00+00'::timestamp with time zone))

Second Query:

Finalize GroupAggregate  (cost=585315.80..585361.00 rows=55 width=40) (actual time=2649.451..2657.856 rows=365 loops=1)
  Group Key: (time_bucket('1 day'::interval, _hyper_1_82_chunk.generation_time))
  ->  Gather Merge  (cost=585315.80..585356.05 rows=330 width=64) (actual time=2649.433..2657.709 rows=365 loops=1)
        Workers Planned: 6
        Workers Launched: 0
        ->  Sort  (cost=584315.70..584315.84 rows=55 width=64) (actual time=2643.755..2643.827 rows=365 loops=1)
              Sort Key: (time_bucket('1 day'::interval, _hyper_1_82_chunk.generation_time))
              Sort Method: quicksort  Memory: 76kB
              ->  Partial HashAggregate  (cost=584313.43..584314.11 rows=55 width=64) (actual time=2643.519..2643.688 rows=365 loops=1)
                    Group Key: time_bucket('1 day'::interval, _hyper_1_82_chunk.generation_time)
                    ->  Result  (cost=369.42..583183.34 rows=90407 width=16) (actual time=3.884..2526.014 rows=525602 loops=1)
                          ->  Parallel Append  (cost=369.42..582053.25 rows=90407 width=16) (actual time=3.878..2462.689 rows=525602 loops=1)
                                ->  Parallel Bitmap Heap Scan on _hyper_1_82_chunk  (cost=506.02..14868.58 rows=4551 width=16) (actual time=6.262..50.874 rows=10080 loops=1)
                                      Recheck Cond: ((id = 'test5'::text) AND (generation_time >= '2022-01-01 00:00:00+00'::timestamp with time zone) AND (generation_time < '2023-01-01 00:00:00+00'::timestamp with time zone))
                                      Heap Blocks: exact=10080
                                      ->  Bitmap Index Scan on "82_82_parameters_pkey"  (cost=0.00..502.50 rows=14107 width=0) (actual time=4.495..4.495 rows=10080 loops=1)
                                            Index Cond: ((id = 'test5'::text) AND (generation_time >= '2022-01-01 00:00:00+00'::timestamp with time zone) AND (generation_time < '2023-01-01 00:00:00+00'::timestamp with time zone))

Other things I noticed is, that the first gather merge has 395 rows while the second has 365 (as it should be I guess). And the second query state ‘Heap Blocks: exact=’ after each Parallel Bitmap, which the first does not do.

Can you spot something? Could this be storage related?

Cheers

jonatasdp · February 13, 2025, 2:12pm

Hi @Lagoon! Thanks for providing such detailed information about your setup and the performance behavior you’re observing. Let’s dive into what’s happening here.

Looking at your EXPLAIN ANALYZE output, there’s something particularly interesting: the difference in worker behavior between your first and subsequent queries:

First query:

Workers Planned: 6
Workers Launched: 5

Subsequent queries:

Workers Planned: 6
Workers Launched: 0

This is a key insight into why you’re seeing the performance difference. The first query is effectively utilizing parallel workers, while subsequent queries aren’t launching any workers at all. This explains the roughly 5x performance difference you’re observing.

A few thoughts and recommendations:

Your table structure looks good, but with your data size (110GB + 170GB indexes), parallel query execution is crucial for performance. Let’s check a few PostgreSQL parallel query settings:

SHOW max_parallel_workers_per_gather;
SHOW max_parallel_workers;
SHOW max_worker_processes;

I’d strongly recommend implementing continuous aggregates for your use case with long time ranges (2 years of minute-resolution data). They’re perfect for this scenario:

CREATE MATERIALIZED VIEW daily_parameter_stats
WITH (timescaledb.continuous) AS
SELECT 
    time_bucket('1 day', generation_time) AS day,
    id,
    min(raw_value) AS min,
    max(raw_value) AS max,
    avg(raw_value) AS avg,
    count(*) AS count
FROM parameters
GROUP BY day, id;

Given your data volume, compression could help significantly. You might want to consider enabling it:

ALTER TABLE parameters SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'id'
);

Could you run those PostgreSQL settings queries? That would help us understand if configuration limits might be affecting your parallel query execution.

Also, what’s your current maintenance_work_mem setting? With your index size being larger than your data, we might need to tune this for better concurrent performance.