Questions on query performance against finalized CAGGs

martin · September 21, 2022, 3:27pm

I have been experimenting with the new finalized continuous aggregates and, after some initial teething problems, I think I’ve figured out how to use them efficiently (both in terms of storage and, more importantly, query performance). I’d like to share my observations and check my understanding is correct around the mechanism that has led to the improved query performance.

Some background. Our raw time series hypertable looks a little like this:

CREATE TABLE raw_table (
  event_time timestamptz NOT NULL,
  detector_id int NOT NULL,
  class_id smallint NOT NULL,
  some_bool boolean NOT NULL
);

And our existing, pre-2.7 CAGG, was built something like this:

CREATE MATERIALIZED VIEW one_hour_agg WITH (timescaledb.continuous) AS
(
	SELECT
		time_bucket('1 hour', event_time) AS bucket,
		detector_id, 
		class_id,
		SUM(CASE WHEN some_bool THEN 1 ELSE 0 END) AS true_count, -- previously could not use FILTER statements
		SUM(CASE WHEN some_bool THEN 0 ELSE 1 END) AS false_count
	FROM raw_table
	GROUP BY bucket, detector_id, class_id
);

After the release of 2.8.0, I recreated a near identical aggregate to the one above, the only difference being that it would implicitly be a new finalized aggregate. I was hoping to see reduced storage space and increased query performance. I did see reduced storage space but the query performance was initially quite a bit worse (often over 4x worse)!

A typical query might looks something like this:

SELECT 
	time_bucket('1 week', bucket) day_bucket, -- re-aggregrate to larger bucket
	detector_id,
	class_id,
	SUM(true_count) true_count,
	SUM(false_count) false_count
FROM one_hour_agg
WHERE detector_id IN (SOME LIST OF detector_ids)
AND bucket >= '2022-03-01'
AND bucket < '2022-09-01'
GROUP BY day_bucket, detector_id, class_id

I tried experimenting a little with the chunk size of the materialization hypertables without seeing much improvement.

But then I tried adding a explicit ORDER BY to the CAGG:

CREATE MATERIALIZED VIEW ordered_one_hour_agg
WITH (timescaledb.continuous) AS
(
	SELECT
		time_bucket('1 hour', event_time) AS bucket,
        detector_id, 
		class_id,
		COUNT(*) FILTER (WHERE some_bool) true_count, -- FILTERs, yaaaay! :tada:
		COUNT(*) FILTER (WHERE NOT some_bool) false_count
		FROM raw_table
	GROUP BY detector_id, bucket, class_id
	ORDER BY detector_id, bucket, class_id
)

And now the queries become super efficient, often 4x faster than the original, and 16x faster than the new unordered finalized CAGG.

Here are some EXPLAINs:
Original: https://explain.depesz.com/s/PScI
New: https://explain.depesz.com/s/ojhb
Ordered New: https://explain.depesz.com/s/G738

The main thing I’m taking from the above execution plans is that by having the CAGG ordered allows much less data to be read from disk (which appears to have been the main bottleneck during execution), presumably because the proportion of relevant data read from each page is much higher when it’s ordered.

Could anyone shine any light on if that is indeed the reason for the improved query performance? If that does turn out to be so, I think it might be worthwhile emphasising this in the documentation somewhere as, whilst it makes total sense in hindsight, it took a fair bit of trial and error to stumble upon!

Questions on query performance against finalized CAGGs

View original discussion in #general on Slack