Increase Your Storage Savings With TimescaleDB 2.6: Introducing Compression for Continuous Aggregates

Increase Your Storage Savings With TimescaleDB 2.6: Introducing Compression for Continuous Aggregates

Yesterday, we announced that we raised $110 million in our Series C. 🐯🦄🚀 Today, we keep celebrating the big news Timescale-style: with an #AlwaysBeLaunching spirit! We're excited to announce the release of TimescaleDB 2.6, a version that comes with new features highly requested by our community - most notably compression for continuous aggregates. 🔥 TimescaleDB 2.6 also includes the experimental support for timezones in continuous aggregates.

We love building in public. We are firm believers in the value of user feedback, as there’s no better way to improve your product than hearing from those who use it daily. We read all the thoughts and comments you share in our Community Slack, and we pay close attention to your feature requests in GitHub. When features get upvoted, it helps us prioritize what to work on next.

With today’s release, we are proud to bring you a top requested feature: the support for compression in continuous aggregates. Originally, we envisioned that compression would mostly be necessary over raw data, as continuous aggregates by themselves help downsample datasets considerably. But TimescaleDB users operate at such a scale that they requested compression also for their continuous aggregates to save even more disk space.

Screenshot of the GitHub issue that originated the support for compression in continuous aggregates
The issue that originated the support for compression in continuous aggregates

Continuous aggregates speed up aggregate queries over large volumes. You can think of them as a more powerful version of PostgreSQL materialized views, as they allow you to materialize your data while your view gets automatically and incrementally refreshed in the background. But continuous aggregates are also very useful for something else: downsampling. Indeed, another property of continuous aggregates is that you can keep them around even when the data from the underlying hypertable has been dropped. This allows you to reduce the granularity of your data once it reaches a certain age, liberating space while still enabling long-term analytics.

The ability to compress continuous aggregates takes this one step further. Starting with TimescaleDB 2.6, you can apply TimescaleDB’s native columnar compression to your continuous aggregates, freeing even more disk space. And by combining compression policies (which automatically compress data after a certain period of time) with data retention policies, you can automatically set up a downsampling strategy for your older data.

Figure describing the downsampling process through continuous aggregates and compression.
Figure describing the downsampling process through continuous aggregates and compression. Notice the relationship between the refresh policy, compression policy, and data retention policy.

TimescaleDB 2.6 also comes with another highly requested feature by the community: you're now able to create continuous aggregates with monthly buckets and/or timezones using time_bucket_ng. (Note that this is an experimental feature.)

time_bucket_ng is the “new generation” of our time_bucket hyperfunction, used for bucketing and analyzing data for arbitrary time intervals in SQL. You can think of time_bucket as a more powerful version of the PostgreSQL date_trunc function, allowing for arbitrary time intervals rather than the standard day, minute, hour provided by date_trunc. But time_bucket doesn’t yet support buckets by months, years, or timezones; time_bucket_ng expands the capabilities of time_bucket by including these features in our experimental schema. (We first introduced this feature thanks to an issue from the community.)

We’re eager to hear how this works for you. Our main goal with experimental features is to get as much input as possible - please, if you see something that needs improvements, tell us in GitHub!

Screenshot of the GitHub issue that originated the support for timezones in continuous aggregates.
The issue that originated the support for timezones in continuous aggregates.

​​TimescaleDB 2.6 is available today. If you are already a TimescaleDB user, check out our docs for instructions on how to upgrade. If you are using Timescale Cloud, upgrades are automatic, and no further action is required from your side (you can also start a free 30-day trial, no credit card required).

If you are new to Timescale and you want to learn more about continuous aggregates, compression, and time_bucket_ng, keep reading!

Once you’re using TimescaleDB, join our community. You can ask us questions in our Community Slack or in our brand new Community Forum, a better home for long-form discussions. We’ll be more than happy to solve any doubts you may have about compression and continuous aggregates, TimescaleDB 2.6, or any other topic. And if you share our mission of helping developers worldwide, we are hiring broadly across many roles!

Before moving on, a huge thank you to the team of engineers (and the entire team of reviewers and testers) that made this release possible. This includes the community members who helped us prioritize, develop, and test these features. As we’ve said before, none of this would be possible without your requests, your upvotes, and your feedback. Please, keep ‘em coming!

The power of continuous aggregates

Continuous aggregates in TimescaleDB are a more powerful version of PostgreSQL materialized views. For example, continuous aggregates are incrementally updated with a built-in refresh policy, so they stay up-to-date as new data is added. When querying a continuous aggregate, the query engine will combine the data that is already pre-computed in the materialized view with the newest raw data in the underlying hypertable. In other words, you will always see up-to-date results.

TimescaleDB uses internal invalidation records to determine which data has been changed or added in the underlying hypertable since the last refresh; when you refresh your continuous aggregate, only the new or updated data is computed. This means that TimescaleDB doesn’t need to look at your whole table every time you do a refresh, saving tons of computation resources, speeding up the incremental maintenance of continuous aggregate, and allowing you to get faster results in your aggregate queries.

This also implies that you can also use continuous aggregates for downsampling. Differently than with materialized views, you are able to retain the continuous aggregate even after the original raw data has been dropped. To reduce the granularity of your time-series data, you can simply define a continuous aggregate and delete your original data once it gets old - we will be demonstrating this process later in this blog post.

Compression in TimescaleDB

Together with continuous aggregates, compression is another oft-used feature of TimescaleDB. By defining a compression policy, TimescaleDB allows you to compress all data stored in a specific hypertable that is older than a specific period of time - let’s say, 7 days old. In this example, you would keep your last week’s worth of data uncompressed, which would allow you to write data into your database at very high rates, giving you optimal query performance for your shallow-and-wide queries as well. Once your data gets old enough (after 7 days), the compression policy would kick in, automatically compressing your data.

TimescaleDB is able to give you high compression rates by deploying best-in-class compression algorithms along with a novel hybrid row/columnar storage design. Once a chunk inside your hypertable becomes old enough, TimescaleDB compresses it by storing the columns into an array. In other words: your more recent chunks (in the previous example, all chunks newer than 7 days) will be stored in TimescaleDB as relational, row-based partitions. Once compressed, your chunks will be a columnar store.

A chunk before compression vs after compression.
An example chunk before compression vs after compression. After compression, the data previously stored in multiple rows is now stored as a single row, with the columns being stored as an array.

✨ If you want to dive deeper into how compression works in TimescaleDB, check out the following videos:

How to use continuous aggregates with compression for downsampling

As we introduced earlier, continuous aggregates can help you reduce the granularity of your dataset, with the ultimate goal of saving some disk space.

We explained previously how, in TimescaleDB, you are able to retain the continuous aggregate even after the original raw data has been dropped. So in order to considerably reduce your data size (and thus your storage costs) automatically and without losing the ability to do long term analytics on your older data, you can:

In this section, we’ll walk you through an example, using financial data from Alpha Vantage. (If you want to also load this dataset into your TimescaleDB instance, we have a step-by-step guide published in our docs.)

We will be using the following schema:

CREATE TABLE public.stocks_intraday (
	"time" timestamptz NOT NULL,
	symbol text NULL,
	price_open float8 NULL,
	price_high float8 NULL,
	price_low float8 NULL,
	price_close float8 NULL,
	trading_volume int4 null
	);
    
SELECT create_hypertable ('public.stocks_intraday', 'time');

Create a continuous aggregate and set up continuous aggregate policy

To start, let’s create a continuous aggregate using TimescaleDB’s time_bucket() function. We will record averages for the price at open, high, low, and close for each company symbol over a given day.

We also will set up a continuous aggregate policy, which will update and refresh data from the last four days. This policy will run once every hour:

CREATE MATERIALIZED VIEW stock_intraday_daily
WITH (timescaledb.continuous, timescaledb.materialized_only = true) AS
SELECT 
time_bucket( interval '1 day', "time") AS bucket,
AVG(price_high) AS high,
AVG(price_open) AS open,
AVG(price_close) AS close,
AVG(price_low) AS low,
symbol 
FROM stocks_intraday si 
GROUP BY bucket, symbol;

Enable and set up compression on your continuous aggregate

Now that we have defined the continuous aggregate together with a refresh policy for it, we can enable compression on this continuous aggregate, also setting up our compression policy. This compression policy will automatically compress all chunks older than 7 days:

ALTER MATERIALIZED VIEW stock_intraday_daily SET (timescaledb.compress = true);

-- Set up compression policy
SELECT add_compression_policy('stock_intraday_daily', INTERVAL '7 days');

Lastly, it is important to notice that updating data within a compressed chunk is not supported yet in TimescaleDB. This is relevant for correctly configuring compression policies in continuous aggregates: since refresh policies require chunks to be updated, we have to make sure that our recent chunks remain uncompressed. I.e., make sure you define your time intervals so that your continuous aggregate gets refreshed at a later date than when your compression policy is set.

How compression affects storage

So we officially have a compressed continuous aggregate, but you may be wondering: how much of a difference does compression make on this continuous aggregate?

In order to find out, let’s check out the stats on our compressed continuous aggregate. To do that, we first need to find the internal name for the materialized hypertable. We can do it by looking at our compressed tables:

SELECT * FROM timescaledb_information.compression_settings;

Then, we can use that to run the hypertable_compression_stats() command. (Note: The name of your materialized hypertable will most likely be different than the one shown below.)

SELECT * FROM hypertable_compression_stats('_timescaledb_internal._materialized_hypertable_3');

Results:
—---------------------------------|--------
total_chunks                   | 7
number_compressed_chunks       | 6
before_compression_table_bytes | 7471104
before_compression_index_bytes | 2031616
before_compression_toast_bytes | 49152
before_compression_total_bytes | 9551872
after_compression_table_bytes  | 401408
after_compression_index_bytes  | 98304
after_compression_toast_bytes  | 3178496
after_compression_total_bytes  | 3678208
Node_name

For this continuous aggregate, we got a compression rate of over 61%:

Figure showing the size of the continuous aggregate before and after compression (9.6 MB vs 3.7 MB).
Size of the continuous aggregate before and after compression (9.6 MB vs 3.7 MB)

A 61% compression rate would imply a very nice boost in your storage savings. However, if you’re used to compression in TimescaleDB, you may be wondering: why only 61%? Why cannot I get compression rates over 90% as I commonly see with my hypertables?

The reason behind this is that continuous aggregates store partial representation of the aggregates in bytea format, which is compressed using dictionary-based compression algorithms. So continuous aggregates cannot take advantage of other compression algorithms - at least, not yet. We’re experimenting with some concepts that may significantly increase the compression rates for continuous aggregates in the future. Stay tuned!

Set up a data retention policy on your raw data

Through the continuous aggregate we created earlier, we could effectively downsample our data. Our original dataset had one datapoint per minute; this is perfect for real-time monitoring, but a bit too heavy for the purpose of long-term analysis. In our continuous aggregate, we’re aggregating the data into 1 h buckets, which is a more manageable granularity if you’re planning to store this data long-term. So let’s save up some additional disk space by dropping the data in the underlying hypertable while keeping the continuous aggregate.

To do so, we will define a retention policy that automatically deletes our raw data once it reaches a certain age. This age is completely up to you and your use case (the retention policy below will delete the data older than a month, for example). As a reminder, this policy refers only to the data in your original hypertable - you will still keep the data materialized in the continuous aggregate.

SELECT add_retention_policy('stocks_intraday', INTERVAL '1 month');

You asked, and we delivered: introducing timezones for continuous aggregates

As we mentioned at the beginning of this post, TimescaleDB 2.6 introduces not only compression for continuous aggregates but also the possibility of using timezones in continuous aggregates, another highly requested feature by our community. Starting with TimescaleDB 2.6, you're able to create continuous aggregates with monthly buckets and/or timezones using time_bucket_ng.

This is an experimental feature: please, test it and send us your feedback! The more feedback we get, the faster we will make this feature ready for production. 🔥

time_bucket_ng is the “new generation” of our time_bucket hyperfunction, used for bucketing and analyzing data for arbitrary time intervals in SQL. You can think of time_bucket as a more powerful version of the PostgreSQL date_trunc function, allowing for arbitrary time intervals rather than the standard day, minute, hour provided by date_trunc.

But time_bucket doesn’t support time buckets by months, years, or timezones. time_bucket_ng expands time_bucket by including these features, giving users maximum flexibility in their queries. (We also first introduced time_bucket_ng thanks to an issue from the community!)

time_bucket_ng is still an experimental feature, being that it’s still being developed under our experimental schema. In Timescale, we like to bring constant value to our users by moving fast, but we also value stability - as we like to say, we move fast without breaking things. Through our experimental schema, we’re allowed to release experimental features that, even if they’re still not ready for production, are ready to be widely tested. As we keep mentioning through this post, we love to get as much feedback from the community as possible: releasing features under experimental helps us build robust features, as by the time we “graduate” these features out of the experimental schema, we are sure that everything is stable.

Starting with TimescaleDB 2.6, you can now use time buckets of months and years plus specify timezones in your continuous aggregates. For example, the continuous aggregate below tracks the temperature in Honolulu over monthly intervals. (Yes, we are dreaming of warmth 🏝 in this team!)

CREATE TABLE conditions(
  day timestamptz NOT NULL,
  city text NOT NULL,
  temperature INT NOT NULL);

SELECT create_hypertable(
  'conditions', 'day',
  chunk_time_interval => INTERVAL '1 day'
);

INSERT INTO conditions (day, city, temperature) VALUES
  ('2021-06-14 00:00:00 HST', 'Honolulu', 26),
  ('2021-06-15 00:00:00 HST', 'Honolulu', 22),
  ('2021-06-16 00:00:00 HST', 'Honolulu', 24),
  ('2021-06-17 00:00:00 HST', 'Honolulu', 24),
  ('2021-06-18 00:00:00 HST', 'Honolulu', 27),
  ('2021-06-19 00:00:00 HST', 'Honolulu', 28),
  ('2021-06-20 00:00:00 HST', 'Honolulu', 30),
  ('2021-06-21 00:00:00 HST', 'Honolulu', 31),
  ('2021-06-22 00:00:00 HST', 'Honolulu', 34),
  ('2021-06-23 00:00:00 HST', 'Honolulu', 34),
  ('2021-06-24 00:00:00 HST', 'Honolulu', 34),
  ('2021-06-25 00:00:00 HST', 'Honolulu', 32),
  ('2021-06-26 00:00:00 HST', 'Honolulu', 32),
  ('2021-06-27 00:00:00 HST', 'Honolulu', 31);

CREATE MATERIALIZED VIEW conditions_summary
WITH (timescaledb.continuous) AS
SELECT city,
   timescaledb_experimental.time_bucket_ng('1 month', day, 'Pacific/Honolulu') AS bucket,
   MIN(temperature),
   MAX(temperature)
FROM conditions
GROUP BY city, bucket;

-- to_char() is used because timestamptz is displayed in the sesison timezone by default
-- alternatively you can use SET TIME ZONE 'Pacific/Honolulu';
SELECT city, to_char(bucket at time zone 'HST', 'YYYY-MM-DD HH24:MI:SS') as month, min, max
FROM conditions_summary
ORDER by month, city;

   city   |        month        | min | max
----------+---------------------+-----+-----
 Honolulu | 2021-06-01 00:00:00 |  22 |  34
(1 row)

We’re eager to know how this feature is working for you so we can improve it. Please, reach out to us through GitHub, our Community Slack, or the Timescale Community Forum.

And thank you again for your invaluable support!

Get started

TimescaleDB 2.6 is already available for Timescale Cloud and self-managed TimescaleDB:

  • If you are a Timescale Cloud user, you will be automatically upgraded to TimescaleDB 2.6 during your next maintenance window. No action is required from your side. You can also create a free Timescale Cloud account to get a free 30-day trial, with no credit card required.
  • If you are using TimescaleDB in your own instances, check out our docs for instructions on how to upgrade.
  • If you are using Managed Service for TimescaleDB, TimescaleDB 2.6 will be available for you in the upcoming weeks.

Once you’re using TimescaleDB, connect with us! You can find us in our Community Slack and the Timescale Community Forum. We’ll be more than happy to answer any question on continuous aggregates, compression, TimescaleDB, PostgreSQL, or anything in between.

And if you want to help us build and improve features like compression, continuous aggregates, and time_bucket, we are hiring broadly across many roles! Join our global, fully remote team. 🌎

Ingest and query in milliseconds, even at terabyte scale.
This post was a collaboration between
12 min read
General
Contributors

Related posts