TimescaleDB 1.6: Data retention policies for continuous aggregates
New release allows you to save resources and storage space while maintaining continuous aggregates
We are excited to announce the availability of TimescaleDB 1.6 today which includes significant updates to continuous aggregates among other enhancements.
In particular, this release introduces the ability to use a data retention policy with continuous aggregates (a feature used to speed up queries that aggregate over time). Data retention policies are useful when you are looking to analyze large datasets without accumulating the additional storage costs.
Within this post, we’ll briefly explain continuous aggregates - and the role they play in time-series data - and then dive into the specifics of what TimescaleDB 1.6 brings in terms of enhancements. But first, we want to thank TimescaleDB community members @optijon, @acarrea42, @ChristopherZellermann, and @SimonDelamare for contributing to this release. Your feedback is always appreciated!
Remind me about continuous aggregates?
At a high level, by using TimescaleDB continuous aggregates, users can speed up queries that aggregate over time which is particularly useful for a variety of use cases such as powering real-time dashboards and deriving value from IoT sensor data.
With TimescaleDB’s continuous aggregates, the view is refreshed automatically in the background as new data is added, or old data is modified. This feature is unique compared to other solutions in the market because it has the ability to properly track out-of-order data. The continuous aggregate will automatically recompute on out-of-order data without adding any maintenance burden to your database (i.e. it will not slow down
We introduced continuous aggregates last spring with the release of TimescaleDB 1.3 and have made several improvements since. If you are interested in an in-depth look at how continuous aggregates work, see Continuous aggregates: faster queries with automatically maintained materialized views.
Option to turn off data invalidation
This release includes the capability to change the limitations for data invalidation, so you can drop raw data without impacting your continuous aggregate. By turning invalidation off, you will have the option to use a data retention policy (more on this below).
In case you aren’t familiar, data invalidation is a process that monitors for changes in historical data. When using data invalidation with continuous aggregates, it changes the presentation of the materialized view.
In other words, if you have a lot of data marked as invalid, the background worker has to go back through and recompute the invalid data. This process takes a lot longer and consumes more resources and can slow down the performance of your continuous aggregate.
With TimescaleDB 1.6, the user has the option to turn data invalidation off for data older than a certain age. When data invalidation is turned off, the view is freed from the underlying data, allowing those data points to exist as they are without worrying about changes to the data invalidating the view. As a result, refresh jobs will complete faster since they are only dealing with new data as part of the refresh, and you will have the option to implement a data retention policy.
Use a data retention policy to remove underlying data
When data reaches a certain age, you are able to remove it manually or via a data retention policy. Essentially, you have the flexibility to control the amount of data stored and allow data to “age out” after it has served its purpose.
However, prior to TimescaleDB 1.6, performing
drop_chunks wasn’t optimized for use with continuous aggregates. If you performed this function, the continuous aggregate would drop all data associated with any chunks dropped from the raw hypertable. As a result, users could not purge any underlying data once a continuous aggregate was established.
For example, say you are collecting metrics on a per second basis for real-time monitoring. You spread your CPU consumption across 300 compute instances and have 31,536,000 data points per year, per instance. At some point, you’d like to roll this up to 5 minute / 1 hour / 24 hour averages for analytics purposes. You’d use a continuous aggregate to do this, but need to store all the underlying data - which would create added storage expenses and hinder downsampling (i.e. the act of applying an aggregate function to roll up a very granular data set to a more coarse grained set of data to enable analytics).
Now, here’s the good news! With this release, you can create a continuous aggregate and remove the underlying data using data retention policy without impacting the materialized view. And you achieve this by turning off data invalidation. This enables you to leverage downsampling by removing underlying data even after it’s rolled up.
Essentially, this will allow users to save space on storage since they can choose to leverage data retention when using continuous aggregates.
What if you need to retain all your old data?
If you can’t leverage data retention policies because you need to retain all your raw data, remember that TimescaleDB native compression can alternatively (and also) be used to substantially save on storage costs. (And in fact, continuous aggregations and data retention policies can be combined with native compression, when you might want both!)
You can learn more about compression by reading “Building columnar compression in a row-oriented database” and by visiting the docs pages.
To recap, TimescaleDB 1.6 will allow for more efficient storage strategy around downsampling, and more efficient use of resources. Additionally, you will achieve faster build and refresh times which leads to less worker contention. (View the Release Notes here.)
For more information on getting started with continuous aggregates, check out our docs pages. We will also be publishing a blog on downsampling next week and will let you know when that’s live.
If you have any questions along the way, we are always available via our community Slack channel.
[COMING SOON: As a final note, we plan to move automated data retention policies and automated hypertable data reordering on disk to the Timescale Community version with the release of TimescaleDB 1.7. We will share more information in conjunction with that release.]