TimescaleDB 2.0 Is Now Generally Available

TimescaleDB 2.0 Is Now Generally Available

TimescaleDB 2.0 is now generally and immediately available on our hosted services and for download 🔥. This milestone release includes new features and capabilities - including multi-node for petabyte-scale, substantial enhancements to continuous aggregates, and more. Learn what's new and how to get started.

Recently, we announced the features and functionality in TimescaleDB 2.0, a multi-node, petabyte-scale, completely free relational database for time series. I’m pleased to inform you that as of today, TimescaleDB 2.0 is now generally available as a software download suitable for running in your own infrastructure, as well as via our managed cloud services.

TimescaleDB 2.0 includes the following new features:

  • Multi-node deployment: for horizontal scaling and rapid ingest of the most demanding time-series workloads (only available in the self-hosted version).
  • Updated, more permissive licensing: making all of our previous enterprise features free and granting more rights to users.
  • Substantial improvements to continuous aggregates: improving APIs and giving users greater control over the process.
  • User-defined actions: users can now define custom behaviors inside the database and schedule them using our job scheduling system.
  • New and improved informational views: including over hypertables, chunks, policies, and job scheduling.

TimescaleDB 2.0 is available immediately:

For more information, we’ve created a list of resources to help you get started:

Read on for insight into the new features and changes in TimescaleDB.

What’s new in TimescaleDB 2.0

The full list of changes in TimescaleDB 2.0 covers everything from function signatures to major features, like continuous aggregates and compression:

  • The Timescale license has been simplified and made more permissive. All the functionality offered in TimescaleDB is now free to use (including all previously “Enterprise” features).
  • Distributed hypertables have been introduced as a new feature that scales hypertables across multiple nodes and to petabyte-sized datasets. We took great care in making the user experience similar to regular hypertables in order to minimize the need to learn new functionality.
  • Continuous aggregates have been overhauled, splitting the policy and automation from the core functionality while also aligning the interface with materialized views. Making automation optional gives more control over the materialization process to users and also makes it easier to support this feature in a multi-node configuration in the future.
  • Support for user-defined actions. It is now possible to schedule background jobs that execute a user-defined action (UDA). This gives more flexibility to users that want to implement custom policies or automate other tasks.
  • Information views, size utilities, and method signatures have been overhauled to be more consistent and intuitive. We’ve also improved the transparency of compressed hypertables so that common operations (like setting a tablespace or moving a chunk) work as expected.

We’ve published a detailed migration guide and are always available in our community Slack to help with your migration and answer any questions you have along the way.

Distributed hypertables

The flagship feature of TimescaleDB 2.0 is distributed hypertables (or “multi-node TimescaleDB”), enabling users to run petabyte-scale workloads across multiple physical TimescaleDB instances, called data nodes. The initial version of this functionality focuses on fast ingest speed for many clients and improved performance for common aggregate queries (accomplished by pushing down work to data nodes).

Once multi-node TimescaleDB is set up, creating a distributed hypertable is as simple as creating a regular hypertable:

-- Create a distributed hypertable partitioned on time and hostname
SELECT create_distributed_hypertable('conditions', 'time', hostname);

-- Insert some data
INSERT INTO conditions VALUES ('2020-12-14 13:45', 1, '1.2.3.4');

The distributed hypertable will then spread data across the data nodes according to the “hostname” key and then the data will be further partitioned by time on each data node:

Diagram showing access node and data node structure on two axes (hostname and time)
A multi-dimensional distributed hypertable covering one access node (AN) and three data nodes (DN1-DN3). The "space" dimension (e.g., hostname in this image) determines the data node to place a chunk on.

However, this is only the first step of the journey to build out multi-node capabilities for TimescaleDB. Over the next few months, we will add more functionality to distributed hypertables, including JOIN optimizations, data rebalancing, distributed object management (e.g., keeping roles, UDFs, and other objects consistent across nodes), and more.

Major changes to continuous aggregates

Continuous aggregates were introduced in May 2019, as part of the 1.3.0 release of TimescaleDB, and quickly became a popular feature with our users. A continuous aggregate allows a user to downsample data by precomputing time-bucketed aggregates (for instance, hourly average temperature) for faster queries and a potential reduction in disk space usage. The feature is similar to regular materialized views, but allows continuous and efficient updates as new data becomes available. The ideal user experience is a continuous aggregate that is always up-to-date with the underlying source data.

However, this ideal user experience is complicated by:

  • Large data volumes and continuous writes: When there are large amounts of data, computing the aggregations might take a long time, and the system might struggle to catch up with continuous writes of new data and backfill. In such cases, it might be better to prioritize aggregating new data (in recent time intervals) over historical data.
  • Retention policies: A user should be able to have a retention policy on the source data, while keeping the aggregated data. But how does one distinguish between deletes due to retention and regular deletes in the data set? What if you want to drop the head of the aggregation (latest data) and re-import it instead of the dropping data in the tail (which is the typical retention use case)? What happens when data is backfilled into a time range that was dropped due to retention. What happens when a user restores the source data after dropping it? Is it important that the aggregate is refreshed prior to dropping data to ensure the aggregate is up-to-date with the source data before it is gone? These questions highlight use cases that weren’t handled well with continuous aggregates previously.

Prior to the TimescaleDB 2.0 release, continuous aggregates always materialized data from the beginning of the source data set moving forward in time. To avoid tying up resources for prolonged periods (in case of large amounts of data) the background materializer worked in increments.

However, this design had two main drawbacks:

(1) The user often had to wait for the materializer to “catch up” to recent time intervals. The result was often that the user didn’t see the data they expected, especially in recent time intervals that are often the most important ones for use cases such as monitoring.

(2) If there was backfill into older time regions, the materializer would start over from the point of backfill and then proceed to re-materialize all the data from that point onward. In the worst case, that could also lead to bad interactions with the incremental approach, as the materializer could (in theory) get stuck in a loop of re-materializing the same time region over and over again.

When also adding data retention to the mix, automated materialization becomes an even more challenging problem. When a retention policy drops old data in the underlying source hypertable, the materializer would instinctively see this as a change in the data and proceed with re-materializing the dropped region. But that would mean also deleting the aggregate in the dropped time interval, which isn’t always what the user desires.

Prior to TimescaleDB 2.0, our “solution” was to simply prevent retention policies from working on hypertables that had associated continuous aggregates. This also meant that existing retention policies on a hypertable would start to fail as soon as a user created a continuous aggregate on that hypertable.

To re-enable retention policies, the user had to explicitly configure each continuous aggregate on the hypertable to make the materializer ignore the time region covered by retention, but that also meant that the ignored regions were no longer tracked and could no longer be reliably materialized again. While side effects like these are sometimes hard to avoid, it was clear that the trade-offs involved were hard to understand.

The lesson we learned here is that it is often very difficult to build a one-size-fits-all solution for automation, since it it's difficult - if not nearly impossible - to know the intentions of all users.

Questions that arise include: Should we prioritize historical or new data? How important is it to be up-to-date with backfill? Should we prioritize backfill in recent regions over older regions? Do we risk starving certain areas of the continuous aggregate that will never be up-to-date?

While it is possible to generalize some of the answers to these questions, we also know that the answer varies with use case. In the end, only the user knows what makes sense for their unique scenario.

Our answer for TimescaleDB 2.0 was to build new APIs for manual refresh of continuous aggregates that give users more control and ability to tell the system what regions of data to prioritize.

Users can create a continuous aggregate policy that defines a sliding refresh window that lets the system know which regions to keep up-to-date and which ones to ignore. With the new APIs, we are therefore no longer left guessing what makes sense for users or a particular use case, since refresh intervals are explicitly given as input.

As illustrated by the example below, we have also made automation optional.

CREATE MATERIALIZED VIEW conditions_summary_hourly
WITH (timescaledb.continuous) AS
SELECT device,
       time_bucket(INTERVAL '1 hour', time) AS bucket,
       AVG(temperature),
       MAX(temperature),
       MIN(temperature)
FROM conditions
GROUP BY device, bucket
WITH NO DATA;

-- Manually refresh the last days worth of data
CALL refresh_continuous_aggregate(‘conditions_summary_hourly’, now() - INTERVAL ‘1 day’, now());

-- Add a continuous aggregation policy to keep the aggregate up-to-date
SELECT add_continuous_aggregate_policy('conditions_summary_hourly',
    start_offset => INTERVAL '1 month',
    end_offset => INTERVAL '1 h',
    schedule_interval => INTERVAL '5 min');

Users can further decide to rely on our built-in automation policies, write their own user-defined actions, or schedule refreshes from outside the database using their own orchestration. Or, they might just rely on manual refreshes that work more on demand.

More control with user-defined actions

Since its 1.0.0 release, TimescaleDB has included the ability to run background automation tasks, a.k.a. policies, that implement automatic retention, compression, data reordering, and continuous aggregation. These policies are simple by design and cover the most common use cases for each of these features. As we learned from continuous aggregates, however, it can be difficult to implement a one-size-fits-all policy that works across a wide range of use cases. There are situations where more flexibility to customize policies is necessary.

In TimescaleDB 2.0, we’ve opened up our background automation framework to enable custom background jobs that we call user-defined actions. This allows users to run custom functions and procedures on a schedule within the database. For instance, you might want a single policy that works on many hypertables at the same time or one that combines continuous aggregation and retention in one policy to better control otherwise complicated interactions between refreshing and dropping data.

Below is an example of the former, i.e., a retention policy that applies to all hypertables in a database:

CREATE OR REPLACE PROCEDURE generic_retention (job_id int, config jsonb)
LANGUAGE PLPGSQL
AS $$
DECLARE
  drop_after interval;
BEGIN
  SELECT jsonb_object_field_text (config, 'drop_after')::interval INTO STRICT drop_after;

  IF drop_after IS NULL THEN
    RAISE EXCEPTION 'Config must have drop_after';
  END IF;

  PERFORM drop_chunks(format('%I.%I', table_schema, table_name), older_than => drop_after)
    FROM timescaledb_information.hypertables;
END
$$;


SELECT add_job('generic_retention','1d', config => '{"drop_after":"12 months"}');
More examples of user-defined actions can be found in our documentation.

Other changes to views, size utilities, and method signatures

The TimescaleDB-specific information views have been updated to provide more information and improved consistency. They have also been redesigned to work with distributed hypertables. Most of the dynamically computed information has moved into functions (e.g., table size information or things that need cross-node communication in a multi-node configuration).

The size utility functions have been renamed and improved (e.g., hypertable_size). We’ve split them up into basic functions (returning only a single aggregated size value) and detailed functions (that break things down in multiple columns of information). Like the information views, size utility functions should work across both regular and distributed hypertables.

We’ve also made improvements to compressed hypertables. For instance, several operations, like setting tablespace or moving chunks, now work on a hypertable that has compression enabled and will also apply to compressed data chunks. Planner statistics are also retained after compressing a chunk, so that the planner can make better decisions about how to execute queries.

For more information on all of these changes, and how they affect your plans to upgrade to TimescaleDB 2.0, please read our migration guide.

Looking to the future

The two-year journey to TimescaleDB 2.0 has taught us much about the myriad of use cases for time-series databases, the ways in which we prioritize feature development, and the manner in which we implement our product. But we’ve always maintained our singular focus on you, our customers, who have embraced TimescaleDB in ways that are humbling and fill us with pride.

In the year ahead, you can expect us to maintain our focus on what you need from a relational database for time series. Features we have planned include:

  • PostgreSQL 13
  • Improvements to compression (transparent updates)
  • Improvements to multi-node and distributed hypertables
  • First-class multi-node experience
  • First-class support for common analytical functions on time-series data
  • And much more

We’d love to hear from you about what you want to see us improve and build. Join our community Slack and let us know. In addition, you can sign up for our Release Notes Newsletter, a no-frills email update sent whenever we make changes to TimescaleDB.

Get started with TimescaleDB 2.0

If this post has piqued your curiosity, TimescaleDB 2.0 is available immediately:

For more information, we’ve created a list of resources to help you get started:

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
10 min read
Announcements & Releases
Contributors

Related posts