3 Ways to Plot Incremental PostgreSQL Materialized Views

3 Ways to Plot Incremental PostgreSQL Materialized Views

From my experience as a developer for sensor equipment and data logging, I have found that TimescaleDB is a great tool for a large variety of purposes. From tracking metrics and statistics for long-term data analysis to providing fresh data to seek abnormalities and alerting, it provides a large variety of features that make our lives easier, such as compression, retention, and, most notably for this blog, its enhanced version of materialized views—downsampling through continuous aggregates.

However, these features need to be used properly to give the best results, and I have often struggled to easily integrate one of them in a smooth way: continuous aggregates and how to plot them.

Time-series plots and automatically updated materialized views: The problem

The problem is not setting up the aggregates, which are automatically updated versions of PostgreSQL materialized views. Actually, TimescaleDB makes that almost as easy as setting up any other view, and once it’s set up, there’s nothing to worry about. You get live data with automatic updates, compression, and retention—all out of the box!

What often adds a bit of complexity, though, is putting them to use in applications that cover a broad range of time intervals, namely plotting data in tools like Grafana.

Editor’s Note: Check out our blog post for more about time-series plots.

See, if you have a plot that shows you CPU usage of a server rack over time, you’ll probably look at this graph not only over a long time range of months or years to monitor average use but also zoom in to shorter ranges of a day or two to check for unusual spikes or similar symptoms. The same holds true for many other kinds of data.

And this is where things get tricky.

You could plot everything from raw data, but that will keep the server busy re-aggregating every time you look at long time ranges, slowing down the refresh rate, etc. On the other hand, using a continuous aggregate alleviates that problem, but now your graph can’t show you the finer details anymore because you only have access to reduced data. Optimally, you want to switch between raw data and one (or more) continuous aggregates based on the requested time range.

Luckily I’ve come across a few solutions to this problem, including one powerful trick that could work out of the box for a lot of your data!

Solution 1: UNIONs

If you only have one or two graphs and you don’t think you’ll be adding more to them, one fairly easy method is to slightly tweak the query used in the plotting tool.

It’s quick and reasonably easy to do but can become a bit of a tangle if you have multiple aggregate tables, different plots, or a lot of hypertables.

The general idea here is as follows: instead of getting data from a raw table or aggregate, tell TimescaleDB to select data from both, but use a clever exclusion in the WHERE clause that will switch it from one table to the other for you!

A basic example looks like this:

SELECT
    time_bucket(INTERVAL '$__interval', time) AS time,
    avg(cpu_usage) AS cpu_usage_avg
FROM cpu_usages
WHERE $__timeFilter(time)
  AND (INTERVAL '$__interval') < INTERVAL '10m'
GROUP BY 1

UNION ALL

SELECT
    time_bucket(INTERVAL '$__interval', time) AS time,
    avg(cpu_usage_avg) AS cpu_usage_avg
FROM cpu_usages_10m
WHERE $__timeFilter(time)
  AND (INTERVAL '$__interval') >= INTERVAL '10m'
GROUP BY 1


See the little trick there with the WHERE [...] (INTERVAL ‘$__interval’) [...] clause?

This WHERE clause is static and only depends on the $__interval parameter, which is Grafana’s placeholder for the downsampling interval it requests. This means the query planner can, right from the start, know which of the two queries can even return any data at all. It then selects only one of the tables for you, avoiding unnecessary aggregation work but still giving you fine data.

You can, of course, keep stacking these UNIONs together if you have more aggregate tables. Just make sure that the WHERE clause is truly exclusive, especially when adding more aggregate ranges. If you get an overlap, you might end up with rather strange-looking, mixed-together data!

Another downside of this is the maintenance effort added to the graphs. Instead of writing one query, you now have to write two or more (one SELECT for each aggregate), although you can place the UNION into a CTE and then do more filtering and downsampling in a single main clause. Plus, changing the aggregates or adding another aggregate requires updating all the graphs.

Overall, this solution was the first and easiest I’ve tried, but it’s always a little tricky to do, and it’s easy to forget what aggregate tables you have available, etc.

Solution 2: Inline SQL functions

To encapsulate the tricky part of the query, the UNION of the various tables, it’d be nice to put it into a VIEW. Those, however, can’t take input parameters, so we can’t easily implement the exclusion WHERE clause. Except, with functions, we can.

PostgreSQL lets you define functions that can return values, rows, or even entire tables. Most functions will materialize this data before handing it off, which would not work for our use case as we still need to apply filters for time and other vital constraints, without which the query can’t be optimized—with one exception.

There’s a smart trick where it can inline certain pure SQL functions into our query, essentially letting them act as VIEWs but with input parameters.

Once inlined, the query planner can then push further filters and WHERE clauses to those queries, like filters on the time column. And this is perfect!


We can now centrally define the same UNION SELECTstructure as we used above, just inside an inlineable SQL function. All we need to do then is pass the INTERVAL as a variable to the function, and it’ll correctly apply the exclusion clause, selecting the correct table.

This means that when we add another aggregate or want to modify the aggregate selection, we have a central location to do so, and all queries using the function are automatically updated!

The example will look very similar to the code we have above, just wrapped into a function statement now:

CREATE FUNCTION cpu_avg_autoselect(data_interval INTERVAL)
RETURNS TABLE(time TIMESTAMPTZ, cpu_usage_avg DOUBLE PRECISION)
LANGUAGE SQL
STABLE
AS $BODY$
SELECT
    time_bucket(data_interval, time) AS time,
    avg(cpu_usage) AS cpu_usage_avg
FROM cpu_usages
WHERE data_interval < INTERVAL '10m'
GROUP BY 1

UNION ALL

SELECT
    time_bucket(data_interval, time) AS timel AS time,
    avg(cpu_usage_avg) AS cpu_usage_avg
FROM cpu_usages_10m
WHERE (INTERVAL '$__inverval') >= INTERVAL '10m'
GROUP BY 1
$BODY$;

See how we can use the data_interval parameter inside? This is vital for our table selection, and something a VIEW can’t do.

With only one classic SQL query (such as our SELECT/UNION), PostgreSQL can inline this function just fine. It should even show up in the query plan for an EXPLAIN.

This solution can already cover a large number of use cases and solves the fundamental problem; however, there is one last step up from here to automate aggregate selection even further and learn more about just how powerful PostgreSQL can be.

Solution 3: PL/pgSQL

The method described in the previous solution already makes use of PostgreSQL’s powerful functions, but if we want it to be inlined (and properly pull in the vital WHERE filter clauses!), we can’t really do much more than with a VIEW.

There is, however, a way to unlock even more potential automation and features.

One example of this that was particularly enticing for me was fully automatic aggregate detection.

Say you have a large variety of tables, as generated by, e.g., a Telegraf to PostgreSQL plugin. Some of these tables might have aggregates, some none, and some with differing lengths.

Having a single function to which you can pass the name of the base hypertable you want to select from, the sample interval you’re hoping to get, and a few other parameters, and having it do all the work of finding the right table for you, could save you a lot of time and even be packaged as an easily reusable extension.

And PostgreSQL can actually do that, thanks to their PL/pgSQL function language!

I won’t go into the syntax here as it is a fairly broad topic, but you can read the documentation.

After a good round of tinkering, I’ve arrived at a function with a rather simple syntax that can do all of the above. An example of the CPU average fetching that I mentioned would look like the following:

SELECT *
FROM ts_autoagg('cpu', INTERVAL '10m', $aggs$
  [
    {
      "with_columns":["value"],
      "aggregate":"avg(value) AS value"
    },
    {
      "with_columns":["avg_value"],
      "aggregate":"avg(avg_value) AS value"
    },
    {
      "with_columns":["value_stats_agg"],
      "aggregate":"average(rollup(value_stats_agg)) AS value"
    }
  ]
$aggs$,
'tags',
$$
  WHERE time BETWEEN NOW()-INTERVAL'2h' AND NOW()
  AND tags@>'{"metric":"usage_idle","cpu":"cpu-total"}'
$$) AS (time TIMESTAMP, tags JSONB, value DOUBLE PRECISION);

ts_autoagg is the function I wrote, and it will return a SETOF RECORD. This is fancy-speak for “a table,” except without predetermined columns. We need to supply a fixed column layout in the AS () clause at the very end of the statement instead, which must match up with the data we’re returning, but that’s about the only tricky thing to remember.

The arguments for it are as follows:

  • The root hypertable name
  • The interval we want to time_bucket our data with
  • A JSON array that contains downsampling configs. Each object in the array represents one possible downsampling method, and one of them is chosen based on the with_columns field. This allows us to easily specify how to grab the right data from each kind of table.
  • A list of GROUPBY clauses
  • Additional query arguments such as JOINs and the WHERE clause

And that’s it! With this information, the function will find the right table to select from and deliver downsampled data. It’ll always choose a table whose downsampling interval is smaller than our data interval and assumes the raw hypertable has an interval of “0”, i.e., no downsampling.

In case our downsampling functions don’t work on the root table, however, the function will then default to the next-highest available interval to at least return some data. This makes it incredibly easy to write even more complex downsampling setups with multiple different aggregate tables, returning different types of data (e.g., min/max for zoomed-in data but a percentile for larger ranges) and quickly adding new continuous aggregates on the fly without having to rewrite existing queries.

A pull request for the SQL for the functions needed for them can be found in the TimescaleDB Extras repository.

And as their source file isn’t that long, I highly recommend giving it a short read!

Editor’s Note: Thanks to David Bailey for being an active contributor and making TimescaleDB even better! 🙏 Check out some of his previous work using Timescale and Grafana.

The details

For those interested in how this function works and who would like to dig in a bit more about its technical details, I’ll go through the rough functionality here.

I can highly recommend learning to use PostgreSQL’s functions, as they are very rich and capable and can provide a huge bonus to workflows thanks to how close they can operate to the data! And, since TimescaleDB is layered nicely on top of Postgres, we can use them with almost no caveats.

The selector function can generally be split up into three different steps, which are as follows:

  1. Find a list of all aggregate tables for the hypertable in question.
  2. Select a fitting table for your downsampling.
  3. Perform your aggregation based on that choice.

I’m happy to say that all three steps can be done with simple SQL and PL/pgSQL functions.

The first step—finding out what tables are available—required a few tricks with TimescaleDB internal VIEWs, but is doable with just two or three SELECTs. PostgreSQL’s information_schema.view_column_usage actually provides details about VIEWs and what tables they’re built on, and this also contains details for continuous aggregates.

Using a recursive query on said view_column_usage table and given a starting table—our hypertable—a single query can find all aggregates/views built on our root table. Having said that, I still need to read up on recursive queries a bit more to fully understand what’s going on—thank you to Chris Engelbert (Timescale’s developer advocate) for the help with this part!

I’ve encapsulated this functionality into its own function,get_hypertable_caggs().

Given a table name, it will return table_name, table_schema pairs of view tables.

We now need to be able to filter these views based on their downsampling intervals, as well as what columns they have. Say we require a value_avg column; we need to take this into account.

You can easily find the intervals within the _timescaledb_catalog.continuous_agg table. The interval is saved as bucket_width, and is quite easy to fetch.

And we can grab the list of columns from information_schema.columns. I’ve wrapped these functions up as get_cagg_interval() as well as get_jsonb_table_columns(), which we can simply call on the list of tables we got from our last function.

A worthy note: Functions that return a single variable can easily be used in SELECT clauses and will be applied to each table row individually. It’s like a lateral join but much easier and readable!

From here on, we only need to apply some basic filtering. We want to consider only the tables whose interval is smaller than our wanted downsampling interval, and we only want tables with column combinations that we can use in our aggregate options. The latter can be expressed elegantly with the JSONB @> operator, which checks for array element inclusion.

We then order by downsampling interval and aggregate option priority and use LIMIT 1 to grab the best-fitting option for our use case… And voilà!

All this functionality is wrapped up in the aggregate_choice function. By the way, until now, all functions are inlineable SQL— see how far we can go with them?

The last part is actually executing the query. This can no longer be done with just a SQL statement, as those don’t allow dynamic table choices. But this is where PL/pgSQL comes into play.

We can programmatically first fetch a matching downsampling choice with the above functions, then construct our own SQL query dynamically based on those parameters. We can then use the EXECUTE statement to run this query.

Doing so is risky, as we’re inserting potentially user-supplied data into a raw SQL query string, so it’s encouraged to make proper use of format, as well as EXECUTE () USING, both of which protect against SQL injections.

However, for this function, the point is moot as we are already intentionally letting the user supply raw SQL query code for the filter and group by clauses. Since our function can be restricted to read-only table access, this should not be a problem.

The code for this final downsampling step is in the ts_autoagg function. It’s surprisingly small, thanks to the encapsulation of functionality into the other mentioned functions, which I think is vital for function readability—especially for SQL, where chained common table expressions (CTEs) can sometimes get a bit out of hand, at least for me.

Summing It Up

Timescale’s continuous aggregates are a very powerful tool. However, combining them with the flexibility and strength of PostgreSQL’s SQL and PL/pgSQL functions can give you even more from them, making them far more seamless to integrate and use in a wide range of applications.

In this article, I’ve successfully shown that you can fully automate the selection process for aggregates, reducing the workload of your database without lowering the quality of the data or giving your database engineers or end-users a headache. ;)

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
10 min read
time-series analysis
Contributors

Related posts