How to Write Better Queries for Time-Series Data Analysis With Custom SQL Functions

How to Write Better Queries for Time-Series Data Analysis With Custom SQL Functions

Why the Right Tools Matter When Analyzing Time-Series Data

SQL is the lingua franca for analytics. As data proliferates, we need to find new ways to store, explore, and analyze it. We believe SQL is the best language for data analysis. We’ve championed the benefits of SQL for several years, even when many were swapping it for custom domain-specific languages. Full SQL support was one of the key reasons we chose to build TimescaleDB on top of PostgreSQL, the most loved database among developers, rather than creating a custom query language. And we were right—SQL is making a comeback (although it never really went away) and has become the universal language for data analysis, with many NoSQL databases adding SQL interfaces to keep up.

In addition, most developers are familiar with SQL, along with most data scientists, data analysts, and other professionals who work with data. Whether you've taken classes at university, done an online course, or attended a boot camp, chances are that you probably have learned a bit of SQL along the way. So you and your fellow developers already know it, making it easier for teams to onboard new members and quickly extract value from the data. With a proprietary language, learning the language is in itself a barrier to using the data—you’ll have to ask another team to write the queries or rely on a separate data lake.

Time-series data is ubiquitous. At Timescale, our mission is to serve developers worldwide and enable them to build exceptional data-driven products that measure everything that matters: software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.

And time-series data comes at you fast, sometimes generating millions of data points per second. Because of the sheer volume and rate of information, time-series data can be complex to query and analyze, even in SQL.

TimescaleDB hyperfunctions make it easier to manipulate and analyze time-series datasets with fewer lines of SQL code. Hyperfunctions are purpose-built for the most common and difficult time-series and analytical queries developers write today in SQL. Using hyperfunctions makes you more productive when querying time-series data, which means you can spend less time creating reports, dashboards, and visualizations involving time series, and spend more time acting on the insights that your work unearths!

Handling Time-Series Data: Meet Hyperfunctions

There are over 70 different TimescaleDB hyperfunctions ready to use today. Here are some of the most popular ones and how they can help you handle your time-series data:

  • Time-based analysis: time_bucket() makes time-based analysis simpler and easier by enabling you to analyze data over arbitrary time intervals using succinct queries.
  • first() and last() allow you to get the value of one column as ordered by another (2x faster in TimescaleDB 2.7!).
  • Time-weighted averages: time_weight() and related hyperfunctions for working with time-weighted averages offer a more elegant way to get an unbiased average when working with irregularly sampled data.
  • Function pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL (currently experimental).
  • Percentile approximation brings percentile analysis to more workflows, enabling you to understand the distribution of your data efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets. When used with continuous aggregates, you can compute percentiles over any time range of your dataset in near real-time and use them for baselining and normalizing incoming data.
  • Frequency analysis: Freq_agg() and related frequency analysis hyperfunctions more efficiently find the most common elements out of a set of vastly more varied values vs. brute force calculation.
  • Histogram shows the data distribution and can offer a better understanding of the segments compared to an average (more on histograms).
  • Downsampling: ASAP smoothing smooths datasets to highlight the most important features when graphed. Largest Triangle Three Buckets Downsampling or lttb() reduces the number of elements in a dataset while retaining important features when graphed. See how to apply our downsampling hyperfunctions in Grafana.
  • Memory efficient COUNT DISTINCTs: HyperLogLog is a probabilistic cardinality estimator that uses significantly less memory than the equivalent COUNT DISTINCT query. It is ideal for use in a continuous aggregate for large datasets.

We created new SQL functions for each of these time-series analysis and manipulation capabilities. This contrasts with other efforts to improve the developer experience by introducing new SQL syntax. While introducing new syntax with new keywords and constructs may have been easier from an implementation perspective, we made the deliberate decision not to do so since we believe it leads to a worse experience for the end-user.

New SQL syntax means existing drivers, libraries, and tools may no longer work. That can leave developers with more problems than solutions as their favorite tools, libraries, or drivers may not support the new syntax or require time-consuming modifications. On the other hand, new SQL functions mean that your query will run in every visualization tool, database admin tool, or data analysis tool.

We have the freedom to create custom functions, aggregates, and procedures that help developers better understand and work with their data, and ensure all their drivers and interfaces still work as expected!

We will now dive into each hyperfunction category that we mentioned and give examples of when, why, and how to use them, plus resources to continue your learning.

TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale, the easiest way to get started with TimescaleDB. Get started with a free Timescale trial—no credit card required. Or download for free with TimescaleDB self-managed.

If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, start our tutorial, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).

Can’t find the function you need? Open an issue on our GitHub project or contact us on Slack or via the Timescale Community Forum. We love to work with our users to simplify SQL!

Solved: How to Query Arbitrary Time-Intervals With date_trunc

When using PostgreSQL, the date_trunc function can be useful when you want to aggregate information over an interval of time. date_trunc truncates a TIMESTAMP or an INTERVAL value based on a specified date part (e.g., hour, week, or month) and returns the truncated timestamp or interval. For example, date_trunc can aggregate by one second, one hour, one day, or one week. However, you often want to see aggregates by the time intervals that matter most to your use case, which may be intervals like 30 seconds, 5 minutes, 12 hours, etc. This can get pretty complicated in SQL, just look at the query below which analyzes taxi ride activity in five-minute time intervals:

Regular PostgreSQL: Taxi rides taken every five minutes

SELECT
  EXTRACT(hour from pickup_datetime) as hours,
  trunc(EXTRACT(minute from pickup_datetime) / 5)*5 AS five_mins,
  COUNT(*)
FROM rides
WHERE pickup_datetime < '2016-01-02 00:00'
GROUP BY hours, five_mins;

The time_bucket() hyperfunction makes it easy to query your data in whatever time interval is most relevant to your analysis use case. time_bucket() enables you to aggregate data by arbitrary time intervals (e.g., 10 seconds, 5 minutes, 6 hours, etc.), and gives you flexible groupings and offsets, instead of just second, minute, hour, and so on.

In addition to allowing more flexible time-series queries, time_bucket() also allows you to write these queries in a simpler way. Just look much simpler the query from the example above is to write and understand when using the time_bucket() hyperfunction:

TimescaleDB hyperfunctions: Taxi rides taken every five minutes

-- How many rides took place every 5 minutes for the first day of 2016?
SELECT time_bucket('5 minute', pickup_datetime) AS five_min, count(*)
FROM rides
WHERE pickup_datetime < '2016-01-02 00:00'
GROUP BY five_min
ORDER BY five_min;

If you’d like even more flexibility when aggregating your data, you can test out the hyperfunction time_bucket, which is an updated version of the original time_bucket() hyperfunction. time_bucket enables you to bucket your data by years and months, in addition to second, minute, and hour time intervals. This allows you to easily do monthly cohort analysis or other multiple-month-based reports in SQL.

SELECT time_bucket('3 month', date '2021-08-01');
 time_bucket
----------------
 2021-07-01
(1 row)

time_bucket also features custom timezone support, which enables you to write queries like the one below, which illustrates using it to bucket data in the Europe/Moscow region:

-- note that timestamptz is displayed differently depending on the session parameters
SET TIME ZONE 'Europe/Moscow';

SELECT time_bucket('1 month', timestamptz '2001-02-03 12:34:56 MSK', timezone => 'Europe/Moscow');
     time_bucket
------------------------
 2001-02-01 00:00:00+03

Missing data or gaps is a common occurrence when capturing hundreds or thousands of time-series readings per second or minute. This can happen due to irregular sampling intervals, or you have experienced an outage of some sort.

The time_bucket_gapfill() hyperfunction enables you to create additional rows of data in any gaps, ensuring that the returned rows are in chronological order and contiguous. To learn more about gappy data, read our blog Mind the Gap: Using SQL Functions for Time-Series Analysis.

Here’s an example of time_bucket_gapfill() in action, where we find the daily average temperature for a certain device and use the locf() function to carry the last observation forward in the case we have gaps in our data:

SELECT
  time_bucket_gapfill('1 day', time, now() - INTERVAL '1 week', now()) AS day,
  device_id,
  avg(temperature) AS value,
  locf(avg(temperature))
FROM metrics
WHERE time > now () - INTERVAL '1 week'
GROUP BY day, device_id
ORDER BY day;

           day          | device_id | value | locf
------------------------+-----------+-------+------
 2019-01-10 01:00:00+01 |         1 |       |
 2019-01-11 01:00:00+01 |         1 |   5.0 |  5.0
 2019-01-12 01:00:00+01 |         1 |       |  5.0
 2019-01-13 01:00:00+01 |         1 |   7.0 |  7.0
 2019-01-14 01:00:00+01 |         1 |       |  7.0
 2019-01-15 01:00:00+01 |         1 |   8.0 |  8.0
 2019-01-16 01:00:00+01 |         1 |   9.0 |  9.0
(7 rows)

The last observation carried forward or locf() function allows you to carry forward the last seen value in an aggregation group. You can only use it in an aggregation query with time_bucket_gapfill.

To learn more about using the time_bucket family of hyperfunctions, read the docs, and get started with our tutorial, which uses time_bucket() to analyze a real-world IoT dataset.

Simpler Time-Weighted Averages

If you’re in a situation where you don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in the following cases:

  • Industrial IoT, where teams “compress” data by only sending points when the value changes.
  • Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations.
  • Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered).

Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.

To illustrate the value of a hyperfunction to find time-weighted averages, consider the following example of a simple table modeling freezer temperature:

CREATE TABLE freezer_temps (
	freezer_id int,
	ts timestamptz,
	temperature float);

And some irregularly sampled time-series data representing the freezer temperature:

INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);

Calculating the time-weighted average temperature of the freezer using regular SQL functions would look something like this:

Time-weighted averages using regular SQL

WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id,
    avg(temperature), -- the regular average
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average 

But, with the TimescaleDB time_weight() hyperfunction, we reduce this potentially tedious to write and confusing to read query to a much simpler five-line query:

SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;

freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111

To learn more about using time-weighted average hyperfunctions, read the docs and see our explainer blog post: What time-weighted averages are and why you should care.

Better Data Summaries Using Percentile Approximation

Many developers choose to use averages and other summary statistics more frequently than percentiles because they are significantly “cheaper” to calculate over large time-series datasets, both in computational resources and time.

As we were designing hyperfunctions, we thought about how we could capture the benefits of percentiles (e.g., robustness to outliers, better correspondence with real-world impacts) while avoiding some of the pitfalls of calculating exact percentiles.

TimescaleDB’s percentile approximation hyperfunctions enable you to understand your data distribution efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets.

With relatively large datasets, you can often accept some accuracy trade-offs to avoid running into issues of high memory footprint and network costs while enabling percentiles to be computed more efficiently in parallel and used on streaming data. (In this post, you can learn more about the design decisions and trade-offs made in TimescaleDB’s percentile approximation hyperfunctions design.)

TimescaleDB has a whole family of percentile approximation hyperfunctions. The simplest way to call them is to use the percentile_agg aggregate along with the approx_percentile accessor. For example, here’s how we might calculate the 10th, 50th (mean), and 90th percentiles of the response time of a particular API:

SELECT 
    approx_percentile(0.1, percentile_agg(response_time)) as p10, 
    approx_percentile(0.5, percentile_agg(response_time)) as p50, 
    approx_percentile(0.9, percentile_agg(response_time)) as p90 
FROM responses;

Hyperfunctions for percentile approximation can also be used in TimescaleDB´s continuous aggregates which make aggregate queries on very large datasets run faster. Continuous aggregates continuously and incrementally store the results of an aggregation query in the background. So, when you run the query, only the changed data needs to be computed, not the entire dataset.

That is a huge advantage compared to exact percentiles because you can now do things like baselining and alerting on longer periods without recalculating from scratch every time!

For example, here’s how you can use continuous aggregates to identify recent outliers and investigate potential problems. First, we create a one-hour aggregation from the hypertable responses:

CREATE TABLE responses(
	ts timestamptz, 
	response_time DOUBLE PRECISION);
SELECT create_hypertable('responses', 'ts');
CREATE MATERIALIZED VIEW responses_1h_agg
WITH (timescaledb.continuous)
AS SELECT 
    time_bucket('1 hour'::interval, ts) as bucket,
    percentile_agg(response_time)
FROM responses
GROUP BY time_bucket('1 hour'::interval, ts);

To find outliers, we can find the data in the last 30 seconds greater than the 99th percentile:

SELECT * FROM responses 
WHERE ts >= now()-'30s'::interval
AND response_time > (
	SELECT approx_percentile(0.99, percentile_agg)
	FROM responses_1h_agg
	WHERE bucket = time_bucket('1 hour'::interval, now()-'1 hour'::interval)
);

To learn more about using percentile approximation hyperfunctions, read the docs, try our tutorial using real-world NFL data and see our explainer blog post on why percentile approximation is more useful than averages.

first( )and last( )

Another common problem is finding the first or last values for multiple time series. That often occurs in IoT scenarios, where you want to monitor devices in different locations, but each device sends back data at different times (as devices can go offline, experience connectivity issues, batch transmit data, or simply have different sampling rates).

The last hyperfunction allows you to get the value of one column as ordered by another. For example, last(temperature, time) returns the latest temperature value based on time within an aggregate group.

This way, you can write queries more easily which, for example, will find the last recorded temperature at multiple locations, as each location might have different rates of data being sampled and recorded:

SELECT location, last(temperature, time)
  FROM conditions
  GROUP BY location;

Similarly, the first hyperfunction also allows you to get the value of one column as ordered by another. first(temperature, time) returns the earliest temperature value based on time within an aggregate group:

SELECT device_id, first(temp, time)
FROM metrics
GROUP BY device_id;

first() and last() can also be used in more complex queries, such as finding the latest value within a specific time interval. In the example below, we find the last temperature recorded for each device in five minutes throughout the past day:

SELECT device_id, time_bucket('5 minutes', time) AS interval,
  last(temp, time)
FROM metrics
WHERE time > now () - INTERVAL '1 day'
GROUP BY device_id, interval
ORDER BY interval DESC;

In TimescaleDB 2.7, we’ve made improvements to make queries with the first() and last() hyperfunctions up to twice as fast and make memory usage near constant.

🚀
Note: The last and first commands do not use indexes but perform a sequential scan through their groups. They are primarily used for ordered selection within a GROUP BY aggregate and not as an alternative to an ORDER BY time DESC LIMIT 1 clause to find the latest value (which uses indexes).

To learn more, see the docs for first() and last().

More Memory Efficient COUNT DISTINCT Queries

Calculating the exact number of distinct values in a large dataset with high cardinality requires lots of computational resources, which can impact the query performance and experience of your database's concurrent users.

To solve this issue, TimescaleDB provides hyperfunctions to calculate approximate COUNT DISTINCTs. Approximate count distincts do not calculate the exact cardinality of a dataset, but rather estimate the number of unique values, in order to improve compute time. We use HyperLogLog, a probabilistic cardinality estimator that uses significantly less memory than the equivalent COUNT DISTINCT query.

Hyperloglog() is an approximation object for COUNT DISTINCT queries. And the distinct_count() accessor function gets the number of distinct values from a HyperLogLog object, as illustrated in the example below, which efficiently estimates the number of unique NFTs and collections in a hypothetical NFT marketplace:

SELECT
  distinct_count(hyperloglog(32768, asset_id)) AS nft_count,
  distinct_count(hyperloglog(32768, collection_id)) AS collection_count
FROM nft_sales
WHERE payment_symbol = 'ETH' AND time > NOW()-INTERVAL '3 months'

You can also use the std_error() function to estimate the relative standard error of the HyperLogLog compared to running COUNT DISTINCT directly.
To learn more about the approximate COUNT DISTINCT hyperfunctions, read the docs.

Enhanced Query Readability and Maintenance With Function Pipelines

🚀
Note: In the spirit of moving fast and not breaking things, the hyperfunctions in this section are released as experimental—please play around with them but don’t use them in production.

At Timescale, we’re huge fans of SQL. But as we’ve seen in many examples above, SQL can get quite unwieldy for certain kinds of analytical and time-series queries. Enter TimescaleDB Function Pipelines.

TimescaleDB Function Pipelines radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from functional programming and popular tools like Python’s Pandas and PromQL. In short, they improve your coding productivity, making your SQL code easier for others to comprehend and maintain.

Inspired by functional programming languages, Function Pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.

And the best part: we built Function Pipelines in a fully PostgreSQL-compliant way! We did not change any SQL syntax, meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.

To understand the power of TimescaleDB Function Pipelines, consider the following PostgreSQL query.

Regular PostgreSQL query:

SELECT device_id, 
	sum(abs_delta) as volatility
FROM (
	SELECT device_id, 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
        	as abs_delta 
	FROM measurements
	WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id;

Supercharge Your Productivity With Hyperfunctions Today

Get started today: TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale, the easiest way to get started with TimescaleDB. Get started with a free Timescale trial—no credit card required. Or download for free with TimescaleDB self-managed.

If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, start our tutorial, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).

Learn more: If you’d like to learn more about TimescaleDB hyperfunctions and how to use them for your use case, read our How-To Guide and the hyperfunctions documentation.

Can’t find the function you need? Open an issue on our GitHub project or contact us on Slack or via the Timescale Community Forum. We love to work with our users to simplify SQL!

Ingest and query in milliseconds, even at terabyte scale.
This post was a collaboration between
14 min read
PostgreSQL
Contributors

Related posts