How to Write Better Queries for Time-Series Data Analysis With Custom SQL Functions
Why the Right Tools Matter When Analyzing Time-Series Data
SQL is the lingua franca for analytics. As data proliferates, we need to find new ways to store, explore, and analyze it. We believe SQL is the best language for data analysis. We’ve championed the benefits of SQL for several years, even when many were swapping it for custom domain-specific languages. Full SQL support was one of the key reasons we chose to build TimescaleDB on top of PostgreSQL, the most loved database among developers, rather than creating a custom query language. And we were right—SQL is making a comeback (although it never really went away) and has become the universal language for data analysis, with many NoSQL databases adding SQL interfaces to keep up.
In addition, most developers are familiar with SQL, along with most data scientists, data analysts, and other professionals who work with data. Whether you've taken classes at university, done an online course, or attended a boot camp, chances are that you probably have learned a bit of SQL along the way. So you and your fellow developers already know it, making it easier for teams to onboard new members and quickly extract value from the data. With a proprietary language, learning the language is in itself a barrier to using the data—you’ll have to ask another team to write the queries or rely on a separate data lake.
Time-series data is ubiquitous. At Timescale, our mission is to serve developers worldwide and enable them to build exceptional data-driven products that measure everything that matters: software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.
And time-series data comes at you fast, sometimes generating millions of data points per second. Because of the sheer volume and rate of information, time-series data can be complex to query and analyze, even in SQL.
TimescaleDB hyperfunctions make it easier to manipulate and analyze time-series datasets with fewer lines of SQL code. Hyperfunctions are purpose-built for the most common and difficult time-series and analytical queries developers write today in SQL. Using hyperfunctions makes you more productive when querying time-series data, which means you can spend less time creating reports, dashboards, and visualizations involving time-series, and spend more time acting on the insights that your work unearths!
There are over 70 different TimescaleDB hyperfunctions ready to use today. Here are some of the most popular ones and how they can help you:
- Time-based analysis:
time_bucket_ng()make time-based analysis simpler and easier by enabling you to analyze data over arbitrary time intervals using succinct queries.
last()allow you to get the value of one column as ordered by another (2x faster in TimescaleDB 2.7!).
- Time-weighted averages:
time_weight()and related hyperfunctions for working with time-weighted averages offer a more elegant way to get an unbiased average when working with irregularly sampled data.
- Function pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL (currently experimental).
- Percentile approximation brings percentile analysis to more workflows, enabling you to understand the distribution of your data efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets. When used with continuous aggregates, you can compute percentiles over any time range of your dataset in near real-time and use them for baselining and normalizing incoming data.
- Frequency analysis:
Freq_agg()and related frequency analysis hyperfunctions more efficiently find the most common elements out of a set of vastly more varied values vs. brute force calculation.
- Histogram shows the data distribution and can offer a better understanding of the segments compared to an average (more on histograms).
- Downsampling: ASAP smoothing smooths datasets to highlight the most important features when graphed. Largest Triangle Three Buckets Downsampling or
lttb()reduces the number of elements in a dataset while retaining important features when graphed. See how to apply our downsampling hyperfunctions in Grafana.
- Memory efficient COUNT DISTINCTs: HyperLogLog is a probabilistic cardinality estimator that uses significantly less memory than the equivalent COUNT DISTINCT query. It is ideal for use in a continuous aggregate for large datasets.
We created new SQL functions for each of these time-series analysis and manipulation capabilities. This contrasts with other efforts to improve the developer experience by introducing new SQL syntax. While introducing new syntax with new keywords and constructs may have been easier from an implementation perspective, we made the deliberate decision not to do so since we believe it leads to a worse experience for the end-user.
New SQL syntax means existing drivers, libraries, and tools may no longer work. That can leave developers with more problems than solutions as their favorite tools, libraries, or drivers may not support the new syntax or require time-consuming modifications. On the other hand, new SQL functions mean that your query will run in every visualization tool, database admin tool, or data analysis tool.
We have the freedom to create custom functions, aggregates, and procedures that help developers better understand and work with their data, and ensure all their drivers and interfaces still work as expected!
We will now dive into each hyperfunction category that we mentioned and give examples of when, why, and how to use them, plus resources to continue your learning.
TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale Cloud, the easiest way to get started with TimescaleDB. Get started with a free Timescale Cloud trial—no credit card required. Or download for free with TimescaleDB self-managed.
If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, start our tutorial, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).
Solved: How to Query Arbitrary Time-Intervals With date_trunc
When using PostgreSQL, the
date_trunc function can be useful when you want to aggregate information over an interval of time.
date_trunc truncates a
TIMESTAMP or an
INTERVAL value based on a specified date part (e.g., hour, week, or month) and returns the truncated timestamp or interval. For example,
date_trunc can aggregate by one second, one hour, one day, or one week. However, you often want to see aggregates by the time intervals that matter most to your use case, which may be intervals like 30 seconds, 5 minutes, 12 hours, etc. This can get pretty complicated in SQL, just look at the query below which analyzes taxi ride activity in five-minute time intervals:
Regular PostgreSQL: Taxi rides taken every five minutes
SELECT EXTRACT(hour from pickup_datetime) as hours, trunc(EXTRACT(minute from pickup_datetime) / 5)*5 AS five_mins, COUNT(*) FROM rides WHERE pickup_datetime < '2016-01-02 00:00' GROUP BY hours, five_mins;
time_bucket() hyperfunction makes it easy to query your data in whatever time interval is most relevant to your analysis use case.
time_bucket() enables you to aggregate data by arbitrary time intervals (e.g., 10 seconds, 5 minutes, 6 hours, etc.), and gives you flexible groupings and offsets, instead of just second, minute, hour, and so on.
In addition to allowing more flexible time-series queries,
time_bucket() also allows you to write these queries in a simpler way. Just look much simpler the query from the example above is to write and understand when using the
TimescaleDB hyperfunctions: Taxi rides taken every five minutes
-- How many rides took place every 5 minutes for the first day of 2016? SELECT time_bucket('5 minute', pickup_datetime) AS five_min, count(*) FROM rides WHERE pickup_datetime < '2016-01-02 00:00' GROUP BY five_min ORDER BY five_min;
If you’d like even more flexibility when aggregating your data, you can test out the experimental hyperfunction
time_bucket_ng(), which is an updated version of the original
time_bucket_ng() enables you to bucket your data by years and months, in addition to second, minute, and hour time intervals. This allows you to easily do monthly cohort analysis or other multiple month-based reports in SQL.
SELECT timescaledb_experimental.time_bucket_ng('3 month', date '2021-08-01'); time_bucket_ng ---------------- 2021-07-01 (1 row)
time_bucket_ng() also features custom timezone support, which enables you to write queries like the one below, which illustrates using it to bucket data in the Europe/Moscow region:
-- note that timestamptz is displayed differently depending on the session parameters SET TIME ZONE 'Europe/Moscow'; SELECT timescaledb_experimental.time_bucket_ng('1 month', timestamptz '2001-02-03 12:34:56 MSK', timezone => 'Europe/Moscow'); time_bucket_ng ------------------------ 2001-02-01 00:00:00+03
Missing data or gaps is a common occurrence when capturing hundreds or thousands of time-series readings per second or minute. This can happen due to irregular sampling intervals, or you have experienced an outage of some sort.
time_bucket_gapfill() hyperfunction enables you to create additional rows of data in any gaps, ensuring that the returned rows are in chronological order and contiguous. To learn more about gappy data, read our blog Mind the Gap: Using SQL Functions for Time-Series Analysis.
Here’s an example of
time_bucket_gapfill() in action, where we find the daily average temperature for a certain device and use the
locf() function to carry the last observation forward in the case we have gaps in our data:
SELECT time_bucket_gapfill('1 day', time, now() - INTERVAL '1 week', now()) AS day, device_id, avg(temperature) AS value, locf(avg(temperature)) FROM metrics WHERE time > now () - INTERVAL '1 week' GROUP BY day, device_id ORDER BY day; day | device_id | value | locf ------------------------+-----------+-------+------ 2019-01-10 01:00:00+01 | 1 | | 2019-01-11 01:00:00+01 | 1 | 5.0 | 5.0 2019-01-12 01:00:00+01 | 1 | | 5.0 2019-01-13 01:00:00+01 | 1 | 7.0 | 7.0 2019-01-14 01:00:00+01 | 1 | | 7.0 2019-01-15 01:00:00+01 | 1 | 8.0 | 8.0 2019-01-16 01:00:00+01 | 1 | 9.0 | 9.0 (7 rows)
The last observation carried forward or
locf() function allows you to carry forward the last seen value in an aggregation group. You can only use it in an aggregation query with
Simpler Time-Weighted Averages
If you’re in a situation where you don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in the following cases:
- Industrial IoT, where teams “compress” data by only sending points when the value changes.
- Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations.
- Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered).
Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.
To illustrate the value of a hyperfunction to find time-weighted averages, consider the following example of a simple table modeling freezer temperature:
CREATE TABLE freezer_temps ( freezer_id int, ts timestamptz, temperature float);
And some irregularly sampled time-series data representing the freezer temperature:
INSERT INTO freezer_temps VALUES ( 1, '2020-01-01 00:00:00+00', 4.0), ( 1, '2020-01-01 00:05:00+00', 5.5), ( 1, '2020-01-01 00:10:00+00', 3.0), ( 1, '2020-01-01 00:15:00+00', 4.0), ( 1, '2020-01-01 00:20:00+00', 3.5), ( 1, '2020-01-01 00:25:00+00', 8.0), ( 1, '2020-01-01 00:30:00+00', 9.0), ( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened! ( 1, '2020-01-01 00:31:30+00', 11.0), ( 1, '2020-01-01 00:32:00+00', 15.0), ( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed ( 1, '2020-01-01 00:33:00+00', 18.5), ( 1, '2020-01-01 00:33:30+00', 17.0), ( 1, '2020-01-01 00:34:00+00', 15.5), ( 1, '2020-01-01 00:34:30+00', 14.0), ( 1, '2020-01-01 00:35:00+00', 12.5), ( 1, '2020-01-01 00:35:30+00', 11.0), ( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized ( 1, '2020-01-01 00:40:00+00', 7.0), ( 1, '2020-01-01 00:45:00+00', 5.0);
Calculating the time-weighted average temperature of the freezer using regular SQL functions would look something like this:
Time-weighted averages using regular SQL
WITH setup AS ( SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, extract('epoch' FROM ts) as ts_e, extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, * FROM freezer_temps), nextstep AS ( SELECT CASE WHEN prev_temp is NULL THEN NULL ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, * FROM setup) SELECT freezer_id, avg(temperature), -- the regular average sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average
But, with the TimescaleDB
time_weight() hyperfunction, we reduce this potentially tedious to write and confusing to read query to a much simpler five-line query:
SELECT freezer_id, avg(temperature), average(time_weight('Linear', ts, temperature)) as time_weighted_average FROM freezer_temps GROUP BY freezer_id; freezer_id | avg | time_weighted_average ------------+-------+----------------------- 1 | 10.2 | 6.636111111111111
To learn more about using time-weighted average hyperfunctions, read the docs and see our explainer blog post: What time-weighted averages are and why you should care.
Better Data Summaries Using Percentile Approximation
Many developers choose to use averages and other summary statistics more frequently than percentiles because they are significantly “cheaper” to calculate over large time-series datasets, both in computational resources and time.
As we were designing hyperfunctions, we thought about how we could capture the benefits of percentiles (e.g., robustness to outliers, better correspondence with real-world impacts) while avoiding some of the pitfalls of calculating exact percentiles.
TimescaleDB’s percentile approximation hyperfunctions enable you to understand your data distribution efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets.
With relatively large datasets, you can often accept some accuracy trade-offs to avoid running into issues of high memory footprint and network costs while enabling percentiles to be computed more efficiently in parallel and used on streaming data. (In this post, you can learn more about the design decisions and trade-offs made in TimescaleDB’s percentile approximation hyperfunctions design.)
TimescaleDB has a whole family of percentile approximation hyperfunctions. The simplest way to call them is to use the percentile_agg aggregate along with the approx_percentile accessor. For example, here’s how we might calculate the 10th, 50th (mean), and 90th percentiles of the response time of a particular API:
SELECT approx_percentile(0.1, percentile_agg(response_time)) as p10, approx_percentile(0.5, percentile_agg(response_time)) as p50, approx_percentile(0.9, percentile_agg(response_time)) as p90 FROM responses;
Hyperfunctions for percentile approximation can also be used in TimescaleDB´s continuous aggregates which make aggregate queries on very large datasets run faster. Continuous aggregates continuously and incrementally store the results of an aggregation query in the background. So, when you run the query, only the changed data needs to be computed, not the entire dataset.
That is a huge advantage compared to exact percentiles because you can now do things like baselining and alerting on longer periods without recalculating from scratch every time!
For example, here’s how you can use continuous aggregates to identify recent outliers and investigate potential problems. First, we create a one-hour aggregation from the hypertable
CREATE TABLE responses( ts timestamptz, response_time DOUBLE PRECISION); SELECT create_hypertable('responses', 'ts');
CREATE MATERIALIZED VIEW responses_1h_agg WITH (timescaledb.continuous) AS SELECT time_bucket('1 hour'::interval, ts) as bucket, percentile_agg(response_time) FROM responses GROUP BY time_bucket('1 hour'::interval, ts);
To find outliers, we can find the data in the last 30 seconds greater than the 99th percentile:
SELECT * FROM responses WHERE ts >= now()-'30s'::interval AND response_time > ( SELECT approx_percentile(0.99, percentile_agg) FROM responses_1h_agg WHERE bucket = time_bucket('1 hour'::interval, now()-'1 hour'::interval) );
To learn more about using percentile approximation hyperfunctions, read the docs, try our tutorial using real-world NFL data and see our explainer blog post on why percentile approximation is more useful than averages.
first( )and last( )
Another common problem is finding the first or last values for multiple time series. That often occurs in IoT scenarios, where you want to monitor devices in different locations, but each device sends back data at different times (as devices can go offline, experience connectivity issues, batch transmit data, or simply have different sampling rates).
last hyperfunction allows you to get the value of one column as ordered by another. For example,
last(temperature, time) returns the latest temperature value based on time within an aggregate group.
This way, you can write queries more easily which, for example, will find the last recorded temperature at multiple locations, as each location might have different rates of data being sampled and recorded:
SELECT location, last(temperature, time) FROM conditions GROUP BY location;
first hyperfunction also allows you to get the value of one column as ordered by another.
first(temperature, time) returns the earliest temperature value based on time within an aggregate group:
SELECT device_id, first(temp, time) FROM metrics GROUP BY device_id;
last() can also be used in more complex queries, such as finding the latest value within a specific time interval. In the example below, we find the last temperature recorded for each device in five minutes throughout the past day:
SELECT device_id, time_bucket('5 minutes', time) AS interval, last(temp, time) FROM metrics WHERE time > now () - INTERVAL '1 day' GROUP BY device_id, interval ORDER BY interval DESC;
In TimescaleDB 2.7, we’ve made improvements to make queries with the
last() hyperfunctions up to twice as fast and make memory usage near constant.
More Memory Efficient COUNT DISTINCT Queries
Calculating the exact number of distinct values in a large dataset with high cardinality requires lots of computational resources, which can impact the query performance and experience of your database's concurrent users.
To solve this issue, TimescaleDB provides hyperfunctions to calculate approximate COUNT DISTINCTs. Approximate count distincts do not calculate the exact cardinality of a dataset, but rather estimate the number of unique values, in order to improve compute time. We use HyperLogLog, a probabilistic cardinality estimator that uses significantly less memory than the equivalent
COUNT DISTINCT query.
Hyperloglog() is an approximation object for
COUNT DISTINCT queries. And the
distinct_count() accessor function gets the number of distinct values from a HyperLogLog object, as illustrated in the example below, which efficiently estimates the number of unique NFTs and collections in a hypothetical NFT marketplace:
SELECT distinct_count(hyperloglog(32768, asset_id)) AS nft_count, distinct_count(hyperloglog(32768, collection_id)) AS collection_count FROM nft_sales WHERE payment_symbol = 'ETH' AND time > NOW()-INTERVAL '3 months'
You can also use the
std_error() function to estimate the relative standard error of the HyperLogLog compared to running
COUNT DISTINCT directly.
To learn more about the approximate
COUNT DISTINCT hyperfunctions, read the docs.
Enhanced Query Readability and Maintenance With Function Pipelines
At Timescale, we’re huge fans of SQL. But as we’ve seen in many examples above, SQL can get quite unwieldy for certain kinds of analytical and time-series queries. Enter TimescaleDB Function Pipelines.
TimescaleDB Function Pipelines radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from functional programming and popular tools like Python’s Pandas and PromQL. In short, they improve your coding productivity, making your SQL code easier for others to comprehend and maintain.
Inspired by functional programming languages, Function Pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.
And the best part: we built Function Pipelines in a fully PostgreSQL-compliant way! We did not change any SQL syntax, meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.
To understand the power of TimescaleDB Function Pipelines, consider the following PostgreSQL query.
Regular PostgreSQL query:
SELECT device_id, sum(abs_delta) as volatility FROM ( SELECT device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta FROM measurements WHERE ts >= now() - '1 day'::interval) calc_delta GROUP BY device_id;
And a much more readable query using TimescaleDB Function Pipelines:
SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM measurements WHERE ts >= now() - '1 day'::interval GROUP BY device_id;
It is now much clearer what this query is doing, performing the following sets of tasks:
- Getting the last day’s data from the measurements table, grouped by
- Sorting the data by the time column
- Calculating the delta (or change) between values
- Taking the absolute value of the delta
- And then taking the sum of the result of the previous steps
To learn more about using TimescaleDB Function Pipelines, read the docs and see our explainer blog post: Function Pipelines: Building Functional Programming Into PostgreSQL Using Custom Operators.
Easier Frequency Analysis With Frequency Aggregates
When working with large datasets, frequency analysis is often computationally expensive, as it requires computing over the entire dataset to get an accurate result. Instead of counting all elements of a dataset, you might prefer to estimate the frequency of the most common elements of a set.
The frequency aggregate hyperfunction or
freq_agg() enables you to track the approximate frequency within a column of all values with a minimum frequency.
This example creates a frequency aggregate over a field
ZIP (representing ZIP codes) in a HomeSales table. This aggregate tracks any
ZIP value occurring in at least 5 % of rows, allowing you to keep track of common ZIP codes efficiently:
CREATE toolkit_experimental.freq_agg(0.05, ZIP) FROM HomeSales;
You can also use the frequency aggregate hyperfunction to estimate the maximum and minimum frequency of a particular value in your dataset. For example, we can find the most frequent rounded square roots of integers 1-100 using the frequency aggregate hyperfunction as follows:
SELECT value, min_freq, max_freq FROM toolkit_experimental.into_values( (SELECT toolkit_experimental.freq_agg(0.15, ceiling(sqrt(v))::int) FROM generate_series(1,100) v), 0::int ); value | min_freq | max_freq -------+----------+---------- 10 | 0.19 | 0.24 9 | 0.17 | 0.2 8 | 0.15 | 0.16 7 | 0.13 | 0.13 6 | 0.11 | 0.11 5 | 0.09 | 0.09 4 | 0.07 | 0.07
Freq_agg is implemented under the hood using the Space-Saving algorithm by Metwally, Agrawal, and El Abbadi in their paper Efficient Computation of Frequent and Top-k Elements in Data Streams.
Moreover, we often want to know the top N values of a dataset. To do this, we can use the
topn_agg hyperfunction, which estimates the top values present in a column.
Here’s an example of using
topn_agg to find the top five most common rounded square roots from 100,000 random numbers in the range (1,1000):
SELECT toolkit_experimental.topn( toolkit_experimental.topn_agg(5, ceiling(sqrt(random() * 1000))::int), 0::int) FROM generate_series(1,100000); topn ------ 31 30 29 28 27
To learn more about using TimescaleDB hyperfunctions for frequency analysis, read the docs and see this explainer in the Timescale Forum which covers Frequency Aggregate, TopN Aggregate, State Aggregate, and Gauge Aggregate.
Supercharge Your Productivity With Hyperfunctions Today
Get started today: TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale Cloud, the easiest way to get started with TimescaleDB. Get started with a free Timescale Cloud trial—no credit card required. Or download for free with TimescaleDB self-managed.
If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, start our tutorial, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).
Send us your feedback on experimental features: to reinforce our commitment to moving fast and not breaking things, we’re releasing function pipelines as an experimental feature—and we would love to hear your opinion! You can open an issue or join a discussion thread on GitHub (and if you like what you see, GitHub ⭐ are always welcome and appreciated!).