*Written by **Dylan Paulus**
*

You may have heard that "data is the new oil." By itself, data is unrefined and not valuable, but given processing and refinement, it becomes precious. We gain insights into our products, applications, and customers by exploring our data. PostgreSQL exposes aggregate functions that give us the tools to transform and process our data to provide meaning.

In this article, we'll take a look at how to use SQL aggregate functions, the pitfalls, and how Timescale gives us advanced tooling to aggregate time-series data.

PostgreSQL aggregate functions allow us to pull meaning from all the data we store in our database. Aggregate functions take in a list of data (a bunch of rows) to produce a single, meaningful output.

The best way to visualize aggregate functions is to work through an example. Let's look at the `avg()`

or average function. The average function tells us our dataset's __arithmetic mean__.

Let's say we have a table of products in a hypothetical store:

```
-- create
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
price DECIMAL NOT NULL
);
```

```
-- insert
INSERT INTO products (name, price) VALUES ('pen', 2.50);
INSERT INTO products (name, price) VALUES ('paper', 1.25);
INSERT INTO products (name, price) VALUES ('hammer', 6.76);
INSERT INTO products (name, price) VALUES ('blanket', 12.45);
INSERT INTO products (name, price) VALUES ('chair', 59.99);
```

We can write a query using `avg()`

to find out the average price of all our products by running:

```
SELECT avg(price) FROM products;
```

Of course, a large list of different __aggregate functions provided by PostgreSQL__ is at our disposal. A few of the most used aggregate functions include:

- `SUM()`

: adds up all the input values

- `MAX()`

: finds the largest of the input values

- `MIN()`

: finds the smallest of the input values

- `COUNT()`

: adds up the number of rows (not to be confused with `SUM()`

!)

One of the biggest sources of frustration around aggregates is intermixing aggregate functions with column data. Building on our previous product table, let's include a `category`

column.

```
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
price DECIMAL NOT NULL,
category TEXT
);
```

```
INSERT INTO products (name, price, category) VALUES ('pen', 2.50, 'office');
INSERT INTO products (name, price, category) VALUES ('paper', 1.25, 'office');
INSERT INTO products (name, price, category) VALUES ('hammer', 6.76, 'tools');
INSERT INTO products (name, price, category) VALUES ('blanket', 12.45, 'home');
INSERT INTO products (name, price, category) VALUES ('chair', 59.99, 'home');
```

And when finding the average price of all the products, we want to include the `category`

column in the result like so:

```
SELECT avg(price), category FROM products;
```

Run the SQL command and boom! An error is given to us.

This is because the `price`

column gets reduced or "smushed down" into a single value. `category`

loses meaning when we find the average price of *all* products. In this error, PostgreSQL is letting us know that we need to 1) include `category`

in the `avg()`

aggregation or 2) group the average price by category. Since finding the average of a string value is impossible, our best bet is option 2. We can use __SQL's ____GROUP BY__ to group the results by `category`

—finding the average price by category.

```
SELECT avg(price), category FROM products GROUP BY category;
```

Taking advantage of PostgreSQL's `GROUP BY`

, we can start to see the power of aggregate functions; in this example, we have insight into the average cost of products in a given category.

You have probably run into the __WHERE____ clause__ when filtering queries, but there is another way to filter results using the `HAVING`

clause, which is generally less used. Though they appear to behave similarly, `WHERE`

and __HAVING ____clauses__ have unique and distinct effects on aggregate functions. Let's take a look at both.

Both `HAVING`

and `WHERE`

will filter the result set by some conditional. If we don't want to include the average price of `home`

items, we could write the query using either SQL clause:

```
-- where
SELECT avg(price), category FROM products WHERE category != 'home' GROUP BY category;
```

```
-- having
SELECT avg(price), category FROM products GROUP BY category HAVING category != 'home';
```

Though it's a slightly different syntax, the result is the same.

Instead of filtering by `category`

, we want to only get the `categories`

whose average price is over $2. Easy enough; let's modify both queries.

```
-- where
SELECT avg(price), category FROM products WHERE avg(price) > 2.0 GROUP BY category;
```

```
-- having
SELECT avg(price), category FROM products GROUP BY category HAVING avg(price) > 2.0;
```

Run these two queries separately, and you'll find a problem. The query using `WHERE`

fails, but the query using `HAVING`

succeeds. What gives? The main distinction between `WHERE`

and `HAVING`

is that the `WHERE`

filter is applied *before* aggregation takes place. `HAVING`

filters get applied *after* aggregation takes place. Since our example filters the result set using an aggregate function `avg(price) > 2.0`

, we can only filter after aggregation occurs—by using `HAVING`

.

The `FILTER`

clause adds an additional way to limit the data aggregate functions operate on. Instead of `WHERE`

or `HAVING`

, which filters the result for the entire query, `FILTER`

only applies to the given aggregate function. This means we can use multiple aggregate functions in a single query. First, let's look at an example of querying for products with a single `FILTER`

clause:

```
SELECT
avg(price) FILTER (where category = 'home') as avg_home_prices
FROM products;
```

Using `FILTER`

, we can include multiple aggregate functions in a query with different filtering conditions.

```
SELECT
avg(price) FILTER (where category = 'home') as avg_home_prices,
sum(price) filter (where category = 'office') as sum_office_prices,
count(*) filter (where category = 'tools') as total_tools
FROM products;
```

On the surface, aggregate functions look similar to standard functions, but there is a critical difference between the two. Aggregate functions work on columns, whereas standard functions work on rows. For example, a standard function like `CEIL()`

rounds a value to the greatest integer *per row*. An aggregate function like `SUM()`

takes in a range of columns and produces a single result.

Aggregation has three main components. PostgreSQL loops through all the rows and keeps track of new and already-seen rows. A function called the `state transition function`

is called on each new row, which updates an `internal value`

. Once all the rows have been looped through, a `final function `

is called with the internal value to produce a final result.

Let's take, for example, the `AVG()`

aggregate function with our `products`

table.

- The initial state is `(0, 0)`

for `price = 0`

and `count = 0`

- The `state transaction function`

is called for each row in the table

- For `AVG()`

, the current price is added to the total price, and count gets one added to it

- `(total price + row price, index + 1)`

- Finally, the `final function`

calculates the average from the `internal state`

- `total price / index`

The exact process is followed for all aggregate functions.

The separation of `state transition function`

and `final function`

optimizes aggregate functions by keeping state transition functions small and offloading the heavy processing until all the rows have been looped through.

TimescaleDB expands on aggregation functions over hypertables using __hyperfunction aggregates__. Hyperfunction aggregates allow us to analyze time-series data. Some hyperfunction aggregates are provided out of the box, but others require the __timescaledb_toolkit__ extension installed.

Similarly to PostgreSQL aggregate functions, hyperfunction aggregates have a `state transition function`

(accessor) and `final function`

(rollup). By combining different aggregations, accessors, and __rollup functions__, we can create powerful insights into our data. Each of these operations is separated to provide a more functional programming approach to data aggregation. For example, to create a hyperfunction aggregation, we first create the aggregation (with an aggregation function like `stats_agg`

), and then we pass the aggregation result to an accessor (like `average`

).

To get a practical look at how this works, let's look at an example using `stats_agg`

, `average`

, and `time_bucket`

to find an average.

First, create a `conditions`

table with data:

```
CREATE TABLE conditions (
time TIMESTAMPTZ NOT NULL,
location TEXT NOT NULL,
device TEXT NOT NULL,
temperature DOUBLE PRECISION NULL,
humidity DOUBLE PRECISION NULL
);
```

```
SELECT create_hypertable('conditions', by_range('time'));
```

```
INSERT INTO conditions (time, location, device, temperature) VALUES (NOW(), 'home', 'omega', 72.3);
INSERT INTO conditions (time, location, device, temperature) VALUES (NOW() + interval '1 day', 'home', 'omega', 55);
INSERT INTO conditions (time, location, device, temperature) VALUES (NOW() + interval '2 day', 'home', 'omega', 65);
INSERT INTO conditions (time, location, device, temperature) VALUES (NOW() + interval '2 day', 'home', 'alpha', 82);
INSERT INTO conditions (time, location, device, temperature) VALUES (NOW(), 'home', 'alpha', 83);
INSERT INTO conditions (time, location, device, temperature) VALUES (NOW(), 'home', 'alpha', 83);
INSERT INTO conditions (time, location, device, temperature) VALUES (NOW() + interval '25 minutes', 'home', 'alpha', 90);
```

We want to find the average temperature by day. First, we need to group the time series data into buckets of one-day intervals. Then, by using `stats_agg()`

to create an aggregate, we can pass that into `average()`

to calculate the average temperature per day.

```
SELECT
time_bucket('1 day'::interval, time),
average(stats_agg(temperature))
FROM conditions
GROUP BY 1;
```

By combining different aggregates, accessors, and __rollup functions__ (if you prefer to watch a video, check the one below) provided by Timescale, we can gain even more power over our time-series data.

PostgreSQL's aggregate functions are powerful tools for extracting meaningful insights from datasets, aiding in data-driven decision-making. But why stop there? Timescale takes these capabilities to the next level with hyperfunctions that easily give insights into your time-series data.

You can look at

__Timescale's documentation on hyperfunction aggregates to learn even more about hyperfunction aggregates__.In this blog post, we also explain how

__PostgreSQL aggregation influenced the design of our hyperfunctions__.Additionally, to learn more about aggregate functions and possible options, look at the

__official PostgreSQL documentation__.

If you want to try aggregate functions and experiment with the extremely powerful hyperfunction aggregates, __create a free Timescale account__ to get started today!