Select the Most Recent Record (of Many Items) With PostgreSQL

Select the Most Recent Record (of Many Items) With PostgreSQL

Time-series data is ubiquitous in almost every application today. One of the most frequent queries applications make on time-series data is to find the most recent value for a given device or item.

In this blog post, we'll explore five methods for accessing the most recent value in PostgreSQL. Each option has its advantages and disadvantages, which we'll discuss as we go.

Note: Throughout this post, references to a "device" or "truck" are simply placeholders to whatever your application is storing time-series data for, whether it be an air quality sensor, airplane, car, website visits, or something else. As you read, focus on the concept of each option, rather than the specific data we're using as an example.

The problem

Knowing how to query the most recent timestamp and data for a device in large, time-series datasets, is often a challenge for many application developers. We study the data, determine the appropriate schema, and create the indexes that should make the queries return quickly.

When the queries aren't as fast as we expect, it's easy to be confused because indexes in PostgreSQL are supposed to help your queries return quickly - correct?

In most cases, the answer to that is emphatically "true". With the appropriate index, PostgreSQL is normally very efficient at retrieving data for your query. There are always nuances that we don't have time to get into in this post (don't create too many indexes, make sure statistics are kept up-to-date, etc.), but generally speaking, the right index will dramatically improve the query performance of a SQL database, PostgreSQL included.

Quick aside:

Before we dive into how to efficiently find specific records in a large time-series database using indexes, I want to make sure we're talking about the same thing. For the duration of this post, all references to indexes specifically mean a B-tree index. These are the most common index supported by all major OLTP databases and they are very good at locating specific rows of data across tables large and small. PostgreSQL actually supports many different index types that can help for various types of queries and data (including timestamp-centric data), but from here on out, we're only talking about B-tree indexes.

The impact of indexes

In our TimescaleDB Slack community channel and in other developer forums such as StackOverflow (example), developers often wonder why a query for the latest value is slow in PostgreSQL even when it seems like the correct index exists to make the query "fast"?

The answer to that lies in how the PostgreSQL query planner works. It doesn't always use the index exactly how you might expect as we'll discuss below. In order to demonstrate how PostgreSQL might use an index on a large time-series table, let's set the stage with a set of fictitious data.

For these example queries, let's pretend that our application is tracking a trucking fleet, with sensors that report data a few times a minute as long as the truck has a cell connection. Sometimes the truck loses signal which causes data to be sent a few hours or days later. Although the app would certainly be more involved and have a more complex schema for tracking both time-series and business-related data, let's focus on two of the tables.

Truck


This table tracks every truck that is part of the fleet. Even for a very large company, this table will typically contain only a few tens of thousands of rows.

truck_id make model weight_class date_acquired active_status
1 Ford Single Sleeper S 2018-03-14 true
2 Tesla Double Sleeper XL 2019-02-18 false

For the queries below, we'll pretend that this table has ~10,000 trucks, most of which are currently active and recording data a few times a minute.

Truck Reading


The reading hypertable stores all data that is delivered from every truck over time. Data typically comes in a few times a minute in time order, although older data can arrive when trucks lose their connection to cell service or transmitters break down. For this example, we'll show a wide-format table schema, and only a few columns of data to keep things simple. Many IoT applications store many types of data points for each set of readings.

ts truck_id milage fuel latitude longitude
2021-11-30 16:39:46 1 49.8 29 40.626 83.139
2021-11-30 16:39:46 2 33.0 371 40.056 78.978
2021-11-30 16:39:46 3 54.5 403 42.732 83.756

When you create a TimescaleDB hypertable, an index on the timestamp column will be created automatically unless you specifically tell the create_hypertable() function not to. For the truck_reading table, the default index should look similar to:

CREATE INDEX ix_ts ON truck_reading (ts DESC);

This index (or at least a composite index that uses the time column first) is necessary for even the most basic queries where time is involved and is strongly recommended for hypertable chunk management. Queries involving time alone like MIN(ts) or MAX(ts) can easily be satisfied from this index.

However, if we wanted to know the minimum or maximum readings for a specific truck, PostgreSQL would have no path to quickly find that information. Consider the following query that searches for the most recent readings of a specific truck:

SELECT * FROM truck_reading WHERE truck_id = 1234 ORDER BY ts DESC LIMIT 1;

If the truck_reading table only had the default timestamp index ( ix_ts above), PostgreSQL has no efficient method to get the most recent row of data for this specific truck. Instead, it has to start reading the index from the beginning (the most recent timestamp is first based on the index order) and check each row to see if it contains 1234 as the truck_id.

If this truck had reported recently, PostgreSQL would only have to read a few thousand rows at most and the query would still be "fast". If the truck hadn't recorded data in a few hours or days, PostgreSQL might have to read hundreds of thousands, or millions of rows of data, before it finds a row where truck_id = 1234.

To demonstrate this, we created a sample dataset with ~20 million rows of data (1 week for 10,000 trucks) and then deleted the most recent 12 hours for truck_id = 1234.

In the EXPLAIN output below, we can see that PostgreSQL had to scan the entire index and FILTER out more than 1.53 million rows that did not match the `truck_id` we were searching for. Even more alarming is the amount of data PostgreSQL had to process to correctly retrieve the one row of data we were asking for - ~184MB of data! (23168 buffers x 8kb per buffer)

QUERY PLAN                                                                  
------------------------------------------------------------------
Limit  (cost=0.44..289.59 rows=1 width=52) (actual time=189.343..189.344 rows=1 loops=1)                                                       
  Buffers: shared hit=23168                                                 
  ->  Index Scan using ix_ts on truck_reading  (cost=0.44..627742.58 rows=2171 width=52) (actual time=189.341..189.341 rows=1 loops=1)|
        Filter: (truck_id = 1234)                                           
        Rows Removed by Filter: 1532532                                     
        Buffers: shared hit=23168                                           
Planning:                                                                   
  Buffers: shared hit=5                                                     
Planning Time: 0.116 ms                                                     
Execution Time: 189.364 ms 

If your application has to do that much work for each query, it will quickly become bottlenecked on the simplest of queries as the data grows.

Therefore, it's essential that we have the correct index(es) for the typical query patterns of our application.

In this example (and in many real-life applications), we should at least create one other index that includes both truck_id and ts. This will allow queries about a specific truck based on time to be searched much more efficiently. An example index would look like this:

CREATE INDEX ix_truck_id_ts ON truck_reading (truck_id, ts DESC);

With this index created, PostgreSQL can find the most recent record for a specific truck very quickly, whether it reported data a few seconds or a few weeks ago.

With the same dataset as above, the same query which returns a datapoint for truck_id = 1234 from 12 hours ago reads only 40kb of data! That's ~4600x less data that had to be read because we created the appropriate index, not to mention the sub-millisecond execution time! That's bananas!

QUERY PLAN                                                                                                                                               
-----------------------------------------------------
Limit  (cost=0.56..1.68 rows=1 width=52) (actual time=0.015..0.015 rows=1 loops=1)                                                                
  Buffers: shared hit=5                                                    
  ->  Index Scan using ix_truck_id_ts on truck_reading  (cost=0.56..2425.55 rows=2171 width=52) (actual time=0.014..0.014 rows=1 loops=1)
        Index Cond: (truck_id = 1234)                                       
        Buffers: shared hit=5                                               
Planning:                                                                   
  Buffers: shared hit=5                                                     
Planning Time: 0.117 ms                                                     
Execution Time: 0.028 ms
                                                            

To be clear, both queries did use an index to search for the row. The difference is in how the indexes were used to find the data we wanted.

The first query had to FILTER the tuple because only the timestamp was part of the index. Filtering takes place after the tuple is read from disk, which means a lot more work takes place just trying to find the correct data.

In contrast, the second query used both parts of the index ( truck_id and ts) as part of the Index Condition. This means that only the rows that match the constraint are read from disk. In this case, that's a very small number and the query is much faster!

Unfortunately, even with both of these targeted indexes, there are a few common time-series SQL queries that won't perform as well as most developers expect them to.

Let's talk about why that is.

Open-ended queries:

Open-ended queries look for unique data points (first, last, most recent) without specifying a specific time-range or device constraint ( truck_id in our example). These types of queries leave the planner with few options, so it assumes that it will have to scan through the entire index at planning time. That might not be true, but PostgreSQL can't really know before it executes the query and starts looking for the data.

This is especially difficult when tables are partitioned because the actual indexes are stored independently with each table partition. Therefore, there is no global index for the entire table that identifies if a specific truck_id (in our case) exists in a partition. Once again, when the PostgreSQL planner doesn't have enough information during the planning phase, it assumes that each partition will need to be queried, typically causing increased planning time.

Consider a query like the following, which asks for the earliest reading for a specific truck_id:

SELECT * FROM truck_reading WHERE truck_id=1234 ORDER BY ts LIMIT 1;

With the two indexes we have in place ( (ts DESC) and (truck_id, ts DESC)), it feels like this should be a fast query. But because the hypertable is partitioned on time, the planner initially assumes that it will have to scan each chunk. If you have a lot of partitions, the planning time will take longer.

If the truck_reading table is actively receiving new data, the execution of the query will still be "fast" because the answer will probably be found in the first chunk and returned quickly. But if truck_id=1234 has never reported any data or has been offline for weeks, PostgreSQL will have to both plan and then scan the index of every chunk. The query will use the composite index on each partition to quickly determine there are no records for the truck, but it still has to take the time to plan and execute the query.

Instead, we want to avoid doing unnecessary work whenever possible and avoid the potential for this query anti-pattern.

High-cardinality queries:

Many queries can also be negatively impacted by increasing cardinality, becoming slower as data volumes grow and more individual items are tracked. Options 1-4 below are good examples of queries that perform well on small to medium-size datasets, but often become slower as volumes and cardinality increase.

These queries attempt to "step" through the time-series table by truck_id, taking advantage of the indexes on the hypertable. However, as more items need to be queried, the iteration often becomes slower because the index is too big to efficiently fit in memory, causing PostgreSQL to frequently swap data to and from disk.

Understanding that these two types of queries may not perform as well under every circumstance, let's examine five different methods for getting the most recent record for each item in your time-series table. In most circumstances, at least one of these options will work well for your data.

Development != Production

One quick word of warning as we jump into the SQL examples below.

It's always good to remember that your development database is unlikely to have the same volume, cardinality, and transaction throughput as your production database. Any one of the example queries we show below might perform really well on a smaller, less active database, only to perform more poorly than expected in production.

It's always best to test in an environment that is as similar to production as possible. How to do that is beyond the scope of this post, but a few options could be:

  • Use one-click database forking with your Timescale Cloud instance to easily make a copy of production for testing and learning. Using data as close to production as possible is usually preferred!
  • Backup and restore your production database to an approved location and anonymize the data, keeping similar cardinality and row statistics. Always ANALYZE the table after any data changes.
  • Consider reusing your schema and generating lots of high-volume, high-cardinality sample data with generate_series() (possibly using some of the ideas from our series about generating more realistic sample data inside of PostgreSQL).

Whatever method you choose, always remember that a database with 1 million rows of time-series data for 100 items will act much differently from a database with 10 billion rows of time-series data for 10,000 items reporting every few seconds.

Now that we've discussed how indexes help us find the data and reviewed some of the query patterns that can be slower than usual, it's time to write some SQL and talk about when it might be appropriate to use each option.

Option 1: Naive GROUP BY

SQL is a powerful language. Unfortunately, every database that allows queries to be written in SQL often has slightly different functions for doing similar work, or simply doesn't support SQL standards that would otherwise allow for efficient "last point" queries like we've been discussing.

However, in nearly every database where SQL is a supported query language, you can run this query to get the most recent time that a truck recorded data. In most cases, this will not perform well on large datasets because the GROUP BY clause prevents the indexes from being used.

SELECT max(time) FROM truck_reading GROUP BY truck_id;

Because the indexes won't be used in PostgreSQL, this approach is not recommended for high-volume/high-cardinality datasets. But, it will get the result you expect, even if it's not efficient.

If you have a query like this, consider how one of the other options listed below might better fit your query pattern.

Option 2: LATERAL JOIN

One of the easiest pieces of advice to give for any PostgreSQL database developer is to learn how to use LATERAL JOINs. In some other database engines (like SQL Server) these are called APPLY commands, but they do essentially the same thing - run the inner query for every row produced by the outer query. Because it is a JOIN, the inner query can utilize values from the outer query. (While this is similar to a correlated subquery, it's not the same thing.)

LATERAL JOINs are a great option when you, as the developer or administrator, know approximately how many rows the outer query will return. For a few hundred or a few thousand rows, this pattern is likely to return your "recent" record very quickly as long as the correct index is in place.

SELECT * FROM trucks t 
INNER JOIN LATERAL (
	SELECT * FROM truck_reading 
	WHERE truck_id = t.truck_id
	ORDER BY ts DESC 
	LIMIT 1
) l ON TRUE
ORDER BY t.truck_id DESC;

The convenient thing about a LATERAL JOIN query is that additional filtering can be applied to the outer query to identify specific items to retrieve data for. In most cases, the relational business data ( trucks) will be a smaller table with faster query times. Paging can also be applied to the smaller table more efficiently (ie. OFFSET 500 LIMIT 100) which further reduces the total work that the inner query needs to perform.

Unfortunately, one downside of a LATERAL JOIN query is that it can be susceptible to the high cardinality issue we discussed above in at least two ways.

First, if the outer query returns many more items than the inner table has data for, this query will loop over the inner table doing more work than necessary. For example, if the truck table had 10,000 entries for trucks but only 1,000 of them had ever reported readings, the query would loop over the inner query 10x more than it needed to.

Second, even if the cardinality of the inner and outer query generally match, if that cardinality is high or the table on the inner query is very large, a LATERAL JOIN query will slow down over time as memory or I/O become a limiting factor. At some point, you may need to consider Option 5 below as a final solution.

Option 3: TimescaleDB SkipScan

Disclaimer: this method only works when the TimescaleDB extension is installed. If you aren’t using it yet, you can find more information in our documentation page.

LATERAL JOINs are a great tool to have on hand when working with iterative queries. However, as we just discussed, they're not always the best choice when iterating the items of the outer query would cause the inner query to be executed often, looking for data that doesn't exist.

This is when it can be advantageous to use the reading table itself to get the distinct items and related data. In particular, this is helpful when we want to query trucks that have reported data within a period of time, for example, the last 24 hours. While we could add a filter to the inner query above ( WHERE ts > now() - INTERVAL '24 hours'), we'd still have to iterate over every truck_id, some of which might not have reported data in the last 24 hours.

Because we already created the ix_truck_id_ts index above that is ordered by truck_id and ts DESC, a common approach that many PostgreSQL developers try is to use a DISTINCT ON query with PostgreSQL.

SELECT DISTINCT ON (truck_id) * FROM truck_reading WHERE ts > now() - INTERVAL '24 hours' ORDER BY truck_id, ts DESC;

If you try this without TimescaleDB installed, it won't perform well - even though we have an index that appears to have the data ordered correctly and easy to "jump" through! This is because, as of PostgreSQL 14, there is no feature within the query execution phase that can "walk" the index to find each unique instance of a particular key. Instead, PostgreSQL essentially reads all of the data, groups it by the ON columns, and then filters out all but the first row (based on order).

However, with the TimescaleDB extension installed (version 2.3 or greater), the DISTINCT ON query will work much more efficiently as long as the correct index exists and is ordered the same as the query. This is because the TimescaleDB extension adds a new query node called "SkipScan" which will start scanning the index with the next key value as soon as another one is found, in order. One of the best parts of (SkipScan) is that it works on any PostgreSQL table with a B-tree index. It doesn't have to be a TimescaleDB hypertable!

There are a few nuances to how the index is used, all of which is outlined in the blog post linked above.

Option 4: Loose Index Scan

If you don't (or can't) install the TimescaleDB extension, there is still a way to query the truck_reading table to efficiently return the timestamp of the most recent reading for each truck_id.

On the PostgreSQL Wiki there is a page dedicated to the Loose Index Scan. It demonstrates a way to use recursive CTE queries to essentially do what the TimescaleDB (SkipScan) node does. It's not nearly as straightforward to write and is more difficult to return multiple rows (it's not the same as a DISTINCT query), but it does provide a way to more efficiently use the index to retrieve one row for each item.

The biggest drawback with this approach is that it's much more difficult to return multiple columns of data with the recursive CTE (and in most cases it's simply impossible to return multiple rows). So while some developers refer to this as a Skip Scan query, it doesn't easily allow you to retrieve all of the row data for a high-volume table like the (SkipScan) query node that TimescaleDB provides.

/*
 * Loose index scan via https://wiki.postgresql.org/wiki/Loose_indexscan
 */
WITH RECURSIVE t AS (
   SELECT min(ts) AS time FROM truck_reading
   UNION ALL
   SELECT (SELECT min(ts) FROM truck_reading WHERE ts > t.ts)
   FROM t WHERE t.ts IS NOT NULL
   )
SELECT ts FROM t WHERE ts IS NOT NULL
UNION ALL
SELECT null WHERE EXISTS(SELECT 1 FROM truck_reading WHERE ts IS NULL);

Option 5: Logging table and trigger

Sometimes, particularly with large, high-cardinality datasets, the above options aren't efficient enough for day-to-day operations. Querying for the last reading of all items, or the devices that haven't reported a value in the last 24 hours, will not meet your expectations as data volume and cardinality grows.

In this case, a better option might be to maintain a table that stores the last readings for each device as it's inserted into the raw time-series table so that your application can query a much smaller dataset for the most recent values. To track and update the logging table, we'll create a database trigger on the raw data (hyper)table.

"Wait a minute! Did you just say we're going to create a database trigger? Doesn't everyone say you should never use them?"

It's true. Triggers often get a bad rap in the SQL world, and honestly, that can often be justified. Used properly and with the correct implementation, database triggers can be tremendously useful and have minimal impact on SELECT performance. Insert and Update performance will be impacted because each transaction has to do more work. The performance hit may or may not impact your application, so testing is essential.

The SQL below provides a minimal example of how you could implement this kind of logging. There are a lot of considerations on how to best implement this option for your specific application. Thoroughly test any new processes you ad to the data processing in your database.

In short, the example script below:

  • creates a table to store the most recent data. If you only want to store the most recent timestamp of each truck's readings, this could easily just insert values into a new field on the truck table
  • ALTER's the FILLFACTOR of the table to 90% because it will be UPDATE heavy
  • creates a trigger function that will insert a row if it doesn't exist for a truck or update the values if a row for that truck already has an entry in the table ( ON CONFLICT)
  • enables the trigger on the data hypertable

The key to this approach is to only track what is necessary, reducing the amount of work PostgreSQL has to do as part of the overall transaction that is ingesting raw data.  If your application updates values for 100,000 devices every second (and you're tracking 50 columns of data), a different trigger approach might be necessary. If this is the kind of data volume you see regularly, we assume that you have an experienced PostgreSQL DBA on your team to help manage and maintain your application database - and help you decide if the logging table approach will work with the available server resources.

/*
 * The logging table alternative. The PRIMARY KEY will create an
*  index on the truck_id column to make querying for specific trucks more efficient
 */
CREATE TABLE truck_log ( 
	truck_id int PRIMARY KEY REFERENCES trucks (truck_id),
	last_time timestamp,
	milage int,
	fuel int,
	latitude float8,
	longitude float8
);

/*
* Because the table will mostly be UPDATE heavy, a slightly reduced
* FILLFACTOR can alleviate maintenance contention and reduce
* page bloat on the table.
*/
ALTER TABLE truck_log SET (fillfactor=90);

/*
 * This is the trigger function which will be executed for each row
*  of an INSERT or UPDATE. Again, YMMV, so test and adjust appropriately
 */
CREATE OR REPLACE FUNCTION create_truck_trigger_fn()
  RETURNS TRIGGER LANGUAGE PLPGSQL AS
$BODY$
BEGIN
  INSERT INTO truck_log VALUES (NEW.truck_id, NEW.time, NEW.milage, NEW.fuel, NEW.latitude, NEW.longitude) ON CONFLICT (truck_id) DO UPDATE SET    last_time=NEW.time,
   milage=NEW.milage,
   fuel=NEW.fuel,
   latitude=NEW.latitude,
   longitude=NEW.longitude;
  RETURN NEW;
END
$BODY$;

/*
*  With the trigger function created, actually assign it to the truck_reading
*  table so that it will execute for each row
*/ 
CREATE TRIGGER create_truck_trigger
  BEFORE INSERT OR UPDATE ON truck_reading
  FOR EACH ROW EXECUTE PROCEDURE create_truck_trigger_fn();

With these pieces in place, the new table will start receiving new rows of data and updating the last values as data is ingested. Querying this table will be much more efficient than searching through hundreds of millions of rows.

Review the options

Requires matching index Impacted by higher cardinality Insert performance may be impacted
Option 1: GROUP BY X
Option 2: LATERAL JOIN X X
Option 3: TimescaleDB SkipScan X X X (if index needs to be added)
Option 4: Recursive CTE X X X (if index needs to be added)
Option 5: Logging table X

Conclusion

Whatever approach you take, hopefully one of these options will help you take the next step to improve the performance of your application.

If you are keen to explore TimescaleDB or want to discuss how other developers have figured out how to quickly find the most recent value for devices in large time-series databases, join the Timescale community! Engage in conversation and ask questions in our lively Slack community or be a pioneer of our new community forum, a home for long-form discussions that will last. You can also read our documentation, or tweet your projects and questions to us @TimescaleDB. The Timescale team and users are ready to help and learn!


If you haven't tried TimescaleDB yet, the easiest way to get started is using a fully-managed database on Timescale Cloud. Sign up here - it’s 100% free for 30-days, no credit card required.

🌟 Try using SkipScan with your TimescaleDB data on Timescale Cloud

Sign up for free

The open-source relational database for time-series and analytics.
Try Timescale for free
This post was written by
17 min read
PostgreSQL
Contributors

Related posts