A question was recently asked on Twitter about finding the most recent set of values based on two columns. While one of the fastest ways to do this in TimescaleDB can often be to use SkipScan, facilitated by DISTINCT ON(), the current implementation only allows skipping over one indexed column, not two.
As an example, consider a simple table that tracks (at least) these columns for financial data.
CREATE TABLE ticks ( ts timestamptz NOT NULL, supply_rate float4 NULL, borrow_rate float4 NULL, manual float4 null, exchange TEXT NOT NULL, symbol TEXT NOT NULL );
With data populated, it might be tempting to create a corresponding index and then query the DISTINCT of both columns for the timestamp like this:
-- Create supporting index CREATE INDEX test_ss ON ticks (exchange, symbol, ts DESC); SELECT DISTINCT ON (exchange, symbol) * FROM ticks ORDER BY exchange, symbol, ts DESC;
Unfortunately, this DISTINCT query will chose a regular Index Scan instead of the TimescaleDB SkipScan implementation. Even though there is an index that appears to support a distinct query on multiple columns, SkipScan will only consider the first column of the index (as of TimescaleDB 2.7).
Depending on the cardinality of your data, there are a few options for querying this data using SkipScan and both columns.
First, let’s create the indexes that we want TimescaleDB to use regardless of which approach we take to query the data.
CREATE INDEX exchange_ss ON ticks (exchange, ts DESC); CREATE INDEX symbol_ss ON ticks (symbol, exchange, ts desc);
With the indexes created, let’s look at each option.
In both examples below, we’re pulling the values for
symbol out of the time-series hypertable rather than a supporting dimension table. This could be simplified if there was a small, already “distinct” table that tracks exchanges for instance, so YMMV depending on your schema design.
In this example, we can use a DISTINCT ON query to get the list of exchanges that exist in the
ticks table and then “loop” (LATERAL) over the inner DISTINCT ON query using the value of the exchange. This SQL is just a little bit longer to write, but works well as long as the proper indexes exist.
SELECT * FROM (SELECT DISTINCT ON (exchange) exchange FROM ticks ORDER BY exchange, ts DESC) a, LATERAL (SELECT DISTINCT ON (symbol) symbol, ts, supply_rate, borrow_rate, manual FROM ticks WHERE exchange = a.exchange ORDER BY symbol, ts DESC) b;
This approach does hide some SQL in a function to clean up the day-to-day querying. Notice that we don’t have to use a LATERAL query here because the function is run once per exchange and returns multiple rows, which creates an implicit
CROSS JOIN so that many rows are produced for each exchange.
-- first create the function that will return all rows for a specific -- exchange/symbol combination. CREATE OR REPLACE FUNCTION symbol_array(TEXT) RETURNS TABLE (ts timestamptz, supply_rate float4, borrow_rate float4, manual float4, symbol text) AS $$ select distinct on (symbol) ts, supply_rate, borrow_rate, manual, symbol from ticks where exchange=$1 ORDER BY symbol, ts DESC; $$ LANGUAGE SQL -- Now query the function with a DISTINCT on exchange WITH s1 AS ( SELECT DISTINCT (exchange) exchange FROM ticks ORDER BY exchange, ts DESC ) SELECT exchange, (symbol_array(exchange)).* FROM s1;
In this example, we first get a list of exchanges from the data table using SkipScan (in case there are many of them!) and then use that value to query the function we created which uses skip scan and filters the data to a specific exchange using the second index
symbol_ss using SkipScan further.
The key is that for SkipScan to be selected,
DISTINCT ON() can only contain one column and the
ORDER BY columns must batch an existing index and index order.