Implementing ASOF Joins in PostgreSQL and Timescale
What Is an
ASOF (or "as of") join is a type of join operation used when analyzing two sets of time-series data. It essentially matches each record from one table with the nearest—but not necessarily equal—value from another table based on a chosen column. Oracle supports this out of the box using a non-standard SQL syntax, but unfortunately, PostgreSQL does not provide a built-in
The chosen column needs to have some concept of range for the
ASOF operation to work. You may think of it as being the "closest value," but not exceeding the comparison. It works for string (alphabetical), integer (ordinal), float (decimal), and any other data type that has an idea of ORDER. Because timestamps are near and dear to our hearts at Timescale, we will demonstrate with time and date columns.
Performing this operation in PostgreSQL takes a bit of effort. This article aims to delve deeper into
ASOF-style joins and how to implement similar functionality in PostgreSQL by subselecting data or other join types.
ASOF joins are a powerful tool when dealing with time-series data. In simple terms, an ASOF join will, for each row in the left table, find a corresponding single row in the right table where the key value is less than or equal to the key in the left table.
This is a common operation when dealing with financial data, sensor readings, or other types of time-series data where readings might not align perfectly by timestamp.
For a simple example, consider the real-world question, "What was the temperature yesterday at this time?" It is very unlikely that a temperature reading was taken yesterday at exactly the millisecond that the question is asked today. What we really want is "What was the temperature taken yesterday up to today's time stamp?"
This simple example becomes a lot more complex when we start comparing temperatures day over day, week over week, etc.
ASOF Joins in Timescale
Even though PostgreSQL does not directly support
ASOF joins, you can achieve similar functionality using a combination of SQL operations. Here's a simplified step-by-step guide:
Step 1: Prepare your data
Ensure your data is in the correct format for the
ASOF join. You'll need a timestamp or other monotonically increasing column to use as a key for the join.
Suppose you have two tables,
asks, each containing a timestamp column, and you want to join them by instrument and the nearest timestamp.
CREATE TABLE bids ( instrument text, ts TIMESTAMPTZ, value NUMERIC ); -- CREATE INDEX bids_instrument_ts_idx ON bids (instrument, ts DESC); CREATE INDEX bids_ts_idx ON bids (ts); -- CREATE TABLE asks ( instrument text, ts TIMESTAMPTZ, value NUMERIC ); CREATE INDEX asks_instrument_ts_idx ON asks (instrument, ts DESC); CREATE INDEX asks_ts_idx ON asks (ts); --
Normally you'd make both these tables into hypertables with the
create_hypertable function (because you're a super educated Timescale user), but in this case, we aren't going to, as we won't be inserting much data (and we also have some Timescale magic to show off 🪄).
Step 2: Insert some test data
Next, we'll create data for four instruments,
AAA, BBB, NCD, and
INSERT INTO bids (instrument, ts, value) SELECT -- random 1 of 4 instruments (array['AAA', 'BBB', 'NZD', 'USD'])[floor(random() * 4 + 1)], -- timestamp of last month plus some seconds now() - interval '1 month' + g.s, -- random value random()* 100 +1 FROM ( -- 2.5M seconds in a month SELECT ((random() * 2592000 + 1)::text || ' s')::interval s FROM generate_series(1,3000000)) g; INSERT INTO asks (instrument, ts, value) SELECT -- random 1 of 4 instruments (array['AAA', 'BBB', 'NZD', 'USD'])[floor(random() * 4 + 1)], -- timestamp of last month plus some seconds now() - interval '1 month' + g.s, -- random value random()* 100 +1 FROM ( -- 2.5M seconds in a month SELECT ((random() * 2592000 + 1)::text || ' s')::interval s FROM generate_series(1,2000000)) g;
Step 3: Query the data using a sub-select
To mimic the behavior of an
ASOF join, use a
SUBSELECT join operation along with conditions to match rows based on your criteria. This will run the sub-query once per row returned from the target table. We need to use the
DISTINCT clause to limit the number of rows returned to one.
This will work in vanilla Postgres, but when we are using Timescale (even though we aren't using hypertables yet), we get the benefits of a Skip Scan, which will supercharge the query (for more information on this check our docs or blog post about how Skip Scan can give you an 8,000x speed-up).
SELECT bids.ts timebid, bids.value bid, (SELECT DISTINCT ON (asks.instrument) value ask FROM asks WHERE asks.instrument = bids.instrument AND asks.ts <= bids.ts ORDER BY instrument, ts DESC) ask FROM bids WHERE bids.ts > now() - interval '1 week'
QUERY PLAN ------------------------------------------------------------------------- Index Scan using bids_ts_idx on public.bids (cost=0.43..188132.58 rows=62180 width=56) (actual time=0.067..1700.957 rows=57303 loops=1) Output: bids.instrument, bids.ts, bids.value, (SubPlan 1) Index Cond: (bids.ts > (now() - '7 days'::interval)) SubPlan 1 -> Unique (cost=0.43..2.71 rows=5 width=24) (actual time=0.027..0.029 rows=1 loops=57303) Output: asks.value, asks.instrument, asks.ts -> Custom Scan (SkipScan) on public.asks (cost=0.43..2.71 rows=5 width=24) (actual time=0.027..0.027 rows=1 loops=57303) Output: asks.value, asks.instrument, asks.ts -> Index Scan using asks_instrument_ts_idx on public.asks (cost=0.43..15996.56 rows=143152 width=24) (actual time=0.027..0.027 rows=1 loops=57303) Output: asks.value, asks.instrument, asks.ts Index Cond: ((asks.instrument = bids.instrument) AND (asks.ts <= bids.ts)) Planning Time: 1.231 ms Execution Time: 1703.821 ms
While PostgreSQL does not have an
ASOF keyword, it does offer the flexibility and functionality to perform similar operations. When you're using Timescale, things only get better with the enhancements like Skip Scan.