Problem with left join of two tables with different sizes

Vaalsermarco · March 24, 2022, 8:57am

Hello,
I am quite new to SQL and especially timescaleDB but already enjoying it and I see the power of it.
Anyway, I have a problem with one query.

I have two tables which I want to join. Each table has a timestamp and a machine-id as PK.
One table (position) has data in very high frequency (every millisecond) and one is event based (events).

If I left join the event based table on the high frequency table, I get incorrect results.
Query:

select time as ts, x as hf_data, changes as event_data from position
left join events on position.machine_id = events.machine_id and position.time = events.time
where position.machine_id ='xyxy' and position.time between '...' and '...'
	and events.machine_id ='xyxy' and events.time between '...' and '...'
order by ts

I expect an output as follows (every millisecond):

ts	hf_data	event_data
2022-01-25 14:06:34.505+00	302.0	1265.0
2022-01-25 14:06:34.506+00	404.0	Null
2022-01-25 14:06:34.507+00	500.0	Null
2022-01-25 14:06:34.508+00	350.0	Null
…	…	…
2022-01-25 14:06:34.545+00	299.0	1264.0

But instead I only get the output of the timestamps where both tables have values:

ts	hf_data	event_data
2022-01-25 14:06:34.505+00	302.0	1265.0
2022-01-25 14:06:34.545+00	299.0	1264.0
2022-01-25 14:06:34.665+00	541.0	1263.0
2022-01-25 14:06:34.697+00	786.0	1264.0
2022-01-25 14:06:34.729+00	771.0	1263.0

So the output is like if I do an right join but I did a left join.
Additionally I tried the query with a lateral join which works but is obviously very slow.

Any help is very much appreciated. Thanks
Marco

ryanbooz · March 25, 2022, 1:28am

@Vaalsermarco -

My first guess is that it’s the second set of predicates in your WHERE clause. Because you’re joining to events on both machine_id and time, you should only have to filter on the first (outer) table. My first guess is that somehow those four predicates together are interfering with each other and not accomplishing what you expect.

select time as ts, x as hf_data, changes as event_data from position
left join events on position.machine_id = events.machine_id and position.time = events.time
where position.machine_id ='xyxy' and position.time between '...' and '...'
order by ts

Does that yield the desired result?

Vaalsermarco · March 25, 2022, 8:19am

Hello @ryanbooz
thanks for your answer. Your guess was right, I get the right result.
Drawback is that it takes more time because I dont use any index (which is on machine_id and time) for the event table. I thought it is smart to access both tables by their index.
I looked at the querry plan in detail and it looks like for my initial query, only a join is executed and not a left join which exactly explains my results.
Explain for original query:

Merge Join  (cost=0.55..80666.97 rows=2697607 width=16)
  Merge Cond: (position."time" = event_data."time")
  ->  Custom Scan (ChunkAppend) on position  (cost=0.21..4154.85 rows=19735000 width=23)
        Order: position."time"
        ->  Custom Scan (DecompressChunk) on _hyper_24_200_chunk  (cost=0.21..4154.85 rows=19735000 width=23)
              Filter: (("time" >= '2022-01-07 08:56:17.881141+00'::timestamp with time zone) AND ("time" <= '2022-01-07 09:05:18.566141+00'::timestamp with time zone))
              ->  Index Scan Backward using compress_hyper_25_226_chunk__compressed_hypertable_25_machine_t on compress_hyper_25_226_chunk  (cost=0.42..4154.85 rows=19735 width=99)
                    Index Cond: (machine_id = 'xyxy'::text)
                    Filter: ((_ts_meta_max_1 >= '2022-01-07 08:56:17.881141+00'::timestamp with time zone) AND (_ts_meta_min_1 <= '2022-01-07 09:05:18.566141+00'::timestamp with time zone))
  ->  Custom Scan (ChunkAppend) on event_data  (cost=0.34..23.55 rows=70000 width=23)
        Order: event_data."time"
        ->  Custom Scan (DecompressChunk) on _hyper_2_104_chunk  (cost=0.34..23.55 rows=70000 width=23)
              Filter: (("time" >= '2022-01-07 08:56:17.881141+00'::timestamp with time zone) AND ("time" <= '2022-01-07 09:05:18.566141+00'::timestamp with time zone))
              ->  Index Scan Backward using compress_hyper_18_133_chunk__compressed_hypertable_18_machine_t on compress_hyper_18_133_chunk  (cost=0.27..23.55 rows=70 width=99)
                    Index Cond: (machine_id = 'xyxy'::text)
                    Filter: ((_ts_meta_max_1 >= '2022-01-07 08:56:17.881141+00'::timestamp with time zone) AND (_ts_meta_min_1 <= '2022-01-07 09:05:18.566141+00'::timestamp with time zone))

Explain for your proposal (working fine but slower):

Merge Left Join  (cost=302395.15..2017681.15 rows=94608935 width=16)
  Merge Cond: (position."time" = _hyper_2_101_chunk."time")
  Join Filter: (position.machine_id = _hyper_2_101_chunk.machine_id)
  ->  Custom Scan (ChunkAppend) on position  (cost=0.21..4154.85 rows=19735000 width=23)
        Order: position."time"
        ->  Custom Scan (DecompressChunk) on _hyper_24_200_chunk  (cost=0.21..4154.85 rows=19735000 width=23)
              Filter: (("time" >= '2022-01-07 08:56:17.881141+00'::timestamp with time zone) AND ("time" <= '2022-01-07 09:05:18.566141+00'::timestamp with time zone))
              ->  Index Scan Backward using compress_hyper_25_226_chunk__compressed_hypertable_25_machine_t on compress_hyper_25_226_chunk  (cost=0.42..4154.85 rows=19735 width=99)
                    Index Cond: (machine_id = 'xyxy'::text)
                    Filter: ((_ts_meta_max_1 >= '2022-01-07 08:56:17.881141+00'::timestamp with time zone) AND (_ts_meta_min_1 <= '2022-01-07 09:05:18.566141+00'::timestamp with time zone))
  ->  Materialize  (cost=302394.94..314669.94 rows=2455000 width=23)
        ->  Sort  (cost=302394.94..308532.44 rows=2455000 width=23)
              Sort Key: _hyper_2_101_chunk."time"
              ->  Append  (cost=0.03..12340.69 rows=2455000 width=23)
                    ->  Custom Scan (DecompressChunk) on _hyper_2_101_chunk  (cost=0.03..5.43 rows=194000 width=23)
                          ->  Seq Scan on compress_hyper_18_124_chunk  (cost=0.00..5.43 rows=194 width=99)
                                Filter: (machine_id = 'xyxy'::text)
                    ->  Custom Scan (DecompressChunk) on _hyper_2_102_chunk  (cost=0.03..5.80 rows=224000 width=23)
                          ->  Seq Scan on compress_hyper_18_127_chunk  (cost=0.00..5.80 rows=224 width=99)
                                Filter: (machine_id = 'xyxy'::text)
                    ->  Custom Scan (DecompressChunk) on _hyper_2_103_chunk  (cost=0.03..3.21 rows=97000 width=23)
                          ->  Seq Scan on compress_hyper_18_130_chunk  (cost=0.00..3.21 rows=97 width=99)
                                Filter: (machine_id = 'xyxy'::text)
                    ->  Custom Scan (DecompressChunk) on _hyper_2_104_chunk  (cost=0.03..13.40 rows=512000 width=23)
                          ->  Seq Scan on compress_hyper_18_133_chunk  (cost=0.00..13.40 rows=512 width=99)
                                Filter: (machine_id = 'xyxy'::text)
                    ->  Custom Scan (DecompressChunk) on _hyper_2_105_chunk  (cost=0.03..17.41 rows=673000 width=23)
                          ->  Seq Scan on compress_hyper_18_138_chunk  (cost=0.00..17.41 rows=673 width=99)
                                Filter: (machine_id = 'xyxy'::text)
                    ->  Custom Scan (DecompressChunk) on _hyper_2_106_chunk  (cost=0.03..11.03 rows=402000 width=23)
                          ->  Seq Scan on compress_hyper_18_140_chunk  (cost=0.00..11.03 rows=402 width=99)
                                Filter: (machine_id = 'xyxy'::text)
                    ->  Custom Scan (DecompressChunk) on _hyper_2_107_chunk  (cost=0.03..9.41 rows=353000 width=23)
                          ->  Seq Scan on compress_hyper_18_142_chunk  (cost=0.00..9.41 rows=353 width=99)
                                Filter: (machine_id = 'xyxy'::text)

Do you have an explaination for that? Is there a way I can do the query faster?
Thank you very much!

ryanbooz · March 27, 2022, 7:41pm

There are a few things going on here that are slowing you down. It appears that both of these tables are setup as hypertables and have compression enabled, so making sure they are setup efficiently for your query patterns is a good place to start.

Could you please share your table schemas and compression settings? It might be easiest to share the SQL that you used to create the tables and enable compression, showing your segment_orderby and segmant_groupby columns.

Vaalsermarco · March 28, 2022, 6:32am

Hello Ryanbooz,
here are the SQL for the HF table:

-- Drop table

-- DROP TABLE xxx.positions;

CREATE TABLE xxx.positions (
	"time" timestamptz NOT NULL,
	machine_id text NOT NULL,
	x float4 NOT NULL,
	y float4 NOT NULL,
	z float4 NOT NULL,
	CONSTRAINT positions_machine_id_fkey FOREIGN KEY (machine_id) REFERENCES xxx.machine_instance(id)
);
CREATE INDEX idx_positions ON xxx.positions USING btree (machine_id, "time" DESC);

-- Table Triggers

create trigger ts_insert_blocker before
insert
    on
    xxx.positions for each row execute function _timescaledb_internal.insert_blocker();

alter table xxx.positions set(timescaledb.compress, timescaledb.compress_segmentby = 'machine_id');

and the SQL for the event data:

-- Drop table

-- DROP TABLE xxx.events;

CREATE TABLE xxx.events (
	"time" timestamptz NOT NULL,
	machine_id text NOT NULL,
	changes float4 NULL,

	CONSTRAINT events_machine_id_fkey FOREIGN KEY (machine_id) REFERENCES xxx.machine_instance(id)
);
CREATE UNIQUE INDEX idx_events ON xxx.events USING btree (machine_id, "time");

-- Table Triggers

create trigger ts_insert_blocker before
insert
    on
    xxx.events for each row execute function _timescaledb_internal.insert_blocker();

alter table xxx.events set(timescaledb.compress, timescaledb.compress_segmentby = 'machine_id');

Do you need anything else?
By the way, the tables in my DB have other names and more columns. I changed the names and removed some data to make it more understandable and anonymous. I hope that is ok.
Thanks,
Marco

ryanbooz · March 29, 2022, 9:18pm

@Vaalsermarco ,

My apologies, but I didn’t realize I was only seeing estimates in your EXPLAIN. I just took a quick look and I should have caught that.

So, let me ask this instead. If you just query the events table for this same time range and EXPLAIN ANALYZE, can you post the results? That might give me an idea. Something like:

select time as ts, changes as event_data from events 
where machine_id=19 
and time >= '2022-01-07 08:56:17.881141+00'::timestamptz
and time <= '2022-01-07 09:05:18.566141+00'::timestamptz

(that might need some adjustment, I’m trying to copy it from two different posts)

Vaalsermarco · April 4, 2022, 11:10am

Sorry @ryanbooz for my delayed reply.
This is the result for your request:

Custom Scan (DecompressChunk) on _hyper_2_104_chunk  (cost=0.23..15.96 rows=70000 width=12) (actual time=0.213..0.858 rows=2466 loops=1)
  Filter: (("time" >= '2022-01-07 08:56:17.881141+00'::timestamp with time zone) AND ("time" <= '2022-01-07 09:05:18.566141+00'::timestamp with time zone))
  Rows Removed by Filter: 1534
  ->  Seq Scan on compress_hyper_18_133_chunk  (cost=0.00..15.96 rows=70 width=84) (actual time=0.083..0.097 rows=4 loops=1)
        Filter: ((_ts_meta_max_1 >= '2022-01-07 08:56:17.881141+00'::timestamp with time zone) AND (_ts_meta_min_1 <= '2022-01-07 09:05:18.566141+00'::timestamp with time zone) AND (machine_id = 'xyxy'::text))
        Rows Removed by Filter: 508
Planning Time: 0.474 ms
Execution Time: 0.964 ms

Hope that helps.
I am not quite sure if I fully understood your last comment.
Thank you very much for your support
Marco

Vaalsermarco · April 21, 2022, 9:22am

@ryanbooz hey, sorry do you have an update about my issue again? Thanks

ryanbooz · April 22, 2022, 3:22pm

Hey Marco,

My apologies for not seeing your reply a couple of weeks ago. I’ve been traveling and it must have slipped through my email updates.

Do you have a script that I can reuse to setup the same test data you’re using (the example you keep providing look like generated test data, but maybe that’s not what you’re using). If not, I’ll try to create something ASAP.

Also, it appears that what you’re providing is the output from a schema diff because I wouldn’t expect you to use the “create trigger” setup and I’m concerned you don’t have some indexes on the compressed chunks that I would expect. If I could see the exact step (including the SELECT create_hypertable... and how you actually compressed the chunks (policy or manually)) that would be most helpful to reproduce.

This is one area where the forum can get a bit long if I’m not communicating correctly and asking for the right thing.

Thanks,
Ryan

Vaalsermarco · April 29, 2022, 7:36am

Hey @ryanbooz,
I am not quite sure why but the query now is much faster.
So basically I am using your first comment which was to filter only for the first (outer) table and the query is executed in a appropiate time. I believe my colleague optimized some settings of the DB but I can’t really tell which yielded the improvements.
What I see is that the “order by” command leads to a “merge left join” but if I execute the query without the “order by” it is a “hash left join”. Second one is faster.
Anyway, thanks for your help! I’ve learned a lot again.
Best regards,
Vaalsermarco