Extremely bad compression ratio

alice-comfy · January 27, 2024, 6:59pm

In brief:

My compression policy is somehow increasing the size of my data by over 100% instead of compressing. My data is stock tick data, I currently have ~ 1.7 Billion rows of data, and record about 40-60 million / day.

Schema: CREATE TABLE
public.trades (
ev text NULL,
sym text NOT NULL,
x integer NULL,
i text NOT NULL,
z integer NULL,
p numeric NULL,
s numeric NULL,
c integer[] NULL,
t timestamp with time zone NOT NULL,
q bigint NOT NULL,
isrth boolean NULL,
PRIMARY KEY (sym, i, t, q)
);

stock_system=# SELECT pg_size_pretty(before_compression_total_bytes) AS total
WHERE compression_status = ‘Compressed’;
total

77 GB
45 GB
47 GB
54 GB
69 GB
58 GB
(6 rows)

stock_system=# SELECT pg_size_pretty(after_compression_total_bytes) AS total
FROM chunk_compression_stats(‘trades’)
WHERE compression_status = ‘Compressed’;
total

170 GB
101 GB
106 GB
122 GB
155 GB
131 GB
(6 rows

TimescaleDB version: 2.16
PostgreSQL version: 16
Other software: Ubuntu 22.04
OS: Ubuntu 22.04
Platform: Self hosted
Install method: APT
Environment: Production

This seems to be my compression policy.

job_id | application_name | schedule_interval | max_runtime | max_retries | retry_period | proc_schema | proc_name | owner | scheduled | fixed_schedule | config | next_start | initial_start | hypertable_schema | hypertable_name | check_schema | check_name
--------±--------------------------±------------------±------------±------------±-------------±-----------------------±-------------------±---------±----------±---------------±-------------------------------------------------±------------------------------±--------------±------------------±----------------±-----------------------±-------------------------
1003 | Compression Policy [1003] | 12:00:00 | 00:00:00 | -1 | 01:00:00 | _timescaledb_functions | policy_compression | postgres | t | f | {“hypertable_id”: 1, “compress_after”: “5 days”} | 2024-01-28 02:37:31.905273+00 | | public | trades | _timescaledb_functions | policy_compression_check

I also found this in the schema. I’m not sure if the number of segmentbys is the problem.

jonatasdp · January 30, 2024, 12:23pm

Can you please double check your segmentby configuration? Segmentby will work well when it groups several lines on a batch. If your unicity end up compressing row by row it will not be efficient.

Also, I see you have i as text and c as integer[] which are great candidates to become TOASTed by Postgresql if each row reaches the toast limits.

Can you double check if you’re not toasting the data and the compression is just reflecting it?

alice-comfy · January 30, 2024, 4:34pm

Can you double check if you’re not toasting the data and the compression is just reflecting it?
I’m not entirely sure I understand, but I think roughly you mean as if the size increase is illusory, not real?

I noticed the issue intially via the DU command, compressed was 800GB, after I decompressed it’s at 400ish. I don’t think these are big enough to get toasted, the I is just numbers stored as text (upstream data provider does it and given ingest rate I didn’t want to deal w conversion), it shouldn’t be anywhere near the limit.

Most values should be close between ticks on the same stock, but unique otherwise. I think segmenting by sym fixes that?

Extremely bad compression ratio

stock_system=# SELECT pg_size_pretty(before_compression_total_bytes) AS total WHERE compression_status = ‘Compressed’; total

stock_system=# SELECT pg_size_pretty(after_compression_total_bytes) AS total FROM chunk_compression_stats(‘trades’) WHERE compression_status = ‘Compressed’; total

stock_system=# SELECT pg_size_pretty(before_compression_total_bytes) AS total
WHERE compression_status = ‘Compressed’;
total

stock_system=# SELECT pg_size_pretty(after_compression_total_bytes) AS total
FROM chunk_compression_stats(‘trades’)
WHERE compression_status = ‘Compressed’;
total