Drop chunk based on column data

Hi,

timescaledb version : 1.7 +
postgresql version: 11 +

We need to drop device_id =‘1’ chunk , how to perform it please suggest.
below table details

CREATE TABLE test.data (
timestamp timestamptz NOT NULL,
device_id character varying(50) NOT NULL,
name text NOT NULL,
value double precision,
source text
);

SELECT create_hypertable(‘test.data’, ‘timestamp’,‘device_id’,16, chunk_time_interval => interval ‘1 day’);

insert into test.data values(‘2021-01-01 00:00:00’,‘1’,‘A’,10,‘DATA1’);
insert into test.data values(‘2021-01-01 00:00:00’,‘2’,‘A’,10,‘DATA1’);
insert into test.data values(‘2021-01-01 00:00:00’,‘3’,‘A’,10,‘DATA2’);
insert into test.data values(‘2021-01-01 00:00:00’,‘4’,‘A’,10,‘DATA2’);
insert into test.data values(‘2021-01-01 00:00:00’,‘5’,‘A’,10,‘DATA3’);
insert into test.data values(‘2021-01-01 00:00:00’,‘6’,‘A’,10,‘DATA3’);

insert into test.data values(‘2021-01-01 00:00:00’,‘1’,‘B’,10,‘DATA1’);
insert into test.data values(‘2021-01-01 00:00:00’,‘2’,‘B’,10,‘DATA1’);
insert into test.data values(‘2021-01-01 00:00:00’,‘3’,‘B’,10,‘DATA2’);
insert into test.data values(‘2021-01-01 00:00:00’,‘4’,‘C’,10,‘DATA2’);
insert into test.data values(‘2021-01-01 00:00:00’,‘5’,‘D’,10,‘DATA3’);
insert into test.data values(‘2021-01-01 00:00:00’,‘6’,‘C’,10,‘DATA3’);

Data is available up to today.

Each device_id have 10000 records for everyday.

Drop chunk is working based on time , i want to drop chunk based column (like device_id)

Hello everyone,
I want To drop chunks by column other than time interval

Unfortunately, drop chunks is only supported using time intervals as the basis, so at this point, we’re not able to meet your requirement. In this case, if it’s a mandatory requirement, then you’d need to implement separate hypertables I believe.

There are some GitHub issues related should you want to add your voice to those https://github.com/timescale/timescaledb/issues/563 and https://github.com/timescale/timescaledb/issues/545

1 Like
SELECT tableoid::regclass as chunk, datetime_col__cast_as_interval, dimension_location_column 
FROM <hypertable> GROUP BY tableoid, datetime_col__cast_as_interval,dimension_location_column;
DROP TABLE <chunk>;
1 Like

Hey @DanM - Welcome to the forum!

Just to be clear, unless you’ve written a specific hashing algorithm to split every distinct value out to it’s own chunk (not advised, honestly), realize that chunks contain many values of the space partition. Given the OP command:

SELECT create_hypertable(‘test.data’, ‘timestamp’,‘device_id’,16, chunk_time_interval => interval ‘1 day’);

The space partitioning column is device_id. This does not imply that Timescale will create one partition per device_id. Only that it will now be considered as part of the hashing algorithm to try and more evenly distribute data across chunks. For example, if you had a dataset with 100 different devices, TimescaleDB might only create 3-4 partitions that each contain ~25 devices for a given chunk_time_interval.

All of that is to say, it’s essentially impossible to drop a chunk based on anything other than time and not impact a much wider surface area of your data.

HTH

1 Like

Thank you for your answer.

My desired space partitioning is a combination of TennantID and GDPR number of months retention policy: T1-12/T1-24/T1-36/T2-12/T2-24.

So far I only tested with a space column for GDPR as NUMERIC values: 12/24/36 and saw that it keeps 3 distinct chunks based on it ( I saw indeed hash collision if the column is INT between values 12 and 36). And I managed to drop individual chunks based on Date & GDPR without any issues.

If I test the hash collision ahead on the space partition column values and I stay bellow 10-12 distinct values, @ryanbooz do you see any problem in dropping them individually using such strategy ?

BR,
Dan

I personally can’t say without more thought to the default hashing function. It’s been a while since I’ve dug in and I really can’t remember what to expect for something like this. My sense is that at some point, you’re going to end up with overlap.

One option that I think is safer, in the long run, would be to enable compression and use the TennantID as a segmentby column assuming there’s enough density per chunk to do so. This will group all data for each tenant in the same row and then you could create your own stored procedure to DELETE data from the underlying compressed hypertable by segment. You will have to do a little extra work to figure out what the compressed hypertable is called and check range_begin/range_end on the chunk to make sure you only delete what you want, but I think it simplifies things in the long run and doesn’t rely on hashing that might give you unexpected results later. Bloat is greatly reduced because compression combines many rows into one and since most of the data is not in TOAST, there’s very little to delete, so it’s fast.

With an approach like this, you save space, probably increase query efficiency on older data, and you can easily remove data from a user based on GDPR settings.