Chunking strategy

igkg · March 2, 2024, 11:45am

Hi,
Could you please help me to choose right chunking strategy for my conditions?
My initial conditions:
We have 10k devices sending data with a nonlinear frequency of 2 to 2k records per second (let’s take 10 records per second as an average).
Saved data not changing in time (no update or delete operations).
We need to keep this data forever (no TTL usage for this data).
Reading requests are very simple (without joins or groupings), the specific device, period, sorting and number of rows are always specified:

select *
  from devices_signals
where device_id = $1
    and dtime between $2 and $3
order by dtime desc
limit 300

I assume this structure and settings at the moment:

create table devices_signals (
  device_id int2 not null,
  dtime timestamp not null,
  a double precision,
  b double precision,
  c double precision,
  d double precision,
  e double precision,
  f double precision,
  g double precision
);

ALTER TABLE devices_signals SET (autovacuum_enabled = false);

SELECT create_hypertable('devices_signals', by_range('device_id', 1), create_default_indexes => False);
SELECT add_dimension('devices_signals', by_range('dtime', interval '1 week'));

ALTER TABLE devices_signals
  SET (timescaledb.compress,
       timescaledb.compress_orderby = 'dtime DESC',
	   timescaledb.compress_chunk_time_interval = '1 hours');

CREATE INDEX devices_signals_dtime_brin_idx ON devices_signals USING BRIN(dtime);

Several dilemmas confuse me.
It is assumed that the system will be loaded with readings and since during readings we always know the device_id, it seems appropriate not to mix data from different devices into one chunk. But from a writing point of view, it would be more economical to write everything in one chunk.
My questions about the best organization of tables and partitioning for these requirements:

Should I use 10k tables or 1 for all devices?
If I use 1 table it is easy to insert data for backend application but in this case PG + TS should decide what chunk right for each row. But if I use 10k tables, I need 10k inserts (for each table) for each moment - it doesn’t look scalable and fast.
Maybe would it be better to use 1 chunk for all devices and use timescaledb.compress_segmentby from
Timescale Docs
?
Will TS fast seek to right segment in the chunk and use it? If yes it should work only for compressed chunk, right? What then should we do while the data is not compressed and is in the same chunk for different devices?
After I load data from my data sample file and compress chunks I noted BTree index takes 0 bytes in
SELECT *
FROM chunk_compression_stats('devices_signals ')

Do I understand correctly that if a chunk is packed, then the index is no longer needed?
4) When I created table, added BRIN index and loaded data then index does not work for select but BTree index works fine, why?
5) Why by default TS use BTree index instead of BRIN?
6) Am I right that if we don’t have any update and delete operations, I can disable autovacuum for the table?

jonatasdp · March 5, 2024, 2:16pm

Hi @igkg , in most of the user requirements, you’ll be able to design the system driven by how you’re going to query it, and secondary think about the user ingestion as Timescale by default can handle a very good throughput. My favorite reference is 220k metrics per second on my raspberry pi using a single table with 10 metrics.

A few questions:

How the device data will be collected? - can I control the batch size?
can I predict the most important queries from this system?
What’s the acceptable time for the user response?

For ingesting, I’d also suggest you think about one table per device type - you can have different frequencies in your devices - you may also have a different volume of data from there and use different lifecycles and methods depending on the device type. It can also make very clear what data type compress well and sometimes a few mix of data types can make it not process that well.

Maybe would it be better to use 1 chunk for all devices and use timescaledb.compress_segmentby

Yes. If you’re going to compress, segmentby device_id will group all your signals from a device on a pack.

Check a few best practices for query optimization.

About autovacuum, yes, you can disable it, but I’d not worry about it as you’re also going to partition and the statistics run over every independent partition as they have their own statistics. The good part is that when you use like retention policies you truncate the chunk, so, in the end there’s no vacuum to be done.