Choosing the right `chunk_time_interval` value for TimescaleDB Hypertables

ryanbooz · January 11, 2022, 8:38pm

In TimescaleDB, one of the primary configuration settings for a Hypertable is the chunk_time_interval value. A TimescaleDB hypertable is an abstraction that helps maintain PostgreSQL table partitioning based on time and optionally space dimensions. Because time is the primary component of time-series data, chunks (partitions) are created based on the interval of time specified for each hypertable.

TimescaleDB determines which chunk new data should be inserted into based on the time dimension value in each row of data. If it falls outside of the range_start and range_end parameters for any existing chunk, a new chunk will be created. It doesn’t matter when the data is inserted (for example in real-time or via a large bulk data COPY). It is the timestamp that determines which chunk the row will be inserted into or if a new chunk needs to be created to store the data.

As noted in the Hypertable Best Practices documentation, PostgreSQL (and therefore TimescaleDB), tend to perform best when the most recent, more queried data is able to reside in the memory cache. General PostgreSQL server tuning guidance suggests that active data reside in about 25% of the configured server memory. Pay special attention to the details that this includes all active tables and hypertables.

With all of that said, here are some additional things to consider as you set (or update) the chunk_time_interval for a hypertable.

Default setting

By default, a hypertable has a 7-day chunk_time_interval which may, or may not, be appropriate for your workload. If you have never specifically set this value, your hypertable creates chunks for 7-day intervals.

Changing the chunk time on existing hypertables

As noted in the documentation for set_chunk_time_interval(), modifying the interval on existing hypertables does not modify existing chunks. Only newly created chunks will have an updated chunk interval.

This also means that chunk intervals are retained in the TimescaleDB configuration catalog unless chunks are dropped. Simply DELETING data from a hypertable will not change the chunk interval of previously created chunks. The chunks must be dropped if you want to reimport data to a hypertable with an updated chunk time. TRUNCATING data from a table will also work. However, all data will be lost for any dropped chunks or truncated tables. Please use with caution.

Alternatively, you could create a new hypertable with the desired (updated) chunk_time_interval, insert the data into the new hypertable, drop the original, and then rename your new hypertable.

Key Takeaway: Chunk times are stored in the TimescaleDB catalog once a chunk is created. Deleting data will not change a pre-existing chunk interval. Dropping chunks or truncating the hypertable are the only current method for updating existing chunk sizes on an existing hypertable. ALWAYS be cautious and mindful of any actions that delete data!!

Many TimescaleDB features depend on chunks

If you use TimescaleDB compression, continuous aggregates, or data retention, all of these features depend on the range_start and range_end of a chunk. Therefore, there is likely a balancing act on the amount of data each chunk contains (time interval) and the number of chunks that will fit into the 25% memory guidance.

For example, if you wanted to perform compression or data retention on data that is older than 1-month, all actions are taken on the entire chunk. If the entire chunks interval (range_start/range_end) doesn’t fit within the specified interval (1-month in this example), the chunk will not be compressed or dropped, even if there is data in the chunk that is older than 1 month.

Likewise, continuous aggregates store data in a materialized hypertable. Therefore, all continuous aggregates have a default chunk_time_interval that is 10 times the chunk_time_interval of the hypertable they inherit from. If your hypertable has a 7 day chunk_time_interval, any continuous aggregate that is based on that hypertable will default to a 70 day chunk_time_interval. If you then set up a data retention policy on the continuous aggregate, realize that all data in the materialized hypertable chunk will have to be older than the interval specified before data will be dropped.

Key Takeaway: chunk_size_interval impacts the way other features interact with your hypertables and materialized hypertables (continuous aggregates).

Smaller chunks is not always the right answer

Depending on your use case, smaller chunks may negatively impact your use of TimescaleDB (and PostgreSQL) in a few ways. Consider these as you design your schema and determine the best chunk_time_interval for your application and data.

Query performance

The more chunks a table has, the more (potential) planning the PostgreSQL query planner has to account for. If you create chunks that are very small (1 hour) and retain a year’s worth of data, the hypertable will have nearly 9,000 chunks. If you filter a query with a table time-based INTERVAL like now() - INTERVAL '1 week', the query planner still needs to consider all of the chunks and planning time will be increased. Although TimescaleDB will be able to exclude nearly a year of chunks during execution, planning cannot use STABLE or VOLATILE predicates to filter out chunks during the planning stage.

If you believe you really need small chunk intervals, strongly consider finding ways to filter your time-series data with constified parameters, not STABLE filters like now() - INTERVAL...

Key Takeaway: Lots of smaller chunks impact the PostgreSQL/TimescaleDB planning phase of query execution. Weigh the benefits of smaller chunks (able to compress or drop chunks more quickly) with the number of chunks the planner may have to consider when planning a query.

You may see suboptimal compression

Depending on how you setup TimescaleDB compression (with or without a segmentby option), smaller chunks may not have very many rows per chunk, per segmentby column. Compression works best when there are at least 100 rows per segment, per chunk and so creating chunks that are too small may mean you see less benefit from compression.

Key Takeaway: If you use compression with smaller chunks, consider how many rows per segment exist per chunk and consider adjusting the chunk_time_interval until you generally expect at least 100 rows of data per segment.

binakot · January 23, 2022, 11:39am

Thanks for sharing really usefull information

Smaller chunks is not always the right answer

It’s so true. Query performance can be dropped dramatically when hypertable has too many chunks. While I’m working with TimescaleDB I figured out a one rule: if I have to make chunk_size_interval less than 1 day, I need to think about hardware upgrade (more RAM) or even migrate to distributed hypertables with several data nodes.

ryanbooz · January 26, 2022, 8:01pm

That’s great feedback. There are definitely workloads where a chunk_time_interval of 1 hour might honestly make sense (observability data that has strict data retention policies for instance), but I really like how you have that “measurement” stick based on your experience and common workloads that cause you to stop and re-evaluate your settings and design.

bendoerr · February 1, 2022, 3:03pm

Something I’m wondering about, is what qualifies “active” data? Say I have a single hypertable that covers 2 months of 1 second data, with the table being constantly updated with the most recent 1 second data. Queries against that table usually stick to the last 1-12 hours. Does this mean that we should consider the last 12 hours of data “active”?

bendoerr · February 1, 2022, 3:07pm

Wow! That is a surprising rule of thumb. We currently have our chunk_size_interval at 1 hour. With each chunk accounting for about ~8GiB uncompressed . Are there easy ways to measure performance at a different chunk size?

binakot · February 1, 2022, 7:00pm

Yes.

When you have chunk_size_interval = 1 day, one year is 365 chunks. When you have chunk_size_interval = 1 hour, one year is 8760 chunks. Query planner has to check all chunks with there time intervals to select which ones will be used in query.

This is not a critical factor. TimescaleDB works fine on 10 thousand chunks. We just decided for ourselves that the division by day is optimal for us. In most cases, our queries are not executed for less than 1 day, if we used 1-hour chunks, then each query we would have to combine 24 chunks.

bendoerr · February 1, 2022, 7:55pm

That makes a lot of sense. We choose our chunk size based on query pattern in that most queries hitting the 1s table will be for less than a hour only needing to fetch a single chunk (CAggs handle anything more). I really appreciate the explanation.

Rmwin32 · February 21, 2022, 8:16pm

Hi @ryanbooz,
EDIT: I realised afterwards that this won’t quite work, because you wrote that continuous aggregates have a default chunk size of 10 times the size of the base hyper table. However, the general idea might still work so I will leave it here as a question and idea generator to others. I will play with the idea when I get a decent chance to.

I was thinking about another question I saw somewhere whilst reading this. The question pertained to rolling up older small chunks into larger chunks. Whilst I don’t have that issue at the moment, I got to thinking about it because it seemed like a fun problem. I was wondering if this kind of problem was something that might be solvable with continuous aggregates (Hereafter referred to simply as CA). For example, say I set up the following:

I have hourly chunks which I compress regularly (Say after 2 hours because in this example we have silly amounts of data coming in).
I create a CA with a time bucket equal to the capture rate for our data, and set to anything older than 2 days back to the start of time (Which we all know is the unix epoch, even though my employer forces me to use Windows - boo!)
The CA is calculated every day (Which I assume sets the chunk size, because I cannot find anything to define the chunk size for CA, and I haven’t gotten around to playing with them yet)
I then drop the hyper table chunks older than 5 days, and I do not cascade to the CA

It seems to me, that if the refresh rate for the CA sets the chunk size (Or if there is an option to set that), then this could be a way to gather up loads of small chunks and put them into a single larger chunk and would work as a roll-up of hourly data into daily data due to how the CA is set up.

It’s kind of interesting to me, because our cloud provider (Where I have my dev/test setup) only likes us to use NTFS, which might cause issues with chunk count eventually. Well I could always use table spaces, but why if I don’t need to and I can just use a few functions inside TimescaleDB I’m still negotiating to let you guys look after the hosting and make my life even easier.

carlota · August 24, 2022, 3:28pm

For anyone looking at the topic of how to choose your 'chunk_size_interval`, Ryan just published a blog post on the topic: Timescale Cloud Tips: Testing Your Chunk Size

You’ll find lots of good info there! Please, tell us if you have any questions.

jayeetamukherjee · January 12, 2023, 2:22pm

Hello all, this is Jayeeta. I am completely new to timescaldb. Trying to fit some financial data in timescaledb. The data is for daily frequency (one row per day). I have almost 25-30 yrs of data, I put the data in a hypertable, with default chunk interval. Now, when I fetch the data for processing, data retrieval is 3-4 times slower than normal potsgres table. Additionaly, I was trying to insert the data after processing, into another hypertable( again with default chunk interval), and the insertion is super slow, (1050 rows/sec). I am using python, the database is the timescaldb cloud(the free 30 days), and using sqlalchemy for batch insert. Any help on how I should decide the chunk intervals? and how should I make the insertions faster? Sorry for the long post, any guidance would be helpful. Thanks in advance.

jonatasdp · January 16, 2023, 3:27pm

Hello @jayeetamukherjee, it seems like every chunk would have 7 records if you’re only in a single stock. Please, try to use the 25% of your memory rule that @carlota mentioned in the last comment. Probably depending on the cardinality, like how many assets you’re observing, you could get a chunk per year or every few years. If you continue loading real-time data using more granular data, you can also change the chunk size later, and it will be valid for further data.

If you need more insights about what’s going on, try to run EXPLAIN ANALYZE in your query to see why it’s too slow. Maybe you need to restrict a time interval adding a WHERE clause to not add all the chunks.

For querying, I’d start the investigation by:

Check what is the chunk size,
Check how many chunks do you have
Check how many records are in each chunk
Use explain analyze to get into why it’s so slow to read.

For inserts, you should probably be careful with the distance you’re from the server and how you’re doing the inserts, and try to prioritize inserts in batches.

Also, take a look at this post that is all about inserts performance.

PetroK · June 19, 2024, 3:29pm

many thanks for the insights! Re. the query planner and chunks that are too small - what is the recommended maximum number of chunks in a hypertable? in my workloads, we do backfilling and query old data regularly.

Alberto_Madonini · June 20, 2024, 12:46pm

Hello,
I’ve recently read your latest article about it, and when it comes to calculate the approximate size of a 7-day chunk of a sample dataset of 300k rows per hour, with 10 columns per row, the result is 1GB.
How would it be possible that an uncompressed size of a chunk containing 50 million rows, with 10 columns, takes up just 1 GB?
The small dataset taken from the “Getting started” guide, with 4.6 million rows and just 4 columns (one of them being a text with an average length of 3), takes up over 300 MB.
I’m trying to figure out what’s the best cloud time series db I should switch to, from SQL Server, but I have a limited budget so it’s very important for me to understand the right compute size in case I’m choosing Timescale, thank you!

jonatasdp · July 1, 2024, 7:17pm

I don’t think we have a maximum, but think about how many of them you query at a time.

I remember some mention from the core team that 10k chunks can work just fine with no performance penalties, and you can also run my experiment to create 100k hypertables to test 5 Milion chunks

github.com

jonatas/sql-snippets/blob/master/poc_100k_hypertables.sql

-- First set the numbers of hypertables you'd like to test
\set hypertables_count 100
-- Hypertable configuration with the chunk time interval for every hypertable
\set chunk_time_interval '''1 hour'''
-- How much data you'd like to append for every append_data call
\set append_interval '''1 day'''
-- How many devices would you like to simulate in parallel
\set number_of_devices '''1'''
-- When the data starts
\set start_date '''2000-01-01'''
-- Interval between each record
\set interval_between_records '''1 second'''

SELECT FORMAT('CREATE TABLE conditions_%s ( time TIMESTAMPTZ NOT NULL, device INTEGER NOT NULL, temperature FLOAT NOT NULL);', i) FROM generate_series(1,:hypertables_count,1) i;
\gexec

SELECT FORMAT($$SELECT create_hypertable('conditions_%s', 'time', chunk_time_interval => INTERVAL '%s' );$$, i, :chunk_time_interval) FROM generate_series(1,:hypertables_count,1) i;
\gexec

CREATE OR REPLACE PROCEDURE append_data(

This file has been truncated. show original

jonatasdp · July 1, 2024, 7:18pm

Have you checked the algorithms explained?