Timescale Cloud Tips: Testing Your Chunk Size
It’s time for another installment of Timescale Cloud Tips, a series of blog posts in which we aim to answer our customers’ most common questions.
Last week, we shared some advice on maximizing your data ingestion rates, and today, we’re discussing a Timescale classic: chunk size. Many users don’t realize that the size of the data partitions within a table—we call them chunks—is an essential parameter that can affect the performance of their TimescaleDB databases and impact how some features work. So first, we’ll explain why this is something you should care about and then show you how to pick your optimal chunk size.
Why Does Chunk Size Matter?
One of the best things about hypertables is that you, the user, interact with a hypertable just like a regular PostgreSQL table; we put a lot of effort into this abstraction, as we think this is the best experience for users. You shouldn’t constantly be thinking about this internal partitioning; we believe this is something that your database should be doing for you, and that’s what TimescaleDB does.
Because TimescaleDB creates chunks based on time, a hypertable has a setting that determines the earliest and latest timestamp stored in each chunk. We call this setting
chunk_time_interval and set it to
7 days unless you configure it otherwise.
Default settings like this are helpful so that you can start using a hypertable quickly. However, every use case is different, and default settings are useful to get generally good performance. But, if you want to fine-tune TimescaleDB to perform well for your application, you may want to tweak this setting manually.
The first reason why this matters is that TimescaleDB uses chunks (partitioning) to write and read data quickly. If your chunks are too big (i.e., you store too much data per partition), the benefit of partitioning diminishes because you have more data in the partition than the PostgreSQL cache available to manage it, similar to having one large regular PostgreSQL table. But if you have too many chunks (i.e., you store only a little data per partition), you may overwhelm the query planner or create extra overhead in other management areas.
The second reason why this setting matters is because many TimescaleDB features are affected by the chunk size, including our top features: compression, continuous aggregates, and data retention policies. All these features consider the chunk interval (e.g., seven days of data) or chunk constraints (earliest timestamp and latest timestamp).
For instance, data is compressed by chunks, not entire hypertables or individual rows. Retention policies delete chunks while continuous aggregates (which store their data in special hypertables as well) automatically set their chunk size based on the parent hypertable setting. Therefore, these features are affected by how you configure your chunk size within the context of your hypertable.
With that background, let's consider a few specifics when modifying the
chunk_time_interval of your hypertable.
Understand your priorities
TimescaleDB is a powerful time-series database with lots of 💯 features that help you ingest and query massive amounts of time-series data. Most of these features depend on each other to work best, and your priorities (or expectations) might not fully align with the features’ design, specifically how
chunk_time_interval impacts how they work.
Do you want to optimize for the best compression ratio? Larger chunks that store more rows per device will probably be better, generally speaking.
Do you want to drop raw data as quickly as possible after materializing data in a continuous aggregate? Smaller chunks that more closely align with the
time_bucket interval of the continuous aggregate definition are a good starting place.
Do you want to ensure that your server resources are using hypertables efficiently? You'll probably have to find a compromise of
chunk_size that aligns with other priorities and still ensures active chunks from all hypertables reside in about 25 percent of your PostgreSQL memory allocation.
Know the characteristics of your data
There is admittedly a process to nail down what interval setting is best for a specific hypertable based on your data, application, and query patterns. It's easiest to modify
chunk_time_interval if you have some level of understanding of the data that you have.
If you have an existing database that stores data similar to what will be inserted into the hypertable, look for how many rows you track each hour (or day). Within that same period, estimate the density of readings per device. And finally, if possible, import a representative set of data into a hypertable with the default setting of '7 days' and then look at how big each chunk is using the
If the chunks only take a fraction of your available memory, drop the hypertable and try again with '14 days' or '30 days' and evaluate again. If the chunks are very large and one or two chunks would take up a lot of your available memory, try again with '1 day' and see how it goes.
If you don't yet have example data, try to do some rough estimates using the same criteria.
- How many devices will you be tracking?
- How often will the devices be recording readings (sampling frequency)?
- How many rows will that total over an hour or a day?
It's often very helpful to create sample data using PostgreSQL to help you "see" what your final data might look like in a hypertable, how big chunks are, etc. See our series (part 1, part 2, part 3) on using
generate_series() to create lots of sample data for easy testing!
Finding Your Optimal Chunk Size
With information about your data and sample chunk size in hand, let's walk through an example of how to merge all of this into a "sweet spot" setting for a sample dataset, talking about some of the compromises you may have to make when working with all of the TimescaleDB features.
Our hypothetical server and dataset are outlined in the following table. For our purposes, assume that each row of data records 10 columns of data.
Server Resources and Sample Dataset
Application Data Expectations
With a clear understanding of our server, dataset, and expectations for how we want to manage the data, let's look at some optimal settings for
chunk_time_interval for this example dataset.
Active chunks of all hypertables in ~25 % of available memory
Our database server has 16 GB of memory available, and we want our active chunks to be around 4 GB total. The average size of a seven-day chunk is approximately 1 GB currently, and we know that our application typically queries for the last 14 days. Other TimescaleDB features aside, it seems that ‘7 days' is a good starting point for
chunk_time_interval because more than 14 days of data could reside in 4 GB of memory. We might even consider increasing
chunk_time_interval to '10 days' or even '14 days' if other features below would benefit from having more data per chunk, but we probably wouldn't go smaller.
Compression is one of the TimescaleDB features that is impacted by both the size of the chunk (e.g., how many rows per device are stored in the chunk) and the chunk constraints based on time, something we track as
range_end in TimescaleDB.
First, when you're looking to maximize the compression ratio, having more rows per device will almost always provide better compression overall. TimescaleDB compression works by grouping up to 1,000 rows of data at a time and performing various types of compression against the groups of column data. If your chunks are too small and you're not saving many values per chunk, per device, your compression ratio will probably be smaller.
In our case, each chunk should have approximately 10,000 rows per device, more than enough to get decent compression. This also shows that we could easily have '14 day' or '30 day' chunks and still not have an overwhelming amount of data in each chunk, particularly after compression.
The second consideration is how quickly data in chunks will be compressed after our '1 month' setting, a value that's calculated as
now()-'1 month'::interval' each time the policy runs. Chunks will not be compressed until all of the data in the chunk is older than the policy interval specifies.
Since we want data to be compressed after one month, the value of
now() - '1 month'::interval' will be compared to the
range_end of a chunk. If
range_end is older than the policy interval, the chunk will be compressed. If it's not, TimescaleDB knows that some rows in the chunk are still newer than the compression policy specifies, and it will not compress that chunk. If the policy doesn't run for another day (or another week), you'll have data older than '30 days', which is still uncompressed.
If your requirements allow that kind of "flex" in data compression, then a
chunk_time_interval of '7 days' (or '14 days') will allow you to pack the chunks more densely with device readings while still compressing data close to the 30-day window. It will almost always be compressed a little later than 30 days (unless all of the stars align), but we'll still have most of our oldest data in a compressed chunk.
With data retention, TimescaleDB drops entire chunks. It doesn't
DELETE data. Therefore, data retention, like compression, relies on the
range_end of a chunk to determine when to drop a chunk.
Again, the chances of the chunk
range_end and the timestamp that's calculated by the policy (
now()-'3 months'::interval') aligning perfectly is unlikely to happen at exactly three months, but the chunk will be dropped within a few days of that marker in most cases. If dropping data at exactly three months is critical, you'll need a smaller chunk size, but that will probably impact the efficiency of compression.
Data retention expectation is typically the one that needs to compromise because of how the other TimescaleDB features interact with the hypertable and chunk size. Lowering the
chunk_time_interval to '1 day' so that chunks are dropped as close to the
3 month mark as possible will reduce the effectiveness of compression efficiency to a minimum and impact how continuous aggregates are sized as well.
If you're focused on precise data retention as an end goal but not using compression, you're not using the full power of TimescaleDB and should re-evaluate your data management expectations.
Finally, our application will have three continuous aggregates that use a
time_bucket of 30 minutes, one hour, and one day, each with a different interval of time that we want to drop the aggregated data using a data retention policy. This is where the hypertable
chunk_time_interval impacts continuous aggregates the most.
If we keep the
chunk_time_interval set to '7 days', any continuous aggregate we create will automatically use the setting of the underlying hypertable to set the
chunk_time_interval of the materialized hypertable. By default, TimescaleDB sets this to 10x the chunk size of the underlying hypertable.
With our example data, that means that all three continuous aggregates will have a
chunk_time_interval of '70 days'. This is probably a reasonable setting from a data density perspective (our aggregates are reducing the total number of rows in each successive continuous aggregate), but that '70 day' interval will impact our ability to drop data from the aggregate in the timeframe we specified.
The '30 minute' continuous aggregate can have data drop after seven days. Unfortunately, since the
chunk_time_interval of the continuous aggregate is 70 days, in this case, that data won't actually be dropped for another two months later than I was expecting. The same holds for the '1 hour' aggregate. If we try to drop data from the continuous aggregate after '1 month', the most recent chunk still has another month or more until it's eligible to be dropped based on the chunk
This is the most important aspect of the hypertable
chunk_time_interval that impacts continuous aggregates. It's worth considering how any change in chunk size of the hypertable will necessitate a change to the
chunk_time_interval of the continuous aggregate, at least based on your data requirements.
Putting It All Together
With our requirements in hand and having examined how `chunk_time_interval` would impact various TimescaleDB features, here is where we would probably start the settings for this hypertable. Recognize that this is an example and each dataset is specific, so use the information above, a sample of your data, and features like Timescale Cloud database forking to quickly iterate changes in your settings to see if they will have the intended impact.
For this table, we'd recommend starting at '14 days' for this application. Our reasoning for this includes:
- We can store more rows of data per device in each chunk, which should keep compression ratios high.
- The total size of an uncompressed chunk still fits easily in memory for fast access and modification.
- Data retention will lag a little further behind our specified windows, but because those older chunks will be compressed, their storage footprint will be minimal.
Based on our data retention settings for our continuous aggregates, we'll set the
chunk_time_interval of each slightly different to accommodate data retention better.
Modify this setting immediately after creating the aggregate, but before any data is materialized through a refresh policy (or manually). For example:
-- Create the continuous aggregate WITH NO DATA so that we can -- modify the chunk_time_interval before any data is added CREATE MATERIALIZED VIEW data_view_30min WITH (timescaledb.continuous) AS SELECT time_bucket('30 minutes', time) as bucket, … WITH NO DATA; SELECT set_chunk_time_interval('data_view_30min', INTERVAL '14 days'); -- Start the refresh policy after setting the new chunk time -- to start populating the continuous aggregate SELECT add_continuous_aggregate_policy('data_view_30min', start_offset => INTERVAL '1 day', end_offset => INTERVAL '1 min', schedule_interval => INTERVAL '10 min');
These changes are made specifically to accommodate data retention on continuous aggregates. You should verify that queries perform as expected.
chunk_time_interval parameter of your hypertables can dramatically impact the efficiency and performance of many key TimescaleDB features.
Spending time understanding your data, server resources, and application data requirements will go a long way in helping you fine-tune this setting. Also, remember to take advantage of forks for testing in Timescale Cloud—you can create a Timescale Cloud account here and use it for free for 30 days.
If you have any further questions, the Timescale Community Forum is always open. Don’t hesitate to ask!