Compress inactive chunks immediately

Rmwin32 · February 19, 2022, 11:17am

Hi everyone,
I am currently using Timescale and am generally very impressed. One are where I would appreciate a little help is with what the exact options of some of the functions around the management of hyper tables actually do.

I want to compress my data as soon, or within hours, of rolling into a new chunk. I have some inputs which capture a good amount of data very regularly. I set the chunk interval to 1 day and the compression to 1 day. This leaves the current chunk and yesterdays chunk uncompressed. I also tried setting the compress option to just 4 hours, but that didn’t compress yesterdays chunk at 4am UTC as I would have expected.

I would really appreciate if the documentation could be updated with little explanations like this. I can also offer to help update the docs, as and when I learn valuable things, seeing as they are also on GitHub. Sadly I cannot use Slack from work, so forums it is

Thanks to anyone who can point me in the right direction.

LorraineP · February 21, 2022, 10:11am

Thanks for the question @Rmwin32 and the observation about docs. Glad that we now have the forum to help those who like you aren’t able to use Slack due to restrictions… It’s likely that a substantial number of people were in that predicament.

I’ll tag this post with ‘docs-request’ and meanwhile will try to get you a steer on this issue

PS would also be awesome if you want to contribute to docs, the team would be very happy to see that!

ryanbooz · February 21, 2022, 3:16pm

Welcome to the forums @Rmwin32!

Most policies don’t provide a specific parameter for setting the schedule interval when they are created and default to running once a day. One outlier is creating refresh policies for Continuous Aggregates which provide the schedule_interval option on creation.

However, all policies create jobs that you can view and modify parameters for using the informational views and alter_job function.

In your case there are two things to consider. First, how frequently the compression job runs (schedule_interval), and second how far back in time the chunk’s range_end date needs to be before the chunk will be compressed (the compress_after option of the policy).

In your case, you could set the schedule_interval for the compression job to run every hour (or 4 hours, however “quickly” you want the chunk to attempt to compress after a new chunk has been created), but leave the compress_after set to 1-day. With this setup, the job will run once an hour and look for any uncompressed chunks with a range_end timestamp that is older than now()-'1 day'::interval.

Basic Steps (adjust for your job details)

1. Find the “job_id” of the Compression Job

SELECT * FROM timescaledb_information.jobs
WHERE proc_name = 'policy_compression' AND hypertable_name='{replace with hypertable}';

2. Alter the job to run once an hour

SELECT alter_job({replace_with_job_id},schedule_interval=>'1 hour'::interval);

If you want to change configuration that’s specific to this job, look at the config field in the timescaledb_information.jobs view and add that JSON to the alter_job request.

Compression considerations

I’d be remiss if I didn’t briefly remind you that compressing chunks as quickly as possible has two ramifications that you should consider given what you know of your application and query patterns.

First, you cannot UPDATE or DELETE data in a compressed chunk. If your application ever sends older data that would modify those chunks, you’ll have to decompress them first, at least as of Feb. 2022 (it’s a limitation we continue to explore solutions to!) If that’s not how your application works, you’re clear here.

Second, realize that when you compress a chunk, the indexes will not be the same as on uncompressed chunks because the data is transformed into columnar data. Therefore, it’s particularly good for querying specific columns of data (like aggregating a value over a long period of time) rather than something like select * from... over compressed chunks. If you request all of the columns, TimescaleDB has to recompose all of the individual columns back into full rows which takes more work and will not perform as well.

All of that is just to give context and explain some of the nuance of how and when it’s best to compress your chunks, considering your application insert and query patterns.

Let us now if you have other questions!

Rmwin32 · February 21, 2022, 4:58pm

HI Ryan,

Many thanks for the quick and thorough reply. I have updated our test system with some config changes that should help allow me to test that everything works for our usage pattern. I also discovered that a few of the compression jobs could overlap. While this shouldn’t cause any issues, if there is other load on the system then it could, potentially, lead to contention for disks so I spread them out a little more.

I wasn’t aware of all the options, and am just learning my way around TimescaleDB. It seems a great system, and works just as I would expect. For reference, we use it to monitor both itself/Postgres and also a decent sized SQL Server estate. Therefore we will never go back and re-insert/update old data. We also rarely go back more than 3 hours, except to compare to baselines for our systems. We do capture a heck of a lot of data though, we created a custom Telegraf input plugin, because the standard one wasn’t nearly nice enough for what we wanted. As such, to keep the monitoring server down a bit in size, I just wanted to compress the data after about 4-5 hours. This seems to work well, and for a single table just gave us 5 GB back.

Lastly, I wondered if you could pass on our teams thanks to Sven Klemm for his effort in getting a Postgres output plugin for Telegraf rolling? It’s been very valuable to us and I’m happy to see that its evolved version is about to become a standard plugin. That should help a bunch of people.

Many thanks again to you all at TimescaleDB for all your efforts and your quick help with my query.