Data Retention Job Failed

TimescaleDB is running in HA(Patroni-ETCD) Environment. The database consists of 8-10 hypertables. All hypertables have a data retention policy set where data older than 4 months will be get deleted. But for the past few days, one of the retention policy jobs is failing.

The solution I Tried:

  • I set the max_runtime for a particular job to 10 minutes and 15 minutes. Also, I disabled the max_runtime by setting the value to 0 seconds. But it failed in each option I tried. Tried to run the job manually but it runs continuously and creates the lock on the table.
  • Since it wasn’t getting successfully completed in both scheduled and manual way. I decided to delete the chunk from hypertable which has records older than 4 months(120 days). But I faced the same issue when I ran the command to run the retention job. The query stayed in a running state.

What can be the reason for the above issue and the suitable solution for it?

Hello @aviral.singh21, can you share what you have in the server logs?

also, probably you can check the error messages in the job_errors view. Please, share anything relevant from these sources.

Hi @jonatasdp, thanks for replying.

I already tried using timescaledb_information.job_errors but it displays an error that relation timescaledb_information.job_errors does not exist.
In server logs also, it is writing only 2 lines:
WARNING: terminating background worker “Retention Policy [1036]” due to timeout
LOG: background worker “Retention Policy [1036]” (PID 38071) exited with exit code 1.

Also, I forgot to mention system information so mentioning below:
OS: Ubuntu 22.04.1
PostgreSQL Version: 13.8
TimescaleDB Version: 2.7.1
Installation Method: Docker

Hi @aviral.singh21, thanks for sharing more details. As you don’t have the latest version, probably the job_errors is not available.

As it’s a timeout, try to use alter_job to set max_runtime. It will allow you to run the same job for more time. Maybe it can help. I just found a related issue here.

Hi @jonatasdp, I already mentioned this in my issue:

Solution 1: I had set the max_runtime for a particular job to 10 minutes and 15 minutes. But it failed each time.

Solution 2: Then I disabled the max_runtime by setting the value to 0 seconds. In job_stats, job status was continuously showing in a running state. I monitored this for more than 2 hours. So I changed the max_runtime back to the default value i.e. 5 minutes. Even after resetting to default value, it was still showing running in job_status of timescaledb_information.job_stats.

Solution 3: Also, I tried running the command manually with CALL run_job(1036);. But it was continuously loading(text cursor was stuck).

Solution 4: I also tried dropping this retention policy and created new retention policy for the same hypertable but still it failed.

My apologies for missing info in the previous message.

Can you check if your chunk has any anomaly like larger size than normal? Or longer period of time interval? Maybe it’s like a memory error trying to allocate some memory before destroy the chunk or some file corruption not allowing to truncate of the chunk or in a related operation.

Hi @jonatasdp the maximum size for any chunk in hypertable is 12mb. There are two chunks which are older than the specified interval(4 months), and there sizes are 10mb and 4mb. Schedule Interval for all jobs is set to 1 day.

Since I am not able to find any solution, I am thinking to upgrade the version to docker image with version PG14 and TS 2.10.1(timescale/timescaledb-ha:pg14.7-ts2.10.1-latest). Will that help?

I cannot guarantee but I’d give a try too.

Then can you please help me with any other solution? This type of issue I was facing in a multi-node environment. At that time also I was not able to find any concrete reason for it or a solution to it. So, I migrated the database to the HA environment but after 20 days of migration, this issue appears again.

Hi @jonatasdp. I don’t know how but after 19 days of the continuous failed retention policy, it just automatically shows success. And it has been running successfully for the past three days. No server restart, no version upgrade, and no modification in the configuration is done.
Well, it is good news but the question still remains what made it fail, and how it automatically started on its own? What was blocking it?

Have you checked anything on server logs? Maybe out of memory? You can also check a view timescaledb_information.job_errors to check any errors.

I checked logs, nothing was mentioned related to the retention policy. And job_errors relation does not exist.

Probably you’re using an old version and it’s not available yet. I’d try to upgrade and check if the error persists.

Yes, I am trying to upgrade the version of services. And this issue: TimescaleDB Upgrade in Docker is related to the upgrade only. I was using inplace upgradation by changing the image version in yml file. I am facing an issue in upgrading Postgres.