Retry period of job is longer then defined

Hi! After changing of retry period for a job it is longer in practice then defined.

Assume retry period is 5 seconds. But in fact this value 50 seconds. If retry period 10 sec then in fact it could be 30 sec. I didn’t understand the dependency.

Below is the example:


CREATE OR REPLACE PROCEDURE test.run_job_test_raise_exception(job_id int, config jsonb)
LANGUAGE plpgsql
AS $$
	PERFORM 1 / 0;

SELECT api_bet.add_job(
	proc => 'test.run_job_test_raise_exception',
	schedule_interval => INTERVAL '1 day',
	initial_start => TIMESTAMP '2023-04-21 07:00:00');

SELECT api_bet.alter_job(
	job_id => (
			SELECT job_id FROM
			WHERE proc_schema = 'test' AND proc_name = 'run_job_test_raise_exception'
	next_start => clock_timestamp() + INTERVAL '5 seconds',
	retry_period => INTERVAL '5 seconds');


job_id|application_name                          |schedule_interval|max_runtime|max_retries|retry_period|proc_schema          |proc_name                   
  1090|User-Defined Action [1090]                |            1 day|   00:00:00|         -1|    00:00:05|test                 |run_job_test_raise_exception


job_id|last_run_started_at          |last_successful_finish       |last_run_status|job_status|last_run_duration|next_start
  1090|2023-04-21 16:19:55.915 +0300|                    -infinity|Failed         |Scheduled |  00:00:00.015897|2023-04-21 16:20:53.587 +0300


job_id|proc_schema|proc_name                   |pid    |start_time                   |finish_time                  |sqlerrcode|err_message
  1090|test       |run_job_test_raise_exception|3817418|2023-04-21 16:19:05.626 +0300|2023-04-21 16:19:05.641 +0300|22012     |division by zero
  1090|test       |run_job_test_raise_exception|3817480|2023-04-21 16:19:55.915 +0300|2023-04-21 16:19:55.931 +0300|22012     |division by zero
  1090|test       |run_job_test_raise_exception|3846428|2023-04-21 16:20:53.590 +0300|2023-04-21 16:20:53.606 +0300|22012     |division by zero
  1090|test       |run_job_test_raise_exception|3847002|2023-04-21 16:21:59.622 +0300|2023-04-21 16:21:59.637 +0300|22012     |division by zero
  1090|test       |run_job_test_raise_exception|3847744|2023-04-21 16:23:15.653 +0300|2023-04-21 16:23:15.668 +0300|22012     |division by zero

Hello @LuxCore!

Admittedly the situation is confusing as this area is not documented, so that’s something we need to update.
In the meantime I hope this reply can help clarify some things.

So when a job run results in a runtime failure, as in the example above, the retry_period parameter is taken into account, and the next start after a failed execution is calculated as follows:

next_start = finish_time + consecutive_failures * retry_period, plus some jitter (± 13%) to avoid “thundering herds”

However, as we don’t want to put off the next_start indefinitely or produce timestamps so large they end up out of range, we cap this at 5* schedule_interval, and we also do not consider more than 20 consecutive failures, so if the number of consecutive failures is higher, we will multiply by 20 there.

So for the given example, with a schedule_interval of 1 day, and a retry period of 5 seconds, the diffs between successive start_times would be about 5 seconds, until after 20 consecutive failures when the diffs would be smaller and only due to jitter.

Additionally, if a job is scheduled to run on a fixed schedule, then we make sure that if the next start calculated as above, surpasses the next scheduled execution, then the job is executed again at the next scheduled slot and not after that.

(If it might interest you, this is the relevant part of the source by the way: timescaledb/job_stat.c at main · timescale/timescaledb · GitHub)

1 Like