Retry period of job is longer then defined

LuxCore · April 21, 2023, 1:31pm

Hi! After changing of retry period for a job it is longer in practice then defined.

Assume retry period is 5 seconds. But in fact this value 50 seconds. If retry period 10 sec then in fact it could be 30 sec. I didn’t understand the dependency.

Below is the example:

UDA

CREATE OR REPLACE PROCEDURE test.run_job_test_raise_exception(job_id int, config jsonb)
LANGUAGE plpgsql
AS $$
BEGIN
	PERFORM 1 / 0;
END;$$;

SELECT api_bet.add_job(
	proc => 'test.run_job_test_raise_exception',
	schedule_interval => INTERVAL '1 day',
	initial_start => TIMESTAMP '2023-04-21 07:00:00');

SELECT api_bet.alter_job(
	job_id => (
			SELECT job_id FROM timescaledb_information.jobs
			WHERE proc_schema = 'test' AND proc_name = 'run_job_test_raise_exception'
		),
	next_start => clock_timestamp() + INTERVAL '5 seconds',
	retry_period => INTERVAL '5 seconds');

jobs

job_id|application_name                          |schedule_interval|max_runtime|max_retries|retry_period|proc_schema          |proc_name                   
------+------------------------------------------+-----------------+-----------+-----------+------------+---------------------+----------------------------
  1090|User-Defined Action [1090]                |            1 day|   00:00:00|         -1|    00:00:05|test                 |run_job_test_raise_exception

job_stats

job_id|last_run_started_at          |last_successful_finish       |last_run_status|job_status|last_run_duration|next_start
------+-----------------------------+-----------------------------+---------------+----------+-----------------+----------
  1090|2023-04-21 16:19:55.915 +0300|                    -infinity|Failed         |Scheduled |  00:00:00.015897|2023-04-21 16:20:53.587 +0300

job_errors

job_id|proc_schema|proc_name                   |pid    |start_time                   |finish_time                  |sqlerrcode|err_message
------+-----------+----------------------------+-------+-----------------------------+-----------------------------+----------+-----------
  1090|test       |run_job_test_raise_exception|3817418|2023-04-21 16:19:05.626 +0300|2023-04-21 16:19:05.641 +0300|22012     |division by zero
  1090|test       |run_job_test_raise_exception|3817480|2023-04-21 16:19:55.915 +0300|2023-04-21 16:19:55.931 +0300|22012     |division by zero
  1090|test       |run_job_test_raise_exception|3846428|2023-04-21 16:20:53.590 +0300|2023-04-21 16:20:53.606 +0300|22012     |division by zero
  1090|test       |run_job_test_raise_exception|3847002|2023-04-21 16:21:59.622 +0300|2023-04-21 16:21:59.637 +0300|22012     |division by zero
  1090|test       |run_job_test_raise_exception|3847744|2023-04-21 16:23:15.653 +0300|2023-04-21 16:23:15.668 +0300|22012     |division by zero

konskov · April 28, 2023, 1:58pm

Hello @LuxCore!

Admittedly the situation is confusing as this area is not documented, so that’s something we need to update.
In the meantime I hope this reply can help clarify some things.

So when a job run results in a runtime failure, as in the example above, the retry_period parameter is taken into account, and the next start after a failed execution is calculated as follows:

next_start = finish_time + consecutive_failures * retry_period, plus some jitter (± 13%) to avoid “thundering herds”

However, as we don’t want to put off the next_start indefinitely or produce timestamps so large they end up out of range, we cap this at 5* schedule_interval, and we also do not consider more than 20 consecutive failures, so if the number of consecutive failures is higher, we will multiply by 20 there.

So for the given example, with a schedule_interval of 1 day, and a retry period of 5 seconds, the diffs between successive start_times would be about 5 seconds, until after 20 consecutive failures when the diffs would be smaller and only due to jitter.

Additionally, if a job is scheduled to run on a fixed schedule, then we make sure that if the next start calculated as above, surpasses the next scheduled execution, then the job is executed again at the next scheduled slot and not after that.

(If it might interest you, this is the relevant part of the source by the way: timescaledb/job_stat.c at main · timescale/timescaledb · GitHub)