ERROR: portal snapshots (1) did not account for all active snapshots (2)

YBenarab · July 15, 2022, 11:29am

I have the following procedure:

CREATE OR REPLACE PROCEDURE delete_events(job_id int, config jsonb)
LANGUAGE plpgsql
AS
$$
DECLARE
  l_rec record;
  decompress_sql text;
  policy_id integer;
  delete_sql text;
  run_compression_sql text;
BEGIN
  FOR l_rec IN SELECT table_schema, TABLE_NAME
               FROM information_schema.tables
               WHERE TABLE_NAME = 'log_win'
                 AND table_schema LIKE '%org_%'
  LOOP
    EXECUTE format('SELECT job_id 
                FROM timescaledb_information.jobs 
                WHERE proc_name = ''policy_compression'' 
                  AND hypertable_schema = %L  
                  AND hypertable_name = ''log_win''', l_rec.table_schema)
    INTO policy_id;

    decompress_sql := format('SELECT public.decompress_chunk(c, true) 
                              FROM public.show_chunks(''%I.log_win'') c', l_rec.table_schema);
    EXECUTE decompress_sql;

    delete_sql := format('WITH summary AS (
        SELECT time, device_id, ROW_NUMBER() OVER (PARTITION BY device_id
                                                ORDER BY time DESC) AS rank
        FROM %I.log_win
        JOIN %I.device USING (device_id)
    )
    DELETE FROM %I.log_win 
    USING summary
    WHERE summary.device_id = log_win.device_id 
      AND summary.rank = 2000 
      AND log_win.time < summary.time', l_rec.table_schema, l_rec.table_schema, l_rec.table_schema);
    EXECUTE delete_sql;

    run_compression_sql := format('CALL public.run_job(%L)', policy_id);
    EXECUTE run_compression_sql;
  END LOOP;
END
$$;

This procedure runs through all my schemas, decompresses the log_win table, trims it and then it calls a compression_policy job in the end that basically compresses the log_win tables back again.
When I run this procedure I get the following error:

ERROR:  portal snapshots (1) did not account for all active snapshots (2)
CONTEXT:  PL/pgSQL function _timescaledb_internal.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 45 at COMMIT
SQL statement "CALL _timescaledb_internal.policy_compression_execute(
    job_id, htid, lag_value::INTERVAL,
    maxchunks, verbose_log, recompress_enabled
  )"
PL/pgSQL function _timescaledb_internal.policy_compression(integer,jsonb) line 51 at CALL
SQL statement "CALL public.run_job('1085')"
PL/pgSQL function delete_events(integer,jsonb) line 37 at EXECUTE
SQL state: XX000

I have no clue what it means, but the records is never trimmed… All I know is that it has something to do with the last step:

run_compression_sql := format('CALL public.run_job(%L)', policy_id);
EXECUTE run_compression_sql;

When I remove it, everything works fine. Also it is only the job_id 1085, it can run jobs for other schemas on identical tables without any issues.

How do I solve this?

ryanbooz · July 15, 2022, 6:06pm

Hey @YBenarab, thanks for the follow-up.

I’ve honestly never seen that error either, so it will take some time to see what info I can find out. However, I wanted to provide two bits of feedback in case it helps in the near term.

Consider using data retention and continuous aggregates instead of deleting based on row count
For your benefit, I wanted to provide some feedback about the current solution you’re working towards. It feels like you’re fighting against the design intent of some TimescaleDB features, specifically related to how time-series data normally functions. Having to uncompress each table to remove rows based on a total count to save space feels like a bit of an architecture design challenge. While you’re absolutely welcome to do what you need for your application (we certainly provide the tools to decompress and recompress chunks), the spirit of the functionality isn’t really being used as intended.

As a time-series database, most of our tooling to manage database size and scope is built around time, not row count. Chunks have time boundaries and that makes it much easier to rationalize about data retention policies (data older than X) rather than row count policies.

Essentially, doing all of this work each time the UDA runs is a lot of effort without really using much of the actual TimescaleDB feature set as developed (like traditional data retention) and you’re likely to run into some other usability issues as the size of the DB grows.

I wonder if there’s a solution based on your need that could utilize continuous aggregates to store those unique messages per device and then drop older chunks over time?

An alternative to re-compress chunks
If you need to stick with this approach based on your application needs, you could try to change the last part by using compress_chunk somewhat like you did at the beginning of the stored procedure. Although calling the compression job should do essentially the same thing, my guess is that the policy engine is still finishing the compression on one of the other tables and you’re getting caught somewhere there.

So, change the end to something like:

    run_compression_sql := format('SELECT public.compress_chunk(c, true) 
                              FROM public.show_chunks(''%I.log_win'', older_than=>'{insert interval here}') c', l_rec.table_schema);
    EXECUTE run_compression_sql;

(Verify I copied the parameter and table names correctly.)

Note that this will cause the script to run more slowly because the compression will be synchronous, rather than calling the procedure and moving on to the next table.