Chunk not found TimescaleDB

jamessalzman · July 21, 2022, 6:18pm

Hi All,

I ran into an issue where i get this error

timescale=# select * from time_test;
ERROR:  [dn1]: chunk id 1 not found

I was wondering if there is any way to recover from this issue?

As a disclaimer - I have been doing automation related to moving/copying chunks between data nodes.

Not sure if this is related but there are a few chunks which seem to have bad metadata on the access node, for ex:

 hypertable_schema | hypertable_name |     chunk_schema      |       chunk_name        | desired_num_replicas | num_replicas | replica_nodes | non_replica_nodes
-------------------+-----------------+-----------------------+-------------------------+----------------------+--------------+---------------+-------------------
 public            | time_test       | _timescaledb_internal | _dist_hyper_1_1_chunk   |                    2 |            2 | {dn2,dn3}     | {dn1,dn4}
 public            | time_test       | _timescaledb_internal | _dist_hyper_1_2_chunk   |                    2 |            2 | {dn1,dn3}     | {dn2,dn4}

Here you can see dn1 holds a replica of _dist_hyper_1_2_chunk, if I try to copy that chunk from dn1 to another node i get this:

timescale=# CALL timescaledb_experimental.copy_chunk('_timescaledb_internal._dist_hyper_1_2_chunk', 'dn1', 'dn4');
ERROR:  [dn1]: relation "_timescaledb_internal._dist_hyper_1_2_chunk" does not exist
DETAIL:  Chunk copy operation id: ts_copy_432_2.

So if I am not mistaken, the access node thinks this chunk is on DN1, but in fact it is not.
Any way to correct the metadata on access node for this guy? I am not sure which transaction it was inside of _timescaledb_catalog.chunk_copy_operation table, so not sure if i can use timescaledb_experimental.cleanup_copy_chunk_operation() here.

Thanks!

jfj · July 22, 2022, 6:58am

Hi @jamessalzman , thanks for posting!
Could you post more details on the automation such as the commands you were using prior to this issue?
move_chunk may be more suitable depending on what you try to achieve.

dmitry · July 22, 2022, 7:37am

Hey James,

I was trying to think how this could happen, first of all could you please show the content of the _timescaledb_catalog.chunk_copy_operation table. This table is used internally to keep the state of each copy/move chunk operations, we are interested here in non-completed copy operations.

In case if copy operation failure, we expect that the cleanup_copy_chunk_operation() operation would be executed always before running any other copy commands with the same chunk. For ease of use, you can specify your own copy operation id in the copy/move chunk command for later use with the cleanup function.

Thank you

jamessalzman · July 22, 2022, 4:50pm

Hi @jfj ,

I was just using the copy_chunk command as an example to show that there is bad meta data on access node. Access node thinks DN1 has a copy of the chunk, while it does not so it fails out.
I am using both move_chunk and copy_chunk in my automation.

jamessalzman · July 22, 2022, 10:00pm

Hi @dmitry,

Here is where i think the issues happened. The ones that say init.

timescale=# select * from _timescaledb_catalog.chunk_copy_operation where completed_stage != 'complete' order by time_start;
  operation_id   | backend_pid |        completed_stage        |          time_start           | chunk_id | compress_chunk_name | source_node_name | dest_node_name | delete_on_source_node
-----------------+-------------+-------------------------------+-------------------------------+----------+---------------------+------------------+----------------+-----------------------
 ts_copy_174_11  |       29045 | sync_start                    | 2022-07-21 00:56:19.721556-04 |       11 |                     | dn3              | dn2            | t
 ts_copy_428_2   |       29721 | create_empty_compressed_chunk | 2022-07-21 01:48:55.187411-04 |        2 |                     | dn1              | dn2            | f
 ts_copy_429_2   |       29721 | init                          | 2022-07-21 01:49:25.136378-04 |        2 |                     | dn1              | dn2            | t
 ts_copy_430_2   |       29721 | create_empty_compressed_chunk | 2022-07-21 01:49:53.583007-04 |        2 |                     | dn1              | dn2            | t
 ts_copy_431_2   |        6918 | init                          | 2022-07-21 14:13:47.310046-04 |        2 |                     | dn1              | dn2            | f
 ts_copy_432_2   |        6918 | create_empty_compressed_chunk | 2022-07-21 14:13:57.817626-04 |        2 |                     | dn1              | dn4            | f
 ts_copy_439_8   |       15256 | create_empty_compressed_chunk | 2022-07-22 01:10:15.007062-04 |        8 |                     | dn2              | dn3            | f
 ts_copy_440_9   |       15256 | create_empty_compressed_chunk | 2022-07-22 01:10:15.038383-04 |        9 |                     | dn2              | dn3            | f
 ts_copy_445_17  |       15256 | create_empty_compressed_chunk | 2022-07-22 01:10:26.32409-04  |       17 |                     | dn4              | dn2            | f
 ts_copy_446_19  |       15256 | create_empty_compressed_chunk | 2022-07-22 01:10:26.354859-04 |       19 |                     | dn4              | dn2            | f

I did continue to do my thing maybe not noticing the error occured. Can you provide an example of supplying the operation id to the function? I did not see that listed in the documentation. This would be helpful so I can automatically call cleanup_<copy/move>_chunk_operation() on any failures.

Thank you

Nikhil · July 25, 2022, 7:04am

Hi @jamessalzman the documentation provides an example on how to call the cleanup procedure

dmitry · July 25, 2022, 8:50am

Hi @jamessalzman,

Looks like there are some operations were not completed are in the list, the problem with those operation that they can consume vital resources (such as replication slots), so they needed to be cleaned up.

I would assume that reason of failure could be that the system is run out of replication slots/background workers, so checking up PostgreSQL log would be a good idea here, not all the errors are returned back to the user unfortunately.

You are right about the documentation about copy/move chunk operation id, we haven’t updated it yet since this functionality was introduced recently.

You can use it by providing additional argument to the function:

copy_chunk(operation_id => 'unique_id')

jamessalzman · July 25, 2022, 12:57pm

I did implement this in my automation and I have not encountered this issue again, thanks!