Frequent Replication Lag Issue

Hello,

In Production Setup I have faced Replication Lag in either both or one of the Replica Nodes in the gap of 10 days. Although this is the most common issue in HA Database. It used to occur once in a month but this time it occurred quite frequently.

Installation Method: Docker (timescale/timescaledb-ha:pg13.7-ts-2.7.1-latest)
OS: Ubuntu 22.04.1 (inside docker container)
PostgreSQL Version: 13.8
TimescaleDB Version: 2.7.1
Setup: 3 Node of Patroni-HA(PGSQL+TSDB) - ETCD and PgPool using Docker Swarm.
Patroni Version: 2.1.4
ETCD Version: 3.5.0

One Solution which I tried: When I saw the log related to leader failed to send heartbeat on time in ETCD. I thought that the heartbeat-interval and election-timeout value are causing problem because we are using default values. So I tried by changing the value of heartbeat-interval to 500ms and election-timeout value to 3000ms. But still I was getting the same logs, “leader failed to send heartbeat on time”.

Is there any issue in TimescaleDB as earlier whenever I faced the replication lag issue in Patroni Cluster, timeline used to increase but this time whenever replication lag issue occurred, timeline was not changed. Only log related to reloading configuration was logged in TimescaleDB Logs.

I request to please help me identify the root cause behind this frequent replication lag.

I am sharing the logs for each services used related to Database HA Setup in the order of time of occurrence of unusual logs in different services.

Node2(Leader): ETCD Log::

{“level”:“warn”,“ts”:“2023-09-18T20:35:29.269Z”,“caller”:“etcdserver/raft.go:369”,“msg”:“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”,“to”:“634888d6d9dce83c”,“heartbeat-interval”:“100ms”,“expected-duration”:“200ms”,“exceeded-duration”:“378.187007ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:35:29.269Z”,“caller”:“etcdserver/raft.go:369”,“msg”:“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”,“to”:“1ae8e1aa55b8dfeb”,“heartbeat-interval”:“100ms”,“expected-duration”:“200ms”,“exceeded-duration”:“378.348601ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:10.565Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“3.271107301s”,“rtt”:“1.50669ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:10.614Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_RAFT_MESSAGE”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“3.270118824s”,“rtt”:“4.186608ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:40.566Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“5.424953406s”,“rtt”:“1.706893ms”}

Node1(Replica): TimescaleDB Log::

2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=127.0.0.1 port=53422
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-3] postgres@[unknown],app=[unknown] [00000] LOG: disconnection: session time: 0:00:00.003 user=postgres database= host=127.0.0.1 port=53422
server signaled
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-9] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-10] @,app= [00000] LOG: parameter “primary_conninfo” changed to “user=postgres passfile=/home/postgres/.pgpass.patroni host=timescale-2 port=5432 sslmode=prefer application_name=timescaledb-1 gssencmode=prefer channel_binding=prefer”
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-11] @,app= [00000] LOG: parameter “primary_slot_name” changed to “timescaledb_1”
2023-09-18 20:41:53 UTC [3147952]: [6508b611.3008b0-1] @,app= [XX000] FATAL: could not start WAL streaming: ERROR: replication slot “timescaledb_1” does not exist
2023-09-18 20:41:55 UTC [3145823]: [6508af31.30005f-3] tsdb@tsdb,app=[unknown] [00000] LOG: disconnection: session time: 0:29:21.936 user=tsdb database=tsdb host=...** port=37952
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=..
.** port=56024
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-2] tsdb@tsdb,app=[unknown] [00000] LOG: connection authorized: user=tsdb database=tsdb

Node1(Replica): PATRONI Log::

2023-09-18 20:42:00,449 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:10,465 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:20,469 WARNING: Master (timescaledb-2) is still alive

Node2(Leader): TimescaleDB Log::

2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...** port=33918
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-3] postgres@[unknown],app=timescaledb-1 [42704] ERROR: replication slot “timescaledb_1” does not exist
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-4] postgres@[unknown],app=timescaledb-1 [42704] STATEMENT: START_REPLICATION SLOT “timescaledb_1” 529/9D000000 TIMELINE 50
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.006 user=postgres database= host=..
.** port=33918

Node1(Replica): ETCD Log::

{“level”:“warn”,“ts”:“2023-09-18T21:35:32.488Z”,“caller”:“wal/wal.go:802”,“msg”:“slow fdatasync”,“took”:“4.388946076s”,“expected-duration”:“1s”}
{“level”:“warn”,“ts”:“2023-09-18T22:35:31.521Z”,“caller”:“wal/wal.go:802”,“msg”:“slow fdatasync”,“took”:“3.444079168s”,“expected-duration”:“1s”}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.101Z”,“caller”:“etcdserver/server.go:1368”,“msg”:“triggering snapshot”,“local-member-id”:“634888d6d9dce83c”,“local-member-applied-index”:43900439,“local-member-snapshot-index”:43800438,“local-member-snapshot-count”:100000}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.155Z”,“caller”:“etcdserver/server.go:2363”,“msg”:“saved snapshot”,“snapshot-index”:43900439}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.155Z”,“caller”:“etcdserver/server.go:2393”,“msg”:“compacted Raft logs”,“compact-index”:43895439}
{“level”:“info”,“ts”:“2023-09-19T00:09:28.269Z”,“caller”:“fileutil/purge.go:77”,“msg”:“purged”,“path”:“data.etcd/member/snap/0000000000000044-0000000002963cf2.snap”}

Node2(Leader): PATRONI Log::

2023-09-18 21:39:48,328 ERROR: Request to server http://etcd1:2379 failed: MaxRetryError(‘HTTPConnectionPool(host='etcd1', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError(“HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)”))’)
2023-09-18 21:39:48,328 ERROR:
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 445, in _make_request
six.raise_from(e, None)
File “”, line 3, in raise_from
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 440, in _make_request
httplib_response = conn.getresponse()
File “/usr/lib/python3.10/http/client.py”, line 1374, in getresponse
response.begin()
File “/usr/lib/python3.10/http/client.py”, line 318, in begin
version, status, reason = self._read_status()
File “/usr/lib/python3.10/http/client.py”, line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), “iso-8859-1”)
File “/usr/lib/python3.10/socket.py”, line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 699, in urlopen
httplib_response = self._make_request(
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 336, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host=‘etcd1’, port=2379): Read timed out. (read timeout=3.3331781973441443)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 211, in _do_http_request
response = request_executor(method, base_uri + path, **kwargs)
File “/usr/lib/python3/dist-packages/urllib3/request.py”, line 78, in request
return self.request_encode_body(
File “/usr/lib/python3/dist-packages/urllib3/request.py”, line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File “/usr/lib/python3/dist-packages/urllib3/poolmanager.py”, line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 755, in urlopen
retries = retries.increment(
File “/usr/lib/python3/dist-packages/urllib3/util/retry.py”, line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘etcd1’, port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError(“HTTPConnectionPool(host=‘etcd1’, port=2379): Read timed out. (read timeout=3.3331781973441443)”))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 566, in wrapper
retval = func(self, *args, **kwargs) is not None
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 659, in touch_member
return self._client.set(self.member_path, data, None if permanent else self._ttl)
File “/usr/lib/python3/dist-packages/etcd/client.py”, line 721, in set
return self.write(key, value, ttl=ttl)
File “/usr/lib/python3/dist-packages/etcd/client.py”, line 500, in write
response = self.api_execute(path, method, params=params)
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 256, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 230, in _do_http_request
raise etcd.EtcdException(‘{0} {1} request failed’.format(method, path))
etcd.EtcdException: PUT /v2/keys/service/timescaledb/members/timescaledb-2 request failed
2023-09-18 21:40:05,153 ERROR: Failed to drop replication slot ‘timescaledb_3’
2023-09-18 21:40:15,004 ERROR: Failed to drop replication slot ‘timescaledb_3’
2023-09-18 21:40:25,015 ERROR: Failed to drop replication slot ‘timescaledb_3’

Node1(Replica): TimescaleDB Log::

2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-1] @,app= [00000] LOG: started streaming WAL from primary at 529/9D000000 on timeline 50
2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000032000005290000009D has already been removed

Node2(Leader): TimescaleDB Log::

2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...** port=35148
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-3] postgres@[unknown],app=timescaledb-1 [58P01] ERROR: requested WAL segment 00000032000005290000009D has already been removed
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-4] postgres@[unknown],app=timescaledb-1 [58P01] STATEMENT: START_REPLICATION SLOT “timescaledb_1” 529/9D000000 TIMELINE 50
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.009 user=postgres database= host=..
.* port=35148

– –
Thanks & Regards

Please, if anyone can help me on this issue. It is urgent. It is an issue in production and I am not able identify root cause behind this.


Thanks

Hi Aviral, sorry for not being able to help. Network issues can cause a lot of replication issues. Are you monitoring the network? Have you tried to measure if maybe you have a heavy workload on it?

Hi @jonatasdp,
This is the most common issue in HA Database Setup. I have been facing this issue for a long time. Whenever I faced replication issue, my first step was to reinitialise the standby cluster member and then check with the Network Admin to look for any network issue. But network admin is 100% sure that there is no issue from the network side.
I also did not experience any heavy workload.

I’d recommend you to setup a telemetry in the related machines to be able to identify the issue. It would at least give you some light in a near future.

Hi @jonatasdp,

I have noticed around 19:00 UTC almost everyday after this replication lag issue started, configuration files are reloading after SIGHUP signal is received.

When this happens then Replication Lag occurs. Sometimes replication lag get upto 1-2 GB and then come back to normal state in 3-4 hours. But sometimes replication lag continues for forever, then have to reinitialise the standby cluster members.

I have confirmed with Network team to check any type of network latency and storage team to check Disk I/O report but there is no conclusion for this issue. All seems normal.

Logs:
2023-10-10 19:10:30 UTC [1010451]: [6525a1a6.f6b13-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=127.0.0.1 port=50918
2023-10-10 19:10:30 UTC [1010451]: [6525a1a6.f6b13-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres
2023-10-10 19:10:30 UTC [1010451]: [6525a1a6.f6b13-3] postgres@[unknown],app=[unknown] [00000] LOG: disconnection: session time: 0:00:00.004 user=postgres database= host=127.0.0.1 port=50918
server signaled
2023-10-10 19:10:30 UTC [871464]: [6523c6a5.d4c28-33] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-10-10 19:10:30 UTC [871464]: [6523c6a5.d4c28-34] @,app= [00000] LOG: parameter “primary_conninfo” changed to “user=postgres passfile=/####/########/#######.####### host=#########-####-# port=5432 sslmode=###### application_name=###########-# gssencmode=###### channel_binding=######”
2023-10-10 19:10:30 UTC [871464]: [6523c6a5.d4c28-35] @,app= [00000] LOG: parameter “primary_slot_name” changed to “timescaledb_1”
2023-10-10 19:10:30 UTC [1010455]: [6525a1a6.f6b17-1] @,app= [00000] LOG: started streaming WAL from primary at 599/5F000000 on timeline 50
server signaled
2023-10-10 19:10:35 UTC [871464]: [6523c6a5.d4c28-36] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-10-10 19:10:35 UTC [871464]: [6523c6a5.d4c28-37] @,app= [00000] LOG: parameter “primary_conninfo” removed from configuration file, reset to default
2023-10-10 19:10:35 UTC [871464]: [6523c6a5.d4c28-38] @,app= [00000] LOG: parameter “primary_slot_name” removed from configuration file, reset to default
2023-10-10 19:10:35 UTC [871466]: [6523c6a5.d4c2a-15] @,app= [00000] LOG: WAL receiver process shutdown requested
2023-10-10 19:10:35 UTC [1010455]: [6525a1a6.f6b17-2] @,app= [57P01] FATAL: terminating walreceiver process due to administrator command
2023-10-10 19:10:35 UTC [871466]: [6523c6a5.d4c2a-16] @,app= [00000] LOG: invalid resource manager ID 143 at 599/60134470

It is only occurring in Standby Nodes. There is nothing unusual or related to this in Leader Node.

Hi Aviral, I’m not an expert on replicas. Trying to help as I can.

Is there any difference in the machine hardware setup? Maybe replica is a smaller machine and is struggling to keep the same IO?