Frequent Replication Lag Issue

aviral.singh21 · September 26, 2023, 6:40am

Hello,

In Production Setup I have faced Replication Lag in either both or one of the Replica Nodes in the gap of 10 days. Although this is the most common issue in HA Database. It used to occur once in a month but this time it occurred quite frequently.

Installation Method: Docker (timescale/timescaledb-ha:pg13.7-ts-2.7.1-latest)
OS: Ubuntu 22.04.1 (inside docker container)
PostgreSQL Version: 13.8
TimescaleDB Version: 2.7.1
Setup: 3 Node of Patroni-HA(PGSQL+TSDB) - ETCD and PgPool using Docker Swarm.
Patroni Version: 2.1.4
ETCD Version: 3.5.0

One Solution which I tried: When I saw the log related to leader failed to send heartbeat on time in ETCD. I thought that the heartbeat-interval and election-timeout value are causing problem because we are using default values. So I tried by changing the value of heartbeat-interval to 500ms and election-timeout value to 3000ms. But still I was getting the same logs, “leader failed to send heartbeat on time”.

Is there any issue in TimescaleDB as earlier whenever I faced the replication lag issue in Patroni Cluster, timeline used to increase but this time whenever replication lag issue occurred, timeline was not changed. Only log related to reloading configuration was logged in TimescaleDB Logs.

I request to please help me identify the root cause behind this frequent replication lag.

I am sharing the logs for each services used related to Database HA Setup in the order of time of occurrence of unusual logs in different services.

Node2(Leader): ETCD Log::

{“level”:“warn”,“ts”:“2023-09-18T20:35:29.269Z”,“caller”:“etcdserver/raft.go:369”,“msg”:“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”,“to”:“634888d6d9dce83c”,“heartbeat-interval”:“100ms”,“expected-duration”:“200ms”,“exceeded-duration”:“378.187007ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:35:29.269Z”,“caller”:“etcdserver/raft.go:369”,“msg”:“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”,“to”:“1ae8e1aa55b8dfeb”,“heartbeat-interval”:“100ms”,“expected-duration”:“200ms”,“exceeded-duration”:“378.348601ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:10.565Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“3.271107301s”,“rtt”:“1.50669ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:10.614Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_RAFT_MESSAGE”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“3.270118824s”,“rtt”:“4.186608ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:40.566Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“5.424953406s”,“rtt”:“1.706893ms”}

Node1(Replica): TimescaleDB Log::

2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=127.0.0.1 port=53422
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-3] postgres@[unknown],app=[unknown] [00000] LOG: disconnection: session time: 0:00:00.003 user=postgres database= host=127.0.0.1 port=53422
server signaled
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-9] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-10] @,app= [00000] LOG: parameter “primary_conninfo” changed to “user=postgres passfile=/home/postgres/.pgpass.patroni host=timescale-2 port=5432 sslmode=prefer application_name=timescaledb-1 gssencmode=prefer channel_binding=prefer”
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-11] @,app= [00000] LOG: parameter “primary_slot_name” changed to “timescaledb_1”
2023-09-18 20:41:53 UTC [3147952]: [6508b611.3008b0-1] @,app= [XX000] FATAL: could not start WAL streaming: ERROR: replication slot “timescaledb_1” does not exist
2023-09-18 20:41:55 UTC [3145823]: [6508af31.30005f-3] tsdb@tsdb,app=[unknown] [00000] LOG: disconnection: session time: 0:29:21.936 user=tsdb database=tsdb host=...** port=37952
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...** port=56024
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-2] tsdb@tsdb,app=[unknown] [00000] LOG: connection authorized: user=tsdb database=tsdb

Node1(Replica): PATRONI Log::

2023-09-18 20:42:00,449 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:10,465 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:20,469 WARNING: Master (timescaledb-2) is still alive

Node2(Leader): TimescaleDB Log::

2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...** port=33918
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-3] postgres@[unknown],app=timescaledb-1 [42704] ERROR: replication slot “timescaledb_1” does not exist
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-4] postgres@[unknown],app=timescaledb-1 [42704] STATEMENT: START_REPLICATION SLOT “timescaledb_1” 529/9D000000 TIMELINE 50
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.006 user=postgres database= host=...** port=33918

Node1(Replica): ETCD Log::

{“level”:“warn”,“ts”:“2023-09-18T21:35:32.488Z”,“caller”:“wal/wal.go:802”,“msg”:“slow fdatasync”,“took”:“4.388946076s”,“expected-duration”:“1s”}
{“level”:“warn”,“ts”:“2023-09-18T22:35:31.521Z”,“caller”:“wal/wal.go:802”,“msg”:“slow fdatasync”,“took”:“3.444079168s”,“expected-duration”:“1s”}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.101Z”,“caller”:“etcdserver/server.go:1368”,“msg”:“triggering snapshot”,“local-member-id”:“634888d6d9dce83c”,“local-member-applied-index”:43900439,“local-member-snapshot-index”:43800438,“local-member-snapshot-count”:100000}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.155Z”,“caller”:“etcdserver/server.go:2363”,“msg”:“saved snapshot”,“snapshot-index”:43900439}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.155Z”,“caller”:“etcdserver/server.go:2393”,“msg”:“compacted Raft logs”,“compact-index”:43895439}
{“level”:“info”,“ts”:“2023-09-19T00:09:28.269Z”,“caller”:“fileutil/purge.go:77”,“msg”:“purged”,“path”:“data.etcd/member/snap/0000000000000044-0000000002963cf2.snap”}

Node2(Leader): PATRONI Log::

2023-09-18 21:39:48,328 ERROR: Request to server http://etcd1:2379 failed: MaxRetryError(‘HTTPConnectionPool(host='etcd1', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError(“HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)”))’)
2023-09-18 21:39:48,328 ERROR:
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 445, in _make_request
six.raise_from(e, None)
File “”, line 3, in raise_from
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 440, in _make_request
httplib_response = conn.getresponse()
File “/usr/lib/python3.10/http/client.py”, line 1374, in getresponse
response.begin()
File “/usr/lib/python3.10/http/client.py”, line 318, in begin
version, status, reason = self._read_status()
File “/usr/lib/python3.10/http/client.py”, line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), “iso-8859-1”)
File “/usr/lib/python3.10/socket.py”, line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 699, in urlopen
httplib_response = self._make_request(
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 336, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host=‘etcd1’, port=2379): Read timed out. (read timeout=3.3331781973441443)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 211, in _do_http_request
response = request_executor(method, base_uri + path, **kwargs)
File “/usr/lib/python3/dist-packages/urllib3/request.py”, line 78, in request
return self.request_encode_body(
File “/usr/lib/python3/dist-packages/urllib3/request.py”, line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File “/usr/lib/python3/dist-packages/urllib3/poolmanager.py”, line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 755, in urlopen
retries = retries.increment(
File “/usr/lib/python3/dist-packages/urllib3/util/retry.py”, line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘etcd1’, port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError(“HTTPConnectionPool(host=‘etcd1’, port=2379): Read timed out. (read timeout=3.3331781973441443)”))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 566, in wrapper
retval = func(self, *args, **kwargs) is not None
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 659, in touch_member
return self._client.set(self.member_path, data, None if permanent else self._ttl)
File “/usr/lib/python3/dist-packages/etcd/client.py”, line 721, in set
return self.write(key, value, ttl=ttl)
File “/usr/lib/python3/dist-packages/etcd/client.py”, line 500, in write
response = self.api_execute(path, method, params=params)
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 256, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 230, in _do_http_request
raise etcd.EtcdException(‘{0} {1} request failed’.format(method, path))
etcd.EtcdException: PUT /v2/keys/service/timescaledb/members/timescaledb-2 request failed
2023-09-18 21:40:05,153 ERROR: Failed to drop replication slot ‘timescaledb_3’
2023-09-18 21:40:15,004 ERROR: Failed to drop replication slot ‘timescaledb_3’
2023-09-18 21:40:25,015 ERROR: Failed to drop replication slot ‘timescaledb_3’

Node1(Replica): TimescaleDB Log::

2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-1] @,app= [00000] LOG: started streaming WAL from primary at 529/9D000000 on timeline 50
2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000032000005290000009D has already been removed

Node2(Leader): TimescaleDB Log::

2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...** port=35148
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-3] postgres@[unknown],app=timescaledb-1 [58P01] ERROR: requested WAL segment 00000032000005290000009D has already been removed
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-4] postgres@[unknown],app=timescaledb-1 [58P01] STATEMENT: START_REPLICATION SLOT “timescaledb_1” 529/9D000000 TIMELINE 50
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.009 user=postgres database= host=...* port=35148

– –
Thanks & Regards

aviral.singh21 · September 28, 2023, 6:19am

Please, if anyone can help me on this issue. It is urgent. It is an issue in production and I am not able identify root cause behind this.

–
Thanks

jonatasdp · September 29, 2023, 11:18pm

Hi Aviral, sorry for not being able to help. Network issues can cause a lot of replication issues. Are you monitoring the network? Have you tried to measure if maybe you have a heavy workload on it?

aviral.singh21 · October 3, 2023, 11:14am

Hi @jonatasdp,
This is the most common issue in HA Database Setup. I have been facing this issue for a long time. Whenever I faced replication issue, my first step was to reinitialise the standby cluster member and then check with the Network Admin to look for any network issue. But network admin is 100% sure that there is no issue from the network side.
I also did not experience any heavy workload.

jonatasdp · October 3, 2023, 1:44pm

I’d recommend you to setup a telemetry in the related machines to be able to identify the issue. It would at least give you some light in a near future.

aviral.singh21 · October 11, 2023, 8:52am

Hi @jonatasdp,

I have noticed around 19:00 UTC almost everyday after this replication lag issue started, configuration files are reloading after SIGHUP signal is received.

When this happens then Replication Lag occurs. Sometimes replication lag get upto 1-2 GB and then come back to normal state in 3-4 hours. But sometimes replication lag continues for forever, then have to reinitialise the standby cluster members.

I have confirmed with Network team to check any type of network latency and storage team to check Disk I/O report but there is no conclusion for this issue. All seems normal.

Logs:
2023-10-10 19:10:30 UTC [1010451]: [6525a1a6.f6b13-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=127.0.0.1 port=50918
2023-10-10 19:10:30 UTC [1010451]: [6525a1a6.f6b13-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres
2023-10-10 19:10:30 UTC [1010451]: [6525a1a6.f6b13-3] postgres@[unknown],app=[unknown] [00000] LOG: disconnection: session time: 0:00:00.004 user=postgres database= host=127.0.0.1 port=50918
server signaled
2023-10-10 19:10:30 UTC [871464]: [6523c6a5.d4c28-33] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-10-10 19:10:30 UTC [871464]: [6523c6a5.d4c28-34] @,app= [00000] LOG: parameter “primary_conninfo” changed to “user=postgres passfile=/####/########/#######.####### host=#########-####-# port=5432 sslmode=###### application_name=###########-# gssencmode=###### channel_binding=######”
2023-10-10 19:10:30 UTC [871464]: [6523c6a5.d4c28-35] @,app= [00000] LOG: parameter “primary_slot_name” changed to “timescaledb_1”
2023-10-10 19:10:30 UTC [1010455]: [6525a1a6.f6b17-1] @,app= [00000] LOG: started streaming WAL from primary at 599/5F000000 on timeline 50
server signaled
2023-10-10 19:10:35 UTC [871464]: [6523c6a5.d4c28-36] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-10-10 19:10:35 UTC [871464]: [6523c6a5.d4c28-37] @,app= [00000] LOG: parameter “primary_conninfo” removed from configuration file, reset to default
2023-10-10 19:10:35 UTC [871464]: [6523c6a5.d4c28-38] @,app= [00000] LOG: parameter “primary_slot_name” removed from configuration file, reset to default
2023-10-10 19:10:35 UTC [871466]: [6523c6a5.d4c2a-15] @,app= [00000] LOG: WAL receiver process shutdown requested
2023-10-10 19:10:35 UTC [1010455]: [6525a1a6.f6b17-2] @,app= [57P01] FATAL: terminating walreceiver process due to administrator command
2023-10-10 19:10:35 UTC [871466]: [6523c6a5.d4c2a-16] @,app= [00000] LOG: invalid resource manager ID 143 at 599/60134470

It is only occurring in Standby Nodes. There is nothing unusual or related to this in Leader Node.

jonatasdp · October 11, 2023, 11:36am

Hi Aviral, I’m not an expert on replicas. Trying to help as I can.

Is there any difference in the machine hardware setup? Maybe replica is a smaller machine and is struggling to keep the same IO?

Tscale · August 15, 2024, 6:55pm

Based on the logs you’ve provided, there are several issues contributing to the replication lag in your TimescaleDB cluster. Let’s break down the problem and analyze the root causes:

ETCD Performance Issues

Slow disk operations:
```
"msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk"
```
This indicates that the ETCD leader (Node2) is experiencing disk I/O issues, causing delays in heartbeat operations.
Clock drift:
```
"msg":"prober found high clock drift","clock-drift":"5.424953406s"
```
There’s significant clock drift between ETCD nodes, which can cause consistency issues.
Slow fsync operations:
```
"msg":"slow fdatasync","took":"4.388946076s","expected-duration":"1s"
```
This further confirms disk I/O problems on the ETCD nodes.

Potential Solutions:

Investigate and improve disk I/O performance on ETCD nodes.
Implement better time synchronization across all nodes (e.g., using NTP).
Consider using SSDs or faster storage for ETCD data.
Tune ETCD parameters for better performance in your environment.

These ETCD issues are likely contributing to the overall instability of your cluster.

Replication Slot Issues:

The logs show problems with replication slots:

Node1 (Replica) tries to use a non-existent replication slot:
```
ERROR: replication slot "timescaledb_1" does not exist
```

Node2 (Leader) fails to drop a replication slot:

ERROR: Failed to drop replication slot 'timescaledb_3'

These issues suggest that there’s a mismatch in replication slot management between the leader and replicas.

WAL Streaming Problems:

The replica (Node1) encounters issues with WAL streaming:

FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000032000005290000009D has already been removed

This indicates that the WAL segments on the leader are being removed before the replica can fetch them, likely due to replication lag.

Patroni Connection Issues:

Patroni on the leader (Node2) is having trouble connecting to ETCD:

ERROR: Request to server http://etcd1:2379 failed: MaxRetryError

This could be due to network issues or the ETCD performance problems mentioned earlier.

Root Causes and Recommendations:

Disk I/O Performance: The ETCD logs strongly suggest disk I/O issues. This can cascade into other problems, affecting the entire cluster’s stability.
- Recommendation: Investigate disk performance on all nodes, especially ETCD nodes. Consider using SSDs or faster storage solutions.
Network Stability: There are signs of network issues, particularly in Patroni’s connection to ETCD.
- Recommendation: Check network stability and latency between all nodes. Ensure there are no firewall or routing issues.
Replication Slot Management: The mismatch in replication slots suggests issues with Patroni’s management of these slots.
- Recommendation: Review Patroni’s configuration for replication slot management. Ensure all nodes are using the correct and consistent slot names.
WAL Retention: The leader is removing WAL segments before replicas can fetch them.
- Recommendation: Increase the wal_keep_segments parameter on the leader to retain WAL files for longer. Also, consider increasing max_wal_senders if you haven’t already.
Time Synchronization: The clock drift in ETCD nodes can cause consistency issues.
- Recommendation: Implement a robust time synchronization solution (e.g., NTP) across all nodes.
Resource Allocation: The frequency of issues suggests that the nodes might be under-resourced.
- Recommendation: Review and potentially increase CPU, memory, and I/O resources allocated to each node, especially the leader and ETCD nodes.
Patroni and ETCD Configuration:
- Recommendation: Review and tune Patroni and ETCD configurations. Consider increasing timeouts and adjusting other parameters to better suit your environment.

To address the frequent replication lag, I suggest focusing on these areas in order:

Resolve disk I/O issues
Ensure network stability
Address time synchronization
Tune WAL retention and replication settings
Review and adjust Patroni and ETCD configurations

After implementing these changes, monitor the cluster closely and collect logs to see if the frequency of replication lag issues decreases. If problems persist, you may need to dive deeper into TimescaleDB-specific configurations or consider upgrading to newer versions of TimescaleDB and PostgreSQL for potential bug fixes and performance improvements.