Can Patroni provide high availability for TimescaleDB deployed on multiple nodes?

whilepy · July 14, 2023, 10:59am

I have now built five TimescaleDB nodes according to the official steps. Node1 is the access node, and Node2 and Node3 are two data nodes. Node4 and Node5 are the standby nodes of Node2 and Node3 respectively, and they use asynchronous streaming replication for backup.
I tried to stop the service on Node2 and then use pg_promote on Node4, but it didn’t work. I still can’t select or insert data through Node1.
One possible solution I saw on the official website is Patroni. It may be able to failover automatically. When Node2 is down, use Node4 instead of Node2, but I am not sure whether the access node (Node1) can write data to Node4 at this time, and whether it can meet my needs. Is there any better high availability solution?

Also, I’m expecting a scenario similar to the one mentioned in this blog.
High Availability for Your Production Environments: Introducing Database Replication in Timescale

troy · July 14, 2023, 3:27pm

I haven’t used Patroni for HA on the data nodes, I’ve been using the native replication (which has its own set of limitations), but I have used it for HA for the access node.

If you simply use Postgres streaming replication as you noted you have to manually go in and us pg_promote to make the other node the primary. This is the problem Patroni solves. It will manage the replication and promotion for you. The problem with Patroni is that this is all it does, it is not an a gateway/load balancer. This means that in the event of a fail over Patroni will properly perform the pg_promote for you but your clients will have no idea this happened and remain connected to the failing node.

To solve this problem you need a load balancer like HAProxy. Your clients will then connect to HAProxy which will route the connection to whichever node Patroni is stating is the master. HAProxy polls an endpoint in Patroni to determine which node is the master node. In the event of a fail over HAProxy will detect from Patroni that a new node is master and reroute the connections there. Finally, for clients this rerouting can still cause some connection issues so clients should be written to automatically retry the connection in the event of a disconnect.

In regards to how this would work for data nodes:

You would setup Patroni to manage your replication for the data nodes and their standbys
You would setup an HAProxy instance and configure it to route traffic based on information from Patroni (you can use the /master endpoint). Since you are routing connections to multiple different nodes through the same host you will need to assign different ports to each data node.
When you add your data nodes in Timescale you will use the IP/Hostname of the HAProxy server and the port for the specific data node you are adding.

The Timescale access node will retry connections to data nodes so if there is a glitch during fail over it should automatically resolve.

Note that for Patroni you will also need to deploy a Distributed Configuration Store like Zookeeper or etcd.

whilepy · July 15, 2023, 2:37am

Hi troy
First of all, thank you very much for your reply, which effectively answered my question.
I have not tried the native replication scheme, and I would like to know what limitations exist in this scheme.
Because in my scenario, data will be continuously written to the database in a time period, and I hope that the data will be lost as little as possible. If native replication is used, when a data node goes down, can the data continue to be written to another data node, or do you need to manually delete the failed data node to ensure that the process continues normally?

troy · July 17, 2023, 2:34pm

There are some limitations with the native replication. Mainly that if a data node goes down the system becomes unusable; it will stop accepting reads and writes. Technically there are scenarios where reads can still happen I believe if the ‘lead’ chunk is not on the failed data node where ‘lead’ chunk is an internal notion of which chunk accepts the write vs. which ones are replicas. It doesn’t have the capability as of yet to fallback to the replica chunks for reads.

If you want to return the cluster to service you need to remove the data node from the cluster. The consequence of this is that to return the data node to the cluster you must remove all the data from it first. Now I believe it is in 2.9.x they introduced the alter_data_node function which you can use to mark a data node unavailable instead of deleting it to return the cluster to service. The advantage of this is that you don’t need to completely wipe out the data on the failed data node to return it to service. The problem here is that this is still an manual operation that an admin must perform.

While I’m using native replication for my project, I wouldn’t recommend it in general unless you are willing to invest a lot of time in understanding how it works and to build the tooling to maintain it. I would even say this about multi-node in general. Postgres scales very well vertically so you may be better off using a couple larger servers with traditional streaming replication for HA.

All that said I when the multi-node does improve both read and write performance, and I’m looking forward to see more work on the feature.

whilepy · July 18, 2023, 3:33am

Thank you very much for your enthusiastic and timely response.