High Availability for Your Production Environments: Introducing Database Replication in Timescale Cloud
Earlier this year, we kicked off the Year of the Tiger 🐯🦄 by announcing our $110M Series C funding to build the future of data for developers worldwide. This new funding helps us accelerate the delivery of features that help our customers build best-in-class data-driven applications. This week, we’re excited to continue this momentum with #AlwaysBeLaunching (#ABL): MOAR Edition —a week full of exciting new features for Timescale Cloud, bringing you MOAR features that make Timescale Cloud even MOAR worry-free, scalable, and flexible!
To start, we’re releasing early access to a highly requested feature: database replication in Timescale Cloud. And throughout the rest of this #ABL Cloud Week, we’ll release features that give you MOAR collaboration, visibility, performance, and regions in Timescale Cloud.
One of our guiding principles as a company is that boring is awesome and that your database should be boring: we believe that you should be able to focus on your applications rather than on the infrastructure on which they’re running. Replicas is a feature commonly available in most databases, and today we’re surfacing that functionality in an easy-to-use experience within Timescale Cloud. As with all things Timescale, we build on top of the proven functionality available in PostgreSQL, the foundation of TimescaleDB.
Database replication in Timescale Cloud is as easy as pressing a button. By enabling a replica, you will increase the availability of your data, assuring that you will only experience a few seconds of downtime if your database fails—but that’s not all. In Timescale Cloud, enabling replicas can also improve performance, as you can easily direct your heavy read queries to the replica, which frees up resources in your main database for higher ingest rates, more advanced continuous aggregates, or additional read queries.
The entire database replication process (including creating a replica, checking out its status, retrieving its URL, and deleting your replica) can be done easily and transparently from the Console UI. And if your database happens to fail – for example, because the underlying AWS server instance becomes unavailable – all the recovery processes are done automatically. Timescale Cloud will shift responsibilities from the former failed primary to the replica, which will then be elevated to start accepting write operations as a new primary, without any action needed from you. Your operations will just keep on running while Timescale Cloud does this heavy lifting under the hood.
Read on to learn more about replicas in Timescale Cloud, and how you can use database replication to increase data availability and liberate load in your Timescale Cloud database.
If you’re new to Timescale Cloud, try it for free (100% free for 30 days, no credit card required). Once you’re up and running, join our community of 8,000+ developers passionate about TimescaleDB, PostgreSQL, time-series data, and all its applications! You can find us in the Timescale Community Forum and in our Community Slack.
To cap off this edition of #AlwaysBeLaunching, we’re also hosting Timescale’s second Community Day! Tune in to Timescale Community Day on March 31 for talks (and demos!) about everything data and databases.
Finally, a huge “thank you” to the teams of engineers and designers that made all the features we’re releasing during the #AlwaysBeLaunching a reality! 🙏
Replicas in Timescale Cloud
Your database should be worry-free: you shouldn't have to think about it, especially if there’s a problem. As any database operator knows, failures occur. But a modern database platform should automatically recover and re-establish service, as soon as issues arise. Our commitment is to build a dependable, highly-available database that you can count on. Boring is awesome when the platform prevents you from getting paged at 3 am.
Many architectural aspects of Timescale Cloud are intentionally aligned with this high availability goal. For example, Timescale Cloud’s decoupled compute and storage is not only great for price optimization, but also for high availability. In the face of failures, Timescale Cloud automatically spins up a new compute node and reconnects it to the existing decoupled database storage, which itself is independently replicated for high availability and durability. Indeed, even without a replica enabled, this cloud-native architecture can provide a full recovery for many types of failures within 30-60 seconds, with more severe physical server failures often taking no more than several minutes of downtime to recover your database.
Further, incremental backups are taken continuously on Timescale Cloud for all your services (and stored separately across multiple cloud availability zones), allowing your database to be restored to any point-in-time from the past week or more. And Timescale Cloud continuously smoke tests and validates all backups to ensure they are ready to go at a moment’s notice.
However, many customers run even their most critical services on Timescale Cloud, where they need almost zero downtime—even in the case of unexpected and severe hardware failures. Indeed, Timescale Cloud powers many customer-facing applications, where downtime comes with important consequences for the business: application dashboards stop responding, assembly lines are no longer monitored, IoT sensors can no longer push measurements, critical business data can be lost, and more. Database replication provides that extra layer of availability (and assurance) these customers need.
Apart from decreasing downtime, Timescale Cloud replicas have another advantage: they can also help you ease the load from your primary database. If you are operating a service subjected to heavy read analytical queries (e.g., if you’re using tools like Tableau or populating complex Grafana dashboards), you can send such read queries to the replica instead of to your primary database, liberating its capacity for writes and improving performance. This makes your replica useful even in the absence of failure.
And leveraging this functionality is as easy as using a separate database connection string that Timescale Cloud makes available: one service URL for your (current) primary, and the platform transparently re-assigns this connection string to a replica if that replica takes over, and a second service URL that maps to your read-only replica.
If you are already a Timescale Cloud user, you can immediately set up a database replica in your new and existing services. Enabling your first replica is as simple as this:
- Select the service you’d like to replicate.
- Under “Operations,” select “Replication” on the left menu.
- To enable your replica, click on “Add a replica.”
That’s it! 🔥
If a service has a replica enabled, it will show under Operations -> Replication. (Take into account that the replica won’t show in your Services screen as a separate database service, as it is not an independent service.)
If you want to direct some of your read queries to the replica, go to your service's “Overview” page. Under “Connection info,” you will see a drop-down menu allowing you to choose between “Primary” and “Replica.” To connect to your replica, you can simply select “Replica” and use the corresponding service URL.
We’re releasing database replication with an early access label, meaning that this feature is still in active development. We will add new functionality to replicas in the very near future: at the end of this post, you will find a detailed list of everything we’re actively working on regarding replicas.
You’ll hear from us again soon!
How Replicas Work
Replicas are duplicates of your main database, which in this context is called “primary.” When you enable a replica, it will stay up-to-date as new data is added, updated, or deleted from your primary database. This is a major difference between a replica and a fork, which Timescale Cloud also supports: a fork is a snapshot of your database in a particular moment in time, but once created, it is independent of your primary. After forking, the data in your forked service won’t reflect the changes in the primary.
PostgreSQL (and thus TimescaleDB) offers several methods for replication. However, setting up replication for a self-hosted database is a difficult task that includes many steps—from choosing which options suit you best, to actually spinning up a new server and adjusting configuration files, to tweaking configurations, and to building a full infrastructure that monitors the health of your primary and automatically failing-over to a replica when needed. All the while avoiding “split-brain” scenarios in which two separate services both believe they are primaries, leading to data inconsistency issues.
Timescale Cloud automates the process for you: we do the hard work so you don’t have to.
As we’ve seen in the previous section, adding a replica to your service is extremely simple. But as a more technical deep dive for the interested reader, the following paragraphs cover three design choices we’ve taken for replicas: (i) the choice of asynchronous commits, (ii) their ability to act as hot standbys, and (iii) their use of streaming replication.
Timescale Cloud replicas are asynchronous
The primary database will commit a transaction as soon as they are applied to its local database, at which point it responds to a requesting client with success. In particular, it does not wait until the transaction is replicated and remotely committed by the replica (as would be the case with synchronous replicas). Instead, the transaction is asynchronously replicated to the replica by the primary shipping its Write-Ahead Log (WAL) files, hence the “quasi-real-time” synchronicity between primary and replica.
We chose this design pattern for two important reasons. The first one, perhaps surprisingly, is high availability: with a single synchronous replica, the database service would stop accepting new writes if the replica fails, even if the primary remains available. The second is performance, as database writes are both lower latency (no round trip to the replica before responding) and can achieve higher throughput. And that’s important for time-series use cases, where ingest rates are often quite high and can be bursty as well.
In the future, when we add support for multiple replicas, we plan to introduce the ability to configure quorum synchronous replication, where a transaction is committed once written to at least some replicas, but not necessarily all. This addresses one tradeoff with asynchronous replication, where a primary failure may lead to the loss of a few of the latest transactions that have yet to be streamed to any replicas.
Timescale Cloud replicas act like hot standbys
A warm standby means that the replica (standby) is ready to take over operations as soon as the primary fails, as opposed to a “cold standby” which might take a while to restore before it can begin processing requests. This is closely related to the high availability mentioned earlier. The process of the standby/replica becoming the primary is called failover, which is covered more below. Since Timescale Cloud replicas are also read replicas – i.e., they can also be used for read-only queries – they are considered hot standbys instead of just warm standbys.
As we will talk about in later sections, allowing you to read from your replicas gives you the option to direct some of your read-only workloads to your replica, freeing capacity in your primary. This means that in Timescale Cloud, you will not only have a replica ready to take over at any moment if the primary happens to fail, but you can also get value from it beyond availability, even if there’s no failure.
Timescale Cloud replicas use streaming replication
Streaming replication helps ensure there is little chance of data loss during a failover event. Streaming replication refers to how the database’s Write-Ahead Log – which records all transactions on the primary – is shipped from the primary to the replica. One common approach for shipping this WAL is aptly-named “log shipping.” Typically, log-shipping is performed on a file-by-file basis, i.e., one WAL segment of 16MB at a time. So, these files aren’t shipped until they reach 16MB or hit a timeout.
The implication of log shipping, however, is that if a failure occurs, any unshipped WAL is lost. Instead of file-based log shipping, Timescale Cloud uses streaming replication to minimize potential loss. This means that individual records in the WAL are streamed to the replica as soon as they are written by the primary, rather than waiting to ship as an entire segment. This method minimizes the potential data loss to the gap between a transaction committing and the corresponding WAL generation.
Enabling Replicas for High Availability
Even without a replica enabled, Timescale Cloud has a range of automated backup and restore mechanisms that protect your data in case of failure. For example, the most common type of failure in a managed database service is a compute node failure; in Timescale Cloud, it often takes only tens of seconds to recover from such a failure, as we are able to spin up a new compute node and reattach your storage to it. In the much rarer case, in which a failure affects your (replicated) storage, Timescale Cloud automatically restores your data from backup, at a rate of roughly 10 GB per minute.
For some use cases, the potential level of downtime associated with this recovery process is completely acceptable; for those customer-facing applications that require minimal downtime, replicas will provide the extra layer of availability they need.
The recovery process through replicas is summarized in the figures below. In a normal operating state, the application is connected to the primary and optionally to its replica to scale read queries. Timescale Cloud manages these connections through load balancers, defining the role for each node automatically.
The next figure illustrates a failover scenario. If the primary database fails, the platform automatically updates the roles, “promoting” the replica to the primary role, with the primary load balancer redirecting traffic to the new primary. When the failed node either recovers or a new node is spun up, it assumes the replica role. The promoted node remains the primary, streaming WAL to its new replica.
When the failed node either recovers or a new node is spun up, it assumes the replica role. The promoted node remains the primary, streaming WAL to its new replica.
On top of increasing the availability of your database in case of failure, replicas also essentially eliminate the downtime associated with upgrades, including database, image, or node maintenance upgrades. Without a replica, these upgrades usually imply 30 to 60 seconds of downtime. With a replica, this is reduced to about a second (just the time to failover). In this case, when the upgrade process starts, your system will switch over to the replica, which now becomes the primary. Once the upgrade is completed in the now-replica-formerly-primary, the system switches back so it can subsequently upgrade the other node. (And on occasions the replica is upgraded first, in which case only one failover will be necessary.)
Read Replicas: Enabling Replicas for Load Reduction
Timescale Cloud's replicas act as “hot standbys” and thus also double as read replicas: when replicas are enabled, your read queries can be sent to the replica instead of the primary.
The main advantages of read replicas are related to scalability. By allowing your replica to handle all of the read load, your primary instance only has to handle writes or other maintenance tasks that generate writes, such as TimescaleDB’s continuous aggregates. Using read replicas would result in higher throughput for writes and faster execution times on analytical reads, plus a less strained primary instance.
For example, read replicas can be particularly useful if you have many Grafana dashboards connecting to your service. Since visualizations don’t need perfectly real-time data – that is, using data that’s a few seconds old is often more than acceptable – the replica can be used to power these dashboards without consuming resources on the primary. Plus, with this setup, data analysts can work with up-to-date production data without worrying about accidentally impacting the database operations, such as with more ad-hoc data science queries.
Another benefit of using read replicas is to limit the number of applications with write access to your data. Since the entire replica database is read-only, any connection, even those with roles that would have write privileges in the primary, cannot write data to the replica. This can serve to easily isolate applications that should have read/write access from those that only need read access, which is always a good security practice. Database roles should certainly also be used to ensure “least privilege,” but a bit of redundancy and “defense in depth” doesn’t hurt.
Today, we’re releasing database replication under an “early access” label, meaning that this feature is still in active development. We’ll be continuing to develop capabilities around database replication in Timescale Cloud, including:
- Replicas in different availability zones within the same region (coming soon - stay tuned!)
- Multiple replicas per database service
- Greater flexibility around synchronous vs. asynchronous replicas
- Replicas in different AWS regions
- Replicas in multi-node database services
So keep an eye out – MOAR replication options coming soon!
Timescale Cloud’s new database replication provides you with increased high availability and fault-tolerance for your important database services. In addition, it allows you to scale your read workloads and better isolate your primary database for writes. Check out our documentation for more information on how to use this database replication in Timescale Cloud.
Replicas are immediately available for Timescale Cloud users. If you want to try Timescale Cloud, you can create a free account to get started—it’s 100% free for 30 days, without a credit card required. And if you have any questions, you can find us in our Community Slack and also in the Community Forum.
And if this sounds like the type of technical challenges you enjoy working on: We’re hiring. Fully remote and globally distributed.
So let’s kick off the first Cloud Week 2022 with MOAR availability. And stay tuned – many more exciting Timescale Cloud capabilities to come!