How High Availability Works in Our Cloud Database

How High Availability Works in Our Cloud Database

When our customers first get interested in Timescale and we mention the importance of high availability, we often hear something along these lines:

“Now that our workload is growing, hosting our own database is starting to consume too much time. We’re ready to move to a hosted service so we can free up some time for our team. But, we’re unsure about the idea of giving our data to somebody else. What would happen if there’s a failure? How do I know our data is protected?”

Hosted databases in the cloud are the future, but you should still understand how your hosted database is working under the hood to ensure it meets your high availability needs. Database failures used to keep database administrators up at night. That shouldn’t be the case anymore, with hosted database services taking that load off the DBAs shoulders by keeping databases up and running. Still, vendors can choose to be more transparent: how they keep the service working seamlessly is the kind of information that does not belong in a black box!

In this blog post, we’re throwing the black box out the window to explain how data availability works in Timescale within its cloud-native architecture. Even if you are not a Timescale user (yet), this read may give you a glimpse of how we’ve designed the platform and built a high-availability cloud database using AWS infrastructure.

What Is High Availability?

As the DBA, “availability” means how often you can interact with your database as expected. If your database is available, you are able to perform normal operations on it and your end users (and your business) will remain unaffected by any database issues.

In this context, the term “high availability” (HA) is often used to describe a system in which you can expect minimal downtime. The exact level of downtime you can expect in an HA system depends on your vendor; there’s no universally accepted definition of high availability, although it often varies between a few seconds and a few minutes.

Companies often describe their service availability in terms of RTO (Recovery Time Objective) and RPO (Recovery Point Objective). These are fancy terms that actually mean very simple concepts:

  • The RTO is how long will it take for your service to recover in terms of failure, usually considering the worst-case scenario.
  • The RPO is how much data could you lose if a recovery takes place.

If you are a DBA, needless to say that you care a lot about these two things. If you’ve managed a database on-premise before or even in a private cloud, you know that failures do happen. By trusting your operations to a hosted database provider, you want to know how much downtime you may experience. And you want to rest assured that no data loss will occur.

We’ll eventually answer how this works in Timescale specifically, but before we get there, let’s spend a few minutes talking about what the infrastructure of a hosted database looks like, focusing on AWS (which is where Timescale is hosted). Having a mental model of how the underlying infrastructure works will help you understand better which types of failures may happen and what will be done to put your service back up and running again.

What Causes Downtime in a Hosted Database (and Why Choose a Hosted Database at All)

If we look a bit deeper, when we talk about high availability in a database, we are really talking about two different elements:

  • The first element is your base operations: how reliable the system itself is. This includes how often your infrastructure experiences outages, causing an important disruption in your normal operations. In a hosted database service, this largely depends on the reliability of the underlying physical hardware (for example, the AWS infra).
  • The second piece is disaster recovery. This indicates how quickly a system can recover when things do go wrong, i.e., when a major problem occurs, how quickly can the system resume normal operations. Disaster recovery doesn’t depend on the underlying infrastructure that much, but on how the system itself is engineered (e.g., how we’ve designed Timescale).

Historically, on-premise was the only option. Databases were always hosted by the companies themselves in their own data centers and operated by their own engineers—this is still the case for many companies.

For these self-hosted teams, base operations are a crucial element of keeping their database up: this is a task that requires specialized skills in hardware management and database administration. For example, they would be in charge of evaluating different options of compute and storage servers, purchasing them, and setting them up; the database would also need to be installed and properly configured (and eventually updated); the system would need to be operated and engineered; a set of operative rules need would need to be put in place to determine what to do when storage corrupts or compute fails; backups would need to be maintained and tested…

When self-hosting their database, engineering teams need to be prepared for any sort of event that may cause a hardware outage, including beavers

This is a lot of work. Managing their own database is surely something possible for some engineering teams—but others may prefer to focus all their efforts on building their application instead of spending them on database maintenance and operations. These teams often choose to use a hosted database service like Timescale.

Timescale, like many hosted database services, runs in AWS. This means that AWS handles the management and reliability of the underlying hardware—and they’re very good at it. By choosing a database hosted in AWS, we can forget about the physical maintenance of our infrastructure. This delegates the first element of availability, related to the maintenance of base operations, to AWS.

To understand what this actually means in the case of Timescale, it’s worth doing a quick overview of the AWS components that are actually being used to host the database:

  • The compute piece of the hardware is covered by EC2 instances. In Timescale, the compute handles the PostgreSQL server, the connections to the database, and it also holds the local memory. (EC2 stands for Elastic Compute Cloud).
  • The database storage piece is covered by EBS instances. This is where the data actually lives—your disk storage and file system. (EBS stands for Elastic Block Store).
  • The backups and WAL are stored in S3 as the long-term storage element. These are things that don’t have to be accessed regularly as part of the daily database operations and thus benefit from being stored in S3, which is a bit slower to access than EC2 but incredibly reliable. (S3 stands for Simple Storage Service.)
AWS components as the Timescale infrastructure
Editor’s note: S3 instances are also commonly used as long-term storage for cold data in cloud-native environments due to their low cost, high reliability, and ease of management and migration. We're soon going to directly support this for our Timescale customers. Stay tuned!

Timescale was designed as a cloud-native platform from the start. We rely on AWS for our underlying hardware infrastructure and have built automated detection and recovery for scenarios when a piece of hardware fails, such as an EC2 instance. Though our underlying availability can only be as good as the hardware it is built upon, we can do some engineering magic on top of this hardware to cover those situations, minimizing the impact on our users—we discuss this magic later in this post.

An important consequence of our cloud-native approach is that the compute and storage pieces are not tied together in Timescale, differently as they would be if we were using a traditional server. This allows us to offer some nice benefits to our users. For example, as the end-user of Timescale, you’re able to scale up and down your compute and storage independently, which is very convenient and cost-efficient. But having a decoupled compute and storage architecture has benefits beyond cost-efficiency: as we’re about to see, it also increases availability.

How Timescale Handles Compute Failures

We said before that AWS does a great job in keeping their infrastructure up and running. But how good?

In the figure below, you can see the levels of availability that AWS defines for each one of the components that conform to a Timescale service (EC2, EBS, S3). These availability levels are all very high, but they are not 100 %. Hardware failures will happen sometimes, even to AWS.

Infrastructure components in AWS (Source: EC2, EBS, S3)
Editor’s note: Availability is usually expressed in percentages which indicate the service uptime in any given year. In other words, a system with an availability of 99.99 % would be unavoidable 52.6 minutes per year or 8.64 seconds per day. For long-term storage objects like S3, durability is a more accurate parameter—defined as the probability that the object will remain intact and accessible in any given year.

These numbers are relevant for assuring high availability in Timescale. If you pay attention to the numbers above, EC2 (the compute piece) will fail significantly more often than the storage. Statistically speaking, roughly 9 out of 10 times that you experience hardware failures in a hosted service, they will be due to a compute failure.

So what happens to your Timescale service if the underlying EC2 instance that’s hosting your database compute fails?

This is when the decoupled compute-storage architecture of Timescale comes in extremely handy. In a traditional database setup on-premise, you would always need to do a recovery from backup, even in the case of a compute failure—and as we’ll see later in the post, recovering from backups can be a lengthy process. This means that even a compute failure would cause significant downtime to your end users.

But since the compute and storage nodes are decoupled in Timescale, if the compute fails, we can automatically spin up a new compute node, attaching your undamaged storage unit to it. This recovery process takes only seconds in the majority of cases, and it’s done without any action needed from you. The only thing you will notice will be a reset of your database connections.

Due to the cloud-native architecture of Timescale with decoupled compute and storage, in case of a compute failure, the platform can quickly and automatically attach your healthy storage to a new compute unit, fixing the issue quickly. This process is called “rapid recovery” in Timescale.
Editor’s note: Timescale uses Kubernetes to automate many of its infrastructure management tasks, including the recovery process described in his section. But this deserves its own blog post—stay tuned for content on how we use Kubernetes for the daily operations in Timescale.

What If There's a Failure Affecting Your Storage?

As we saw earlier, AWS is very good at managing hardware. Failures affecting the storage side of things (EBS in the case of Timescale) are way less common—and yet they happen from time to time.

How will your managed database service handle recovery in this case?

Reducing downtime as much as possible: Replication

A first failover scenario involves the use of replicas.

In Timescale, users can enable a replica in one click when they create their service (or anytime after the fact). This replica will stay in sync with the primary database at all times, containing the exact same information and configuration.

If something occurs that makes the data stored in the primary database unavailable, the platform will automatically switch all operations to the replica, which contains an up-to-date copy of your data. This process takes only a few seconds (<10s), which is the only downtime that your end-users will experience. Often the only thing noticeable is a reset of connections to the database.

This will effectively fix the problem for you and your end-users.

In a normal operating state, the application is connected to the primary and optionally to its replica. The load balancer handles the connection and defines the role for each node.
When the primary database fails, the platform updates the roles, “promoting” the replica to the primary role, with the primary load balancer redirecting traffic to the new primary. In the meantime, the system begins the recovery of the failed node.
When the failed node recovers or a new node is created, it assumes the replica role. The previously promoted node remains the primary, streaming WAL (write-ahead log) to its replica.

After the failover process has been completed, Timescale will proceed to repair the damaged node, which will eventually become the new replica.

We always recommend our users to enable replication for mission-critical workloads, as it significantly increases the availability of their service. If your system requires uptime guarantees, replicas are the option for you.

Also, replicas in Timescale are automatically created in a different Availability Zone (AZ) than your primary database for extra peace of mind. AWS hosts their infrastructure in different regions across the globe (e.g., us-east-1). For extra security, the regions are divided into multiple availability zones, which remain isolated from each other (e.g., power may go down in one AZ without affecting the others within the same region). To have your replica and your primary database hosted in different AZs gives you extra redundancy in case an entire AZ goes down.

The last resource: Backups

In Timescale, replicas are strongly recommended—but not enabled by default (as they increase the cost of your service). But we (of course) ensure data protection to all our services, not only those with a replica. If you don’t have a replica enabled and there’s a failure affecting your storage, your good old friend the backups will come to the rescue.

If you’ve ever dealt with databases on-premise or in your own cloud, you are already familiar with backups. By backing up your database every X period of time, you can restore to the latest backup if there’s a failure affecting your database, which essentially means getting your data into a new database.

Backups are the historical way of dealing with database failures, but recovery from backups can be a rather slow process that is limited by the quality and frequency of the latest backups. If the latest backup was two days ago, then the last two days of data might be lost!

Even though this problem is mostly solved today by tools like pgbackrest, configuring the backup strategy, testing backups, and automating the recovery is a time-intensive process…  And it can be rather stressful.

In a database with cloud-native infrastructure like Timescale, backups are our safety net but not our only resource. As we explained earlier, having a cloud-native infrastructure allows us to fix compute failures without touching our backups—and for mission-critical applications, we always recommend enabling replication, so you can be protected against the potentially longer downtime caused by more severe failures.

But not all workloads are mission-critical. Perhaps you have certain services which are powering internal dashboards, machine learning models, or hosting historic data that you use to build weekly reports—for systems like these, you may decide that having a little downtime may not be critical, and you may choose not to enable a replica. If some of these services experience a failure affecting the storage, how does Timescale recover your data?

First, Timescale keeps up-to-date backups of all services:

  • Full backups are taken weekly. This process is done automatically by the platform (you don’t need to do anything manually) and for all services.
  • Incremental backups are taken daily. These backups record the changes made since the last full backup.
  • On top of these full and daily backups, Timescale keeps WAL (write-ahead log) files of any changes made to the database. This WAL can be replayed in the event of a failure to reproduce any transactions not captured by the last daily backup, e.g., to replay the changes made to your database during the last few hours.

By combining these three elements, we can do a point-in-time recovery—we can recover a database to any point, and you won’t experience any data loss.

Diagram illustrating the process of backup and recovery in Timescale. For more information, check out our docs

The figure above illustrates the process of recovery from backup. How long does this take? As we’ve mentioned before, this is the longest recovery process—you will experience substantially more downtime than if you have replication enabled. The exact amount of downtime, however, will depend on multiple factors, including how up-to-date your backups are, how much data you have in general, and your compute size (how much CPU/memory you have available).

High Availability in Timescale: A Summary

We hope you now have a better view of how the infrastructure behind a hosted database service really looks and about the different strategies one can follow to achieve as much availability as possible.

In the particular case of Timescale, the platform protects your data against failure automatically, with very low RTO (Recovery Time Objective) and RPO (Recover Point Objective) for all instances. For mission-critical workloads with high-availability requirements, Timescale also offers replicas—which ensure near-zero downtime and near-zero data loss if the database fails. Click here to learn how to enable a replica in your Timescale service.

If you still haven’t tried Timescale, you can create an account here. You will get free access to the platform for 30 days, no credit card required.

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
12 min read
Cloud
Contributors

Related posts