How I Learned to Stop Worrying and Love PostgreSQL on Kubernetes: Continuous Backup/Restore Validation on Timescale

How I Learned to Stop Worrying and Love PostgreSQL on Kubernetes: Continuous Backup/Restore Validation on Timescale

Your database needs reliable backups. Data loss events could occur at any point in time. A developer may drop a table by mistake, even replicated storage drives may fail and start producing read errors, software bugs may cause silent data corruption, applications may perform incorrect modifications, and so on.

You may hope this will not occur to you, but hope is not a strategy. The key way to reliably prevent data loss or enable data recovery in these and other similar events is to perform regular backups.

As the ancient proverb goes: “There are two kinds of people. Those who do backups and those who will do backups.”

Relational databases like PostgreSQL support continuous archiving, where, in addition to the image of your data directory, the database continuously pushes changes to backup storage.

Some of the most popular open-source tools for PostgreSQL center around performing backups and restores, including pgBackRest, Barman, wal-g, and more, which underscores the importance of doing so. Or, at the very least, that backup/restore is top of mind for many developers. And, because TimescaleDB is built on PostgreSQL (specifically, it’s implemented as a regular extension to PostgreSQL), all your favorite PostgreSQL tools work perfectly well with TimescaleDB.

Most of the PostgreSQL tools mentioned above are not described as backup tools, but as disaster recovery tools. Because when the disaster strikes, you are not really interested in your backup, but rather the outcome of restoring it. And sometimes you need to work really hard to recover from what otherwise would be a disaster: One of my favorite talks by long-time PostgreSQL contributor Dimitri Fontaine describes the process of recovering data from the PostgreSQL instance with a backup that couldn’t be restored when needed. It’s a fascinating story and even with the help of world-class experts, it’s an almost certain data loss. That is why, if you just do regular backups, you are not done yet! You’d better have a strategy of testing them by doing regular restores.

We thought about how to apply this lesson to Timescale, our cloud-native platform for TimescaleDB. A core tenet of Timescale is to provide a worry-free experience, especially around keeping your data safe and secure. Behind the scenes, among other technologies, Timescale relies on encrypted Amazon Elastic Block Storage (EBS) volumes and PostgreSQL continuous archiving. But it’s not enough to just take backups and hope for the best; we also believe in actively testing restore functionality frequently, and especially testing in prod. Towards that end, we employ various strategies for actively and automatically testing backups, so that we provide our users with greater peace of mind about the safety and reliability of their data.

Read on for more about our backup and restore strategy for Timescale and how we automatically test your backups to ensure they can be restored when you need them.

If you’re new to TimescaleDB, create a free Timescale account to get started with a fully managed Timescale service (free for 30 days, no credit card required).

Once you’re using TimescaleDB, or if you’re already up and running, join the Timescale community to share your feedback, ask questions about time-series data (and databases in general), and more.

And, for those who share our mission and passion for hard problems and want to join our fully remote, global team, we’re hiring broadly across many roles (including the Timescale team 🔥).

How we back up every single Timescale instance

Before explaining how we perform restore tests at scale, let’s briefly describe how we run databases on Timescale.

We refer to a TimescaleDB instance available to our customers as a TimescaleDB service. (Fun fact in terminology: in PostgreSQL, this database instance is referred to as a PostgreSQL “cluster” as one can traditionally run multiple logical databases within the same PostgreSQL process, which also should not be confused with a “cluster” of a primary database and its replicas. So let’s just refer to these things as “databases” or “instances” for now.)

A TimescaleDB service is constructed from several Kubernetes components, such as pods and containers running the database software, persistent volumes holding the data, Kubernetes services, and endpoints that direct clients to the pod.

We run TimescaleDB instances in containers orchestrated by Kubernetes. We have implemented a custom TimescaleDB operator to manage a large fleet of TimescaleDB services, configuring and provisioning them automatically.

A TimescaleDB operator, similar to the other operators, provides a Kubernetes custom resource definition (CRD) that describes a TimescaleDB deployment. The operator converts the YAML manifests defined by the TimescaleDB CRD into the running TimescaleDB services and manages the lifecycle of the resulting service.

TimescaleDB pods take advantage of sidecars, running several containers alongside the database. One of the sidecars runs pgBackRest, a popular PostgreSQL backup software, and provides an API to launch backups, both on-demand and periodic, triggered by Kubernetes cron jobs. In addition, the database container continuously archives changes in the form of WAL segments, storing them on Amazon S3 in the same location as the backups.

 Architecture diagram of how Timescale Forge backups are done in Kubernetes.
How we orchestrate backups using Kubernetes and AWS S3 in Timescale

In addition to the TimescaleDB operator, there is another microservice whose task is to deploy TimescaleDB instances (known as “the deployer”). The deployer defines TimescaleDB resources based on users’ choices and actions within the cloud console’s UI and creates TimescaleDB manifests, letting the operator pick them up and provision running TimescaleDB services.

Architecture diagram illustrating the process of deploying a TimescaleDB service in Kubernetes.
How a TimescaleDB service is deployed via Kubernetes on Timescale

The deployer also watches for the changes in Kubernetes objects that are part of the resulting TimescaleDB service and the manifest itself. It detects when the target service is fully provisioned or when there are changes to be made to the running service (e.g., to provision more compute resources or to upgrade to a new minor version of TimescaleDB). Finally, it also marks the service as deleted upon receiving a delete event from the manifest.

Restore all backups!

So the deployer and operator work together to deploy and manage a TimescaleDB service in Kubernetes, including the container running PostgreSQL and TimescaleDB, but also the container sidecars running pgBackRest and others.

Sometimes, a solution to one problem is a by-product of working on another problem. As we built Timescale, there were several features that we were able to easily implement by adding the ability to clone a running service, producing a new one with identical data. That process is similar to spawning a replica of the original database, except that at some point, that replica is “detached” from the former primary and goes a separate way.

We’ve recently added the ability to continuously validate backups through frequent smoke testing using a similar approach.

A restore test produces a new service with the data from an existing backup, relying on PostgreSQL point-in-time recovery (PITR). When a new test service is launched, it restores the base backup from Amazon S3 and replays all pending WAL files until it reaches a pre-defined point in time, where it detaches into a stand-alone instance.

Under the hood, we use Patroni, a well-known PostgreSQL High-Availability solution template, to replace a regular PostgreSQL bootstrap sequence with a custom one that involves restoring a backup from Amazon S3.

A feature of Patroni, called “custom bootstrap”, allows defining arbitrary initialization steps instead of relying on the PostgreSQL bootstrap command initdb. Our custom bootstrap script also calls pgBackRest, pointing it to the backup of the instance we are testing. (Side note: My colleague Feike Steenbergen and I were among the initial developers of Patroni earlier in our careers, so we’re quite familiar with how to incorporate it into such complex workflows.)

Architecture diagram of how Timescale Forge performs restore tests.
How Timescale conducts restore tests to validate PostgreSQL backups

Once we have verified the backup can be restored without errors, we determine whether we have the right data. We check two properties of the restored backup: recentness and consistency. Since the outcome of the restore is a regular TimescaleDB, those checks simply run SQL queries against the resulting database.

Obviously, we have no visibility into users’ data to verify the restored backup is up-to-date. So in order to check for recentness, we inject a special row with the timestamp of the beginning of the restore test into a dedicated bookkeeping table in the target service. (This table is not accessible or visible to users.) The test configures the PostgreSQL Point-in-Time Recovery (PITR), setting the parameter restore_target_time to match that timestamp. When the instance’s restore is completed, the scripts that Patroni runs at the post-bootstrap stage verify whether the row is there.

As a final safeguard, we check for consistency by verifying that the restored database is internally consistent. In this context, a backup restore is consistent if it produces the same results for a set of queries as the original service it is based on at the point in time when the backup has been made.

The easiest way to check for consistency is to read every object in the target database and watch for errors. If the original instance produced no errors for a particular query at the time the backup has been made, the restore of that backup should produce no errors as well. We use pg_dump, the built-in tool for producing SQL dumps for PostgreSQL.

Typically, it reads every row in the target database and writes its SQL representation in the dump file. Since we are not interested in the dump, we redirect the output to /dev/null to save disk space and improve performance. (We use “-s” flag to trigger a schema-only dump without touching data rows.) There is no need to read every data row when we are only interested in checking system catalogs for consistency.

The deployer is responsible for scheduling the tests over the whole fleet. It employs an elegant hack – our favorite type of hack! – to do so by relying on certain Patroni behavior:

  • Patroni modifies Kubernetes endpoints to point PostgreSQL clients to the primary database instance. Patroni updates the list of addresses in the endpoint, as well as its annotations. As a result, every endpoint is touched regularly, as Patroni ensures the primary holds the leader lock for every instance.
  • The deployer installs the Kubernetes informer on the endpoints of running instances, allowing it to call a custom callback every time the endpoint is created, updated, or deleted.
  • The OnUpdate path allows for every running instance to evaluate whether the restore test is necessary.
  • The restore test instance endpoint triggers its own OnUpdate event. We use it to check the restore test status and finish the test once it is done.
  • The deployer records each observed restore test status in a hypertable in the deployer database, together with the status change timestamp.
  • The deployer hypertable is used to limit the number of in-progress tests and provide useful statistics about the tests for our monitoring.

Summary

Timescale is designed to provide a worry-free experience and provide a trustworthy environment for your critical data. We believe that developers should never have to worry about the reliability of their database, and they should have complete confidence that their data will never be lost.

Backups provide a facility to archive and store data so that it can be recovered in the future. But, backups themselves need to be frequently tested, just like all other parts of your infrastructure.

We’ve detailed how we’ve designed a system to frequently test Timescale backups, periodically scheduling recovery operations and validating that they’ve worked properly. This helps ensure that your backups can be successfully restored, giving you more peace of mind about the safety of your data in the case of a data loss event in your running service.

Of course, backups are only one part of a broader strategy for ensuring reliability. Among other things, Timescale's use of Kubernetes has allowed us to provide a decoupled compute and storage solution for more reliable and cost-effective fault tolerance.

All writes to WAL and data volumes are replicated to multiple physical storage disks for higher durability and availability, and even if a TimescaleDB instance fails (including from hardware failures), Kubernetes can immediately spin up a new container to reconnect to its online storage volumes within tens of seconds without needing to ever take the slower path of recovering from these backups from S3. But a deeper dive into this approach for this “instant recovery” for all services is the topic for a future post.

So, at Timescale, we modify that ancient proverb like so: “There are three kinds of database developers. Those who do backups, those who will do backups, and those who use Timescale and don’t have to think about them.”

If you’re new to TimescaleDB, create a free Timescale account to get started with a fully managed Timescale service (free for 30 days, no credit card required).

Once you’re using TimescaleDB, or if you’re already up and running, join the Timescale community to share your feedback, ask questions about time-series data (and databases in general), and more.

And, if you enjoy working on hard problems like testing automatic PostgreSQL backups in Kubernetes, share our mission, and want to join our fully remote, global team, we’re hiring broadly across many roles.

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
9 min read
Always Be Launching
Contributors

Related posts