Making PostgreSQL Backups 100x Faster via EBS Snapshots and pgBackRest
If you have experience running PostgreSQL in a production environment, you know that maintaining database backups is a daunting task. In the event of a catastrophic failure, data corruption, or other form of data loss, the ability to quickly restore from these backups will be vital for minimizing downtime. If you’re managing a database, maintaining your backups and getting your recovery strategy in order is probably the first check on your checklist.
Perhaps this has already given you one headache or two because creating and restoring backups for large PostgreSQL databases can be a very slow process.
The most widely used external tool for backup operations in PostgreSQL is pgBackRest, which is very powerful and reliable. But pgBackRest can also be very time-consuming, especially for databases well over 1 TB.
The problem is exacerbated when restoring backups from production databases that continue to ingest data, thus creating more WAL (write-ahead log) that must be applied. In this case, a full backup and restore can take hours or even days, which can be a nightmare in production databases.
When operating our platform (Timescale, a cloud database platform built on PostgreSQL), we struggled with this very thing. At Timescale, we pride ourselves in making PostgreSQL faster and more scalable for large volumes of time-series data—therefore, our customers’ databases are often large (many TBs). At first, we were completely basing our backup and restore operations in pgBackRest, and we were experiencing some pain:
- Creating full backups was very slow. This was a problem, for example, when our customers were trying to upgrade their PostgreSQL major version within our platform, as we took a fresh, full backup after upgrade in case there was a failure shortly after. Upgrades are already stressful, and adding a very slow backup experience was not helping.
- Restoring from backups was also too slow, both restoring from the backups themselves and replaying any WAL that had accrued since the last backup. (In Timescale, we automatically take full and incremental backups of all our customers’ databases.)
In this blog post, we’re sharing how we solved this problem by combining pgBackRest with EBS snapshots. Timescale runs in AWS, so we had the advantage of cloud-native infrastructure. If you're running PostgreSQL in AWS, you can perhaps benefit from a similar approach.
After introducing EBS snapshots, our backup creation and restore process got 100x faster. This significantly improved the experience for our customers and made things much easier for our team.
Quick Introduction to Database Backups in PostgreSQL (And Why We Used pgBackRest)
If you asked 100 engineers if they thought backups were important for production databases, they would all say "yes"—but if you then took those same 100 engineers and gave them a grade on their backups, most wouldn’t hit a pass mark.
We all collectively understand the need for backups, but it’s still hard to create an effective backup strategy, implement it, run it, and test that it’s working appropriately.
In PostgreSQL specifically, there are two ways to implement backups: logical database dumps, which contain the SQL commands needed to recreate (not restore) your database from scratch, and physical backups, which capture the files that store your database state.
Physical backups are usually paired with a mechanism to store the constant stream of write-ahead logs (WALs), which describe all data mutations on the system. A physical backup can then be restored to get PostgreSQL to the exact same state as it was when that backup was taken, and the WAL files rolled forward to get to a specific point in time, maybe just before someone (accidentally?) dropped all your data or your disk ate itself.
Logical backups are useful to recreate databases (potentially on other architectures), but maintaining physical backups is imperative for any production workload where uptime is valued. Physical backups are exact: they can be restored quickly and provide point-in-time recovery. In the rest of this article, we’ll discuss physical backups.
How are physical backups usually created in PostgreSQL?
- The first option is using the pg_basebackup command.
pg_basebackupcopies the data directory and optionally includes the WAL files, but it doesn’t support incremental backups and has limited parallelization capabilities. The whole process is very manual, too. If you’re using
pg_basebackup, you’ll instantly get the files you need to bootstrap a new database in a tarball or directory, but not much else.
- Tools like pgBackRest were designed to overcome the limitations of
pg_basebackup. pgBackRest allows for full and incremental backups, multi-threaded operations, and point-in-time recovery. It ensures data integrity by validating checksums during the backup process, supports different types of storage, and much more. In other words, pgBackRest is a robust and feature-rich tool, making it our choice for PostgreSQL backup operations.
The Problem With pgBackRest
But pgBackrest is not perfect: it reads and backs up files, causing an additional load on your system. This can cause performance bottlenecks that can complicate your backup and restore strategy, especially if you’re dealing with large databases.
Even though pgBackRest offers incremental backups and parallelization, it often gets slow when executing full backups over large data volumes or on an I/O-saturated system.
While you can sometimes rely on differential or incremental backups to minimize data (like we do in Timescale), there are situations in which creating full backups is unavoidable. Backups could also be taken on standby, but at the end of the day, you’re limited by how fast you can get data off your volumes.
We shared earlier the example of full database upgrades, but we're also talking about any other kind of migration, integrity checks, archival operations, etc. In Timescale, some of our most popular platform features (like forks, high-availability replicas, and read replicas) imply a data restore from a full backup.
Having a long-running full backup operation in your production database is not only inconvenient, it can also conflict with other high-priority DB tasks, affecting your overall performance. This was problematic for us.
The slowness of pgBackRest was also problematic when it was time to restore from these backups. It’s very good at CPU parallelization, but when you’re trying to write terabytes of data as fast as possible, I/O will be the bottleneck. When it comes to recovery time objective or RTO, every minute counts. In case of major failure, you want to get that database up as soon as possible.
Using EBS Snapshots to Speed Up the Creation of Backups
To speed up the process of creating fresh full backups, we decided to replace standard pgBackRest full backups with on-demand EBS snapshots.
Our platform runs in AWS, which comes with some advantages. Using snapshots is a much more cloud-native approach to the problem of backups compared to what’s been traditionally used in PostgreSQL management.
EBS snapshots create a point-in-time copy of a particular database: this snapshot can be restored, effectively making it a backup. The key is that taking a snapshot is significantly faster than the traditional approach with pgBackRest: in our case, our p90 snapshot time decreased by over 100x. This gap gets wider the larger your database is!
How did we implement this? Basically, we did a one-to-one replacement of pgBackRest. Instead of waiting for the pgBackRest fresh full backup to complete, we now take a snapshot. We still wait for the backup to complete, but the process is significantly faster via snapshots. This way, we get the quick snapshot but also the full data copy and checksumming for datafile integrity, which pgBackRest performs.
If a user experiences a failure shortly after an upgrade, we have a fresh backup—the snapshot—that we can quickly restore (we’ll cover how we handle restores next). We still take a fresh full backup using pgBackRest (yay for redundancy), but the key difference is that this happens after the upgrade process has been fully completed.
If a failure has happened, the service is available to our customer quickly: we don’t have to force them to wait for the lengthy pgBackRest process to finish before being able to use their service again.
The trade-offs for adopting this approach were minimal. The only downside to consider is that, by taking snapshots, we now have redundant backups (both snapshots and full backups), so we incur additional storage costs. But what we’ve gained (both in terms of customer satisfaction and our own peace of mind) is worth the price.
Combining EBS Snapshots and pgBackRest for Quick Data Restore: Taking Partial Snapshots, Replaying WAL
Solving the first problem we encountered with pgBackRest (i.e., slow creation of full backups) was relatively simple. We knew exactly when we needed an EBS snapshot to be created, as this process is always tied to a very specific workflow (e.g., performing a major version upgrade).
But we also wanted to explore using EBS snapshots to improve our data restore functionality. As we mentioned earlier, some popular features in the Timescale platform rely heavily on restores, including creating forks, high-availability replicas, and read replicas, all of which imply a data restore from a full backup.
This use case posed a slightly different and more difficult challenge since to restore from a full backup, such a backup needs to exist first, reflecting the latest state of the service.
To implement this, the first option we explored was taking an EBS snapshot when the user clicked “Create” a fork, read replica, or high-availability replica, to then restore from that snapshot. However, this process was still too slow for the end user. To get the performance we wanted, we had to think a bit beyond the naive approach and determine a way to take semi-regular snapshots across our fleet.
Fortunately, we already had a backup strategy for pgBackRest in place that we chose to mirror. Now, all Timescale services have EBS snapshots taken daily. For redundancy reasons and to verify file checksums, we still take our standard pgBackRest partial backups, but we don’t depend on them.
Once the strategy is solved, restoring data from an EBS snapshot mirrors a restore from pgBackRest very closely. We simply chose the corresponding EBS snapshot we wanted to restore—in the cases mentioned above, always the most recent—and then replayed any WAL that has accumulated since that restore point. Here, it is important to note that we still rely on pgBackRest to do our WAL management. pgBackRest works great for us here; nothing gets close in terms of parallel WAL streaming.
This EBS snapshotting and pgBackRest approach has given us great results so far. Using snapshots for restores has helped improve our product experience, also providing our customers with an even higher level of reliability. Keeping pgBackRest in parallel has given us peace of mind that we still have a traditional backup approach that validates our data as well as snapshots.
We’re continually improving our strategy though, for example, by being smarter about when we snapshot—e.g., by looking at the accumulated WAL since the last snapshot to determine if we need to snapshot certain services more frequently. This practice helps improve restore times by reducing the amount of WAL that would need to be replayed, which is often the bottleneck in this process.
On Snapshot Prewarming
One important trade-off with this EBS snapshot approach is the balance between deployment time and initial performance. One limitation of a snapshot restore is that not all blocks are necessarily prewarmed and may need to be fetched from S3 the first time they are used, which is a slow process.
To give props to pgBackRest restore, it does not have this issue. For our platform features, our trade-off was between getting the user a running read replica (or fork or high-availability replica) as quickly as possible or making sure it was as performant as possible.
After some back and forth, we decided on our current approach on prewarming: we’re reading as much as we can for five minutes, prioritizing the most recently modified files first. The idea here is that we will warm the data the user is actively engaging with first. After five minutes, we then hand the process off to PostgreSQL to continue reading the rest of the volume at a slower pace until it is complete. For the initial warming, we use a custom goroutine that reads concurrently from files.
Backing It Up
We are not completely replacing our pgBackRest backup infrastructure with EBS snapshots anytime soon: it is hard to give up on the effectiveness and reliability of pgBackRest.
But by combining EBS snapshots with pgBackRest across our infrastructure, we’ve been able to mitigate its performance problem significantly, speeding up our backup creation and restore process. This allows us to build a better product, providing a better experience to our customers.
If you’re experiencing the same pains we were experiencing with pgBackRest, think about experimenting with something similar! It may cost you a little extra money, but it can be very much worth it.
We still have work to do on our end: we will continue to iterate on the ideal snapshotting strategy across the fleet to minimize deployment times as much as possible. We are also looking at smarter ways to prewarm the snapshots and more applications for snapshots in general.
If any of these problems interest you, check out our open engineering roles (we’re hiring!). And if you are a PostgreSQL user yourself, sign up for a free Timescale trial and experience the result of EBS snapshots in action.