Database Backups and Disaster Recovery in PostgreSQL: Your Questions, Answered
When we ask our community about the elementary challenges they face with their PostgreSQL production databases, we often hear about three pain points: query speed, optimizing large tables, and managing database backups.
To help you handle the first two, we’ve written about how to improve your PostgreSQL performance through partitioning and fine-tuning your database, and we’ve also discussed tactics on how to reduce your database size to better manage large tables. Now, it’s time to talk about database backups!
In this article, we’ll answer some of the most frequently asked questions we get on the topic of database backup and recovery in PostgreSQL, including how we handle things in the Timescale platform.
Why Are PostgreSQL Database Backups Important?
Let’s start from the beginning: why is everyone so worried about database backups?
When we discuss backup and recovery, we’re referring to a set of processes and protocols established to safeguard your data from loss or corruption and restore it to a usable state:
- Backups involve creating copies of your data at regular intervals, copies that encapsulate your PostgreSQL database state at a specific point in time.
- Recovery, on the other hand, is the process of restoring data from these backups. If both things are taken care of (i.e., you always have up-to-date backups and a good recovery strategy in place), your PostgreSQL database will be resilient against failure, and you’ll be protected against data loss.
Effective backup management is not only about creating copies of data but also ensuring those copies are healthy, accurate, and up-to-date. To define a good backup strategy for your production PostgreSQL database, you need to consider a few aspects, including how frequently you will back up your database, where these backups will be stored, and how often you will audit them.
But your job isn’t finished once you get up-to-date and healthy database backups: you must also establish an effective disaster recovery protocol. No matter how careful you are, it’s a fact of database management that, when running a production database, failures will happen sooner or later—whether outages cause them, failed upgrades, corrupted hardware, or human error—you name it.
Your disaster recovery plan must encompass all the steps needed to restore data as quickly as possible after an incident, ensuring that your database is not just backed up but also recoverable in a timely and efficient manner.
What Is the Difference Between a Physical Backup and a Logical Backup in PostgreSQL?
In PostgreSQL, there are two main types of database backups: physical backups and logical backups.
- Physical backups capture the database's state at a specific point in time. They involve copying the actual PostgreSQL database data at the file system level.
- Logical backups involve exporting specific database objects or the entire database into a human-readable SQL file format. A logical backup contains SQL statements to recreate the database objects and insert data.
Logical backups can be highly granular, allowing for the backup of specific database objects like tables, schemas, or databases. They are also highly portable and can be used across different database systems or versions, making them popular for migrating small to medium databases. This is your common pg_dump/pg_restore.
But a main drawback of logical backups is speed. For large databases, the process of restoring from a logical backup is too slow to be useful as a sole disaster recovery mechanism (or migration mechanism, for that matter). Restoring from physical backups is faster than restoring from logical backups, and it’s exact; when putting together a disaster recovery strategy, you’ll be dealing with physical backups.
Physical Backups in PostgreSQL
Let’s explore some essential concepts around physical backups and how they can help you recover your database in case of failure.
File system backups
First, to be technically exact, physical backups are referred to as file system backups in PostgreSQL. As mentioned earlier, this refers to the process of directly copying the directories and files that PostgreSQL uses to store its data, resulting in a complete representation of the database at a specific moment in time.
Maintaining file system backups is an essential piece of every disaster recovery strategy and imperative in production databases. But putting together a solid disaster recovery plan requires other techniques beyond simply taking file system backups (or physical backups) regularly, especially if you’re dealing with large production databases.
First, taking physical backups of very large databases can be a rather slow and resource-intensive process that can conflict with other high-priority database tasks, affecting your overall performance. Second, physical backups are not enough to ensure consistency in case of failure, as they only reflect the database state at the time they were taken. To restore a database in case of failure, you’ll need another mechanism to be able to restore all the transactions that occurred between the moment the last backup was taken and the failure.
WAL and continuous archiving
That mechanism is WAL. WAL stands for Write-Ahead Logging, and it is a protocol that improves the reliability, consistency, and durability of a PostgreSQL database by logging changes before they are written to the actual database files.
WAL is key for assuring atomicity and durability in PostgreSQL transactions. By writing changes to a log before they're committed to the database, WAL ensures that either all the changes related to a transaction are made or none at all.
WAL is also essential for disaster recovery since, in the event of a failure, the WAL files can be replayed to bring the database back to a consistent state. The process of regularly saving and storing these WAL records in a secondary storage location, ensuring that they are preserved over the long term, is usually referred to as continuous archiving.
Keeping WAL records and a recent, healthy physical database backup ensures that your PostgreSQL database can be successfully restored in case of failure. The physical backup will get PostgreSQL to the same state as it was when the backup was taken, which hopefully was not so long ago, and the WAL files will be rolled forward right before things start failing.
You might be wondering why it’s necessary to keep up-to-date backups if WAL can be replayed. The answer is speed. Replaying WAL during a recovery process is time-consuming, especially when dealing with large datasets with complex transactions. Backups provide a snapshot of the database at a specific point in time, enabling quick restoration up to that point.
In the optimal recovery scenario, a recent backup is restored (e.g., from the previous day), and then WALs recorded post-backup are replayed to update the database to its most recent state. You don’t want to rely on WAL to reproduce two weeks’ worth of transactions.
What is point-in-time recovery (PITR) in PostgreSQL?
Lastly, let’s define the concept of PITR. Point-in-time recovery refers to the ability to restore a PostgreSQL database to any specific point in time due to direct user input. For example, if I perform an upgrade and, for whatever reason, decide to revert the change, I could choose to recover the database from any day before.
Behind the scenes, PITR in PostgreSQL is often anchored in WAL. By integrating a backup with the sequential replay of WAL, PostgreSQL can be restored to an exact moment.
Which Tools Can I Use to Manage Physical Backups in PostgreSQL?
Within the PostgreSQL ecosystem, there are multiple tools that help with the creation of physical backups, two of the most popular being pg_basebackup and pgBackRest.
pg_basebackup is the native tool offered by PostgreSQL for taking physical backups. It’s straightforward and reliable. It allows you to efficiently copy the data directory and include the WAL files to ensure a consistent and complete backup.
pg_basebackup comes with important limitations. Taking full backups of a large database can be a lengthy and resource-intensive process. A good workaround to mitigate this is to combine full backups with incremental backups, i.e., frequently copying the data that has changed since the last full backup (e.g., once a day) and creating full backups less frequently (e.g., once a week). However, incremental backups are not supported in pg_basebackup.
Pg_basebackup also has limited parallelization capabilities, which can slow down the creation of full backups even further. Additionally, the process is highly manual, requiring developers to closely monitor and manage the backup operations.
To address the constraints of pg_basebackup, tools like pgBackRest were built by the PostgreSQL community. pgBackRest introduces several important improvements:
- It supports both full and incremental backups.
- It introduces multi-threaded operations, accelerating the backup process for larger databases.
- It validates checksums during the backup process to ensure data integrity, offering an additional layer of security.
- It supports various storage solutions, offering flexibility in how and where backups are stored.
As we’ll explain in more detail in a later section, to manage our own backup and restore process in Timescale, we use pgBackRest, although we’ve implemented some hacks to speed up the process of executing full backups (pgBackRest can still be quite slow for creating backups in large databases).
Logical Backups in PostgreSQL
Now, let’s briefly cover logical backups. As we mentioned previously, logical backups involve exporting data into a human-readable format, such as SQL statements. This type of backup is generally more flexible and portable, making it handy to reproduce a database in another architecture (i.e., for migrations)—although recovering from a logical backup is quite a slow process, making them practical for migrating small to medium PostgreSQL production databases alone.
The most common way to create logical backups and restore from them is by using pg_dump/pg_restore:
- pg_dump is used to create logical backups of a PostgreSQL database. It generates a script file or other formats that contain SQL statements needed to reconstruct the database to the state it was at the backup time. You can use pg_dump to back up an entire database or individual tables, schemas, or other database objects.
- pg_restore is used to restore databases from backups created by pg_dump. Just as pg_dump offers granularity in creating backups, pg_restore allows for selective restoration of specific database objects, providing flexibility in the recovery process. While it is typically used with backups created by pg_dump, pg_restore is compatible with other SQL-compliant database systems, enhancing its utility as a migration tool.
When Should I Use Logical Backups, and When Should I Use Physical Backups in PostgreSQL?
Logical backups via pg_dump/pg_restore are mostly useful for creating testing databases or for database migrations. In terms of migrations, if you’re operating a production database, we only recommend going the pg_dump/pg_restore route if your database is small (<100GB).
Migrating larger and more complex databases via pg_dump/pg_restore might take too much time, during which your production database would be offline. There are other migration strategies you can follow to avoid this downtime, like the dual-write and backfill method.
Physical backups are mostly used for disaster recovery and for data archiving. If you are operating a production database, you’ll want to maintain up-to-date physical backups and WAL to recover your database when failure occurs. If your industry also requires you to keep copies of your data for a certain period of time due to regulations, physical backups will be the way to go.
In production applications, you’ll most likely use a combination of both logical and physical backups. For disaster recovery, physical backups will be your foundational line of defense, but logical backups can serve as additional assurance (redundancy is a good thing). For migrating large databases, you’ll most likely use a staged approach, combining logical backups with other tactics, and so on.
What About Replicas in PostgreSQL?
We're not focusing specifically on replicas and high availability in this article, but these concepts are related to the topic of backups and disaster recovery, so let’s cover them.
Replicas are continuously updated mirrors of the primary database, capturing every transaction and modification almost instantaneously. They're not the same as backups, but their usefulness in disaster recovery is indisputable: in the event of a failure, replicas can be promoted to serve as the primary database, ensuring minimal downtime while the damaged database is being restored.
Building a high-availability replica and failover mechanism would generally involve taking care of the following steps:
- The primary database should be configured to allow connections from replicas.
- Physical backups of the primary should be regularly created, e.g., using pgBackRest.
- WAL capturing all changes made to the database should be shipped to the replica, for example, via streaming replication. Replication can be synchronous, where each transaction is confirmed only when both primary and replica have received it, or asynchronous, where transactions are confirmed without waiting for the replica.
- Configurations for automatic failover should be established to promote a replica to become the primary database in case of a failure.
- Tools and scripts should be used to monitor replication lag and ensure the replica is up-to-date.
This setup can be considerably complex to maintain. Most providers of managed PostgreSQL databases, including Timescale, offer fully managed replicas as one of their services, making it much easier to run highly available databases.
Database Backups and Disaster Recovery: How Do We Do It?
The Timescale platform allows our customers to create fully managed PostgreSQL and TimescaleDB databases, meaning that we take care of the backup and disaster recovery process for them. Let’s run through how the platform handles all things backups, replication, upgrades, and restores.
How do backups work in Timescale?
Backups in Timescale are fully automated. Using pgBackRest under the hood, Timescale automatically creates one full backup every week and incremental backups every day. Timescale also keeps WAL files of any changes made to the database. This WAL can be replayed in the event of a failure to reproduce any transactions not captured by the last daily backup, e.g., to replay the changes made to your database during the last few hours. The two most recent full backups and WAL are stored in S3 volumes.
On top of the full and incremental backups taken by pgBackRest, Timescale also takes EBS snapshots daily. EBS snapshots create copies of the storage volume that can be restored, effectively making it a backup, and they are significantly faster than taking full backups via pgBackRest (about 100x faster).
By taking EBS snapshots daily (on top of the weekly full backups by pgBackRest), we introduce an extra layer of redundancy, ensuring that we always have a fresh snapshot that we can quickly restore if the customer experiences a critical failure that requires recovery from a full backup.
If I’m a Timescale customer, do I have to keep my own backups?
No. You can do it if you wish, but this would be (extra) redundant.
Disaster recovery in Timescale: What happens if my database fails?
Timescale is built on AWS with decoupled compute and storage, something that makes the platform especially resilient against failures. We can distinguish two types of failures that are dealt with differently by Timescale: compute and storage failures.
How Timescale handles compute failures
Compute failures are more frequent than storage failures, as they can be caused by things like unoptimized queries or other issues that result in a maxed-out CPU. To improve uptime for the customer, Timescale has developed a methodology that makes the platform recover extremely quickly from compute failures: we call this technique rapid recovery.
Since the compute and storage nodes are decoupled in Timescale, if the compute node fails, Timescale automatically spins up a new compute node, attaching the undamaged storage unit to it. Any WAL that was in memory then replays. How long this recovery process takes mostly depends on how much WAL needs replaying, but it is usually completed in less than thirty seconds. Under the hood, this entire process is automated via Kubernetes.
How Timescale handles storage failures
Storage failures are much less common than compute failures, but when they happen, they’re more severe. As we’ll cover in the next section, having a high-availability replica can be a life-saver in this circumstance; while your storage is being restored, instead of experiencing downtime, your replica will automatically take over.
To automatically restore your damaged storage, Timescale makes use of the backups it has on storage, reproducing WAL since the last incremental backup. The figure below illustrates the process:
How do replicas work in Timescale, and how do they help with recovery?
In Timescale, you can create two types of replicas:
- Read replicas are useful for read scaling. They’re used to liberate load from your primary database in read-heavy applications, for example, if you’re powering a BI tool or doing frequent reporting. Read replicas are read-only, and you can create as many as you need.
- High-availability replicas are exact, up-to-date copies of your database that automatically take over operations if your primary becomes unavailable.
Let's dig deeper into high-availability replicas since they’re closer to the topic of backup and recovery discussed in this article.
We’ve been talking about the importance of backups and disaster recovery, but there’s a related concept that’s important also to consider: the concept of high availability. In broad terms, a “highly available” database describes a database that’s able to stay running without significant interruption (perhaps no more than a few seconds) even in case of failure.
As we’ve mentioned previously, the process of recovering a large database from backup might take a while, even when you’ve done everything right—which is why it’s handy to have a replica running. Instead of waiting for the backup and restore process to finish, when your primary database fails, your connection will automatically failover to the replica, saving your own users any major downtime.
Failover also helps remove downtime for common operations which would normally cause a service to reset, like some upgrades. In these cases, Timescale makes changes to each node sequentially so that there is always a node available.
And speaking of upgrades…
How are upgrades handled in Timescale?
In Timescale, you’re running PostgreSQL databases with the TimescaleDB extension enabled. Therefore, during your Timescale experience, you’ll most likely experience three different types of upgrades:
These refer to upgrades between TimescaleDB versions, e.g., from TimescaleDB 2.11 to TimescaleDB 2.12. You don’t have to worry about these: they’re backward compatible, they require no downtime, and they will happen automatically during your maintenance window. Your Timescale services always run the latest available TimescaleDB version, so you can enjoy all the new features we ship.
PostgreSQL minor version upgrades
We always run the latest available minor version of PostgreSQL in Timescale as well, mostly for security reasons: these minor updates may contain security patches, data corruption problems, and fixes to frequent bugs.
These upgrades are also automatically handled by the platform during your maintenance window, and they are also backward compatible. However, they require a service restart, which could cause some downtime (30 seconds to a few minutes) if you do not have a replica. We will alert you ahead of time about these, so you can set your maintenance window to a low traffic time (e.g., middle of the night) to minimize consequences.
PostgreSQL major version upgrades
These refer to upgrading, for example, from PostgreSQL 15 to 16. These upgrades are different and more serious since they’re often not backward compatible. We cannot run these upgrades for you, as this might cause issues on your application. Besides, the downtime associated with upgrading major versions of PostgreSQL can be more severe (e.g., 20 minutes), and unfortunately, high-availability replicas can’t help you avoid downtime in this particular case.
Major PostgreSQL upgrades are always a significant lift. Timescale has some tools that will make the smoother; for example, you can initiate the upgrade process in a particular database by clicking a button in the UI, and before doing so, you can test your upgrade in a copy of your database to make sure nothing will break and have an accurate idea of how much downtime the upgrade will require. Read this article for more information.
Can I do PITR in Timescale, i.e., restore my database to a previous state at my own will?
Yes you can! All Timescale services allow PITR to any point in the last 3 days. If you're using our Enterprise plan, this timespan expands up to 14 days.
Having a solid backup and recovery strategy is top of mind for every PostgreSQL user. We hope this introductory article answers some of your questions; if you’d like to see more articles diving deeper into this topic, tell us on Twitter/X.
If you prefer not to worry about maintaining your backups and taking care of recovering your database when things fail, try Timescale, our managed PostgreSQL platform. It takes care of all things backups so you can focus on what matters (building and running your application) while experiencing the performance boost of TimescaleDB. You can start a free trial here, no credit card required.