Database Backups and Disaster Recovery in PostgreSQL: Your Questions, Answered

Database Backups and Disaster Recovery in PostgreSQL: Your Questions, Answered

When we ask our community about the elementary challenges they face with their PostgreSQL production databases, we often hear about three pain points: query speed, optimizing large tables, and managing database backups. We’ve covered the first two topics in articles about partitioning and fine-tuning your database. We’ve also discussed how to reduce your database size to better manage large tables.

In this guide, we’ll answer some of the most frequently asked questions about database backup and recovery in PostgreSQL. We’ll also discuss how we handle things in the Timescale platform.

Why Are PostgreSQL Database Backups Important?

When we discuss backup and recovery, we’re referring to a set of processes and protocols established to safeguard your data from loss or corruption and restore it to a usable state:

  • Backups involve creating copies of your data at regular intervals, copies that encapsulate the state of your PostgreSQL database at a specific point in time. 
  • Recovery, on the other hand, is the process of restoring data from these backups. If both things are taken care of (i.e., you always have up-to-date backups and a good recovery strategy in place), your PostgreSQL database will be resilient against failure, and you’ll be protected against data loss.  

Effective backup management is not only about creating copies of data. It’s also about ensuring those copies are healthy, accurate, and up-to-date. 

To define a good backup strategy for your production PostgreSQL database, you need to consider several aspects. This includes how frequently you will back up your database, where these backups will be stored, and how often you will audit them. 

But your job isn’t finished once you get up-to-date and healthy database backups. You must also establish an effective disaster recovery protocol. No matter how careful you are, it’s a fact of database management that failures will happen sooner or later. They can be caused by outages, failed upgrades, corrupted hardware, or human error—you name it.

Your disaster recovery plan must encompass all the steps to restore data as quickly as possible after an incident. This ensures that your database is not just backed up but also recoverable in a timely and efficient manner.

What Is the Difference Between a Physical Backup and a Logical Backup in PostgreSQL? 

In PostgreSQL, there are two main types of database backups: physical backups and logical backups. 

  • Physical backups capture the database's state at a specific point in time. They involve copying the actual PostgreSQL database data at the file system level. 
  • Logical backups involve exporting specific database objects or the entire database into a human-readable SQL file format. A logical backup contains SQL statements to recreate the database objects and insert data.

Logical backups can be highly granular, allowing for the backup of specific database objects like tables, schemas, or databases. They are also portable and can be used across different database systems or versions, making them popular for migrating small to medium databases. This is your common pg_dump/pg_restore.  

But a main drawback of logical backups is speed. For large databases, the process of restoring from a logical backup is too slow to be useful as a sole disaster recovery mechanism (or migration mechanism, for that matter). Restoring from physical backups is faster than restoring from logical backups, and it’s exact. When putting together a disaster recovery strategy, you’ll be dealing with physical backups.

A Guide to Physical Backups in PostgreSQL

Let’s explore some essential concepts around physical backups and how they can help you recover your database in case of failure. 

File system backups 

Physical backups are referred to as file system backups in PostgreSQL. This refers to the process of directly copying the directories and files that PostgreSQL uses to store its data, resulting in a complete representation of the database at a specific moment in time. 

Maintaining file system backups is an essential piece of every disaster recovery strategy and imperative in production databases. But putting together a solid disaster recovery plan requires other techniques beyond simply taking “physical” file system backups regularly. That’s especially true if you’re dealing with large production databases. 

Taking physical backups of very large databases can be a rather slow and resource-intensive process that conflicts with other high-priority database tasks, affecting your overall performance. Physical backups are not enough to ensure consistency in case of failure, as they only reflect the database state at the time they were taken. To restore a database in case of failure, you’ll need another mechanism to be able to restore all the transactions that occurred between the moment the last backup was taken and the failure. 

WAL and continuous archiving 

WAL stands for Write-Ahead Logging. It’s a protocol that improves the reliability, consistency, and durability of a PostgreSQL database by logging changes before they are written to the actual database files.

WAL is key for assuring atomicity and durability in PostgreSQL transactions. By writing changes to a log before they're committed to the database, WAL ensures that either all the changes related to a transaction are made or none at all. 

WAL is also essential for disaster recovery since, in the event of a failure, the WAL files can be replayed to bring the database back to a consistent state. The process of regularly saving and storing these WAL records in a secondary storage location, ensuring that they are preserved over the long term, is usually referred to as continuous archiving. 

Keeping WAL records and a recent, healthy physical database backup ensures that your PostgreSQL database can be successfully restored in case of failure. The physical backup will get PostgreSQL to the same state as it was when the backup was taken, which hopefully was not so long ago, and the WAL files will be rolled forward right before things start failing. 

You might be wondering why it’s necessary to keep up-to-date backups if WAL can be replayed. The answer is speed. Replaying WAL during a recovery process is time-consuming, especially when dealing with large datasets with complex transactions. Backups provide a snapshot of the database at a specific point in time, enabling quick restoration up to that point. 

In the optimal recovery scenario, you restore a recent backup (e.g., from the previous day. hen, you replay a WAL recorded post-backup to update the database to its most recent state. You don’t want to rely on WAL to reproduce two weeks’ worth of transactions.

What is point-in-time recovery (PITR) in PostgreSQL?

Point-in-time recovery refers to restoring a PostgreSQL database to any specific point in time due to direct user input. For example, if I perform an upgrade and, for whatever reason, decide to revert the change, I could choose to recover the database from any day before. 

Behind the scenes, PITR in PostgreSQL is often anchored in WAL. By integrating a backup with the sequential replay of WAL, PostgreSQL can be restored to an exact moment.

A Guide to PostgreSQL Physical Backup Tools  

There are multiple tools that help with the creation of physical backups, two of the most popular being pg_basebackup and pgBackRest

pg_basebackup

pg_basebackup is the native tool offered by PostgreSQL for taking physical backups. It’s straightforward and reliable. It allows you to efficiently copy the data directory and include the WAL files to ensure a consistent and complete backup.

pg_basebackup has important limitations. Taking full backups of a large database can be a lengthy and resource-intensive process. A good workaround to mitigate this is to combine full backups with incremental backups. For example, frequently copying the data that has changed since the last full backup (e.g., once a day) and creating full backups less frequently (e.g., once a week). However, incremental backups are not supported in pg_basebackup

pg_basebackup also has limited parallelization capabilities, which can further slow down the creation of full backups. The process is mostly manual, requiring developers to closely monitor and manage the backup operations.

pgBackRest

To address the constraints of pg_basebackup, the PostgreSQL community built tools like pgBackRest. pgBackRest introduces several important improvements:

  • It supports both full and incremental backups. 
  • It introduces multi-threaded operations, accelerating the backup process for larger databases. 
  • It validates checksums during the backup process to ensure data integrity, offering an additional layer of security.
  • It supports various storage solutions, offering flexibility in how and where backups are stored. 

We use pgBackRest to manage our own backup and restore process in Timescale, although we’ve implemented some hacks to speed up the full backup process (pgBackRest can still be quite slow for creating backups in large databases).

A Guide to Logical Backups in PostgreSQL 

Logical backups involve exporting data into a human-readable format, such as SQL statements. This type of backup is generally more flexible and portable, making it handy to reproduce a database in another architecture (i.e., for migrations).  However, recovering from a logical backup is quite a slow process. That makes them practical only for migrating small to medium PostgreSQL production databases. 

pg_dump/pg_restore 

The most common way to create logical backups and restore from them is by using pg_dump/pg_restore:

  • pg_dump creates logical backups of a PostgreSQL database. It generates a script file or other formats that contain SQL statements needed to reconstruct the database to the state it was at the backup time. You can use pg_dump to back up an entire database or individual tables, schemas, or other database objects.
  • pg_restore restores databases from backups created by pg_dump. Just as pg_dump offers granularity in creating backups, pg_restore allows for selective restoration of specific database objects, providing flexibility in the recovery process. While it is typically used with backups created by pg_dump, pg_restore is compatible with other SQL-compliant database systems, enhancing its utility as a migration tool.

When Should I Use Logical Backups, and When Should I Use Physical Backups in PostgreSQL? 

Logical backups via pg_dump/pg_restore are mostly useful for creating testing databases or for database migrations. In terms of migrations, if you’re operating a production database, we only recommend going the pg_dump/pg_restore route if your database is small (<100 GB).

Migrating larger and more complex databases via pg_dump/pg_restore might take your production database offline for too long. Other migration strategies, like the dual-write and backfill method, can avoid this downtime. 

Physical backups are mostly used for disaster recovery and data archiving. If you’re operating a production database, you’ll want to maintain up-to-date physical backups and WAL to recover your database when failure occurs. If your industry requires you to keep copies of your data for a certain period of time due to regulations, physical backups will be the way to go. 

In production applications, you’ll most likely use a combination of logical and physical backups. For disaster recovery, physical backups will be your foundational line of defense, but logical backups can serve as additional assurance (redundancy is a good thing). For migrating large databases, you’ll most likely use a staged approach, combining logical backups with other tactics, and so on.

What About Replicas in PostgreSQL? 

Replicas are continuously updated mirrors of the primary database, capturing every transaction and modification almost instantaneously. They're not the same as backups, but their usefulness in disaster recovery is indisputable. In the event of a failure, you can promote replicas to serve as the primary database, ensuring minimal downtime while you restore the damaged database. Building a high-availability replica and failover mechanism generally involves the following steps: 

  • The primary database should be configured to allow connections from replicas. 
  • Physical backups of the primary should be regularly created, e.g., using pgBackRest
  • WAL capturing all changes made to the database should be shipped to the replica, for example, via streaming replication. Replication can be synchronous, where each transaction is confirmed only when both primary and replica have received it, or asynchronous, where transactions are confirmed without waiting for the replica.
  • Configurations for automatic failover should be established to promote a replica to become the primary database in case of a failure.
  • Tools and scripts should be used to monitor replication lag and ensure the replica is up-to-date.

This setup can be considerably complex to maintain. Most providers of managed PostgreSQL databases, including Timescale, offer fully managed replicas as one of their services. This simplifies running highly available databases.

A Guide to Database Backups and Disaster Recovery with Timescale? 

The Timescale platform allows our customers to create fully managed PostgreSQL and TimescaleDB databases. That means we take care of the backup and disaster recovery process for them. Let’s run through how the platform handles backups, replication, upgrades, and restores.

How do backups work in Timescale? 

Backups in Timescale are fully automated. Using pgBackRest under the hood, Timescale automatically creates one full backup every week and incremental backups every day. 

Timescale also keeps WAL files of any changes made to the database. This WAL can be replayed in the event of a failure to reproduce any transactions not captured by the last daily backup. For example, it can replay the changes made to your database during the last few hours. Timescale stores the two most recent full backups and WAL in S3 volumes. 

On top of the full and incremental backups taken by pgBackRest, Timescale also takes EBS snapshots daily. EBS snapshots create copies of the storage volume that can be restored, effectively making it a backup. They are significantly faster than taking full backups via pgBackRest (about 100x faster). 

By taking EBS snapshots daily (on top of the weekly full backups by pgBackRest), we introduce an extra layer of redundancy, ensuring that we always have a fresh snapshot that we can quickly restore if the customer experiences a critical failure that requires recovery from a full backup. 

Disaster recovery in Timescale: What happens if my database fails? 

Timescale is built on AWS with decoupled compute and storage, something that makes the platform especially resilient against failures. There are two classes of failures that Timescale handles distinctly: compute and storage failures.

How Timescale handles compute failures 

Compute failures are more frequent than storage failures, as they can be caused by things like unoptimized queries or other issues that result in a maxed-out CPU. To improve uptime for the customer, Timescale has developed a methodology that makes the platform recover extremely quickly from compute failures. We call this technique rapid recovery

Timescale decouples the compute and storage nodes. So, if the compute node fails, Timescale automatically spins up a new compute node, attaching the undamaged storage unit to it. Any WAL that was in memory then replays. 

The length of this recovery process depends on how much WAL needs replaying. Typically, it completes in less than thirty seconds. Under the hood, this entire process is automated via Kubernetes. 

How Timescale handles storage failures 

Storage failures are much less common than compute failures, but when they happen, they’re more severe. Having a high-availability replica can be a lifesaver in this circumstance; while your storage is being restored, instead of experiencing downtime, your replica will automatically take over. 

To automatically restore your damaged storage, Timescale makes use of the backups it has on storage, reproducing WAL since the last incremental backup. The figure below illustrates the process:

Recovery from backup in Timescale
Recovery from backup in Timescale 

How do replicas work in Timescale, and how do they help with recovery? 

In Timescale, you can create two types of replicas:

  • Read replicas are useful for read scaling. They’re used to liberate load from your primary database in read-heavy applications, for example, if you’re powering a BI tool or doing frequent reporting. Read replicas are read-only, and you can create as many as you need.
  • High-availability replicas are exact, up-to-date copies of your database that automatically take over operations if your primary becomes unavailable. 

We’ve been talking about the importance of backups and disaster recovery. There’s a related concept that’s also important to consider: the concept of high availability. In broad terms, a “highly available” database describes a database that’s able to stay running without significant interruption (perhaps no more than a few seconds) even in case of failure. 

The process of recovering a large database from backup might take a while, even when you’ve done everything right. That’s why it’s handy to have a replica running. Instead of waiting for the backup and restore process to finish, when your primary database fails, your connection will automatically failover to the replica. That saves your own users any major downtime.

Failover also helps remove downtime for common operations that would normally cause a service to reset, like upgrades. In these cases, Timescale makes changes to each node sequentially so that there is always a node available. And speaking of upgrades… 

How are upgrades handled in Timescale? 

In Timescale, you’re running PostgreSQL databases with the TimescaleDB extension enabled. Therefore, during your Timescale experience, you’ll most likely experience three different types of upgrades: 

TimescaleDB upgrades

These refer to upgrades between TimescaleDB versions, e.g., from TimescaleDB 2.11 to TimescaleDB 2.12. You don’t have to worry about these. They’re backward compatible, they require no downtime, and they will happen automatically during your maintenance window. Your Timescale services always run the latest available TimescaleDB version, so you can enjoy all the new features we ship. 

PostgreSQL minor version upgrades

We always run the latest available minor version of PostgreSQL in Timescale as well, mostly for security reasons. These minor updates may contain security patches, data corruption problems, and fixes to frequent bugs. 

Timescale automatically handles these upgrades during your maintenance window, and they are also backward compatible. However, they require a service restart, which could cause some downtime (30 seconds to a few minutes) if you don’t have a replica. We alert you ahead of time about these, so you can set your maintenance window to a low traffic time (e.g., middle of the night) to minimize consequences. 

PostgreSQL major version upgrades 

These refer to upgrading, for example, from PostgreSQL 15 to 16. These upgrades are different and more serious since they’re often not backward compatible.

 We can’t run these upgrades for you, as this might cause issues on your application. Besides, the downtime associated with upgrading major versions of PostgreSQL can be more severe (e.g., 20 minutes). Unfortunately, in this particular case, high-availability replicas can’t help you avoid downtime.

Major PostgreSQL upgrades are always a significant lift. Timescale has some tools that will make the transition smoother. For example, you can initiate the upgrade process in a particular database with a click of a button Before doing so, you can test your upgrade in a copy of your database to make sure nothing will break and have an accurate idea of how much downtime the upgrade will require. Read this article for more information. 

Can I do PITR in Timescale, i.e., restore my database to a previous state at my own will? 

Yes, you can! All Timescale services allow PITR to any point in the last three days. If you're using our Enterprise plan, this timespan expands up to 14 days.

Stress-Free PostgreSQL Backups

Having a solid backup and recovery strategy is top of mind for every PostgreSQL user. We hope this introductory article answers some of your questions; if you’d like to see more articles diving deeper into this topic, tell us on Twitter/X.

If you prefer not to worry about maintaining your backups and taking care of recovering your database when things fail, try Timescale, our managed PostgreSQL platform. It takes care of all things backups so you can focus on what matters (building and running your application) while experiencing the performance boost of TimescaleDB. You can start a free trial here (no credit card required). 

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
13 min read
PostgreSQL
Contributors

Related posts