Migrating a Terabyte-Scale PostgreSQL Database to Timescale With (Almost) Zero Downtime

Migrating a Terabyte-Scale PostgreSQL Database to Timescale With (Almost) Zero Downtime

(Almost) Zero Downtime Migrations (What, Why, and a Disclaimer)

Zero downtime database migrations are the unicorn of database management: they’re legendary, but deep down, we know they don’t exist. Regardless of which vendor you’re migrating to, you will always have to reconnect to your target database after the migration, incurring a (preferably minimal) downtime period. 

With that disclaimer out of the way, database migrations also don’t have to be a stressful, gloomy process. They can be effective, secure, and straightforward. Today, we’re introducing live migrations, a battle-tested, (almost) zero downtime migration strategy for Timescale that will enable you to migrate your terabyte-scale PostgreSQL database effortlessly and safely—because the only thing better than a unicorn is a workhorse.

And what’s best: in one fell swoop and using a few simple commands, you can migrate your data to Timescale and enable one of our most popular features—hypertables—in your target database so that automatic partitioning is already running smoothly once your data comes in.

Live migrations is the result of our extensive experience assisting our customers with the nerve-racking process of database migrations. It aims to simplify the transition process to Timescale, even if you're working with large PostgreSQL databases.

In this article, we’ll go over some of the traditional PostgreSQL database migration challenges and explain why Postgres’ logical replication feature wasn’t enough before fully delving into our new, almost zero downtime migration strategy.

💡
You can find the step-by-step instructions for performing the live migrations described in this article in our docs.

The Challenges of Migrating Large Databases (A Time-Series Example)

Let’s face it. Nobody wants to migrate their production database. But when your current PostgreSQL database no longer meets your application’s performance needs—even after some careful performance tuning—it’s time to move your data elsewhere.

Data migrations can be challenging, especially when you have terabytes of data. The more data you have to move, the longer the migration will take. At the scale of terabytes, it could take days or weeks to transfer the data, which only compounds the issue because you can't take your application offline for that long (if at all).

Plus, if you have time-series data coming at you fast and furiously, things only get more complicated. Most of our customers manage some kind of timestamped data (with all its specificities), so we’re particularly aware of these challenges.

For example, picture an IoT application using PostgreSQL as its primary database for storing sensor data and the management platform’s operations. This includes device registration, sensor readings, alerts, and device configurations. 

This IoT management platform performs 10,000 inserts per second, 4,000 updates per second, and 2,000 deletes per second. Additionally, it serves approximately 20,000 queries per second due to different API calls happening on the front end based on the actions of end users. The existing database size is 2.7 TB, a substantial amount of data to migrate.

Migrating 2.7 TB of data will take approximately 24-48 hours, depending on the network bandwidth, compute resources allocated in the source and target, and the machine where the migration occurs. As you can easily see, the main problem with this database migration is that downtime is not affordable, meaning the source database must remain uninterrupted until the migration is complete and the user is ready to make the switch. 

Additionally, the insert, update, and delete rate ranges from 10,000 to 2,000 per second—a significant amount of traffic and data operations occurring at the source. All these real-time operations must be replicated on the target end to ensure a smooth migration.

With all these moving parts, a thoughtful data migration demands minimal downtime, avoiding missed sensor readings and an unavailable IoT management platform for end users. The ultimate goal is to ensure the service remains accessible and the migration runs seamlessly.

Cue in live migrations: Traditionally, performing a migration while the data and operations are happening at the source would be concerning and sensitive. But, with our newest migration strategy (we have two others, pg_dump and pg_restore, and dual writes and backfill), the process becomes seamless because we do the heavy lifting for you.

Why Doesn't Postgres’ Built-In Logical Replication Suffice?

At this point, you’re probably wondering: “Wouldn’t it be simpler to use PostgreSQL’s native logical replication instead of creating a completely new migration strategy?” The short answer is “no.” 

PostgreSQL offers logical replication, a feature introduced in version 11, which aids in smooth migrations with minimal downtime. This feature operates on a PUBLISH/SUBSCRIBE model, with the source database publishing changes and subscribing to the target database.

Timescale’s automatic partitioning on hypertables uses PostgreSQL’s child tables and inheritance. Each child table (we call them chunks) is assigned a time range and only contains data from that range.

Normally, when you insert into the parent table, rows are written directly into the parent table, not the underlying child table. This is why TimescaleDB extends the planner to control re-route inserts into a hypertable to the right chunk. Unfortunately, Postgres' native logical replication applies changes directly to the parent table, bypassing TimescaleDB's logic to route changes to the correct chunk. 

The Timescale Engineering team is working hard to address this with upstream Postgres, but it is still a work in progress and will take time to be generally available. In fact, this is one of the main reasons we developed live migrations: we didn’t want our customers to wait for a speedy, minimal downtime migration from PostgreSQL to Timescale, and our two existing migration strategies didn’t suit all of our customers use cases and requirements. 

How Are We Solving the PostgreSQL to Timescale Migration Puzzle?

To create live migrations, we have developed a migration workflow for Timescale on top of pg_dump/pg_restore (for schema) and Postgres logical decoding (for live data). Let’s explore some of the basic concepts involved:

📖
What is logical decoding?
Logical decoding turns complex database changes into a simple format that's easy to understand without worrying about the database’s internal storage format. PostgreSQL achieves this by transforming the write-ahead log (WAL) into a more user-friendly format, like a series of data entries or SQL statements. 

What is write-ahead logging (WAL)?
You probably know that Postgres' transactions are ACID. The D stands for "durability" and ensures that once a transaction has been committed, its changes persist, even in the face of failures. PostgreSQL manages durability through the WAL, which is an append-only, persistent data structure that logs all modifications made to the database. By recording these changes to a file, the database ensures that each modification remains durable and recoverable.

💡 Tip: If you’re looking for more information on how to keep your data safe, be sure to check our article on PostgreSQL database backups.

Logical decoding facilitates the subscription to real-time database changes. These changes are streamed to subscribers in accessible formats, such as JSON. Subscribers can then convert these changes to SQL statements and apply them to the target database.

For our Timescale migration, we opted to leverage pgcopydb, a rising open-source tool in the Postgres community, rather than building our own logical decoding solution from the ground up.

While pgcopydb can do the complete migration in one go, we decided to use it only for historical data migration and live data replication. We wanted finer control over the schema migration process to enable features like hypertables (a Timescale feature that automatically partitions your data).

After the migration, you can still enable hypertables. However, converting non-empty plain tables to hypertables can lock the table for a significant amount of time, preventing any modifications to the table until the conversion is complete. This means that no new data can be written into the table while the conversion is in progress. The duration of the lock increases with the size of the table.

The Live Migrations Process

As mentioned, live migrations leverages logical decoding using pgcopydb to ensure changes made to the source database are applied to the target database. 

For example, when you use pg_dump to take a snapshot of the data in the source database, all changes made in the source after the pg_dump command are lost. With live migrations, these changes are stored and later replayed into the target database with the help of pgcopydb. 

How live migrations ensures the safe migration of terabytes of data from PostgreSQL to Timescale
How live migrations ensures the safe migration of terabytes of data from PostgreSQL to Timescale

Transactional consistency ensures that all operations within a transaction are either completed successfully or not at all. If a transaction is interrupted, such as due to a system crash or power outage, any changes made during that transaction are not saved, and the system returns to its original pre-transaction state. 

Thus, transactional consistency refers to the guarantee that the state of the target database will reflect the state of the source database from a specific point in time. When migrating data, it is important that the source and target databases remain consistent with each other. This means the following:

Completeness: If a transaction was committed on the source database and affected multiple rows or tables, those changes should be entirely replicated in the target database.

Order preservation: The order of transactions in the source should be preserved in the target. If transaction A is completed in the source before transaction B, the same order should be reflected in the target.

Transactional consistency is crucial for data integrity, preventing data anomalies and ensuring the migrated database remains a true and reliable reflection of the source database.

To maintain transactional consistency, the live migrations process involves moving the schema first, followed by enabling hypertables. Hypertables are Timescale's method of partitioning tables for optimal performance. If you want to learn more about hypertables and Postgres partitioning, check out this article

Once the database schema has been migrated and enabled with hypertable settings for the desired plain tables, logical decoding is initiated to subscribe to all real-time operations happening at the source. These operations are stored in an intermediate storage running pgcopydb until the data backfilling is complete. 

After initiating logical decoding, the data backfilling process begins. Once the data backfilling is successful, the real-time transactions stored in the intermediate storage are applied to the target database, keeping the source and target in sync in real time without the need for intermediate storage.

The time it takes for the source and target to be fully synchronized can vary, depending on the ingest load at the source. When the target catches up with the source, you can observe in the pgcopydb logs that the target is a few seconds to minutes behind the source. At this point, you can perform data integrity and verification checks. 

Once you are confident with the move, switch your applications to write to the target database. This will achieve near-zero downtime but may cause transactional inconsistency between the lag the migration is trying to catch up with and the latest changes made by your application to the target.

If the application's changes on the target do not depend on the latest few minutes that the migration is trying to catch up with, you can proceed with the migration by making the target the primary and discontinuing writes to the source.

However, if you prioritize transactional consistency, you will need to stop writes to the source and wait until the lag between the source and target reaches zero. Once the lag is eliminated, you can resume writes to the target. As writes to the source have been halted during the switch, pgcopydb will catch up with the few seconds to minutes of delay and indicate that the migration is in sync. At this stage, you can safely conclude the migration process since there will be no further changes to replicate from the source to the target.

Performance and reliability

The performance and reliability of pgcopydb are improving day by day. We have contributed fixes to improve its INSERT throughput and reliability in live data replication mode, as well as fixing memory leaks. Other performance improvements, like using COPY and pipeline mode, are currently underway.

For the historical data copying, pgcopydb uses the standard Postgres COPY protocol to stream the data from source to target. Pgcopydb not only supports parallel copying of multiple table data at once but also supports same table concurrency by splitting a table into multiple parts to accelerate the data movement and reduce the duration of the migration window.

Migration is inherently sequential: any changes to the source database must be buffered until the historical data transfer is completed. While typical logical replication tools like Postgres native replication and pglogical retain the WAL segments on the source database until the changes are consumed, this can result in excessive storage consumption and potential database failures. 

The creators of pgcopydb took a unique approach and chose to buffer the live changes into the volume attached to the computer where it runs. This grants users increased flexibility, allowing them to opt for larger storage capacities or even resort to cloud-based storage solutions like AWS S3.

Migrating a Large Postgres Workload to Timescale

Now that we’ve explained our live migration solution in detail, let’s see how you can actually move a large PostgreSQL workload to Timescale without batting an eye. 

Let’s use our earlier IoT example—remember the 10,000 inserts per second? Suppose we're storing them in a readings table ingesting data from various sensors, which, over time, is bound to grow because the developer wants to keep that data for a while. Without appropriate partitioning, this table’s insertions and queries could slow down. 

So, to tackle this issue, let’s migrate the data to Timescale. By doing so, we gain not only the advantages of automatic partitioning but also the perks of a managed service.

Migration prerequisites

Before diving in, ensure the following:

  1. For optimal performance and to minimize network latency, run the live-migration Docker image on a machine geographically near your source and target. For example, if your databases are hosted in the us-east-1 region, consider provisioning an EC2 instance in the same region. Ensure the instance is equipped with a minimum of 4 CPUs and 4 GB RAM for a smooth operation.
  2. The amount of disk required for live-migration to operate is a function of the time required to transfer historical data and the rate of DML operations executed on the source database. Let’s illustrate this with an example: with an insert rate of 10 K/s, pgcopydb writes about 6.5 MB/s to storage. This equates to around 23 GB every hour or 550 GB daily. If the transfer of historical data spans two days, the disk should have a minimum capacity of 1,100 GB.
  3. The source database must have “wal_level” set to “logical.” You can check by running the following command on the psql prompt connected to the source database.
psql "postgres://<user:password>@<source host>:<source port>" -c "SHOW wal_level";

 wal_level
-----------
 logical
  1. The Postgres user must have a REPLICATION attribute set on their role to create a replication slot in the source database. You can check this by running the following command on the psql prompt connected to the source database.
psql "postgres://<user:password>@<source host>:<source port>" -c "\du <user>";

 List of roles
 Role name |                   Attributes                    |                         Member of
-----------+-------------------------------------------------+-----------------------------------------------------------
 tsdbadmin | Create role, Create DB, Replication, Bypass RLS | {pg_read_all_stats,pg_stat_scan_tables,pg_signal_backend}

1. Set environment variables

To simplify the migration process, it's a good practice to set up your source and target database connection strings as environment variables. This way, you don't have to re-enter them every time you use a tool, reducing the risk of errors.

On most systems, you can set an environment variable using the command line:


export SOURCE=postgres://<user:password>@<source host>:<source port>
export TARGET=postgres://<user:password>@<target host>:<target port>

2. Install the live-migration Docker image

Install Docker runtime on the EC2 instance you're using for the migration.

docker run --rm -dit --name live-migration \
  -e PGCOPYDB_SOURCE_PGURI=$SOURCE \
  -e PGCOPYDB_TARGET_PGURI=$TARGET \
  -v ~/live-migration:/opt/timescale/ts_cdc \
timescale/live-migration:latest

The command will take a snapshot of your source database and migrate the schema to the target database. After migrating the schema, it will prompt you to create hypertables in the target database.

Ideally, tables that contain time-series data should be converted to hypertables. You need to run create_hypertable() for each table you want to convert to a hypertable in the target database. For more information, see the hypertable docs.

Once you have finished creating hypertables, you need to signal "continue" to proceed. You can do it by pressing the c key.

3. Wait for ANALYZE to complete

Next, the live-migration image will migrate the existing data in the source database to the target database and start streaming live transactions (live replay) received on the source side to the target. During this process, it will display the lag between the source and target databases regarding WAL offset size.

Watch Source DB - Target DB = > 126 MB

When the lag difference between the source and target databases is less than 30 megabytes, the Docker image will kick off ANALYZE on the target database. This updates statistics in the target database, which is necessary for optimal querying performance in the target database. Wait for ANALYZE to complete.

You can now start the validation step (4) to compare the data between the source and target databases. 

Important: Application downtime begins here in order to address the lag between the source and target databases and conduct data integrity checks. We recommend performing thorough data integrity checks before taking your application offline to minimize application downtime. During the downtime, allow the lag between the source and target to catch up, and then perform a final data integrity check and promote Timescale as the primary. This will significantly reduce the downtime.

4. Validate the migration

Once the lag between the databases is below 30 MB, and you're ready to take your applications offline, stop all applications writing to the source database. You’ve entered the downtime phase, which will last until you complete the validation step. As mentioned earlier, be sure to go through the validation step before entering the downtime phase to minimize overall downtime.

Stopping writes to the source database allows the live migration process to finish replicating the data to the target database, which will become evident once the replication lag falls to 0 MB.

Once the replication lag is 0, wait a few minutes and then provide the signal to proceed by pressing key c.

[WATCH] Source DB - Target DB => 0MB. Press "c" (and ENTER) to proceed
Syncing last LSN in Source DB to Target DB ...

And you’re done!

The live-migration image will continue the remaining work, which includes migrating sequences and cleaning up resources. You should see the following message if all the mentioned steps were successful.

Migration successfully completed

Validate the data in the target database and use it as the new primary

Now that you have migrated all the data, the contents of both databases should be the same. How exactly this should best be validated is dependent on your application. You could compare the number of rows or an aggregate of columns to validate that the target database matches the source.

Note: Application downtime stops here, as the lag between source and target is zero, and the data integrity checks are complete. 

Once you are confident with the data validation, the final step is to configure your applications to use the target database and promote Timescale as your primary database.

The future of live migrations

As you’ve probably noticed by now, we’ve put a lot of effort into our live migrations solution, but that doesn’t mean the work is done yet.

Currently, our migration workflow using pgcopydb only supports migrations from PostgreSQL to Timescale. However, we are in the process of enabling Timescale to Timescale live migrations as well. We are also actively working with the pgcopydb community to further improve the performance and reliability aspects of the tool.

In addition, we are enhancing the pre-migration and post-migration checks to streamline the end-to-end migration process. This will simplify the start of the migration process and make it easier to verify and switch the primary database to Timescale.

All of these enhancements are part of our overarching goal for this solution: we know how precious your data is and want to ensure a seamless and worry-free transition.

If you are looking to migrate your data from self-hosted TimescaleDB to Timescale, a fully managed cloud platform that makes PostgreSQL faster and fiercer, reach out! We’ll be happy to assist you. For feedback on live migrations, hit us up on Slack (#migrations channel) and be a part of our community. We'd love to hear from you!

🗣️
What our users have to say about our migration process:

"Migration has been a very fun process. My colleagues barely had to touch Timescale while I was away. It just ran, and it worked great. I never had to check anything. It was working in production without any maintenance whatsoever."

Björn Olafur Johannsson, Software Engineer, AVO

Next Steps

Live migrations is just the beginning of easing the migration process to Timescale. We are actively iterating on our solutions to ensure they are simple and effective, enabling data migrations (including heavy workloads) with just a few clicks and providing best-in-class performance and reliability for future migrations.

If you would like to move your data using live migrations (or one of our other migration strategies), please follow the Timescale documentation:

To try Timescale, sign up for a free trial (no credit card required). Timescale includes our core database, TimescaleDB, but enhances it with features designed for higher performance, faster queries, and cost savings.

(This post was originally published in October 2023 and updated in January 2024.)

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
14 min read
Announcements & Releases
Contributors

Related posts