open mobile menu

Migrating a Terabyte-Scale PostgreSQL Database to Timescale With (Almost) Zero Downtime

Migrating a Terabyte-Scale PostgreSQL Database to Timescale With (Almost) Zero Downtime

(Almost) Zero Downtime Migrations (What, Why, and a Disclaimer)

Zero downtime database migrations are the unicorn of database management: they’re legendary, but deep down, we know they don’t exist. Regardless of which vendor you’re migrating to, you will always have to reconnect to your target database after the migration, incurring a (preferably minimal) downtime period. 

With that disclaimer out of the way, database migrations also don’t have to be a stressful, gloomy process. They can be effective, secure, and straightforward. Today, we’re introducing live migrations, a battle-tested, (almost) zero downtime migration strategy for Timescale that will enable you to migrate your terabyte-scale PostgreSQL database effortlessly and safely—because the only thing better than a unicorn is a workhorse.

And what’s best: in one fell swoop and using a few simple commands, you can migrate your data to Timescale and enable one of our most popular features—hypertables—in your target database so that automatic partitioning is already running smoothly once your data comes in.

Live migrations is the result of our extensive experience assisting our customers with the nerve-racking process of database migrations. It aims to simplify the transition process to Timescale, even if you're working with large PostgreSQL databases.

In this article, we’ll go over some of the traditional PostgreSQL database migration challenges and explain why Postgres’ logical replication feature wasn’t enough before fully delving into our new, almost zero downtime migration strategy. This is the first announcement of Cloud Week, the first of three consecutive launch weeks here at Timescale—stay tuned!

Ready to get started? You can find the step-by-step instructions for performing the live migrations described in this article in our docs.

The Challenges of Migrating Large Databases (A Time-Series Example)

Let’s face it. Nobody wants to migrate their production database. But when your current PostgreSQL database no longer meets your application’s performance needs—even after some careful performance tuning—it’s time to move your data elsewhere.

Data migrations can be challenging, especially when you have terabytes of data. The more data you have to move, the longer the migration will take. At the scale of terabytes, it could take days or weeks to transfer the data, which only compounds the issue because you can't take your application offline for that long (if at all).

Plus, if you have time-series data coming at you fast and furiously, things only get more complicated. Most of our customers manage some kind of timestamped data (with all its specificities), so we’re particularly aware of these challenges.

For example, picture an IoT application using PostgreSQL as its primary database for storing sensor data and the management platform’s operations. This includes device registration, sensor readings, alerts, and device configurations. 

This IoT management platform performs 10,000 inserts per second, 4,000 updates per second, and 2,000 deletes per second. Additionally, it serves approximately 20,000 queries per second due to different API calls happening on the front end based on the actions of end users. The existing database size is 2.7 TB, a substantial amount of data to migrate.

Migrating 2.7 TB of data will take approximately 24-48 hours, depending on the network bandwidth, compute resources allocated in the source and target, and the machine where the migration occurs. As you can easily see, the main problem with this database migration is that downtime is not affordable, meaning the source database must remain uninterrupted until the migration is complete and the user is ready to make the switch. 

Additionally, the insert, update, and delete rate ranges from 10,000 to 2,000 per second—a significant amount of traffic and data operations occurring at the source. All these real-time operations must be replicated on the target end to ensure a smooth migration.

With all these moving parts, a thoughtful data migration demands minimal downtime, avoiding missed sensor readings and an unavailable IoT management platform for end users. The ultimate goal is to ensure the service remains accessible and the migration runs seamlessly.

Cue in live migrations: Traditionally, performing a migration while the data and operations are happening at the source would be concerning and sensitive. But, with our newest migration strategy (we have two others, pg_dump and pg_restore, and dual writes and backfill), the process becomes seamless because we do the heavy lifting for you.

Why Doesn't Postgres’ Built-In Logical Replication Suffice?

At this point, you’re probably wondering: “Wouldn’t it be simpler to use PostgreSQL’s native logical replication instead of creating a completely new migration strategy?” The short answer is “no.” 

PostgreSQL offers logical replication, a feature introduced in version 11, which aids in smooth migrations with minimal downtime. This feature operates on a PUBLISH/SUBSCRIBE model, with the source database publishing changes and subscribing to the target database.

Timescale’s automatic partitioning on hypertables uses PostgreSQL’s child tables and inheritance. Each child table (we call them chunks) is assigned a time range and only contains data from that range.

Normally, when you insert into the parent table, rows are written directly into the parent table, not the underlying child table. This is why TimescaleDB extends the planner to control re-route inserts into a hypertable to the right chunk. Unfortunately, Postgres' native logical replication applies changes directly to the parent table, bypassing TimescaleDB's logic to route changes to the correct chunk. 

The Timescale Engineering team is working hard to address this with upstream Postgres, but it is still a work in progress and will take time to be generally available. In fact, this is one of the main reasons we developed live migrations: we didn’t want our customers to wait for a speedy, minimal downtime migration from PostgreSQL to Timescale, and our two existing migration strategies didn’t suit all of our customers’ use cases and requirements. 

How Are We Solving the PostgreSQL to Timescale Migration Puzzle?

To create live migrations, we have developed a migration workflow for Timescale on top of pg_dump/pg_restore (for schema) and Postgres logical decoding (for live data). Let’s explore some of the basic concepts involved:

What is logical decoding?
Logical decoding turns complex database changes into a simple format that's easy to understand without worrying about the database’s internal storage format. PostgreSQL achieves this by transforming the write-ahead log (WAL) into a more user-friendly format, like a series of data entries or SQL statements. 

What is write-ahead logging (WAL)?
You probably know that Postgres' transactions are ACID. The D stands for "durability" and ensures that once a transaction has been committed, its changes persist, even in the face of failures. PostgreSQL manages durability through the WAL, which is an append-only, persistent data structure that logs all modifications made to the database. By recording these changes to a file, the database ensures that each modification remains durable and recoverable.

💡 Tip: If you’re looking for more information on how to keep your data safe, be sure to check our article on PostgreSQL database backups.

Logical decoding facilitates the subscription to real-time database changes. These changes are streamed to subscribers in accessible formats, such as JSON. Subscribers can then convert these changes to SQL statements and apply them to the target database.

For our Timescale migration, we opted to leverage pgcopydb, a rising open-source tool in the Postgres community, rather than building our own logical decoding solution from the ground up.

While pgcopydb can do the complete migration in one go, we decided to use it only for historical data migration and live data replication. We wanted finer control over the schema migration process to enable features like hypertables (a Timescale feature that automatically partitions your data).

After the migration, you can still enable hypertables. However, converting non-empty plain tables to hypertables can lock the table for a significant amount of time, preventing any modifications to the table until the conversion is complete. This means that no new data can be written into the table while the conversion is in progress. The duration of the lock increases with the size of the table.

The Live Migrations Process

As mentioned, live migrations leverages logical decoding using pgcopydb to ensure changes made to the source database are applied to the target database. 

For example, when you use pg_dump to take a snapshot of the data in the source database, all changes made in the source after the pg_dump command are lost. With live migrations, these changes are stored and later replayed into the target database with the help of pgcopydb. 

How live migrations ensures the safe migration of terabytes of data from PostgreSQL to Timescale
How live migrations ensures the safe migration of terabytes of data from PostgreSQL to Timescale

Transactional consistency ensures that all operations within a transaction are either completed successfully or not at all. If a transaction is interrupted, such as due to a system crash or power outage, any changes made during that transaction are not saved, and the system returns to its original pre-transaction state. 

Thus, transactional consistency refers to the guarantee that the state of the target database will reflect the state of the source database from a specific point in time. When migrating data, it is important that the source and target databases remain consistent with each other. This means the following:

Completeness: If a transaction was committed on the source database and affected multiple rows or tables, those changes should be entirely replicated in the target database.

Order preservation: The order of transactions in the source should be preserved in the target. If transaction A is completed in the source before transaction B, the same order should be reflected in the target.

Transactional consistency is crucial for data integrity, preventing data anomalies and ensuring the migrated database remains a true and reliable reflection of the source database.

To maintain transactional consistency, the live migrations process involves moving the schema first, followed by enabling hypertables. Hypertables are Timescale's method of partitioning tables for optimal performance. If you want to learn more about hypertables and Postgres partitioning, check out this article

Once the database schema has been migrated and enabled with hypertable settings for the desired plain tables, logical decoding is initiated to subscribe to all real-time operations happening at the source. These operations are stored in an intermediate storage running pgcopydb until the data backfilling is complete. 

After initiating logical decoding, the data backfilling process begins. Once the data backfilling is successful, the real-time transactions stored in the intermediate storage are applied to the target database, keeping the source and target in sync in real time without the need for intermediate storage.

The time it takes for the source and target to be fully synchronized can vary, depending on the ingest load at the source. When the target catches up with the source, you can observe in the pgcopydb logs that the target is a few seconds to minutes behind the source. At this point, you can perform data integrity and verification checks. 

Once you are confident with the move, switch your applications to write to the target database. This will achieve near-zero downtime but may cause transactional inconsistency between the lag the migration is trying to catch up with and the latest changes made by your application to the target.

If the application's changes on the target do not depend on the latest few minutes that the migration is trying to catch up with, you can proceed with the migration by making the target the primary and discontinuing writes to the source.

However, if you prioritize transactional consistency, you will need to stop writes to the source and wait until the lag between the source and target reaches zero. Once the lag is eliminated, you can resume writes to the target. As writes to the source have been halted during the switch, pgcopydb will catch up with the few seconds to minutes of delay and indicate that the migration is in sync. At this stage, you can safely conclude the migration process since there will be no further changes to replicate from the source to the target.

Performance and reliability

The performance and reliability of pgcopydb are improving day by day. We have contributed fixes to improve its INSERT throughput and reliability in live data replication mode, as well as fixing memory leaks. Other performance improvements, like using COPY and pipeline mode, are currently underway.

For the historical data copying, pgcopydb uses the standard Postgres COPY protocol to stream the data from source to target. Pgcopydb not only supports parallel copying of multiple table data at once but also supports same table concurrency by splitting a table into multiple parts to accelerate the data movement and reduce the duration of the migration window.

Migration is inherently sequential: any changes to the source database must be buffered until the historical data transfer is completed. While typical logical replication tools like Postgres native replication and pglogical retain the WAL segments on the source database until the changes are consumed, this can result in excessive storage consumption and potential database failures. 

The creators of pgcopydb took a unique approach and chose to buffer the live changes into the volume attached to the computer where it runs. This grants users increased flexibility, allowing them to opt for larger storage capacities or even resort to cloud-based storage solutions like AWS S3.

Migrating a Large Postgres Workload to Timescale

Now that we’ve explained our live migration solution in detail, let’s see how you can actually move a large PostgreSQL workload to Timescale without batting an eye. 

Let’s use our earlier IoT example—remember the 10,000 inserts per second? Suppose we're storing them in a readingstable ingesting data from various sensors, which, over time, is bound to grow because the developer wants to keep that data for a while. Without appropriate partitioning, this table’s insertions and queries could slow down. 

So, to tackle this issue, let’s migrate the data to Timescale. By doing so, we gain not only the advantages of automatic partitioning but also the perks of a managed service.

Before diving in, ensure the following:

  1. For optimal performance, run pgcopydb on a machine geographically near both your source and target to minimize network latency. For example, if your databases are hosted in the us-east-1 region, consider provisioning an EC2 instance in the same region. Ensure the instance is equipped with a minimum of 4 CPUs and 4 GB RAM for smooth operation.
  2. The amount of disk required for pgcopydb to operate is a function of the time required to transfer historical data and the rate of DML operations executed on the source database. To illustrate, with an insert rate of 10 K/s, pgcopydb writes about 6.5 MB/s to storage. This equates to around 23 GB every hour or 550 GB daily. If the transfer of historical data spans two days, the disk should have a minimum capacity of 1,100 GB.
  3. Install pg_dump, psql, and pgcopydb on the computer you're using for the migration. If you are using a Debian-based distro, run the following to install the tools:
sudo apt-get install postgresql-client pgcopydb
  1. The source database must have “wal_level” set to “logical.” You can check by running the following command on the psql prompt connected to the source database.
psql "postgres://<user:password>@<source host>:<source port>" -c "SHOW wal_level";

  1. The Postgres user must have a REPLICATION attribute set on their role to create a replication slot in the source database. You can check this by running the following command on the psql prompt connected to the source database.
psql "postgres://<user:password>@<source host>:<source port>" -c "\du <user>";

 List of roles
 Role name |                   Attributes                    |                         Member of
 tsdbadmin | Create role, Create DB, Replication, Bypass RLS | {pg_read_all_stats,pg_stat_scan_tables,pg_signal_backend}

To simplify the migration process, it's a good practice to set up your source, target database connection strings, and pgcopydb working directory as environment variables. This way, you don't have to re-enter them every time you use a tool, reducing the risk of errors.

On most systems, you can set an environment variable using the command line:

export SOURCE=postgres://<user:password>@<source host>:<source port>
export TARGET=postgres://<user:password>@<target host>:<target port>

export PGCOPYDB_DIR=/path/to/large/volume/mount

PGCOPYDB_DIR This is the directory where pgcopydb retains its operational data, including the buffered live replication data during the change data capture phase.

1. Begin logical decoding

Kickstart the migration by running pgcopydb in follow mode. It does the following:

  1. It creates a replication slot in the source database, which represents a stream of changes that can be replayed to a client in the order they were made on the origin server. The changes emitted by the slot are buffered into the disk.
  2. It creates and exports a snapshot in the source database to migrate the data that exists in the database (historical data) prior to the creation of the replication slot. The exported snapshot will be available in the $PGCOPYDB_DIR/snapshot file.

After creating the replication slot and snapshot, it will automatically start buffering changes happening in the source to the intermediate files under the $PGCOPYDB_DIR/cdc directory. The pgcopydb follow command only takes care of the live data replication part; still, schema & historical data have to be migrated separately, which we discuss in the next section.

pgcopydb follow \
  --dir "$PGCOPYDB_DIR"
  --source "$SOURCE" \
  --target "$TARGET" \
  --fail-fast \
  --plugin wal2json > pgcopydb_follow.log 2>&1 &

The above command is going to be active during most of the migration process. You can run it on a separate terminal instance or start it in the background. To start it in the background, append > pgcopydb_follow.log 2>&1 & to redirect all the messages to the pgcopydb_follow.log file, this is optional but recommended. The pgcopydb follow command outputs many messages, if they are not redirected, using the terminal becomes cumbersome due to the constant pop-up of messages.

Important: To replicate DELETE and UPDATE operations, the source table must have REPLICA IDENTITY set. This assists logical decoding in identifying the modified rows. Although it defaults to using the table's PRIMARY KEY as REPLICA IDENTITY, if no PRIMARY KEY is available, you'll need to manually set REPLICA IDENTITY to enable replication of UPDATE and DELETE operations.

2. Migrate schema

Use the pg_dump command to dump only the schema without any data.

pg_dump -d "$SOURCE" \
  --format=plain \
  --quote-all-identifiers \
  --no-tablespaces \
  --no-owner \
  --no-privileges \
  --schema-only \
  --file=dump.sql \
  --snapshot=$(cat ${PGCOPYDB_DIR}/snapshot)

Apply the schema dump into the TARGET Database:

psql "$TARGET" -v ON_ERROR_STOP=1 --echo-errors \
	-f dump.sql

This is the ideal point to convert regular tables into hypertables. In simple terms, you might want to convert the tables that contain time-series data to leverage Timescale’s automatic time-based partitioning. For instance, you can set up the readings table as a hypertable by running the following command.

psql "$TARGET" -c "SELECT create_hypertable('readings', 'time', chunk_time_interval=>'1 week'::interval);"

You may want to repeat this for all the candidates who can benefit from Timescale’s automatic partitioning.

Note: The conversion of the normal table into a hypertable has the downside of locking the table until the conversion completes. The larger the table, the longer it is going to remain locked, and it doesn’t allow any modification to the table during that time.

3. Migrate historical data

The following command migrates the historical data from source to target using the efficient PostgreSQL COPY command. It not only can copy multiple tables in parallel, but it can also split a single large table into multiple smaller parts and copy those in parallel too. (See --split-tables-larger-than).

pgcopydb copy table-data \
 --dir "$PGCOPYDB_DIR"
  --source "$SOURCE" \
  --target "$TARGET" \
  --snapshot $(cat ${PGCOPYDB_DIR}/snapshot) \
  --split-tables-larger-than 10G  \
  --table-jobs 8

By using the snapshot created in step 1, it guarantees that only the data that existed on the database prior to the creation of the replication slot is copied.

Depending on the historical data size, the command takes anywhere from a few hours to several days. While it's copying, pgcopydb follow, which we started in step 1, will keep track of any new changes made in the source database and save them to disk. 

4. Apply the replication changes

After the completion of historical data migration, instruct pgcopydb to start applying the buffered changes stored in the file, followed by live data replication.

pgcopydb stream sentinel set apply --source "$SOURCE"

The replication progress can be monitored by querying the pg_stat_replication view like:

psql "$SOURCE" \
  -f - <<'EOF'
SELECT write_lag, flush_lag, replay_lag
FROM pg_stat_replication
WHERE application_name='pgcopydb' AND state='streaming'

It shows three different values:

  1. write_lag: the time elapsed between flushing recent WAL at SOURCE and receiving notification that the pgcopydb has written it into the disk.
  2. flush_lag: the time elapsed between flushing recent WAL at SOURCE and receiving notification that the pgcopydb has written and flushed it into the disk.
  3. replay_lag: the time elapsed between flushing recent WAL at SOURCE and receiving notification that the pgcopydb has written, flushed it into the disk and applied at the TARGET.

We are more interested in replay_lag because, as described, it shows the replication lag in the time duration between the source and the target. In an ideal scenario, the replay_lag must come down gradually over time and should oscillate between a few seconds to minutes. 

If you notice the replay_lag is getting longer, this might mean:

  1. The target database isn't working.
  2. The target database is struggling to keep up with the source due to limited resources.

5. Prepare to switchover

If all goes as well and once the replay_lag comes down to a few seconds, It is time to update target table statistics to get better query performance after the switchover.

psql $TARGET -c "ANALYZE;"

Upon completion of ANALYZE, it is recommended to inspect the target database by executing a few queries used in your application to verify data integrity, even if the target is slightly behind (a few seconds to minutes) the source. Performing these checks prior to the switchover would avoid last-minute surprises after promoting the target database as the primary. 

Once the preliminary inspections on the target are positive, it is time to stop the write traffic to the source database so that the pending changes can be applied to make both the source and target identical. With no new data being ingested into the source, the replication slot should not emit any more changes, and replay_lag should decrease considerably faster.

Let´s instruct pgcopydb to stop when it's done applying the changes that have been generated up to this point:

pgcopydb stream sentinel set endpos --current --source "$SOURCE"

Postgres logical decoding doesn’t replicate sequences, The following command copies the latest values of all sequences from source to target. 

pgcopydb copy sequences --dir "$PGCOPYDB_DIR" \
  --source "$SOURCE" \
  --target "$TARGET" \
  --resume \

Wait for the pgcopydb follow process from step 1 to finish its execution. The process runs until the end position is reached. If you started pgcopydb follow in the background, you can bring it to the foreground with the fg command.

A successful execution of the pgcopydb follow command should have logs stating that the end position has been reached and that the process is done.

07:58:28.859 119 INFO   Transform reached end position 0/2ECDB58 at 0/2ECDB58
07:58:29.108 120 INFO   Replay reached end position 0/2ECDB58 at 0/2ECDB58
07:58:29.168 120 INFO   Replayed up to replay_lsn 0/1AB61F8, stopping
07:58:29.473 8 INFO   Subprocesses for prefetch, transform, and catchup have now all exited
07:58:29.534 8 INFO   Current sentinel replay_lsn is 0/2ECDB58, endpos is 0/2ECDB58
07:58:29.534 8 INFO   Current endpos 0/2ECDB58 has been reached at 0/2ECDB58
07:58:29.534 8 INFO   Follow mode is now done, reached replay_lsn 0/2ECDB58 with endpos 0/2ECDB58

If the command output was redirected to a file, the messages won't be shown in the terminal even if you bring the process to the foreground. In this case, inspect the log file.

Now that all data has been moved, the contents of both databases should be the same. How exactly this should best be validated is dependent on the application. One option is to compare the number of rows in the source and target tables, although this reads all data in the table, which may have an impact on the production workload. 

The final step is to switch to the target database as your primary by changing the application’s connection string to point to the target instance. 

As the last step, clean up thepgcopydb artifacts:

pgcopydb stream cleanup --source "$SOURCE" --target "$TARGET"

The future of live migrations

As you’ve probably noticed by now, we’ve put a lot of effort into our live migrations solution, but that doesn’t mean the work is done yet.

Currently, our migration workflow using pgcopydb only supports migrations from PostgreSQL to Timescale. However, we are in the process of enabling Timescale to Timescale live migrations as well. We are also actively working with the pgcopydb community to further improve the performance and reliability aspects of the tool.

In addition, we are enhancing the pre-migration and post-migration checks to streamline the end-to-end migration process. This will simplify the start of the migration process and make it easier to verify and switch the primary database to Timescale.

All of these enhancements are part of our overarching goal for this solution: we know how precious your data is and want to ensure a seamless and worry-free transition.

If you are looking to migrate your data from self-hosted TimescaleDB to Timescale, a fully managed cloud platform that makes PostgreSQL faster and fiercer, reach out! We’ll be happy to assist you. For feedback on live migrations, hit us up on Slack (#migrations channel) and be a part of our community. We'd love to hear from you!

Next steps

Live migrations is just the beginning of easing the migration process to Timescale. We are actively iterating on our solutions to ensure they are simple and effective, enabling data migrations (including heavy workloads) with just a few clicks and providing best-in-class performance and reliability for future migrations.

If you would like to move your data using live migrations (or one of our other migration strategies), please follow the Timescale documentation:

To try Timescale, sign up for a free trial (no credit card required). Timescale includes our core database, TimescaleDB, but enhances it with features designed for higher performance, faster queries, and cost savings.

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
17 min read
Announcements & Releases

Related posts