Category: All posts
Jan 17, 2025
Posted by
Umair Shahid
As industries like IoT, finance, and healthcare increasingly generate tremendous amounts of data, scalable and efficient cloud storage solutions have become essential. Developers often debate when to use PostgreSQL vs. specialized solutions due to scalability concerns, but it’s possible to engineer PostgreSQL to handle these challenging workloads efficiently and at a petabyte scale—whether it’s events, time series, real-time analytics, or vector data.
Another common concern is PostgreSQL cloud costs. The trade-off between management convenience and cost control is a real consideration for developers—cue in tiered storage. Timescale’s tiered storage architecture offers a practical approach to managing large datasets in PostgreSQL by moving older, less-accessed data to S3. This allows you to save on storage costs while keeping all your data queryable, and without compromising query speed—even for infrequently accessed data.
In this post, we will explore how to get the most out of this architecture—balancing the high-performance tier for recent, frequently accessed data with the low-cost tier for older, less critical data, while weighing any trade-offs so you can scale your PostgreSQL database with confidence.
Timescale's tiered storage system is built to handle the explosive growth of time-series data and other demanding workloads, such as vector data or real-time analytics. Its architecture splits data into high-performance storage (frequently accessed) and low-cost storage (infrequently accessed) tiers, allowing you to optimize both PostgreSQL costs and performance.
This tier stores your recent, frequently accessed data on high-performance disks. It ensures low-latency, high-throughput reads and writes, making it ideal for workloads that require real-time access. Recent and frequently accessed data remains performant due to PostgreSQL’s optimized indexing, caching, and Timescale’s time-partitioned hypertables.
The low-cost tier is designed for older, less accessed data. Using object storage like AWS S3 significantly reduces storage costs, while still remaining an integral part of your database. It comes with trade-offs though: while old data remains fully queryable, you may experience slower response times compared to the high-performance tier. This is one aspect we have recently improved, optimizing the query performance for tiered data by 400x so you can run complex queries across both tiers.
The tiered storage system offers virtually unlimited data storage, enabling you to handle databases that grow into petabytes without skyrocketing costs. As your time-series data grows, older data can be automatically moved to the low-cost tier without requiring database restructuring or external ETL (extract-load-transform) processes.
Storing data in S3-like object stores is significantly cheaper than keeping all data on high-performance disks. For instance, Timescale's low-cost storage layer offers a flat price of $0.021 per GB/month, cheaper than Amazon S3’s typical rate. When your dataset grows, this cost-saving benefit compounds.
It is important to note here that—as opposed to the high-performance tier—pricing in the low-cost tier is calculated per uncompressed gigabyte.
Regardless of where your data is stored, Timescale’s architecture makes querying seamless. SQL queries are designed to access both tiers without needing special tools or database management changes. Timescale's query planner automatically fetches data from the relevant tier based on your time-range query, maintaining a unified interface.
Please note that in order to make querying tiered data seamless, Timescale provides a GUC (grand unified configuration)—timescaledb.enable_tiered_reads
—which needs to be set to true. It is false by default, which means that by default, Timescale does not touch tiered data. More in the docs: Querying tiered data.
Understanding the trade-offs between high-performance storage and object storage like S3 is essential for managing performance in a tiered storage environment.
PostgreSQL is optimized for real-time queries with sub-millisecond response times. Due to the nature of object storage, query latency increases for old data stored in S3. However, this trade-off is acceptable for many analytical workloads or infrequently accessed data. Timescale ensures that queries spanning both tiers are handled efficiently, but the speed of the high-performance tier cannot be matched by S3.
A very interesting GUC provided by Timescale is timescaledb.enable_tiered_reads
. If this is set to "false," the queries only fetch data from high-performance tier while ignoring data in the low-cost storage, like it doesn’t exist. If it is okay for your use case to ignore data in the low-cost (and low performance) storage, setting this GUC to "false" is one way ensuring query speeds.
The cost difference between high-performance storage and low-cost S3 storage is significant. As a result, storing large datasets that may not be frequently accessed in the high-performance tier is costly. Tiering older data to S3 can reduce this burden, lowering your overall operational costs.
PostgreSQL allows real-time inserts, updates, and deletes, which is critical for most OLTP and real-time analytics applications. Timescale’s high-performance tier also allows these operations over compressed data. S3’s immutable nature, however, limits flexibility. Data tiered to the low-cost S3 is best for archival or append-only datasets where modifications are not expected.
When querying data that spans both tiers, TimescaleDB uses its query planner to seamlessly fetch data from both tiers. However, query performance can degrade if queries are not optimized. Here's how you can improve this:
Queries that span both tiers can be optimized by retrieving partial results instead of scanning the entire dataset. Using LIMIT
along with ORDER BY
reduces the overhead by fetching only the required number of rows.
SELECT * FROM conditions
WHERE time_column BETWEEN '2024-01-01' AND '2024-06-01'
ORDER BY time_column DESC
LIMIT 100;
Chunk exclusion is one of Timescale’s most powerful features for improving query performance. It ensures that the database only scans the necessary chunks of data and avoids querying irrelevant chunks, particularly in the low-cost tier.
Timescale automatically partitions data into chunks based on time intervals. Each chunk stores a specific range of time, so by using a WHERE
clause with a time_column
, you can help Timescale exclude chunks that don’t match the query, avoiding unnecessary data scans.
Let's say we want to fetch temperature data only for January 2024. Without chunk exclusion, the database might scan all data (including irrelevant historical data) if it’s not optimized properly.
SELECT
time_column, temperature
FROM
conditions
WHERE
time_column BETWEEN '2024-01-01' AND '2024-01-31';
You can combine the time_bucket
function with a specific time range to both aggregate data and enable chunk exclusion.
SELECT
time_bucket('1 hour', time_column) AS bucket,
avg(temperature) AS avg_temp
FROM
conditions
WHERE
time_column BETWEEN '2024-03-01' AND '2024-03-31'
GROUP BY
bucket;
One of the key trade-offs of using S3 as low-cost storage in Timescale is immutability—once data is moved to low-cost storage, it cannot be modified directly. While this immutability ensures data stability and durability, it can become a challenge when you need to update historical data for backfilling or corrections.
This section will discuss strategies for handling immutability, including chunk rehydration, backfilling historical data, and using staging updates in the high-performance tier.
When historical data needs to be updated (e.g., for correcting errors or adding missing data), Timescale provides a mechanism called chunk rehydration. This process allows you to temporarily bring old data back into the high-performance tier, where it can be modified, and then retier it by moving it back to low-cost storage.
SELECT chunk_name FROM timescaledb_information.chunks WHERE hypertable_name = 'conditions';
untier_chunk
function to bring the compressed old data from S3 back into the high-performance storage.-- Bring the chunk back from low-cost storage to high-performance storage for modification
CALL untier_chunk('_timescaledb_internal._hyper_1_1_chunk');
-- Perform your data updates or inserts
UPDATE conditions
SET temperature = 22.5
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-15';
-- After modifications, the chunk will be recompressed and sent back to low-cost storage
CALL retier_chunk('_timescaledb_internal._hyper_1_1_chunk');
untier_chunk
function temporarily moves the chunk back to the high-performance tier for updates.retier_chunk
function recompresses the chunk and moves it back to low-cost storage.If you need to perform more frequent updates on old data or want to avoid repeated rehydration of chunks, another effective strategy is to stage updates in the high-performance tier before moving the data to low-cost storage. This approach is particularly useful when backfilling or modifying a large amount of historical data.
-- Extend the retention period for high-performance storage to allow for more updates
SELECT add_tiering_policy('conditions', INTERVAL '12 months');
-- Perform any updates or backfilling on recent data
UPDATE conditions
SET temperature = temperature + 1.0
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-31';
-- Once updates are complete, move the data back to low-cost storage after the updated retention period
SELECT add_tiering_policy('conditions', INTERVAL '6 months');
In cases where immutability conflicts with data maintenance needs, there are a few additional strategies you can apply to balance the benefits of low-cost storage with the ability to modify data.
Instead of moving data back and forth between high-performance and low-cost tiers, you can create temporary high-performance-only tables to stage data updates before merging them with historical data.
-- Extend the retention period for high-performance storage to allow for more updates
SELECT add_tiering_policy('conditions', INTERVAL '12 months');
-- Perform any updates or backfilling on recent data
UPDATE conditions
SET temperature = temperature + 1.0
WHERE time_column BETWEEN '2024-01-01' AND '2024-01-31';
-- Once updates are complete, move the data back to low-cost storage after the updated retention period
SELECT add_tiering_policy('conditions', INTERVAL '6 months');
If modifications are minimal and mostly involve adding new information rather than altering existing records, consider designing your data model to favor append-only operations. This aligns well with the immutability of low-cost storage, as you won’t need to modify the existing data.
-- Append new temperature readings as new rows rather than modifying existing ones
INSERT INTO conditions (time_column, temperature, humidity)
VALUES ('2024-01-15', 22.5, 45.2);
One of the most important aspects of leveraging tiered storage is setting well-thought-out data retention and tiering policies. Timescale allows you to move older data to low-cost storage after a specified period using simple SQL commands. For instance, you can automate the movement of all data older than six months to low-cost storage:
SELECT add_tiering_policy ('your_hypertable', INTERVAL '6 months');
This policy automatically shifts the data after six months, ensuring the high-performance tier remains optimized for recent queries. Implementing intelligent policies helps prevent overburdening the high-performance tier and keeps performance high.
Not all older data may need to move immediately to low-cost storage. For workloads that intermittently access historical data for compliance or analysis, you may want to fine-tune tiering policies based on query patterns. Understanding which data is critical for real-time analysis helps you optimize storage costs while maintaining efficient performance.
For example, financial institutions often need to access last year’s data for audits or compliance, but the data doesn't need to reside in the high-performance tier at all times. Query monitoring tools within Timescale can give you insights into data access frequency, which helps you make informed decisions on when to move data to the low-cost tier.
Compression can substantially reduce the storage footprint for high-performance data. Timescale’s native hybrid-row columnar storage engine reduces disk space usage by 90 % or more, which lowers storage costs even for the high-performance tier. Data in the high-performance tier can be kept compressed for longer while still ensuring fast query response times, allowing more efficient use of the more expensive PostgreSQL storage. This is especially useful for time-series data that often exhibits high redundancy.
Example:
ALTER TABLE your_hypertable SET (timescaledb.compress = true);
While Timescale enables transparent access to data in the low-cost tier, it’s important to consider S3's characteristics:
Once data is tiered to low-cost storage, it cannot be modified. This is a fundamental trade-off of using object stores like S3. Plan accordingly and ensure that data you no longer expect to modify goes into low-cost storage.
Accessing old data from S3 involves additional latency compared to querying high-performance tier’s recent data. Timescale mitigates this with efficient query planning, but it's worth noting that certain queries, especially those that span both tiers, may see increased response times. Design your applications to tolerate slightly higher latency for historical queries.
If you need to update historical data that resides in low-cost storage, Timescale provides a mechanism to untier specific data chunks, allowing you to bring them back to high-performance storage temporarily. This can be useful for cases where backfilling data is necessary.
Example:
CALL untier_chunk('_Timescale_internal._hyper_1_1_chunk');
When querying across both recent and old data, ensure your queries are optimized to avoid unnecessary scans across the entire dataset. Timescale’s chunk exclusion feature automatically excludes unnecessary data chunks from being queried. Additionally, using `time_bucket` and other SQL functions effectively can further optimize performance.
Timescale’s replication mechanisms ensure that data, whether in the high-performance tier or the low-cost tier, is stored securely across multiple nodes. If you’re using read replicas or forks for testing, ensure that you are using the low-cost tier efficiently across all instances. Because Timescale only charges once for data stored in the low-cost tier, replicated data remains cost-efficient.
Timescale’s tiered storage unlocks both the performance of PostgreSQL and the cost savings of cloud-based object storage like S3, making it an ideal solution for time-series data and challenging workloads at scale. By effectively managing tiering policies, leveraging compression, and understanding the trade-offs between high-performance and low-cost data storage, you can optimize your Timescale deployment to handle growing datasets with ease while keeping cloud costs in check.
Incorporating these best practices will help you strike the right balance between performance and cost, ensuring that your time-series workloads remain efficient and scalable for the long term. For further reading, make sure to read this article on Scaling PostgreSQL to Petabyte Scale using Timescale’s tiered storage architecture.
Tiered storage is available only in Timescale’s PostgreSQL cloud platform, Timescale Cloud. Start for free here (no credit card required, free for 30 days).