Scaling PostgreSQL for Cheap: Introducing Tiered Storage in Timescale

Scaling PostgreSQL for Cheap: Introducing Tiered Storage in Timescale

Products like Amazon RDS for PostgreSQL are fine for smaller deployments, but scaling PostgreSQL is a different story. Once the project grows to many terabytes, these managed databases get slow and expensive, making data management much more complex. Performance suffers once tables reach billions of rows, and while there are ways to improve it, the data will carry on growing. Without proper data management, developers can only watch their disks (and bills) increase in size.  

But not anymore. Today, we’re happy to announce the General Availability of Tiered Storage, a multi-tiered storage architecture engineered to enable infinite, low-cost scalability for your time series and analytical databases in the Timescale platform. You can now store your older, infrequently accessed data in a low-cost storage tier while still being able to access it—without ever sacrificing performance for your frequently accessed data.

This is what happens when you insert data into a Timescale time-series database with our new multi-tiered storage backend:

  1. Your most recent data will be written into a high-performance storage tier optimized for fast queries and high ingests.
  2. Once you don’t access that data frequently, you can automatically tier it to a lower-cost object storage tier by setting up a tiering policy. The data in the low-cost storage tier remains fully queryable within your database, and there’s no limit to the amount of data you can store—up to hundreds of TBs or more. Our low-cost storage tier has a flat price of $0.021 per GB/month for data—cheaper than Amazon S3. 
Timescale now has a Tiered Storage backend, combining two storage tiers to take advantage of both fast query performance and affordable scalability
Timescale now has a Tiered Storage backend, combining two storage tiers to take advantage of both fast query performance and affordable scalability

Developers need a cheap way to scale their large PostgreSQL databases in AWS without compromising performance. While object stores are amazingly scalable and affordable, they are not the fastest, and developers also need to get millisecond query responses for their applications. 

However, once data becomes older and rarely accessed, real-time performance is often not as essential. Developers still need to be able to access this historical data for ad hoc queries, data analysis, or regulatory compliance, but they can assume some latency for these types of queries. Now, what developers want is the ability to store this historical data as affordably and efficiently as possible. 

This new Tiered Storage architecture frees developers from choosing between storage costs and performance trade-offs for real-time applications. By keeping their recent and frequently accessed data in the high-performance tier, they’ll leverage the millisecond query speed and ingestion capabilities of Timescale, and by tiering their older data, they can keep as many TBs as they need in their PostgreSQL databases for less. 

PostgreSQL Data Management With Tiered Storage: What You Need to Know 

Timescale’s Tiered Storage architecture leverages the flexibility of PostgreSQL and hypertables for effective data management. When creating a hypertable, it can now seamlessly stretch across both storage tiers; when you run a query, Timescale seamlessly figures out what storage tiers to access to generate the response, boosting performance for your most recent data and lowering storage costs for older data. There’s no extra charge per query or data read and no hidden fees, enabling straightforward, cost-effective data management.

This storage architecture also eliminates any storage limitations in Timescale services: since our low-cost storage tier is infinite, you can store as many TBs as you’d like. For example, we’re leveraging Tiered Storage internally to store the huge Timescale database powering our Insights product. 

This Insights database is constantly collecting and analyzing query statistics from across our customer fleet, and it’s exceeded 350 TB today and is growing quickly. From those 350 TB, 250 TB are tiered to low-cost storage.

Let’s do the math: 

  • We store 5 TB in our high-performance storage tier after compression. Of course, we have compression enabled, and we’re getting compression ratios of 20x—meaning that what was originally 100 TB of Postgres data now fits in a 5 TB disk thanks to compression (!) 
  •  The remaining 250 TB of data are stored in the low-cost storage tier. This tiering happens automatically once data hits a certain age, which currently is several weeks old.

Our customers with large deployments are also already making use of Tiered Storage: 

"We perform a lot of analysis on market data, and the sheer volume of data we need to store makes a normal disk-based database solution unfeasible (it's just too expensive). Timescale’s Tiered Storage seamlessly allows us to move large volumes of data to the object storage layer. This is a great solution to store large volumes of historical data and perform post-analysis. Without this, we'd be forced to develop a solution in-house."
Chief Technology Officer at a Proprietary Digital Assets Trading Company

Simplicity is really the key feature of Tiered Storage. To move your data from the high-performance tier to the low-cost object tier, all you need to do is use our simple API to define policies to tier your data as it ages on a particular hypertable. Data tiering policies work on a chunk basis (the policy will tier complete chunks once they reach the age defined by the policy). There’s no need for any ETL (extract-transform-load) processes or infrastructure changes. 

Editor's Note: Timescale hypertables are PostgreSQL tables automatically partitioned by time. Those partitions are called chunks. Chunks are the unit of user-defined policies in Timescale: for example, when you define a data retention policy, you’ll be deleting entire partitions (chunks), and when you move data between storage tiers, you’ll be moving entire chunks. This is a very convenient way to manage data from a resource-utilization and developer experience perspective.

For example, this policy would move all data older than one month to object storage in your events hypertable: 

SELECT add_tiering_policy('events', INTERVAL '1 month');

That’s all you need! Everything else happens automatically, including intelligent query planning that only executes any SQL queries on the appropriate tier. 

To drop data currently stored in the low-cost storage tier, you can define a data retention policy so data is deleted automatically after a certain period of time  (e.g., after five years). You can also manually delete specific chunks

Also, if you want to “move back” a particular chunk from the low-cost storage tier to the high-performance tier (for example, if you need to backfill or update data), you can easily “untier” it.

-- Untier a particular chunk
CALL untier_chunk('_hyper_1_1_chunk'); 

You can track how much data has been tiered (and how much it will cost at the end of the month) in the Timescale console: 

The Overview screen in the Timescale UI will show you how much data you have in the low-cost storage layer and an estimate of how much you’ll pay for it 

And speaking of billing estimations… 

How much can I save with Tiered Storage? 

Our high-performance storage tier has an effective price of $0.177 per GB/month of data after compression (considering the expected compression rate we see across our fleet).  This is now joined by a low-cost storage layer with a flat fee of $0.021 per GB/month of data, with the same price across all cloud regions.

When tiering data, you will only pay for the data you store, not for the queries executed or the amount of data scanned: this is truly a flat price. Our goal with this pricing structure was to provide a transparent, unambiguous, and simple way to calculate the total data storage cost, making it easier to manage your data.

As a quick example, let’s say you have a hypertable with 5.2 TB of storage, with recent hypertable chunks and other Postgres tables taking up approximately 200 GB and about 5 TB of hypertable data older than a month. You do not access or query this older data frequently, meaning you don’t need it for the daily operations of your application. Still, you’d like to keep it accessible in your database for ad hoc queries or compliance requirements (we see many such cases among our customers). 

As a cost-effective data management strategy, you can tier all chunks older than a month to the low-cost tier and decrease the cost of storing that 5 TB of data from around $478/month to around $105/month, a 78% decrease in your storage bill. (For this estimation, we assume you enabled compression for your hypertable and consider the median overall compression in all Timescale services). 

Moving your infrequently accessed data to the low-cost storage tier will save you significant money, making your storage bill very affordable

The savings will increase alongside your data: when moving multiple terabytes to this low-cost tier, your storage bill will decrease from thousands of dollars to a few hundred. The following reference table illustrates how affordable our low-cost storage tier actually is.

You get the point! 

The savings’ multiplication effect of Tiered Storage 

There is one more thing that makes Tiered Storage even more amazing: when you keep data in the low-cost storage tier, you pay for this data only once, independently of if you have a high-availability replica or read replicas running in your service.

The same applies to forks. In Timescale, you can create copies of your primary database (we call them forks) by clicking one button from the UI, for example, for running tests or creating dev environments. When creating one (or more) forks, you won’t be billed for data shared with the primary in the low-cost storage. If you decide to tier more data that’s not in the primary, you will pay to store it in the low-cost tier, but you will still benefit from substantial savings by moving it from the high-performance tier of the fork to the cheaper tier.

To make this crystal clear, let’s roll out an example. Imagine that you have a primary service with 6.5 TB of data and that you also have set: 

  • A high-availability replica to significantly reduce the risk of downtime and data loss due to failures
  • A read replica to serve your read queries and allow the primary to dedicate itself fully to writes
  • A fork of that service for development and testing purposes 

From a billing perspective, if you kept the 6.5 TB of data in your primary service in the high-performance storage tier, you would see reflected [6.5 TB x 4] in your storage bill to account for the two replicas, the fork, and the primary service—26 TB in total. Assuming our median compression rate, this would be costly: around $4,602 /month.

But what happens if your application needs to actively access only the most recent 500 GB of data? You could tier 6 TB to low-cost storage and only keep 500 GB on your high-performance storage tier. And since you only pay for the data in your low-cost storage tier once, this is how your new storage bill would look like:

  • [500 GB x 4] = 2 TB in high-performance storage (instead of 26 TB) 
  • [6 TB x 1] in the low-cost storage tier 

The above storage bill would come up to approximately $480 /month: you would save over $4,000/month! This is the savings' multiplication effect of Tiered Storage. Especially if you’re running replicas or forks, it is a great idea to take advantage of the low-cost storage tier—the savings you’ll see in your overall bill will be very significant. 

Our Tiered Storage Journey 

We released an early version of our Tiered Storage functionality in the Timescale platform as Early Access in March. After many improvements, it has now reached General Availability. Since its first release, we have worked hard to make Tiered Storage stable, reliable, and faster. 

We also thought long and hard about our pricing model, iterating consistently over multiple ways to price our low-cost storage tier. Finally, we leaned towards the simplest: a flat rate over the data volume pre-compression*. We’ve already mentioned how we valued simple and predictable pricing, but it was also important for us to provide a price per GB/month as low as possible. Our price landed at $0.021—as a comparison, storing files on Amazon S3 costs $0.023 per GB/month

Personally, I (Yannis) must admit that after leading teams building cloud-native solutions for more than a decade, I still have to go back and occasionally recheck multiple rate tables on various cloud pricing pages, especially looking for extra fees, every time that I want to compute the total costs for our services accurately.

At Timescale, we believe that you should not have to build a complicated spreadsheet to be able to run a database service or make informed choices about storage layers.

Our commitment to making developers’ lives easier led us to the flat rate of $0.021 per GB/month—no guesswork, hidden costs, or charges per query or data reads. 

* Data volume pre-compression means the data volume before applying Timescale compression. For example, due to Timescale compression, a 500 GB volume in the high-performance storage tier might end up needing only 50 GB of disk space once compressed. If you decided to tier this data to low-cost storage, your bill would be calculated over the original 500 GB volume, as in [500 GB * $0.021] per month.

How Tiered Storage Works: Behind the Scenes 

All data inserted into Timescale is initially written into our high-performance storage layer. Using faster disks for your most recent data will bring top insert and query performance for your most recent values, a usage pattern that fits the needs of data-intensive applications.

On the contrary, the low-cost storage tier is an object store built on Amazon S3. This object store is much more than an external bucket to archive your data, though: it’s an integral part of your database. When you move data to this object storage tier, your database will remain fully aware of all the semantics and metadata, and you can keep querying as usual with standard SQL (albeit with slower performance). 

Behind the scenes, we’re storing the data in a compressed columnar format (specifically, Apache Parquet). When data is tiered, chunks stored in Timescale’s native internal database format (typically in our native columnar compression) are asynchronously converted to Parquet format and stored in the underlying S3 object. We’ve built several mechanisms to ensure that the tiered chunks are durably stored in the low-cost storage tier before transactionally removing them from the high-performance tier. 

When you run your SQL query, it will transparently pull data from the high-performance storage tier, the object storage tier, or both, as required. When we say transparent, we mean transparent: Timescale supports arbitrarily complex queries across its standard and tiered data, including complex predicates, JOINs, CTEs, windowing, hyperfunctions, and more.

In the example query below (with an EXPLAIN clause), you can see how the query plan includes a Foreign Scan when the database is accessing data from the object storage tier. In this example, devices and sites are standard Postgres tables (residing only in high-performance storage), while metrics is a hypertable that stretches across both storage tiers.  When executing this query against the metrics hypertable, three of its chunks were read from standard storage, and five chunks were read from object storage.

EXPLAIN 
SELECT time_bucket('1 day', ts) as day,
        max(value) as max_reading, 
        device_id  	
    FROM metrics 
    JOIN devices ON metrics.device_id = devices.id 
    JOIN sites ON devices.site_id = sites.id
WHERE sites.name = 'DC-1b'
GROUP BY day, device_id
ORDER BY day;


QUERY PLAN                                                      
----------------------------------------------------------
GroupAggregate
    Group Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
    -> Sort
        Sort Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
        -> Hash Join
            Hash Cond: (_hyper_5666_706386_chunk.device_id = devices.id)
            -> Append
    -> Seq Scan on _hyper_5666_706386_chunk
                -> Seq Scan on _hyper_5666_706387_chunk
                -> Seq Scan on _hyper_5666_706388_chunk
                -> Foreign Scan on osm_chunk_3334
            -> Hash
                -> Hash Join
                    Hash Cond: (devices.site_id = sites.id)
                    -> Seq Scan on devices
                    -> Hash
                        -> Seq Scan on sites
                           Filter: (name = 'DC-1b'::text)

In the above query plan, the Foreign Scan on osm_chunk_3334 corresponds to fetching data from the bottomless object storage tier. 

To avoid processing chunks falling outside the query’s time window, we perform chunk exclusion to only touch the chunks that are needed to satisfy the query. The database stores various forms of metadata to build a “map” of row groups and columnar offsets within object storage.

Furthermore, when a query is run, it is further selective about the data it reads. If your query only touches a range of rows and a few columns, only that subset of the data is read from the S3 object behind the low-cost storage tier. 

In the above, for example, only the device_id and value columns are read from the object storage tier.  If an additional  time-based WHERE filter had been included, the database would first use its metadata (stored in high-performance storage and cached in memory) to further reduce which Parquet files and row groups need to be read in order to execute the query.  All with the goal of reducing query latency even when accessing this bottomless storage transparently through PostgreSQL.

Give Tiered Storage a Try 

Storage decisions around historical data don’t have to be costly. In Timescale, you now have access to a low-cost, infinite storage tier with no pricing gotchas that allows you to scale your database storage for an affordable price without compromising performance for your application. 

If you’re wondering if this is a good solution for your use case, check out our Docs for more information and use case fits. If you’re looking for a more practical perspective, you can also experiment with Tiered Storage yourself for free in the Timescale platform via our free trial (no credit card required)

Ingest and query in milliseconds, even at terabyte scale.
This post was a collaboration between
13 min read
Announcements & Releases
Contributors

Related posts