Expanding the Boundaries of PostgreSQL: Announcing a Bottomless, Consumption-Based Object Storage Layer Built on Amazon S3

Expanding the Boundaries of PostgreSQL: Announcing a Bottomless, Consumption-Based Object Storage Layer Built on Amazon S3

We are excited to announce the initial launch, in private beta, of our new consumption-based, low-cost object storage layer in Timescale Cloud. This new capability expands the boundaries of traditional databases, allowing you to transparently tier your data across disk and Amazon S3 while accessing it as if it all lived in one single continuous PostgreSQL table. This means that you can now store an infinite amount of data in Timescale Cloud, paying only for what you store. Bottomless cloud storage for time series, events, and analytics is just one piece of our vision to empower you with exceptional data infrastructure so that you can build the next wave of computing.

We started Timescale five years ago with a mission: to help developers build the next wave of computing through applications that leverage time-series and real-time analytical data. This mission led us to build TimescaleDB, a time-series database that gives PostgreSQL the performance boost it needs to handle relentless streams of time-series data at scale.

On top of scalability and performance, another key concern for developers managing data at scale is cost efficiency. Time-series data is often collected at high frequency or across long time horizons. This scale is often a fundamental part of applications: it’s storing metrics about all IoT devices in a fleet, all the events in a gaming application, or tick data about many financial instruments. But this data adds up over time, often leading to difficult trade-offs about which data to store and for how long.

To address this problem, we’ve developed several database features at Timescale aimed at making it easier for developers to manage their time-series data—like native columnar compression, downsampling, data retention policies, and user-defined actions. And indeed, these offer massive savings in practice. By compressing data by 95 percent, Timescale ends up much more cost-effective than vanilla storage options like Amazon RDS for PostgreSQL.

Today we’re excited to announce how we’re extending this vision to a cloud-native future and building Timescale Cloud to supercharge PostgreSQL for time series, events, and analytics at greater scale and lower cost.

Timescale Cloud now offers consumption-based, low-cost object storage built on Amazon S3.

This new storage layer gives you, the developer, more tools to build applications that scale more efficiently while reducing costs. Leveraging a cost-efficient storage layer like Amazon S3 removes the need to pre-allocate—and pay for—an upper bound of your storage. When you tier data on Timescale Cloud, you will only pay for what you actually store while retaining the flexibility to keep a limitless amount of data, and without being charged extra per query.

This consumption-based pricing is not only transparent but an order of magnitude cheaper than our standard disk-based storage. And what’s more, you can access this affordable object storage layer seamlessly from your Timescale Cloud database, meaning no need to create a custom pipeline to archive and reload data. All you’ll need is a single SQL command to automatically tier data based on its age, as suited to your application’s needs:

# Create a tiering policy for data older than two weeks
SELECT add_tiering_policy ('metrics', INTERVAL '2 weeks');

But why stop at cost efficiency? At Timescale, we strive to create a seamless developer experience for every feature we release. That means doing the heavy technical lifting under the covers while you continue interacting with your data in the simplest way possible.

When applied to cost-saving object storage in Timescale Cloud, this means that even when data is tiered, you can continue to query it from within the database via standard SQL, just like you do in TimescaleDB and PostgreSQL. Predicates, filters, JOINs, CTEs, windowing, and hyperfunctions all work! Reading data directly from tiered object storage only adds a few tens of milliseconds of latency—and this cost goes away for larger scans.

We’ve natively architected Timescale Cloud databases to support tables (hypertables) that can transparently stretch across multiple storage layers. The object store is thus an integral part of your cloud database rather than just an archive.

Here’s an example of the EXPLAIN plan for a query that fetches data from disk and object storage (notice the Foreign Scan):

EXPLAIN SELECT time_bucket('1 day', ts) as day,
        max(value) as max_reading, 
        device_id  	
    FROM metrics 
    JOIN devices ON metrics.device_id = devices.id 
    JOIN sites ON devices.site_id = sites.id
WHERE sites.name = 'DC-1b'
GROUP BY day, device_id
ORDER BY day;


QUERY PLAN                                                      
----------------------------------------------------------
GroupAggregate
    Group Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
    -> Sort
        Sort Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
        -> Hash Join
            Hash Cond: (_hyper_5666_706386_chunk.device_id = devices.id)
            -> Append
                -> Seq Scan on _hyper_5666_706386_chunk
                -> Seq Scan on _hyper_5666_706387_chunk
                -> Seq Scan on _hyper_5666_706388_chunk
                -> Foreign Scan on osm_chunk_3334
            -> Hash
                -> Hash Join
                    Hash Cond: (devices.site_id = sites.id)
                    -> Seq Scan on devices
                    -> Hash
                        -> Seq Scan on sites
                           Filter: (name = 'DC-1b'::text)

The ability to keep your regular and tiered data both accessible via SQL helps you avoid the silos and application-level patchwork that come from operating a separate data warehouse or data lake. It will also help you escape the operational work and extra costs of integrating yet another tool into your data architecture.

Starting today, tiering your data to object storage is available for testing in private beta for all Timescale Cloud users. Sign up for Timescale Cloud and navigate to the Operations screen, pictured below, to request access. Timescale Cloud is free for 30 days, no credit card required.

You can request access to our private beta via the Timescale Cloud UI.
You can request access to our private beta via the Timescale Cloud UI

But, this is just the beginning. We plan to further improve object storage in Timescale Cloud to not only serve as bottomless, cost-efficient storage but also as a shared storage layer that makes it dramatically easier for developers to share data across their entire fleet of databases.

This is a huge step forward in our vision to build a data infrastructure that extends beyond the boundaries of a traditional database: combining the flexibility of a serverless platform with all the performance, stability, and transparency of PostgreSQL that developers know and love. Not just a managed database in the cloud, but a true “database cloud” to help developers build the next wave of computing.

So yes, we are just getting started.

✨ A huge “thank you” to the team of Timescale engineers that made this feature possible, with special mention to Gayathri Ayyappan, Sam Gichohi, Vineetha Kamath, and Ildar Musin.

To learn more about Timescale Cloud’s new data tiering functionality, how it redefines traditional cloud databases, and how it can help you build scalable applications more cost-efficiently, keep reading.

Bottomless Storage for PostgreSQL

Having native access to a cloud-native object store means you can now store an infinite amount of data, paying only for what you store. You no longer have to manually archive data to Amazon S3 to save on storage costs, nor import this data into a data warehouse or other tools for historical data analysis. Timescale’s new data tiering feature moves the data transparently to the object store and keeps it available to the Timescale Cloud database at all times.

To enable this new functionality in PostgreSQL, we built new database internal capabilities and external subsystems. Data chunks (segments of data related by time) that comprise a tiered hypertable now stretch across standard storage and object storage. We also optimized our data format for each layer: block storage starts in uncompressed row-based format and can be converted to Timescale’s native compressed columnar format.

On top of that, all object storage is in a compressed columnar format well-suited for Amazon S3 (more specifically, Apache Parquet). This allows developers more options to take advantage of the best data storage type during different stages of their data life cycle.

Once a data tiering policy is enabled, chunks stored in our native internal database format are asynchronously migrated into Parquet format and stored in S3 based on their age (although they remain fully accessible throughout the tiering process). A single SQL query will pull data from the disk storage, object storage, or both as needed, but we implemented various query optimizations to limit what needs to be read from S3 to resolve the query.

We perform “chunk exclusion” to avoid processing chunks falling outside the query’s time window. Further, the database doesn’t need to read the entire object from S3, even for selected chunks, as it stores various metadata to build a “map” of row groups and columnar offsets within the object. The result? It minimizes the amount of data to be processed, even within a single S3 object that has to be fetched to answer queries properly.

Cost-Effective Scalability

Timescale’s new object storage layer doesn’t just give PostgreSQL bottomless storage but also gives you, the developer, more tools to build applications that scale cost-efficiently.

By leveraging Amazon S3, you no longer have to pre-allocate (and pay for) an upper bound of your storage. While Timescale Cloud already offers disk auto-scaling, your allocation is still “bumped up” between predefined levels: from 50 GB to 75 GB to 100 GB, from 5 TB to 6 TB to 7 TB, etc. Our new object storage layer scales effortlessly with your data, and you only pay for what you store.

These storage savings can be meaningful: an order of magnitude cheaper than employing standard disk-based storage like EBS.

So why isn’t this standard for all databases? We build solutions focused on analytical and time-series data. We are doing these transparent optimizations at the larger chunk level rather than the much smaller database page level. This way, we can effectively make the most of S3, which is optimized—for both price and performance—for larger objects. This same approach wouldn’t be practical when employing traditional page-based strategies for database storage.

It’s Still Just PostgreSQL, But Better

As we say, Timescale supercharges PostgreSQL for time series and analytics. But it’s always been important for us to maintain the full PostgreSQL experience, which developers trust and love. This is why we built TimescaleDB as an “extension” of PostgreSQL (although that “extension” has certainly gotten bigger and bigger over the years!).

In our books, a smooth developer experience means that developers can continue interacting with all their data as if it’s a standard table—we do the heavy technical lifting under the covers. It should be invisible, and the more our improvements fade into the background, the better.

Developers don’t realize that hypertables are actually heavily partitioned data tables—with thousands of such partitions—they just treat them like standard tables. Developers don’t see Timescale’s real-time aggregations combining incrementally pre-aggregated data with the latest raw table data to provide them with up-to-date results every time. They are meant to “just work.”

We titled our 2017 launch post “When Boring is Awesome: Building a Scalable Time-Series Database on PostgreSQL.” We still strive to make Timescale seem “boring” to developers—simple, fast, scalable, reliable, and cost-effective so that developers can focus their precious time and minds on building applications.

This focus on the developer experience similarly motivated our design of transparent data tiering. When data is tiered, you can continue to query tiered data from within the database via standard SQL—predicates and filters, JOINs, CTEs, windowing, and hyperfunctions all just work.

And what’s more, your SQL query will pull relevant data from wherever it is located: disk storage, object storage, or both, as needed, without you having to specify anything in the query.

Here’s what it would look like working with relational and time-series data in Timescale, including tiered data. This example shows the use of sensor data, as you might have for IoT, building management, manufacturing, or the like. After creating tables and hypertables for sites, devices, and metrics, respectively, you use a single command add_tiering_policy to set up a policy that automatically tiers data older than two weeks to low-cost object storage.

# Create relational metadata tables, including GPS coordinates 
# and FK constraints that place devices at specific sites
CREATE TABLE sites (id integer primary key, name text, location geography(point)); 
CREATE TABLE devices (id integer primary key, site_id integer references sites (id), description text);

# Create a Timescale hypertable
CREATE TABLE metrics (ts timestamp, device_id integer, value float);
SELECT create_hypertable ('metrics', 'ts');

# Create a tiering policy for data older than two weeks
SELECT add_tiering_policy ('metrics', INTERVAL '2 weeks');

Now, after you’ve inserted data into your metrics hypertable (and other relational data devices and site information into your relational tables), you can query it as usual. Reading tiered data only adds an extra latency of around tens of milliseconds, and this latency cost may even go away for larger scans.

The following SQL query returns the maximum value recorded per device, per day, for a specific site—a fairly standard monitoring use case. Rather than showing the data results (which will just look normal!), we’ll show the output of EXPLAIN.

This command allows developers to see the actual query plan that will be executed by the database, which in this case includes a Foreign Scan when the database is accessing data from S3. (With our demo data, three chunks remain in standard storage, while five chunks are tiered onto S3.)

EXPLAIN SELECT time_bucket('1 day', ts) as day,
        max(value) as max_reading, 
        device_id  	
    FROM metrics 
    JOIN devices ON metrics.device_id = devices.id 
    JOIN sites ON devices.site_id = sites.id
WHERE sites.name = 'DC-1b'
GROUP BY day, device_id
ORDER BY day;

QUERY PLAN                                                      
----------------------------------------------------------
GroupAggregate
    Group Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
    -> Sort
        Sort Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
        -> Hash Join
            Hash Cond: (_hyper_5666_706386_chunk.device_id = devices.id)
            -> Append
                -> Seq Scan on _hyper_5666_706386_chunk
                -> Seq Scan on _hyper_5666_706387_chunk
                -> Seq Scan on _hyper_5666_706388_chunk
                -> Foreign Scan on osm_chunk_3334
            -> Hash
                -> Hash Join
                    Hash Cond: (devices.site_id = sites.id)
                    -> Seq Scan on devices
                    -> Hash
                        -> Seq Scan on sites
                           Filter: (name = 'DC-1b'::text)

Replace Your Siloed Database and Data Warehouse

Timescale’s new data tiering functionality expands the boundaries of a traditional cloud database to incorporate features typically attributed to data warehouses or data lakes.

The ability to tier data to Amazon S3 within Timescale Cloud saves you the manual work of building and integrating up a custom system or operating a separate data store (e.g., Snowflake) for your archival of historical data. Instead of setting up, maintaining, and operating a separate system alongside your production database (and a separate ETL process), you can simply work with a Timescale hypertable that serves your entire data lifecycle, where data is distributed across different storage layers.

As we’ve illustrated, you can query regular and tiered data seamlessly from this table and also JOIN it to the rest of your tables, avoiding silos without adding more complexity to your data stack. This not only simplifies operations but also billing: unlike regular data warehousing systems (which typically charge per query, making it very difficult to forecast the final cost), in Timescale Cloud you’ll pay only for what you store, keeping your pricing transparent at all times.

Request Access to Data Tiering Today

If you’re already using Timescale Cloud, you can test data tiering today by requesting access to our private beta. We welcome your feedback to improve the product and better serve the needs of developers.

To start testing data tiering in Timescale Cloud today:

  • Sign up to Timescale Cloud. The first 30 days are completely free (no credit card required).
  • Log in to the Timescale Cloud UI. In your Service screen, navigate to Operations > Data Tiering. Click on the “Request Access” button, and we’ll be in touch soon with the next steps.

Bottomless cloud storage for time series, events, and analytics is just one piece of our vision to empower you, the developer, with exceptional data infrastructure so that you can build the next wave of computing.

We plan to further improve object storage in Timescale Cloud to not only serve as bottomless, cost-efficient storage but also as a shared storage layer that makes it dramatically easier for developers to share data across their entire fleet of databases.

If that sounds interesting to you, please request access to the private beta and let us know!

We’re just getting started!

The open-source relational database for time-series and analytics.
Try Timescale for free
This post was written by
10 min read
Cloud
Contributors

Related posts