Timescale Logo

Best Practices for Time-Series Data Modeling: Single or Multiple Partitioned Table(s) a.k.a. Hypertables

Written by Chris Engelbert

Collecting time-related information, or time-series data, creates massive amounts of data to manage and model. Storing it will require one or more Timescale hypertables, which are very similar to PostgreSQL partitioned tables, and many decisions on database schema design.

While we discussed the table layout in the Narrow, Medium, or Wide Table Layout best practices article, this blog post will address whether you should use a single table to store all data versus using multiple tables, as well as their respective pros and cons.

The Basics: Hypertables and Time-Series Data Modeling

Timescale hypertables work like regular PostgreSQL tables but offer optimized performance and user experience for time-series data. With hypertables, data is stored in chunks, which work similarly to PostgreSQL’s partitioned tables but support multiple dimensions and other features.

Timescale is built upon a relational database model, which means it supports numerous schemas and data modeling choices or ways in which data can be organized and laid out. Understanding the database design choices early on is crucial to find the best combination.

I started using Timescale long before joining the company, initially using it to store IoT metrics at my own startups. We went through a few different iterations of designs, and migrations between those were everything but fun. Due to that personal experience, one of my biggest goals is to prevent others from suffering through the same.

Using Our Relational Database Experience

Being built on PostgreSQL, we understand we have many ways to store data, including in partitioned tables or just separate ones. A common pattern in the relational world is to divide data into separate tables based on their content, also often referred to as domain or entity. That means that data belonging to a set of A’s is stored in a different table than a set of B’s, representing different domains.

That leaves us questioning how the concepts of time-series data and relational domains fit together. Unfortunately, there is no easy answer. 

Squeezing vs. splitting data

Our primary options are “squeezing” all data into a single table, which could have hundreds of columns (basically defining our domain around the idea of “it’s all just metrics”), or splitting data into multiple tables with fewer columns. The latter choice may slice tables in many ways, such as by metric type (temperature is different from humidity, stock symbol A is different from symbol B), customer, data type, and others, or combinations of the previous.

Both possibilities have their own set of advantages and disadvantages, which can be split into four commonly seen topics:

  • Ease of use

  • Multi-tenancy / Privacy-related requirements (General Data Protection Regulation or GDPR / California Consumer Privacy Act or CCPA / others)

  • Schema migration or upgrading / Future-proofness

  • Tooling support

I did not choose the above order by accident: the sequential importance of these questions may influence your options further down the line.

Single vs. Multiple Table Designs: Pros and Cons

As said before, both design choices have pros and cons, and it’s vital to understand them before making a final data modeling decision. Given the previous set of topics, let’s start by answering a few questions.

Ease of use

First and foremost, how important is the ease of use, meaning, are you and your team up to the challenging tasks that need to be solved down the road? Potential “complications” could involve generation patterns for table names or ensuring that similar tables are all upgraded to the same schema level.

Multi-tenancy

Next up, are you required to provide a harder level of multi-tenancy, such as storing different customers not just by segregating them using a customer ID but are required to store them in different tables, schemas, or even databases? Is your company bound by regulations (e.g., GDPR or CCPA) where users and customers may have the right to be forgotten? With time-series data being (normally) append-only, removing parts of the data (this specific user’s data) may be tricky.

Schema changes

Then we have the question of whether you expect the data schema to change frequently. A large discussion around a future-proof design for hypertables can be found in the Narrow, Medium, or Wide Table Model write-up. However, the higher the number of tables, the more they need to be upgraded or migrated in the future, adding additional complexity.

Additional support

Finally, how important is support by additional tools and frameworks, such as ORM (Object Relational Mapping) solutions? While I personally don’t think ORM frameworks are a great fit for time-series data (especially when using aggregations), a lot of folks out there make extensive use of them, so talking about them has its merits.

Anyhow, now that we answered those questions, let’s dig into the design choices in greater detail.

Single Table Designs for Time-Series Data Modeling

Storing all data into a single table may initially feel irresponsible from a relational database point of view. Depending on how I slice my domain model, though, it could be a perfectly valid option. The design choice makes sense if I consider all stored data (metrics, events, IoT data, stock prices, etc.) as a single domain, namely time series.

The strengths

Single tables make a few things super simple. 

Querying

First of all, and probably obvious to most, is querying. Everything is in the same table, and the queries select certain columns and add additional filters or where clauses to select the data. That is as basic as it can get with SQL. That said, querying data is super simple, not just easy.

Upgrading schema

Upgrading the table’s schema is equally simple. Adding or removing columns implies a single command, and all data is upgraded at the same time. If you have multiple similar tables, you may end up in a situation where some tables are upgraded while others are simply forgotten—no real migration window is needed.

Single tables can easily support multiple different values, either through multiple columns (wide table layout), a JSONB column that supports a wide range of data values (narrow table layout), or through columns based on a value’s potential data type (medium table layout). Those three options have pros and cons, though.

tsdb=> \d
      Column    |           Type           | Collation | Nullable |      Default---------------+--------------------------+-----------+----------+-------------------
 created        | timestamp with time zone |           | not null | now()
 point_id       | uuid                     |           | not null | gen_random_uuid()
 device_id      | uuid                     |           | not null |
 temp           | double precision         |           |          |
 hum            | double precision         |           |          |
 co2            | integer                  |           |          |
 wind_speed     | integer                  |           |          |
 wind_direction | integer                  |           |          |

|                  created | point_id | device_id | temp |  hum | co2 |
 wind_speed | wind_direction |
| 2022-01-01 00:00:00.0+00 |      123 |        10 | 24.7 | 57.1 | 271 |
       NULL |           NULL |

Tool and framework integration

Last but not least, single tables play very nicely with tools like ORM frameworks. It is easy to connect a specific set of ORM entities to the hypertable, representing either a full record or the result of an aggregation (which may need a native query instead of an automatically generated one).

The challenges

But as with everything in this world, this choice has a large downside. Since time-series data is designed around the idea of being append-only (meaning that mutating existing records only happens occasionally), it is hard to delete data. Deleting data based on specific requirements is even harder, such as a user or customer asking to have all their data deleted.

In that situation, we’d have to crawl our way through potentially years of data, removing records from all over the place. That’s not only a burden on the WAL (Write-Ahead Log) to keep track of the changes, but it also creates loads and loads of I/O, reading, and writing.

The same is true if we try to store collected and calculated sets of data in the same table. With many systems often having to backfill data (for example, from devices that lost their internet connection for a while and were collecting data locally), calculated values may have to be recalculated. That means that the already stored data must be invalidated (which may mean deleted) and reinserted.

Finally, if your company provides different tiers of data retention, good luck implementing this on a single table. It is the previous two issues, but on a constant, more than ugly, basis.

Multiple Table Designs for Time-Series Data Modeling

Now that we know about the pros and cons of single table design, what are the differences when we aim for multiple table designs instead?

The Strengths

Querying

While querying is still simple, querying multiple sets of data simultaneously may be slightly more complicated, involving JOINs and UNIONs to merge data from the different tables. Requesting multiple sets of data at the same time is often done for efficiency reasons, requiring fewer database round trips and minimizing response time. Apart from that, there isn’t a massive difference in ease of use, except for table names, but we’ll come back to that in a second.

Dropping and Removing Data

One of the major benefits of having multiple tables, especially when sliced by the customer, user, or whatever meaningful multi-tenancy divider for your use case, is the option to quickly react to GDPR- or CCPA-related requests to destroy and remove any customer-related data. In this case, it is as easy as finding all the client’s tables and dropping them. Removing them from backups is a different story, though. 😌

The same is true with calculated and collected data. Separating those tables makes it much easier to throw away and recalculate all or parts of the data when late information arrives.

Data Retention

Also similar is the previously mentioned data retention. Many companies storing huge amounts of data on behalf of their customers provide different data retention policies based on how much the customer is willing to pay. Having tables sliced by customers makes it easy to set customer-specific retention policies and even change them when a customer upgrades to a higher tier. If this is something you need, the multi-table design is it.

The challenges

However, just as with single tables, multiples have drawbacks too.

Besides the already slightly more complicated elements around querying, which are not necessarily a disadvantage, having many tables requires planning a table name schema. The more dimensions we bring into the game (by customer, metric type, etc.), the more complicated our naming schema needs to be. That said, we may end up with table names such as <<customer_name>>__<<metric_type>>. While this doesn’t sound too bad, it can get ugly fast. We’ve all been there before. 😅

tsdb=> \dt *.cust_*
                    List of relations
 Schema |            Name            | Type  |   Owner
--------+----------------------------+-------+-----------
 public | cust_mycompany_co2         | table | tsdbadmin
 public | cust_mycompany_humidity    | table | tsdbadmin
 public | cust_mycompany_temperature | table | tsdbadmin
 public | cust_timescale_co2         | table | tsdbadmin
 public | cust_timescale_humidity    | table | tsdbadmin
 public | cust_timescale_temperature | table | tsdbadmin
(6 rows)

Tool and framework integration

Tools, such as an ORM framework, may make things even more complicated. Many of those tools are not designed to support arbitrary, runtime-generated table names, making it very complicated to integrate those with this concept. Using different PostgreSQL database schemas per customer and lowering the number of dimensions may help.

Upgrading and migrating tables

There is one more complexity: upgrading and migrating tables. Due to the multiple table design, we may end up with many similar tables segregated by the additional dimensions chosen. When upgrading the table schema, we need to ensure that all those tables eventually end up in a consistent state.

However, many automatic schema migration tools do not easily support that kind of multi-table migration. This forces us to fall back on writing migration scripts, trying to find all tables matching a specific naming schema, and ensuring that all are upgraded the same way. If we miss a table, we’ll figure it out eventually, but probably when it’s too late.

The TL;DR

Now that you’ve laid it all out and answered these questions, you can look at the requirements and see where your use case fits.

Some hard requirements may make your choice obvious, while some “nice-to-have” elements may still influence the final decision and hint at what may become a harder requirement in the near or far future.

Single Table Design

Multiple Table Design

Ease of Use

Easy

Somewhat easy

Multi-Tenancy / Privacy Regulations

Hard

Easy

Future-Proofness

Easy

Somewhat hard

Tooling Support

Easy

Hard

While the single table design is very easy to get started with, it may be a hard sell if you need to comply with regulations. However, the multiple table design is certainly more complicated to manage and use correctly.

How to Choose the Best Table for Your Data Modeling

There’s never a one-size-fits-all answer, or as consultants love to put it: it depends.

Unlike the design choices around the table layout, it’s too complex to make a real recommendation. You can only try to follow the suggested process of answering the questions above and looking at the answers to see which points represent hard requirements, which could become hard requirements, and which are simply nice to have. That way, you’ll likely find the answer to match your use case.

The mix-and-match approach

Also, remember that you may have different use cases with diverging requirements, so one use case may end up being perfectly fine running as a single table design, while the other one(s) may need multiple tables.

Plus, you have the chance to mix and match the benefits of both solutions. It was kind of hinted at in the text already, but it is possible to use a simplified multiple table design (for example, per metric type) and separate the customer dimension into a PostgreSQL database schema, with each schema representing one customer.

Similarly, it is possible to use the schema to separate customers and store all metrics/events/data for that specific customer in a single hypertable. There are plenty of options, only limited by your imagination.

Whatever you end up with, try to be as future-proof as possible. Try to imagine what the future will hold, and if you’re unsure whether a somewhat harder requirement will become a must-have, it may be worth considering it as a hard requirement now just to be on the safe side.

Next Steps

If you want to start designing your hypertable database schema as soon as possible, ensuring you get the best performance and user experience for your time-series data while achieving incredible compression ratios, check out Timescale.

If you’re looking to test it locally or run it on-prem, then we’ve got you covered: have a look at our documentation.

Timescale Logo

Subscribe to the Timescale Newsletter

By submitting, I acknowledge Timescale’s Privacy Policy
2024 © Timescale Inc. All rights reserved.