Best Practices for Time-Series Data Modeling: Single or Multiple Partitioned Table(s) a.k.a. Hypertables
Collecting time-related information, or time-series data, creates massive amounts of data to manage and model. Storing it will require one or more Timescale hypertables, which are very similar to PostgreSQL partitioned tables.
Timescale hypertables work like regular PostgreSQL tables but offer optimized performance and user experience for time-series data. With hypertables, data is stored in chunks, which work similarly to PostgreSQL’s partitioned tables but support multiple dimensions and other features. While we discussed the table layout in the Narrow, Medium, or Wide Table Layout best practices article, this time, we addressed whether you should use a single table to store all data versus using multiple tables, as well as their respective pros and cons.
Timescale is built upon a relational database model, which means it supports numerous data modeling choices or ways in which data can be organized and laid out. Understanding the database design choices early on is crucial to find the best combination.
I started using Timescale long before joining the company, initially using it to store IoT metrics at my own startups. We went through a few different iterations of designs, and migrations between those were everything but fun. Due to that personal experience, one of my biggest goals is to prevent others from suffering through the same.
Time-Series Data Modeling: Using our Relational Database Experience
As mentioned, Timescale uses a relational data model at its core. Being built on PostgreSQL, we understand we have many ways to store data, including in partitioned tables or just separate ones. A common pattern in the relational world is to divide data into separate tables based on their content, also often referred to as domain or entity. That means that data belonging to a set of A’s is stored in a different table than a set of B’s, representing different domains.
That leaves us questioning how the concepts of time-series data and relational domains fit together. Unfortunately, there is no easy answer. Our primary options are “squeezing” all data into a single table, which could have hundreds of columns (basically defining our domain around the idea of “it’s all just metrics”), or splitting data into multiple tables with fewer columns. The latter choice may slice tables in many ways, such as by metric type (temperature is different from humidity, stock symbol A is different from symbol B), customer, data type, and others, or combinations of the previous.
Both possibilities have their own set of advantages and disadvantages, which can be split into four commonly seen topics:
- Ease of use
- Multi-tenancy / Privacy-related requirements (General Data Protection Regulation or GDPR / California Consumer Privacy Act or CCPA / others)
- Schema migration or upgrading / Future-proofness
- Tooling support
I did not choose the above order by accident: the sequential importance of these questions may influence your options further down the line.
Pros and Cons
As said before, both design choices have pros and cons, and it’s vital to understand them before making a final data modeling decision. Given the previous set of topics, let’s start by answering a few questions.
First and foremost, how important is the ease of use, meaning, are you and your team up to the challenging tasks that need to be solved down the road? Potential “complications” could involve generation patterns for table names or ensuring that similar tables are all upgraded to the same schema level.
Next up, are you required to provide a harder level of multi-tenancy, such as storing different customers not just by segregating them using a customer ID but are required to store them in different tables, schemas, or even databases? Is your company bound by regulations (e.g., GDPR or CCPA) where users and customers may have the right to be forgotten? With time-series data being (normally) append-only, removing parts of the data (this specific user’s data) may be tricky.
Then we have the question of whether you expect the data schema to change frequently. A large discussion around a future-proof design for hypertables can be found in the Narrow, Medium, or Wide Table Model best practices write-up. However, the higher the number of tables, the more they need to be upgraded or migrated in the future, adding additional complexity.
Finally, how important is support by additional tools and frameworks, such as ORM (Object Relational Mapping) solutions? While I personally don’t think ORM frameworks are a great fit for time-series data (especially when using aggregations), a lot of folks out there make extensive use of them, so talking about them has its merits.
Anyhow, now that we answered those questions, let’s dig into the design choices in greater detail.
Single Table Designs
Storing all data into a single table may initially feel irresponsible from a relational database point of view. Depending on how I slice my domain model, though, it could be a perfectly valid option. The design choice makes sense if I consider all stored data (metrics, events, IoT data, stock prices, etc.) as a single domain, namely time series.
Single tables make a few things super simple. First of all, and probably obvious to most, is querying. Everything is in the same table, and the queries select certain columns and add additional filters or where clauses to select the data. That is as basic as it can get with SQL. That said, querying data is super simple, not just easy.
Upgrading the table’s schema is equally simple. Adding or removing columns implies a single command, and all data is upgraded at the same time. If you have multiple similar tables, you may end up in a situation where some tables are upgraded while others are simply forgotten—no real migration window is needed.
Single tables can easily support multiple different values, either through multiple columns (wide table layout), a JSONB column that supports a wide range of data values (narrow table layout), or through columns based on a value’s potential data type (medium table layout). Those three options have pros and cons, though.
tsdb=> \d Column | Type | Collation | Nullable | Default —---------------+--------------------------+-----------+----------+------------------- created | timestamp with time zone | | not null | now() point_id | uuid | | not null | gen_random_uuid() device_id | uuid | | not null | temp | double precision | | | hum | double precision | | | co2 | integer | | | wind_speed | integer | | | wind_direction | integer | | | | created | point_id | device_id | temp | hum | co2 | wind_speed | wind_direction | | 2022-01-01 00:00:00.0+00 | 123 | 10 | 24.7 | 57.1 | 271 | NULL | NULL |
Last but not least, single tables play very nicely with tools like ORM frameworks. It is easy to connect a specific set of ORM entities to the hypertable, representing either a full record or the result of an aggregation (which may need a native query instead of an automatically generated one).
But as with everything in this world, this choice has a large downside. Since time-series data is designed around the idea of being append-only (meaning that mutating existing records only happens occasionally), it is hard to delete data. Deleting data based on specific requirements is even harder, such as a user or customer asking to have all their data deleted.
In that situation, we’d have to crawl our way through potentially years of data, removing records from all over the place. That’s not only a burden on the WAL (Write-Ahead Log) to keep track of the changes, but it also creates loads and loads of I/O, reading and writing.
The same is true if we try to store collected and calculated sets of data in the same table. With many systems often having to backfill data (for example, from devices that lost their internet connection for a while and were collecting data locally), calculated values may have to be recalculated. That means that the already stored data must be invalidated (which may mean deleted) and reinserted.
Finally, if your company provides different tiers of data retention, good luck implementing this on a single table. It is the previous two issues, but on a constant, more than ugly, basis.
Multiple Table Designs
Now that we know about the pros and cons of single table design, what are the differences when we aim for multiple table designs instead?
While querying is still simple, querying multiple sets of data simultaneously may be slightly more complicated, involving
UNIONs to merge data from the different tables. Requesting multiple sets of data at the same time is often done for efficiency reasons, requiring fewer database round trips and minimizing response time. Apart from that, there isn’t a massive difference in ease of use, except for table names, but we’ll come back to that in a second.
One of the major benefits of having multiple tables, especially when sliced by the customer, user, or whatever meaningful multi-tenancy divider for your use case, is the option to quickly react to GDPR- or CCPA-related requests to destroy and remove any customer-related data. In this case, it is as easy as finding all the client’s tables and dropping them. Removing them from backups is a different story, though. 😌
The same is true with calculated and collected data. Separating those tables makes it much easier to throw away and recalculate all or parts of the data when late information arrives.
Also similar is the previously mentioned data retention. Many companies storing huge amounts of data on behalf of their customers provide different data retention policies based on how much the customer is willing to pay. Having tables sliced by customers makes it easy to set customer-specific retention policies and even change them when a customer upgrades to a higher tier. If this is something you need, the multi-table design is it.
However, just as with single tables, multiples have drawbacks too.
Besides the already slightly more complicated elements around querying, which are not necessarily a disadvantage, having many tables requires planning a table name schema. The more dimensions we bring into the game (by customer, metric type, etc.), the more complicated our naming schema needs to be. That said, we may end up with table names such as
<<customer_name>>__<<metric_type>>. While this doesn’t sound too bad, it can get ugly fast. We’ve all been there before. 😅
tsdb=> \dt *.cust_* List of relations Schema | Name | Type | Owner --------+----------------------------+-------+----------- public | cust_mycompany_co2 | table | tsdbadmin public | cust_mycompany_humidity | table | tsdbadmin public | cust_mycompany_temperature | table | tsdbadmin public | cust_timescale_co2 | table | tsdbadmin public | cust_timescale_humidity | table | tsdbadmin public | cust_timescale_temperature | table | tsdbadmin (6 rows)
Tools, such as an ORM framework, may make things even more complicated. Many of those tools are not designed to support arbitrary, runtime-generated table names, making it very complicated to integrate those with this concept. Using different PostgreSQL database schemas per customer and lowering the number of dimensions may help.
There is one more complexity: upgrading and migrating tables. Due to the multiple table design, we may end up with many similar tables segregated by the additional dimensions chosen. When upgrading the table schema, we need to ensure that all those tables eventually end up in a consistent state.
However, many automatic schema migration tools do not easily support that kind of multi-table migration. This forces us to fall back on writing migration scripts, trying to find all tables matching a specific naming schema, and ensuring that all are upgraded the same way. If we miss a table, we’ll figure it out eventually, but probably when it’s too late.
Now that you’ve laid it all out and answered these questions, you can look at the requirements and see where your use case fits.
Some hard requirements may make your choice obvious, while some “nice-to-have” elements may still influence the final decision and hint at what may become a harder requirement in the near or far future.
|Single Table Design||Multiple Table Design|
|Ease of Use||Easy||Somewhat easy|
While the single table design is very easy to get started with, it may be a hard sell if you need to comply with regulations. However, the multiple table design is certainly more complicated to manage and use correctly.
So What's the Best One for Me?
There’s never a one-size-fits-all answer, or as consultants love to put it: it depends.
Unlike the design choices around the table layout, it’s too complex to make a real recommendation. You can only try to follow the suggested process of answering the questions above and looking at the answers to see which points represent hard requirements, which could become hard requirements, and which are simply nice to have. That way, you’ll likely find the answer to match your use case.
Also, remember that you may have different use cases with diverging requirements, so one use case may end up being perfectly fine running as a single table design, while the other one(s) may need multiple tables.
Plus, you have the chance to mix and match the benefits of both solutions. It was kind of hinted at in the text already, but it is possible to use a simplified multiple table design (for example, per metric type) and separate the customer dimension into a PostgreSQL database schema, with each schema representing one customer.
Similarly, it is possible to use the schema to separate customers and store all metrics/events/data for that specific customer in a single hypertable. There are plenty of options, only limited by your imagination.
Whatever you end up with, try to be as future-proof as possible. Try to imagine what the future will hold, and if you’re unsure whether a somewhat harder requirement will become a must-have, it may be worth considering it as a hard requirement now just to be on the safe side.
If you want to start designing your hypertable database schema as soon as possible, ensuring you get the best performance and user experience for your time-series data while achieving incredible compression ratios, check out Timescale Cloud.
If you’re looking to test it locally or run it on-prem, then we’ve got you covered: have a look at our documentation.