Best Practices for Time-Series Data Modeling: Narrow, Medium or Wide Table Layout
As time-series applications are designed to store massive amounts of time-related information, data modeling the underlying data storage when you have one or more Timescale hypertables, is essential. The data model itself describes the logical representation of the information in the shape of a relational database table.
Since Timescale is built upon a relational database model, there are a few ways you can design our hypertables. It is important to understand them early in the process of laying out the data storage to be prepared for the near and far future.
Before I joined Timescale, I had my own startup, clevabit, and we went through a few iterations of storage layout, eventually opting for a narrow/JSON table design for quite a while before moving to a medium design. What I learned, though, is that every single migration of existing data hurts really hard—an experience you definitely don’t need to go through.
Generally, three major design options are available: narrow, wide, and (for lack of a better term) medium. In this blog post, we look at the options and explain why and how you would use them.
Time-Series Data Modeling: What Are the Different Table Layouts?
Let’s start with the narrow table layout as part of your data modeling for time series. A common characteristic of a narrow table is that there are few actual data or value columns, often just one. That does not mean the table can’t have more columns (e.g., you could assign a row to a specific time series, device, or metric). Here is an example of a narrow table:
tsdb=> \d Column | Type | Collation | Nullable | Default —---------------+--------------------------+-----------+----------+------------------- created | timestamp with time zone | | not null | now() point_id | uuid | | not null | gen_random_uuid() device_id | uuid | | not null | metric | text | | not null | value | double precision | | not null | | created | point_id | device_id | metric | value | | 2022-01-01 00:00:00.0+00 | 123 | 10 | temp | 24.7 |
As you can see, the row has a single column that holds the value: in this case, a decimal value. This example also uses one column for the metric name (in this example, “temp” for temperature). It could also be anything stored as a decimal value. If you want to use other data type values, you could create a different table to store a text string or whatever other type. Alternatively, you often see JSONB as the data type for the value column. That way, you can store any valid JSON value in the column.
tsdb=> \d Column | Type | Collation | Nullable | Default —---------------+--------------------------+-----------+----------+------------------- created | timestamp with time zone | | not null | now() point_id | uuid | | not null | gen_random_uuid() device_id | uuid | | not null | metric | text | | not null | value | jsonb | | not null |
The wide table layout is the exact opposite of the narrow layout, meaning your data modeling work will end up with a time-series table with a lot of columns. Most often, it’s one column per available metric. Depending on the number of metrics you have, wide tables can get—as the name indicates—very wide, very quickly. It is common to see 200 or more columns in a single table. A simple example of a wide table could be something like this:
tsdb=> \d Column | Type | Collation | Nullable | Default —---------------+--------------------------+-----------+----------+------------------- created | timestamp with time zone | | not null | now() point_id | uuid | | not null | gen_random_uuid() device_id | uuid | | not null | temp | double precision | | | hum | double precision | | | co2 | integer | | | wind_speed | integer | | | wind_direction | integer | | | | created | point_id | device_id | temp | hum | co2 | wind_speed | wind_direction | | 2022-01-01 00:00:00.0+00 | 123 | 10 | 24.7 | 57.1 | 271 | NULL | NULL |
Based on your use case, the value columns may be nullable or not. That mostly depends on how data is delivered. For example, if data is provided at the same time and can be stored in one row, or if data becomes available at different times, and only one or more columns are used each time. In the above example, the three values for temperature, humidity, and CO2 are provided together, while wind speed and wind direction would be in a separate data packet, hence the NULL values.
Finally, the medium table layout. As the name suggests, it sits between narrow and wide models. The basic design principle is to create one column for each necessary data type. That said, in the wide example, we had two different data types, an integer and a decimal (let’s assume float8). Therefore, we’d end up with two different value columns, like this:
tsdb=> \d Column | Type | Collation | Nullable | Default —---------------+--------------------------+-----------+----------+------------------- created | timestamp with time zone | | not null | now() point_id | uuid | | not null | gen_random_uuid() device_id | uuid | | not null | metric | text | | not null | int_value | integer | | | float_value | double precision | | | | point_id | device_id | metric | int_value | float_value | | 123 | 10 | temp | NULL | 24.7 |
If you need more data types, you can just add more columns. This is a very convenient setup if you know all the expected data types, which is fairly easy compared to “knowing all possible metrics in advance.” However, it requires you to know which metric type would be found in which specific data type column.
Pros and Cons
Modeling data according to the three different layouts comes with its pros and cons. Let’s get into the details of each one.
Narrow table layout
The narrow table layout is great if you have a low cardinality of metrics and you know that up front, along with all their data types. If you don’t know this, you can remove the metric name as a column, and create one hypertable for each metric type, somehow mangling the metric into the table name, like this:
Table: temperature | point_id | device_id | value | | 123 | 10 | 24.7 | Table: humidity | point_id | device_id | value | | 123 | 10 | 57.1 |
In this example, I created independent tables for the temperature and humidity metrics. That way, you can think about the data type of every metric independently, and the table design becomes much easier.
Extending the layout with additional metrics is easy: create an additional hypertable matching the metric’s data type, and you’re done. This, however, can also be a disadvantage. If you have a high number of different metrics, the number of individual tables could become a hassle to manage. While PostgreSQL doesn’t have an issue with many tables, we humans often do. However, you could use a tool like Promscale to help automatically manage the relational data model for you.
Additional complexity may come in if you want to split those tables further down, for example, having a temperature table per customer (isolation, also known as multi-tenancy). You could handle this by mangling the name (
<<customer_id>>_temperature) or using a lookup table that maps the customer and metric to a generated table name. It could get tedious, though.
Another way to mitigate complexity is using JSONB as the value column’s data type. This allows you to unify all metrics per customer into a single table, adding back the metric column. Remember, you still need to understand what data type the value will be when queried since JSONB might force you to cast the resulting value into the data type your application expects (for example, JSON doesn’t use real integers, only numbers according to IEEE 754).
Last but not least, querying many different metrics at the same time can mean you need to use a lot of JOIN or UNIONs, especially when every metric has its own table. While it doesn’t necessarily slow down the query, it can be tedious to write or generate the queries. Plus, every JOIN operation increases the input/output (I/O) load on the system when preparing the result. All this becomes more likely when the cardinality of metrics keeps increasing.
Medium table layout
The medium table layout, on the other hand, has one row with one column per needed data type. That means you only need one column defining what kind of data is stored in that specific row. This may need an additional lookup table, not only defining the metrics but also which data type column needs to be queried for each. For a low, static number of metrics, you could handle this in the application layer.
This layout provides most of the benefits of the narrow design, being easy to extend with additional metrics, as long as we don’t want to introduce a new data type. But even then, you only need to add a new column to the table layout. It depends on the existing number of rows in your hypertable as to how heavy an operation like that is.
The medium layout is a very practical design choice if you are sure about the different data types that can occur in the future but are not yet certain of all potential metrics. This is a common situation in IoT use cases, where you add new types of metrics frequently. While the number of metrics increases, and you have to add them to your lookup mechanism (metric to data type column), the table's design doesn’t change, and you can keep adding new metrics without touching the layout.
Multi-tenancy is easily achieved by creating one hypertable per customer and storing all corresponding data in that particular table. If you need to query data for multiple customers, you can JOIN and UNION ALL the data together and get a single result set.
It is also straightforward to query multiple metrics simultaneously since all data resides in the same table. If you want to retrieve data from multiple different data types, though, you may end up using CASE WHEN clauses and fill result columns with NULLs when no value is available for a specific metric but other queried ones. This would need to be handled in the application layer.
Wide table layout
Finally, while the wide table layout seems a common choice for beginners, it is probably the most complicated to manage and extend over time.
The wide table is a great choice if you know all potential metrics up front. This is often the case with systems that have been in production for a long time and data that will be migrated over. If new columns need to be added only at a low rate, it may be a viable option, but adding a column may be a heavy operation.
Besides that, supporting multi-tenancy is simple, adding a new wide table per customer. Remember, when adding a new column, you may have to add it to all of the customer’s tables too. This can be a tedious process.
Querying data is simple: we don’t have to cast data as they’re already stored in their native data types, and there is no need to look up the column to query since there is one column per metric. Also, querying a subset of the columns is very efficient, especially when chunks are compressed into columnar representation.
Querying data from multiple customers is quick. Just JOIN and UNION the different intermediate queries together, and you’re done.
The biggest issue with the wide table is, as already mentioned, adding more metric columns. Keeping track of things in systems that are constantly growing with new metrics can be challenging.
Also, while dropping a column because you don’t need the metric anymore is an immediate operation (the column will just be hidden from the query engine), it doesn’t immediately remove the content from disk. The rows will be rewritten whenever that row is updated, which will most likely never happen in time-series data, leaving the dangling data around forever. Adding* NULLABLE* columns is simple, the same as
DEFAULT values. Non-updated rows will be evaluated at query time.
✨Editor’s Note: Read how the analytics company Nodecolytics makes their queries up to six times less costly by using wide tables in TimescaleDB while ingesting and recording millions of events per day.
All that said, there are quite a few bits to consider when making a decision. While it is possible to migrate from one layout to another, the process can be quite tedious, especially with a growing number of records being stored.
Therefore, the following table gives you a quick overview of the most important elements of the previous section.
|Narrow Table Layout||Medium Table Layout||Wide Table Layout|
|Ease of Use||Easy||Easy||Somewhat easy|
|Cost of Up-Front Design||Low||Somewhat low||Somewhat high|
|Extensibility||Easy||Somewhat easy||Somewhat complicated|
While the wide table format is a common sight with early designs, it is not necessarily the best choice in the long run. A good layout choice is based on your use case and may change from hypertable to hypertable.
Hypertable Compression (Compressed Chunks)
Since we just talked about compression, let’s have a quick look at how TimescaleDB’s compression plays with the different table layouts. Generally speaking, all layouts are fully supported when using compression. However, some of them have other cons, especially when you want to add metrics later on.
This is not a major issue anymore, given that compressed hypertables can be extended with additional
NULLABLE DEFAULT NULL columns for quite a while now.
The easiest is the narrow table layout. Since no columns are to be expected at a later point in time (given you decided early on if you want to use the metric column or table name setup), no issues are to be expected with compression. You can also expect a really good compression ratio since the records are very uniform and, therefore, compress extremely well.
The medium table layout also plays very nicely with compression, given you have considered all potential data types, and it is unlikely that you will have to add columns in the future. But even if you have to, the column will be
NULLABLE as there can’t be any previous metrics using the newly added column. As mentioned above, adding a column to a non-compressed chunk may need some heavier chunk rewriting, though.
Last but not least, the wide table layout is the most complicated regarding compression since it may be the most likely candidate to include additional columns at a regular cadence. Still, as with the medium layout, new columns usually are
NULLABLE and, therefore, fully supported by compressed chunks. Again, just remember potential rewrite operations on non-compressed chunks.
Unexpected tidbit: due to how the compressed chunks are designed (columnar-like storage), query speed often increases because of the lower number of I/O operations. This is still true, even though decompression may add a bit of overhead.
As a final note on compression and the above layout options, columns with NULL compress really well. An often-seen worry for many people is that the number of NULL values in medium or wide layouts creates a lot of unnecessary bloat in the hypertable. But due to the way compressed chunks are designed, NULL values compress to almost nothing.
Which to Choose?
Your use case is going to influence your choice of table design.
While in the wild, the medium table layout is probably the most uncommon one, it is often an excellent trade-off between an easy-to-grasp design, future extensibility, and a quick start. Some lookup mechanism is necessary to understand what metric will be available in what data type column.
If you know the exact metrics you want to store, the wide table layout may be a great choice. It provides a quick setup, great query speeds (especially with compressed enabled), and common queries, basically what your typical relational database query looks like. Finally, no lookup tables are necessary.
If you want the most straightforward design choice and the highest isolation of different types of metrics, the narrow table layout is yours. Given that there are no changes in the table’s layout to be expected, it is the easiest to design since you probably know exactly what the data type for a specific metric (or set of metrics sharing the same data type) looks like. Adding a hypertable for a new metric is as simple as every other before. With a high cardinality of metrics, a lookup mechanism may be necessary to understand what table to query (especially with name mangling).
There is no “one-size-fits-all” layout, and to put it into a common consultant answer: “It depends.” Look at your specific use case and decide on the pros and cons of the different layout aspects. Maybe you have different use cases with different benefits using different layouts. You may, for example, want to store IoT metrics from sensors differently from your infrastructure’s observability data.
Whatever you do, just remember one thing, it is always possible to switch from one layout to another, but migrating isn’t simple. Never. If you’re unsure about the future, consider one of the simpler extensible layout options.
If you want to start designing your hypertable today and improve your performance and user experience for time-series data while achieving impressive compression ratios, try self-hosting TimescaleDB for free.
For an unparalleled cloud-native PostgreSQL for time series, events, and analytics, sign up for Timescale Cloud. Spin up your first instance in minutes with a free 30-day trial; no credit card required.