Time-Series Database: An Explainer

Time-Series Database: An Explainer

People create, capture, and consume more data than ever before. And a big portion of this data volume is timestamped, a.k.a. time-series data.

So it’s not surprising that the time-series database (TSDB) category has seen a lot of growth in popularity in the past five years.

In this post, we’ll introduce you to time-series databases and how they can work for you:

What Is a Time-Series Database?

A time series database is a type of database specifically designed for handling time-stamped or time-series data. Time-series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time.

This could include server metrics, application performance monitoring data, network data, sensor data, and more. Whether you are recording the temperature in your garden, the price of a stock, or monitoring your application’s usage data, you are dealing with time-series data.

Since time-series data is time-centric, recent, and normally append-only, a time-series database (TSDB) leverages these foundational characteristics to store time-series data more simply and efficiently than general databases.

Time-series databases vs traditional databases

You might ask: Why can’t I just use a “normal” (i.e., non-time-series) database?

Well, you can, and some people do. But relational databases like PostgreSQL get too slow for time-series data once tables start getting big.

Time-series data accumulates very quickly, and relational databases are not designed to handle that scale (at least not in an automated way). Traditionally, relational databases fare poorly with vast datasets, and NoSQL databases are hailed as the best performers at scale.

But the truth is that, with some engineering fine tuning, PostgreSQL can be turned into a time-series database—actually performing much better than other specialized solutions (as we've shown in benchmarks versus InfluxDB, Cassandra, and MongoDB).

Time-series databases—whether relational or NoSQL-based—introduce efficiencies that are only possible when you treat the time element as a first-class citizen.

These efficiencies allow them to offer massive scale, from performance improvements, including higher ingest rates and faster queries at scale (although some support more queries than others) to better data compression.

Is My Data Time-Series Data?

Even if you don't often refer to your data as time-series data, they might be. A few examples of specific time-series data types:

  • Sensor data
  • Transaction data / Financial transactions / Customer transactions / Order history
  • Operational analytics / Application data
  • Fleet data /Logistics
  • Metrics data
  • Tick data / Fintech data / Trading data
  • Event data
  • Vector data
  • Weather data
  • Insurance data
  • Call records

If you want to know if your data is time-series data, this is the litmus test: does your data have some kind of timestamp or time element related to it, even if that may not be its main dimension? If the answer is “yes,” you’re dealing with time-series data.

Why Businesses Depend on Time-Series Databases

There are several reasons for this trend:

  • The increasing number of IoT devices and applications has increased the need for databases that can store time-series data efficiently. IoT often generates high-granularity and high-volume datasets that general-purpose databases can’t handle very well.
  • Most companies plan to invest more in data and analytics. As organizations seek to gain insights from both their historical and real-time data, the demand for time-series databases has increased.
  • The rise of observability and the growing importance of monitoring your application at all times have also contributed to the popularity of time-series databases since this kind of data is also time-series data.

How does a time-series database help?

A time-series database helps you worry less about your database infrastructure, so you can spend more time building your application or gaining insight from the data.

You can make it simpler or faster (or both) to ingest and query data by optimizing a time-series database and implementing shortcuts. Here are some examples.

Improved data ingestion performance

Time-series data is often generated at a high granularity. Going back to the previous finance example, there might be hundreds of data points generated every second or even more (depending on how many symbols you monitor). High volumes of data can be challenging to ingest efficiently. Time-series databases are equipped with internal optimizations like auto-partitioning and indexing, allowing you to scale up your ingestion rate.

Simplified querying

When you query time-series data, you are often interested in how the data have been changing in the past five minutes, five days, five weeks, etc. You are also more likely to want to perform time-based calculations like time-based aggregations (time bucketing), moving averages, time-weighted averages, percentiles, and so on. A time-series database has specialized features to simplify and speed up the calculation of typical time-series queries.

time

symbol

price

2023-02-06 15:05:26

PFE

44.19

2023-02-06 15:05:25

WMT

141.27

2023-02-06 15:05:24

KO

59.67

2023-02-06 15:05:24

TSLA

194.08

2023-02-06 15:05:24

SNAP

10.88

Store real-time and historical data in one place

The value of time-series data comes from the fact that you can compare the most current state of your system with past situations. A time-series database has the tools and scale needed to store both historical (archives) and real-time data in one data store, making it easy for you to keep an eye on what’s happening right now while also having the ability to learn from the past. This gives you the power to simplify your data infrastructure and seamlessly analyze all your data in one place.

Automated data management

Time-series databases can also automate your time-based data management tasks. For example, you might want to get rid of all data that is older than one year to save disk space and because you don’t need old data anymore. Or you still need to keep old data around, but maybe it’s not as handy as more recent data, you can set up your database to automatically compress old data to save on storage costs.

There might be other valuable automations in a time-series database. For example, in TimescaleDB you can use continuous aggregates to incrementally add data to a predefined materialized view—improving query performance and developer productivity.

Top Time-Series Databases

According to DB-Engines, here are some of the top time-series databases:

Tips for Choosing a Time-Series Database

Once your applications start storing time-series data, you still have to pick a TSDB that best fits your data model, write/read pattern, and developer skill sets.

When evaluating database options, consider these factors:

  • Scalability. You must ensure that the database can scale vertically (adding more resources to a database node) and horizontally (adding more database nodes to your system) while remaining performant and reliable.
  • Maintainability. Consider the time and effort it will take to maintain it long-term—backups, replicas, data retention, archiving, automation, and so on. Think about the maintenance jobs you want to do and see if the database has the tools and features to help you.
  • Reliability. You might notice that many companies in this space are developing brand-new technologies from scratch. As time-series data quickly becomes the basis of business decisions and forecasts, you need to be sure that it will be available when you need it.
  • Query language. There are also quite a few time-series databases on the market with their own custom languages (e.g., Flux by InfluxDB). This can be a turn-off for many developers because they’d need to learn a new language just to use the database. And even though some of them try to look like SQL, they are not real SQL.

Time-Series Database Comparison: InfluxDB vs TimescaleDB

Time-series databases may provide general features as well as time-series features (e.g., TimescaleDB) or focus on providing time-series features at the cost of supporting more general workloads (e.g., InfluxDB).

Time-series data doesn’t live in a silo, so whatever the approach, you will need a way to relate general and time-series data.

In the following section, you will see examples of time-series features you can expect from a time-series database. To show you both ends of the spectrum, you’ll see what the features look like in two totally different time-series databases: InfluxDB (with its own “Flux” query language) and TimescaleDB (which extends PostgreSQL and offers full-SQL support).‌‌

📝 Editor's note: This article was written before InfluxDB 3.0. Read this more recent article about InfluxDB.

Time-based aggregation

Time-based aggregation is a must-have feature for time-series databases. Without it, you won’t be able to create hourly, daily, weekly, and monthly time aggregations (and many permutations in between) that are key for analyzing time-series data.

Let’s see what time-based aggregate queries look like in the case of InfluxDB and TimescaleDB:

InfluxDB

from(bucket: "crypto")
  |> range(start: 0)
  |> aggregateWindow(every: 1d, fn: count, createEmpty: false)
  |> sort(columns: ["_time", ], desc:true)
  |> limit(n:10)

Using InfluxDB’s query language, Flux, you must define a time range for the query. You can use 0 as the start parameter to work around this—in case you don’t want to define a specific time range. Then you can use the aggregateWindow() function to create arbitrary time “buckets”.

TimescaleDB

SELECT
  time_bucket('1 day', time) AS bucket,
  count(*) row_count
FROM crypto_ticks
GROUP BY bucket
ORDER BY bucket DESC
LIMIT 10;

bucket         	|row_count|
-------------------+---------+
2022-09-09 02:00:00|	52277|
2022-09-08 02:00:00|   128657|
2022-09-07 02:00:00|   134849|
2022-09-06 02:00:00|   132837|
2022-09-05 02:00:00|   126254|

In TimescaleDB you can use the time_bucket() function to create arbitrary time buckets. Besides, all the other available PostgreSQL functions like count(*) works the same way as in regular PostgreSQL.

When working with time-series data, you will have to create a lot of time-based aggregations—so make sure that the database you choose provides a simple and intuitive interface to create time buckets.

Time-series data modeling

Time-series data, as mentioned, is always connected to time. For this reason, if you use a relational database to store time series, it makes sense to put an index on the time column. You can also create indexes that include the time and other columns you frequently filter.

Indexes in relational databases are essential to improve performance. Some time-series databases provide this functionality by default, for example, when you create a hypertable in TimescaleDB, a time-series database. Let’s see an example data model for a typical financial application in PostgreSQL (the same would apply in TimescaleDB):

CREATE TABLE stocks_real_time (
    time TIMESTAMPTZ NOT NULL,
    symbol TEXT NOT NULL,
    price DOUBLE PRECISION NULL
);

This is a simplified version of our schema in our getting started guide. This example has one TIMESTAMPTZ column called time. If you use a relational database, it might make sense to put an index on this column to speed up queries where you filter by time (e.g., get me all the data from the past hour).

The second column in this example is a symbol. This stores the company’s stock ticker symbol. You might consider creating an index that includes both time and symbol columns to improve the performance of queries where you filter by both time and symbol (e.g., get me all data for “TSLA” from the past hour).

Finally, the last column in this example is price, which stores the price of the given stock symbol at the given time.

Some databases only allow you to store timestamped data in your database. In contrast, others will enable you to store non-timestamped data right next to your timestamped data in the same database. This can be an important feature of the database if you want to simplify your data infrastructure and use only one database to store all of your data.

Automatic downsampling

Time-series data is often ingested at a very high resolution (e.g., thousands of data points per second). To make it easier to analyze time series, users often downsample their data (e.g., they convert thousands of data points per second to only one). This technique doesn’t only save storage costs because you need to store lower-resolution data, but it also makes it easier to create visualizations and recognize trends in the data.

Downsampling is often done repeatedly and continuously, which means that if, for example, you insert multiple new rows every second, the database rolls up the incoming data into larger buckets automatically. Instead of aggregating the raw data yourself, the database takes care of it automatically and in real time.

Let’s see how InfluxDB and TimescaleDB handle downsampling with an OHLC example.

InfluxDB

close=from(bucket: "crypto")
  |> range(start: -30d)
  |> group(columns:["symbol"])
  |> filter(fn: (r) => r["_measurement"] == "ohlc")
  |> window(every: 1h)
  |> reduce(fn: (r, accumulator) => ({
  
      indexLow:
        if (r._field=="low") then 
          accumulator.indexLow+1 
        else
        accumulator.indexLow,
      indexOpen: 
      if (r._field=="open") then 
       accumulator.indexOpen+1 
       else 
       accumulator.indexOpen,
        open: 
      if (r._field=="open") then 
        if (accumulator.indexOpen==0) then 
          r._value 
        else 
          accumulator.open
      else
        accumulator.open  
    ,
    
    
      high:
       if (r._field=="high") then  
          if(r._value>accumulator.high ) then
            r._value
          else
            accumulator.high 
      else 
        accumulator.high
   ,
    low: 
      if (r._field=="low") then

          if(r._value<accumulator.low or accumulator.indexLow==0.0) then
            r._value
          else
           accumulator.low 
      else 
        accumulator.low,

             close: 
       if (r._field=="close") then 
          r._value 
      else 
        accumulator.close,
             volume: 
        if (r._field=="volume") then
          r._value+accumulator.volume 
          else
           accumulator.volume
             }),
    identity: {indexLow:0,indexOpen:0,open: 0.0,high: 0.0,low: 0.0,close: 0.0,volume: 0.0})
    |> drop(columns: ["indexOpen","indexLow"])
  |> group(columns:["pair"])
    |> yield(name: "candle")
        

InfluxDB’s Flux provides a convenient way to write simple queries, but if you want to create somewhat more complex queries, like creating OHLC aggregates from raw financial tick data, the final query can become quite long as you can see.

TimescaleDB

CREATE MATERIALIZED VIEW hourly_buckets
WITH (timescaledb.continuous)
AS
SELECT
  time_bucket('1 hour', time) AS bucket,
  symbol,
  first(price, time) AS open,
  max(price) AS high,
  min(price) AS low,
  last(price, time) AS close
FROM crypto_ticks
GROUP BY bucket, symbol;



SELECT * FROM hourly_buckets;
bucket         	|symbol  |open   |high   |low	|close  |
-------------------+--------+-------+-------+-------+-------+
2022-02-08 22:00:00|ADA/USD |  1.166|   1.17|  1.157|  1.168|
2022-02-08 22:00:00|ATOM/USD|  30.44|  30.63|   30.3|  30.51|
2022-02-08 22:00:00|AVAX/USD|  87.85|   88.0|  86.72|  87.06|
2022-02-08 22:00:00|BNB/USD |  413.5|  416.5|  410.3|  410.3|
2022-02-08 22:00:00|BTC/USD |44192.4|44354.0|43938.6|44185.2|

‌If you are familiar with PostgreSQL syntax, you can see that the TimescaleDB method is very similar to a PostgreSQL materialized view. However, the mechanism under the hood is different to provide a better developer experience for time-series data by automatically storing pre-aggregated buckets over time, maintaining aggregations when raw data changes, and even returning real-time data.

Querying recent data

You might want to build visual dashboards to display time-series trends or even close to real-time data. For creating trend charts, you can use the previously mentioned downsampling method. But for real-time data, you probably want to see more granular and recent data, e.g., all data points from the past five minutes. Let’s see how you can make this simple request in InfluxDB and TimescaleDB.

InfluxDB‌

from(bucket: "crypto")
  |> range(start: -5m)
  |> filter(fn: (r) => r.symbol == "BTC/USD")
  |> sort(columns: ["_time", ], desc:true)
  |> limit(n:5)
  |> keep(columns: ["_time", "_value"])

In Flux, you can specify a time range that is relative to now with start: -5m, which will return all data for the “BTC/USD” symbol from the past five minutes.

TimescaleDB

SELECT
  time,
  price
FROM crypto_ticks
WHERE
  "time" > NOW() - INTERVAL '5 minutes' AND
  symbol = 'BTC/USD'
ORDER BY time DESC
LIMIT 5;

time           	|price  |
-------------------+-------+
2022-09-12 15:24:07|22346.7|
2022-09-12 15:24:03|22346.3|
2022-09-12 15:23:50|22346.7|
2022-09-12 15:23:45|22355.9|
2022-09-12 15:23:40|22358.1|

In the TimescaleDB example, you can see a familiar SQL example (if you already know SQL) with a symbol filter and a relative time filter in the WHERE clause using the NOW() PostgreSQL function.

Under the hood, the way this query gets executed is different from regular PostgreSQL, though: when you insert time-series data into the database, TimescaleDB auto-partitions your table based on the time column.

Then, when you make a query containing a time filter, like in this example, TimescaleDB can exclude whole chunks from scanning which makes querying recent data lightning fast, even if you have billions of rows stored in the database.

Long-range analytical queries

What if you are also interested in analyzing longer time frames, e.g. all data from the past year? Maybe you want to see what was the highest price of a certain stock or crypto symbol in the past year.

InfluxDB

from(bucket: "crypto")
  |> range(start: -1y)
  |> group(columns: ["code"])
  |> max()
  |> group()
  |> sort(columns: ["_value"], desc: true)

This example shows that Flux executes your query in the same order as you describe it.

TimescaleDB

SELECT
  symbol,
  MAX(price) AS max_price
FROM crypto_ticks
WHERE
  "time" >= NOW() - INTERVAL '1 year'
GROUP BY symbol
ORDER BY max_price DESC;

symbol   |max_price |
---------+----------+
BTC/USD  |   48210.1|
WBTC/USD |  48169.56|
ETH/USD  |   3579.38|
BNB/USD  | 	460.0|
SOL/USD  |	143.55|

Analytical queries like this, with a larger time window as the filter, are not typical time-series queries, but you might want to run these from time to time.

TimescaleDB provides two features that significantly speed up these queries: native compression, which saves space and converts your data into a columnar form, and continuous aggregates, which automatically maintain materialized aggregate data that can be retained separately from raw readings.

Together, these features can have a dramatic effect on the performance of your application.

JOINing time-series data with other business data

Sometimes we only talk about time-series data without mentioning all the other data that real-world projects have in their data infrastructure. But the reality is that time-series data is always connected to non-time-series (business) data.

If you plan to analyze your time-series data and business data together, the database you choose needs to be able to JOIN them and work with them quickly and simply. In the following examples, you can see how to JOIN two tables in InfluxDB and TimescaleDB.

InfluxDB


crypto_assets = from(bucket: "crypto-assets")
    |> range(start: -1mo)
    |> filter(fn: (r) => r._measurement == "assets" and r._field == "symbol")

crypto_ticks = from(bucket: "crypto-ticks")
    |> range(start: -1mo)
    |> filter(fn: (r) => r._measurement == "ticks" and r._field == "price")

join(
    tables: {assets:crypto_assets, ticks:crypto_ticks},
    on: [symbol, ],
)

The big difference between InfluxDB and TimescaleDB in this regard is that InfluxDB can only store timestamped data, while TimescaleDB can store timestamped and non-timestamped data right next to each other. Thus, in InfluxDB you can only join time-series data with other time-series data but not relational data.

TimescaleDB


SELECT  crypto_assets.name,  bucket,  close,  high,  low,  open
FROM one_day_candle
INNER JOIN crypto_assets ON crypto_assets.symbol = one_day_candle.symbol
WHERE
  bucket > NOW() - INTERVAL '1 month' AND
  one_day_candle.symbol = 'BTC/USD'
ORDER BY bucket;

name   	|bucket         	|close  |high   |low	|open   |
-----------+-------------------+-------+-------+-------+-------+
Bitcoin USD|2022-08-13 02:00:00|24460.6|24889.5|24312.3|24402.2|
Bitcoin USD|2022-08-14 02:00:00|24312.4|25034.2|24160.4|24455.2|
Bitcoin USD|2022-08-15 02:00:00|24092.8|25210.9|23798.7|24316.2|
Bitcoin USD|2022-08-16 02:00:00|23867.7|24247.5|23692.0|24103.2|
Bitcoin USD|2022-08-17 02:00:00|23340.1|24430.1|23184.4|23857.3|

In TimescaleDB, you can use PostgreSQL’s JOIN to connect any two tables in the same database, enabling you to store your non-time-series data next to your time-series data. Without this feature, you might have a harder time bringing your data together from multiple sources.

Fun fact: One of the reasons TimescaleDB was created was that the founders struggled to find a database that could do easy JOINs for time-series data.

Data retention and compression

While seeing how thousands of companies handle time-series data, we found that time-series data becomes less valuable over time. This means users often want to archive or even remove older data after a certain time to save on storage costs.

InfluxDB

In InfluxDB you can change the data retention settings on a per-bucket basis on the UI


In InfluxDB you can change the data retention settings on a per-bucket basis on the UI.

In older versions of InfluxDB, you could also add data retention policies this way:

CREATE RETENTION POLICY "one_year" ON "crypto_ticks" DURATION 1y REPLICATION 1 DEFAULT

TimescaleDB

--Compression:
ALTER TABLE example SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = symbol
);
SELECT add_compression_policy('crypto_ticks', INTERVAL '2 weeks');

--Data retention:
SELECT add_retention_policy('crypto_ticks', INTERVAL '1 year');

With TimescaleDB, you can set up both a compression policy (to save on storage needs but keep the data available for querying) and a data retention policy (which gets rid of the data after the defined time period). Without essential tooling around data compression and data retention in the database, you’d need to implement and maintain these automations manually.

Time-Series Database Resources

If you are interested in how other developers are using TimescaleDB for time series and analytics, check out our developer stories:

If you want to keep reading about TimescaleDB, these articles will help you learn more:

FAQs

What are the benefits of time-series databases?

Time series databases are designed to efficiently store and query large volumes of time-series data, making them ideal for high-frequency data like IoT sensor data, server metrics, or financial data. With their optimized storage structures and compression algorithms, TSDBs offer faster write and query performance for time-based queries compared to traditional databases. They also often provide flexible schemas, allowing for easy addition or alteration of data fields. Furthermore, TSDBs can handle high data ingest rates and are built to scale horizontally, ensuring they can manage increased data loads effectively. Their specialized querying capabilities also make it simpler and more efficient to perform complex time-based analytics.

How do time-series databases handle large volumes of data?

Time-series databases handle large volumes of data by using optimized storage structures, compression algorithms, and indexing strategies. These elements allow for efficient storage and quick retrieval of time-stamped data. Additionally, many time-series databases are designed to scale horizontally, meaning they can distribute data across multiple servers to manage increased data loads effectively.

What are the challenges in working with time-series data?

Working with time-series data presents several challenges. These include handling the high volume and velocity of data, as time-series data is often generated continuously and in large quantities. Storing and querying this data efficiently can be difficult. Additionally, managing the real-time nature of the data, ensuring data quality, dealing with seasonality and time-dependent trends, and developing accurate forecasting models can also be challenging.

How do time-series databases work with real-time data?

Time-series databases are designed to handle real-time data effectively. They can ingest high volumes of data rapidly and continuously, often providing near real-time insights. Their optimized storage and indexing strategies allow for fast data writes and quick retrieval of recent data. Some also offer stream processing or real-time analytics capabilities, enabling users to analyze and respond to trends in the data as it arrives.

Get Started With Timescale

If you're running your PostgreSQL database in your own hardware, you can simply add the TimescaleDB extension. If you prefer to try Timescale in AWS, create a free account on our platform. It only takes a couple seconds, no credit card required.

📝 Are you migrating off InfluxDB? We can help. Check out Outflux, our migration tool, and reach out to us for expert migration advice. Our Support team has helped many others migrate from InfluxDB to Timescale successfully!

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
16 min read
General
Contributors

Related posts