How to Choose the Best Time-Series Database for Your Project

How to Choose the Best Time-Series Database for Your Project

The time-series database market is extensive. As time-series databases answer a challenge that more companies increasingly recognize—measuring how systems change—this shouldn’t be surprising.

According to db-engines, time series has been the fastest-growing database category in the past two years in terms of popularity. Such growth has introduced a new problem for developers looking for the best time-series database for their use case: how to choose the best among so many options?

Some time-series databases are brand new, created from scratch, while others rely on decades-old battle-proven technologies (e.g., TimescaleDB is built on PostgreSQL and extends its capabilities). There are time-series databases with custom query languages, databases that support SQL-like queries, and yet others that support full SQL. All of these databases try to provide solutions to ingest, store, and query time-series data efficiently.

While reading real-world stories from developers about why they chose a specific database is helpful, this article gives you actionable guidance about features and expectations when considering any time-series database technology. To choose a suitable database for your project, you should better understand the value provided by time-series databases.

Let’s discuss why you should consider a specialized database for your time-series workloads instead of a general one.

Why Do You Need a Specialized Database for Time-Series Data?

Time-series data is a unique data category because the focus is on time. The time field is a primary data field in time-series databases. Each row includes a time entry, and analyzing data along this dimension enables us to identify how systems change… over time.

Furthermore, the database usually records new data as a new entry. Typically, if an existing row changes, you insert a new one and don’t update historical data.

Time-series data is also relentless. To effectively analyze time-series data, you must collect data at speed and scale. It can pile up quickly if you ingest hundreds, thousands, or even more data points per second.

Ultimately, you need a database that can handle the scale, has excellent usability and is reliable. A general database is unlikely to meet all three criteria for actual time-series workloads. You need a purpose-built time-series database.

Time-Series Data Does Not Live In a Silo

Time-series data is always connected to other types of business data. For example, a finance data application might be monitoring stock prices (time-series data), but it might also record general information about the businesses – for example, contact info and industry – which those stocks represent (business data). When researching databases, you should consider how easy it will be to work with time-series and business data together in the same context (and how the query language will support doing that).

Time-series data is increasingly more prevalent not just in finance, manufacturing, and Web3 but also in other industries as people realize there is untapped value in recording everything that matters as time-series data. In fact, many people argue that most data has some sort of temporal aspect which can be stored. Check out real-world examples in our Developer Q&A series!

What Makes a Time-Series Database the Best?

Saying that something is the best is subjective. This is especially true for time-series databases. Best in performance? Developer productivity? Usability? There are many ways a database can be the best, depending on your preferred optimization points. In this section, we outline the most critical aspects of a time-series database: scalability, maintainability, reliability, and query language.

Scalability

Time-series data is relentless and comes at you in huge quantities. The database you choose needs to have proven scalability and be able to store a lot of time-series data. This could mean gigabytes, terabytes, or even petabytes of data—depending on the scale of your project. You must ensure that the database can scale vertically (adding more resources to a database node) and horizontally (adding more database nodes to your system) while remaining performant and reliable.

Maintainability

When choosing a database, you should consider the time and effort it will take to maintain it long-term—backups, replicas, data retention, archiving, automation, and so on. Think about the maintenance jobs you want to do and see if the database has the tools and features to help you. Writing scripts, cron jobs, and other custom maintenance solutions can be a burden in the long run.

For this reason, many developers opt for cloud-hosted versions of their time-series database, letting them focus on their data.

Reliability

The time-series database market is relatively new. You might notice that many companies in this space are developing brand-new technologies from scratch. This might not be a problem if you are willing to try brand new software and risk long-term reliability, but if you’re looking for a tried and tested technology, you might want to look at databases and storage engines that have been in production use for more than just a few years. As time-series data quickly becomes the basis of business decisions you need to be sure that it will be available when you need it.

Query language

Would you prefer SQL, or is SQL-like good enough? Do you prefer one specific SQL syntax (e.g., PostgreSQL) over others (e.g., MySQL)? There are also quite a few time-series databases on the market with their own custom languages (e.g., Flux by InfluxDB). This can be a turn-off for many developers because they’d need to learn a new language just to use the database. And even though some of them try to look like SQL, they are not real SQL.

On the other hand, a database with full-SQL support makes it much easier to get started because all the tools that integrate with SQL will work with your new database. Also, you and your team won’t need to learn a new language to query your time-series data.

Time-Series Database Features

While the aspects mentioned above can give you an idea of the best time-series database for your application, let’s cover specific database features that you might require to work efficiently with time-series data.

A general database doesn’t usually provide you with time-series-specific features, often not even having the tools to manually get the same outcomes. A general database is expected to handle a lot of different workloads well; its main goal is to be usable by a very wide range of use cases. Time-series databases may provide general features as well as time-series features (e.g., TimescaleDB), or focus on providing time-series features at the cost of supporting more general workloads (e.g., InfluxDB). As mentioned above, time-series data doesn’t live in a silo, so whatever the approach, you will need a way to relate general and time-series data.

In the following section, you will see examples of time-series features you can expect from a time-series database. To show you both ends of the spectrum, you’ll see what the features look like in two totally different time-series databases: InfluxDB (with its own “Flux” query language) and TimescaleDB (which extends PostgreSQL and offers full-SQL support).‌‌

Time-based aggregation

Time-based aggregation is a must-have feature for time-series databases. Without it, you won’t be able to create hourly, daily, weekly, and monthly time aggregations (and many permutations in between) that are key for analyzing time-series data. Let’s see what time-based aggregate queries look like in the case of InfluxDB and TimescaleDB:

InfluxDB

from(bucket: "crypto")
  |> range(start: 0)
  |> aggregateWindow(every: 1d, fn: count, createEmpty: false)
  |> sort(columns: ["_time", ], desc:true)
  |> limit(n:10)

Using InfluxDB’s query language, Flux, you must define a time range for the query. You can use 0 as the start parameter to work around this - in case you don’t want to define a specific time range. Then you can use the aggregateWindow() function to create arbitrary time “buckets”.

TimescaleDB

SELECT
  time_bucket('1 day', time) AS bucket,
  count(*) row_count
FROM crypto_ticks
GROUP BY bucket
ORDER BY bucket DESC
LIMIT 10;

bucket         	|row_count|
-------------------+---------+
2022-09-09 02:00:00|	52277|
2022-09-08 02:00:00|   128657|
2022-09-07 02:00:00|   134849|
2022-09-06 02:00:00|   132837|
2022-09-05 02:00:00|   126254|

In TimescaleDB you can use the time_bucket() function to create arbitrary time buckets. Besides, all the other available PostgreSQL functions like count(*) works the same way as in regular PostgreSQL.

When working with time-series data, you will have to create a lot of time-based aggregations—so make sure that the database you choose provides a simple and intuitive interface to create time buckets.

Automatic downsampling

Time-series data is often ingested at a very high resolution (e.g., thousands of data points per second). To make it easier to analyze time-series, users often downsample their data (e.g., they convert thousands of data points per second to only one). This technique doesn’t only save storage costs, because you need to store lower resolution data, but it also makes it easier to create visualizations and recognize trends in the data.

Downsampling is often done repeatedly and continuously, which means that if, for example, you insert multiple new rows every second, the database rolls up the incoming data into larger buckets automatically. Instead of aggregating the raw data yourself, the database takes care of it automatically and in real-time. Let’s see how InfluxDB and TimescaleDB handle downsampling with an OHLC example.

InfluxDB

close=from(bucket: "crypto")
  |> range(start: -30d)
  |> group(columns:["symbol"])
  |> filter(fn: (r) => r["_measurement"] == "ohlc")
  |> window(every: 1h)
  |> reduce(fn: (r, accumulator) => ({
  
      indexLow:
        if (r._field=="low") then 
          accumulator.indexLow+1 
        else
        accumulator.indexLow,
      indexOpen: 
      if (r._field=="open") then 
       accumulator.indexOpen+1 
       else 
       accumulator.indexOpen,
        open: 
      if (r._field=="open") then 
        if (accumulator.indexOpen==0) then 
          r._value 
        else 
          accumulator.open
      else
        accumulator.open  
    ,
    
    
      high:
       if (r._field=="high") then  
          if(r._value>accumulator.high ) then
            r._value
          else
            accumulator.high 
      else 
        accumulator.high
   ,
    low: 
      if (r._field=="low") then

          if(r._value<accumulator.low or accumulator.indexLow==0.0) then
            r._value
          else
           accumulator.low 
      else 
        accumulator.low,

             close: 
       if (r._field=="close") then 
          r._value 
      else 
        accumulator.close,
             volume: 
        if (r._field=="volume") then
          r._value+accumulator.volume 
          else
           accumulator.volume
        
  
    }),
    identity: {indexLow:0,indexOpen:0,open: 0.0,high: 0.0,low: 0.0,close: 0.0,volume: 0.0})
    |> drop(columns: ["indexOpen","indexLow"])
  |> group(columns:["pair"])
    |> yield(name: "candle")

InfluxDB’s Flux provides a convenient way to write simple queries, but if you want to create somewhat more complex queries, like creating OHLC aggregates from raw financial tick data, the final query can become quite long as you can see.

TimescaleDB

CREATE MATERIALIZED VIEW hourly_buckets
WITH (timescaledb.continuous)
AS
SELECT
  time_bucket('1 hour', time) AS bucket,
  symbol,
  first(price, time) AS open,
  max(price) AS high,
  min(price) AS low,
  last(price, time) AS close
FROM crypto_ticks
GROUP BY bucket, symbol;
SELECT * FROM hourly_buckets;
bucket         	|symbol  |open   |high   |low	|close  |
-------------------+--------+-------+-------+-------+-------+
2022-02-08 22:00:00|ADA/USD |  1.166|   1.17|  1.157|  1.168|
2022-02-08 22:00:00|ATOM/USD|  30.44|  30.63|   30.3|  30.51|
2022-02-08 22:00:00|AVAX/USD|  87.85|   88.0|  86.72|  87.06|
2022-02-08 22:00:00|BNB/USD |  413.5|  416.5|  410.3|  410.3|
2022-02-08 22:00:00|BTC/USD |44192.4|44354.0|43938.6|44185.2|

‌If you are familiar with PostgreSQL syntax, you can see that the TimescaleDB method is very similar to a PostgreSQL materialized view. However, the mechanism under the hood is different to provide a better developer experience for time-series data by automatically storing pre-aggregated buckets over time, maintaining aggregations when raw data changes, and even returning real-time data.

Querying recent data

You might want to build visual dashboards to display time-series trends or even close to real-time data. For creating trend charts, you can use the previously mentioned downsampling method. But for real-time data you probably want to see more granular and recent data, e.g. all data points from the past five minutes. Let’s see how you can make this simple request in InfluxDB and TimescaleDB.

InfluxDB

from(bucket: "crypto")
  |> range(start: -5m)
  |> filter(fn: (r) => r.symbol == "BTC/USD")
  |> sort(columns: ["_time", ], desc:true)
  |> limit(n:5)
  |> keep(columns: ["_time", "_value"])

In Flux, you can specify a time range that is relative to now with start: -5m which will return all data for the “BTC/USD” symbol from the past five minutes.

TimescaleDB

SELECT
  time,
  price
FROM crypto_ticks
WHERE
  "time" > NOW() - INTERVAL '5 minutes' AND
  symbol = 'BTC/USD'
ORDER BY time DESC
LIMIT 5;

time           	|price  |
-------------------+-------+
2022-09-12 15:24:07|22346.7|
2022-09-12 15:24:03|22346.3|
2022-09-12 15:23:50|22346.7|
2022-09-12 15:23:45|22355.9|
2022-09-12 15:23:40|22358.1|

In the TimescaleDB example, you can see a familiar SQL example (if you already know SQL) with a symbol filter and a relative time filter in the WHERE clause using the NOW() PostgreSQL function. Under the hood, the way this query gets executed is different from regular PostgreSQL though: when you insert time-series data into the database, TimescaleDB auto-partitions your table based on the time column. Then, when you make a query containing a time filter, like in this example, TimescaleDB can exclude whole chunks from scanning which makes querying recent data lightning fast, even if you have billions of rows stored in the database.

Long-range analytical queries

What if you are also interested in analyzing longer time frames, e.g. all data from the past year? Maybe you want to see what was the highest price of a certain stock or crypto symbol in the past year.

InfluxDB

from(bucket: "crypto")
  |> range(start: -1y)
  |> group(columns: ["code"])
  |> max()
  |> group()
  |> sort(columns: ["_value"], desc: true)

This example shows that Flux executes your query in the same order as you describe it.

TimescaleDB

SELECT
  symbol,
  MAX(price) AS max_price
FROM crypto_ticks
WHERE
  "time" >= NOW() - INTERVAL '1 year'
GROUP BY symbol
ORDER BY max_price DESC;
symbol   |max_price |
---------+----------+
BTC/USD  |   48210.1|
WBTC/USD |  48169.56|
ETH/USD  |   3579.38|
BNB/USD  | 	460.0|
SOL/USD  |	143.55|

Analytical queries like this, with a larger time window as the filter, are not typical time-series queries, but you might want to run these from time to time. TimescaleDB provides two features that significantly speed up these queries: native compression, which saves space and converts your data into a columnar form, and continuous aggregates, which automatically maintain materialized aggregate data that can be retained separately from raw readings. Together, these features can have a dramatic effect on the performance of your application.

JOINing time-series data with other business data

Sometimes we only talk about time-series data without mentioning all the other data that real-world projects have in their data infrastructure. But the reality is that time-series data is always connected to non-time-series (business) data. If you plan to analyze your time-series data and business data together, the database you choose needs to be able to JOIN them and work with them quickly and simply. In the following examples, you can see how to JOIN two tables in InfluxDB and TimescaleDB.

InfluxDB

crypto_assets = from(bucket: "crypto-assets")
    |> range(start: -1mo)
    |> filter(fn: (r) => r._measurement == "assets" and r._field == "symbol")

crypto_ticks = from(bucket: "crypto-ticks")
    |> range(start: -1mo)
    |> filter(fn: (r) => r._measurement == "ticks" and r._field == "price")

join(
    tables: {assets:crypto_assets, ticks:crypto_ticks},
    on: [symbol, ],
)

The big difference between InfluxDB and TimescaleDB in this regard is that InfluxDB can only store timestamped data, while TimescaleDB can store timestamped and non-timestamped data right next to each other. Thus, in InfluxDB you can only join time-series data with other time-series data but not relational data.

TimescaleDB

SELECT  crypto_assets.name,  bucket,  close,  high,  low,  open
FROM one_day_candle
INNER JOIN crypto_assets ON crypto_assets.symbol = one_day_candle.symbol
WHERE
  bucket > NOW() - INTERVAL '1 month' AND
  one_day_candle.symbol = 'BTC/USD'
ORDER BY bucket;

name   	|bucket         	|close  |high   |low	|open   |
-----------+-------------------+-------+-------+-------+-------+
Bitcoin USD|2022-08-13 02:00:00|24460.6|24889.5|24312.3|24402.2|
Bitcoin USD|2022-08-14 02:00:00|24312.4|25034.2|24160.4|24455.2|
Bitcoin USD|2022-08-15 02:00:00|24092.8|25210.9|23798.7|24316.2|
Bitcoin USD|2022-08-16 02:00:00|23867.7|24247.5|23692.0|24103.2|
Bitcoin USD|2022-08-17 02:00:00|23340.1|24430.1|23184.4|23857.3|

In TimescaleDB, you can use PostgreSQL’s JOIN to connect any two tables in the same database, enabling you to store your non-time-series data next to your time-series data. Without this feature, you might have a harder time bringing your data together from multiple sources.

Fun fact: One of the reasons TimescaleDB was created was that the founders struggled to find a database that could do easy JOINs for time-series data.

Data retention and compression

While seeing how thousands of companies handle time-series data, we found that time-series data becomes less valuable over time. This means users often want to archive or even remove older data after a certain time to save on storage costs.

InfluxDB

influxdb data retention
In InfluxDB you can change the data retention settings on a per bucket basis on the UI.

In older versions of InfluxDB you could also add data retention policies this way:

CREATE RETENTION POLICY "one_year" ON "crypto_ticks" DURATION 1y REPLICATION 1 DEFAULT

TimescaleDB

--Compression:
ALTER TABLE example SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = symbol
);
SELECT add_compression_policy('crypto_ticks', INTERVAL '2 weeks');

--Data retention:
SELECT add_retention_policy('crypto_ticks', INTERVAL '1 year');

With TimescaleDB, you can set up both a compression policy (to save on storage needs but keep the data available for querying) and data retention policy (which gets rid of the data after the defined time period). Without essential tooling around data compression and data retention in the database, you’d need to implement and maintain these automations manually.

Learn More!

The value that time-series data provides is unquestionable. It’s like watching the whole movie instead of just the last frame. You do need a proper database that can play that movie for you though. I hope this article gave you the tools you need to choose the time-series database that best fits your team and your project.

Check out our docs for hands-on time-series tutorials:


If you are interested in TimescaleDB,  you can try it for free (no credit card needed) or check out our GitHub, community Slack or Forum and join discussions with 9000+ TimescaleDB users.

The open-source relational database for time-series and analytics.
Try Timescale for free
This post was written by
12 min read
General
Contributors

Related posts

TimescaleDB - Timeseries database for PostgreSQL

Explore TimescaleDB

Learn more about how TimescaleDB works, compare versions, and get technical guidance and tutorials.

Go to docs Go to products