# Storing IoT data: 8 reasons why you should use PostgreSQL

IoT applications generate torrents of data. PostgreSQL is powerful but scales poorly. With TimescaleDB you can scale Postgres by 20x and harness its power for IoT.

It’s no secret that the so-called “Internet of Things” isn’t about the Things, but the data. And it is a lot of data. Every day, more of the physical world — manufacturing operations, food production systems, trains we commute on — is connected to the Internet and automated, creating more and more streams of sensor data.

Multiply the millions of things by the amount of data per device, and you get an exponentially growing torrent of information being used to make better business decisions, provide better end user experiences, and produce more while wasting less.

Most engineering teams (including ours, early on in our company history) working on these initiatives end up storing all of this data in multiple databases: metadata in a relational database, time-series data in a NoSQL store. Yet each of these databases operate differently; running a polyglot database architecture adds operational and application complexity that’s unnecessary.

You don’t need to do this.

In this post, we’ll show you how to keep all of your IoT relational and time-series data together in PostgreSQL (yes, even at scale), and how that leads to simpler operations, more useful contextualized data, and greater ease of use. We’ll also highlight other awesome PostgreSQL features relevant to IoT, like query power, flexible data types, geospatial support, and a rich ecosystem.

By the end, if you’re convinced that PostgreSQL + TimescaleDB is the ideal database for your IoT project, you can find out more about TimescaleDB in our documentation page.

## Scaling capabilities (with TimescaleDB)

“Will PostgreSQL scale to support my IoT data rates?”

Scale is the number one concern that developers have about PostgreSQL. And that’s completely valid. IoT backends need to support high data ingest rates, while writes on PostgreSQL slow to a crawl as your dataset grows:

But it turns out that for time-series data, if your database is architected the right way, you can scale PostgreSQL to hundreds of thousands of inserts per second, at billions of rows, even on a single node with a modest amount of RAM.

This insight is what led us to develop TimescaleDB, a new open-source time-series database launched earlier this year. TimescaleDB achieves significant performance improvements by automatically partitioning data across time and space behind the scenes, while presenting the illusion of a single, continuous table (called a “hypertable”) across all partitions to the user. Most importantly, TimescaleDB does this while appearing like a normal PostgreSQL instance (in fact, packaged as a PostgreSQL extension, not a fork) to the developer and the outside world.

We benchmarked TimescaleDB against PostgreSQL by loading 1 billion rows, with 10 metrics per row, into a single instance of each.

Here are the results:

In short, TimescaleDB is more than 20x faster than vanilla PostgreSQL when inserting data at scale. The average insert rate for TimescaleDB in this benchmark is 111k rows (1.11M metrics) / second, within a relatively narrow band (as displayed above), while PostgreSQL insert performance drops to 5k rows (50k metrics) / second. But most importantly, TimescaleDB enables you to scale to 1 billion rows with no real impact to insert performance.

But what about data retention? Because time-series data volumes grow quickly, at some point you may want to delete old time-series data that you no longer need. And PostgreSQL does not handle massive deletions well. This is another problem that our time/space partitioning solves: when you need to eliminate old data, instead of deleting rows (and undergoing expensive vacuuming operations), TimescaleDB just drops the partition. And of course, this is all tucked away behind a data retention API.

## A simplified stack: one IoT database vs two

We just saw how TimescaleDB scales PostgreSQL to large workloads, like that within IoT. Now let’s see what it lets us do.

Being able to store time-series data alongside relational data is incredibly powerful. For one, it leads to a simpler stack. Instead of two databases (NoSQL for sensor data, relational for sensor metadata), with all kinds of glue code in between, not to mention the operational headaches of having two databases… you only need one database:

For example, let’s look at backups. Running two different databases means two different backup processes (and longer downtimes). Also, if the systems are backed up independently, then you run the risk of data integrity issues. Imagine this scenario: your time-series data has a backup from the last hour, but your relational data was backed up yesterday; so you now have time-series data referring to new devices whose metadata is lost.

Prioritizing simplicity leads to other advantages: A simpler stack is even useful at the edge, where running one database is hard enough (due to resource constraints), let alone two. Only needing a single database makes IoT prototyping easier. It also helps reduce the complexity of large sprawling IoT projects. As we recently heard from the CTO of a Unicorn IoT startup: “Aggressively consolidating our tech stack lets us sharpen our engineering team’s focus.”

## More context for your IoT data

A single database for time-series data and metadata does more than just simplify your stack: it also lets you add context to your sensor data. Because what is point of collecting data if you don’t know what you’re measuring?

Here’s an example: One company we are working with is developing an IoT application to monitor manufacturing processes. This application involves close examination of assembly line data to catch defects and ensure a high quality product. Most of the assembly line data is time-series in nature: {machine_id, timestamp, measurement}. Yet that data is meaningless by itself without all the other operational metadata: the machine settings at that time, the line information, the shift information, etc. They have experimented with denormalizing their data, but that added an unreasonable amount of data bloat on each measurement.

With TimescaleDB and PostgreSQL, they can keep their metadata normalized, and add the necessary context to their time-series data at query time (via a SQL JOIN). For example, via a query like this:

-- Show me the average temperature by machine, for machines of a
-- given type on a particular line, by 5 minute intervals
SELECT time_bucket('5 minutes', measurements.time) as five_min,
machine_id, avg(temperature)
FROM measurements
JOIN machines ON measurements.machine_id = machines.id
JOIN lines ON machines.line_id = lines.id
WHERE machines.type = 'extruder' AND lines.name = 'production'
AND measurements.time > now() - interval '36 hours'
GROUP BY five_min, machine_id
ORDER BY five_min, machine_id;

[Show the average temperature by machine, for machines of a given type on a particular line, by 5 minute intervals.]

## Power and ease of use

Let’s take a step back and keep in mind the value of an IoT project: to collect, analyze, and act on data to improve utility/efficiencies, reduce downtime/waste, and provide better products and services. In other words, you need more than a data store: you also need an easy way to wring actionable insights out of your data.

This is where SQL comes in. While it’s been quite fashionable in the past several years to denounce SQL and praise NoSQL, the truth is that SQL is quite powerful and is starting to make a comeback (which is one reason why “NoSQL” is now getting “backronymed” to “Not only SQL”).

SQL includes quite a few useful features: e.g., multiple complex WHERE predicates (backed by secondary indexes); multiple aggregations and orderings; window functions, libraries of mathematical and statistical functions; and more.

-- Plot the change in temperature for ceiling sensors
-- on linear and logarithmic scales, by 10 second intervals
SELECT ten_second,
temperature / lead(temperature) OVER data AS temperature_rise_linear,
log(temperature / lead(temperature) OVER data) AS temperature_rise_log
FROM (
SELECT time_bucket('10 seconds', time) as ten_second,
last(temperature, time) as temperature
FROM measurements JOIN sensors ON measurements.sensor_id = sensors.id
WHERE sensors.type = 'ceiling' AND measurements.time >= '2017-06-01'
GROUP BY 1
ORDER BY 1
) sub window data AS (ORDER BY ten_second asc);

[Plot the change in temperature for ceiling sensors on linear and logarithmic scales, by 10 second intervals.]

TimescaleDB augments SQL by adding new functions necessary for time-series analysis, e.g., as shown with time_bucket and last in the query above.

SQL has another advantage: people across your organization already know it. You won’t need to train your engineers to learn a new specialized query language (or hire new ones), and non-technical users won’t need to rely as heavily on engineering (and engineering release cycles) to answer questions of the data. In other words, by leveraging SQL you can democratize your time-series data, letting more people across your organization access it.

## Broad datatype support (including JSON)

JSON in a relational database? Eat your heart out, MongoDB.

When you start building your IoT product, you may not know what data you will care about, nor have a specific data schema in mind. Or later on, you may need very specific data structures (e.g., arrays).

Fortunately, PostgreSQL supports a broad spectrum of datatypes. It allows for semi-structured data (via JSON / JSONB support), but also a variety of other data types, including many numeric types, geometric types, arrays, range types, and date/time types.

-- Storing and querying JSON data in PostgreSQL
FROM measurements
ORDER BY time DESC, sensor_id LIMIT 50;

time     | sensor_id | type |             readings
---------------+-----------+------+----------------------------------
1499789565000 |       330 |    1 | {"occupancy": 0, "lights": 1}
1499789565000 |       440 |    2 | {"temperature": 74.0, "humidity": 0.81}
1499789565000 |       441 |    2 | {"temperature": 72.0, "humidity": 0.78}
1499789560000 |       330 |    1 | {"occupancy": 1, "lights": 1}
1499789560000 |       440 |    2 | {"temperature": 73.9, "humidity": 0.81}
1499789560000 |       441 |    2 | {"temperature": 72.1, "humidity": 0.79}
1499789555000 |       330 |    1 | {"occupancy": 1, "lights": 1}
1499789555000 |       440 |    2 | {"temperature": 73.9, "humidity": 0.80}
1499789555000 |       441 |    2 | {"temperature": 72.1, "humidity": 0.78}

[Some of our IoT customers store sensor data in JSONB, which allows indexing.]

## Geospatial support for your IoT data

Often there is a geospatial component in the data from connected devices — physical things exist in a particular space, after all. In particular, geospatial information is important when the things are moving. One IoT use case we often see is asset tracking: e.g., tracking vehicles for fleet management, optimizing routes, reducing spoilage, etc.

TimescaleDB is packaged as a PostgreSQL extension, which means you can run many other PostgreSQL extensions in conjunction with it. One powerful extension is PostGIS, which adds a wealth of geospatial support (including new data types, functions, etc.) to PostgreSQL. By combining PostGIS with TimescaleDB, you can combine your geospatial and time-series data, in effect creating a scalable spatiotemporal database.

## Plenty of integration opportunities

Time-series databases don’t operate in a vacuum. They need connectors, for example, to data buses like Kafka, stream processing engines like Spark, or BI tools like Tableau. Most time-series databases are relatively new, and there hasn’t been enough time for an ecosystem to develop around them. On the other hand, PostgreSQL has been around for over 20+ years, and the community has built an expansive ecosystem around it.

Fortunately, TimescaleDB looks just like PostgreSQL. In fact, customers often ask, “Does Timescale work with X?” And as a quick Google search typically shows, there usually already exists a PostgreSQL connector for X, which will work with TimescaleDB out of the box.

## Proven reliability

To be annoyingly obvious: the database you choose for IoT needs to be reliable (and not wake you up at 3AM). Unlike websites or mobile apps, IoT applications from the beginning are typically deployed in high-value scenarios. If you are using a database to monitor your factory line, then that database cannot go down.

This is where we rest on the shoulders of giants. In the past 20+ years, PostgreSQL has been battle tested in a variety of mission critical applications across a variety of industries. There is also another ecosystem of administrative tools that make reliability easier to achieve: streaming replication, hot standby, and more. TimescaleDB inherits this same level of reliability and ecosystem.

There is a lot of noise in the IoT world right now, and it can be hard to decide what database to use for your IoT project. But sometimes the best option is the boring option: the database that just works. That’s PostgreSQL, which now (thanks to TimescaleDB) finally scales to handle IoT workloads and optimizes SQL for time-series data.

If you’re building a new IoT project or currently wrestling with a complex IoT stack, choose PostgreSQL and download TimescaleDB.

If you need help with anything, feel free to join our community on Slack.

The open-source relational database for time-series and analytics.
This post was written by