How clevabit Builds a Data Pipeline to Turn Agricultural IoT Data Into Insights for Farmers Everywhere
This is an installment of our “Community Member Spotlight” series, where we invite our customers to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.
In this edition, Christoph Engelbert, clevabit co-founder, joins us to share the work they’re doing to power sustainable and transparent farming and how they’re giving farmers, mills, and consumers the metrics they need to make decisions.
The clevabit GmbH was founded in April 2019, and, with our platform, we offer otherwise unseen insight into animal farming, barn environmental levels, and how all those metrics influence animal health and farm operations.
Given the broad range of metrics we collect and correlate, clevabit’s customers range from farmers and hatcheries to feed mills – and, eventually, we’ll surface intelligence to allow customers to make informed decisions when they buy their meat or eggs.
For each group, the problems are vastly different:
- Farmers and hatcheries mainly focus on ways to reduce costs, without sacrificing quality or animal health (e.g., optimizing feeding, reducing medication needs, or speeding up the fattening process).
- Feed mills try to make their delivery routes as efficient as possible
- Customers want healthy food, and in the best case, to see the “process chain” all the way from the farm to their home.
About the Team
While we only launched the company a little over a year ago, we’ve had four main people working on the clevabit hardware and software for almost 3 years. Each one brought an important bit of skill to get the basics running:
- Hubertus Wasmer, the initiator and visionary, brought years of experience with supply chain management and the vision.
- Dennis Borgmann and Frederik Grote, who’ve run their own hardware development company for more than ten years, brought knowledge of hardware engineering.
- And finally me, Christoph Engelbert. I’ve had years of hands-on experience in distributed software engineering and Developer Relations, and I’ve written most of the firmware and backend systems.
Now, the team’s roughly 15-20 people - with a bit on and off, just like in any good relationship ;-) - and most of our growth is in software engineering, mobile app development, and data science. I can build a massively scalable system, but data analytics is definitely not my strength.
About the Project
We store many different kinds of metrics in our platform, including: air pollution, ammonia levels, silo fill levels, and daily water consumption. In addition, we also capture less structured data, like the time, quantity, and type of medications specific animals receive in a certain barn.
In general, the more information we have around a specific farm, barn, and animal, the better we can run correlations - like predicting how changes in air values affect medication needs, or how to adjust water and food consumption to optimize fattening by just one day.
I previously worked for Hazelcast (a system for large distributed data calculation and caching), and, while I hadn’t dealt with time-series data specifically, I knew our IoT data required a time-series database. We’re handling somewhat large amounts of data, with each device generating around 8-10k data points per day. All of them vary in types and style, but they’re all time-based metrics.
For example, broiler chicken fattening is handled in rounds of about 20-30 days each, making it interesting to compare the current run with previous ones: did a recent change in food lead to more efficiency (meaning faster / better grows)?. In addition, benchmarking your run against anonymized customers’ runs is valuable for general comparison.
Database-wise, I’m a major fan of PostgreSQL— but, for a time-series database, I didn’t really have an idea when I started looking into options. We looked at InfluxDB as a possible choice; however, the promise of TimescaleDB and that it was built on top of Postgres just sold it to me.
Our reliance on a time-series database like TimescaleDB is almost 100%. Queries, especially aggregations, need to be super fast when we’re showing multi-day graphs or run comparisons between different runs.
A very common request from our customers is to create a “barn-card” – a collection of daily aggregate values for metrics like temperature, ammonia levels, relative food or water usage. This “barn-card” holds all the essential elements for a specific fattening round (i.e., proves your animals were healthy while fattening, including how many animals died, per day).
We aggregate a set of different value types per device, or for all devices of a customer or farm. Our common queries aren’t super fancy nor complex, but the number of data points - and specifically the grouping afterward - took quite a bit of time before we started using TimescaleDB.
Here’s an example of one of our queries (using
time_bucket). The query returns the temperature data for a specific customer, in 1 hour intervals, aggregated by device:
SELECT time_bucket('1h', m.created) as time, d.serial_number, avg(m.value) as value FROM ( SELECT dp.device_id, dp.data ->> 'board_serial' AS serial_number FROM device_provisioning dp, parent_hierarchy(dp.device_id, 'device') p WHERE p.parent = '14f8db2d-b594-4b07-8351-9ac987c19081' AND p.parent_type = 'customer' ) AS d, LATERAL ( SELECT m.created, m.device_id, metric_extract_float(m.data) as value FROM metrics m WHERE m.device_id = d.device_id AND m.created >= now() - interval '30d' AND m.value_type_id = '7fec9d76-8e4b-48e7-9c85-cf5cf865c4eb' ORDER BY 2 ) AS m GROUP BY 1,2 ORDER BY 1,2
I don’t really have measurements on speed before and after we started using TimescaleDB – except for one time when I forgot to reactivate our hypertable after restoring. Our queries were in the range of tens of seconds, which was surprising. After investigating, I figured out the issue, reactivated TimescaleDB, and query times fell down to sub-second.
Not a scientific experiment, but hey, I know TimescaleDB helps :-)
Current Deployment and Future Plans
Our backend system is mostly written in Go or Kotlin. It’s more of an accident than intention, but it’s been a good trade-off between time to value and functionality. At the moment, we’re running TimescaleDB on Azure’s hosted PostgreSQL service – but we’ve started to look into running our own database, specifically the multi-node solution TimescaleDB is working on. All for the data!
I’m a strong believer in “use the right tool for the right job.”
We deploy services into o a Kubernetes cluster (Azure Kubernetes Service), communicate with HTTP calls or through RabbitMQ, and our metrics pipeline is built on top of Azure Events Hub Kafka facade.
We send alerts when we see issues in food and water supply, or when certain environmental measures rise to dangerous levels (e.g., high ammonia or carbon dioxide levels). We execute these calculations on our real-time pipeline, but for the baselines, I’m looking into TimescaleDB’s continuous aggregates right now. I haven’t tried native compression yet, since Azure is lacking and offering the open-source version only, but I expect even better results in terms of query speed.
I hope that in the near future I can also recommend using a TimescaleDB cluster, but right now I’m still playing with it myself. Early stage :-)
Getting Started Advice and Resources
I’d certainly recommend the TimescaleDB documentation. Get familiar with the architecture and the concepts on how data is stored and queried.
Furthermore, I recommend all documentation around PostgreSQL and the according query syntax. I have to admit, I’ve never had an issue with the TimescaleDB specific part of a query - either the query was already messed up (`Hash Join`, or similar), or the query was blazing fast. Understanding the Query Optimizer and the Explain output (especially with verbose and analyzing activated) is a must!
Lastly, try to make your queries readable. A good way to do that is to extract common parts into functions; just make sure you use SQL as the language and mark the function as immutable.
- If you do this, the Query Optimizer actually understands the query inside the function, inlines it, and optimizes the overall query altogether.
- The parent_hierachy query example I included above is one of these inline functions.
One of my recommendations is to prototype– and prototype hard.
For clevabit, we did two prototypes, each with a different approach to storing data. We experimented with not only how to store data in general (since we don’t know what values will come in the future), but also how to make a generic metrics store that can be aggregated inside the database.
And overall, what’s worked best over the last couple of years is an event-based, asynchronous service communication. Kafka (or Azure Events Hub in our case) is a great solution, since you can upgrade or restart services without any issue. If you need to scale out a specific calculation, it’s no problem, you just go ahead and put more calculators into the partitioning group.
Editor’s Note: If you’re interested in (or already!) using Kafka, check out our Create a Data Pipeline with TimescaleDB, Kafka, and Confluent blog post to learn how to ingest your data into TimescaleDB (includes step-by-step instructions for connecting to Kafka, mapping data, and more).
Building a scalable system isn’t an easy task (it’s not something you just do). In general, we’ve built a heavily microservices-based system. I can basically scale every part of the system independently.
If you’re considering a similar approach, remember that this comes at a cost, and microservices are not a silver bullet. They’re great for scalability, but every transaction between services consists of a network transmission.
We’d like to thank Christoph and the clevabit team for sharing their story with the community and for their commitment to making farming as transparent as possible. Their work to surface farm-specific metrics to individuals and farmers is yet another example of the amazing power data has to help us understand the world around us.
We’re always keen to share community projects, either via our blog or DataPub, our monthly virtual meetup for open data enthusiasts (everyone’s welcome, so we hope to see you at a future session!).
Do you have a story or project you’d like to share? Reach out on Slack (@lacey butler), and we’ll go from there :).