Build an application monitoring stack with TimescaleDB, Telegraf & Grafana
Match the flexibility and scale of your application with a stack that works for you.
The world of systems design has become a powerful yet complex place. Through the advancement of microservice architectures, enterprise applications have become more fault tolerant, easier to scale, and capable of delivering end users with a constantly improving experience due to the ability for development teams to rapidly iterate and innovate.
However, this presents the operations teams with some VERY complex challenges when it comes to monitoring the health and performance of these applications. For example:
- How do we implement a monitoring solution that matches the power and flexibility of the application we are deploying?
- How can we have an eye on all application layers and collect all the data we need to monitor key metrics that ensure the end user is having an optimal experience?
The answer to those questions are simpler than you might think. We must look to an application monitoring stack that matches the flexibility and scale of the application we are monitoring.
This monitoring stack can use best of breed components to instrument the needed data collection, and to store the data in such a way that is can be accessed quickly and can be used for both realtime and historical contexts. The end result provides us the ability to present the data based on the needs of the teams maintaining the application.
In this blog, we’ll discuss such a stack using the following:
Application Monitoring Stack
Storing the data in TimescaleDB
Now that you know what this application monitoring stack will look like, the first order of business is to make sure you have a TimescaleDB instance running (whether on premise or an instance running in Managed Service for TimescaleDB. TimescaleDB is the heart of the application monitoring stack, and is where the data from your application will land.
If you are brand new to Timescale, follow the instructions here to get TimescaleDB going locally or in the cloud.
The nature of the data we are collecting is unique, as you can see from the sample below it represents time-series data:
In the sample, we are capturing CPU metrics on a minute by minute basis, calling for a database technology that is purpose built to handle this type of data ingestion (large volume, high velocity), and obtains the requirements for querying this type of data. TimescaleDB is designed to help manage this type of data ingestion and complex queries.
In our application monitoring use case we are going to look to cast the data in two ways:
- We need to be able to query the data in such a way that it will allow us to build real time dashboards, helping us understand what is happening “now”.
- We need to be able to store and query the data in a historical context, that is to say we need to be able to understand, and start to predict what will happen in the future (allowing us to prepare and/or budget for additional resources, and make the needed application level adjustments).
Being able to serve these two use cases simultaneously is key since it represents the core of our application performance monitoring stack, and is at the core of the value provided by TimescaleDB.
Another large benefit in using a technology like TimescaleDB is that it sits on top of PostgreSQL, making it simple to get the data out of the database. Whether you're using a tool like Grafana to build a dashboard (which we will discuss next) or want to run some ad hoc queries to understand trending in storage usage, the idea that you will be using standard SQL to do this reduces the learning curve and shortens the time to value of the entire stack.
Instrumenting data collection with Telegraf
The next step is figuring out how we are going to collect data, and in this case we are going to use Telegraf and deploy it to all elements of our application.
Telegraf gives us the facility to collect the information we need to properly monitor the health and performance of our application. We will be able to gather things like CPU, Memory, Network, as the basics (i.e. will operate at the pod level in a Kubernetes environment, and report pod based statistics if you have chosen this type of deployment). We can also instrument data collection / metrics that will help us evaluate things like database performance and application response times.
The Telegraf agent is lightweight and simple to install, and will solve the first part of our problem: data collection.
Using the link here you will find step by step instructions for deploying Telegraf (Note: This version of Telegraf includes output plugin support for writing back to PostgreSQL and TimescaleDB).
Configuring visualization & alerting with Grafana
Finally, let's talk about visualization and about how are we going to present the data we are collecting and use it to monitor what is happening across the application.
In this case, we are going to use Grafana to help us understand the data in realtime while giving us the ability to set and trigger alarms when something is out of specification. (To set up Grafana visualization with TimescaleDB, follow these instructions.)
As an example, we may choose to collect CPU data from the nodes in our application cluster, and we will want to make sure we monitor this in real time and set alarm thresholds to notify of a potential issue:
We are capturing the real time data from the machine, monitoring both System and User CPU usage, and setting an alarm threshold when we reach 80% utilization. This is an example of data being evaluated and shown to the user in real-time. This particular dashboard is updated every 5 seconds.
In contrast, we also need a higher level view of the same world. In the case below we are looking at the same set of CPU metrics but across a broader period of time (12 Hours):
Again, being able to look at the larger picture as it relates to spotting trends is another use case we need to account for in this solution. The ability to pull back from the granular level, and view historical data is key to managing our application and its performance.
In this case we are looking for spikes based on a particular window of our day (12 hours), however it is worth noting that because TimescaleDB is providing us with long term data storage we can pull back to a daily, weekly, or even monthly views of this data, with an eye on being proactive in spotting the trends associated with our application performance.
Once this data is integrated into Grafana, we can define alert rules (e.g. “Average CPU usage greater than 80 percent for 5 minutes”). Once an alert is triggered, Grafana can dispatch a notification. (Instructions for setting up alerting in Grafana can be found here.)
When it comes to implementing and application performance monitoring stack, flexibility along with best of breed components that are purpose built to carry out their part of the job is the key. Being able to understand the real time “situation” AND quickly and easily access historical trends is required.
Our suggested stack will give you the flexibility to collect that data you are interested in, store it in a database that will serve the two key use cases (real-time and historical), provide you the flexibility around how you want to display the data, and ensure the data is collected and queried in a performant manner. All of this will make certain that you have access to the data you need to effectively manage your application.
Have questions about setting up this application monitoring stack? Reach out to us on our community Slack channel.