How We Built Alert Rules, Runbooks, and Dashboards to Observe Our Observability Tool
In this era of cloud-native systems, the focus on observability is a given: observability tools ensure your systems perform correctly and deliver a satisfactory experience to your end-users. In the absence of good observability, your end-users will be the first to notify you about a problem—a tiny dart piercing through every respectable developer’s heart.
Needless to say, your users shouldn’t do your monitoring and alerting for you. You should aim to detect and fix problems before your users notice them, not after. When an issue arises, you need to find the root of the problem. And in a distributed architecture, the best way to do so is by interrogating your telemetry data with random questions, getting results in real time. You need real observability.
Observing the Observer
We all know that observability tools are critical for any system that aims to deliver an always available and reliable service. But what happens if your observability tool stops working? If your system has an issue, you’d only notice it when a user of said system notifies you—meaning that we’d be back to square one, running blind again.
As an observability practitioner, you’ll find many observability tools and tons of resources on configuring them to collect, store, query, visualize, and alert on telemetry data from the systems you monitor.
But who observes the observer?
We must treat observability tools as highly available and reliable systems. They, too, have to be monitored to ensure correct behavior. But it is surprisingly hard to find information on how to observe your own observability tool effectively. For example, we’ve struggled to find information online about how to monitor Prometheus —if there is any, we have not been able to find it.
This seems like a missing piece (more like a missing pillar) in our observability journey.
In the Promscale team, we decided to prioritize this. As a result, we’ve built a set of alerting rules, runbooks, and dashboards that will help Promscale users track the performance of their own Promscale instance while providing them with guidance on how to fix common issues. In this blog post, we tell you how we did it.
We relied on open-source components to build this set of tools, combined with our own experience assisting Promscale users. So even if you’re not a Promscale user, we hope this blog post can give you ideas on how to build your own “observing the observer” setup.
To understand our reasoning behind the alerts, runbooks, and dashboards we ended up creating, we’ll have to first explain in more detail how Promscale works.
Promscale is a unified observability backend for metrics and traces built on PostgreSQL and TimescaleDB. It has a simple architecture with only two components: the Promscale Connector and the Promscale Database.
The Promscale Connector is a stateless service that adds a “plug and play” functionality to connect some of the most common observability tools and protocols to the Promscale Database. These are some of its functions:
- Ingestion of Prometheus metrics and OpenTelemetry traces
- PromQL query support
- APIs for seamless integration with the Jaeger and Grafana user interfaces to visualize distributed tracesPromQL alerting and recording rules
The second component is the Promscale Database. The Promscale Database is PostgreSQL with observability superpowers, including:
- An optimized schema for observability data, including metrics and traces
- All the time-series analytical functions and the performance improvements provided by TimescaleDB
- Database management functions, aggregates to speed up PromQL queries, and SQL query experience enhancements for observability data
Tools that speak SQL can connect directly to the Promscale Database, while other common open-source tools such as Prometheus, Jaeger, and Grafana can integrate with the Prometheus Connector.
This simple architecture benefits easier troubleshooting. Promscale’s PostgreSQL foundation also helps—we’re talking about a very mature piece of software with a lot of documentation and knowledge around its configuration, tuning, and troubleshooting.
Still, we knew that we could accelerate the production-readiness process by providing extra guidance to our users through an extensive set of alerts and runbooks created by the engineering team building the product.
Common Performance Bottlenecks
From our conversations with users, we learned that when tracking Promscale’s performance, there are three processes that they should be paying particular attention to:
In Promscale, metrics and traces follow different ingest paths. Let's cover them separately.
When metrics are ingested, they are transformed into the Promscale metric schema. This schema stores the series’ metadata and data points in separate tables. Each data point includes the associated series’ identifier. Metric labels (both keys and values) are also stored in a different table: only the IDs that reference the values are stored in the series table.
To avoid running queries to retrieve those IDs from the database when new data points are inserted for existing series, the Promscale Connector keeps a cache with all that information, including all labels for a series ID, as well as the corresponding IDs of all the keys and values that have already been seen. As new series are ingested, the cache is automatically updated.
If the cache size is not enough to hold all the series information, the Promscale Connector will automatically increase the cache size up to a certain configurable limit. If that limit is hit and new series are ingested, Promscale will start evicting the oldest series from the cache. After eviction, the Promcale Connector will have to query the database to retrieve the series if it is ingested again.
If the cache is too small to contain all the “active” series (i.e., series that are being ingested regularly), then the system will start a loop where a series is loaded into the cache, evicted, and re-loaded again at the next iteration. This makes the cache ineffective and increases the query load on the database. It can negatively affect performance, usually translating into an increase in the latency of metric data ingest.
When trace data is ingested, spans are translated into the Promscale span schema, which has individual tables for spans and their associated events and links. Resource, span, event, and link attributes are stored in a separate tag table, and their tag IDs are referenced in the span, event, and link tables.
Tracing uses multiple caches, but the most relevant one for ingest is the tag cache because of its potential cardinality and size. The tag cache is where the resource, span, link, and event attribute names and values are cached. As new spans, links or events are ingested, and if new attribute names or values are found, they are inserted in the database and cached together with their IDs in the tag cache.
When existing attribute names or values are found, their corresponding IDs in the database are retrieved from the cache. This cache behaves in a similar way to the metric series cache. It automatically grows to hold more attribute names and values up to a certain limit. At this point, new attribute names and values cause older attribute names and values present in the cache to be evicted. If the cache is too small to hold all the active attributes, the constant cache evictions and subsequent database queries will cause performance degradation.
Data reads with PromQL
Promscale is built on PostgreSQL, but it reuses parts of the Prometheus PromQL query evaluation code, giving it 100 % PromQL compliance.
To process PromQL queries, the most straightforward approach for Promscale would be to retrieve all the matching metric series data points (Prometheus calls them samples) from the database and let the PromQL query evaluation code process them. But for queries that return a lot of data points to be processed, this would require a lot of memory and CPU, which may lead to long query executions and possibly even failures.
In Promscale, PromQL queries may be used not only by dashboards but also by alerting and recording rules, so it is essential to ensure that they complete successfully and run fast.
To speed up query execution and make the process more efficient, the Promscale Connector parses the PromQL query and translates it into “query pushdowns.” That means it runs parts of the PromQL query directly inside the database via SQL, leveraging TimescaleDB’s time series capabilities. The Promscale extension also provides additional functions to help map a higher percentage of the PromQL queries to SQL.
The Promscale database automatically handles compression and retention policies. It does so by regularly running background jobs. By default, two background jobs run every 30 minutes.
For high data ingest volume, the database may run behind on applying compression and background jobs, resulting in background jobs taking a very long time to complete and increasing disk usage.
The most common ways to resolve this are the following:
- Reduce the ingest rate by filtering unneeded series
- Configure additional background jobs if your compute has more CPUs available
- Increase the amount of compute resources allocated to the database
Alerts, Runbooks, and Dashboards to Fix These Issues
Once we identified the most likely potential problems that users could find, we started building our set of out-of-the-box alerts, runbooks, and dashboards to help our users ensure everything was working smoothly—from ingesting data to running PromQL queries and running maintenance tasks for data compression and retention.
When creating the alerting rules, we followed these design principles:
- Alerting rules should be symptom-based (e.g., “ingest latency increasing,” which alerts you on actual performance degradation that users will experience vs. “high CPU consumption,” which may or may not negatively impact the experience). The metrics used to trigger those alerts should explain the cause.
- Alerts should be actionable: they should help you fix the issue immediately. For this reason, we decided to create runbooks for each one of the alerts.
- Things should stay simple: we should avoid too many alerting rules, leading to alerting fatigue
The resulting alerts live in this YAML file. As you can see, if you browse the code, we grouped the alerting rules into several categories, aligned with the areas more prone to causing potential performance bottlenecks. To help you visualize everything, we also built a Grafana dashboard including several panels associated with these alerts.
Such categories are presented in the paragraphs below.
Promscale down (1 alert)
This alert checks if a Promscale instance is running. The runbook associated with this alert lives here.
Ingest (4 alerts)
This set of alerts checks for high latency or error rate in the ingest of telemetry data. They use the
promscale_ingest_duration_seconds_bucket metrics, which are available for metrics and traces via the type label. The former also includes a code label that you can use to identify requests that returned an error.
The runbooks related to these alerts live here and here.
Query (4 alerts)
These alerts check for high latency or error rate in PromQL queries. In a similar vein to ingest metrics, they use
promscale_query_duration_seconds_bucket metrics with the same labels as the ones for ingesting.
The runbooks associated with these alerts live here and here.
Cache (2 alerts)
These alerts check if the cache is large enough to avoid evictions of active items. This is monitored via the
promscale_cache_evictions_total metrics, which also have a type label to separately track issues associated with the metric and trace caches.
The runbooks related to these alerts live here and here.
Database connection (2 alerts)
These alerts check for high latency or error rate in the connection between the Promscale Connector and the Promscale Database. They leverage several database metrics:
The runbooks associated with these alerts live here and here.
Promscale database (4 alerts)
Lastly, these alerts look at potential issues with the Promscale database regarding health checks, compression, and retention jobs. They monitor the following metrics:
The runbooks associated with these alerts live here, here, here, and here.
All these alerts are based on the metrics exposed by Promscale’s Prometheus-compliant
/metrics endpoint, which runs on port 9201 by default.
In total, we built 17 alerts and 13 runbooks.
Observing the Observer: How to Get Started
This set of tools is freely available with the latest Promscale release. And all the information on how to start using it lives in our documentation:
- Learn how to configure alerting rules in Promscale, using this YAML configuration file to set up this particular set of “observing the observer” alerts.
- All the runbooks associated with the alerts can be found in this GitHub repo.
- Lastly, import the Grafana dashboard into your own instance.
If you are still not using Promscale, you can install it here (it’s 100 % free) or get started now with Promscale on Timescale Cloud (free 30-day trial, no credit card required). Up to 94 % cost savings on managed Prometheus with Promscale on Timescale Cloud.
And if you are using Kubernetes, an even more convenient option is to install Promscale using tobs, a tool that allows you to install a complete observability stack in a few minutes. This set of alerting rules and dashboards have been directly integrated with tobs, so if you use tobs to deploy Promscale, they will be automatically deployed as well.