Promscale 0.4: Drawing Inspiration from Cortex to Rewrite Support for Prometheus High-Availability
Now with better support for Prometheus high-availability, support for multi-tenancy, improved user permissions (using role-based access control), and more.
Last October, we released an early version of Promscale, an open-source analytical platform and long-term store for Prometheus data. Promscale grew out of a need to have something to store metrics and handle both simple dashboarding queries as well as answer more complex questions. For example, complex questions that involve other datasets and analytical queries, for use cases like capacity planning, predictive analytics, root-cause analysis, auditing, and reporting.
Because of our architectural decision to build on top of TimescaleDB (and, thus, on PostgreSQL), Promscale is the only Prometheus long-term store that offers full SQL and full PromQL support, providing the ability to create any dashboard and ask any question to get a more complete understanding of the systems they are monitoring. (This includes the ability to include external datasets and query across them using joins.)
Despite being a relatively young project, Promscale users include two of the largest cloud providers in North America and Europe, one of the largest gaming companies in the world, many smaller startups, and a range of many other use cases and industries (including Timescale, where this project is also being developed - “dogfooding” FTW).
Today, with Promscale 0.4, we’re introducing several new capabilities:
- Better support for Prometheus high-availability (more below)
- Support for multi-tenancy
- More control over user permissions (using role-based access control)
- And more (please see release notes)
With this release, Promscale is now even more robust and suited for simple and enterprise-grade deployments (and everything in between). In particular, our improved support for Prometheus high-availability will enable scalable deployments that are resilient to failures and outages while multi-tenancy provides better isolation of data belonging to different parts of your organization.
In the rest of this post, we’ll go deeper into how we’ve redesigned (and subsequently improved) the way Promscale ingests data from Prometheus servers running as high-availability (HA) replicas.
To get started with Promscale today:
- Install Promscale via Helm Charts, Docker, and others. See GitHub for more information (GitHub ⭐️ welcome and appreciated! 🙏)
- Check out our Getting Started with Promscale tutorial for more on how Promscale works, installation instructions, sample PromQL and SQL queries, and more.
- Watch our Promscale 101 YouTube playlist for step-by-step demos and best practice.
- Reach out in Timescale Slack (#promscale) to join our community.
If you’re an existing user, simply deploy the latest version of the Promscale binary to upgrade to Promscale 0.4.
In general, deploying Prometheus high-availability replicas is critical for robust production systems, since they protect against a crash of any one server. Promscale has supported ingesting this kind of data since our first release – but our original method was based on database locks, which led to complex deployments, had problems with scalability and coupling, and was less resilient to certain kinds of failures.
Our new system, which takes inspiration from Cortex, solves these issues and makes Promscale both easier to use and more robust.
To the average reader, this may seem strange: why discuss how a feature implementation of another Prometheus project inspired our own?
We believe that this type of knowledge sharing illustrates one of the most powerful parts of the Prometheus ecosystem: due to its open-source nature, and the strong collaborative culture within the Prometheus community, projects often take inspiration from one another, together improving the ecosystem as a whole.
In this case, we designed our approach using some of the same basic principles as Cortex; conversely, aspects of our implementation may be useful for Cortex or other long-term storage systems for Prometheus metrics to use as inspiration to improve their own projects. This virtuous cycle improves all projects in the ecosystem.
Read on to learn more about: how Prometheus HA replication works; the challenges it poses for long-term storage systems; our original solution (and the problems we encountered); and how we built a new system to address those issues and deliver a more robust, resilient, and consequently more useful, observability data platform.
We’ll conclude with a brief overview of common issues beyond HA that are critical to think through and address in any production observability data system.
How Prometheus HA works
Any production system that relies on Prometheus data for operations should deploy Prometheus with HA. When things go wrong, and your cluster is under stress, it’s vital that your observability systems keep running, so that you can diagnose and fix the problem.
As a result, Prometheus users often deploy Prometheus as High Availability (HA) clusters to protect against any single Prometheus server crashing.
These clusters have multiple identical Prometheus servers that run as replicas on different machines/containers. These servers scrape the same targets, and thus have very similar data (scrapes happen at slightly different times, which may lead to minor differences), so if one Prometheus server goes down, the other one(s) will have the same data.
Thus, you can think of a Prometheus HA cluster as a group of similar Prometheus servers that all scrape the same targets. Also, you can consider Prometheus replicas as the individual Prometheus servers within that group.
The problem we - and all long-term storage systems - must solve
Since different replicas within the same cluster contain essentially the same data, many users prefer to store long-term data from only one of the replicas to save on storage costs.
However, in order to maintain high-availability properties, the long-term storage system must “deduplicate” data. This “deduplication” must also be resilient to the failure of any single Prometheus replica. Yet, because Prometheus expects data that is mostly regular (i.e., occuring at equally-spaced times), you also don’t want to switch between the replicas too often.
Therefore, the basic solution elects a single replica to write data, and switches to another replica only if the leader replica dies.
The flow of data then looks something like:
Our original solution (Using locks)
Since the implementation challenge described above sounds a lot like leader election, where we choose one replica out of many to be the “leader,” our original implementation reused PostgreSQL to implement leader-election using exclusive advisory locks. (Again, Promscale is built on top of TimescaleDB and PostgreSQL.)
This is how our original solution worked:
- We coupled every Prometheus server with a unique Promscale connector.
- We asked all of the Promscale connectors to try to grab an exclusive database lock.
- Whichever Promscale connector “won” (and was able to grab the lock) became the leader.
- The Prometheus server that was paired with the Promscale leader (as determined in the prior step) became the Prometheus leader and had its data written to the database.
Flaws in our original design
But, as users started deploying Promscale, we discovered two main problems with our original implementation:
- Tight coupling between Prometheus and Promscale
- Challenges handling delayed data
Tight coupling between Prometheus and Promscale
The first problem was the one-to-one mapping (i.e., coupling) between Prometheus and the Promscale connector. This made it impossible to independently scale the Prometheus layer from the Promscale layer and thus created scaling issues, hot-spots, and inefficient resource utilization.
In terms of scaling, this meant we could no longer provide multiple Promscale connectors to handle ingestion from one Prometheus server, necessitating large Promscale machines for handling load from large Prometheus servers.
Having only one leader Promscale connector be able to write data at any one time also created hot-spots and wasted resources: The Promscale connector that was the leader did most of the work (actually writing data to the database) while the non-leaders did almost nothing.
This coupling also made deployments more complex, as Prometheus and Promscale - two seemingly independent services - had to be deployed in a one-to-one mapping. This was especially complex in the Kubernetes ecosystem, where users had to inject Promscale as a side-car into the Prometheus pod.
Challenges handling delayed data
The second problem was processing delays, since the “who is the leader?” decision depended on wall-clock time and not data-time. If one of the Prometheus servers had an intermittent failure and lost its leader status, then came back up and tried to re-send data, that data would be considered “late” and be thrown away.
For example, if replica A was the leader and crashed, replica B took over after some period of time. Now, imagine replica A comes back online and proceeds to send data from the period between A crashing and B taking over (e.g., if replica A didn’t finish sending all of the data it collected before it crashed, and then attempted to finish sending after coming back online). We couldn’t use that data to “fill in the gap” because B is now the leader, and all data from A had to be discarded.
An improved solution (Using labels and leases)
Our requirements for better solution
Based on what we saw with our initial design, we wanted an improved system that:
- Allowed multiple Promscale connectors to process write traffic at the same time to avoid hot-spots and wasted resources.
- Allowed Promscale servers to be scaled and sized independently from Prometheus connectors.
- Spread work evenly across Promscale connectors. In other words, we wanted to be able to put a load-balancer between the two systems.
- Base leadership decisions on data-time and not wall-clock-time, allowing data to be ingested from the correct leader (for the time-period) even if it arrives late.
This led to the realization that what we really wanted was a system that allowed multiple Promscale connectors to write data coming from a single “leader” Prometheus server at the same time. In other words, we couldn’t elect a leader Promscale connector corresponding to a Prometheus server, rather we had to elect a leader Prometheus server directly and allow multiple Promscale connectors to process the data coming from that leader.
Using immutable leases to elect a single leader Prometheus replica
As we outlined above, our original leader election system allowed only a single Promscale connector to obtain a lock and write data. But, this method could not be used to pick an identifier for a Prometheus leader. We needed to change the system to elect a single Prometheus leader replica – and share this information across multiple Promscale connectors.
To do this, we use immutable leases written as records into a database table, stored within the same TimescaleDB instance as the Prometheus metric data itself.
Here is how that works: For any given time period, we insert a lease record that maps the time period to a unique replica leader (e.g., replica A). This mapping is immutable: once it is created, no other replica can create another entry for the same time period in our lease record table. However, the time period is limited in time (e.g., 60 seconds). But as long as the most recent leader is still alive, it can keep expanding that time range forward in time (i.e., in 60 second increments).
For example: if replica A becomes the leader for 00:00-00:60, at 00:30, it can extend its lease to be 00:00-00:90 (i.e., an additional 60 seconds, the maximum time range.). (Technically, the replica does not create and extend its lease directly, but rather a Promscale connector creates and extends the lease on the replica’s behalf.)
If the leader dies (i.e., stops actively expanding the lease), another replica can become the leader for a subsequent time period. To continue with the above example, if replica A dies at 00:75, replica C can create a new lease for 00:91-00:151, since replica A’s lead doesn’t expire until 00:90. In this example, we lose 15 seconds of data (data from 00:75-:00:90) – and users can mitigate any potential loss by setting a conservative lease timeout period, so only a few scrapes are lost. (This seems like a fair compromise to us. Also, to our knowledge, this kind of data loss happens with every single Prometheus long-term storage system that uses leader-election vs. retaining data from all replicas.)
The existence of records in the database allows multiple Promscale connectors to get a consistent view of the leases in the database. If multiple Promscale connectors try to create a lease at the same time (on behalf of different replicas), the database makes sure that there is only one winner, which is exposed consistently to all the other connectors.
We actually maintain two tables: one access-optimized table that contains the current lease, and a second one that contains a log of all the leases taken for previous periods of time. We maintain this log to enable users to determine whether late-arriving data belonged to the leader, as well as for auditing or debugging purposes.
An important detail here is that the lease is based on the time in the data, not wall-clock time. This makes the system safe for Prometheus servers that experience different latencies writing to the database, or that fall behind because of slow processing. In particular, if a slowdown or intermittent failure in one of the Prometheus servers causes “late” data, we use the lease log table to figure out which replica was the leader at the time corresponding to the “late” data. From there, we determine if the “late” data should be inserted, or dropped because it came from a non-leader.
Figuring out which Prometheus replica sent the data
Once we are able to pick a Prometheus replica leader, we still have a problem: how does a Promscale connector that can get data from multiple replicas know which data came from the leader (i.e., to save that data), and which came from the non-leaders (i.e., to discard that data)? To solve this problem, we took inspiration from Cortex’s idea to use external labels to signal the cluster and replica that’s associated with the data being sent from Prometheus.
More concretely: since external labels are part of the HTTP request sent via Prometheus remote_write, we can now distribute requests to several Promscale connectors (e.g., via a load balancer) instead of having to maintain a 1:1 mapping – without losing the knowledge of which request came from which replica.
Our new architecture
This new approach allows us to distribute write load among several Promscale connectors, so that our architecture now looks like this:
This new architecture has several advantages:
- Multiple Promscale connectors can process write traffic concurrently, increasing write throughput.
- Promscale connectors can be sized and scaled independently of Prometheus servers.
- Traffic from Prometheus servers can be fairly distributed to Promscale connectors with a load balancer.
- Delayed data is now correctly handled, by saving it if it came from the “leader” replica for that data-time, or by discarding it otherwise.
Edge-case: handling slowdowns
While developing leader election systems many interesting edge cases show up.
We thought we’d share one such edge-case we found surprising and didn’t expect: What if the “leader” Prometheus server sends data that is older than the data being sent by a non-leader? In this (slow) tortoise and (fast) hare scenario what should we do?
We saw two options:
- Switch leader to the hare, and possibly introduce a lot of leader changes if this scenario occurs frequently.
- Stay with the tortoise and assume it will catch up over time.
An additional confounding factor: how do we tell the difference between the tortoise being slow and just not working?
In thinking through this scenario, we realized that Prometheus already has methods to adjust to slowdowns. For example, Prometheus could increase the number of queues it uses to send data via remote write, or it could slow down its scrape loop (and does this by default). It seemed like doing our own adaptation (option 1 above) ran a very real risk of creating chaos.
Thus, we opted for option 2, trusting Prometheus to do the adaptation. We stay with the existing leader, unless we don’t receive new data from it in 30 seconds (of wall-clock time) and there is another Prometheus server sending data newer than the existing lease.
Beyond high-availability Prometheus: a production-ready system for storing and collecting observability data
Here we described how we - Promscale - evolved our approach for supporting Prometheus high-availability. But of course, truly resilient systems need much more.
For example, truly resilient systems need to support high-availability at every step of the ingest pipeline, and to support backups and disaster recovery (e.g., and recovery from the ever-present “fat finger”).
Promscale supports these requirements in various ways.
For Promscale, the ingest pipeline consists of 3 components – Prometheus server, Promscale connector, and TimescaleDB – each of which supports high-availability. We have already discussed the Prometheus high-availability setup. Promscale is a stateless service and so can safely be run as a replicated service behind a load balancer. TimescaleDB itself can be configured with multiple types of high-availability deployments which allow the user to make tradeoffs in read performance appropriate for their system.
Aside from high-availability throughout the ingest pipeline, a resilient system should also have a good way of performing remote backups. TimescaleDB supports continuous streaming backups and point-in-time recovery. This protects you against data-center disasters as well as operational mistakes and fat-finger errors (and who hasn’t done /that/).
Proper permissions and limited roles is another way to protect your data. That’s why Promscale implements a role-based access control (RBAC) permission system at the database level. This allows you to grant separate permissions for reading data, writing data, and administering the database out of the box. Using PostgreSQL’s advanced permission systems, finer control is also possible.
In summary, we’re excited about our new approach for ingesting data from Prometheus servers running in high-availability mode. We highlighted the importance of building systems where independent layers are loosely coupled and can scale independently – as well as why observability solutions must carefully consider the difference between wall-clock and data times (something we learned the hard way).
We hope that the way we’ve drawn inspiration from other projects - specifically Cortex in this instance - serves as inspiration for others who wish to improve the Prometheus ecosystem. Promscale is 100% open-source, and we’re happy to work with anyone who’d like to adapt pieces of our design for their own projects (and continue the virtuous cycle 🔥).
We’ll continue to invest in future improvements, both in how we support Prometheus high-availability and the overall Promscale platform. We also welcome any feedback or direct contributions from community members.
Get started and join the Promscale community
To get started with Promscale:
Install Promscale today via Helm Charts, Docker, and others. See GitHub for more information (GitHub ⭐️ welcome and appreciated! 🙏.)
Check out our Getting Started with Promscale tutorial for more on how Promscale works, installation instructions, sample PromQL and SQL queries, and more.
Watch Promscale 101 YouTube playlist for step-by-step demos and best practice.
To get involved with the Promscale community: