A Deep Dive Into the Four Types of Prometheus Metrics

A Deep Dive Into the Four Types of Prometheus Metrics

Metrics measure performance, consumption, productivity, and many other software properties over time. They allow engineers to monitor the evolution of a series of measurements (like CPU or memory usage, requests duration, latencies, and so on) via alerts and dashboards. Metrics have a long history in the world of IT monitoring and are widely used by engineers together with logs and traces to detect when systems don’t perform as expected.

In its most basic form, a metric data point is made of:

  • A metric name
  • The timestamp when the data point was collected
  • A measurement represented by a numeric value

In the last ten years, as systems have become more and more complex, the concept of dimensional metrics, that is, metrics that also include a set of tags or labels (i.e., the dimensions) to provide additional context, emerged. Monitoring systems that support dimensional metrics allow engineers to easily aggregate and analyze a metric across multiple components and dimensions by querying for a specific metric name and filtering and grouping by label.

For modern dynamic systems made up of many components, Prometheus, a Cloud Native Computing Foundation (CNCF) project, has become the most popular open-source monitoring software and effectively the industry standard for metrics monitoring. Prometheus defines a metric exposition format and a remote write protocol that the community and many vendors have adopted to expose and collect metrics becoming a de facto standard. OpenMetrics is another CNCF project that builds upon the Prometheus exposition format to offer a vendor-agnostic, standardized model for the collection of metrics that aims to be part of the Internet Engineering Task Force (IEFT).

More recently, another CNCF project, OpenTelemetry, has emerged with the goal of providing a new standard that unifies the collection of metrics, traces, and logs, enabling easier instrumentation and correlation across telemetry signals.

With a few different options to pick from, you may be wondering which standard is best for you. To help you answer this question, we have prepared a three-part blog post series in which we will be diving deep into the metric standards hosted by the CNCF. In this first post, we will cover Prometheus metrics; in the next one, we will review OpenTelemetry metrics; and in the final blog post, we will directly compare both formats—providing some recommendations for better interoperability.

Our hope is that after reading these blog posts, you will understand the differences between each standard, so you can decide which one would best address your current (and future) needs.

Prometheus Metrics

First things first. There are four types of metrics collected by Prometheus as part of its exposition format:

  • Counters
  • Gauges
  • Histograms
  • Summaries

Prometheus uses a pull model to collect these metrics; that is, Prometheus scrapes HTTP endpoints that expose metrics. Those endpoints can be natively exposed by the component being monitored or exposed via one of the hundreds of Prometheus exporters built by the community. Prometheus provides client libraries in different programming languages that you can use to instrument your code.

The pull model works great when monitoring a Kubernetes cluster, thanks to service discovery and shared network access within the cluster, but it’s harder to use to monitor a dynamic fleet of virtual machines, AWS Fargate containers or Lambda functions with Prometheus. Why? It’s difficult to identify the metrics endpoints to be scraped, and access to those endpoints may be limited by network security policies. To solve some of those problems, the community released the Prometheus Agent Mode at the end of 2021, which only collects metrics and sends them to a monitoring backend using the remote write protocol.

Prometheus can scrape metrics in both the Prometheus exposition and the OpenMetrics formats. In both cases, metrics are exposed via HTTP using a simple text-based format (more commonly used and widely supported) or a more efficient and robust protocol buffer format. One big advantage of the text format is that it is human-readable, which means you can open it in your browser or use a tool like curl to retrieve the current set of exposed metrics.

Prometheus uses a very simple metric model with four metric types that are only supported in the client libraries. All the metric types are represented in the exposition format using one or a combination of a single underlying data type. This data type includes a metric name, a set of labels, and a float value. The timestamp is added by the monitoring backend (Prometheus, for example) or an agent when they scrape the metrics.

Each unique combination of a metric name and set of labels defines a series while each timestamp and float value defines a sample (i.e., a data point) within a series.

Some conventions are used to represent the different metric types.

A very useful feature of the Prometheus exposition format is the ability to associate metadata to metrics to define their type and provide a description. For example, Prometheus makes that information available and Grafana uses it to display additional context to the user that helps them select the right metric and apply the right PromQL functions:

Screenshot showing the metrics browser in Grafana.
Metrics browser in Grafana displaying a list of Prometheus metrics and showing additional context about them.

Example of a metric exposed using the Prometheus exposition format:

# HELP http_requests_total Total number of http api requests
# TYPE http_requests_total counter
http_requests_total{api="add_product"} 4633433

# HELP is used to provide a description for the metric and # TYPE a type for the metric

Now, let's get into more detail about each of the Prometheus metrics in the exposition format.

Counters

Counter metrics are used for measurements that only increase. Therefore they are always cumulative—their value can only go up. The only exception is when the counter is restarted, in which case its value is reset to zero.

The actual value of a counter is not typically very useful on its own. A counter value is often used to compute the delta between two timestamps or the rate of change over time.

For example, a typical use case for counters is measuring API calls, which is a measurement that will always increase:

http_requests_total{api="add_product"} 4633433

The metric name is http_requests_total, it has one label named api with a value of add_product and the counter’s value is 4633433. This means that the add_product API has been called 4,633,433 times since the last service start or counter reset. By convention, counter metrics are usually suffixed with _total.

The absolute number does not give us much information, but when used with PromQL’s rate function (or a similar function in another monitoring backend), it helps us understand the requests per second that API is receiving. The PromQL query below calculates the average requests per second over the last five minutes:

rate(http_requests_total{api="add_product"}[5m])

To calculate the absolute change over a time period, we would use a delta function which in PromQL is called increase():

increase(http_requests_total{api="add_product"}[5m])

This would return the total number of requests made in the last five minutes, and it would be the same as multiplying the per second rate by the number of seconds in the interval (five minutes in our case):

rate(http_requests_total{api="add_product"}[5m]) * 5 * 60

Other examples where you would want to use a counter metric would be to measure the number of orders in an e-commerce site, the number of bytes sent and received over a network interface or the number of errors in an application. If it is a metric that will always go up, use a counter.

Below is an example of how to create and increase a counter metric using the Prometheus client library for Python:

from prometheus_client import Counter
api_requests_counter = Counter(
                        'http_requests_total',
                        'Total number of http api requests',
                        ['api']
                       )
api_requests_counter.labels(api='add_product').inc()

Note that since counters can be reset to zero, you want to make sure that the backend you use to store and query your metrics will support that scenario and still provide accurate results in case of a counter restart. Prometheus and PromQL-compliant Prometheus remote storage systems like Promscale handle counter restarts correctly.

Gauges

Gauge metrics are used for measurements that can arbitrarily increase or decrease. This is the metric type you are likely more familiar with since the actual value with no additional processing is meaningful and they are often used. For example, metrics to measure temperature, CPU, and memory usage, or the size of a queue are gauges.

For example, to measure the memory usage in a host, we could use a gauge metric like:

node_memory_used_bytes{hostname="host1.domain.com"} 943348382

The metric above indicates that the memory used in node host1.domain.com at the time of the measurement is around 900 megabytes. The value of the metric is meaningful without any additional calculation because it tells us how much memory is being consumed on that node.

Unlike when using counters, rate and delta functions don’t make sense with gauges. However, functions that compute the average, maximum, minimum, or percentiles for a specific series are often used with gauges. In Prometheus, the names of those functions are avg_over_time, max_over_time, min_over_time, and quantile_over_time. To compute the average of memory used on host1.domain.com in the last ten minutes, you could do this:

avg_over_time(node_memory_used_bytes{hostname="host1.domain.com"}[10m])

To create a gauge metric using the Prometheus client library for Python you would do something like this:

from prometheus_client import Gauge
memory_used = Gauge(
                'node_memory_used_bytes',
                'Total memory used in the node in bytes',
                ['hostname']
              )
memory_used.labels(hostname='host1.domain.com').set(943348382)

Histograms

Histogram metrics are useful to represent a distribution of measurements. They are often used to measure request duration or response size.

Histograms divide the entire range of measurements into a set of intervals—named buckets—and count how many measurements fall into each bucket.

A histogram metric includes a few items:

  1. A counter with the total number of measurements. The metric name uses the _count suffix.
  2. A counter with the sum of the values of all measurements. The metric name uses the _sum suffix.
  3. The histogram buckets are exposed as counters using the metric name with a  _bucket suffix and a le label indicating the bucket upper inclusive bound. Buckets in Prometheus are inclusive, that is a bucket with an upper bound of N (i.e.,  le label) includes all data points with a value less than or equal to N.

For example, the summary metric to measure the response time of the instance of the add_product API endpoint running on host1.domain.com could be represented as:

# HELP http_requests_total Total number of http api requests
# TYPE http_requests_total counter
http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332
http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892
http_request_duration_seconds_bucket{api="add_product" instance="host1.domain.com" le="0"}
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.01"} 0
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.025"} 8
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.05"} 1672
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.1"} 8954
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.25"} 14251
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.5"} 24101
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="1"} 26351
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="2.5"} 27534
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="5"} 27814
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="10"} 27881
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="25"} 27890
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="+Inf"} 27892

The example above includes the sum, the count, and 12 buckets. The sum and count can be used to compute the average of a measurement over time. In PromQL, the average duration for the last five minutes will be computed as follows:

rate(http_request_duration_seconds_sum{api="add_product", instance="host1.domain.com"}[5m]) / rate(http_request_duration_seconds_count{api="add_product", instance="host1.domain.com"}[5m])

It can also be used to compute averages across series. The following PromQL query would compute the average request duration in the last five minutes across all APIs and instances:

sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

With histograms, you can compute percentiles at query time for individual series as well as across series. In PromQL, we would use the histogram_quantile function. Prometheus uses quantiles instead of percentiles. They are essentially the same thing but quantiles are represented on a scale of 0 to 1 while percentiles are represented on a scale of 0 to 100. To compute the 99th percentile (0.99 quantile) of response time for the add_product API running on host1.domain.com, you would use the following query:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com"}[5m]))

One big advantage of histograms is that they can be aggregated. The following query returns the 99th percentile of response time across all APIs and instances:

histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

In cloud-native environments, where there are typically many instances of the same component running, the ability to aggregate data across instances is key.

Histograms have three main drawbacks:

  1. First, buckets must be predefined, requiring some upfront design. If your buckets are not well defined, you may not be able to compute the percentiles you need or would consume unnecessary resources. For example, if you have an API that always takes more than one second, having buckets with an upper bound ( le label) smaller than one second would be useless and just consume compute and storage resources on your monitoring backend. On the other hand, if 99.9 % of your API requests take less than 50 milliseconds, having an initial bucket with an upper bound of 100 milliseconds will not allow you to accurately measure the performance of the API.
  2. Second, they provide approximate percentiles, not accurate percentiles. This is usually fine as long as your buckets are designed to provide results with reasonable accuracy.
  3. And third, since percentiles need to be calculated server-side, they can be very expensive to compute when there is a lot of data to be processed. One way to mitigate this in Prometheus is to use recording rules to precompute the required percentiles.

The following example shows how you can create a histogram metric with custom buckets using the Prometheus client library for Python:

from prometheus_client import Histogram
api_request_duration = Histogram(
                        name='http_request_duration_seconds',
                        documentation='Api requests response time in seconds',
                        labelnames=['api', 'instance'],
                        buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 25 )
                       )
api_request_duration.labels(
    api='add_product',
    instance='host1.domain.com'
).observe(0.3672)

Summaries

Like histograms, summary metrics are useful to measure request duration and response sizes.

A summary metric includes these items:

  • A counter with the total number of measurements. The metric name uses the _count suffix.
  • A counter with the sum of the values of all measurements. The metric name uses the _sum suffix. Optionally, a number of quantiles of measurements exposed as a gauge using the metric name with a quantile label. Since you don’t want those quantiles to be measured from the entire time an application has been running, Prometheus client libraries use streamed quantiles that are computed over a sliding time window (which is usually configurable).

For example, the summary metric to measure the response time of the instance of the add_product API endpoint running on host1.domain.com could be represented as:

http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332
http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0"}
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0.5"} 0.232227334
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0.90"} 0.821139321
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0.95"} 1.528948804
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="0.99"} 2.829188272
http_request_duration_seconds{api="add_product" instance="host1.domain.com" quantile="1"} 34.283829292

This example above includes the sum and count as well as five quantiles. Quantile 0 is equivalent to the minimum value and quantile 1 is equivalent to the maximum value. Quantile 0.5 is the median and quantiles 0.90, 0.95, and 0.99 correspond to the 90th, 95th, and 99th percentile of the response time for the add_product API endpoint running on host1.domain.com.

Like histograms, summaries include sum and count that can be used to compute the average of a measurement over time and across time series.

Summaries provide more accurate quantiles than histograms but those quantiles have three main drawbacks:

  1. First, computing the quantiles is expensive on the client-side. This is because the client library must keep a sorted list of data points overtime to make this calculation. The implementation in the Prometheus client libraries uses techniques that limit the number of data points that must be kept and sorted, which reduces accuracy in exchange for an increase in efficiency. Note that not all Prometheus client libraries support quantiles in summary metrics. For example, the Python library does not have support for it.
  2. Second, the quantiles you want to query must be predefined by the client. Only the quantiles for which there is a metric already provided can be returned by queries. There is no way to calculate other quantiles at query time. Adding a new quantile requires modifying the code and the metric will be available from that time forward.
  3. And third and most important, it’s impossible to aggregate summaries across multiple series, making them useless for most use cases in dynamic modern systems where you are interested in the view across all instances of a given component. Therefore, imagine that in our example the add_product API endpoint was running on ten hosts sitting behind a load balancer. There is no aggregation function that we could use to compute the 99th percentile of the response time of the add_product API endpoint across all requests regardless of which host they hit. We could only see the 99th percentile for each individual host. Same thing if instead of the 99th percentile of the response time for the add_product API endpoint we wanted to get the 99th percentile of the response time across all API requests regardless of which endpoint they hit.

The code below creates a summary metric using the Prometheus client library for Python:

from prometheus_client import Summary
api_request_duration = Summary(
                        'http_request_duration_seconds',
                        'Api requests response time in seconds',
                        ['api', 'instance']
                       )
api_request_duration.labels(api='add_product', instance='host1.domain.com').observe(0.3672)

The code above does not define any quantile and would only produce sum and count metrics. The Prometheus client library for Python does not have support for quantiles in summary metrics.

Histograms or Summaries, What Should I Use?

In most cases, histograms are preferred since they are more flexible and allow for aggregated percentiles.

Summaries are useful in cases where percentiles are not needed and averages are enough, or when very accurate percentiles are required. For example, in the case of contractual obligations for the performance of a critical system.

The table below summarizes the pros and cons of histograms and summaries.

Table comparing different properties of histograms vs. summaries in Prometheus.
Table comparing different properties of histograms vs. summaries in Prometheus.

Conclusion

In the first part of this blog post series on metrics, we’ve reviewed the four types of Prometheus metrics: counters, gauges, histograms, and summaries. In the next part of the series, we will dissect OpenTelemetry metrics.

Looking for a long-term store for your Prometheus metrics? Check out Promscale, the observability backend built on PostgreSQL and TimescaleDB. It seamlessly integrates with Prometheus, with 100% PromQL compliance, multitenancy, and OpenMetrics exemplars support.

  • Promscale is an open-source project, and you can use it completely for free. For install instructions in Kubernetes, Docker, or virtual machine, check out our docs.
  • If you have questions, join the #promscale channel in the Timescale Community Slack. You will be able to directly interact with the team building Promscale and with other developers interested in observability. We’re +4,100 and counting in that channel!
This post was written by
13 min read
Observability
Contributors

Related posts

TimescaleDB - Timeseries database for PostgreSQL

Explore TimescaleDB

Learn more about how TimescaleDB works, compare versions, and get technical guidance and tutorials.

Go to docs Go to products