Understanding percentile_cont() and percentile_disc() in PostgreSQL

Start supercharging your PostgreSQL today.

PostgreSQL has two functions to calculate the percentile for a list of values at any percentage: percentile_cont() and percentile_disc(). These two functions work similarly, but they differ in how they produce the final result. Both are used with ordered-set aggregates returned by the WITHIN GROUP clause.

The percentile_disc() function returns a value from the input set that is the closest to the percentile requested. The value returned will actually exist in the set.

Here's the percentile_disc() syntax:

SELECT
	percentile_disc(<fraction double precision>) WITHIN GROUP (<sort_expression>)
FROM <table>;

The percentile_cont() function returns an interpolated value between multiple values based on the distribution. It is more accurate, but it may return a fractional value between two values in the input set.

percentile_cont() syntax:

SELECT
	percentile_cont(<fraction double precision>) WITHIN GROUP (<sort_expression>)
FROM <table>;

Examples

Calculating the median
Calculating multiple percentiles
Calculating a series of percentiles

For the following examples, we will use this set of weather data stored in a table called city_data:

day	city	temperature	precipitation
2021-09-04	Miami	68.36	0.00
2021-09-05	Miami	72.50	0.00
2021-09-01	Miami	65.30	0.28
2021-09-02	Miami	64.40	0.79
2021-09-03	Miami	68.18	0.47
2021-09-04	Atlanta	67.28	0.00
2021-09-05	Atlanta	68.72	0.00
2021-09-01	Atlanta	63.14	0.20
2021-09-02	Atlanta	62.60	0.59
2021-09-03	Atlanta	62.60	0.39

Calculating the Median

The median is also known as the 50th percentile. You can calculate it from the dataset with the following query:

SELECT
	percentile_disc(0.5) WITHIN GROUP (
	ORDER BY temperature)
FROM city_data;

The result is:

percentile_disc

65.30

Because the query used percentile_disc(), the result is a value that exists in the dataset. If you want to find the true median, it is not a value in this data, and you have to use percentile_cont(). Here is the query:

SELECT
	percentile_cont(0.5) WITHIN GROUP (
	ORDER BY temperature)
FROM city_data;

And the result:

percentile_cont

66.28999999999999

But since there are two cities, you might want to calculate the median temperature of each by adding a GROUP BY clause. Here is that query:

SELECT
	city,
	percentile_cont(0.5) WITHIN GROUP (
ORDER BY
	temperature)
FROM
	city_data
GROUP BY
	city;

The results is:

city	percentile_cont
Atlanta	63.14
Miami	68.18

Calculating Multiple Percentiles

For this example, we are going to use a database table called conditions that contains these values:

time	device_id	temperature	humidity
2016-11-15 07:00:00	weather-pro-000001	32.4	49.8
2016-11-15 07:00:00	weather-pro-000002	39.800000000000004	50.2
2016-11-15 07:00:00	weather-pro-000003	36.800000000000004	49.8
2016-11-15 07:00:00	weather-pro-000004	71.8	50.1
2016-11-15 07:00:00	weather-pro-000005	71.8	49.9
2016-11-15 07:00:00	weather-pro-000006	37	49.8

Let’s say that we want to calculate various percentiles for the humidity for each device. Here is an example query:

SELECT
	device_id,
	percentile_cont(0.25) WITHIN GROUP(
ORDER BY
	humidity) AS percentile_25,
	percentile_cont(0.50) WITHIN GROUP(
ORDER BY
	humidity) AS percentile_50,
	percentile_cont(0.75) WITHIN GROUP(
ORDER BY
	humidity) AS percentile_75,
	percentile_cont(0.95) WITHIN GROUP(
ORDER BY
	humidity) AS percentile_95
FROM
	conditions
GROUP BY
	device_id ;

Here is a part of the result:

device_id	percentile_25	percentile_50	percentile_75	percentile_95
weather-pro-000000	49.29999999999999	50.500000000000036	53.10000000000007	54.9000000000001
weather-pro-000001	49.09999999999999	50.00000000000003	51.60000000000005	55.6
weather-pro-000002	52.500000000000036	53.60000000000005	54.00000000000006	54.500000000000064
weather-pro-000003	51.100000000000016	51.90000000000003	52.90000000000004	53.800000000000054
weather-pro-000004	48.60000000000001	49.20000000000002	49.60000000000002	50.400000000000034

Calculating a Series of Percentiles

For this example, we are going back to our original city_data dataset because this query can take a long time to run on a big dataset. We are going to use the generate_series() to create every single whole percentage and then use those values in percentile_cont. Here is the query:

SELECT
	city,
	percentile,
	percentile_cont(p) WITHIN GROUP (
ORDER BY
	temperature)
FROM
	city_data,
	generate_series(0.01, 1, 0.01) AS percentile
GROUP BY
	city, percentile;

Here is a selection of the results since the query generates 200 rows of them:

city	percentile	percentile_cont
Atlanta	0.25	62.6
Atlanta	0.26	62.6216
Atlanta	0.27	62.6432
Atlanta	0.28	62.6648
Atlanta	0.29	62.6864
Atlanta	0.30	62.708

Why Use the Timescale approx_percentile() Function Instead of PostgreSQL Percentile Functions?

Calculating the percentile over large datasets, like time-series data in a Timesscale database, can involve a lot of expensive calculations. It can increase the memory footprint of the database, result in higher network costs, and make streaming data unfeasible. The aggregates are also not partializable or parallelizable.

Many times you don’t need this type of accuracy, and approximate percentile calculations will be close enough. This is why Timescale introduced the approx_percentile() hyperfunction. The approx_percentile() function implements the UDDSketch algorithm that uses a modified histogram to approximate the shape of a distribution. This allows for calculating a “good enough” percentile without needing to use all the data or ordering it before it returns the result.

approx_percentile() syntax:

approx_percentile(
    percentile DOUBLE PRECISION,
    sketch  uddsketch
) RETURNS DOUBLE PRECISION

The second parameter is the sketch to perform the approx_percentile on and is usually returned from a percentile_agg() call. Here is an example query:

SELECT
    approx_percentile(0.01, percentile_agg(data))
FROM generate_series(0, 100) data;

Result:

approx_percentile

0.999

Next Steps

To learn more about how to use percentile_cont() and percentile_disc() in PostgreSQL, you can see the PostgreSQL documentation.

To find out more about Timescale’s approx_percentile() function, you can read more about it in the Times cale documentation.

For examples of how to use these functions in your queries, see these sections of the Timescale documentation: