Published at Mar 25, 2024

Understanding percentile_cont() and percentile_disc() in PostgreSQL

Start supercharging your PostgreSQL today.

PostgreSQL has two functions to calculate the percentile for a list of values at any percentage: percentile_cont() and percentile_disc(). These two functions work similarly, but they differ in how they produce the final result. Both are used with ordered-set aggregates returned by the WITHIN GROUP clause.

The percentile_disc() function returns a value from the input set that is the closest to the percentile requested. The value returned will actually exist in the set.

Here's the percentile_disc() syntax:

SELECT percentile_disc(<fraction double precision>) WITHIN GROUP (<sort_expression>) FROM <table>;

The percentile_cont() function returns an interpolated value between multiple values based on the distribution. It is more accurate, but it may return a fractional value between two values in the input set.

percentile_cont() syntax:

SELECT percentile_cont(<fraction double precision>) WITHIN GROUP (<sort_expression>) FROM <table>;

Examples

Calculating the median
Calculating multiple percentiles
Calculating a series of percentiles

For the following examples, we will use this set of weather data stored in a table called city_data:

day	city	temperature	precipitation
2021-09-04	Miami	68.36	0.00
2021-09-05	Miami	72.50	0.00
2021-09-01	Miami	65.30	0.28
2021-09-02	Miami	64.40	0.79
2021-09-03	Miami	68.18	0.47
2021-09-04	Atlanta	67.28	0.00
2021-09-05	Atlanta	68.72	0.00
2021-09-01	Atlanta	63.14	0.20
2021-09-02	Atlanta	62.60	0.59
2021-09-03	Atlanta	62.60	0.39

Calculating the Median

The median is also known as the 50th percentile. You can calculate it from the dataset with the following query:

SELECT percentile_disc(0.5) WITHIN GROUP ( ORDER BY temperature) FROM city_data;

The result is:

percentile_disc

65.30

Because the query used percentile_disc(), the result is a value that exists in the dataset. If you want to find the true median, it is not a value in this data, and you have to use percentile_cont(). Here is the query:

SELECT percentile_cont(0.5) WITHIN GROUP ( ORDER BY temperature) FROM city_data;

And the result:

percentile_cont

66.28999999999999

But since there are two cities, you might want to calculate the median temperature of each by adding a GROUP BY clause. Here is that query:

SELECT city, percentile_cont(0.5) WITHIN GROUP ( ORDER BY temperature) FROM city_data GROUP BY city;

The results is:

city	percentile_cont
Atlanta	63.14
Miami	68.18

Calculating Multiple Percentiles

For this example, we are going to use a database table called conditions that contains these values:

time	device_id	temperature	humidity
2016-11-15 07:00:00	weather-pro-000001	32.4	49.8
2016-11-15 07:00:00	weather-pro-000002	39.800000000000004	50.2
2016-11-15 07:00:00	weather-pro-000003	36.800000000000004	49.8
2016-11-15 07:00:00	weather-pro-000004	71.8	50.1
2016-11-15 07:00:00	weather-pro-000005	71.8	49.9
2016-11-15 07:00:00	weather-pro-000006	37	49.8

Let’s say that we want to calculate various percentiles for the humidity for each device. Here is an example query:

SELECT device_id, percentile_cont(0.25) WITHIN GROUP( ORDER BY humidity) AS percentile_25, percentile_cont(0.50) WITHIN GROUP( ORDER BY humidity) AS percentile_50, percentile_cont(0.75) WITHIN GROUP( ORDER BY humidity) AS percentile_75, percentile_cont(0.95) WITHIN GROUP( ORDER BY humidity) AS percentile_95 FROM conditions GROUP BY device_id ;

Here is a part of the result:

device_id	percentile_25	percentile_50	percentile_75	percentile_95
weather-pro-000000	49.29999999999999	50.500000000000036	53.10000000000007	54.9000000000001
weather-pro-000001	49.09999999999999	50.00000000000003	51.60000000000005	55.6
weather-pro-000002	52.500000000000036	53.60000000000005	54.00000000000006	54.500000000000064
weather-pro-000003	51.100000000000016	51.90000000000003	52.90000000000004	53.800000000000054
weather-pro-000004	48.60000000000001	49.20000000000002	49.60000000000002	50.400000000000034

Calculating a Series of Percentiles

For this example, we are going back to our original city_data dataset because this query can take a long time to run on a big dataset. We are going to use the generate_series() to create every single whole percentage and then use those values in percentile_cont. Here is the query:

SELECT city, percentile, percentile_cont(p) WITHIN GROUP ( ORDER BY temperature) FROM city_data, generate_series(0.01, 1, 0.01) AS percentile GROUP BY city, percentile;

Here is a selection of the results since the query generates 200 rows of them:

city	percentile	percentile_cont
Atlanta	0.25	62.6
Atlanta	0.26	62.6216
Atlanta	0.27	62.6432
Atlanta	0.28	62.6648
Atlanta	0.29	62.6864
Atlanta	0.30	62.708

Why Use the Timescale approx_percentile() Function Instead of PostgreSQL Percentile Functions?

Calculating the percentile over large datasets, like time-series data in a Timesscale database, can involve a lot of expensive calculations. It can increase the memory footprint of the database, result in higher network costs, and make streaming data unfeasible. The aggregates are also not partializable or parallelizable.

Many times you don’t need this type of accuracy, and approximate percentile calculations will be close enough. This is why Timescale introduced the approx_percentile() hyperfunction. The approx_percentile() function implements the UDDSketch algorithm that uses a modified histogram to approximate the shape of a distribution. This allows for calculating a “good enough” percentile without needing to use all the data or ordering it before it returns the result.

approx_percentile() syntax:

approx_percentile( percentile DOUBLE PRECISION, sketch uddsketch ) RETURNS DOUBLE PRECISION

The second parameter is the sketch to perform the approx_percentile on and is usually returned from a percentile_agg() call. Here is an example query:

SELECT approx_percentile(0.01, percentile_agg(data)) FROM generate_series(0, 100) data;

Result:

approx_percentile

0.999

Next Steps

To learn more about how to use percentile_cont() and percentile_disc() in PostgreSQL, you can see the PostgreSQL documentation.

To find out more about Timescale’s approx_percentile() function, you can read more about it in the Times cale documentation.

For examples of how to use these functions in your queries, see these sections of the Timescale documentation:

PostgreSQL Percentile Functions FAQ

Q: What is the difference between percentile_cont() and percentile_disc() functions in PostgreSQL?

A: The percentile_cont() function returns an interpolated value that may not exist in the original dataset, providing a more accurate statistical representation. In contrast, percentile_disc() returns an actual value from the dataset that is closest to the requested percentile. Both functions are used with WITHIN GROUP to specify the ordering of values.

Q: How do I calculate the median (50th percentile) of a dataset in PostgreSQL?

A: You can calculate the median using either percentile function with 0.5 as the parameter. For example, SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY temperature) FROM city_data; will return the interpolated median value, while percentile_disc(0.5) would return an actual value from the dataset.

Q: Can I calculate multiple percentiles in a single query?

A: Yes, you can calculate multiple percentiles in one query by using multiple percentile function calls. For example: SELECT percentile_cont(0.25) WITHIN GROUP(ORDER BY humidity) AS p25, percentile_cont(0.5) WITHIN GROUP(ORDER BY humidity) AS p50, percentile_cont(0.75) WITHIN GROUP(ORDER BY humidity) AS p75 FROM conditions GROUP BY device_id; will return the 25th, 50th, and 75th percentiles.

Q: How can I calculate percentiles by group in PostgreSQL?

A: You can calculate percentiles for each group by adding a GROUP BY clause to your query. For example, SELECT city, percentile_cont(0.5) WITHIN GROUP (ORDER BY temperature) FROM city_data GROUP BY city; will return the median temperature for each city in the dataset.

Q: What is the approx_percentile() function in Timescale, and when should I use it?

A: The approx_percentile() function is a Timescale hyperfunction that approximates percentiles using the UDDSketch algorithm, which is much more efficient for large datasets. You should use it when exact precision isn't required, but performance is important, especially with time-series data, as it requires less memory and computational resources than PostgreSQL's native percentile functions.