Best Practices for Query Performance on PostgreSQL
In a previous blog post, we shared for the first time how we are building a distributed time-series database on PostgreSQL by relying on chunking instead of sharding. To that end, we also introduced distributed hypertables and showed how they could scale almost linearly for inserts up to a nine-node cluster (one access node and eight data nodes), with an insert rate of well over 12 million metrics a second.
In this post, we take a look at how we optimize queries on distributed hypertables in our multi-node architecture and share some query benchmarks. We've seen significant performance benefits with distributed hypertables over regular hypertables for both full and partial aggregations.
But first, let's look at the main causes of performance degradation on PostgreSQL.
What Leads to Performance Problems?
Performance issues can significantly impact the efficiency and functionality of a PostgreSQL database. Several factors can contribute to these problems. Here are some of the main causes:
Inefficient Queries: One of the most common causes of performance problems is inefficient queries. Poorly optimized or complex queries can create bottlenecks and slow down database operations. Regularly reviewing and optimizing queries can help mitigate this issue.
Insufficient indexes: A lack of appropriate indexes can lead to slower query execution as the database must scan entire tables to find relevant data. Implementing indexes on frequently queried columns can significantly improve query performance.
Improper PostgreSQL data types: Using inappropriate data types can also impact performance. For example, using a text data type when an integer would suffice leads to unnecessary storage usage, potential casting overhead, and slower query speeds.
Changing data volume: The query planner creates execution plans (how to run a query) based on statistics of source data. Increasing or decreasing the data volume can influence these stats and yield different execution plans, resulting in slow queries that were once fine. Analyzing the source table can help optimize the query for a different volume.
High volume of transactions: A high volume of transactions can strain system resources and lead to performance degradation. Implementing efficient transaction handling strategies can help manage this load.
Insufficient hardware resources: Hardware limitations, such as CPU, memory, or storage capacity, can hinder database performance. Upgrading hardware resources or optimizing resource allocation can often resolve these issues.
Lock contention: Using locks on tables or rows can slow down a system with parallel queries, basically synchronizing access to the data and creating waiting queues for requests. Locks should be used very carefully and only as a last result.
Lack of maintenance: Routine maintenance tasks like vacuuming, reindexing, and updating statistics are crucial for maintaining optimal performance. Neglecting these tasks can lead to performance degradation over time.
By addressing these common causes of performance issues, you can optimize your PostgreSQL database for better performance and responsiveness.
Let's now jump straight into our benchmarks.
Query Performance Benchmarks
To show the query performance of distributed hypertables, we use the Time-Series Benchmark Suite (TSBS) with the IT monitoring use case (DevOps). All of the benchmark results are produced on nodes that ran m5.2xlarge AWS instances.
For the IT monitoring use case, we have a distributed hypertable with timestamped CPU metrics that cover multiple hosts being monitored. The distributed hypertable is partitioned along two dimensions: time and hostname, where the hostname determines which data node a chunk is placed on.
In other words, each data node stores data for the same time intervals but covers a specific subset of hosts, as shown below.
Let’s first look at the query performance of computing the average hourly CPU usage per host.
With eight data nodes, queries complete almost 8x faster than with one data node on a distributed hypertable. We include one data node as a point of reference to show the slight overhead of distributed hypertables due to the extra network communication and processing. Even with that slight overhead, we see that queries complete about 7x faster on a distributed hypertable with eight data nodes than on a regular (one node) hypertable.
The reason this query performs well is because it can push most of the work down to the data nodes. Since the query is grouping based on hostname, each data node can compute a full aggregate independently and concurrently (more on that later).
But what if we query for something less ideal for this setup? For instance, let’s look at finding the max CPU usage per hour across a set of hosts. Such a query groups only on time, and data nodes can therefore only compute partial aggregates. To get the final result, the access node has to combine and finalize the partial results from the data nodes.
Even in this case, eight data nodes are over 3x faster than a single data node. More notably, the overhead between hypertables and distributed hypertables is much smaller here than in the prior benchmark, bringing the two roughly on par. This is because the number of rows returned is much smaller for this query than the previous one, where we also grouped by hostname.
How We Optimize Queries on Distributed Hypertables (With an Example)
The key to unlocking the full query performance of distributed hypertables boils down to three main tactics:
- Limiting the amount of work
- Optimally distributing and pushing down work to data nodes
- Keeping data nodes busy
We’ll now discuss how Timescale incorporates each of these tactics.
Limiting the amount of work
One of the first things that the access node does with a query is to determine which chunks are involved in the query and, by extension, which data nodes to talk to. For instance, let’s say we are executing the following query, which is similar to the one in the first benchmark test:
SELECT time_bucket(time, ‘1 hour’) as hour, hostname, avg(cpu) FROM measurements WHERE hostname IN (‘host001’, ‘host002’) AND time > NOW() - interval ‘24 hours’ GROUP BY hour, hostname;
This query tells us that we only want to work with the set of chunks whose time intervals overlap with the past 24 hours and include, e.g., hostnames
Even though the access node doesn’t store any data locally, it has global knowledge of all chunks across the distributed hypertable, allowing it to perform chunk exclusion. Each chunk in the resulting set has a list of data nodes that store a copy of it, allowing the access node to turn the set of chunks into a set of data nodes to be queried, as shown in the figure below.
Note that chunks may have replica copies on multiple data nodes for fault-tolerance purposes; in that case, the access node has a choice of which data nodes to query for optimal performance.
Different strategies for assigning chunks are possible, but the default strategy is always to use the designated “default” data node for a chunk.
At the end of this planning stage, the access node has a list of data nodes to query, each with a disjoint subset of chunks. From this, it synthesizes a SQL query to send to each data node based on the original query:
SELECT time_bucket(time, ‘1 hour’) as hour, hostname, avg(cpu) FROM measurements WHERE _timescaledb_internal.chunks_in(measurement, ARRAY[1, 2]) hostname IN (‘host001’, ‘host002’) AND time > NOW() - interval ‘24 hours’ GROUP BY hour, hostname;
Note that the access node explicitly tells the data node which chunks to query via the
chunks_in function in the
WHERE clause. This function serves two purposes: first, it obviates the need for running chunk exclusion again on the data node, and second, it avoids returning duplicate data from replica chunks that also exist on other data nodes.
Thus, at the end of this planning stage, the access node knows exactly which data nodes to talk to and which chunks to query on those nodes.
Pushing Down Work to Data Nodes
When generating the SQL query statement to send to a particular data node, the access node needs to decide which parts of the original query it can safely push down, i.e., execute on the data nodes, and which parts need to execute locally.
In the worst case (i.e., no push down), the access node has to fetch the raw data from each data node and process it locally, as shown below:
Clearly, fetching the raw data is to be avoided if possible, as it involves transferring a lot of data and puts a heavy processing burden on the access node. Instead, push-downs have the potential to improve the situation tremendously by (1) moving processing to the data nodes and (2) reducing the amount of transferred data since the output is often smaller than the input, especially in the case of aggregations and limits.
Typical things the planner can consider to push down include:
- Functions (and general expressions)
- Sorting (ORDER BY)
- Aggregates and GROUP BYs
While a number of things determine the ability to push down various parts of the query, we will focus on the ability to push down the computation of GROUP BY aggregates (e.g., calculating the average CPU usage per host and hour).
Going back to our example query, the first thing that determines the level of aggregate push-down is whether the GROUP BY clause covers all of the partitioning dimensions. Because the GROUP BY clause includes both time and hostname (both of which are partitioning dimensions), we know that it is safe to push down the aggregation fully.
This is because no data for a time-hostname group on a data node can exist on any other data node. Thus, with full aggregate push-down, the query conceptually looks as follows:
Note, however, that we aren’t actually grouping on time and hostname; we are grouping on hour (time bucket) and hostname. Fortunately, since we use hostname as our primary dimension along which we assign chunks to data nodes, the planner can think of this as a single-dimensional table partitioned only on hostname.
This allows it to do time bucketing independently on each data node safely. There’s a caveat here, though: repartitioning. If the table has been repartitioned, data might not align along data node boundaries as we require, so full aggregation might not be possible (more on that below).
In summary, full aggregation is possible if any of the following cases hold:
- The grouping clause includes all partitioning keys
- The grouping clause includes only the “space” partitioning key, and the time restriction includes no repartitioning event
- The query is restricted to only one data node
Fortunately, the TimescaleDB planner is smart enough to detect and handle each case for optimal query performance.
Partial aggregation is necessary when the data for a computed group is not located on a single data node [for example, when we group only by hour (time)]. In this case, each data node computes a partial aggregate, and the access node then finalizes the result for each hour bucket.
Conceptually, this looks as follows:
While PostgreSQL fully supports partial aggregations on a local node (e.g., for parallel query execution), there is unfortunately no general way to express a partial aggregation in standard SQL, which is necessary to tell a data node that it should compute a partial aggregate.
In the case of
avg(), this partial aggregate state would consist of a sum and a count that can be used to produce the final average. But, obviously, this state is different depending on the aggregate function. Fortunately, TimescaleDB has a function for computing the partial aggregate state that is also used by continuous aggregates.
The SQL sent to a data node would then look something like this:
SELECT time_bucket(time, ‘1 hour’) as hour, _timescaledb_internal.partialize_agg(avg(cpu)) FROM measurements WHERE _timescaledb_internal.chunks_in(measurement, ARRAY[1, 2]) hostname IN (‘host001’, ‘host002’) AND time > NOW() - interval ‘24 hours’ GROUP BY hour;
Note the addition of the
partialize_agg function around the aggregate. This function tells the data node to compute and return the partial state for the
avg() aggregate so that the access node can compute the final aggregate from all the data nodes' partial results.
How repartitioning affects push-down
It is easy to expand the capacity of a distributed hypertable by adding additional data nodes. But to use the extra data nodes, existing distributed hypertables might require repartitioning to, e.g., increase the number of space partitions.
However, since the new partitioning configuration only affects new chunks, the planner has to be careful to ensure that queries on data nodes still produce the correct results. To illustrate this situation, let’s look at what happens when we add a new data node to expand the capacity of a distributed hypertable:
The figure shows that, during the third time interval, an extra data node was added so that the fourth time interval now includes four chunks instead of three. Now, imagine that the highlighted area shows the chunks covered by a query’s time and hostname restrictions.
We find that this includes overlapping chunks from two distinct partitioning configurations, i.e., data for a particular hostname might exist on more than one data node. This will prohibit full aggregations on data nodes, as it would otherwise produce the wrong result.
Fortunately, TimescaleDB's planner can dynamically detect overlapping chunks and revert to the appropriate partial aggregation plan when necessary. Users can, therefore, freely add data nodes and repartition their data to achieve elasticity without worrying about the correctness of query results.
While this leads to slightly worse query plans in some cases, these repartitioning events are rare and often quickly move out of the query or retention window. There’s also the possibility of rewriting old chunks in the new partitioning scheme, although no automation for such repartitioning of old data currently exists.
Keeping data nodes busy
To minimize latency and maximize resource utilization, it is crucial that the access node keeps feeding data nodes with work. Unfortunately, the default behavior of PostgreSQL is to execute so-called Append plans serially, i.e., the access node starts with getting all tuples from the first data node, then moves on to the second, and so forth.
This is obviously bad for performance and should be avoided. PostgreSQL 11 introduced parallel append plans, but they require launching separate worker processes on the access node, which is a lot of overhead when most of the work anyhow happens on data nodes. Further, parallel workers introduce other challenges related to read consistency and coordination of two-phase commit across multiple connections to the same data node.
Instead of using parallel append, TimescaleDB introduces asynchronous append execution that allows the access node to asynchronously initiate the query work on each data node while fetching the results as they become ready.
Basically, like the event-driven paradigm, the idea is to eliminate as much idle time as possible across both the access node and all data nodes. This provides great performance improvements for push-down aggregates since most of the work happens simultaneously on data nodes.
Best Practices for Query Performance
Here are some general practices that can further improve your query performance:
- Understand common performance bottlenecks: know the usual suspects of performance issues, such as inefficient queries, missing indexes, or hardware limitations.
- Utilize chunk exclusion: use TimescaleDB's chunk exclusion to optimize the selection of relevant data chunks for faster query processing.
- Implement full aggregate push-down: maximize the push-down of aggregation to data nodes when the
GROUP BYclause covers all partitioning dimensions.
- Employ partial aggregation for complex queries: use partial aggregations when data for a computed group is spread across multiple data nodes.
- Monitor repartition effects on push-down: be aware of how repartitioning, especially with added data nodes, affects the optimization of push-down operations.
- Asynchronously execute appends: adopt asynchronous append execution to minimize idle time and initiate work on each data node concurrently.
- Continuously update and educate: regularly update your knowledge, tools, and PostgreSQL version to benefit from performance improvements and stay informed on best practices.
If you want to learn more about PostgreSQL performance issues, including how to identify and solve them, plus how to optimize your query performance, read the following resources:
- Using pg_stat_statements to Optimize Queries
- PostgreSQL + TimescaleDB: 1,000x Faster Queries, 90 % Data Compression, and Much More
- 13 Tips to Improve PostgreSQL Insert Performance
To try distributed hypertables, you can self-host TimescaleDB or sign up for a free 30-day trial of Timescale, our cloud solution, which includes our core database but supercharges it with features designed for higher performance, faster queries, and less spending.