When to Consider Postgres Partitioning
PostgreSQL is now the most popular relational database on the planet. Its rich feature set and open-source nature make it a popular choice for businesses of all sizes. But as your data grows, you may encounter certain challenges, especially when dealing with large tables. Struggling with query performance, maintenance overhead, or managing data retention, anyone? 🖐️
One popular strategy to tackle these issues is table partitioning. In this article, we'll explore when and why you should consider partitioning your large PostgreSQL tables, sharing some best practices and common pitfalls in case you decide to move on with partitioning.
What Is Table Partitioning?
Table partitioning is a database design technique that divides a large table into smaller, more manageable pieces known as partitions. Each partition is a sub-table that holds a subset of the data, with each row existing in exactly one partition. Partitions can have their own indexes, but any unique indexes (including primary keys) must include the column that was used to partition the table.
This technique can offer significant advantages when managing large tables. By breaking down a large table into smaller partitions, you can benefit from improved query performance, optimized index sizes, and more efficient data maintenance operations, keeping your large-scale databases more agile and responsive overall. But partitioning is not a one-size-fits-all solution: sometimes, partitioning won't benefit you. It could even harm your performance, as we’ll see later on.
In terms of the types of partitioning you could implement, PostgreSQL supports three partitioning strategies:
- Range partitioning: this strategy is ideal for time-series data or incrementing sequences (maybe a BIGINT primary key), where you partition data based on a range of values (e.g.,g: by day or by a number of keys).
- List partitioning: you would use list partitioning when you want to partition data based on specific values in a column (e.g., country or department).
- Hash partitioning: hash partitioning is suitable when there is no clear partitioning key, as it distributes data evenly across partitions based on a hash function.
One important characteristic of partitioning in PostgreSQL (independent of which strategy you follow) is that it does not support “global indexes”: there is no way to create an index across partitions. Instead, when querying across partitions, PostgreSQL will evaluate the
WHERE clause, and if there are constraints on the partition column, it will use those to exclude partitions that don’t hold data relevant to the query.
Consider the following example: Imagine we took a table storing sensor metrics and partitioned it by the column representing the ingest time. The table could now be broken down into multiple partitions, each of which could store the metrics ingested in a single day.
Once the table is partitioned by day (if we query the table for a range of days), PostgreSQL would exclude the partitions outside the range—meaning that only the relevant data would be scanned. Compare this to querying the original table, where we would have to scan all the data (or one large index). One can clearly see how partitioning could help you reduce CPU, disk activity, and query time.
It’s probably clear to you at this point that if you were to implement partitioning in a production setup, you'd need automation in place to create and maintain partitions, especially if the partitioned table expects to receive data continuously. There are multiple ways to do this, the most common being the following:
- You can write SQL procedures/functions to check if the necessary partitions for upcoming days/weeks/months exist, using cron (on Unix-like systems) or Task Scheduler (on Windows) to run the script regularly.
- You can use a PostgreSQL extension dedicated to this, the most popular being pg_partman (which will probably need to be paired with a scheduler).
- You can use Timescale, whose hypertables have built-in automatic partitioning (reducing the creation and maintenance of partitions to a single command).
Determining which methodology to follow for table partitioning deserves its own article—but this is something you can think about later. For now, let’s help you figure out if partitioning would be beneficial for you.
When Should You Consider Partitioning?
As we said above, partitioning can be very powerful, but it is certainly not a good idea for every single use case. Contrary to what people often think, the decision to partition a PostgreSQL table is not strictly based on an absolute table size but rather on various factors that interact with the table's size. It's essential to evaluate your database's characteristics and requirements before implementing partitioning.
Generally speaking, you should start thinking about partitioning if you identify with one (or more) of the following:
- You have large tables. As we said above, the size of your tables is not the only thing that determines if you may see benefits from partitioning; this said, if you have large tables (this is from tens of millions of rows to billions of rows), you would probably benefit from partitioning.
- Your ingestion rate is very high. Even if the current table size isn't massive, a high data ingestion rate can indicate that the table will grow significantly in the near future. To implement a partitioning strategy, it might be beneficial to preemptively manage this growth before it starts affecting your performance and maintenance operations.
- You’re beginning to notice query performance degradation. Partitioning may also be beneficial if your queries are starting to slow down, especially those that should only touch a subset of your data. This could be true even when your tables are smaller due to the complexity of the data and queries. For example, partitioning can significantly enhance query performance when your daily queries include searches based on a specific range or criteria. Let's say you're dealing with time-series data: partitioning by date can help you quickly retrieve records within a particular time frame without scanning the entire table.
- You’re dealing with maintenance overhead. As a table grows, maintenance operations like
ANALYZE, and indexing can take longer and might start impacting your operational efficiency. Partitioning can simplify these operations because you can focus on maintaining smaller partitions independently, reducing the impact on your database's overall performance.
- You’re managing data retention policies. If your dataset has built-in obsolescence, where older data is periodically purged, partitioning can make these operations much more efficient. Dropping an old partition is much faster and less resource-intensive than deleting rows.
- You want to use less memory. If you want to operate with limited memory, you might benefit from partitioning, as smaller indexes and data chunks fit better in memory and improve cache hit rates. In most cases, this will also improve performance.
When Not to Use Partitioning
On the contrary, there are situations where introducing partitioning may be counterproductive, and you may want to look at other optimizations instead, such as tuning your indexes or queries:
- Your tables are small, and you’re not ingesting at high rates. If your tables are not large and are unlikely to grow significantly, partitioning would add unnecessary complexity without delivering noticeable benefits.
- You’re just starting with PostgreSQL. Implementing and maintaining partitioned tables introduces a certain level of operational and architectural complexity: managing such tables would become more challenging, so make sure you feel ready to go there—or use a hosted service like Timescale to manage this for you.
- You have uniform data access patterns. If your queries usually access data uniformly across the table rather than focusing on specific subsets (like recent data, ranges of data, specific categories, and so on), then your query performance could actually become worse.
- Frequent full scans. If most of your queries don’t use your partition key in the
WHEREclause, you’ll scan every partition every time. This will be slow and will get slower the more partitions you have.
Best Practices for Data Partitioning
If, after reading this article, you’ve decided to give partitioning a go, make sure to follow this advice—it will help you set up things right from the start:
- Choose the right partition size. This is one of the first design questions that will come to your mind when implementing partitioning: what’s your ideal partition size? The answer is it depends—you should aim for a balance between too large and too small. While PostgreSQL can handle a high number of partitions, having too many will increase planning time and could negatively affect query time. At the same time, if your partitions are too large, then you won’t be able to use ranges to exclude data, and the effectiveness of partition pruning will be minimized.
- Keep your partition size consistent. Aim to keep partitions relatively uniform in size, ensuring that maintenance tasks and query performances are consistent across partitions.
- Choose the right partitioning key. Opt for a key that aligns with your query patterns. For instance, if most of your queries filter by date, a timestamp or date column would be an ideal partitioning key.
- Create partitions in advance. Ensure that partitions for upcoming periods (e.g., future days or months) are created ahead of time so there's no interruption in data ingestion. While you can use a default partition to catch orphaned records, in practice, this introduces a maintenance burden and does not perform well.
- Take advantage of data retention policies to maintain old partitions. For example, if you're partitioning by time and data has a limited useful life, schedule regular tasks to drop or archive old partitions.
- Optimize your queries. If you’re especially interested in optimizing query performance, make sure to analyze and understand the query execution plan to validate that only necessary partitions are being scanned.
- Properly place partitions across different storage mediums. If you're using tablespaces to place partitions on different storage engines (e.g., EBS or S3), ensure that frequently accessed partitions are on faster storage and older or less accessed partitions can be on slower, cheaper storage. Timescale makes this implementation very easy via data tiering.
Pitfalls to Avoid
Similarly, make sure to avoid these common mistakes:
- Over-partitioning. It's tempting to create a high number of small partitions, but as we mentioned earlier, this won’t work well—you’ll get into query planning and management challenges.
- Inefficient indexing. Avoid creating unnecessary indexes on your partitions. Only index the columns that are frequently filtered or joined on.
- Unoptimized query pattern. Queries spanning multiple partitions or not using the partition key in the
WHEREclause might suffer in performance. Ensure that the majority of your queries are optimized for the partitioning scheme.
- Running out of partitions. If you insert data with no partition to live in, it will either be rejected or stored in a
DEFAULTpartition. Ensure you either pre-create partitions at a quiet time (as this will lock your table) or use an extension that creates new partitions on the fly.
- Monitor disk usage (partitions need extra space). If you're creating many partitions, especially on different tablespaces or disks, monitor disk usage to avoid out-of-space issues.
Partitioning PostgreSQL tables can be a great ally in your production database, especially if your tables are starting to get big, your ingestion is growing, or your query performance is starting to degrade. Still, partitioning is not a one-size-fits-all solution: consider if it’s the right fit for your use case before deciding to implement it.
If you’ve decided to partition your large PostgreSQL tables by time, check out Timescale, which completely automates the creation and maintenance of partitions: you will interact with your tables as if they were regular PostgreSQL tables, while Timescale handles partitioning under the hood. You will get the extra performance—but not the extra work.