How to scale PostgreSQL 10 using table inheritance and declarative partitioning
PostgreSQL 10 is now out in the wild (edit: 10.1 was released right after we published). This is exciting for many reasons. PG10 introduces quite a few substantial new features: logical replication, full text search support for JSON / JSONB, cross-column statistics, and better support for parallel queries. Overall, PG10 is a significant milestone in the steady development of a venerable open-source database.
One of the more exciting new features in PG10 is declarative partitioning. This essentially makes it easier to scale PostgreSQL to “big data” volumes.
Making PostgreSQL scalable is critical to its future success. Developers, over the years, have complained that PostgreSQL is too difficult to scale. Consequently popular opinion shifted towards “NoSQL” solutions that promised scalability, and they were adopted enthusiastically. Yet these NoSQL options often sacrificed reliability, performance, and ease of use (which is one reason why developers lately have been returning to SQL).
With declarative partitioning, PG10 lays the groundwork to compete with these other technologies. One can think of declarative partitioning as the first step for making all that PostgreSQL has to offer — rock-solid reliability, powerful complex queries, a vast ecosystem of tools and connectors — finally available for more big data applications. And as the developers of a new time-series database built on top of PostgreSQL, we are hugely excited by the direction that Postgres is pursuing.
To be clear, partitioning in general is not new in PostgreSQL 10. For years, many have utilized table inheritance for scaling PostgreSQL tables. Declarative partitioning is just an additional, intriguing option.
This post examines these two options for partitioning in PostgreSQL.
In particular, this post covers:
- Why PostgreSQL doesn’t naturally scale well (why we need to partition our data in the first place)
- How partitioning via table inheritance works (the primary option pre- PostgreSQL 10)
- How declarative partitioning in PostgreSQL 10 works
- Limitations of the new declarative partitioning
Our experience has been that developers come to PostgreSQL with a wide variety of expertise, so this post starts with some fundamentals and then delves deeper into the details. Of course, depending on your own level of expertise, feel free to skip ahead to the first section relevant to you.
(Note: Declarative partitioning is only the first step in a new direction for PostgreSQL partitioning. Future improvements will build upon this foundation. This post, however, will only consider the PG10 state-of-the-art, as opposed to future plans.)
Why doesn’t PostgreSQL naturally scale well?
It comes down to the performance gap between RAM and disk.
In most relational databases, a table is stored as an unordered collection of fixed-size pages of data (e.g., 8KB pages in PostgreSQL), on top of which the system builds data structures (e.g., heaps) to index the data. With an index, a query can quickly find a row with a specified ID (e.g., bank account number) without scanning or “walking” the entire table in some sorted order.
But if the data is sufficiently large that we can’t fit all (similarly fixed-size) pages of our index in memory, then updating a random part of the tree can involve significant disk I/O as we read pages from disk into memory, modify in memory, and then write back out to disk (when evicted to make room for other pages). And a relational database like PostgreSQL keeps an index (or other data structure, such as a B-tree) for each table index, in order for values in that index to be found efficiently. So, the problem compounds as you index more columns.
The cost of swapping in and out of memory can be seen in the following performance graph from PostgreSQL, where insert throughput plunges with table size and increases in variance (depending on whether requests hit in memory or require (potentially multiple) fetches from disk).
Insert throughput as a function of table size for PostgreSQL 9.6.2, running with 10 workers on an Azure standard DS4 v2 (8 core) machine with SSD-based (premium LRS) storage. Clients insert individual rows into the database (each of which has 12 columns: a timestamp, an indexed randomly-chosen primary id, and 10 additional numerical metrics). The PostgreSQL rate starts over 15K inserts/second, but then begins to drop significantly after 50M rows and experiences very high variance (including periods of only 100s of inserts/sec).
(A benchmark showing vanilla PostgreSQL’s insert performance out to 1 billion rows, with 10K row batch inserts can be seen here.)
The solution: Partitioning
Table partitioning helps to avoid this expensive swap-to-disk problem. By splitting a large table into smaller partitions, the resulting sub-tables’ indexes (if done correctly) will be kept in memory during insert.
This results in several benefits, including:
- Improved inserts and queries from being able to fit all indexes in memory.
- Improved queries from being able to use constraint exclusion.
- Having a smaller dataset (per partition) may enable the database to use sequential scans efficiently.
- Maintenance operations like
REINDEXrun faster because of smaller indexes.
- Deleting entire partitions is much faster, as it only requires a
DROP TABLEand avoids expensive vacuuming operations.
The exact point at which partitioning should be considered instead of a large table depends on the workload and machine resources. For example, the graph above shows PostgreSQL’s performance degrading after 50M rows for the specified workload and machine specifications. Adding more memory is kicking the proverbial can down the road. In general, if you are writing a lot of data to a PostgreSQL table, at some point you’ll need partitioning.
Partitioning via table inheritance
Before PostgreSQL 10, table partitioning was only possible via table inheritance. Even with PG10, table inheritance is still a viable option for partitioning in situations where declarative partitioning falls short.
To partition using table inheritance, users first create a parent table, and then create a child table for each partition, using the
INHERITS clause of the
CREATE TABLE statement (which creates an inheritance link). When querying a parent table, data from both the parent and child tables is returned.
More specifically, to implement table inheritance users need to :
- Create the “parent” table, from which all of the partitions will inherit.
- Create several “child” tables (each representing a partition of the data) that each inherit from the parent.
- Add constraints to the partition tables to define the row values in each partition.
- Create indexes on any parent and child tables individually. (Indexes do not propagate from the parent tables to child tables).
- Write a suitable trigger function to the master table so that inserts into the parent table redirect into the appropriate partition table.
- Create a trigger that calls the trigger function.
- Remember to redefine the trigger function when the set of child tables changes.
Table inheritance for a long time has given developers a viable way to partition unwieldy tables in PostgreSQL. It also provides useful flexibility: for example, child tables are permitted to have extra columns not present in the parent table and data can be divided in a manner of the user’s choosing.
But, as the above steps indicate, table inheritance requires a non-trivial amount of manual work. And while scripts can be used to automate some steps, those would still require upkeep and increase engineering overhead.
Table inheritance also comes with some limitations:
- Data consistency depends on the
CHECKconstraints specified on each child table.
COPYcommands do not automatically propagate data to other child tables in the inheritance hierarchy, but instead rely on triggers, resulting in slower inserts.
- Substantial manual work is required to create and maintain child sub-tables.
- Given 3, re-partitioning is non-trivial (e.g., when adding disks, making schema changes).
- Ensuring uniqueness across partitioned tables is not supported.
- Indexes, constraints, and many table maintenance commands need to be applied to child tables explicitly. This greatly complicates management tasks.
To overcome these limitations and the manual work required, declarative partitioning was introduced in PostgreSQL 10.
PostgreSQL 10 declarative partitioning solves issues 1 and 2 above. It also simplifies issue 3, but significant manual work and limitations still remain.
Declarative partitioning in PostgreSQL 10
Declarative table partitioning reduces the amount of work required to partition data in PostgreSQL. It achieves this by providing an improved syntax to create partitioned tables and their child table partitions.
Currently, declarative partitioning supports
RANGE— partition the table into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, device_id.
LIST— partition the table by explicitly listing which key values appear in each partition, and are limited to a single column. For example, device_type.
How does it work?
To start, we create a partitioned table:
- Create a partitioned table using the
PARTITION BYclause, which includes the partitioning method (
RANGEin this example) and the list of column(s) to use as the partition key (examples are directly from the PostgreSQL 10 documentation):
CREATE TABLE measurement ( city_id int not null, logdate date not null, peaktemp int, unitsales int ) PARTITION BY RANGE (logdate);
- After creating the partitioned table, partitions can be manually created:
CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM (‘2006–02–01’) TO (‘2006–03–01’) CREATE TABLE measurement_y2006m03 PARTITION OF measurement FOR VALUES FROM (‘2006–03–01’) TO (‘2006–04–01’) CREATE TABLE measurement_y2006m04 PARTITION OF measurement FOR VALUES FROM (‘2006–04–01’) TO (‘2006–05–01’) ... CREATE TABLE measurement_y2007m11 PARTITION OF measurement FOR VALUES FROM (‘2007–11–01’) TO (‘2007–12–01’)
- And then indexes can be created:
CREATE INDEX ON measurement_y2006m02 (logdate); CREATE INDEX ON measurement_y2006m03 (logdate); CREATE INDEX ON measurement_y2006m04 (logdate); ... CREATE INDEX ON measurement_y2007m11 (logdate);
The partitioned table functions similarly to the parent or “master” table in table inheritance but is much improved, serving as a much more powerful query interface to the underlying child table partitions. For instance, commands such as
COPY now propagate to child tables via execution on the partitioned table. Additionally, users can insert data into underlying child tables via the partitioned table, since tuples (i.e., rows) now automatically route to the right partition on
INSERT and no longer depend on triggers, as is the case with table inheritance.
As can be seen, declarative partitioning drastically simplifies the 8-step process of table inheritance. The 3-step process is less manual and more straightforward for the user. The automation of inserts and additional propagation of table commands via the partitioned table are also big wins.
Declarative partitioning is just the first step in a new approach to partitioning for PostgreSQL. As far as we understand, additional constraints on partitioning — such as the fact that the parent table itself cannot hold data — introduced by declarative partitioning (as opposed to table inheritance) will allow PostgreSQL to build in more powerful query optimization and other features in the future. In other words, in PG10 declarative partitioning is mostly about an improved interface and syntax for defining partitions, but future versions should be able to use this structure to implement more advanced features and optimizations.
Limitations of declarative partitioning
Undoubtedly, declarative partitioning is a significant step toward the automation of partitioning in PostgreSQL. That said, there are still several limitations.
Manual partition management
Declarative partitioning makes partition management easier, but still requires a fair amount of manual work.
For example, child tables for the data need to exist before the data is inserted. This can become a management challenge for when data evolves. For example, if partitioning by time, it would be nice if the system would auto-create new partitions as new data comes in. Currently this has to be done manually by the database administrator. At the same-time, pre-creating many partitions may negatively affect performance.
Some of other the other limitations requiring manual management are listed in the documentation (section 22.214.171.124), for example:
- Cannot create indexes on all partitions automatically. Indexes still need to be manually created on each partition.
- Updates that would move a row from one partition to another will fail.
- Row triggers must be defined on individual partitions.
These limitations mean that when creating tables, a user still needs to manually assign tablespaces and create all indexes and constraints, as appropriate. Similarly, when performing changes to the parent table, all changes to the indexes and constraints need to be manually applied to every partition. Also, many commands like
REINDEX need to be applied to individual child tables.
To summarize, the management of individual partitions in PG10 is still not yet done exclusively through the partitioned table interface. These additional operational requirements have the potential to create data management bottlenecks, especially as more and more partitions are created, and as changes to the data model occur.
Multi-dimensional partitioning is cumbersome
In the case of multi-dimensional partitioning — important for scalability and additional query improvements — sub-partitions must be manually created on existing partitions after the first partition is created. As an example, partitioning on one
LIST (e.g., device) and one
RANGE (e.g., time) dimension first requires creating all
LIST partitions and then, in each of those child tables, creating corresponding
RANGE sub-partitions, or vice versa. This creates a tree hierarchy of tables where the leafs are the effective sub-partitions that hold tuples.
-- Create device partitions CREATE TABLE conditions_p1 PARTITION OF conditions FOR VALUES FROM (MINVALUE) TO ('g') PARTITION BY RANGE (time); CREATE TABLE conditions_p2 PARTITION OF conditions FOR VALUES FROM ('g') TO ('n') PARTITION BY RANGE (time); CREATE TABLE conditions_p3 PARTITION OF conditions FOR VALUES FROM ('n') TO ('t') PARTITION BY RANGE (time); CREATE TABLE conditions_p4 PARTITION OF conditions FOR VALUES FROM ('t') TO (MAXVALUE) PARTITION BY RANGE (time); -- Create time partitions for the first week in each device partition CREATE TABLE conditions_p1_y2017m10w01 PARTITION OF conditions_p1 FOR VALUES FROM ('2017-10-01') TO ('2017-10-07'); CREATE TABLE conditions_p2_y2017m10w01 PARTITION OF conditions_p2 FOR VALUES FROM ('2017-10-01') TO ('2017-10-07'); CREATE TABLE conditions_p3_y2017m10w01 PARTITION OF conditions_p3 FOR VALUES FROM ('2017-10-01') TO ('2017-10-07'); CREATE TABLE conditions_p4_y2017m10w01 PARTITION OF conditions_p4 FOR VALUES FROM ('2017-10-01') TO ('2017-10-07'); -- Create time-device index on each leaf partition CREATE INDEX ON conditions_p1_y2017m10w01 (time); CREATE INDEX ON conditions_p2_y2017m10w01 (time); CREATE INDEX ON conditions_p3_y2017m10w01 (time); CREATE INDEX ON conditions_p4_y2017m10w01 (time); INSERT INTO conditions VALUES ('2017-10-03 10:23:54+01', 73.4, 40.7, 'sensor3');
This multi-step process and the additional dimensions / sub-levels of partitions further complicate things and unnecessarily large inheritance trees can negatively impact performance and scalability as more tables need to be processed for both inserts and queries.
Unsupported data-integrity features
Some data integrity features commonly used in Postgres are not supported in PG 10, as described in documentation (same section 126.96.36.199):
- Cannot create primary keys on partitions: meaning that foreign keys referencing partitioned tables are not supported, nor are foreign key references from a partitioned table to another table.
- No support for enforcing uniqueness (or an exclusion constraint) across an entire partitioned table. Unique or exclusion constraints can only be created on individual partitions. Thus,
ON CONFLICTclauses are also unsupported (but is apparently slated for PG11).
We have seen that many IoT and time-series applications make use of upserts (
ON CONFLICT clauses) so this may be a significant hurdle for these application domains. The lack of primary keys may also be an issue for users using ORMs that assume the existence of primary keys.
Declarative partitioning is an important step in the drive to make PostgreSQL scale for large data sets. In PG10, it introduces a simpler syntax for defining partitions and enables tuple routing. Both of these steps make setting up partitioning easier and will be beneficial for many. However, from what we see, the most exciting results from this ongoing work are query optimizations yet to come (slated for PG11 and later). Significant opportunities also remain in improving management and ease of use.
But parallel to the continued core development of PostgreSQL (and thanks to the fact that Postgres is so extensible), the state of partition management and ease of use can be accelerated by extensions, such as TimescaleDB.
These extensions allow PostgreSQL to be used effectively for more diverse workloads than traditional OLTP, such as the exponentially growing amount of time-series data in our world.
If you’d like to learn more about how PG10 declarative partitioning works for time-series data (and how our partitioning approach differs), please check out this recent post: Problems with PostgreSQL 10 for time-series data.