Database partitioning is a technique used to divide a large database into smaller, more manageable pieces called partitions. These partitions contain their own subset of data and can be overlapping or non-overlapping. Each partition can be thought of as a separate database, but they are still part of the same logical database.
The purpose of database partitioning is to improve the database's performance, scalability, and availability. By dividing the database into smaller pieces, managing and processing large amounts of data becomes simpler. Additionally, database partitioning allows for parallel processing of queries and reduces contention for shared resources such as CPU and input/output (I/O).
One effective way to partition large time-series or time-series-like workloads with PostgreSQL is using Timescale (a cloud database that works and feels like Postgres under the hood) to create an hypertable. Hypertables work as PostgreSQL tables, automatically partitioning your data by time and optionally by space. In Timescale jargon, we call the data partitions within a hypertable “chunks.”
To learn more about this advantageous database partitioning feature, see our Docs, or keep reading for a primer on data partitioning.
You may need to partition your database for several reasons, including the following:
Improved performance: Partitioning a database can significantly improve the performance of queries and transactions. By dividing a large database into smaller partitions, we can reduce the amount of data that needs to be processed, improving the speed of queries by reducing the amount of contention for shared resources such as CPU and I/O.
Availability: While availability is not a “byproduct” of partitioning, by dividing the data into multiple and smaller partitions, we can create redundant copies (replicas) of the data and distribute these partitions across multiple systems. This helps ensure the data remains available even if one partition or server fails. Still, availability can be considered a goal—not a result—of data partitioning.
Scalability: Partitioning a database can make it easier to scale the database as the size and complexity of the data increases. By dividing the data into smaller partitions, we can add more servers or storage devices to handle the increased workload.
Manageability: Partitioning a database can make it easier to manage. We can simplify backups, maintenance, and other management tasks by dividing the data into smaller partitions.
There are several ways to partition a database, including horizontal partitioning, vertical partitioning, and hybrid partitioning. Horizontal partitioning involves dividing the data based on rows, while vertical partitioning involves dividing the data based on columns.
Hybrid partitioning is a combination of both horizontal and vertical partitioning.
Database partitioning can be implemented at various levels of the database architecture, including at the application and database levels. The specific partitioning approach used depends on the needs of the application and the characteristics of the data being stored.
In a distributed database, partitions are used to split the stored data and assign a smaller fraction of the whole database to the nodes of a cluster. Each of the nodes stores only a part of the dataset.
In most distributed databases, the terms partitioning and sharding are used as synonyms. Sharding data and distributing it across several systems allows the database to use more resources to store and process the dataset than a single computer can provide.
In these systems, partitions are also used together with replication. This means that one partition is assigned to more than one node of the distributed system. This leads to better availability of the data. If one of the nodes fails, the data can still be accessed from another system.
For example, you can partition customer data using horizontal range partitioning in a cluster with four nodes: A, B, C, and D. The customers with customer IDs between 0 and 1,000 are stored on system A. In addition, this partition is replicated in system B. The customer data with customer IDs from 1,001 to 2,000 are stored on system C, and the same partition is also replicated in system D.
If the customer with ID 50 has to be accessed, systems A or B have to be contacted to load the data from the correct partition. If one of these systems is not available (e.g., due to a crash), you can still access the data from the remaining system, as it remains available even if some systems are not functioning.
In Timescale, we use both space and time partitioning to improve data distribution.
By using horizontal partitioning, complete rows of a table are assigned to the partitions. So, each partition contains the same attributes but fewer tuples (in relational databases, like PostgreSQL and Timescale, a tuple is one record, i.e., one row) than the whole dataset. Usually, this type of partitioning is non-overlapping. It means that one tuple belongs to exactly one partition.
You can perform the assignment using several strategies (like the above-discussed list, range, or hash partitioning).
For example, when horizontal partitioning is used to partition customer data using range partitioning, partition A contains the tuple of customers with the customer IDs 0-1,000, whereas partition B contains the tuples for customers 1,001-2,000.
In list partitioning, the data is divided into partitions based on a predefined list of values for a specific column in the table. Each partition contains rows that match a particular value in the list. For example, a table of customers might be partitioned based on the state where they reside, with each partition containing rows for customers in a specific state.
In range partitioning, the data is divided into partitions based on a range of values for a particular column in the table. Each partition contains rows that fall within a specific range of values. For example, a table of sales transactions might be partitioned based on the date of the transaction, with each partition containing rows for a specific range of dates.
In hash partitioning, the data is divided into partitions based on a hash function applied to a specific column in the table. The hash function generates a value that is used to assign each row to a specific partition. Hash partitioning is useful when there is no obvious range or list to partition on.
This is a technique for database partitioning that combines multiple partitioning methods to create more complex partitions. In composite partitioning, a table is partitioned using two or more partitioning methods.
For example, a table might be first partitioned using range partitioning based on a date column. Then, each partition might be further divided using list partitioning based on the state where the customer resides. This would result in a composite partitioning scheme that uses both range and list partitioning.
Composite partitioning can be useful when a single partitioning method is insufficient to create an even data distribution. By combining multiple partitioning methods, composite partitioning can provide more flexibility and allow for more complex partitioning schemes.
However, composite partitioning can also be more complex to implement and manage than simpler partitioning methods and may require more resources. As with any partitioning method, the choice to use composite partitioning depends on the specific requirements of the database and the application.
This simple technique for database partitioning evenly distributes data across a set of partitions in a round-robin fashion. In round-robin partitioning, each new row or record is assigned to the next available partition in a cyclic manner.
For example, suppose we have three partitions and want to partition a table of sales transactions using round-robin partitioning. The first row would be assigned to the first partition, the second row would be assigned to the second partition, and the third row would be assigned to the third partition. The fourth row would then be assigned to the first partition again, and so on.
Round-robin partitioning can be useful when there is no clear key or attribute to use for partitioning or when a more complex partitioning scheme is unnecessary. However, round-robin partitioning may not be optimal for all applications, as it may not provide the best performance for query processing. This happens because all partitions need to be processed by most queries (i.e., it cannot be determined which partitions contain the data needed to compute a particular query since there’s no clear data about the partition assignment rule).
Going back to our previous example: if you do range partitioning and store customers with 0-1,000 in partition A and customers 1,001-2,000 in partition B and query the database for customer ID 50, you know that you only need to access partition A. There is a precise and deterministic partitioning.
However, if you assign the customers via round-robin partitioning and make the same query, you don’t know in which partition that particular record is stored.
Overall, round-robin partitioning is a straightforward technique for database partitioning that can be useful in certain situations but may not be the best choice for all applications.
By using vertical partitioning, the attributes of a tuple are split and assigned to different partitions. Each partition contains the same amount of tuples but a different amount of attributes.
In most cases, one attribute (often the primary key) is part of all partitions. This attribute is used to reconstruct the tuple when it is read. The attributes that belong to each partition are often directly specified by their name when the partitions are created.
For example, the customer entity consists of the attributes
lastname are assigned to partition one, and the attributes
customer id, and
You can use vertical partitioning to store different attribute partitions on separate storage volumes. This allows storing less frequently accessed attributes on slower and more cost-effective volumes, while more frequently accessed or modified attributes can be stored on faster and more expensive volumes. Another application is to assign different permissions to these partitions to restrict access to certain attributes.
Hybrid partitioning combines horizontal and vertical partitioning. So, tuples are assigned to different partitions using horizontal partitioning, and the attributes of the tuples are partitioned and assigned to different partitions using vertical partitioning. So, each partition contains fewer attributes and fewer tuples than the whole dataset.
Even if such a partitioning schema is more complex to manage, it allows the creation of small partitions and the different handling of certain attributes (e.g., storing them on different volumes; see vertical partitioning).
In the example with the customer and the range partitioning, the customer tuples are assigned to a partition based on the customer ID. Then, the attributes of the tuples are partitioned. So, the attributes
lastname of customers with customer IDs 0-1,000 are stored in partition A, while the attributes
customer id, and
If you’re handling large time-series or time-series-like workloads, you know why partition techniques are crucial to handling massive data volumes without breaking a sweat.
Built on PostgreSQL but optimizing it for enhanced performance and scalability, Timescale elevates the Postgres partitioning strategies with the following advantages:
Improved query performance: Timescale's partitioning strategies are designed to optimize queries on time-series data, allowing for faster query processing and improved query performance. By partitioning data by time interval, Timescale executes queries more efficiently by limiting the amount of data that needs to be scanned.
Automatic creation of partitions: By setting up a hypertable and a chunk interval (partitions in Timescale hypertables are called chunks, remember?), Timescale will create partitions automatically as soon as you insert the first data.
Data retention and deletion: Timescale's partitioning strategies allow for easy management of time-series data retention and deletion. For example, data can be partitioned by time interval, older partitions can be compressed to save storage costs, or partitions can be dropped when they are no longer needed. This ensures that only relevant data is retained, reducing storage costs and improving query performance.
Data tiering: By setting up a data tiering policy, old and rarely accessed data can be automatically moved to S3. This type of data partitioning allows distinguishing between new and frequently accessed data and historical data, which is accessed infrequently. Storing infrequently accessed data on S3 is a cost-effective solution.
Overall, Timescale's partitioning strategies are optimized for time-series and time-series-like data, providing software engineers with several advantages, including improved query performance, scalability, and data retention and deletion. These features make Timescale a popular choice for time-series data storage and analysis in a wide range of applications.