Written by Umair Shahid
Fundamentally, a database index is a strategically designed data structure that enhances the speed of data retrieval activities in a database table. You can liken it to the table of contents in a book, which allows you to quickly locate a specific topic without the need to comb through the entire book.
In the same vein, indexes in PostgreSQL function like a navigation system for the database engine. They facilitate the swift and efficient location and extraction of particular rows, eliminating the necessity for a full-table scan. In more straightforward terms, an index acts as a signpost to data within a table.
Leveraging indexes in PostgreSQL provides several key advantages:
Rapid data access: Indexes are instrumental in drastically slashing the time needed to retrieve data, particularly from large tables. Without indexes, a complete table scan would be required, which can be quite time-consuming.
Boosted query efficiency: Queries that include conditions in the
WHERE clause or require table joins see marked improvements in performance with indexing. Such queries leverage indexes to quickly pinpoint rows that fulfill the set criteria.
Reduced disk I/O: As indexes hold a subset of the table's data, disk I/O operations are significantly lessened. This not only speeds up query execution but also lightens the load on the storage system.
Data integrity maintenance: Unique indexes act as safeguards against duplicate values within specific columns, thereby maintaining data integrity by ensuring no two rows have identical values in the designated columns.
Despite the benefits, there are some potential pitfalls to using indexes in PostgreSQL:
Increased storage requirement: The most obvious downside of utilizing indexes is the extra storage space they require. The precise amount depends on the size of the table and the number of indexed columns. Usually, it's a small fraction of the total table size. However, for large datasets, adding multiple indexes can lead to a significant increase in storage usage.
Slower write operations: Every time a row is inserted, updated, or deleted, the index must be updated too. As a result, write operations may become slower. It's crucial to balance read and write operations when considering index usage. If your application relies heavily on write operations, the benefits of faster reads should be carefully weighed against the cost of slower writes.
HOT updates: PostgreSQL employs a mechanism called Multi-Version Concurrency Control (MVCC) for updates. However, indexes can lead to "HOT" (Heap-Only Tuples) updates. Instead of allowing direct in-place updates, each update operation effectively results in a new row version, generating a new entry in each associated index. This leads to increased I/O activity and the addition of dead rows in the database.
Despite PostgreSQL indexes offering remarkable benefits in query performance, it's crucial to judiciously balance these advantages against potential drawbacks, especially in situations where storage efficiency and write performance are key considerations.
Next, let's explore the various types of indexes offered by PostgreSQL.
The B-tree (Balanced Tree) index stands as the default and most widely employed index type within PostgreSQL. When an indexed column participates in a comparison employing any of these operators: <, <=, =, >=, >, the query planner will consider employing a B-tree index.
B-tree indexes can also be effectively utilized with operators like BETWEEN and IN. Furthermore, an IS NULL or IS NOT NULL condition on an indexed column can be combined with a B-tree index.
By default, B-tree indexes organize their entries in ascending order, with null last. If you wish to gain more insights into index ordering and its potential advantages in specific scenarios, you can refer to this page.
CREATE INDEX index_product_id ON products (product_id);
In this example, a B-tree index is created on the product_id column of the products table. This index enhances the speed of queries that seek specific product_id values or product_id ranges.
Hash indexes are ideal for equality-based lookups, but they don't support range queries or sorting. They work well with data types like integers and are usually faster than B-tree indexes for equality checks.
CREATE INDEX index_product_id ON products USING HASH (product_id);
A multicolumn index is defined on more than one column of a table.
It is presumed that the product_id column is heavily utilized when retrieving data from the products table. Therefore, we are giving precedence to the product_id column over the product_name column.
CREATE INDEX index_product_id_name ON products (product_id, product_name);
When deciding whether to create a single-column index or a multicolumn index, it's essential to consider the column or columns that are frequently used in a query's WHERE clause as filter conditions.
Queries in which the above index (index_product_id_name) will be used are as follows:
Select * from products where product_id = 1 and product_name = 'a'; (index_product_id_name is used)
Select * from products where product_id = 1;
(index_product_id_name is used)
Select * from products where product_name = 'a';
(index_product_id_name is not used)
A partial index is an index built over a subset of a table; the subset is defined by a conditional expression (called the predicate of the partial index). The index contains entries only for those table rows that satisfy the predicate.
CREATE INDEX index_product_id ON products(product_id) where product_available = ‘true’;
One common use case for partial indexes is to filter out rows that are irrelevant for most queries. For example, suppose you have a table of products with a column called product_available, which can be ‘true' or ‘false’. Most queries will only need to access the available products, so you can create a partial index on the availability column, where available = 'true’.
Covering index allows a user to perform an index-only scan if the select list in the query matches the columns that are included in the index. Additional columns can be specified using the INCLUDE keyword
CREATE INDEX index_product_id_name_status ON products (product_id, product_name) include (status);
When the user runs the following query:
EXPLAIN ANALYZE SELECT product_id,product_name,status from products
product_id < 10;
All the columns specified in the SELECT clause will be retrieved directly from the index pages. In theory, this can significantly reduce the amount of I/O (input/output) your query requires to access information.
Traditionally, I/O operations represent a significant bottleneck in database systems, so by avoiding access to the heap (table data) and minimizing the need for multiple I/O operations, PostgreSQL can enhance query performance.
However, it's important to exercise caution when implementing covering indexes. Each column added to the index still takes up space on disk, and there is an associated cost for maintaining the index, especially when it comes to row updates.
BRIN indexes are designed for large tables with sorted data, such as time-series data. They divide the table into logical blocks, where each block contains a range of values. Instead of storing individual index entries for each row, the index stores the minimum and maximum values within each block, making them smaller in size compared to other index types.
Index size: 541 MB
CREATE INDEX btree_example_index ON logs(log_date);
Index size: 544 kB
CREATE INDEX brin_example_index ON logs USING BRIN(log_date);
We've created both a B-TREE and a BRIN index on the log_date column. This demonstrates that BRIN indexes are indeed highly space-efficient and can provide significant advantages when dealing with large tables characterized by sequential workloads.
For complex data types like arrays or geometric shapes, use GiST, GIN, or SP-GIST indexes.
To learn more about the key server parameters that influence index usage, read this article on optimizing database indexes to fine-tune your PostgreSQL performance. Still have questions about indexes in Timescale? We answer them in this blog post.