Efficient data retrieval is essential for achieving peak PostgreSQL performance for applications, particularly when dealing with vast datasets. Databases offer a robust solution in the form of indexing, a mechanism that accelerates the retrieval of specific rows. In this article, we will explore PostgreSQL indexing, starting with a general introduction to the topic and building up to the essential tips for optimizing performance.
At its core, a PostgreSQL index serves as a data structure designed to speed up data retrieval operations within a database table. Think of it as the table of contents in a book. Instead of painstakingly searching through the entire book to pinpoint a particular topic, you can swiftly navigate to the relevant page using the table of contents.
Similarly, in PostgreSQL, indexes act like a GPS for the database engine, enabling it to efficiently locate and retrieve specific rows without having to scan the entire table. To put it simply, an index is a pointer to data within a table.
These database indexes are standalone data structures, and they reside on disk alongside data tables and other related catalog objects. Sometimes, when all the data required for a query is located in an index, Postgres can employ an INDEX ONLY scan and never have to touch the table at all.
Common PostgreSQL Index Types
Postgres supports many types of indexes, from the default b-tree indexes to more exotic types like hash, GIN, GiST, or BRIN. They are well documented in the PostgreSQL docs, but here’s a quick summary:
B-Tree indexes (default index type in PostgreSQL)
CREATE INDEX index_product_id ON products (product_id);
Default index type in PostgreSQL.
Supports <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL.
Organizes entries in ascending order.
CREATE INDEX index_product_id ON products USING HASH (product_id);
Ideal for equality checks, especially for integers.
Doesn't support range queries or sorting.
CREATE INDEX index_product_id_name ON products (product_id, product_name);
Multicolumn index defined on multiple table columns.
Optimize based on columns frequently used in WHERE clauses.
CREATE INDEX index_product_id ON products(product_id) where product_available = ‘true’;
Built over a subset of a table, defined by a conditional expression.
Useful for filtering out frequently unqueried rows.
CREATE INDEX index_product_id_name_status ON products (product_id, product_name) include (status);
Allows index-only scans when the select list matches index columns.
Additional columns specified with the INCLUDE keyword.
Block Range index (BRIN)
CREATE INDEX brin_example_index ON logs USING BRIN(log_date);
Designed for large, sorted tables like time-series data.
Space-efficient, storing min and max values within logical blocks.
The Pros of Indexing in PostgreSQL
The utilization of indexes gives PostgreSQL (and other relational databases) some important benefits:
Accelerated data retrieval. Indexes play a pivotal role in significantly reducing the time required to fetch data, particularly when dealing with large-sized tables. In the absence of indexes, the database would need to perform a full table scan, which can be excessively sluggish. For example, in databases that don't use indexes, such as certain object stores, data retrieval is often based on the object's key. While this can be quick for direct lookups, it's inefficient for complex queries or partial matches, which PostgreSQL handles gracefully with indexes.
Enhanced query performance. PostgreSQL query planner/optimizer uses indexes to derive the most efficient way to execute a query. Queries that involve conditions specified in the WHERE clause or require joins between tables experience substantial performance improvements with the use of indexes. These queries can tap into indexes to swiftly identify rows that meet the specified criteria.
Minimized disk I/O. Indexes store a subset of the table's data, resulting in a reduction of disk I/O operations. This not only expedites query execution but also reduces the burden on the storage system.
Data integrity enforcement. Unique indexes serve as guardians against duplicate values within specific columns. They play a pivotal role in upholding data integrity by ensuring that no two rows possess identical values in the designated columns. Foreign key constraints, another aspect of data integrity, are more efficient with indexes on the referenced columns, making operations like cascading updates or deletes faster.
The Cons of PostgreSQL Indexes
From our experience, everything related to databases has drawbacks if overused, including indexes. Well-designed indexes will give you optimal performance, but over-indexing is a common problem in large PostgreSQL deployments that can cause you multiple problems.
An apparent one is the consumption of extra storage space: every index you add to a table requires its own separate storage allocation. At a fundamental level, an index comprises the values of the indexed columns and a reference (or pointer) to the actual row in the table.
For smaller datasets, the space used by indexes typically constitutes a small percentage of the total table size, but as the size of your PostgreSQL tables grows, so does the size of its associated indexes. This can become significant when multiple columns are indexed and when the indexed columns contain textual or string data. Unlike fixed-size data types like integers or dates, strings can vary in length, meaning that the index has to accommodate not only the string values but also metadata indicating the length of each string, thereby increasing the space overhead.
In Timescale, all hypertables are automatically partitioned by time. When a large table is partitioned, each partition is a table unto itself; if indexes are applied to the partitioned table, they are effectively applied to each partition separately. So, if you have hundreds or thousands of partitions, each with its own index, the cumulative storage used by these indexes can rise quickly.
An inefficient use of indexes can also slow down your writes. When you introduce an index, it has to be updated every time a row is inserted, updated, or deleted. PostgreSQL has an optimization called HOT (heap-only tuples) which can make in-place row modifications by avoiding index changes when certain columns are updated. However, if a table has many indexes or if updates frequently touch indexed columns or columns used in expression indexes, the ability to perform these efficient hot updates diminishes.
This also leads to more I/O operations due to necessary index adjustments since each time data is added, deleted, or updated in a table or a partition, its associated indexes need to be updated too. This can increase table bloat from dead tuples, which can degrade performance over time. Additionally, maintaining numerous indexes can make database backups and restores more time-consuming.
Indexes are a crucial aspect of your PostgreSQL database, but it’s essential to build a practice of monitoring and optimization to ensure they’re working in your favor. Keep reading for tips on improving your use of indexes, starting with identifying the parameters you should be aware of in PostgreSQL due to their influence on index usage.
Server Parameters That Influence Index Usage
Certain server parameters offer a degree of control over how the query planner makes decisions about index usage, and they are instrumental in optimizing query performance. Let’s cover them:
work_mem and shared_buffers
We already discussed these parameters in the previous episode of this series. work_mem dictates the amount of memory allocated to each database operation, such as sorting and hashing—its sizing directly impacts whether operations remain in memory or spill over to disk. If work_mem is set appropriately, it can efficiently accommodate bitmap scans and sorts in memory, reducing the need for PostgreSQL to rely heavily on indexes. If the allocated memory is too small, the system might have to use disk-based temporary storage, affecting performance.
shared_buffers represents a cache of the database's disk pages in the main memory and plays a critical role in how frequently the database needs to access the underlying disk. A well-sized shared_buffers ensures that frequently accessed table and index pages are kept in fast, main memory, drastically reducing I/O operations and improving overall query response times.
Make sure to check out Part II for more detailed configuration recommendations for both these parameters.
effective_cache_size provides the query planner with an estimate of how much memory is available for caching data and indexes. Although it doesn't allocate memory by itself, it influences the planner's decisions regarding the execution of queries.
This is an essential for query planning, as it guides the planner in selecting the most efficient query plan based on the expected cost of different strategies. For example, if the effective_cache_size is set to a high value and the planner believes, based on this setting, that much of the needed data is likely to be in the OS's cache, the query planner is more likely to favor index scans over sequential scans, assuming ample memory for caching.
The optimal value will vary based on your setup, but a common starting point is setting it to about 50-70 percent of your machine's total RAM.
random_page_cost and seq_page_cost
random_page_cost and seq_page_cost are configuration parameters that help the PostgreSQL query planner estimate the relative costs of random versus sequential page reads. By default, seq_page_cost is set to 1 (one), indicating the base cost of a sequential read, while random_page_cost is set to 4 (four), suggesting that random access is considered four times more expensive than sequential access.
Generally, index lookups tend to involve random disk I/O because of the nature of traversing index trees, whereas full table scans are more inclined to be sequential, although this can vary with frequent updates or deletions. Adjusting the random_page_cost downward makes the planner more favorable towards index scans by narrowing the perceived cost gap between random and sequential page reads.
It's worth noting that the default value of 4 is more aligned with the behavior of traditional spinning disks. However, if your PostgreSQL instance runs on an SSD, you could see benefits by adjusting random_page_cost to around 1.1, reflecting the significantly reduced random access penalty of SSDs.
This parameter indicates the relative cost of CPU operations required to scan and process index entries during query execution.
In theory, adjusting cpu_index_tuple_cost allows you to influence the query planner's choices regarding whether to utilize an index scan or a sequential scan for a particular query. Lower values would make index scans more appealing from a cost perspective, while higher values would lead the planner to favor sequential scans.
In practice, it is not recommended to alter the default value for cpu_index_tuple, which is 0.005 in the latest versions of PostgreSQL—at least not without thorough testing. The potential performance gains from tweaking it are usually marginal compared to the risks, e.g., suboptimal query plans and unexpected performance regressions.
How Can I Know Whether my PostgreSQL Indexes Are Working Efficiently?
The million-dollar question!
Determining whether your PostgreSQL indexes need optimization requires ongoing monitoring via a multi-faceted approach. Indexes are not a "set and forget" in PostgreSQL; as your workload evolves, they will require maintenance to ensure your database keeps running with optimal performance.
Thankfully, PostgreSQL comes with great tools and techniques you can use to monitor your indexes. Let’s cover some of the most rewarding strategies you can follow to identify if your indexes might not be functioning as intended.
Finding unused indexes via pg_stat_user_indexes
As we mentioned at the beginning of this post, over-indexing is a frequent issue in many large PostgreSQL deployments. Once your database starts growing, it's essential to identify and remove unused or underutilized indexes to remain efficient, as too many indexes can lead to several problems:
Every time a row is added, deleted, or modified, all indexes on that table need to be updated. With many indexes, this write overhead can become substantial, slowing down INSERT, UPDATE, and DELETE operations.
Indexes consume disk space. Excessive or unused indexes can waste valuable storage, especially on large tables.
Indexes need to be periodically vacuumed and analyzed, and more indexes lead to longer maintenance windows.
If you want to spot potentially unused or underutilized indexes, pg_stat_user_indexes is your friend. You can look at the idx_scan column, which counts the number of index scans initiated on each index. Here's an example query that could help identify unused or rarely used indexes:
relname AS table_name,
indexrelname AS index_name,
idx_scan AS index_scan_count
idx_scan <50-- Choose a threshold that makes sense for your application.ORDERBY
In the example above, we're looking for indexes with fewer than 50 scans (which is an arbitrary number for this example—you should adjust it based on your application's usage patterns and the duration since the last statistics reset; also, if you have daily or weekly batch jobs, ensure you're not prematurely dropping indexes that are used by those less frequent tasks).
Monitoringshared_blks_read and blk_read_time to identify if a new index may be needed
On the contrary, sometimes you may be lacking an index—this can happen as your workloads and query patterns evolve over time.
The pg_stat_statements view is once again your friend: it can also help you to identify slow queries that may benefit from additional indexing via the metrics shared_blks_read and blk_read_time.
shared_blks_read indicates the number of disk blocks read from shared memory; a high value for this parameter suggests that the system is reading a significant amount of data directly from the disk.
blk_read_time represents the total time spent reading these blocks, measured in milliseconds. A high value indicates that not only is the system reading a lot of blocks from the disk, but it's also spending a significant amount of time doing so.
If you notice higher values of shared_blks_read and blk_read_time associated with an important query, it signals that PostgreSQL is fetching data from disk rather than using indexes efficiently. Creating an index may significantly improve query performance.
Once again, let’s use a particle example to bring this home. Suppose you run a report every day that lists all customers who made a purchase in the last 24 hours. Over time, as the number of customers and purchases grows, you notice that the query is taking longer and longer to run. To diagnose the issue, you turn to pg_stat_statements:
SELECT * FROM purchases WHERE purchase_date > NOW() - '1 day'::INTERVAL;
Within the context of your use case, the high shared_blks_read value suggests that PostgreSQL is fetching a substantial amount of data directly from the disk when executing this query. Furthermore, the blk_read_time of 300 milliseconds indicates a significant delay in retrieving this data, which should be performing much better. (Once again, don’t take these values as generally applicable—you’ll have to determine your own normal ranges by monitoring overtime.)
Given this insight, you decide to investigate the indexes on the purchases table. You discover that there's no index on the purchase_date column. By adding an appropriate index, you might be able to significantly reduce the shared_blks_read and blk_read_time values, leading to a faster query execution:
CREATE INDEX idx_purchase_date ON purchases(purchase_date);
After creating the index and letting it populate, you could revisit pg_stat_statements to confirm the improvement.
Using EXPLAIN ANALYZE and looking for Seq Scan
We’re mentioning 'EXPLAIN ANALYZE’ a lot in this series, and for a good reason—it’s an amazing tool. By running EXPLAIN ANALYZE before executing a query, you can examine the query execution plan generated by the planner. It provides details on which indexes are being used, how data is retrieved, and which query node might be spending excessive time. For the specific case of optimizing indexes, If you see that PostgreSQL is doing sequential scans (Seq Scan) on large tables when you expect it to be doing index scans, this is another sign that indexes might not be used effectively.
Imagine you have a customers table with a column email that contains a million records. You frequently run queries to look up customers by their email addresses, so you've set up an index on the email column to optimize these lookups. However, over time, you've noticed that the query performance seems to be deteriorating. To investigate this, you decide to examine the query plan using EXPLAIN ANALYZE:
Seq Scan on customers (cost=0.00..20450.00rows=1 width=1168)(actual time=500.032..600.021rows=1 loops=1)
Filter: (email ='[email protected]'::text)Rows Removed by Filter: 999999
Total runtime: 600.040 ms
The output shows that PostgreSQL has chosen a Seq Scan over the customers table, even though there's an index on the email column. This is surprising because you'd expect an index scan to be more efficient for such a specific lookup.
The output also gives you the time taken (Total runtime: 600.040 ms) and highlights that nearly all rows in the table were removed by the filter (Rows Removed by Filter: 999999), indicating that a full scan was indeed done.
This information tells you that the index on the email column isn't being used efficiently. Potential reasons might include:
There's some misconfiguration that makes the planner believe sequential scans are cheaper. The parameters we’ve covered earlier, such as random_page_cost and seq_page_cost, may not be set accurately according to your resources. As we’ve mentioned earlier, a common reason is that random_page_cost is set too high, which causes the planner to avoid index scans even when they would be faster.
The index is damaged or missing. This is a rare scenario, but indexes might get corrupted due to disk issues, crashes, or other issues. Furthermore, if an expected index has been inadvertently dropped or never created in the first place, PostgreSQL would have no choice but to revert to a sequential scan.
The statistics used by the query planner are out of date. PostgreSQL relies on statistics about the distribution of data in tables and indexes to generate efficient query plans. As data changes over time (due to additions, updates, or deletions), these statistics can become outdated, and without accurate statistics, the query planner might make suboptimal choices, such as selecting a sequential scan over an index scan. Running the ANALYZE command helps the system gather fresh statistics.
It’s worth expanding on this last point a bit in the next section.
Regularly running VACUUM and ANALYZE
Routine maintenance tasks, like VACUUM and ANALYZE, play a crucial role in the optimal functioning of indexes in a PostgreSQL database. The ANALYZE operation deserves special attention. When you run ANALYZE, the system examines the tables and updates the statistics related to the distribution of values within the tables.
These statistics are vital because the query planner relies on them to devise the most efficient strategy to execute a query, which often includes deciding whether to use an index or not. Without updated statistics, the planner might make suboptimal choices, as we mentioned previously.
Implementing a regular routine for these maintenance tasks (e.g., weekly) can be highly beneficial. Such a schedule ensures that the database is consistently rid of dead rows—obsolete data no longer accessible due to updates or deletions—and is equipped with up-to-date statistics.
Indexing Best Practices: Summary
Nice! Now, you have the tools to step up your indexing game as you scale. Before we wrap up, Let’s run through a quick summary of indexing best practices. These are simple, but if you follow them, your indexes will be performing optimally:
Only index what you need. Don't create indexes on every column. Assess your query patterns and create indexes that align with your most common queries. Over-indexing can negatively impact write operations, increase storage requirements, and introduce maintenance overhead.
Avoid indexes on low cardinality columns. Indexing uniformly distributed columns with few distinct values (say gender or a boolean value) is usually an anti-pattern, as the random I/O generated by using the index will still visit most pages and will be slower than sequentially scanning the table. Sometimes, using partial indexes (which have a WHERE clause), multicolumn indexes, or writing queries with char can solve this problem.
Regularly review and optimize indexes. As your application's data and usage patterns evolve, the effectiveness of existing indexes may diminish. Regular reviews and optimizations can help align indexes with current query requirements.
Test and benchmark. Before implementing new indexes, assess their impact on query performance using tools like EXPLAIN ANALYZE to verify whether the planner utilizes the index.
Update statistics after adding or removing an index. After adding or removing an index, update statistics using the ANALYZE command on the affected table. Accurate statistics about the distribution of values in indexed columns are vital for the query optimizer to make better decisions. Failing to update statistics can result in suboptimal query plans and decreased performance.
Rebuild indexes concurrently. The standard REINDEX command requires an exclusive table lock, which can disrupt database operations. In contrast, REINDEX CONCURRENTLY allows the table to remain accessible during the index rebuilding process, although it may take longer to complete. Setting a higher value of maintenance_work_mem can substantially speed up the index creation process, as we mentioned in our previous article.
Special Considerations for Partitioned Tables and Hypertables
Lastly, when working with partitioned tables or with Timescale’s hypertables (which abstract partitioning, taking care of it automatically), it’s worth adding some extra considerations:
In PostgreSQL, while you can create indexes on the parent partitioned table (which apply to all partitions), you can also create indexes on individual partitions, known as local indexes. If you’re using CREATE_INDEX in Timescale, an index will be created on each chunk (instead of using a single transaction for the entire hypertable), which will enhance performance for large databases.
When you create a hypertable, Timescale will automatically create indexes on the time column.
In the Timescale Console, you can inspect your database indexes via the Explorer view
You can also use composite indexes (an index that includes two or more table columns) in vanilla PostgreSQL and hypertables. This is useful if your queries often filter on the time column combined with one or more other columns. For example, you may want to keep the index on time to quickly filter for a given time range and add another index on another column (e.g., device_id).
To define an index as a UNIQUE or PRIMARY KEY index in Timescale, the index must include the time column and the partitioning column (if you are using one).
When you create a hypertable in Timescale, set the datatype for the time column as timestamptz and not timestamp.
In partitioned tables, it is extra important to pay attention to over-indexing, as every index you add introduces extra overhead during data ingestion.
If you have sparse data (aka with columns that are often NULL), you can add a clause to the index saying WHERE column IS NOT NULL. This prevents the index from indexing NULL data, which can lead to a more compact and efficient index.
One of the benefits of partitioned tables is the ability to quickly drop partitions (for example, old data in a time-based partitioning setup). When a partition is dropped, its associated indexes are also dropped, which is a much faster operation than deleting rows and subsequently cleaning up associated index entries. This operation is even faster in Timescale due to its data retention policies and continuous aggregates.
Indexing strategies should consider how data in partitioned tables is accessed. For example, if older partitions (e.g., in a time-based partitioning setup) are rarely accessed, it might not be worth maintaining certain indexes on them, especially if they consume significant space.
An efficient PostgreSQL indexing is a potent ally for fine-tuning data retrieval within your applications. Indexing is a balancing act, but creating the appropriate indexes and maintaining them will significantly enhance the overall performance of your PostgreSQL database.