Teaching Postgres New Tricks: SIMD Vectorization for Faster Analytical Queries
After more than a year in the works, we’re proud to announce that the latest release of TimescaleDB (TimescaleDB 2.12) has added a vectorized query pipeline that makes Single Instruction, Multiple Data (SIMD) vectorization on our hybrid row columnar storage a reality for PostgreSQL. Our goal is to make common analytics queries an order of magnitude faster, making the world’s most loved database even better.
We’ve already built a mechanism for transforming your PostgreSQL tables into hybrid row columnar stores with our native columnar compression. When you compress data you get the immediate benefit of significantly reducing storage size, and you get the secondary benefit of spending less CPU time waiting for disk reads. But there is another avenue for optimization that comes from columnar storage, and we are now focused on unlocking its potential to set analytical queries on fire.
(Here's a sneak preview so you can see what we're talking about.)
If you thought you had to turn to a specialized “analytics” columnar database to serve your queries, think twice. In this article, we walk you through how we’ve supercharged PostgreSQL with vectorization, or to be more precise, implemented a vectorized query execution pipeline that lets us transparently unlock the power of SIMD, so you can start on Postgres, scale with Postgres, and stay with Postgres—even for your analytical workloads.
From Postgres Scaling Issues to Vectorization
The decision to implement vectorized query execution in TimescaleDB comes from a long line of initiatives aimed at improving PostgreSQL’s experience and scalability. Before we get into the technical details, let’s start by discussing where developers reach the limits of Postgres and how vectorization can help.
You love Postgres (doesn’t everyone?) and chose it to power your new application because using a rock-solid, widely-used database with an incredibly diverse ecosystem that supports full SQL just makes sense.
Things are going really well, development is easy, the application launches. You might be working with IoT devices, sensors, event data, or financial instruments—but whatever the use case, as time moves on, data starts piling up. All of a sudden, some of the queries that power your application mysteriously begin to get slower. Panic starts to settle in. 😱
Fast forward a few weeks or months, and something is off. You’re spending money on adding additional resources to the database and burning precious developer time trying to work out what’s broken. It doesn’t feel like anything is wrong on the application side, and tuning PostgreSQL hasn’t helped. Before you know it, someone has proposed splitting part of the workload into a different (perhaps “purpose-built”) database.
Complexity and tech debt rocket as the size of your tech stack balloons, your team has to learn a new database (which comes with its own set of challenges), and your application now has to deal with data from multiple siloed systems.
Teaching Postgres new tricks to make this journey smoother
This is the painful end-state that Timescale wants to help avoid, allowing developers to scale and stay with PostgreSQL. Over the years, TimescaleDB has made PostgreSQL better with many features to help you scale smoothly, like hypertables with automatic partitioning, native columnar compression, improved materialized views, query planner improvements, and much more. If it holds you back in PostgreSQL, we want to tackle it.
Which brings us to today’s announcement…
For the past year we’ve been investigating how to extend PostgreSQL to unlock techniques used by specialized analytics databases custom-built for OnLine Analytical Processing (OLAP), even while retaining ACID transactions, full support for mutable data, and compatibility with the rest of the wonderful ecosystem. We don’t have the luxury of building a database from the ground up for raw performance (with all the trade-offs that typically entails), but we think where we have ended up offers a unique balance of performance, flexibility, and stability.
You can teach an old elephant new tricks and sometimes get an order of magnitude speedup when you do!
Columnar Storage in PostgreSQL
Before we launch into the vectorization and SIMD deep dive, we need to set the scene by explaining the other feature which makes it possible, our compressed columnar storage.
By default, PostgreSQL stores and processes data in a way that is optimized for operating on data record by record (or row) as it’s inserted. The on-disk data files are organized by row, and queries use a row-based iterator to process that data. Paired with a B-tree index, a row-based layout is great for transactional workloads, which are more concerned with quickly ingesting and operating on individual records.
Databases that optimize for raw analytical performance take the opposite approach to PostgreSQL—they make some architectural trade-offs to organize writes with multiple values from one column grouped on disk. When a read happens, a column-based iterator is used, which means only the columns that are needed are read.
Column organized, or columnar, storage performs poorly when an individual record is targeted or when all columns are requested, but amazingly for the aggregate or single-column queries that are common in analytics or used for powering dashboards.
To clarify things, the following diagram shows how a row store and a column store would logically lay out data from devices measuring temperature.
Row vs. Columnar Storage: Why Not Both?
Traditionally, you had to choose between a database that supported a row-based format optimized for transactional workloads or one that supported a column-based format targeted towards analytical ones. But, what we saw over and over again with our customers is that, with the same dataset, they actually wanted to be able to perform transactional-style operations on recent data and analytical operations on historical data.
Timescale is built on Postgres, so we can store data using Postgres’ native row format effortlessly. We have also built out the ability to organize data by columns through our native columnar compression (check out this recent deep dive into the technical details). You can keep recent data in a row format and convert it to columnar format as it ages.
Both formats can be queried together seamlessly, the conversion is handled automatically in the background, and we can still support transactions and modifications on our older data (albeit less performantly).
When you’re working with columnar data, the benefit for analytical queries is immense, with some aggregate queries over columnar storage coming in 5x, 10x, and in some cases, even up to 166x faster (due to lower I/O requirements and metadata caching) compared to row-based storage, as well as taking 95 % less space to store (due to our columnar compression) when tested using the Time-Series Benchmark Suite.
But can we make this faster? Read on!
Vectorization and SIMD—Oh My!
Now that we have data in a columnar format, we have a new world of optimization to explore, starting with vectorization and SIMD. Current CPUs are amazing feats of engineering, supporting SIMD instruction sets that can process multiple data points with a single instruction, both working faster and giving much better memory and cache locality. (The exact number they can process depends on the register size of the CPU and the data size; with a 128-bit register, each vector could hold 4 x 32-bit values, resulting in a theoretical 4x speedup.)
A regular (or scalar) CPU instruction receives two values and performs an operation on them, returning a single result. A vectorized SIMD CPU instruction processes two same-sized vectors (a.k.a. arrays) of values simultaneously, executing the same operation across both vectors to create an output vector in a single step. The magic is that the SIMD instruction takes the same amount of time as its scalar equivalent, even though it’s doing more work.
Implementing vectorized query execution on top of our compressed columnar storage has been a significant focus for Timescale over the last year. It quickly became evident that implementing a vectorized query pipeline is one of the most exciting areas for optimization we can tackle—with performance increases by an order of magnitude on the table.
Timescale’s Vectorized Query Execution Pipeline
As of version 2.12, TimescaleDB supports a growing number of vectorized operations over compressed data, with many more coming in 2.13 and beyond. When we were starting, one of the biggest challenges was integrating the built-in PostgreSQL operators, which process data in row-based tuples, with our new vectorized pipeline, which would be triggered as the batch was decompressed and complete when the batch was aggregated.
This becomes very clear when we look at an aggregate query. For us to vectorize aggregation, we need to have that as part of our vectorization pipeline (and not at a higher level where PostgreSQL would normally handle it).
However, because a single query could be returning data from an uncompressed and a compressed chunk (our abstraction which partitions tables) at the same time, we also need to return the same type of data in both cases (even though no vectorization would take place for the uncompressed data). We did this by changing both plans' output to PostgreSQL Partial Aggregate nodes (which were actually developed for parallel aggregation) nodes rather than raw tuples. PostgreSQL already knows how to deal with partial aggregates, so this gives us a common interface to work with that allows early aggregation.
The following diagram contains a query plan for an aggregation query and shows how an uncompressed chunk, a compressed chunk with vectorization disabled, and a compressed chunk with vectorization enabled all flow up to the same PostgreSQL Append node.
But doing this had an amazing side-effect: we could now do early aggregation for uncompressed chunks! In fact, when we committed this in TimescaleDB 2.12, we saw a consistent 10-15 % speedup across all aggregation queries which operated on hypertables, even before we got to implementing vectorization (interestingly, a large part of this improvement comes from working with smaller datasets when aggregating, for example, smaller hash tables).
Now that we could keep PostgreSQL happy when aggregating by using Partial Aggregates, we turned our attention to the start of the pipeline. We knew that we needed to convert the compressed representation into an in-memory format, which each of our vectorization stages could use.
We chose to update our decompression node to read compressed data and output data in the Apache Arrow format, allowing us to quickly and transparently perform SIMD operations at each stage of the execution pipeline.
So, now that we have a vectorization pipeline, we need to find operations that can benefit from SIMD to vectorize. Let’s start with an example: consider a typical dashboard query that shows some average metrics on a table with all data older than one hour compressed:
SELECT time_bucket(INTERVAL '5 minute', timestamp) as bucket, metric, sum(value) FROM metrics WHERE metric = 1 AND timestamp > now() - INTERVAL '1 day' GROUP BY bucket, metric;
Among the many things this query does are four crucial, computationally expensive stages that can be vectorized to use SIMD:
- Decompressing the compressed data into the in-memory Apache Arrow format
- Checking two filters in the WHERE clause, one for metric and one for time
- Computing one expression using the time_bucket function
- Performing aggregation using the SUM aggregate function
All of these stages benefit from vectorization in a slightly different way; let’s dig into each of them.
We know that compression is a good thing, and when we decompress data, the CPU overhead incurred is almost always offset by the I/O savings from reading a smaller amount of data from disk. But what if we could use our CPU more efficiently to decompress data faster? In TimescaleDB 2.12, we answered that question with a 3x decompression speedup when using SIMD over vectorized batches where the algorithms support it.
While we raised our decompression throughput ceiling to 1 GB/second/core, there is more work to be done. Some parts of our modified Gorilla compression algorithm for floating-point values, as well as some custom algorithms for compressing small and repeated numbers (see this blog post for more algorithm details), don’t allow full use of SIMD because of the way they lay out compressed data, with internal references or complex flow control blocking us from unlocking more performance.
Looking to the future, we have identified some new algorithms designed with SIMD in mind, which can go an order of magnitude faster, so watch this space. 👀
On top of the speed benefits, vectorized decompression is where we convert our on-disk compression format into our in-memory Apache Arrow format that the rest of our vectorization pipeline consumes.
The next stage of query processing is applying compute-time filters from WHERE clauses. In an ideal analytical query, most of the data that doesn't match the query filters is not even read from storage. Unneeded columns are skipped, metadata is consulted to exclude entire columnar batches, and conditions are satisfied using indexes.
However, the real world is not ideal, and for many queries, not all conditions can be optimized like this. For example, when a filter (e.g., a where clause on a time range) partially overlaps a compressed batch, then some of the batch (but not all of it) has to be used to calculate the result.
In this case, vectorized filters can provide another large performance boost. As Apache Arrow vectors stream out of our decompression node, we can use SIMD to check each filter condition very efficiently by comparing the stream to a vector of constants. Using the example from above (namely,
WHERE metric = 1 AND time > now() - INTERVAL '1 day'), we would compare the metric column against the value 1 and then also compare vectors of the time column against
now() - INTERVAL '1 day'.
This optimization should be released in TimescaleDB 2.13, with early benchmark results against some real-world data showing up to a 50 % speedup on common queries.
But that’s not all vectorized filters can provide! Previously, even for compressed queries, all data was read from disk before filters were applied (a hold-over from the read-the-whole-row behavior that PostgreSQL employs by default).
Now that we are living in a columnar world, we can optimize this using a technique called “lazy column reads,” which reads the required columns for a batch early in the order they are defined in the WHERE clause. If any filters fail, the batch is discarded with no more I/O incurred. For queries with filters that remove a large number of full batches of records, this can result in an additional 25 % – 50 % speedup.
Another important part of vectorized query pipelines is computing various expressions (projections) of columns that might be present in the query. In the simplest cases, this allows the use of the common vector CPU instructions for addition or multiplication, increasing the throughput.
More complex operations can benefit from handcrafted SIMD code, such as converting the string case, validating UTF-8, or even parsing JSON. More importantly, in some cases, the vectorized computation of expressions is a prerequisite for vectorizing the subsequent stages of the pipeline, such as grouping. For example, in the dashboard query we presented at the beginning of this section, we considered the grouping to be on
time_bucket, so the result of this function must have a columnar in-memory representation to allow us to vectorize the grouping itself.
We haven’t made a start on vectorizing expressions yet, because aggregations will have a more immediate impact on analytics queries—but fear not, we will get to them!
Finally, the computation of most aggregate functions can also be vectorized to take advantage of SIMD. To demonstrate that this can work inside PostgreSQL as a partial aggregate, we built a high-throughput summation function that uses SIMD when working on columnar/compressed data, targeting the basic use case of
SELECT sum(value) FROM readings_compressed (we can currently support filters on the segment_by column). Without further optimization, we saw a 3x speedup on compressed data.
Obviously, SUM is only one of the large set of aggregate functions that PostgreSQL provides (and TimescaleDB extends with hyperfunctions). So, in forthcoming versions, we will optimize our approach to aggregates and deliver vectorized functions with the eventual goal of supporting the full set of built-in and hyperfunction aggregates.
Adding It All Up
We’ve been showing you a lot of speedups, but how do they stack up in the real world?
We ran two simple queries, one which uses the vectorized SUM aggregate, and one which makes use of vectorized filters (unfortunately these can’t be combined at the moment). Both the queries were run on the same data (about 30 million rows) four times to show the gains from row-based, columnar (without a segment_by in this case), vectorized decompression, and then finally adding the last vectorized stage (aggregation or filter depending on the query).
We think the numbers can speak for themselves here 🔥.
Nothing gets us more excited at Timescale than finding smart solutions to hard problems which let people get more out of PostgreSQL. Since the first code for our vectorization pipeline hit Git, our internal Slack channels have been full of developer discussion about the optimizations and possibilities that vectorization on top of our columnar compression unlocks.
Looking forward, we are projecting that we can get even orders-of-magnitude performance improvements on some queries, and we’ve only started scratching the surface of what’s possible.
It’s an amazing time to be using Postgres.
Create a free Timescale account to get started quickly with vectorization in TimescaleDB today.