Database Scaling: PostgreSQL Caching Explained

Follow us, friends, as we take a journey backward in time. We're going back to 1990, when soft rock was cool, and fanny packs were still okay. But we're not going there to enjoy the music and hang out at the mall. We’re going there to talk about database scaling and PostgreSQL caching. We’re going there because that was the last time PostgreSQL made simple sense—at least when it comes to resource management.

It was a time when the network was slower than a hard drive, the hard drives were slower than memory, and memory was slower than CPU. Back then, there was no such thing as a file system cache, hard drive cache, or operating system cache. Stuff like the Linux kernel was just a gleam in Linux Torvalds eye.

Why are we going there, you might ask? To be honest, because your poor author is a bit lazy. And because it's less likely that you'll be overwhelmed by the description we're about to give you.

PostgreSQL implemented a strategy to speed up access to data in those special years of clarity and simplicity. The basic idea was simple. Memory is faster than disk, so why not keep some of the most used stuff in memory to speed up retrieval? (Cue the sinister laughter from the future.)

This improvement has proven far more effective and valuable than the original authors probably envisioned. As PostgreSQL has matured over the years, the shared memory system matured with it. This most basic idea, what we commonly know as caching, continues to be very useful—in fact, the second most useful thing in PostgreSQL besides the working memory for each query. However, it is becoming less and less accurate over time, and other factors are becoming more prominent.

We are going to start at the beginning and then introduce the pesky truth—much in the same way that it blindsided the developers of PostgreSQL caching. As we go along, I can hear the advanced users of PostgreSQL. They are saying things like "except when," "unless," and "on the such-and-such platform." Yes, yes. We may or may not get around to your favorite exception to the rule. If we don't, apologies in advance for skipping it in the name of clarity and simplicity. This is not the last article we will ever write (in fact, there are already two more planned in the series, so stay tuned!). Please share your thoughts in the Timescale Forum blog channel and we'll try to get there in the next few go-arounds.

The good news is that I hope to introduce you to the concepts in a digestible format and pace. The bad news is that caching is a huge problem domain, and it will take a while to introduce you to all those concepts if you want to learn more about database scaling. Keep reading, and the information will get more useful and accurate over time.

Scaling Your Database: A Trip Down Shared Memory Lane

Back to 1990. There were basically two problems to solve to have a practical design for shared memory. The first one is that PostgreSQL is a multi-process system by design, so things happening in parallel processes can (and do) affect each other. The other is that the overhead of managing the memory system can't take more time than it would have just to retrieve the data anyway.

The good news for the first problem is that we already had a similar problem in the form of file system access. To solve that problem, we used an in-memory locking table.

Access to the file system is doled out by the postmaster process (the main one that creates all the other processes). Any other PostgreSQL process that wants to access a file has to ask the postmaster "pretty please" first. These locks are maintained in memory associated with the postmaster process. There isn't much reason to maintain them elsewhere because if the main process dies, the database will be unlocked. In other words, all the other processes working on files will be closed, and all locks released.

These requirements are suspiciously close to the same requirements for shared memory access. For the file system, we call this "locking" or, for memory, "latching." For the Postgres shared memory system, we call them "pins." Pins are very similar to locks but much simpler. All we have to care about is reading or writing to memory. So there are only two types of pins.

Now that we have the cooperation system down to two actions and a bit of memory, the next issue to solve is finding what you want when you need it. This is a simple matter of a memory scan. In PostgreSQL, the files on disk that store the actual table data are managed already with a page and leaf descriptor.

These descriptors are simply an indicator of the location of a row within a database file. The format of a page is described in the manual.

Curiously, in that description it also says:

> All the details can be found in src/include/storage/bufpage.h.

Which is a reference to the shared memory code. It turns out that every disk write operation is handled in memory first. The ctid (page and leaf location of the eventual location in the data files for the table) is assigned before the data is written to disk.

That allows the pages in memory to be "looked up" using the same description the file system uses, even if the data hasn't yet been written to the file system. Clever, eh?

💡

We could go off on a tangent here about how the journaling system works and why a page could be in the journal (known as the write-ahead log) and memory but not written to the data file system yet. That is a topic for another day. Suffice it to say for now that durability is guaranteed by writing to the journal, so this is fine and dandy. In a future article, we'll also talk about how this buffering of writes acts as a backup for the journal. Again, that's getting ahead of ourselves.

Accessing shared memory

Each connection to the database is handled by a process that the PostgreSQL developers affectionately call a “backend.” This process is responsible for interpreting a query and providing the result. In some cases, it can retrieve that result from the shared memory held by the postmaster process. To access shared memory, we have to ask if the buffer system in the postmaster keeps a copy. The postmaster responds with one of two options:

No, these aren’t the pages you’re looking for.
Yes, and this is what it might look like.

"Might" in this case, because we are now beginning to see the effects of the first issue mentioned above. No, don't look back there; we'll repeat it here. The issue is that the processes are also affecting each other. A buffer may change based on any process still in flight acting on it. So, if we want to know that the buffer is valid, we have to read it while we "pin" it.

The semantics of this are much the same as the file system. Any number of processes may access the buffers for reading purposes. The postmaster simply keeps a running list of these processes. When any process comes along with a write operation and makes a change to the buffer, all of the ones that were reading it get a notice that the contents changed in flight. It is up to each "backend" (process handling connections to the user) to reread the buffer and validate that the data continues to be "interesting." That is, the row still matches the criteria of the query.

Since the data in shared memory is managed in pages, not rows, the particular row that a query was accessing may or may not have actually changed at all. It may have just had the misfortune of being in roughly the same place as something else that changed. It may have changed, but none of the columns that are a part of the query criteria were affected. It may have changed those columns, but in a way that still matches. Or the row may now no longer be a part of what the query was searching for. This is all up to the parallel processes handling the user query to decide.

Assuming that the data has made it through this gauntlet, it can be returned to the caller. We can reasonably assume that the row looks exactly like what would have been returned had we looked it up in the file system instead of memory. Despite having to work with a form of locking and lookup, we also presume that this was cheaper than spinning up a disk and finding and reading the data.

PostgreSQL’s Shared Memory: The Design Principles

Now that we know how the basic process of accessing shared data works, let's have a few words about why it was originally designed this way. PostgreSQL is an MVCC (multi-version concurrency control) system—another topic beyond this article's scope to explain. For the moment, we'll condense this to the point of libel. INSERT, DROP, TRUNCATE and SELECT are cheap. UPDATE, DELETE and MERGE are expensive. This is largely due to the tracking system for row updates. And yes, DELETE is considered an UPDATE for tracking purposes.

💡

PostgreSQL actually doesn’t UPDATE or DELETE rows. For tracking purposes, it maintains a copy of every version of a row that existed. On UPDATE, it creates a new row, makes the changes in the new row, and marks the row as current for any future transactions. For DELETE, it just marks the existing row as no longer, well, uh, existing. This is called a tombstone record. It allows all transactions in flight (and future transactions) to know that the row is dead, creating yet another cleanup problem, which we’ll (hopefully) talk about in future articles.

The caching system follows the same coding paradigm to have the same performance characteristics. It is possible in an alternate universe that there is a cheaper solution for caching that provides "better" concurrency. That being said, the overall system is as fast as the slowest part. If the design of the caching system wildly diverged from the file system, the total response to the caller would suffer at the worst-performing points of both systems.

Also, this system is tightly integrated into the PostgreSQL query planner. A secondary system (of nearly any kind) would likely introduce inefficiencies that are far greater than any benefits would be likely to cover.

And lastly, the system acts not only as a way to return the most-sought data to the caller but also as a change buffer for data modifying queries. Multiple changes may be made to the same row before being written to the file system. The background writer (responsible for making the journal changes permanent in the data files) is intelligent enough to save the final condition of the row to disk. This added efficiency alone pays for a lot of the complexity of caching.

Cache eviction

There are a few outstanding things to consider in the design of PostgreSQL caching. The first is that, eventually, something has to come along and evict the pages out of the cache. It can't just grow indefinitely. That something is called the background writer. We know from the design above that PostgreSQL keeps track of which processes find the data interesting. When that list of processes is at zero entries, and the data hasn't been accessed in a while, the background writer will mark the block as "reusable" by putting it in a list called the "free space map." (Yes, this is much the same as the autovacuum process for the file system).

In the future, the memory space will be overwritten by some other (presumably more active) data. It's the circle of life. Buffer eviction is a garbage collection process with no understanding of what queries are in flight or why the data was put into the buffer. It just comes along on a timer and kicks things out that haven't been active in a while.

Forced eviction considered evil

Also, the backend processes we have already mentioned may decide that they need a lot of memory to do some huge operation and request the postmaster commit everything currently staged to disk. This is called a buffer flush, and it is immediate. It is miserable for performance to get into a position where a buffer flush is necessary. All concurrent processes will halt until the flush is completed and verified. <== A horrible statement in a concurrent database.

The postmaster may decide to flush the buffer cache to respond to some backend process. This is just as horrible as the previous paragraph for the same reasons.

Hey, I Was Using That

PostgreSQL is paying attention (by the autovacuum process) to which data blocks are being accessed in the file system. If these accesses reach a threshold, PostgreSQL will read the block from the disk and stick it back in the cache because it seems to answer many questions. This process is blind to the queries that access the data, the eviction process, and anything else, for that matter. It's the old late-night kung-fu flick version of sticking the commercials in there anywhere. There is no rhyme or reason to where these blocks will end up in the buffer memory space, but they seem interesting, so in you go.

In fact, the blocks in memory are effectively unordered. Because of the process of eviction and restoration, no spatial order is guaranteed (or even implied.) This means that a "cache lookup" is effectively reading the page and leaf location for each block every time the cache is accessed. The postmaster holds a list of pages in the cache in order—much like a hash index—with the memory location of each block. As the size of the cache increases, additional lookups in the "cache index" are implied.

Database Scaling: PostgreSQL Caching Explained