Mar 14, 2024
Delta-delta encoding, Simple-8b, XOR-based compression, and more - These algorithms aren't magic, but combined they can save over 90% of storage costs and speed up queries. Here’s how they work.
Computing is based on a simple concept: the binary representation of information. And as computing infrastructure has gotten cheaper and more powerful, we have asked it to represent more and more of our information, in the form of data we collect (which is often time-series data).
But computing is not free. The more efficiently we can represent that information, the more we can save on storage, compute, and bandwidth. Enter compression: “the process of encoding information using fewer bits than the original representation.” (source)
Compression has played an important role in computing for several decades. As a concept, compression is even older: “Morse code, invented in 1838, is the earliest instance of data compression in that the most common letters in the English language such as “e” and “t” are given shorter Morse codes.” (source)
In this post, we set out to demystify compression. To do this, we explain how several lossless time-series compression algorithms work, and how you can apply them to your own projects.
We also explain how we implement them in TimescaleDB, the first open-source relational database to use these time-series compression algorithms, and achieve 90%+ storage efficiencies. [1]
[1] We use the term “we” throughout this article to represent the engineers who developed this capability: Josh Lockerman, Matvey Arye, Gayathri Ayyapan, Sven Klemm, and David Kohn.
We all collect time-series data. Whether we are measuring our IT systems, web/mobile analytics, product usage, user behavior, sensor/device data, business revenue, etc., time-series data flows through our data pipelines and applications, and enables us to better understand our systems, applications, operations in real-time.
One of the challenges of time-series data is storage footprint. In order to analyze data over time, we insert new data (i.e., instead of updating existing data) on every measurement. Some time-series workloads also have high insert rates (e.g., IT monitoring, IoT sensor data). As a result, time-series datasets often scale well into the terabytes and more.
In order to achieve high-performance while maintaining resource efficiency, we first identified several best-in-class time-series compression algorithms, and then implemented them in TimescaleDB.
These algorithms are quite powerful. According to our users, these algorithms have helped them achieve 90%+ lossless compression rates. This translates into 90%+ storage cost savings, which can mean thousands of dollars (and in some cases, tens of thousands of dollars) of savings per year. These algorithms also lead to compute performance improvements: as more data fits in less space, fewer disk pages need to be read to answer queries. [2]
[2] A 10TB disk volume in the cloud is more than $12,000 per year itself (at $0.10/GB/month for AWS EBS storage), and additional HA replicas and backups can grow this number by another 2-3x. Achieving 95% storage can save you over $10K-$25K per year in storage costs alone ($12K/10TB * 10TB/machine * 2 machines [one master and one replica] * 95% savings = $22.8K).
First of all, they’re not magic, but clever computer science techniques. Here are the set of compression algorithms we'll explain, grouped by data type:
Integer compression:
Floating point compression:
Data-agnostic compression:
Delta-encoding (also known as Delta compression) reduces the amount of information required to represent a data object, by only storing the difference (or delta) between that object and one or more reference objects. These algorithms work best where there is a lot of redundant information: for example, in versioned file systems (e.g., this is how Dropbox efficiently syncs your files).
Applying delta-encoding to time-series data makes a lot of sense: we can use fewer bytes to represent a data point by only storing the delta from the previous data point. (In fact, given enough coffee and time, we would argue that versioned file systems themselves are time-series datasets, but we’ll save that discussion for another time.)
For example, imagine we were collecting a dataset that collected CPU, free memory, temperature, and humidity over time (time stored as an integer value, e.g., # seconds since UNIX epoch).
Under a naive approach, we would store each data point with its raw values:
| time | cpu | mem_free_bytes | temperature | humidity |
|---------------------|-----|----------------|-------------|----------|
| 2020-04-01 10:00:00 | 82 | 1,073,741,824 | 80 | 25 |
| 2020-04-01 10:05:00 | 98 | 858,993,459 | 81 | 25 |
| 2020-04-01 10:05:00 | 98 | 858,904,583 | 81 | 25 |
With delta-encoding, we would only store how much each value changed from the previous data point, resulting in smaller values to store:
| time | cpu | mem_free_bytes | temperature | humidity |
|---------------------|-----|----------------|-------------|----------|
| 2020-04-01 10:00:00 | 82 | 1,073,741,824 | 80 | 25 |
| 5 seconds | 16 | -214,748,365 | 1 | 0 |
| 5 seconds | 0 | -88,876 | 0 | 0 |
Now, after the first row, we are able to represent subsequent rows with less information.
Applying Delta-encoding to time-series data takes advantage of the fact that most time-series datasets are not random, but instead represent something that is slowly changing over time. The storage savings over millions of rows can be pretty substantial, especially when the value doesn’t change at all.
Additional reading: Delta Compression Techniques
Delta-of-delta encoding (also known as “delta-delta encoding”), takes delta encoding one step further: it applies delta-encoding a second time over delta-encoded data. With time-series datasets where data collection happens at regular intervals, we can apply delta-of-delta encoding to the time column, effectively only needing to store a series of 0’s.
Applied to our example dataset, we now get this:
| time | cpu | mem_free_bytes | temperature | humidity |
|---------------------|-----|----------------|-------------|----------|
| 2020-04-01 10:00:00 | 82 | 1,073,741,824 | 80 | 25 |
| 5 seconds | 16 | -214,748,365 | 1 | 0 |
| 0 | 0 | -88,876 | 0 | 0 |
In this example, delta-of-delta further compresses “5 seconds” down to “0” for every entry in the time column after the second row. (Note that we need two entries in our table before we can calculate the delta-delta, because we need two delta’s to compare.)
This compresses a full timestamp (8 bytes = 64 bits) down to just a single bit (64x compression). (In practice we can do even better by also applying Simple-8b + RLE. More below.)
In other words, delta-encoding stores the first derivative of the dataset, while delta-of-delta encoding stores the second derivative of the dataset.
With Delta (and delta-of-delta) encoding, we’ve reduced the number of digits we needed to store. Yet we still need an efficient way to store these smaller integers. Here’s an example that illustrates why: Say, in our previous example, we still use a standard integer datatype (which takes 64 bits on a 64-bit computer) to represent the value of “0” when delta-delta encoded. Thus, even though we are storing “0”, we are still taking up 64 bits – we haven’t actually saved anything.
Enter Simple-8b, one of the simplest and smallest methods of storing variable length integers.
In Simple-8b, the set of integers is stored as a series of fixed-size blocks. For each block of integers, each integer is represented in the minimal bit-length needed to represent the largest integer in that block. The first bits of each block denotes that minimum bit-length for the block.
This technique has the advantage of only needing to store the length once for a given block, instead of once for each number. Also, since the blocks are of a fixed-size, we can infer the number of integers in each block from the size of the integers being stored.
As an example, say we were storing the changing temperature over time, applied delta encoding, and ended up needing to store this set of numbers:
| temperature (deltas) |
|----------------------|
| 1 |
| 10 |
| 11 |
| 13 |
| 9 |
| 100 |
| 22 |
| 11
In other words, our data looks like this:
1, 10, 11, 13, 9, 100, 22, 11
With a block size of 10 digits, we could store this set of numbers in a Simple-8b-like scheme as two blocks, one storing 5 2-digit numbers, and a second storing 3 3-digit numbers.
{2: [01, 10, 11, 13, 09]} {3: [100, 022, 011]}
As you can see, both blocks store about 10-digits worth of data, even though some of the numbers have to be padded with a leading ‘0’. (Note in our example, the second block only stores 9 digits, because 10 is not evenly divisible by 3).
Simple-8b works very similarly, except it uses binary numbers instead of decimal ones, and generally uses 64-bit blocks. In general, the larger number length, the fewer number of numbers that can be stored in each block:
Additional reading: Decoding billions of integers per second through vectorization includes an even more detailed description of how Simple-8b works.
Simple-8b compresses numbers very well, using approximately the minimal number of digits for each number. However, in certain cases we can do even better.
We can do better in cases where we end up with a large number of repeats of the same value. This can happen if the value does not change very often, or (if you recall from the delta and delta-delta section) if an earlier transformation removes the changes.
As one example, consider our delta transformation of the time and humidity values from earlier. Here, the time column value repeats with “5”, and the humidity column with “0”:
| time | cpu | mem_free_bytes | temperature | humidity |
|---------------------|-----|----------------|-------------|----------|
| 2020-04-01 10:00:00 | 82 | 1,073,741,824 | 80 | 25 |
| 5 seconds | 16 | -214,748,365 | 1 | 0 |
| 5 seconds | 0 | -88,876 | 0 | 0 |
To see how we can do better in representing these repeats of the same value, let’s actually use a less simplified example. Say we were still storing the changing temperature over time, we again applied delta-encoding, but now ended up with this set of numbers:
| temperature (deltas) |
|----------------------|
| 11 |
| 12 |
| 12 |
| 12 |
| 12 |
| 12 |
| 12 |
| 1 |
| 12 |
| 12 |
| 12 |
| 12 |
In other words, our data now looks like this:
11, 12, 12, 12, 12, 12, 12, 1, 12, 12, 12, 12
For values such as these, we do not need to store each instance of the value, but merely how long the run, or number of repeats, is. We could store this set of numbers as {run; value}
pairs like this:
{1; 11}, {6; 12}, {1; 1}, {4; 12}
This technique only takes 11 digits of storage ([1, 1, 1, 6, 1, 2, 1, 1, 4, 1, 2]
), as opposed to the approximately 23 digits that an optimal series of variable-length integers would require ([11, 12, 12, 12, 12, 12, 12, 1, 12, 12, 12, 12]
).
This is Run-length encoding (RLE), which is one of the classic compression algorithms (along with Dictionary compression, discussed later). If you see compression with seemingly absurd ratios -- e.g., fewer than 1 bit per value -- run-length-encoding (or a similar technique) is probably being used. Think about a time-series with a billion contiguous 0’s, or even a document with a million identically repeated strings: both run-length-encode quite well.
RLE is used as a building block in many more advanced algorithms: e.g., Simple-8b RLE, an algorithm that combines both techniques.
In practice, in TimescaleDB we implement a variant of Simple-8b RLE, where we detect runs on-the-fly, and run-length-encode if it would be beneficial. In this variant, we use different sizes than standard Simple-8b, in order to handle 64-bit values and RLE. More information (and the source code) for the variant we built into TimescaleDB can be found on GitHub.
Gorilla, an in-memory time-series database developed at Facebook (and research paper published in 2015), introduced two compression techniques that improve on delta-encoding. The first is delta-of-delta encoding for timestamps, which we already covered above. Here we cover the second, XOR-based compression, which is something that typically applies to floats. (Note: Developers will often refer to “Gorilla compression”, which is generally at least one, if not both, of these techniques.)
Floating point numbers are generally more difficult to compress than integers. Unlike fixed-length integers which often have a fair number of leading 0s, floating point numbers often use all of their available bits, especially if they are converted from decimal numbers, which can’t be represented precisely in binary. [3]
[3] Decimal is in base-10, while floats are in base-2 (binary). This means that decimal numbers can be built out of a series of base-10 fractions, like a/10 + b/100 + c/1000 ... etc, while floats are built out of base-2 fractions like a/2 + b/4 + c/8 ... etc. The numbers representable as sums of these different fractions don't completely overlap, so decimal numbers are rounded to the nearest value in binary. This causes some numbers that are simple to represent in base-10, like 93.9, to get represented in a float as much more complicated numbers, which gets represented as approximately 93.90000152587890625 in binary.
Furthermore, techniques like delta-encoding don’t work well for floats, as they do not reduce the number of bits sufficiently. For example, in our example above, if we stored CPU as a double, the delta remains a number with many significant digits:
| time | cpu |
|-------|--------------|
| t1 | 0.8204859587 |
| t2 | 0.9813528043 |
| DELTA | 0.1608668456 |
Due to these challenges, most floating-point compression algorithms tend to be either complex and slow, or lossy (e.g., by truncating significant digits). (It’s important when evaluating compression algorithms to distinguish between lossless and lossy compression: for example, in the above example, if we truncate the cpu float values to two significant digits, the delta of 0.16 is already much smaller, but we have lost information.)
One of the few simple and fast lossless floating-point compression algorithms is the XOR-based one that the Gorilla paper applies, with very good results:
“We addressed this by repurposing an existing XOR based floating point compression scheme to work in a streaming manner that allows us to compress time series to an average of 1.37 bytes per point, a 12x reduction in size.” (source)
In this algorithm, successive floating point numbers are XORed together, which means that only the different bits are stored. (Quick primer on how a binary XOR operation works.)
The first data point is stored with no compression. Subsequent data points are represented using their XOR’ed values, encoded using a bit packing scheme covered in detail in the paper (and also neatly diagrammed in this blog post).
According to Facebook, with this compression algorithm, over 50% of floating point values (all doubles) were compressed to a single bit, ~30% to 26.6 bits, and the remainder to 39.6 bits. (Reminder, a double is 64 bits).
Additional reading: Gorilla: A Fast, Scalable, In-Memory Time Series Database
One of the earliest lossless compression algorithms, Dictionary compression (in particular, LZ-based compression) is the ancestor of many compression schemes used today, including LZW (used in GIF) and DEFLATE (used in PNG, gzip).
(As a general concept, dictionary compression can also be found in areas outside of computer science: e.g., in the field of medical coding.)
Instead of storing values directly, Dictionary compression works by making a list of the possible values that can appear, and then just storing an index into a dictionary containing the unique values. This technique is quite versatile, can be used regardless of data type, and works especially well when we have a limited set of values that repeat frequently.
For example, say we had an additional column in our time-series dataset storing city location for each measurement:
| City |
|---------------|
| New York |
| San Francisco |
| San Francisco |
| Los Angeles |
| ⋮ |
Instead of storing all the City names directly, we could instead store a dictionary, such as {0: “New York”, 1: “San Francisco”, 2: “Los Angeles”, ...}
and just store the indices [0, 1, 1, 2, ...]
in the column.
For a dataset with a lot of repetition, like the one above, this can offer significant savings. In the above dataset, each city name is on average 11 bytes in length, while the indices are never going to be more than 4 bytes long, reducing space usage nearly 3x. (In TimescaleDB, we compress the list of indices even further using the Simple8b+RLE scheme described earlier, making the storage cost even smaller.)
Like all compression schemes, Dictionary compression isn’t always a win: in a dataset with very few repeated values, the dictionary will be the same size as the original data, making the list of indices into the dictionary pure overhead.
However, this can be managed: in TimescaleDB, we detect this case, and then fall back to not using a dictionary in those scenarios.
TimescaleDB is an open-source time-series database, engineered on PostgreSQL, that employs all of these best-in-class compression algorithms to enable much greater storage efficiency for our users (over 90% efficiency, as mentioned earlier).
TimescaleDB deploys different compression algorithms, depending on the data type:
In particular, we extended classic XOR-based compression and Simple-8b so that we could decompress data in reverse order. This enables us to speed up queries that use backwards scans, which are common in time-series query workloads. (For super technical details, please see our compression PR.)
We have found this type-specific compression quite powerful: In addition to higher compressibility, some of the techniques like XOR-based compression and Delta-of-delta can be up to 40x faster than LZ-based compression during decoding, leading to faster queries.
In addition, if we have data for which none of the aforementioned schemes work, we store the data directly in a specialized array. Doing this provides two notable benefits.
For more information on how we built this capability into TimescaleDB, please read this article and this section of our docs.
If you’d like to test out TimescaleDB and see compression in action, you can create a free Timescale account. And if you have any further questions, please join our community on Slack.