Analyzing the Bitcoin Blockchain: Looking Behind the Hype With PostgreSQL
Spoiler alert: blockchain data is fundamentally time-series data.
There are many definitions for blockchain, but here's a simple one: blockchain is a decentralized and distributed database. Looking further into it, the time dimension of the database is an essential element of every blockchain. Without it, blockchains could not operate: transactions and blocks could not be added. So, it’s not just that each entry in the database has an associated time value—that time value is a fundamental component of how the network functions. Data associated with blockchains is inherently time-series data.
At Timescale, we're always looking for ways to showcase the expanding reach of time-series data. Using our PostgreSQL and TimescaleDB superpowers, we are able to generate insight into the data behind everything in the world. With time-series data, we’re not only able to see how things are, but also how things have changed and are in the process of changing. Measuring change is an essential component of better understanding the world.
For today’s blog post, our real-world dataset is the Bitcoin blockchain. It was an obvious choice of measurable change quickly unfolding right before our eyes and impacting the world around us, in ways both good and bad. The blockchain community is trying to appease or work around challenging tensions with new tools and protocols. As we monitored the Bitcoin network from a data perspective, our findings reflected these tensions, too. We hope our insights will help inform community efforts to generate new paths for innovative and more environmentally friendly projects.
Our data-driven approach led us to the following conclusions:
- Transaction fees on the Bitcoin blockchain will drastically increase long term. As the block reward is halved every four years, transaction fees will increasingly make up a larger share of their revenue. End-users will end up paying significantly more over time for the same transaction.
- The future of Bitcoin is about off-chain and batch transactions at scale. Assuming that adoption will increase, applications built on top of Bitcoin will need to handle the network’s scalability issue, whether by facilitating off-chain transactions or mitigating transaction fees for users through batching transactions. Batching transactions at scale will grow the utility of the Bitcoin blockchain and also limit the per-transaction environmental impact.
How did we reach these conclusions?
- We ingested approximately five years of past Bitcoin on-chain transactions into TimescaleDB (from January 2017 until May 2022).
- We analyzed 285,000 blocks and 540 million transactions (the equivalent of 210 GB of data) with SQL to find interesting insights.
- Finally, we published a tutorial that takes you step-by-step on how to get access, ingest, and analyze blockchain data with TimescaleDB.
If you want to learn how to do your own research, keep reading. We’ll show you how you can use PostgreSQL to unlock a new world of blockchain data applications. To learn how to do this yourself, sign up for Timescale (it’s free; no credit card required!) and dig deeper with our tutorial. If you have any questions, join our 8,000+ member Slack community, and get help from the Timescale Team and users.
Ready to better understand blockchain data? Let’s dive right in!
Bitcoin Blockchain Analysis: Transaction Feels Will Be High
Bitcoin is a proof of work blockchain. Miners are responsible for keeping the network up and running. They are incentivized by receiving a “block reward” after each mined block. Currently, the block reward is 6.25 BTC. Additionally, miners also receive transaction fees from the network users after each transaction.
The long-term problem is that block rewards get halved about every four years. This results in fewer new Bitcoin currency entering the network after every halving, which is why Bitcoin is considered deflationary by some. Over time, miners will also be less rewarded.
Let’s look at a chart that explains why this is a long-term risk for the network’s existence:
The chart analyzes the last six months of average miner revenue. In the current system, if we take away the block rewards from the total miner revenue, miners end up with less than 3 % (green line) of their current total revenue levels. This portion is the transaction fees users pay.
The other 97 % (yellow line) of their revenue comes from block rewards, which will be cut in half every four years. This means that the miners’ revenue will be mainly made up of transaction fees in the long term—assuming the fees will provide enough incentive for the miners to keep the network running.
Transaction fees will make up a larger share of the miners’ revenue with every halving. So, these fees will need to increase drastically for miner revenues to remain the same. This is a hot discussion topic in the community, and there are multiple scenarios describing what will happen to solve this problem. One possible scenario is transaction fees will have to increase at a pace that ensures profit for miners. But high fees could discourage network usage.
At the same time, we can’t discount the substantial environmental impact of Bitcoin. In the United States, it is estimated that Bitcoin mining is responsible for producing over 40 billion pounds of carbon emissions. Each halving results in consequently doubling the environmental impact for the same amount of Bitcoin.
If Bitcoin is to gain true mainstream utility, the incentive models of both miners and environmental concerns must be aligned. This is where off-chain and batch transactions can be useful.
Bitcoin Blockchain Analysis: The Future of Bitcoin is Off-Chain and Batch Transactions at Scale
Transaction fees and volume are correlated. If the volume increases, fees need to increase too. The issue is that Bitcoin, the network itself, is not scalable without changing the protocol. Why? There’s a maximum amount of transactions that the network can process at any given time. For example, the record transaction volume in a day was 335,000 transactions in the past year. But on an average day, you see fewer transactions. For example, in April 2022, the daily transaction volume was between 210,000 and 290,000.
Increased transaction volume causes fees to increase, too. Based on the previous analysis, such a spike will be necessary once block rewards don’t provide enough incentive for miners. The question then becomes: how can Bitcoin remain affordable for users without sacrificing miner profits, thus killing the network?
An often-cited opinion in the community is that other supporting solutions (like Layer 2 protocols) will enable scalability—not necessarily changes in the blockchain protocol itself. These scaling solutions are often use case-based (e.g., Lightning Network that facilitates off-chain instant transactions) or comprise centralized solutions that allow transaction batching (e.g., exchanges do this to decrease or even eliminate transaction fees for their users).
Using these approaches could enable the Bitcoin network to process more transactions than it currently can. But it’s important to understand that most of these transactions, which eventually settle on the main network, will need to happen on Layer 2 and centralized batching solutions.
The transaction fees can be raised drastically with this approach, with end-users only paying a small portion of the fees. The result could be that miners are adequately incentivized even without substantial block rewards, and users interact with the blockchain through Layer 2 and batching solutions.
And, by increasing the utility of each miner’s work, the Bitcoin industry can amortize the environmental impact of Bitcoin across several orders of magnitude more end-user transactions. For Bitcoin enthusiasts and skeptics alike, only proper insight and oversight will guide a positive future for both. For that, you need access to the proper data and the right tools with which to analyze it.
DYOR: Do Your Own Research (With the Right Tools)
We came to these conclusions based on very limited amounts and types of data that only contain Bitcoin on-chain transactions. We didn’t include other important aspects of the blockchain, such as input and output addresses, miner information, and detailed block data.
Analyzing more data could’ve changed the interpretation of some of the analyses in this article. We encourage you to do your research and come to your own conclusions. To help you with this, follow our tutorial on how to ingest and analyze blockchain data.
Monitoring blockchain activities and analyzing data is not necessarily difficult if you use the right tools. Let’s now see how PostgreSQL and TimescaleDB can be the perfect combination if you want to ingest, store, and analyze blockchain activity to build data applications.
Efficient Storage and Querying of On-Chain Data in PostgreSQL
To explore the world of Bitcoin transactions, you need to ensure your database infrastructure is reliable, scalable, and maintainable—especially if you’re building a data application on top of this data.
Using PostgreSQL to store blockchain data gives you a reliable foundation that can keep up with the relentless nature of blockchains—which are always up and running. Add TimescaleDB on top of that, and you also get scalability to store the whole Bitcoin network or any other blockchains in the same database. TimescaleDB is petabyte scale, while most blockchains are less than 500 GB in size, so you have a long way to go.
Furthermore, TimescaleDB also makes it simple to maintain and analyze your database with continuous aggregates, automation policies, or hyperfunctions. Later in the article, you’ll see how these features relate to blockchain data.
You need a schema to store anything in a relational database like TimescaleDB. Let's see a simple yet effective schema that you can start to store Bitcoin transactions.
Database Schema for Bitcoin Transactions
The dataset collected for this blog post uses Blockchair’s data dumps. Those data files, just like the data on the actual blockchain, have more data fields than what this article covers. But still, this example schema provides several data fields (e.g., transaction weight, output, fee, etc.) that can get you started and help you generate insights.
Nonetheless, this schema has a detailed
JSONB column that stores all additional information if you decide to analyze the data fields not covered in the blog post.
Data field descriptions:
CREATE TABLE transactions ( time TIMESTAMPTZ, block_id INT, hash TEXT, size INT, weight INT, is_coinbase BOOLEAN, output_total BIGINT, output_total_usd DOUBLE PRECISION, fee BIGINT, fee_usd DOUBLE PRECISION, details JSONB ); SELECT create_hypertable('transactions', 'time');
This schema vastly simplifies analyzing the dataset but still leaves room for future changes in case you find the data fields in
details useful; they are still available through the
create_hypertable function transforms the regular PostgreSQL table into a TimescaleDB hypertable to partition the table based on the timestamp column, thus optimizing the querying and ingestion of time-series data—while behaving just like a regular PostgreSQL table.
Optimize Storage and Speed Up Time-Series Queries
To make your database “blockchain-scale,” you need to apply some optimization strategies in PostgreSQL and TimescaleDB to enjoy a satisfying experience while maintaining and querying the data.
Using a hypertable to store transactions lays the groundwork for simple and scalable time-series storage. TimescaleDB, being a PostgreSQL database, lets you use all the PostgreSQL optimization strategies (e.g., indexes) along with TimescaleDB features (like continuous aggregates) to get the best of both tools.
In sum, you can achieve the best experience in your database if you use the things that make PostgreSQL exceptional in the relational database world and TimescaleDB exceptional in the time-series world.
Let’s start with indexes.
When you create a hypertable, TimescaleDB automatically adds a B-tree index on the timestamp column of the table. This already improves most of the queries where you filter by the time column.
For example, this is what you can do to find the most recent transactions from the past two weeks in which the transaction fee was higher than the transaction value:
SELECT time, hash, fee*0.00000001 as fee_btc, output_total*0.00000001 as value_btc, output_total_usd as value_usd FROM transactions WHERE "time" > NOW() - INTERVAL '2 weeks' ORDER BY fee_btc DESC
Just by using TimescaleDB, this query performs immediately better than with regular PostgreSQL. Under the hood, TimescaleDB speeds up this query by using the index and excluding chunks that are outside the defined time period. You can also decide to add more indexes based on the kind of queries you plan to write.
For example, if you plan to build a data application that analyzes individual Bitcoin transactions, you can achieve better query performance by adding an index to the
hash column. Each transaction on the blockchain has a
hash value, a 64-character long string that acts as the transaction’s unique identifier. If your application users search for a specific transaction to find out more about it, this is the column they should use to find the transaction in the table.
SELECT * FROM transactions WHERE hash = '94fb53b99dd2cb7ccf26745c908eebd7c01cb8c19fb84819b796beb7750b0cb8'
You can speed up this query by adding a
HASH INDEX to the
CREATE INDEX hash_idx ON public.transactions USING HASH (hash)
If you store more than one hundred million rows of transactions and want to filter by the
hash column, this index is essential. It has the power to turn a query that runs for 10 or more seconds into a query that finishes in less than a second.
You can also ensure that you don’t accidentally insert duplicate records by adding a
UNIQUE INDEX on the
CREATE UNIQUE INDEX time_hash_idx ON public.transactions (time, hash)
This index can make some of your queries faster, including those with both the
Continuous aggregates are materialized views for time-series data. They were designed with time-series data in mind and provide an efficient way to refresh and query the values stored. They even store real-time data. After setting up proper indexes, this is the most important optimization that you can do to make your queries lightning fast.
When it comes to time-series data, users usually aggregate data based on time, creating so-called “time buckets.” Time buckets enable you to analyze how things change over time. In the case of the Bitcoin blockchain, it allows you to see how transaction fees, transaction sizes, block weights, etc., change over time.
These aggregated time-series queries are often recurring queries done by a dashboard or some user-facing application. Hence, it’s important to make these queries as fast as possible to provide a smooth user experience. This is precisely what TimescaleDB’s continuous aggregates will do.
You can set up continuous aggregates to pre-aggregate and materialize information within the database and go from seconds or even minutes to sub-seconds in query execution time.
Even when analyzing a large blockchain database with more than 150 million rows, you can achieve great performance and sub-second query execution speed just by using indexes and continuous aggregates.
Let’s see examples of continuous aggregate setups to store and serve blockchain data.
Continuous Aggregates for Blockchain Analytics
Continuous aggregates set up properly can speed up your queries and make those queries much easier to write. For this blog post, let’s choose three aspects of the data as foundations for the continuous aggregate definitions:
- Miners (revenue)
Blockchains are designed to record information without ever updating historical data. When it comes to Bitcoin, the blockchain is just a digital public ledger containing information about transactions. Simply put, a transaction tells you who sent what amount to whom, plus the technical details.
Each transaction has at least one input and output field that indicate which address(es) were used to send BTC to which address(es). You can have multiple “senders” and “receivers” in the same transaction.
Transactions are not mined separately; they are mined (added to the blockchain) in groups. These groups are called blocks.
A blockchain is made up of blocks. These blocks are important from a data analytics perspective for many reasons. Miners get their revenue based on the block they add to the blockchain. Each block has a unique identifier called
block_id. Before SegWit, a block had a size limit of 1 MB. After SegWit, a new measure was added to the network called weight unit, and the new limit for a block became four million weight units.
Miners are essential for any blockchains. They keep the network up and running. Miners are incentivized with BTC to maintain and validate the order of blocks on the blockchain. In the case of Bitcoin, miners get their revenue from two sources: block rewards and transaction fees. The block reward is currently 6.25 BTC, and it gets halved every 210,000 blocks mined (about every four years). Blockchain users pay the transaction fee every time they submit a transaction.
With continuous aggregates and automation policies, you can simplify your work going forward:
- You save a ton of resources by materializing (saving the results to disk) the time-based aggregations you often use, which means that the database doesn’t need to recompute the information at query-time.
- The views get maintained automatically due to the refresh policies, so you can just focus on analyzing the data.
- Real-time aggregations become simple because continuous aggregates combine the query results from the materialized table and the underlying hypertable by default (saving you from doing a `UNION` in all of your queries).
- You can compress or delete data from the continuous aggregates on schedule—or delete the underlying raw data in the hypertable and keep the downsampled data in the continuous aggregates.
After setting up indexes and continuous aggregates as optimizations, you can focus on analyzing the data and generating insights while having a fast and simple querying experience.
Now that you know which features we used to analyze our data using PostgreSQL and TimescaleDB, let’s look at the thought process behind it.
Use SQL to Generate Blockchain Insights
We came to the conclusions you read at the beginning of the article by asking the following questions:
- Is there any connection between the number of transactions and the transaction fees?
- What percentage of the average miner's revenue comes from fees vs. block rewards?
Each question has a matching SQL query and a chart that answers it with a brief explanation.
Is there any connection between the number of transactions and the transaction fees?
When it comes to blockchains, a major concern for users is the transaction fees. At the end of the day, if a blockchain is too expensive to use, not many people are willing to use it. Let’s see if there’s any correlation between the number of Bitcoin transactions and the fees. The time range for this analysis is the month of April 2022.
SELECT bucket AS "time", tx_count as "tx volume", average(stats_fee_sat) as fees FROM one_day WHERE bucket >= '2022-04-01' AND bucket < '2022-05-01' ORDER BY 1
On this chart, the green line indicates the average transaction volume over time, and the yellow line indicates the average fee per transaction over time.
You can notice two things: there seem to be days with higher transaction volumes (e.g., between April 4-7 and April 11-15), and there is a high correlation between fees and volumes, while not as prevalent in the first “period” of higher volumes. This type of insight can help you decide whether you should submit a transaction now or wait a couple of days, so fees decrease.
What percentage of the average miner's revenue comes from fees vs. block rewards?
Miners are incentivized to keep the network up and running because they earn fees and rewards after mining each block. But let’s look at how much of their revenue comes from these two sources.
WITH coinbase AS ( SELECT block_id, output_total AS coinbase_tx FROM transactions WHERE is_coinbase IS TRUE and time > NOW() - INTERVAL '6 months' ) SELECT bucket as "time", avg(block_fee_sat)*0.00000001 AS "fees", FIRST((c.coinbase_tx - block_fee_sat), bucket)*0.00000001 AS "reward" FROM block_one_week b INNER JOIN coinbase c ON c.block_id = b.block_id GROUP BY bucket ORDER BY 1;
This chart analyzes the last six months of average miner revenue. The left axis is a percentage value that indicates the portion of total revenue from transaction fees (green) and block rewards (yellow). It might be surprising for some, but most miner revenue comes from block rewards (6.25BTC, at the moment). The fees portion never went above 3 % in the last three months.
This kind of analysis can start discussions around the long-term fading of block rewards and how on-chain fees will need to go up in the future to incentivize the miners and sustain the network. (Note that the left axis is logarithmic-scale, so it’s easier to see the green “fees” portion.)
Wrapping It Up
In this blog post, you learned how we monitored the Bitcoin network to reach findings that may affect the future of the blockchain. We hope that you got inspired to do your own research with PostgreSQL and TimescaleDB. The easiest way to get started is to sign up for a free Timescale account and complete the tutorial that walks you through the process of sourcing, ingesting, storing, and analyzing blockchain data.
In case you have questions that you want to ask the community or Timescale team members, feel free to join the Timescale Forum and Slack Community where we’d be delighted to help you out!
Take the State of PostgreSQL 2022 survey! It is open until June 30 at 11:59 p.m. PT.