How to Reduce Query Cost With a Wide Table Layout in TimescaleDB
This is an installment of our “Community Member Spotlight” series, where we invite our customers to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.
In this edition, Florian Herrengt, co-founder of Nocodelytics, shares how he tracks, records, and monitors millions of user interactions per day to build dazzling analytics dashboards for website building and hosting platform Webflow. And the best part? The team is making queries up to six times less costly by using a wide table layout.
About the Company
Nocodelytics is an analytics product built specifically for the website building and hosting company Webflow. We are processing millions of events from various websites and turning them into insightful dashboards.
We’re a close-knit team of three based in London. We are two co-founders, Sarwech and myself (Florian), and the team recently expanded when Alex joined us as a frontend developer. You can find us on the Webflow Marketplace.
As our company is growing fast, we found that a quick, reliable database is vital for our company to grow and thrive.
About the Project
Our goal is to build the number one analytics platform for Webflow.
Like other analytics tools, we provide users with a tracking script that they add to their Webflow site. However, because of the nature of Webflow’s audience, we have to do things quite differently from other analytics tools—which presents challenges that no other analytics tool faces.
First, one of the things that we do that adds complexity is that we automatically track every event a user does. Whether it’s clicking on a button or link, interacting with a form, or viewing a page, we need to be able to track all of this information with minimal impact on accuracy.
We also track events tied to the Webflow Content Management System (CMS) and other third-party tools like Jetboost, Wized, Memberstack, and Outseta, which we automatically integrate with and track.
So, we tap into the power of the CMS and the Webflow ecosystem to record how users interact. We then output these interactions into valuable insights for our analytics customers. This means we need to be able to record and ingest millions of events into our database per day without it crashing down. Some of our customers will get publicity and see huge spikes in traffic, so we need to be able to handle this too.
Second, we provide our customers with a simple-to-use and customizable dashboard. This allows them to create metrics that go deep and answer almost any question (What are the most popular jobs on my site? How many signups came from Google? Which contact button is most effective?).
To do this, we’re building a metric creator that is simple to use on the frontend but complex on the backend, with some advanced SQL being done to return the right answers depending on the question asked. So it’s important that we have the right tool to help us with this.
Third, when our customers view their dashboard and look at the metrics, even a few seconds’ wait can cause frustration. As our customers can have several metrics on the dashboard at any time—some with fairly complex queries—there’s a lot of pressure on our database to read the data, crunch the numbers, and return the result quickly.
On top of that, we also allow our customers to share and embed their dashboard onto a site, which means the number of users viewing the metrics goes up, and the number of read requests can increase at any time.
Choosing (and Using!) TimescaleDB
First, let’s talk about the previous setup we had and what problems this resulted in.
Like many other companies, Nocodelytics started with PostgreSQL. In the beginning, it worked. But the size of the database grew very, very fast. Eventually, with millions of rows, our dashboards became sluggish. Queries for customers with a lot of traffic would take several minutes or even time-out.
As we needed a solution as quickly as possible, I had three things in mind when looking for an alternative to PostgreSQL:
- It had to be quick to learn.
- The change needed to involve a minimal amount of code.
- The migration path had to be simple.
My first choice was ClickHouse, which seems to have better performance than Timescale for our use case—but keep reading as there's more to it.
Not everything was great about ClickHouse: It does a lot, which can get confusing, and I’d rather stick with PostgreSQL, which I’ve used for years and know works.
Amazon Athena was another good option. It's serverless and queries compressed data directly from S3 (which Timescale is now offering in private beta too). It did have some weird limitations (e.g., a maximum of 20 concurrent queries, no way to update data, and dynamic partition columns must always be included in queries), which I found out the hard way. At that point, I was worried about the next issue I’d find, and I lost confidence in the product.
Finally, InfluxDB. I spent a few hours with it, but it’s too different from what we already had. The migration would take forever.
Also, I must stress that I had never heard about those tools before. I either worked on large projects with big money, where we used Redshift/BigQuery or specialized, small-scale projects, where the smallest PostgreSQL instance would be enough.
I was about to use ClickHouse before I came across Timescale by chance while browsing databases.
It’s just PostgreSQL
There you have it. The best feature of TimescaleDB: it's all PostgreSQL, always has been. All your tools, all the existing libraries, and your code already work with it. I’m using TimescaleDB because it’s the same as PostgreSQL but magically faster.
Whatever technology is behind TimescaleDB, it’s truly impressive. Since the Webflow Conf, we have been inserting more than a million rows per day (without optimizations) in our tiny 8 GB memory instance. Sometimes, we have 3 K/IOPS. PostgreSQL would struggle. It’s like pulling an elastic until it snaps—but it never does, and we barely scratched the surface of what it can do. Also, the community is really nice.
“I’m using TimescaleDB because it’s the same as PostgreSQL but magically faster"
So, in sum, Timescale was a drop-in replacement that solved most of our issues. I installed the extension, created a hypertable, and everything became magically fast.
But as I was reading the Timescale documentation, I realized it could be faster. A lot faster.
Relational vs. Wide Table Layout
When you first learn about relational databases, you learn how to normalize your data with multiple tables and foreign key references. That’s a good, flexible way to store your data. However, it can be an issue when dealing with a large amount of data.
That’s where the wide table layout becomes useful.
The idea is to trade storage and schema flexibility for query performance by reducing the number of
JOINs. But this doesn’t stop you from combining both. You can still add foreign keys to a wide table.
You will end up using more storage, but you can mitigate it with TimescaleDB’s compression.
Show time: Setting Up the Schema
Let’s create the above schema with relationships and insert dummy data:
-- Sequence and defined type CREATE SEQUENCE IF NOT EXISTS events_id_seq; CREATE SEQUENCE IF NOT EXISTS countries_id_seq; CREATE SEQUENCE IF NOT EXISTS browsers_id_seq; CREATE SEQUENCE IF NOT EXISTS devices_id_seq; -- Table Definition CREATE TABLE "public"."countries" ( "id" int4 NOT NULL DEFAULT nextval('countries_id_seq'::regclass), "name" varchar, PRIMARY KEY ("id") ); CREATE TABLE "public"."browsers" ( "id" int4 NOT NULL DEFAULT nextval('browsers_id_seq'::regclass), "name" varchar, PRIMARY KEY ("id") ); CREATE TABLE "public"."devices" ( "id" int4 NOT NULL DEFAULT nextval('devices_id_seq'::regclass), "name" varchar, PRIMARY KEY ("id") ); CREATE TABLE "public"."events" ( "id" int4 NOT NULL DEFAULT nextval('events_id_seq'::regclass), "name" varchar, "value" int, "country_id" int, "browser_id" int, "device_id" int, PRIMARY KEY ("id") ); create index events_country_id on events(country_id); create index events_browser_id on events(browser_id); create index events_device_id on events(device_id); create index countries_name on countries(name); create index browsers_name on browsers(name); create index devices_name on devices(name);
Then create our new wide table:
create table events_wide as select events.id as id, events.name as name, events.value as value, countries.name as country, browsers.name as browser, devices.name as device from events join countries on events.country_id = countries.id join browsers on events.browser_id = browsers.id join devices on events.device_id = devices.id create index events_wide_country on events_wide(country); create index events_wide_browser on events_wide(browser); create index events_wide_device on events_wide(device);
Neat. But was it worth it? Well, yes. It would be a lot less interesting to read otherwise. Now that we have our wide table, let’s have a look at the query cost.
-- cost=12406.82 explain select devices.name, count(devices.name) from events join countries on events.country_id = countries.id join browsers on events.browser_id = browsers.id join devices on events.device_id = devices.id where browsers.name = 'Firefox' and countries.name = 'United Kingdom' group by devices.name order by count desc; -- cost=2030.21 explain select device, count(device) from events_wide where browser = 'Firefox' and country = 'United Kingdom' group by device order by count desc;
This is a significant improvement. The same query is six times less costly. For a dashboard with dozens of metrics, it makes a massive difference.
You can find the full SQL here.
Timescale is packed with amazing features we want to start using. Things like
I didn't dive into it yet, but the Timescale Toolkit seems to have a lot of valuable functionality, such as approximate count distincts or function pipelines, which we can’t wait to try out.
We also want to see how continuous aggregates could help us relieve some pressure on the database.
Our goal is to keep growing and scaling the number of events we store. We will soon leverage tablespaces and data tiering to save on storage space. We’re keen to further optimize and use TimescaleDB to help as we grow towards handling billions of rows!
June 2023 update: We’re now dealing with more than 500 GB of data, and those wide tables just aren’t efficient anymore.
So, we’ve gone ahead and separated the table again. We’re executing a count query first, retrieving the ids, then running another query for the labels. Essentially, it’s a two-query process.
TimescaleDB is row-based and our wide table is heavy on strings. As a result, we’re hitting I/O limits. This wasn’t a problem before because we’re using a fast SSD and had fewer rows per site, but now with the data volume, it’s a different story.
In retrospect, choosing the wide table structure at that time was good. It accelerated our development pace significantly. We centralized all the events, simplifying our queries for quite some time. Plus, it enabled us to compress all our data without effort. Looking back, it was a beneficial strategy for that stage of our project.
We’d like to thank Florian and the folks at Nocodelytics for sharing their story on tracking millions of user events while reducing their query cost using TimescaleDB. Stay tuned for an upcoming dev-to-dev conversation between Florian and Timescale’s developer advocate, Chris Englebert, where they will expand on these topics.
We’re always keen to feature new community projects and stories on our blog. If you have a story or project you’d like to share, reach out on Slack (@Ana Tavares), and we’ll go from there.