Train timings database - should I use Timescale?

Hello! I’m involved in designing a new data warehouse to support analysis and applications using both historic and (near)-real-time data on train timings. I am a big fan of Postgres, however in the past I have struggled to keep good performance when data volumes grow beyond about a billion rows (even using partitioning). My ideal system would support:

  • cost-effective and performant scaling to several years’ worth of data (say, 10-20bn rows in a “table”)
  • queries over long time ranges (“how late were trains on line X over the last 5 years”)
  • (this is where my previous databases have always struggled most) performant merges between separate time-series on a close, but not exactly equal, time key (“find all occasions where train A was at location X within 5 minutes of train B being at location Y”)
  • inserts usually around 100 rows / second, but with the ability to support faster (for backfill and futureproofing against new data sources)
  • window functions
  • updates to recent data (say, up to a day old)
  • (eventually) concurrent usage of around 100 users, possibly through read replicas

Do you think Timescale (likely Timescale Cloud on AWS) would be a good fit for this use case? My other options at the moment are Snowflake or Redshift - however my only direct experience is with plain Postgres, rather than any of the newer “analytics-led” databases, so I don’t know what will work best.

Thank you :slight_smile:

1 Like

:wave: @sam,

Thanks for joining the Timescale Forum and asking about the fit for your application. I’m a Developer Advocate for the company but have a long history with time-series data applications before coming to Timescale (and one of the reasons I wanted join).

The scale that you’re describing is well within the bounds of a single-instance TimescaleDB database. We regularly see insert rates of tens and hundreds of thousands of rows/second without breaking a sweat and TBs of data in a table. So to that end, I’m confident you’ll be able to insert and retain historical data to fit your application needs.

More directed to your bullet points, these three features will have the most impact on your day-to-day application development and will require some planning as you step in, but I’m confident you’ll be rewarded for the effort. :slightly_smiling_face:

Compression: TimescaleDB offers native columnar compression that is applied at the partition level. So as data ages, you can set an automated policy to compress the data which saves many of our clients 90% or more in storage. Obviously, that reduces costs, but because we transform this data into columnar format, many historical queries that aggregate specific columns over long time-ranges end up being faster too!

Continuous Aggregates: These are essentially materialized views that stay up-to-date on a schedule you choose and they only update/aggregate data that’s impacted by changes to the raw data. So, they’re very efficient to maintain, they can retain data differently from your raw data (ie. store aggregated rollups for months or years after the raw data was dropped!), and many of our new aggregate functions allow you to rollup a smaller aggregate (say one hour) to higher levels (1 day) without maintaining multiple views. And because these views aggregate the raw data, you can serve this data to more people, faster, because the aggregates are already done (and kept up-to-date!).

This is a total game changer for time-series data and something many of us maintained ourselves for years because there was no other solution. :tada:

Automated data retention: Like compression and continuous aggregates, data retention is applied with a policy of your choosing, and then you can stop thinking about it. TimescaleDB will regularly check for chunks that are older than your configured policy and drop the chunk for you.

Timescale Cloud: I’ll also add that Timescale Cloud is currently working (as of September 2022) on a number of other features around storage that aren’t public yet which will further reduce cost ability to serve a wider array of use cases. :wink:

The last comment I wanted to add was in relation to you relative time query. Without knowing your schema and data patterns it feels a bit difficult to say, “sure, Timescale will magically work there!”. That said, I’m confident that many of our query planner improvements will make searching and merging time-based comparisions easier, including work we’re currently exploring around join operators like “ASOF”, something that Postgres doesn’t support natively but that we can provide as part of the extention at some point.

Finally, if you weren’t aware, when you signup for a 30-day trial of Timescale Cloud, the Timescale support team is ready to assist in any way they can - including migration plans, schema design, etc. to make sure you’re successful with our hosted solution.

Feel free to ask more questions. We’re here to help. :slight_smile:

1 Like

Hi Ryan, thanks very much for your reply :slight_smile:

Another common (and related) query that I would like to be able to run is where I have two closely correlated time fields in a table, time_a and time_b. If time_a is used as the “primary” time key for the purposes of partitioning/chunking, but time_b is always within (say) 1 hour of time_a, can a query which filters on time_b leverage the chunking on time_a? Again this is a query pattern which I’ve always struggled to optimise in Postgres - intuitively I want the query planner to know about the correlation between these columns and take advantage of it without my intervention. Dunno if TimescaleDB offers anything beyond regular Postgres along these lines?

Great to know there’s support on hand if/when we decide to opt for the hosted solution!

Sam

At the current time, no. Postgres partitioning (and thus currently Timescale too), doesn’t offer global indexes that might make this a little more possible. If you don’t filter by the partitioning key, the query planner can’t prune chunks. So to get the most benefit, you’d always have to filter on time_a at some level to reduce the number of chunks that are queried.

From your description, it sounds like you could always add something like:

... WHERE time_a > {timestamp} - '1 hour'::interval and time_a < {timestamp}+'1 hour'::interval and time_b = {timestamp}

But to get chunk exclusion and the best performance, you need to filter on time_a somehow.

Does that answer the question?

Thanks Ryan - yes that answers my question perfectly. Sure I’ll be back in touch if/when we get going!