Usage of Timescale Db For Armada Job Metrics

d80tb7 · March 12, 2023, 11:08am

Hello,

I’m one of the developers of Armada (https://armadaproject.io/) which is high-throughput job scheduler for Kubernetes. We’re investigating TimescaleDb to see if it will meet our needs for storing Job-level metrics (Cpu, Memory, IO, etc.).

Because Armada is high throughput, one of the real challenges here is making sure it scales. In terms of our numbers:

At any one time we might have 300K jobs running.
Every minute each job produces 10 distinct metrics- all at the same timestamp.
Average Job runtime is 30 minutes so average of 30 rows per job. Some jobs may run for much longer and have up to 100K data points, other jobs may fail immediately so just a single datapoint.
Jobs have around 10 tags associated with them (queue, jobset, dag, stage etc)- we need to be able to query and aggregate by tag.
We’re looking to store the data in a database that can either retreive the timeseries for a given job or retrieve an aggregate over at time range with latencies appropriate for displaying in a UI (i.e. most queries should be sub second, and nothing should really take more than about 10 seconds).
We want to be able to store a year’s worth of data at full resolution.

In terms of the numbers:

300K rows a minutes is around 5K rows a second.
Storing a year’s worth of history means 157 Billion Rows
157 Billion Rows of 10 Doubles is around 12.5TB. Tags and primary Key will increase this, but on the other hand, the data should compress quite well as metrics data is usually highly compressible (e.g. because many jobs peg the cpu to 100% for long periods).
The cardinality of the tags is huge.

What I’ve Done So far:

I’ve performed the following tests on a single node with 32 physical cores and 256GB memory running Postgres 15. Schema looks as follows:

CREATE TABLE  .job_metrics (
  time TIMESTAMPTZ NOT NULL,
  jobId TEXT NOT NULL, -- these are uuids.
  tags JSONB NOT NULL,
  cpu DOUBLE PRECISION NULL,
  memory DOUBLE PRECISION NULL,
  diskRead DOUBLE PRECISION NULL,
  diskWrite DOUBLE PRECISION NULL,
  networkReceive DOUBLE PRECISION NULL,
  networkSend DOUBLE PRECISION NULL,
  gpu  DOUBLE PRECISION NULL,
  nfsRead  DOUBLE PRECISION NULL,
  nfsWrite DOUBLE PRECISION NULL
);
CREATE INDEX ix_jobid_time ON job_metrics (jobId, time DESC);
CREATE INDEX idxgin ON job_metrics USING GIN (tags);
SELECT create_hypertable(job_metrics','time',  chunk_time_interval => INTERVAL '1 day');

Note that I’m storing the tags at the row level as a jsonb col. This has a serious penalty in terms of the amount of data written, but my initial attempt to normalise these out into a separate tags table, led performance problems when joining back to create continuous aggregations.

I’ve then created some continuous aggregates:

CREATE MATERIALIZED VIEW one_min_aggregates
WITH (timescaledb.continuous) AS
SELECT tags#>'{queue}' as queue, tags#>'{jobset}' as jobset, tags#>'{stage}' as stage,
   time_bucket(INTERVAL '1 minute', time) AS bucket_1min,
   SUM(cpu) as cpu,
   SUM(memory) as memory,
   SUM(diskRead) as diskRead,
   SUM(diskWrite) as diskWrite,
   SUM(networkReceive) as networkReceive,
   SUM(networkSend) as networkSend,
   SUM(gpu) as gpu,
   SUM(nfsRead) as nfsRead,
   SUM(nfsWrite) as nfsWrite
FROM chrismaTest.job_metrics
GROUP BY tags#>'{queue}', tags#>'{jobset}', tags#>'{stage}', bucket_1min;
CREATE INDEX ix_one_min_aggregates ON .one_min_aggregates (bucket_1min, queue, jobset, stage);

Using this one min aggregate- I can create other aggreagates (5min, 15min, 1hr, 4hr, 1day) each derived from the previous aggregates.

So far my testing has shown the following.

Write performance is good. I can write at around 70-80K rows/sec sustained. This is an order of magnitude more than we need.
Looking up the timeseries for a given job is expectedly fast (10s ms) as it’s using a postgres index. Filtering this to a time range is also fast.
Creating the aggregations has been challenging. The one minute aggregation above takes 17 Seconds to run for a 5 minute window and almost 2 hours for a 1 day window. This is borderline acceptable as it is probably fast enough for now but doesn’t give us much overhead if data volumes increase or we need more aggregations.
One multiple occasions I seem to have been able to hose the database by running a big aggregation- e.g. 1 min aggregation for a day. In this case the dbas have had to come in an kill the offending query.

Based on the following- I wonder if the community has any advice for me. My specific questions would be:

Am I right in thinking that Timescale would be a good fit for my use case? Specifically can I get the performance I want with the data volume I have.
Is there a way of making the continuous aggregations more performant? One thing I have noticed is the continuous aggregation seems to be single threaded, but the query plan itself seems to be (in theory) multithreaded. I did wonder if we could potentially increase performance by getting the aggregation to run in parallel.
Is there a better way of storing my tags than a jsonb on every row. The size of the tags is a lot bigger than the size of the data so it’s a it sad I’m doing this, but when I tried to normalise this, the continuous aggrgation performance suffered, I presume because it now (for a one-day window) it’s trying to join 413 million rows to 18 million rows and that’s never going to be quick.
So far I’ve only been able to get 10 days worth of data into the db as even ingesting at 70K rows a second- it takes a long time to fill up. Do you think I’ll see any issues I’ve not already encountered at higher data volumes? I’m hoping not as the hypertable is partitioned by day.
My continuous aggregate above is just one of many we may want to produce (e.g that aggregate is just totals, we may want different aggregates for max, min, avg etc). Given that a single CA is causing performance issues, is it reasonable to want to produce so many aggregates?

sorry for such the long post- and any views (even if partial) would be much appreciated!

many thanks,

Chris

jonatasdp · March 13, 2023, 10:23pm

Hi! You can certainly use Timescale for it! Take a look on the tsbs project if you want to put a heavy workload and check how it behaves! I remember I did a benchmark on my raspberry oi and got 20k rows per second throughput. We have a lot of users with huge workloads and it seem to work.

About your continuous aggregates, can you share more details about where it slows down? Maybe we can help you to optimize it.

I’m afk in the airport so I’ll not have a complete answer for everything but I hope to revisit it later this week and give you a more complete answer. I’d also encourage you take a look in the promscale even we discontinued the project, the metrics models is really inspiring and probably can give you some insights about a db schema that scales.

noctarius · March 14, 2023, 2:21pm

Hey Chris!

Before @jonatasdp is back, I’d like to give a few thoughts, as well as ask some questions to make sure we’re on the same page.

You said write performance is a no-brainer, so let’s leave it out of the picture.
10s of millis may still be slow. There could be optimization potential, like trying different chunk sizes and stuff
This is something I’m not sure I understand. Continuous Aggregates are incrementally updated. That said, the initial setup time should be irrelevant since they don’t have to be updated all-or-nothing like normal Materialized Views. Maybe you can elaborate on your concern.
That sounds like the query ran out of memory or had to page a lot. What is the machines specs? How many rows do you expect? What type of query brought it to its knee?

To the other bullet points:

If Timescale is a good use case mostly depends on how and what you query. Do you query based on a time range, then yes. Do you always query everything (I mean all rows, which no time limitation), maybe not. However, as long as you’re happy with the query speed it may work. With the number of partitions growing and everything being queried, you may put pressure on the query planner though, meaning it may become slower over time. That is only true with the non-time-range query type.
See above, not sure what you want to achieve with the faster generation of CAGGs since it’s not how you use them.
For the layouting, you may want to look at this best practices blog post: Best Practices for Time-Series Modeling. It may also be interesting to see if there’s a chance of deduplication, but that depends on the type of tags being stored.
How do you ingest data? Do you use bulk insert using the COPY command? Do you use batches (with prepared statements) or do you do single, independent inserts? Depending on that you may have optimization potential. Also using multiple connections / threads from the same client can increase speed.
I’d always recommend creating CAGGs with the WITH NO DATA option and let the underlying refresh policy (that you want to set up) handle the aggregation in batches. Less stress on CPU, RAM and transactions.

d80tb7 · March 14, 2023, 2:24pm

Hello,

Thanks very much for the reply. I think the issues we currently have are:

The minutely continuous aggregation that I gave above- takes about 2 hours to regenerate for a whole day. This is perilously close to being a dealbreaker, partly because any long running task like that is going to add operational complexity and partly because it doesn’t leave us much performance in the locker if volumes increase or we need more metrics. I realize that in the happy path we should never need to aggregate a whole day of metrics in one go, but I’d guess I’d be more comfortable if we had more performance to spare here. The obvious solution is to parallelise the aggregation as the calculation does seem to be parallelisable , however right now postgres just seems to max out a single thread.
We’re currently storing the labels denormalised on every time point, which means that we’re storing 150GB of data a day, most of which is labels. Ideally we’d normalise the labels out, however when I tried the naïve approach here it made the 1min aggregation even slower. I noticed that there’s prior art for storing label ids as a list, which I’m going to have a play with next and might help here.
Compression is less useful than I first thought because a) it seems to take a very long time (I’m trying to compress a day chunk right now and it’s been running for over 2 hours) and b) it removes the index on the labels. This makes the problem of 150GB/Day more off an issue as if we could aggressively compress, the duplication of labels should hopefully compress out.

I think ideally:

I work out a way of making aggregation much quicker or I show that it’s feasible for us to increase volumes/number of metrics/number of aggregations without making aggregation so slow that they become unusable.
We work out a way of reducing the storage requirements- either by normalizing out the tags or by finding out how compression can work for us.

Regarding Promscale- I have had a look at that and indeed the data model is very clever. What worries me slightly here is that the Promscale design doc states that their approach to labels is efficient only because although total cardinality in the system is huge, the cardinality for any given metric is much less. In our case, this doesn’t apply because all metrics are associated with the same entity (a job) and so the cardinalities of all metrics are the same (i.e. huge).

noctarius · March 14, 2023, 2:30pm

Not every aggregation can be parallelized since the outcome of a multi-step aggregation may not be equal to the single shot aggregation. This is true for aggregations like average, but it may also be true for aggregations including lag or similar commands. There is no plan to implement specific paths (as of right now) to increase speed in a few cases. As mentioned before, CAGGs are designed as incrementally updated MVs, not all-or-nothing updates. They are also design to be updated automatically, in the background, in small batches.
When you want to join data I’d always recommend a multi-step query. Like querying the time series aggregation first (in a CTE) and then joining the external data in the second step.
Compression is the same thing as with CAGG, it is designed to be automatic, in the background, and for older, not necessarily changing data.

Please talk a bit about the chunk sizes and how you’ve configured Timescale.

d80tb7 · March 14, 2023, 2:49pm

Hi @noctarius

Thanks for your reply, it’s much appreciated. Let me try and answer some of your questions as best I can:

If Timescale is a good use case mostly depends on how and what you query.

Our use case here are generally a pretty good fit for Timescale at least in theory. The main use cases are:

Query all the data for a single job (between 0 and 10K rows, but an average of 30 rows) with an optional filter on time. Timescale handles this without issue.
Show aggregate usage for a given label or combination of labels (user, queue, jobset, dag etc) over some time range. The time range could be anything from minutes up to a year. CAs seem like a good fit here because firstly we are aggregating up job level data and secondly becuase if one wants to see an aggregation over a year, we are fine to decrease the resolution accordingly. My testing has shown that performance here is pretty good, once the aggregation has been made.

How do you ingest data

We’re using the postgres batch protocol with prepared statements and batches of 10K and 8 paralllel writers. Using this, a sustained write speed of 75K rows a second seems achievable- which in all honesty is pretty good. I think we could squeeze out a bit more performance if we used the COPY protocol, but I don’t see this as a big issue right now.

This is something I’m not sure I understand. Continuous Aggregates are incrementally updated.

Apologies- I haven’t been very clear here. In general I agree with you in that the continuous aggregation can be done faster than real time and thus it’s a valid approach. My concerns here are really around:

If we need to regenerate a continuous aggregation for any reason (we’ve had an outage, we’ve ingested some bogus data) this could take a long time.
When you’re running a job scheduler, you’re definitely at the mercy of the users as to how many jobs they decode to run and what labels they choose to provide, along with what metrics they want. The current speed of the continuous aggregations worries me slightly in that if any of the above variables change, we might find ourselves in a position where we can’t aggregate fast enough.

As for the information you requested about our setup:

a single node with 32 physical cores and 256GB memory running Postgres 15
partition size is 1 day- which is just over 400 million rows.
We ran timescale_db_tune beforehand. parameters it came up with were:

timescaledb.max_background_workers = 16
max_worker_processes = 51
max_parallel_workers = 32
default_statistics_target = 500
max_locks_per_transaction = 512
autovacuum_max_workers = 10
autovacuum_naptime = 10

d80tb7 · March 14, 2023, 3:08pm

Hi @noctarius

Some answers inline:

Please talk a bit about the chunk sizes and how you’ve configured Timescale.

Tests have been performed on a single node with 32 physical cores and 256GB memory running Postgres 15. Partition size is 1 day (roughly 400 million rows). We ran timescale_tune before we started which came up with:

timescaledb.max_background_workers = 16
max_worker_processes = 51
max_parallel_workers = 32
default_statistics_target = 500
max_locks_per_transaction = 512
autovacuum_max_workers = 10
autovacuum_naptime = 10

This is something I’m not sure I understand. Continuous Aggregates are incrementally updated. That said, the initial setup time should be irrelevant since they don’t have to be updated all-or-nothing like normal Materialized Views. Maybe you can elaborate on your concern.

Yes, apologies I’ve not been very clear here. In the main, you’re right here in that the CA I’m doing is faster than realtime so if we aggregate as we go along, we should be good. The concerns I have are twofold:

There will be times (when we’ve had an outage or ingested some bogus data) when we’ll need to go back and refresh aggregations. In this case, aggregation speed can be an issue.
We’re a bit at the mercy of our users as to how many jobs they submit, with which labels, and which metrics they desire. If any of these three axes increase then I’m a little bit worried that we’ll be in a position where we can’t aggregate fast enough.

Do you query based on a time range, then yes. Do you always query everything (I mean all rows, which no time limitation), maybe not

I think at ah high level, our usecase is pretty textbook. Essentially the query patterns are:

Query for all the datapoints for a given job with an optional time filter. This is typically 30 rows returned using an indexed search.
Query for aggregated metrics grouped by one or more labels and with a time filter. The time filter could be anything from 1 minute to 1 year, but decreased resolution is perfectly acceptable as the time period increases. This seems a good fit for CAs which aggregates the same information over different time buckets.

How do you ingest data?

We use batches of prepared statements, which I think works pretty well (we can see a sustained 75K rows/sec). I think we could squeeze out a bit more performance using the COPY protocol, but it hasn’t got to the stage where I think we need that yet.

That sounds like the query ran out of memory or had to page a lot.

I suspect what had happened here was that I’d refreshed a CA where the underlying CA hadn’t been materialized yet, and thus something that should have been relatively quick, ended up performing an aggregation over billions of rows.

jonatasdp · March 15, 2023, 11:03pm

I suspect what had happened here was that I’d refreshed a CA where the underlying CA hadn’t been materialized yet, and thus something that should have been relatively quick, ended up performing aggregation over billions of rows.

you can use continuous aggregates with materialized_only data and it will avoid real-time computing and maybe even reuse other time frames to build it for you. Let’s say you want to get the data from the last 10 days so that you can use one row from the weekly view and 3 rows from the daily view.

I’m a bit lost after all the conversation here! thanks @noctarius for jumping in!

noctarius · March 16, 2023, 6:00am

Hey Chris, sorry had the tab open and forgot to answer

Yes and no. As long you have at least the start and end time of a job in the query it’ll handle it gracefully. The general rule for timescaledb is that you always want a time range in the where clause for early chunk exclusion (removing unnecessary chunks from the execution plan). That is the “trick” behind the speed of timescaledb.

As @jonatasdp already wrote you can create CAGGs without the realtime part. That just means that the CAGG will not automatically query not-yet-materialized data (not having the realtime feeling). Still, you don’t have to fully rematerialize a CAGG but use refresh_continuous_aggregate (Timescale Docs) to only rematerialize a certain timeframe. The CAGG policies can automate that process and help automatically materializing everything new since the policy’s last run.

As for the information you requested about our setup:

With 400 million rows may be on the higher end of a chunk size. How large is the chunk in bytes? You can get the chunk size using chunks_detailed_size (Timescale Docs). But since you have 256GB RAM, it’ll probably be fine

@ @jonatasdp sooooooorry

jonatasdp · September 16, 2023, 5:15pm

@noctarius thank you for all the support and insights here!