How can I reduce the disk usage of Promscale metrix?

This distils a useful discussion from Slack… View in #promscale on Slack

nmaludy @nmaludy: Any advice to reduce the disk usage of Promscale metrics?
I’ve enabled compression and verified that things are being compressed, but keep running out of disk space. I turned off metrics, labels, tags, etc that i don’t need. Promscale seems to use an order of magnitude (or more) disk space vs Prometheus.

I just had to dump my metrics database again since Postgres filled up my disk.

I’m monitoring 8 nodes, 235 metrics, 10s scrape interval on a 20GB partition using timescale/promscale chunk interval of 8 hours and I’m running out of space after about a week.

This time I’m going to try and keep a closer watch on Prometheus vs Postgres disk usage. While that is happening, any insight into how to decrease disk usage? Or, are my expectations unrealistic?

Ramon @Ramon: Probably it’s the recent uncompressed chunks that are causing you to run out of space.

A few ideas (I would implement them in this order):

  1. Make sure you run the latest version of Promscale. We’ve made some improvements to compression in version 0.9.0 that will help reduce disk usage.
  2. How long do you need to retain the data? You could look into reducing data retention.
  3. Reduce the chunk interval to 4h or 2h. The tradeoff if query performance since more chunks need to be scanned.

nmaludy @nmaludy: OK, I’ll give these a shot.
Ideally I’d like to keep data for 90 days or more (obviously will need more disk space) but trying to make sure it works on a small retention setting first after upgrading to 0.9.0, should I purge my metrics and start over?

Ramon @Ramon: No, the improvements should apply immediately.

nmaludy @nmaludy: FYI this is after about 8 hours of running:

# du -hsx pgsql/ prometheus/
1011M   pgsql/
70M     prometheus/

update after about 20 hours:

# du -hsx pgsql/ prometheus/
2.3G    pgsql/
85M     prometheus/

Ramon @Ramon: @nmaludy can you confirm you are running version 0.9.0?

Could you run this query:

select is_compressed,count(*) from timescaledb_information.chunks where hypertable_schema='prom_data' group by is_compressed;

And also this one:

SELECT 
     avg(chunk_interval) as avg_chunk_interval_new_chunk,
     sum(total_size_bytes-coalesce(after_compression_bytes,0)) 
         AS not_yet_compressed_bytes,
     avg(total_interval-compressed_interval) 
         AS avg_not_yet_compressed_interval,     
     avg(total_chunks-compressed_chunks) AS avg_not_yet_compressed_chunks,
     sum(total_chunks-compressed_chunks) AS total_not_yet_compressed_chunks,
     count(*) total_metrics, 
     sum(coalesce(after_compression_bytes,0)) 
        AS compressed_bytes,
     avg(compressed_interval) as avg_compressed_interval,
     sum(total_size_bytes-coalesce(after_compression_bytes,0)) / 
       sum(extract(epoch from (total_interval-compressed_interval)))
         AS not_yet_compressed_bytes_per_sec,
     sum(coalesce(after_compression_bytes,0)) / 
       greatest(sum(extract(epoch from (compressed_interval))),1)
         AS compressed_bytes_per_sec,
     sum(compressed_chunks) as total_compressed_chunks,
     sum(total_size_bytes) as total_prom_data_size,
     pg_database_size(current_database()) as total_db_size
FROM  prom_info.metric;

and paste the output here?

nmaludy @nmaludy: ah you’re right, looks like my upgrade did not succeed, investigating

metrics=# select is_compressed,count(*) from timescaledb_information.chunks where hypertable_schema='prom_data' group by is_compressed;
 is_compressed | count 
---------------+-------
 f             |  1206
(1 row)

confirmed 0.9.0 now

metrics=# SELECT 
metrics-#      avg(chunk_interval) as avg_chunk_interval_new_chunk,
metrics-#      sum(total_size_bytes-coalesce(after_compression_bytes,0)) 
metrics-#          AS not_yet_compressed_bytes,
metrics-#      avg(total_interval-compressed_interval) 
metrics-#          AS avg_not_yet_compressed_interval,     
metrics-#      avg(total_chunks-compressed_chunks) AS avg_not_yet_compressed_chunks,
metrics-#      sum(total_chunks-compressed_chunks) AS total_not_yet_compressed_chunks,
metrics-#      count(*) total_metrics, 
metrics-#      sum(coalesce(after_compression_bytes,0)) 
metrics-#         AS compressed_bytes,
metrics-#      avg(compressed_interval) as avg_compressed_interval,
metrics-#      sum(total_size_bytes-coalesce(after_compression_bytes,0)) / 
metrics-#        sum(extract(epoch from (total_interval-compressed_interval)))
metrics-#          AS not_yet_compressed_bytes_per_sec,
metrics-#      sum(coalesce(after_compression_bytes,0)) / 
metrics-#        greatest(sum(extract(epoch from (compressed_interval))),1)
metrics-#          AS compressed_bytes_per_sec,
metrics-#      sum(compressed_chunks) as total_compressed_chunks,
metrics-#      sum(total_size_bytes) as total_prom_data_size,
metrics-#      pg_database_size(current_database()) as total_db_size
metrics-# FROM  prom_info.metric;
 avg_chunk_interval_new_chunk | not_yet_compressed_bytes | avg_not_yet_compressed_interval | avg_not_yet_comp
ressed_chunks | total_not_yet_compressed_chunks | total_metrics | compressed_bytes | avg_compressed_interval 
| not_yet_compressed_bytes_per_sec | compressed_bytes_per_sec | total_compressed_chunks | total_prom_data_siz
e | total_db_size 
------------------------------+--------------------------+---------------------------------+-----------------
--------------+---------------------------------+---------------+------------------+-------------------------
+----------------------------------+--------------------------+-------------------------+--------------------
--+---------------
 02:00:02.498161              |               2356043776 | 23:29:20.141279                 |            5.131
9148936170213 |                            1206 |           235 |                0 | 00:00:00                
|               118.56316751707429 |                        0 |                       0 |           235604377
6 |    2430833199
(1 row)

looks like compression isn’t working properly?

max_worker_processes = 2

set to 2 instead of 3 (# of database + 2)
changed it to 3 and stuff started compressing
that helped a little bit:

# du -hsx prometheus/ pgsql/
120M    prometheus/
1003M   pgsql/
# SELECT is_compressed,count(*) FROM timescaledb_information.chunks WHERE hypertable_schema='prom_data' GROUP BY is_compressed;
 is_compressed | count 
---------------+-------
 f             |   303
 t             |   941
(2 rows)

FYI here is my full postgresq.conf
here is an updated result of the larger query:


 avg_chunk_interval_new_chunk | not_yet_compressed_bytes | avg_not_yet_compressed_interval | avg_not_yet_comp
ressed_chunks | total_not_yet_compressed_chunks | total_metrics | compressed_bytes | avg_compressed_interval 
| not_yet_compressed_bytes_per_sec | compressed_bytes_per_sec | total_compressed_chunks | total_prom_data_siz
e | total_db_size 
------------------------------+--------------------------+---------------------------------+-----------------
--------------+---------------------------------+---------------+------------------+-------------------------
+----------------------------------+--------------------------+-------------------------+--------------------
--+---------------
 02:00:02.498161              |                642170880 | 06:53:18.036991                 |            1.357
4468085106383 |                             319 |           235 |         65732608 | 16:57:50.604256         
|               110.19590153569092 |        4.580161420336446 |                     941 |            70790348
8 |     791007791
(1 row)

Ramon @Ramon: Thanks. It looks like compression is working now. There is a significant difference between Promscale and Prometheus that is caused by the recent data not being compressed. With compression working data size should grow very slowly over time. It would be great if you could report disk usage tomorrow to confirm that it’s very similar to what you have at the moment.

Re: rpm package. We are going to upload it to a repo in the next few days so you will not have to download it from Github and will be able to more easily upgrade.

Also, would you mind sharing the output of curl <http://promscale-host:9201/metrics|http://&lt;promscale-host&gt;:9201/metrics> (if you run in on the same box where Promscale run it would be curl http://localhost:9201/metrics ? Those are performance metrics about Promscale in Prometheus format. Thanks.

nmaludy @nmaludy:
yeah, i’ll keep an eye on it and let you know!

super excited we figured out what the issue was!

also great to hear about the RPM on packagecloud, that will be a game changer for me from an installation automation perspective

2 Likes

We developed a metrics stack for monitoring our production servers (0.6.0). After 3 weeks we runs out of disk space (100 GB). We adds more disk space to have more time for research. After doing some tests with 0.10.0 I recognized a 1000 times higher disk usage in comparison to a Prometheus only stack. With our production and development systems (~1000 metrics, 10 servers) Postgres processes are writing 250 GB/h after 2h uptime. This would kill my 2 TB SSD in my development PC in one year (TWB 1,3PB).

The compression seems running correctly: False , 2231; True , 89299

I build an example docker stack with Prometheus, TimescaleDB, Promscale, Node_Exporter, Grafana and 4 Wildfly Servers to reproduce the problem. The stack only uses default settings with no further configurations on TimescaleDB. https://github.com/DanielWebelsiep/test-prometheus-timescaledb.

After 2 hours Postgres processes writes 40 GB on disk (iotop -a -o -P) and TimescaleDBs volume grows to 6 GB. Prometheus writes only 200MB at the same time with allmost volume size.

After 2 hours uptime and TimescaleDB starts compessing chunks.

I don’t know if the high write IO is related to missing configurations on TimescaleDB or Promscale. Perhaps 1000 metrics (and hyptertables) with a small amount of servers is not the best scenario using TimescaleDB.

We are going to upload it to a repo in the next few days so you will not have to download it from Github and will be able to more easily upgrade.