Your Questions Answered (YQA): March Edition
Check out this month's roundup of community questions, from ways to scale your reads and writes to handling large-scale geospatial datasets.
We’ve previously published “Overheard at Office Hours” posts to recap the questions community members ask at our open forum with product team members, but we get questions on Slack, via social, and at events.
Enter “Your Questions Answered,” where we’ll share - and answer - some of the most interesting, generally applicable, or frequently asked questions from the latest Office Hours sessions, our Slack community, and more.
Have a question you’d like answered?
- Sign up for an upcoming Office Hours session (Tuesdays @ 9am PT / 12pm ET / 5pm GMT)
- Join our Slack channel (our engineers are active in all channels)
- Start a conversation on Twitter (@timescaleDB)
Without further ado, let’s dive in.
How do I insert data into TimescaleDB with AWS Lambda?
We have a full blog post to help you out with this :). Check out “Using AWS Lambda with Timescale Cloud” - it uses IoT data as an example, but the same steps and best practices apply to monitoring, geospatial, or any other type of data you may be storing in TimescaleDB.
If you run into issues or have questions, reach out on Slack at any time.
What limitations should I consider with continuous aggregates?
Using continuous aggregates is a great way to ensure fast and efficient query results; they remove the need to perform complex operations at query time and pre-calculate the needed result set in the background (if you’re new to continuous aggregates, see our tutorial).
If you’re going to build and update large numbers of continuous aggregates, some things that you want to monitor and keep in mind:
- Make sure that you have ample compute resources and that you monitor your job schedule, as your background jobs will consume CPU, memory, and background workers to perform updates.
- Avoid worker contention - via setting up enough background workers and spreading out your jobs - to ensure up-to-date continuous aggregates.
Two questions to ask:
1. Are my jobs sufficiently spread out, so I’m not fighting for background workers at the same time? To check if your jobs are too closely spaced: select * from timescaledb_information.continuous_aggregate_stats, which will return something like the following:
SELECT job_id, total_runs, total_failures, total_successes FROM timescaledb_information.job_stats WHERE hypertable_name = 'test_table'; job_id | total_runs | total_failures | total_successes --------+------------+----------------+----------------- 1001 | 1 | 0 | 1 1004 | 1 | 0 | 1 (2 rows)
Pay particular attention to next_scheduled_run and ensure that the jobs will attempt to run in the same timeframe. If you need to alter your job schedule, please see this section of our documentation.
2. Do I have enough background workers to support the number of continuous aggregate updates I’m running? To check, run
SHOW ALL and look at how many background workers (parallel) and TimescaleDB background workers you’ve set.
What are some
chunk_size best practices?
Chunk_size is based on the time intervals that make sense for your data and workloads. The key factor in choosing the time intervals you use to chunk your data is that the chunk (including indexes) in the most recent interval fits into memory. While it’s always safer to make your chunks smaller rather than too large, making them too small can lead to too many chunks - which may cause latency in some queries.
As such, we typically recommend setting your time intervals so that chunk(s) are no more than 25% of main memory.
Important tip: Make sure that you’re planning for single chunks from all active hypertables fit into 25% of main memory, rather than 25% per hypertable.
Does TimescaleDB perform well on geospatial and temporal queries on large datasets? I have ~1B rows of time-series data in Cassandra that I’d like to migrate to Timescale & PostGIS.
We have quite a few customers using TimescaleDB and PostGIS, including Sentinel Marine Solutions (a maritime fleet monitoring provider). The Sentinel team collects real-time location, remaining range and state of charge, travel distance, fuel levels, and much more for all of their customers’ boats (for reference, they expect to surpass 5TB of data over time).
The success story dives into their architecture, and we’ve created multiple tutorials to demonstrate how to combine TimescaleDB and PostGIS, including:
- Timescale Example Repo (GitHub): collection of our example datasets, integrations, and tutorials, including the MTA geospatial example (ingest, store, and query real-time bus location data).
- IoT Monitoring and Time Series Analysis (YouTube): We demo'd this at AWS to show how to use features like
gap_fillto monitor all buses running in NYC and display real-time results in Grafana.
- Time Series Benchmark Suite (GitHub): see the IoT query example, which simulates GPS truck location tracking and sensor monitoring data (average miles per gallon, fuel tank levels, etc.), and benchmarks performance across various databases.
- Hello Timescale Tutorial (Timescale Docs): use this tutorial to get familiar with TimescaleDB functions (jump to the GIS section to dive into more complex GIS queries).
How can I prepare to scale my reads and writes (I’m new to TimescaleDB)?
In vanilla Postgres, writes are typically scaled vertically, so you want to make sure you have plenty of memory, CPU, fast disk, and fast networking. Postgres typically handles read scaling via read-replicas for horizontal scaling.
- Check out our Engineering team’s “High availability and scalable reads in PostgreSQL” blog post to learn more about using Postgres’ streaming replication to improve read scale and performance.
On top of basic Postgres best practices, with TimescaleDB, we leverage in-memory and size chunks to ingest data through memory. From a user perspective, TimescaleDB exposes what look like singular tables, called hypertables, that are actually an abstraction or a virtual view of many individual tables holding the data, called chunks.
- See our Getting Started Configuration documentation to learn more about how to use
timescaledb_tuneto optimize your memory, CPU, and other settings for better performance.
- See our “Scaling out TimescaleDB” tutorial to walk through how to configure your primary and replication databases, add recovery settings, run diagnostics, and more.
Schema and table design also play a role in scale and performance, either leading to significant performance improvements (correct schema design) or significant performance degradation (wrong schema design). “Correct schema design” varies by use case and workload, so we recommend reading our Schema Management documentation for more detailed information and best practices.
Finally, a tuned database means faster write and read performance. Tuning your TimescaleDB database is essential to leveraging your existing hardware and we built a tool to help you do just that.
If you’d like to learn more about how to increase performance of your reads or writes, reach out to the team on Slack at any time (we have a wealth of knowledge on staff and are happy to help).
I’m currently in the public multi-node beta testing multi-node for AWS. I’m running into width limitations – where can I find information about workarounds or when new code is checking in?
Firstly, thank you for trying and giving feedback on our beta!
Over the next month or so, we’ll be getting much more aggressive about rounding up, replying to, and triagging any multi-node GitHub issues to ensure we’re solving problems for our users.
That concludes this month’s installment of Your Questions Answered, but we’re always here to help - whether you’re just getting started with Timescale or a seasoned user with a very specific use case.