Currently I am working on a task to store timeseries data for which we have choosen timescale. Previously I was storing the data into timestamp UTC and was calculating the continous aggregates for every day. But with this approach there will be data inconsistency issues as if my user is in 1st December 1 AM UTC + 5:30 timezone then that data will be stored at 30th November ( as -5:30 will make it us to go one day before).
To find solutions I recently came across Database continuous aggregation with time zone · Issue #1248 · timescale/timescaledb · GitHub for which
- one solution could be create timezone specific continous aggregates such as
time_bucket('1 day', "time", 'Asia/kolkata') AS day,
sum(value) AS total,
avg(value) AS average
GROUP BY day, user_id;
but with this approach we will have to create continous aggregate for every timezone which is not a ideal way as a user can change its timezone any time .
- second solution can be to create continous aggregates of 30 mins each UTC and then convert it to timezone on fly.
Please suggest a better approach . Thank you.
There are a few unknowns in your question as you have stated it.
When I have done this sort of work, the solution has always been the second approach you offered. This is because if you need aggregate data across a significant set of users it is way easier to do this on 30 minute aggregates than it is to try to recalculate a bunch of timezones.
This being said, if you have a particular case where you need per timezone aggregates across users, that may be different. To be honest, though, I am having trouble thinking of such cases.
So I would certainly stick with 30 minute UTC-based approaches first, and only consider tweaking if that proves insufficient for some reason.
Thank you Chris for your insights. I have few questions
If we do 30 mins aggregate will it not cause data inconsistency problem where UTC offset is 15 mins? say UTC + 5:45 hrs
If we do 30 mins aggregate then do we have to store the time in timestamp or timestampz ?
Also does timescale currently have support for dealing with different timezones together? I read about integer time concept Timescale Documentation | Time and continuous aggregates , not sure if that can solve ?
Please suggest. Thank you
Yeah ,you lose resolution for 15 minute offsets, so if those are super-important to you, then you need to do them in 15 minute buckets instead.
On the timestamp or timestamptz, I think there is a significant tradeoff here. I think you want to avoid mixing datatypes here. Usually I tend to store timestamp and timezone separately, but because math can be different in the two, it makes it harder to keep the math clear in your head if you have both datatypes in use.
On the third, I don’t understand your question, could you restate what kind of support in dealing with different timezones together you are looking for?
Hi Chris, Thank you for the explanation.
- Currently I am working on a fitness tracking backend, where I am tracking real time steps.
The requirements that I am currently working on is to show daily, weekly and monthly trends.
So for example user can be in any timezone (or can change timezone as per their wish) and in each case the aggregates should be according to the timezone.
for this currently we are thinking to go for 30 min aggregates (excluding 15min UTC timezone for now) and then based on the start and end time we can calculate daily, weekly and monthly trends from the 30 min aggregate .
Can you please suggest if the approach looks good to you.
- Secondly, currently we are using spring jdbc for insertion and retrieval from timescale for which the response time is 500ms on average, which is not ideal. Can you suggest the steps to improve performance for using timescale. Below are the config details
a) hypertable count : 2
b) hypertable chunk size : 1 week
c) index in both hypertables on time and 2 columns (total columns are 4 in both)
Thanks in advance, Looking forward to hearing from you
So thinking more about this, I think that you may want to think through your analytics first. These are per-user aggregates, right? And so the question is whether time of day vs 24 hour period is most important.
Then there are the corner cases: someone living near a timezone border who regularly walks across the border and back. This is not a corner case maybe you feel like you need to handle but it may be worth thinking about when we think about the need to cross timezones.
In other words, the question is “what does it mean to your analytics when someone crosses a timezone?” I think that would probably steer my design decisions.
My own thinking is that it might make more sense to aggregate by hour (localtime) and timezone per user. Most users will spend all their time in a timezone in a given week, and even those that change are unlikely to do so within the hour. That would cut your number of buckets significantly. And if time of day (regardless of timezone) is important, that would make larger analytics easier. But it makes it harder to address “how many steps did people do in a given hour UTC time” since you have half-hour and quarter-hour increments that would overlap. I am not sure that’s a very helpful metric though but it is worth considering.
I terms of performance, it would be helpful to see query plans (preferably with timing, aka
EXPLAIN ANALYZE and (possibly anonymized) queries.