How to shape sample data with PostgreSQL generate_series() and SQL
In the lifecycle of any application, developers are often asked to create proof-of-concept features, test newly released functionality, and visualize data analysis. In many cases, the available test data is stale, not representative of normal usage, or simply doesn't exist for the feature being implemented. In situations like this, knowing how to quickly create sample time-series data with native PostgreSQL and SQL functions is a valuable skill to draw upon!
In this three-part series on generating sample time-series data, we demonstrate how to use the built-in PostgreSQL function,
generate_series(), to more easily create large sets of data to help test various workloads, database features, or just to create fun samples.
In part 1 and part 2 of the series, we reviewed how
generate_series() works, how to join multiple series using a CROSS JOIN to create large datasets quickly, and finally how to create and use custom PostgreSQL functions as part of the query to generate more realistic values for your dataset. If you haven't used
generate_series() much before, we recommend first reading the other two posts. The first one is an intro to the generate_series() function, and the second one shows how to generate more realistic data.
With those skills in hand, you can quickly and easily generate tens of millions of rows of realistic-looking data. Even still, there's one more problem that we hinted at in part 2 - all of our data, regardless of how it's formatted or constrained, is still based on the
random() function. This means that over thousands or millions of samples, every device we create data for will likely have the same MAX() and MIN() value, and the distribution of random values over millions of rows for each device generally means that all devices will have similar average values.
This third post demonstrates a few methods for influencing how to create data that mimics a desired shape or trend. Do you need to simulate time-series values that cycle over time? What about demonstrating a counter value that resets every so often to test the counter_agg hyperfunction? Are you trying to create new dashboards that display sales data over time, influenced for different months of the year when sales would ebb and flow?
Below we'll cover all of these examples to provide you with the final building blocks to create awesome sample data for all of your testing and exploration needs. Remember, however, that these examples are just the beginning. Keep playing. Tweak the formulas or add different relational data to influence the values that get generated so that it meets your use case.
Time-series data often has patterns. Weather temperatures and rainfall measurements change in a (mostly) predictable way throughout the year. Vibration measurements from an IoT device connected to an air conditioning system usually increase in the summer and decrease in the winter. Manufacturing data that measures the total units produced per hour (and the percentage of defective units) usually follow a pattern based on shift schedules and seasonal demand.
If you want to demonstrate this kind of data without having access to the production dataset, how would you go about it using
generate_series()? SQL functions ended up being pretty handy when we discussed different methods for creating realistic-looking data in part 2. Do you think they might help here? 😉
Two options to easily return the row number
Remember, for our purposes we're specifically talking about creating sample time-series data. Every row increases along the time axis, and if we use the multiplication formula from part 1, we can determine how many rows our sample data query will generate. Using built-in SQL functions, we can quickly start manipulating data values that change with the cycle of time. 💥
There are many reasons why it can be helpful to know the ordinal position of each row number in a query result. That's why standard SQL dialects have some variation of the
row_number() over() window function. This simple, yet powerful, window function allows us to return the row number of a result set, and can utilize the ORDER BY and PARTITION keywords to further determine the row values.
SELECT ts, row_number() over(order by time) AS rownum FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') ts; ts |rownum| -----------------------------+------+ 2022-01-01 00:00:00.000 -0500| 1| 2022-01-02 00:00:00.000 -0500| 2| 2022-01-03 00:00:00.000 -0500| 3| 2022-01-04 00:00:00.000 -0500| 4| 2022-01-05 00:00:00.000 -0500| 5|
In a normal query, this can be useful for tasks like paging data in a web API when there is a need to consistently return values based on a common partition.
There's one problem though.
row_number() over() requires PostgreSQL (and any other SQL database) to process the query results twice to add the values correctly. Therefore, it's very useful, but also very expensive as datasets grow.
Fortunately, PostgreSQL helps us once again for our specific use case of generating sample time-series data.
Through this series of blog posts on generating sample time-series data, we've discussed that
generate_series() is a Set Returning Function (SRF). Like the results from a table, set data can be JOINed and queried. Additionally, PostgreSQL provides the
WITH ORDINALITY clause that can be applied to any SRF to generate an additional, incrementing BIGINT column. The best part? It doesn't require a second pass through the data in order to generate this value!
SELECT ts AS time, rownum FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum); time |rownum| -----------------------------+------+ 2022-01-01 00:00:00.000 -0500| 1| 2022-01-02 00:00:00.000 -0500| 2| 2022-01-03 00:00:00.000 -0500| 3| 2022-01-04 00:00:00.000 -0500| 4| 2022-01-05 00:00:00.000 -0500| 5|
Because it serves our purpose and is more efficient, the remainder of this post will use
WITH ORDINALITY. However, remember that you can accomplish the same results using
row_number() over() if that's more comfortable for you.
Harnessing the row value
With increasing timestamps and an increasing integer on every row, we can begin to use other functions to create interesting data.
Remember from the previous blog posts that calling a function as part of your query executes the function for each row and returns the value. Just like a regular column, however, we don't have to actually emit that column in the final query results. Instead, the function value for that row can be used in calculating values in other columns.
As an example, let's modify the previous query. Instead of displaying the row number, let's multiply the value by 2. That is, the function value is treated as an input to a multiplication formula.
SELECT ts AS time, 2 * rownum AS rownum_by_two FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum); time |rownum_by_two| -----------------------------+------+ 2022-01-01 00:00:00.000 -0500| 2| 2022-01-02 00:00:00.000 -0500| 4| 2022-01-03 00:00:00.000 -0500| 6| 2022-01-04 00:00:00.000 -0500| 8| 2022-01-05 00:00:00.000 -0500| 10|
Easy enough, right? What else can we do with the row number value?
Counters with reset
Many time-series datasets record values that reset over time, often referred to as counters. The odometer on a car is an example. If you drive far enough, it will "roll over" to zero again and start counting upward. The same is true for many utilities, like water and electric meters, that track consumption. Eventually, the total digits will increment to the point where the counter resets and starts from zero again.
To simulate this with time-series data, we can use the incrementing row number and after some period of time, reset the count and start over using the modulus operator (%).
– This example resets the counter every 10 rows WITH counter_rows AS ( SELECT ts, CASE WHEN rownum % 10 = 0 THEN 10 ELSE rownum % 10 END AS row_counter FROM generate_series(now() - INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum) ) SELECT ts, row_counter FROM counter_rows; ts |row_counter| -----------------------------+-----------+ 2022-01-07 13:17:46.427 -0500| 1| 2022-01-07 13:17:47.427 -0500| 2| 2022-01-07 13:17:48.427 -0500| 3| 2022-01-07 13:17:49.427 -0500| 4| 2022-01-07 13:17:50.427 -0500| 5| 2022-01-07 13:17:51.427 -0500| 6| 2022-01-07 13:17:52.427 -0500| 7| 2022-01-07 13:17:53.427 -0500| 8| 2022-01-07 13:17:54.427 -0500| 9| 2022-01-07 13:17:55.427 -0500| 10| 2022-01-07 13:17:56.427 -0500| 1| … | …
By putting the CASE statement inside of the CTE, the counter data can be selected more easily to test other functions. For instance, to see how the
delta() hyperfunctions work, we can use
time_bucket() to group our 1-second readings into 1-minute buckets.
WITH counter_rows AS ( SELECT ts, CASE WHEN rownum % 10 = 0 THEN 10 ELSE rownum % 10 END AS row_counter FROM generate_series(now() - INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum) ) SELECT time_bucket('1 minute', ts) bucket, delta(counter_agg(ts,row_counter)), rate(counter_agg(ts, row_counter)) FROM counter_rows GROUP BY bucket ORDER BY bucket; bucket |delta|rate| -----------------------------+-----+----+ 2022-01-07 13:25:00.000 -0500| 33.0| 1.0| 2022-01-07 13:26:00.000 -0500| 59.0| 1.0| 2022-01-07 13:27:00.000 -0500| 59.0| 1.0| 2022-01-07 13:28:00.000 -0500| 59.0| 1.0| 2022-01-07 13:29:00.000 -0500| 59.0| 1.0| 2022-01-07 13:30:00.000 -0500| 26.0| 1.0|
time_bucket() outputs the starting time of the bucket, which based on our date math for
generate_series() produces four complete buckets of 1-minute aggregated data, and two partial buckets - one for the minute we are currently in, and a second bucket for the partial 5 minutes ago. We can see that the delta correctly calculates the difference between the last and first readings of each bucket, and the rate of change (the increment between each reading) correctly displays a unit of one.
What are some other ways we can use these PostgreSQL functions to generate different shapes of data to help you explore other features of SQL and TimescaleDB quickly?
Increasing trend over time
With the knowledge of how to create an ordinal value for each row of data produced by
generate_series(), we can explore other ways of generating useful time-series data. Because the row number value will always increase, we can easily produce a random dataset that always increases over time but has some variability to it. Consider this a very rough representation of daily website traffic over the span of two years.
SELECT ts, (10 + 10 * random()) * rownum as value FROM generate_series ( '2020-01-01'::date , '2021-12-31'::date , INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
In reality this chart isn't very realistic or representative. Any website that gains and loses viewers upwards of 50% per day probably isn't going to have great long-term success. Don't worry, we can do better with this example after we learn about another method for creating shaped data using sine waves.
Simple cycles (sine wave)
Using the built-in
cos() PostgreSQL functions, we can generate data useful for graphing and testing functions that need a predictable data trend. This is particularly useful for testing TimescaleDB downsampling hyperfunctions like lttb or asap. These functions can take tens of thousands (or millions) of data points and return a smaller, but still accurately representative dataset for graphing.
We'll start with a basic example that produces one row per day, for 30 days. For each row number value, we'll get the sine value that can be used to graph a wave.
–- subtract 1 from the row number for wave to start -- at zero radians and produce a more representative chart SELECT ts, cos(rownum-1) as value FROM generate_series('2021-01-01','2021-01-30',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum); ts |value | -----------------------------+--------------------+ 2021-01-01 00:00:00.000 -0500| 1.0| 2021-01-02 00:00:00.000 -0500| 0.5403023058681398| 2021-01-03 00:00:00.000 -0500| -0.4161468365471424| 2021-01-04 00:00:00.000 -0500| -0.9899924966004454| 2021-01-05 00:00:00.000 -0500| -0.6536436208636119| 2021-01-06 00:00:00.000 -0500| 0.28366218546322625| … | …
Unfortunately, the graph of this SINE wave doesn't look all that appealing. For one month of daily data points, we only have ~6 distinct data points from peak to peak of each wave.
The reason our sine wave is so jagged is that sine and cosine values are measured in radians (based on 𝞹), not degrees. A complete cycle (peak-to-peak) on a sine wave happens from zero to 2*𝞹 (~6.28…). Therefore, every ~6 rows of data will produce a complete period in the wave - unless we find a way to modify that value.
To take control over the sine/cosine values, we need to think about how to modify the data based on the date range and interval (how many rows) and what we want the wave to look like.
This means we need to take a quick trip back to math class to talk about radians.
Math class flashback!
Step back with me for a minute to primary school and your favorite math subject - Algebra 2 (or Trigonometry as the case may be). How many hours did you spend working with graph paper (or graphing calculators) determining the amplitude, period, and shift of a sine or cosine graph?
If you reach even further into your memory, you might remember this formula which allows you to modify the various aspects of a wave.
There's a lot here, I know. Let's primarily focus on the two numbers that matter most for our current use case:
X = the "number of radians", which is the row number in our dataset
B = a value to multiply the row number by, to decrease the "radian" value for each row
(A, C, and D change the height and placement of the wave, but to start, we want to elongate each period and provide more "points" on the line to graph.)
Let's start with a small dataset example, generating cosine data for three months of daily timestamps with no modifications.
SELECT ts, cos(rownum) as value FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum); ts |value | -----------------------------+---------------------+ 2021-01-01 00:00:00.000 -0500| 0.5403023058681398| 2021-01-02 00:00:00.000 -0500| -0.4161468365471424| 2021-01-03 00:00:00.000 -0500| -0.9899924966004454| 2021-01-04 00:00:00.000 -0500| -0.6536436208636119| 2021-01-05 00:00:00.000 -0500| 0.28366218546322625| 2021-01-06 00:00:00.000 -0500| 0.960170286650366| 2021-01-07 00:00:00.000 -0500| 0.7539022543433046| 2021-01-08 00:00:00.000 -0500| -0.14550003380861354| 2021-01-09 00:00:00.000 -0500| -0.9111302618846769| … | … 2021-03-29 00:00:00.000 -0400| 0.9993732836951247| 2021-03-30 00:00:00.000 -0400| 0.5101770449416689| 2021-03-31 00:00:00.000 -0400| -0.4480736161291701|
In this example, we see ~14 peaks in our wave because there are 90 points of data and without modification, the wave will have a period (peak-to-peak) every ~6.28 points. To lengthen the cycle, we need to perform some simple division.
[cycle modifying value] = 6.28/[total interval (rows) per cycle]
Using the same 3 months of generated daily values, let's see how to modify the data to lengthen the period of the wave.
One cycle per month (30 days)
If we want our daily data to cycle every 30 days, multiply our row number value by 6.28/30.
6.28/30 = .209 (the row number radians modifier)
SELECT ts, cos(rownum * 6.28/30) as value FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
One cycle per quarter (90 days)
6.28/90 = .07 (this is our radians modifier)
SELECT ts, cos(rownum * 6.28/90) as value FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
To modify the overall length of the period, you need to modify the row number value based on the total number of rows in the result and the granularity of the timestamp.
Here are some example values that you can use to modify the wave period based on the interval used with
|generate_series() interval||Desired period length||Divide 6.28 by…|
Modifying the wave amplitude and shift
Another tweak we can make to our wave data is to change the amplitude (difference between the min and max peaks) and, as necessary, shift the wave up or down on the Y-axis.
To do this, multiply the cosine value by the value that maximum value you want the wave to have. For example, we can multiply the monthly cycle data by 10, which changes the overall minimum and maximum values of the data.
SELECT ts, 10 * cos(rownum * 6.28/30) as value FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
Notice that the min/max values are now from -10 to 10.
We can take it one step further by adding a value to the output which will shift the final values up or down on the Y-axis. In this example, we modified the previous query by adding 10 to the value of each row which results in values from 0 to 20.
SELECT ts, 10 + 10 * cos(rownum * 6.28/30) as value FROM generate_series('2021-01-01','2021-03-31',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);
Why spend so much time showing you how to generate and manipulate sine or cosine wave data, especially when we rarely see repeatable data this smooth in real life?
One of the main advantages of using consistent, predictable data like this in testing is that you can easily tell if your application, charting tools, and query are working as expected. Once you begin adding in unpredictable, real-life data, it can be difficult to determine if the data, query, or application are producing unexpected results. Quickly generating known data with a specific pattern can help rule out errors with the query, at least.
The second advantage of using a known dataset is that it can be used to shape and influence the results of other queries. Earlier in this post, we demonstrated a very simplistic example of increasing website traffic by multiplying the row number and a random value. Let's look at how we can join both datasets to create a better shape for the sample website traffic data.
Better website traffic samples
One of the key takeaways from this series of posts is that
generate_series() returns a set of data that can be JOINed and manipulated like data from a regular table. Therefore, we can join together our rough "website traffic" data and our sine wave to produce a smoother, more realistic set of data to experiment with. SQL for the win!
Overall this is one of the more complex examples we've presented, utilizing multiple common table expressions (CTE) to break the various sets into separate tables that we can query and join. However, this also means that you can independently modify the time range and other values to change the data that is generated from this query for your own experimentation.
-- This is the generate series data -- with a "short" date to join with later WITH daily_series AS ( SELECT ts, date(ts) AS day, rownum FROM generate_series ( '2020-01-01' , '2021-12-31' , '1 day'::interval) WITH ORDINALITY AS t(ts, rownum) ), -- This selects the time, "day", and a -- random value that represents our daily website visits daily_value AS ( SELECT ts, day, rownum, random() AS val FROM daily_series ORDER BY day ), -- This cosine wave dataset has the same "day" values which allow -- it to be joined to the daily_value easily. The wave value is used to modify -- the "website" value by some percentage to smooth it out -- in the shape of the wave. daily_wave AS ( SELECT day, -- 6.28 radians divided by 180 days (rows) to get -- one peak every 6 months (twice a year) 1 + .2 * cos(rownum * 6.28/180) as p_mod FROM daily_series day ) -- (500 + 20 * val) = 500-520 visits per day before modification -- p_mod = an adjusted cosine value that raises or lowers our data each day -- row_number = a big incremental value for each row to quickly increase "visits" each day SELECT dv.ts, (500 + 20 * val) * p_mod * rownum as value FROM daily_value dv INNER JOIN daily_wave dw ON dv.DAY=dw.DAY order by ts;
Without much effort, we are able to generate a time-series dataset, use two different SQL functions, and join multiple sets together to create fun, graphical data. In this example, our traffic peaks twice a year (every ~180 days) during July and late December.
But we don't have to stop there. We can carry our website traffic example one step further by applying just a little more control over how much the data increases or decreases during certain periods.
Once again, relational data to the rescue!
Influence the pattern with relational data
As a final example, let's consider one other type of data that we can include in our queries that influence the final generated values - relational data. Although we've been using data that was created using
generate_series() to produce some fun and interesting sample datasets, we can just as easily JOIN to other data in our database to further manipulate the final result.
There are many ways you could JOIN to and use additional data depending on your use case and the type of time-series data you're trying to mimic. For example:
- IoT data from weather sensors: store the typical weekly temperature highs/lows in a database table and use those values as input to the
random_between()function we created in post 2
- Stock data analysis: store the dates for quarterly disclosures and a hypothetical factor that will influence the impact on stock price moving forward
- Sales or website traffic: store the monthly or weekly change observed in a typical sales cycle. Does traffic or sales increase a quarter-end? What about during the end-of-year holiday season?
To demonstrate this, we'll use the fictitious website traffic data from earlier in this post. Specifically, we've decided that we want to see a spike in traffic during June and December.
First, we create a regular PostgreSQL table to store the numerical month (1-12) and a float value which will be used to modify our generated data (up or down). This will allow us to tweak the overall shape for a given month.
CREATE TABLE overrides ( m_val INT NOT NULL, p_inc FLOAT4 NOT null ); INSERT INTO overrides(m_val, p_inc) VALUES (1,.1.04), – 4% residual increase from December (2,1), (3,1), (4,1), (5,1), (6,1.10),-- June increase of 10% (7,1), (8,1), (9,1), (10,1), (11,1.08), -- 8% early shoppers sales/traffic growth (12,1.18); -- 18% holiday increase
Using this simple dataset, let's first join it to the "simplistic" query that had randomly growing data over time.
WITH daily_series AS ( -- a random value that increases over time based on the row number SELECT ts, date_part('month',ts) AS m_val, (10 + 10*random()) * rownum as value FROM generate_series ( '2020-01-01'::date , '2021-12-31'::date , INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum) ) -- join to the `overrides` table to get the 'p_inc' value -- for the month of the current row SELECT ts, value * p_inc AS value FROM daily_series ds INNER JOIN overrides o ON ds.m_val=o.m_val ORDER BY ts;
Joining to the
overrides table based on the month of each data point, we are able to multiply the percentage increase (
p_inc) value and the fake website traffic value to influence the trend of our data during specific time periods.
Combining everything we've learned and taking this example one step further, we can enhance the cosine data query with the same monthly override values to tweak our fake, cyclical time-series data that represents growing website traffic with a more realistic shape.
-- This is the generate series data -- with a "short" date to join with later WITH daily_series AS ( SELECT ts, date(ts) AS day, rownum FROM generate_series ( '2020-01-01' , '2021-12-31' , '1 day'::interval) WITH ORDINALITY AS t(ts, rownum) ), -- This selects the time, "day", and a -- random value that represents our daily website visits -- 'm_val' will be used to join with the 'overrides' table daily_value AS ( SELECT ts, day, date_part('month',ts) as m_val, rownum, random() AS val FROM daily_series ORDER BY day ), -- This cosine wave dataset has the same "day" values which allow -- it to be joined to the daily_value easily. The wave value is used to modify -- the "website" value by some percentage to smooth it out -- in the shape of the wave. daily_wave AS ( SELECT day, -- 6.28 radians divided by 180 days (rows) to get -- one peak every 6 months (twice a year) 1 + .2 * cos(rownum * 6.28/180) as p_mod FROM daily_series day ) -- (500 + 20 * val) = 500-520 visits per day before modification -- p_mod = an adjusted cosine value that raises or lowers our data each day -- row_number = a big incremental value for each row to quickly increase "visits" each day -- p_inc = a monthly adjustment value taken from the 'overrides' table SELECT dv.ts, (500 + 20 * val) * p_mod * rownum * p_inc as value FROM daily_value dv INNER JOIN daily_wave dw ON dv.DAY=dw.DAY inner join overrides o on dv.m_val=o.m_val order by ts;
Wrapping it up
In this 3rd and final blog post of our series about generating sample time-series datasets, we demonstrated how to add shape and trend into your sample time-series data (e.g., increasing web traffic over time and quarterly sales cycles) using built-in SQL functions and relational data. With a little bit of math mixed in, we learned how to manipulate the pattern of generated data, which is particularly useful for visualizing time-series data and learning analytical PostgreSQL or TimescaleDB functions.
To see some of these examples in action, watch my video on creating realistic sample data:
If you have questions about using generate_series() or have any questions about TimescaleDB, please join our community Slack channel, where you'll find an active community and a handful of the Timescale team most days.
If you want to try creating larger sets of sample time-series data using generate_series() and see how the exciting features of TimescaleDB work, sign up for a free 30-day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)
🔥Create large realistic time-series datasets using generate_series() and TimescaleDB
Sign up for free