Start supercharging your PostgreSQL today.
Written by Anber Arif
Time-series data is collected at specific intervals across various industries. It’s essential for tracking changes over time, identifying patterns, and making predictions. For instance, financial markets analyze daily stock prices, meteorologists measure hourly temperatures, and businesses track monthly sales figures. This data helps observe trends and seasonal patterns, which is crucial for strategic decision-making.
A critical aspect of analyzing time-series data is understanding its stationarity since many statistical methods and models assume stationarity. Non-stationary data can lead to inaccurate results, while stationary analysis simplifies complexity and enhances interpretability.
In this article, we will cover the basics of stationary time-series analysis, providing a comprehensive overview of the key concepts and techniques for analyzing such data effectively. We will also share stationary analysis examples: you'll learn how to handle time-series data efficiently and apply stationary analysis to gain meaningful insights, ultimately enhancing your ability to make data-driven decisions.
A stationary time series is a type of time series whose statistical properties do not change over time. This means that the process's statistical behavior remains constant, regardless of when the observations were recorded.
Intuitively, this means that if we were to take a snapshot of the time series at any point in time, the statistical properties of that snapshot would be similar to those of any other snapshot taken at a different point in time. For example, the mean and variance of the time series would be identical, and the correlation between observations would depend only on the time difference between them, not on the absolute time at which they were recorded.
Technically, a stationary time series is defined as a stochastic process whose joint distribution is shift-invariant. This means that the joint distribution of the process remains the same, regardless of how much we shift the time axis. In other words, the statistical properties of the process do not change over time.
We can look for specific characteristics in the data to determine whether a time series is stationary or non-stationary. For example, a stationary time series will have a constant mean, variance, and autocorrelation over time. On the other hand, non-stationary time series may exhibit trends, seasonality, or changing variance.
Based on these characteristics, we can identify which of the below time-series plots are stationary and which are non-stationary.
Series (d), (h), and (i) are non-stationary due to their obvious seasonality. Seasonality refers to a pattern that repeats at regular intervals over time. In these series, we can see a clear pattern of peaks and troughs that repeat every year or every few years.
Series (a), (c), (e), (f), and (i) are also non-stationary due to their trends and changing levels. A trend is a long-term increase or decrease in the mean of the time series while changing levels refer to shifts in the mean that occur suddenly and persist over time. In these series, we can see that the mean is not constant over time but instead increases or decreases systematically.
Increasing variance also rules out series (i). Variance refers to the spread of data around the mean. In this series, we can see that the variance is increasing over time, indicating that the data is becoming more spread out and less predictable.
This leaves only series (b) and (g) as potential stationary series. At first glance, series (g) might app be non-stationary due to its strong cycles. However, these cycles are aperiodic, meaning that they do not repeat at regular intervals. Instead, they are caused by the lynx population becoming too large for the available food supply, leading to a decrease in breeding and a subsequent decrease in the population. Once the food supply regenerates, the population can grow again, leading to another cycle. Because the timing of these cycles is not predictable in the long term, the series is considered stationary.
Stationarity can be defined in various ways, each with its own set of criteria. Let’s explore the common types of stationarity:
This type of stationarity aligns with the initial definition provided, emphasizing that the statistical properties of the process remain constant over time. In essence, strong stationarity requires that the entire probability distribution of the process remains unchanged under time shifts. This means that the distribution's shape, spread, and location remain the same regardless of when the observations were recorded.
Consider a hypothetical time-series dataset representing daily temperatures in a region. If the dataset exhibits strong stationarity, it implies that not only does the average temperature remain constant over time, but also the entire distribution of temperatures remains unchanged. For instance, regardless of whether we observe the temperatures in January or July, the distribution of temperatures, including the frequency of cold and hot days, remains the same.
In first-order stationarity, the key criterion is that the average of the time series is shift-invariant. This means that regardless of when the observations were recorded, the mean of the process remains constant over time. In other words, the central tendency of the data does not change as we shift along the time axis.
Let’s consider a time-series dataset representing the daily closing prices of a particular stock over a year. If the dataset exhibits a first-order stationarity, it implies that the average closing price of the stock remains constant throughout the year, irrespective of the specific date or time period being considered. For instance, if the average closing price over the entire year is $100 per share, this average remains consistent regardless of whether we analyze the data in January or December. This stability in the average closing price indicates first-order stationarity.
Weak stationarity extends the principles of first-order stationarity by adding another criterion: the cross-covariance between different time points is also stationary. This means that not only does the average of the time series remain constant over time, but also the relationship between different observations remains stable.
Let's continue with the example of a stock's daily closing prices. In a weakly stationary time series, not only does the average closing price remain constant over time, but also the relationship between the closing prices of different days remains stable. For instance, if the closing price of the stock on one day is highly correlated with the closing price of the next day, this relationship remains consistent over time, regardless of the specific dates.
In second-order stationarity, not only does the average of the time series remain constant over time, but the variance of the process should also remain constant. This means that in addition to a shift-invariant mean, the spread or variability of the data points around the mean should also remain consistent over time.
For instance, if the average closing price of the stock over a certain period is $100 per share, and the variance or standard deviation of the closing prices around this average is $5, then second-order stationarity implies that the variance remains constant over time. This means that even with fluctuations in the stock prices from day to day, the degree of variability or dispersion of the closing prices around the mean remains consistent.
The concept of stationarity encompasses the stability of statistical properties with respect to time shifts. However, the interpretation of stationarity can vary depending on the analysis or forecasting goals. A more lenient definition may tolerate minor fluctuations in statistical properties, providing flexibility for analyzing noisy or irregular data. Conversely, a stricter definition demands consistent stability in critical metrics such as mean and variance, which is vital for accurate forecasting or hypothesis testing. Therefore, the choice of the stationarity definition depends on analysis objectives and data characteristics.
Let’s explore the significance of stationarity in time-series analysis:
Time-series data spans time: Time-series data is inherently temporal, capturing observations over a sequence of time intervals. For meaningful analysis, the statistical properties of the data must remain consistent over time. Such consistency allows analysts to draw reliable conclusions about the underlying processes driving the data and make informed decisions based on these insights.
Simplicity in statistical analysis: When the statistical properties of a time series vary with time, statistical analysis becomes challenging and complex. Inconsistencies in properties such as mean, variance, and autocorrelation can complicate modeling efforts and hinder interpretation. Stationarity simplifies the analysis by providing a stable framework where these properties remain constant over time.
Essential for projections and models: Assuming some level of stationarity is crucial, particularly for projections and modeling purposes. By assuming stationarity, we can handle the noise in the data more effectively, leading to more reliable forecasts and model outcomes.
Understanding whether a time series is stationary is crucial for accurate modeling and forecasting. In this section, we will explore various methods used to determine the stationarity of a time series.
A unit root test is a statistical method used to determine whether a time series is non-stationary due to the presence of a unit root. This concept is rooted in the stochastic properties of the time series and its characteristic equation.
The unit root test evaluates whether the value 1 is a root of the characteristic equation derived from the stochastic process governing the time series. In simpler terms, this involves checking if the series can be described by an equation where the presence of a unit root (value 1) implies non-stationarity.
Mathematically, this means that the series can be represented by an autoregressive model, and if the coefficient of the lagged term equals 1, the series has a unit root.
where yt is the value of the time series at time t,
t-1 is the value of the time series at time t-1, and ϵt is a white noise error term.
Non-stationarity: If the test indicates that 1 is a root of the characteristic equation, it confirms that the time series is non-stationary. This non-stationarity is characterized by changing statistical properties over time, such as mean and variance.
No deterministic trend: Even though the series is non-stationary, it does not imply a deterministic trend. Instead, the series follows a random walk, which can drift over time without a predictable pattern.
Fundamentally chaotic: In simple terms, a time series with a unit root is fundamentally chaotic. This means that shocks to the system have permanent effects, causing the series to drift away from its initial value rather than reverting to a long-term mean. This behavior makes it challenging to model and forecast, as past values do not provide a stable basis for prediction.
The most commonly used unit root tests are as follows:
The Augmented Dickey-Fuller (ADF) is a widely used statistical test for detecting the presence of a unit root in a time series, which helps determine whether the series is non-stationary. The test extends the simpler Dickey-Fuller test by including lagged differences of the series to account for higher-order autocorrelation, making it more robust and reliable for practical applications.
The ADF test works by estimating the following regression equation:
where:
ΔYt is the first difference of the time series Yt (i.e., Yt −Yt−1).
t is a time trend (optional).
Yt−1 is the lagged level of the time series.
α is a constant.
β is the coefficient on a time trend.
γ is the coefficient of the lagged level of the time series.
p is the number of lagged differences included in the model.
δi are coefficients of the lagged differences.
ϵt is the error term.
The null hypothesis of the ADF test is that the time series has a unit root (i.e., γ=0), implying non-stationarity. The alternative hypothesis is that the series is stationary (γ<0).
The outcome of the ADF test is a test statistic that tells us about the likelihood of a unit root being present in the time series.
Test Statistic: The ADF test statistic is calculated as:
ADF Statistic =
where:
ŷ is the estimated coefficient of the lagged level term Yt−1, and SE(ŷ) is its standard error.
Here's the big picture of interpreting this number:
More negative values: The more negative the ADF statistic, the more confident we can be that the series does not have a unit root and is, therefore, stationary. This is because large negative values indicate that the estimated γ is significantly less than zero, suggesting that the time series reverts to a mean and does not exhibit a random walk.
Critical values: The computed ADF statistic is compared against critical values at different significance levels (e.g., 1 %, 5 %, and 10 %). If the ADF statistic is less than the critical value, we reject the null hypothesis and conclude that the series is stationary.
For example, if the ADF statistic is -3.5 and the critical value at the 5 % significance level is -3.0, we reject the null hypothesis of a unit root because -3.5 is more negative than -3.0. This indicates strong evidence that the time series is stationary.
The Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test assesses whether a time series is trend-stationary, meaning it is stationary once any deterministic trend is removed. It provides a positive test for trend stationarity by checking if the series reverts to a mean after accounting for trends.
The KPSS test involves decomposing the time series Yt into three parts:
Yt =βt+rt+ϵt
where:
βt is a deterministic trend.
rt is a random walk.
ϵt is a stationary error term.
The test statistic for the KPSS test is based on the residuals from the regression of Yt on α and βt. The test statistic is given by:
where:
St is the partial sum of residuals.
T is the number of observations.
σ^{2} is an estimate of the long-run variance of the residuals.
The null hypothesis of the KPSS test is that the time series is stationary around a deterministic trend. The alternative hypothesis is that the series is a non-stationary unit root process. If the test statistic is greater than the critical value, the null hypothesis is rejected, indicating that the series is not stationary and may have a unit root.
The critical values for the KPSS test are essential for determining whether to reject the null hypothesis of stationarity. The following table provides KPSS test critical values for different significance levels (10 %, 5 %, and 1 %) for two scenarios: Test A (Intercept Only) and Test B (Linear Trend).
Significance Level | Test A (Intercept Only) | Test B (Linear Trend) |
0.10 | 0.347 | 0.119 |
0.05 | 0.463 | 0.146 |
0.01 | 0.739 | 0.216 |
To interpret these values, compute the KPSS statistic for your data and compare it to the relevant critical value. If the KPSS statistic exceeds the critical value, reject the null hypothesis of stationarity. For instance, a KPSS statistic of 0.5 compared to the 5 % significance level critical value of 0.463 in Test A would reject level stationarity, while a statistic of 0.12 compared to 0.146 in Test B would not lead to rejecting trend stationarity.
When time-series data is not stationary, transforming it can help stabilize its properties for analysis. Common techniques include:
Differencing: Calculate the difference between consecutive observations to remove trends and stabilize the mean. For instance, ΔYt =Yt −Yt−1
Log transformation: Apply a logarithm to stabilize the variance. This is especially useful for data with exponential growth. For example, log(Yt).
Seasonal differencing: Remove seasonal effects by differencing the series at a lag equal to the seasonal period, such as Δ12Yt =Yt −Yt−12 for monthly data with annual seasonality.
Decomposition: Separate the time series into trend, seasonal, and residual components, where the residual (noise) component is typically stationary.
Once we have transformed the data into a stationary series, we can analyze its stable properties, such as the mean, variance, and autocorrelation function. Using these properties, we can build a time-series model and make predictions.
Finally, we can connect the results to the original series by reversing the transformation. For example, if we have taken the natural logarithm of the data, we can exponentiate the predicted values to obtain predictions for the original series.
Here’s a simple diagram illustrating the transformation process:
In time-series analysis, transformations are often applied to the original data to stabilize the variance, make the data stationary, or remove trends and seasonality. Here are three common transformations:
The random walk model is a non-stationary time-series model where the value at any point in time is the previous value plus a random shock. Mathematically, it can be expressed as:
Yt =Yt−1 + ϵt
where ϵt is a white noise error term.
However, in many cases, the time series may exhibit a trend where the value of the series increases or decreases over time. To account for this trend, we can introduce a drift term to the random walk model, resulting in the random walk model with drift:
Yt = c + Yt−1 + ϵt
where c is the drift term, representing the average change in the value of Yt between consecutive time periods. If c is positive, then the average change is an increase in the value of Yt, and the time series will tend to drift upwards over time. On the other hand, if c is negative, then the average change is a decrease in the value of Yt, and the time series will tend to drift downwards over time.
Transforming a random walk model into a stationary series typically involves differencing. The first difference of a random walk series removes the dependency on the previous value and the drift term, resulting in a stationary series:
ΔYt =Yt −Yt−1 = c + ϵt
where ΔYt is the first difference of Yt.
In this case, pulling back the transformation involves summing the differenced values to obtain the original series. If the differenced series is modeled, the predictions must be integrated (summed) to convert them back to the original scale.
Second-order differencing involves taking the difference of the first-order differences. This technique is used when the time series shows a quadratic trend. Mathematically, it is expressed as:
Δ^{2}Yt = Δ(Yt −Yt−1) = (Yt −Yt−1)−(Yt−1 −Yt−2) = Yt − 2Yt−1 +Yt−2
Transforming a series using second-order differencing helps remove polynomial trends, making it stationary. Pulling back this transformation involves summing the differences twice. If predictions are made on the second-order differenced series, they must first be integrated to obtain first-order differences and then summed again to return to the original scale.
Seasonal differencing is used to remove seasonal effects from a time series. This involves differencing the series at a lag equal to the seasonal period. For example, for monthly data with annual seasonality:
ΔsYt =Yt −Yt−s
where s is the seasonal period (e.g., 12 for monthly data with yearly seasonality).
Seasonal differencing removes seasonal patterns, resulting in a stationary series. Pulling back the transformation involves reversing the seasonal differencing by adding the differenced values to the series lagged by the seasonal period. If the transformed series is used for modeling and predictions, the predicted values must be adjusted by adding back the seasonal component.
You can use several powerful time-series packages to perform stationary analysis:
TSStudio for R: TSStudio is an R package that provides a user-friendly interface for time-series analysis. It simplifies the process of exploring and understanding time-series data through intuitive functions.
Statsmodels for Python: This is a robust library for statistical modeling in Python, including tools for time series analysis. It provides capabilities for performing various tests, decompositions, and forecasting.
Django-TimescaleDB: When working with time-series data in a PostgreSQL environment, TimescaleDB is highly recommended for its performance and scalability. For Python integration, the django-timescaledb
extension is preferred, as it seamlessly integrates TimescaleDB with the Django framework. Additionally, the psycopg2 or psycopg3 libraries are essential for efficient interaction with PostgreSQL databases. If there is any uncertainty, using the recommended libraries for vanilla PostgreSQL ensures compatibility and reliability for managing and analyzing time-series data.
Let's take the example of the USgas dataset, a package dataset representing the monthly consumption of natural gas (in billion cubic feet) in the US since January 2000.
Install the TSstudio package (if you haven’t yet installed it) and load the library as shown below:
install.packages("TSstudio")
library(TSstudio)
Now you can load the series and display its information as shown below:
# Load the USgas dataset
data("USgas")
ts_info(USgas)
Here’s how to decompose the series:
ts_decompose(USgas)
Trend: In this case, the trend is clearly upward, indicating that natural gas consumption in the US has been increasing over time.
Seasonal: The seasonal component of the time series represents the repeating patterns that occur at regular intervals.
In this case, the seasonal component shows a clear pattern of higher natural gas consumption in the winter months and lower consumption in the summer months. This is likely due to the increased demand for heating in the winter and cooling in the summer.
Random (Noise): The noise component shows some variability, but it is relatively small compared to the trend and seasonal components.
When it comes to the noise component of a time series, we can analyze its stationarity to better understand the underlying patterns and trends.
Let’s perform a second-order stationarity analysis to examine the stability of the noise component in our time-series data:
# Load the forecast package
library(forecast)
# Fit an ARIMA model to the USgas time series data
model <- auto.arima(USgas)
# Extract the residuals from the model
residuals <- resid(model)
# Perform the Ljung-Box test on the residuals for autocorrelation
lb_test_autocorr <- Box.test(residuals, type = "Ljung-Box")
# Test for constant mean using a t-test
mean_test <- t.test(residuals, mu = 0)
# Test for constant variance using the chi-squared test
variance_chi_squared <- chisq.test(residuals^2)
# Print the test results
print("Autocorrelation (Ljung-Box test):")
print(lb_test_autocorr)
print("\nMean Test:")
print(mean_test)
print("\nVariance Test:")
print(variance_chi_squared)
This code fits an ARIMA model to the USgas time series data using the auto.arima()
function from the forecast
package in R. It then extracts the residuals from the model and performs three tests to check for stationarity: the Ljung-Box test for autocorrelation, a t-test for a constant mean, and a chi-squared test for constant variance.
The analysis of the residuals indicates no significant autocorrelation, as indicated by a p-value of 0.8227 from the Ljung-Box test. Given that this p-value exceeds the typical significance level of 0.05, we do not reject the null hypothesis.
Secondly, the one-sample t-test conducted to assess the mean of the residuals indicates that there is no statistically significant deviation from zero. With a p-value of 0.3528, we fail to reject the null hypothesis that the mean of the residuals is equal to zero. This suggests that, on average, the residuals are centered around zero, indicating that the ARIMA model is capturing the underlying trend in the data effectively. However, the chi-squared test suggests a significant lack of constant variance (p < 2.2e-16), potentially indicating a violation of second-order stationarity.
Here’s an example of a basic model using trend and seasonality with a confidence interval for the residuals and a prediction interval for the model in R:
# Load the forecast package
library(forecast)
# Load the USgas dataset
data(USgas)
# Convert the USgas data to a time series object
USgas_ts <- ts(USgas, start = c(1960, 1), frequency = 12)
# Fit a seasonal ARIMA model with trend and seasonality
model <- auto.arima(USgas_ts, seasonal = TRUE)
# Print the summary of the model
summary(model)
# Plot the original time series and the fitted values
plot(USgas_ts, main = "US Gas Consumption")
lines(fitted(model), col = "red")
# Calculate the residuals
residuals <- resid(model)
# Calculate the average and variance of the residuals
mean_residuals <- mean(residuals)
var_residuals <- var(residuals)
# Print the average and variance of the residuals
cat("Mean of residuals:", mean_residuals, "\n")
cat("Variance of residuals:", var_residuals, "\n")
# Calculate the standard deviation of the residuals
sd_residuals <- sd(residuals)
# Calculate the upper and lower bounds of the 95% confidence interval
upper_bound <- mean_residuals + 1.96 * sd_residuals
lower_bound <- mean_residuals - 1.96 * sd_residuals
# Plot the original time series, the fitted values, and the 95% confidence interval
plot(USgas_ts, main = "US Gas Consumption")
lines(fitted(model), col = "red")
lines(USgas_ts + upper_bound, col = "blue", lty = 2)
lines(USgas_ts + lower_bound, col = "blue", lty = 2)
# Shade the area between the upper and lower bounds of the 95% confidence interval
polygon(c(time(USgas_ts), rev(time(USgas_ts))), c(USgas_ts + upper_bound, rev(USgas_ts + lower_bound)), col = "lightblue", border = NA)
Editor’s Note: Learn more about time-series analysis using R.
In this code, we first load the USgas dataset and convert it to a time-series object. We then fit a seasonal ARIMA model with trend and seasonality using the auto.arima()
function. The summary()
function provides details about the fitted model.
Next, we plot the original time series and the fitted values obtained from the ARIMA model. We then calculate the residuals of the model and compute their mean and variance to assess the model’s performance.
We also calculate the standard deviation of the residuals to determine the upper and lower bounds of the 95 % confidence interval for the residuals. These bounds help us visualize the variability of the residuals around the mean.
Finally, we plot the original time series, the fitted values, and the 95 % confidence interval. We shade the area between the upper and lower bounds of the confidence interval to illustrate the variability and uncertainty around the model’s predictions.
Stationary analysis is a critical technique in time-series analysis. Understanding and identifying stationarity in your data provides a foundation for more advanced modeling and forecasting, as it allows you to make reliable assumptions about the behavior of the data over time.
By ensuring the statistical properties of a time series remain constant, you can more effectively interpret and predict future values, even in seemingly chaotic datasets. With this primer, you have a starting point for learning stationary analysis and how to apply it to your time-series data.
Ready to unlock the full potential of your time-series data? Try TimescaleDB, the PostgreSQL database optimized for scaling time-series data. To get started and uncover its capabilities today, sign up for a free trial (recommended) or simply install it if you prefer to self-host.