Written by Anber Arif
Predicting future values is fundamental when working with time-series data, which records observations over time. Autoregressive (AR) models are among the most foundational tools for this, using past data points to forecast future outcomes. These models are essential for analysts working with time-series data in areas like finance, economics, and forecasting, as they provide a first step towards more advanced predictive approaches.
Understanding autoregressive models is critical, as they form the basis for more sophisticated techniques like ARIMA (autoregressive integrated moving average), which incorporates additional complexities. By mastering AR models, analysts can tackle time-series problems more effectively, building a strong foundation for tackling real-world scenarios.
In this article, we’ll explore what autoregressive models are, demonstrate how they work through examples, and discuss their limitations.
An autoregressive (AR) model is a statistical model used in time-series analysis that leverages past data points to predict future values. Specifically, it builds a multilinear function where the future value of a time series is expressed as a linear combination of previous observations. By doing so, the model attempts to capture the dependencies between the current and past data points.
Let’s consider a simple time series with 100 entries (starting at t=0 and ending at t=99). The goal of an AR model is to predict the value at the next time step, t=100, by using previous data points.
Suppose we want to predict the value at t=100. An autoregressive model will look at a specific number of previous data points to make this prediction. For instance, if we decide to use the three most recent entries—known as a lag of 3—we would use the data points at t=97, t=98, and t=99 to predict the value at t=100.
Time (t) | Value (X(t)) |
97 | 12.5 |
98 | 13.1 |
99 | 13.8 |
100 (Predicted) | ? |
The autoregressive model for X(100) can be written as:
X(100) = C_{0} + C_{1}X(99) + C_{2}X(98) + C_{3}X(97)
where:
X(100) represents the value at time step t=100.
C_{0}, C_{1}, C_{2}, C_{3} are coefficients that the model will learn from the data.
The coefficients are determined using a multilinear regression fit on the previous data points: X(t) = C_{0}+C_{1}X(t−1)+C_{2}X(t−2)+C_{3}X(t−3)
The autoregressive model is a common tool, and many statistical libraries in Python and R provide built-in functionality to implement AR models.
Python: The statsmodels library provides the AutoReg class, which simplifies fitting AR models. It allows for the specification of lag orders, the inclusion of exogenous variables, and other customizable options for accurate forecasting.
R: In R, the ar.ols function from the stats package helps fit AR models using ordinary least squares (OLS). It’s a robust tool for modeling time-series data and allows flexibility in model order selection.
Autoregressive models are not limited to predicting just the next data point; they can also project further into the future. For example, after predicting X(100), the model can use this predicted value, along with other past values, to forecast X(101) and continue this process for subsequent points. This technique is known as recursive forecasting.
To illustrate:
First, predict X(100) using X(99), X(98), and X(97)
Then, predict X(101) using X(100) (which was predicted), X(99), and X(98)
However, there’s a catch. Since the model uses predicted values for future predictions, compounding errors occur. The further into the future you predict, the more the errors accumulate, causing predictions to become less accurate over time. For example, the prediction for X(101) is based on X(100), which may already have some prediction error, leading to an amplified error for X(101).
Lag correlation refers to the correlation between a time series value at a specific time t, denoted as X(t), and a previous series value at a lagged time t-k, where k is the lag. For instance, X(t) and X(t-1) are lagged by 1, and their correlation is called lag 1 correlation.
The lag correlation can tell us how well past values of the time series can predict future values. High lag correlation (close to 1) suggests that an autoregressive model might be appropriate because there’s a significant relationship between the current and previous data points.
We can visualize lag correlation by plotting X(t) vs. X(t-1) for the time series. Here’s the code to generate such a plot and compute the correlation coefficient:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
# Generating a synthetic time series data
np.random.seed(42)
time_series = np.cumsum(np.random.randn(100)) # Generating random walk
# Creating a DataFrame with X(t) and X(t-1)
df = pd.DataFrame({'X_t': time_series[1:], 'X_t-1': time_series[:-1]})
# Scatter plot X(t) vs X(t-1)
plt.scatter(df['X_t-1'], df['X_t'])
plt.title("Lag 1 Correlation: X(t) vs X(t-1)")
plt.xlabel("X(t-1)")
plt.ylabel("X(t)")
plt.grid(True)
plt.show()
# Calculating the correlation coefficient for Lag 1
lag_1_correlation = df['X_t'].corr(df['X_t-1'])
print(f"Lag 1 Correlation Coefficient: {lag_1_correlation:.4f}")
The chart shows a strong positive correlation between X(t) and X(t-1), with the data points clustering along a clear upward trend line. This high correlation coefficient of 0.9807 indicates that past values strongly predict future values, making an autoregressive (AR) model a good fit for this data. Similar to linear regression, where two correlated variables suggest predictive potential, this correlation shows that X(t) depends heavily on X(t-1), which justifies using an AR(1) model for forecasting.
While we initially focus on lag 1 (i.e., X(t) vs. X(t-1))), we can look at other lags too, such as X(t) vs. X(t-2) (lag 2), X(t) vs. X(t-3) (lag 3), and so on. Lag correlation helps identify the memory of the series and how many past data points are useful in predicting the future.
A lag correlation chart shows the correlation coefficients at various lags. Significant spikes in the chart suggest which lags to consider in the model. Let’s plot the lag correlation chart and observe which lags have strong correlations.
from statsmodels.graphics.tsaplots import plot_acf
# Plotting Autocorrelation (Lag correlation)
plt.figure(figsize=(10,6))
plot_acf(time_series, lags=20)
plt.title('Lag Correlation Chart')
plt.grid(True)
plt.show()
This lag correlation chart (also called an autocorrelation plot) shows how the time-series values correlate with their past values (lags). The first few lags, particularly lag 1, exhibit a strong positive correlation, with the bars extending above the confidence interval (shaded area). Lag 3 and lag 6 also show significant correlations. These high correlations at specific lags suggest that an autoregressive model including these lags would be appropriate for predicting future values in the time series.
The results from the lag correlation chart directly influence the structure of the autoregressive model. The above chart indicates strong correlations at lag 1, lag 3, and lag 6, suggesting that we should incorporate these lags into the autoregressive model.
An autoregressive model that includes up to six previous time points is called an order 6 AR model. The model can be written as:
X(t) = C_{0}+C_{1}X(t−1)+C_{2}X(t−2)+C_{3}X(t−3)+C_{4}X(t−4)+C_{5}X(t−5)+C_{6}X(t−6)
We expect the coefficients C1, C3, and C6 to be relatively large since lag 1, lag 3, and lag 6 were identified as having strong correlations. Coefficients corresponding to other lags may be smaller or close to zero.
Let’s fit this model to our synthetic time-series data and see what the coefficients look like:
from statsmodels.tsa.ar_model import AutoReg
# Training an AR model with order 6 (lags 1 through 6)
model = AutoReg(time_series, lags=6).fit()
# Printing the coefficients
print("AR Model Coefficients:")
print(model.params)
# Predicting the next value (t=101) based on the model
next_value = model.predict(start=len(time_series), end=len(time_series))[0]
print(f"Predicted value for t=101: {next_value:.4f}")
As anticipated, C1 is large and positive, indicating a strong influence from the previous time point X(t-1). C3 is positive but small, suggesting a weaker impact from lag 3 than expected. Notably, C6 is negative, indicating that increases in X(t-6) are associated with decreases in X(t), which is contrary to our expectations based on the correlation analysis.
Let’s dive into some examples of autoregressive models and see how we can use past data to predict future values.
We will work through a temperature time-series forecasting example using autoregression models based on the daily minimum temperatures dataset. This dataset contains daily temperature observations recorded in Melbourne, Australia, from 1981 to 1990.
First, load the dataset using pandas and plot the time series to visualize the data.
from pandas import read_csv
from matplotlib import pyplot
# Loading dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv'
series = read_csv(url, header=0, index_col=0, parse_dates=True)
# Plotting time series
series.plot()
pyplot.show()
Next, we split the data into training and testing sets. We use most of the data for training and the last few days (or weeks) for testing.
# Splitting data into train and test sets
X = series.values
train, test = X[1:len(X)-7], X[len(X)-7:]
Now, we apply an autoregression model to the training data. The AutoReg
class from statsmodels
allows us to specify the number of lags to consider. In this case, we use 29 lags, which means the model will predict the next day’s temperature based on the previous 29 days.
# Fitting autoregression model
from statsmodels.tsa.ar_model import AutoReg
# Training the autoregression model
model = AutoReg(train, lags=29)
model_fit = model.fit()
print("Coefficients:", model_fit.params)
Once the model is trained, we make predictions on the test data and evaluate performance using root mean squared error (RMSE).
from sklearn.metrics import mean_squared_error
from math import sqrt
# Making predictions
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)
for i in range(len(predictions)):
print('predicted=%f, expected=%f' % (predictions[i], test[i]))
rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)
# Plotting results
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()
The plot shows a comparison between the predicted values (in red) and the actual values (in blue) from an autoregressive (AR) model.
The red line (predictions) follows the trend of the blue line (actual data), but there is some deviation.
Notable discrepancies occur around time steps 1 and 5, where the actual values spike higher than the predictions.
Let’s use autocorrelation to investigate the correlation between temperature data at different lags.
from pandas.plotting import autocorrelation_plot
# Autocorrelation plot
autocorrelation_plot(series)
pyplot.show()
The plot displays lag values on the x-axis and correlation coefficients on the y-axis (ranging from -1 to 1). Solid and dashed lines represent the 95 % and 99 % confidence intervals. Correlations above these lines are statistically significant, indicating stronger relationships between past and future values. Values below the lines are less significant. This visualization helps determine the most relevant lag values by focusing on significant correlations beyond the confidence thresholds.
The autocorrelation plot also reveals distinct seasonal patterns, especially at key lags:
Six-month seasonality: A significant correlation is seen at a lag of around 180 days, indicating a strong relationship between temperatures six months apart, suggesting a pronounced semi-annual cycle.
Twelve-month seasonality: A stronger correlation appears at approximately 365 days, showing a clear annual cycle. This indicates a strong correlation between temperatures measured one year apart, affirming the yearly periodicity in the data.
The Google Stock dataset contains 105 data points representing the closing stock price of Google shares from February 7, 2005, to July 7, 2005. We will use this data to determine the appropriate order for an autoregressive (AR) model. Below is a plot of the stock prices over time to analyze trends and behaviors in the dataset visually.
Consecutive values in the dataset appear to follow one another fairly closely, indicating that an autoregressive model may be well-suited for this data. To confirm this, we can examine a plot of partial autocorrelations to help identify significant lag values and determine the optimal order for the model.
The plot reveals a significant spike at lag 1, with diminishing spikes at higher lags. This pattern suggests that an AR(1) model could effectively model the data.
To help determine significant values in the autocorrelation plot, approximate bounds can also be constructed (as given by the red lines in the plot above). These significance bounds are typically represented as ^{±z}1-ɑ ⁄ 2 ⁄√n , where n is the number of observations. When values fall outside these bounds, they suggest the presence of an autoregressive process. This method allows us to visually assess whether certain lags show significant correlation, indicating potential autoregressive behavior in the data.
Next, we create a lag-1 price variable and examine the scatter plot of the current price against this lag-1 variable.
The scatter plot reveals a moderate linear relationship, indicating that the first-order autoregression model:
could effectively capture the dynamics of the stock price.
Autoregressive models are widely used in time-series analysis but have certain limitations. Let’s explore the conceptual and computational limitations that can affect their accuracy and applicability.
Only incorporates data from within the system: Autoregression models rely solely on the historical data within the system to make predictions. This limitation means that the model only considers the past values of the time series data to forecast future values. In the context of predicting stock prices, this limitation is particularly significant. Stock prices are influenced by a multitude of factors, including economic indicators, geopolitical events, company performance, and market sentiment, among others.
However, an autoregressive model only uses past stock prices to make predictions, ignoring these external factors. This limitation simplifies the model-building process but compromises its accuracy. For instance, a sudden change in interest rates or a global pandemic can significantly impact stock prices, but an autoregressive model would not be able to capture these effects.
Assumes the future can be known from the past: Autoregression models are based on the assumption that the patterns and trends observed in the past will continue into the future. This assumption is rooted in the concept of temporal dependence, which suggests that the future behavior of a time series can be predicted based on its past behavior.
However, this assumption can be seen as a philosophical limitation. In reality, the future is inherently uncertain, and many events are unpredictable. The assumption that the future can be known from the past oversimplifies the complexity of real-world systems.
In practice, this limitation can be more or less applicable depending on the specific scenario. For instance, in certain domains like weather forecasting or traffic prediction, the assumption of temporal dependence may hold relatively well. However, in domains like finance or economics, where human behavior and external factors play a significant role, this assumption can be more uncertain.
It's essential to consider whether this philosophical assumption applies to your specific data and problem domain. If the data is subject to sudden changes or external influences, an autoregressive model may not be the most suitable choice.
Autoregression models are prone to compounding error, which refers to the accumulation of errors over time. When an autoregressive model is used to make predictions, it relies on its previous predictions to make subsequent ones. This means that any errors in the initial predictions will be propagated and amplified over time, leading to a rapid decline in accuracy.
In particular, when an autoregressive model is pushed into the future beyond its order, all predictions are based on synthetic data. This means that the model is essentially generating new data based on its own predictions rather than relying on actual historical data. As a result, the accuracy of the predictions rapidly deteriorates.
For example, consider an autoregressive model of order 5, which means it uses the past five values to make predictions. If we use this model to make predictions 10 steps into the future, the first five predictions will be based on actual historical data, but the following five predictions will be based on synthetic data generated by the model itself. As a result, the accuracy of the predictions will rapidly deteriorate.
This limitation makes autoregressive models more suitable for short-range prediction tasks, where the goal is to predict the next few values in a time series. For longer-range predictions, alternative models that can incorporate external information or account for uncertainty in the data may be more suitable.
Autoregressive models can also suffer from overfitting issues, particularly when you use high-order models. Overfitting occurs when a model is too complex and fits the noise in the training data rather than the underlying patterns. As a result, the model performs well on the training data but poorly on new, unseen data.
In the context of autoregression, overfitting can occur when a high-order model is used to capture complex patterns in the data. While adding more lags to the model to capture more complex patterns may seem intuitive, this can lead to overfitting. Higher-order models are more complex to fit, and the risk of overfitting to the existing data increases.
For example, consider an autoregressive model of order 10, which uses the past 10 values to make predictions. While this model may capture complex patterns in the data, it may also fit the noise in the training data, leading to poor performance on new data.
To avoid overfitting, it's essential to balance the model order in autoregression models carefully. This balance involves selecting an order high enough to capture the data's underlying patterns but low enough to prevent overfitting, ensuring the model generalizes well to new data.
Some strategies for avoiding overfitting in autoregression models include:
Cross-validation: this involves splitting the data into training and testing sets and evaluating the model's performance on the testing set.
Regularization: this involves adding a penalty term to the model's objective function to discourage overfitting.
Model selection: this involves selecting a model order that balances the trade-off between bias and variance.
Autoregressive models are powerful tools for making predictions based on historical data. They effectively capture seasonal patterns and consistent behaviors within a time series. However, they have notable limitations: they fail to account for external influences and cannot model non-linear relationships in the data.
Want to deepen your time-series knowledge? Discover advanced techniques, real-world applications, and best practices in these insightful resources. Learn how to leverage time-series models effectively and improve your forecasting skills: