Start supercharging your PostgreSQL today.
Written by Junaid Ahmed
Understanding time-series dynamics is increasingly crucial for data-driven decisions across various industries. Thinking about data over time makes companies agile and competitive, as they can derive insights from a particular point in time to delve into what happened and, more importantly, why it happened. With time-series analysis and forecasting, companies can improve operational performance and gain significant competitive advantages.
From a developer’s perspective, you can use a wide range of tools and programming languages to effectively perform time-series analysis and forecasting. Each offers advantages and features, but Python and TimescaleDB play a leading role in my toolbelt for time-series analysis and forecasting.
Python has been widely used due to its simplicity and versatility. The framework comes with extensive libraries for data manipulation and modeling, such as pandas, NumPy, and scikit-learn. It also offers forecasting libraries—like Statsmodels and Prophet—that are suited to time-series forecasting, making advanced statistical modeling and prediction approaches easier.
On the other side, we have TimescaleDB, a database developed to handle this type of data efficiently. Its innovative architecture eases the process of data gathering, managing, and analyzing time-series data.
Jointly, Python and TimescaleDB enable organizations to unearth hidden value within time-series data and make predictions. During this tutorial, we'll explore different phases of time-series analysis, from data pre-processing to model assessment. We will also use state-of-the-art hyperfunctions, which are excellent at mining meaning from timestamped data, to make accurate forecasts.
You can find all of the Python and TimescaleDB scripts available on GitHub.
In this tutorial, you will learn how to analyze and forecast stock market trends using historical stock data over five years. The primary objective is to make a model that accomplishes two goals at a time: it should be predictive and use data from the past as a reference. We will explore trends to understand where the stock market may go next, which will be done by:
Predict short-term movements: Some traders use this information to decide if stock prices will increase or decrease in the short run and make decisions accordingly.
Understand volatility: Visualizing the data and identifying volatility can help illustrate risk management functions.
Identify long-term trends: Using forecasting algorithms, we can predict 10-year trends to build adaptive investment strategies that will offer better chances of growth in the long term.
To achieve these goals, we will employ various data analysis techniques and machine learning algorithms, including:
Data indigestion: We will use TimescaleDB for data storage and retrieval because it is tailored to handle time-series data.
Exploratory data analysis: We will use Python libraries such as Matplotlib, Seaborn, and Plotly to identify trends in the data.
Machine Learning models: We will explore various models, such as ARIMA, LSTM NNs, and regression analysis, to determine the most precise one.
Feature engineering: Using the hyperfunctions from the ML models and TimescaleDB, we will extract the most important feature that can help predict stock prices.
Model evaluation: The forecasting model's performance will be measured using the evaluation metric. These metrics include RMSE (Root Mean Square Error), MAE (Mean Absolute Error), and R-square.
The dataset, meticulously curated to aid in this analysis, includes the following attributes for each trading day:
Date: The trading day, formatted as yy-mm-dd, providing a temporal anchor for each data point.
Open: The price of the stock at the market opens, offering insight into the starting valuation for the day.
High: The highest price reached during the trading day, indicating the peak market valuation.
Low: The lowest price reached during the trading day, highlighting the day's trough in market valuation.
Close: The closing price of the stock, which is critical for understanding the final market valuation for the day.
Volume: The total number of shares traded during the day, reflecting market activity and liquidity.
Name: The stock's ticker symbol, which serves as a unique identifier for each company.
If you feel you need to take a step back, check our Guide to Time-Series Analysis in Python.
Before ingesting data, ensure your TimescaleDB instance is up and running. You can install TimescaleDB on various operating systems and cloud platforms. Refer to these official TimescaleDB documentation for installation instructions.
Once TimescaleDB is installed, create your first instance and connect to your PostgreSQL database. Create a database and then a table that matches the schema of your CSV file. For time-series data, you should convert this table into a hypertable using the TimescaleDB function create_hypertable (). I have already created the hypertable. You can use the following code to create a hypertable for the stock data you have:
CREATE TABLE stock_data (
date DATE,
open FLOAT,
high FLOAT,
low FLOAT,
close FLOAT,
volume BIGINT,
name TEXT
);
SELECT create_hypertable('stock_data', 'date');
Once we have created the hypertable, we can load the data from the CVS file to the hypertable using the copy command.
Install these libraries if you haven’t installed them yet using the following code chunk:
pip install psycopg2
pip install pandas
pip install matplotlib
pip install plotly
Once we have installed the required libraries, we can establish the connection between our database and notebook and import the data. Now, you can establish the connection using the following code:
con = psycopg2.connect(
host='your_host',
port='your_port',
database='your_database',
user='your_username',
password='your_password'
)
Replace the database credentials, user name, and password to connect. The next step is to create a cursor object that will help you interact with the database and execute the queries.
# Create a cursor
cur = conn.cursor()
Define the SQL query, adjusting for the 'stock_data' table and the desired columns:
query = sql.SQL(SELECT * FROM stock_data LIMIT 10)
# Execute the query
cursor.execute(query)
After that, you can fetch all the rows using the following command:
# Fetch the results
records = cur.fetchall()
It's a good practice to close the cursor and connection when you're done:
# Close the cursor and connection
cur.close()
conn.close()
Print the records if you want to see the output directly in your Python script:
for record in records:
print(record)
Another way to get the most recent stock data is to use the Yahoo finance package. To extract the data and display it in a suitable format, we have used the following code:
import yfinance as yf
import pandas as pd
from datetime import datetime, timedelta
# Define the list of stock tickers and the time period
tickers = [
'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'FB', 'NVDA', 'NFLX', 'INTC', 'AMD',
'IBM', 'ORCL', 'CSCO', 'ADBE', 'PYPL', 'QCOM', 'TXN', 'AVGO', 'CRM', 'SAP'
]
start_date = '2019-05-17'
end_date = '2024-05-17' # Use today's date
The code chunk defines the Yahoo finance module for data retrieval and pandas for data manipulation. Then, we define a list of stock tickers that include the various top-performing stocks in the world, like Apple and Tesla. As we are looking for data from the last five years, we define the starting and ending dates.
# Function to download and format data for a single stock
def download_and_format_stock_data(ticker, start_date, end_date, retries=3):
for i in range(retries):
try:
stock_data = yf.download(ticker, start=start_date, end=end_date)
if not stock_data.empty:
stock_data.reset_index(inplace=True)
stock_data['Name'] = ticker
stock_data = stock_data[['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Name']]
stock_data.columns = ['date', 'open', 'high', 'low', 'close', 'volume', 'name']
return stock_data
except Exception as e:
print(f"Attempt {i+1} failed for {ticker}: {e}")
return pd.DataFrame()
# Download and format data for all tickers
all_stocks_data = pd.concat([download_and_format_stock_data(ticker, start_date, end_date) for ticker in tickers])
# Display the first few rows of the combined data
print(all_stocks_data.head())
# Save to CSV
file_path = 'formatted_stock_data2.csv'
all_stocks_data.to_csv(file_path, index=False)
print(f"Data saved to {file_path}")
It downloads and reformates the data before formatting it into a DataFrame with the columns date, open, high, low, close, volume, and stock name. The method then tries to download the data up to three times before proceeding.
Next, it goes down the for loop over a list of stock tickers, downloads and converts them into a DataFrame for each ticker, concatenates them into a single one, prints a few first rows, and saves them as CSV.
The output of the code is shown below:
Now, let’s analyze the data to uncover hidden patterns.
Through EDA, we will move down into the depths of historical stock price movements to appreciate volatility, trends, seasonality, and anomalies that might not be visible at a glance. The plots given below showcase two different aspects of the stock market based on the dataset:
Highest Average Trading Volume: This is the first plot that shows the traded volume over time for the company with the highest turnover volume in the dataset. The chart shows some very busy days, which is a sign of increased investor activity or maybe cases when the stock was particularly moved by the market.
Highest Average Volatility: The second plot displays the volatility (using a 30-day rolling standard deviation of close prices as a proxy) for the company with the highest average volatility. This company experiences more pronounced price fluctuations, suggesting higher risk and potentially higher returns for investors.
Let’s create an interactive plot using Plotly to show the trading in a specific time zone, with a zoom-in and zoom-out option.
Import the necessary libraries for the plot.
import plotly.graph_objects as go
import pandas as pd
Filter the DataFrame for the year 2024 and the company with the highest average trading volume:
max_volume_df_2024 = max_volume_df[(max_volume_df['date'] >= pd.Timestamp('2024-01-01')) & (max_volume_df['date'] <= pd.Timestamp('2024-5-16'))]
Create an interactive plot using Plotly:
fig = go.Figure()
Add a bar chart for trading volume:
fig.add_trace(go.Bar(x=max_volume_df_2024['date'], y=max_volume_df_2024['volume'], marker_color='purple'))
Update the layout for a better visual appearance:
fig.update_layout(
title=f'Trading Volume for {max_volume_company["name"]} in 2024',
xaxis_title='Date',
yaxis_title='Volume',
template='plotly_white',
xaxis_rangeslider_visible=True # Enable the range slider for zooming
)
Now show the figure using the figure function:
fig.show()
The output of the plot shows the trading volume for the company TSLA in the year 2024 with various time zones:
The interactive plot illustrates significant fluctuations in trading volume throughout the year. These fluctuations can point to investor reactions to market events, earnings announcements, or other news directly impacting the company. Peaks in trading volume highlight periods of heightened market interest or response to external factors.
Let's delve into some basic statistical analysis using Python and TimescaleDB's powerful capabilities. Our objective is to conduct statistical computations on our dataset. This could involve calculating standard deviation, mean, or any other statistical functions pertinent to our analysis.
Let’s start by doing some descriptive analysis of our dataset. First, let’s find the summary of the dataset in Python.
# Print summary statistics
print(data.describe())
The output looks like this:
The stock data summary indicates 23,921 data points with an average open price of $190.49 and a mean volume of 31.1 million shares. The standard deviation for prices and volume is high, reflecting significant variability. Prices range from a minimum of $12.07 to a maximum of $1426.00.
The median open price is $138.68, and the 75th percentile is $229.34. The highest trading volume recorded is 914 million shares. This summary provides insights into the stocks' overall price range and trading activity over the specified period.
# Get the row count
row_count = len(data)
# Print formatted output
print("approximate_row_count")
print("----------------------")
print(row_count)
Timescale’s hyperfunctions help you efficiently analyze time-series data using fewer lines of code. For example, by utilizing TimescaleDB’s approximate_row_count() function, we can calculate the total row count for the dataset. The code to get the row count is:
ANALYZE stock_data;
SELECT * FROM approximate_row_count('stock_data');
The expected output is:
Now let’s explore comprehensive statistical analyses of the dataset.
Here is the code that we have used to calculate the statistics for the open column:
import pandas as pd
import numpy as np
# Assuming stock_data is already defined and contains the dataset
# Convert 'date' column to datetime if it's not already in datetime format
data['date'] = pd.to_datetime(data['date'])
# Set 'date' as the index
data.set_index('date', inplace=True)
# Group data by daily intervals and calculate statistics for the 'open' price
daily_summary = data['open'].resample('D').agg(
stats1D=lambda x: (np.mean(x), np.std(x), x.skew()) if not x.empty else np.nan
)
# Remove rows with NaN values (if any)
daily_summary = daily_summary.dropna()
# Print the result
print(daily_summary)
Using TimescaleDB's stats_agg() hyperfunction, we effectively generate statistical metrics such as mean, standard deviation, and more for a single variable over defined periods, allowing for a meaningful study of time-series patterns. Here is the sample code that you can use:
WITH t AS (
SELECT
time_bucket('1 day'::interval, "date") AS dt,
stats_agg(open) AS stats1D
FROM stock_data
WHERE Name = 'AAL'
GROUP BY time_bucket('1 day'::interval, "date")
)
SELECT
average(stats1D) AS avg_open,
stddev(stats1D) AS stddev_open,
skewness(stats1D) AS skewness_open
FROM t;
Output
The output of the above calculations from the Python and TimescaleDB is expected as:
These values can be used as input to the forecasting models. For example, a trading strategy might involve buying when the mean is low relative to historical data and volatility is decreasing.
Now, let’s find out the correlation between the top two performing stocks in the previous years.
After executing the code, we found a correlation between the two best-performing stocks ADBE and AVGO. The correlation results for Adobe (ADBE) and Broadcom (AVGO) from February 2023 to May 2024 show extremely high positive correlations, consistently above 0.999. This indicates that the price movements of these two stocks are highly synchronized.
Such strong correlations suggest that both stocks react similarly to market conditions, which can help forecast their future prices and manage risk in a portfolio. Year-on-year comparisons reveal slight increases in correlation, indicating growing synchronization over time. This consistently high correlation aids investors in making informed decisions based on the predictable relationship between these stocks.
Now that we have uncovered the trends through visualizations and statistical analysis, let’s build and make forecasts about the stock prices.
Forecasting stock prices involves predicting future values based on historical data, market trends, and other influencing factors. Suitable forecasting models for stock prices often include time series analysis and machine learning algorithms. Here are a few popular models:
ARIMA (Autoregressive Integrated Moving Average): A classical time-series forecasting model that captures different aspects of the data like trend and seasonality through its components.
LSTM (Long Short-Term Memory Networks): A type of recurrent neural network (RNN) capable of learning order dependence in sequence prediction problems, suitable for the volatile nature of stock prices.
Prophet: Developed by Facebook, Prophet is designed for forecasting with daily observations that display patterns on different time scales. It's robust to missing data and shifts in the trend and typically handles outliers well. Check this tutorial on time-series forecasting with Prophet to learn more.
XGBoost: Although not a traditional forecasting model, XGBoost can be used for time series prediction by creating lag features and using its powerful gradient-boosting framework to predict future stock prices.
Here, we'll focus on building a simple ARIMA model for stock price forecasting. The process involves data preparation, model fitting, and parameter tuning. From the correlation results and volatility analysis, we have selected the AVGO stock for building this model as it shows the best-performing and most volatile stock in the last year.
The following code presents a typical workflow for time-series analysis, including data preprocessing, stationarity testing, and ARIMA model fitting:
# Convert the date column to datetime and set it as the index
data['date'] = pd.to_datetime(data['date'])
data.set_index('date', inplace=True)
# Filter for a specific stock,
avgo_df = data[data['name'] == 'AVGO']['close'].asfreq('B')
# Check for stationarity
result = adfuller(avgo_df.dropna())
print('ADF Statistic:', result[0])
print('p-value:', result[1])
# If the series is not stationary, differencing might be necessary
# Fit the ARIMA model
model = ARIMA(avgo_df, order=(5,1,0)) # Example order, needs optimization
model_fit = model.fit()
# Summary of the model
print(model_fit.summary())
The output of the code snippet is as follows:
Given the ADF test results and the p-value, the time-series data appears non-stationary, necessitating additional preprocessing processes, such as differencing before fitting an ARIMA model. Furthermore, the diagnostic tests shed light on the quality of the ARIMA model fit, as well as possible issues such as autocorrelation and heteroskedasticity.
Let’s do some differencing on the ARIMA model:
# Perform differencing
differenced_series = avgo_df.diff().dropna() # Compute the difference and remove NaN values
# Check for stationarity after differencing
result_diff = adfuller(differenced_series)
print('ADF Statistic after differencing:', result_diff[0])
print('p-value after differencing:', result_diff[1])
The output of the code snippet shows:
The ADF test statistic after differencing is significantly negative, and the p-value is zero, indicating strong evidence against the null hypothesis of non-stationarity. This suggests that the differenced series is stationary, which is suitable for modeling with an ARIMA model. Now, we can proceed to model fitting and forecasting.
The summary of the mode after fitting the new results is as follows:
The coefficients and diagnostic tests indicate that the fitted model adequately captures the dynamics of the time series.
Standardized residuals: The standardized residuals represent the difference between the forecasted and actual stock prices, standardized by their estimated standard deviation. These residuals help assess the goodness of fit of the model, with smaller residuals indicating a better fit.
Histogram and estimated density: The histogram provides a visual representation of the distribution of errors. In contrast, the estimated density plot (Kernel Density Estimation - KDE) smoothes out the histogram to estimate the underlying probability density function. The notation N(0,1) indicates a standard normal distribution, where the mean is zero and the variance is one. A bell-shaped curve centered at zero suggests that the errors are normally distributed.
Normal Q-Q plot: The Normal Q-Q plot compares the standardized residuals against theoretical quantiles of a standard normal distribution. If the points on the plot fall along a straight line, it suggests that the residuals are normally distributed. Deviations from the straight line indicate departures from normality, such as skewness or heavy-tailed distributions.
Correlogram: The correlogram, also known as the autocorrelation plot, displays the autocorrelation coefficients of the residuals at different lags. Significant autocorrelation at certain lags indicates a pattern or structure in the residuals that the model has not captured. This can help identify any remaining temporal dependencies in the data that the model should account for.
Evaluating the performance of forecasting models is crucial to understand their accuracy and reliability. Let’s evaluate the performance of the models using two metrics: mean absolute error and root mean square error. Mean Absolute Error: Measures the average magnitude of the errors in a set of predictions without considering their direction. The code for this metric is:
from sklearn.metrics import mean_absolute_error
# Calculate MAE
mae_diff = mean_absolute_error(differenced_series[1:], train_predictions_diff)
print("MAE:", mae_diff)
Root Mean Square Error: Measures the square root of the average of the squares of the errors. It gives a higher weight to larger errors. The code for the metric is:
from sklearn.metrics import mean_squared_error
# Make predictions on the training data
train_predictions_diff = model_fit_diff.predict(start=differenced_series.index[1], end=differenced_series.index[-1])
# Align the indices of the differenced series and predictions
train_predictions_diff = train_predictions_diff[differenced_series.index[1]:]
# Calculate RMSE
rmse_diff = np.sqrt(mean_squared_error(differenced_series[1:], train_predictions_diff))
print("RMSE:", rmse_diff)
The result of these metrics are:
Metric | Value |
RMSE (Root Mean Squared Error) | 14.122 |
MAE (Mean Absolute Error) | 9.2202 |
Lower values for RMSE and MAE indicate better performance, suggesting that the ARIMA model's forecasts are relatively close to the actual values.
Another way to analyze the stock data is to use the TimescaleDB finance database to get real-time data about open, close, high, and low. Here’s how you can retrieve that data from TimescaleDB in Python:
import psycopg2
import pandas as pd
# Database credentials
db_url = "postgres://tsdbadmin:[email protected]:33159/tsdb?sslmode=require"
# Connect to the database
conn = psycopg2.connect(db_url)
# Create a cursor
cur = conn.cursor()
# Define the query to fetch data for the past 3 months
query = """
SELECT bucket, open, high, low, close
FROM one_min_candle
WHERE symbol = 'BTC/USD'
AND bucket >= NOW() - INTERVAL '1 months'
ORDER BY bucket DESC;
"""
# Execute the query
cur.execute(query)
# Fetch the results
rows = cur.fetchall()
# Convert the results to a pandas DataFrame
stock_data = pd.DataFrame(rows, columns=['bucket', 'open', 'high', 'low', 'close'])
# Print the DataFrame
print(stock_data)
# Optionally, save the DataFrame to a CSV file
stock_data.to_csv('stock_data.csv', index=False)
# Close the cursor and connection
cur.close()
conn.close()
Here is the output that we got by running this code:
Using this data, we can analyze the data using TimescaleDB’s continuous aggregates and Python time-series models.
In today’s data-driven world, the importance of time-series analysis cannot be underestimated.
TimescaleDB, with its specialized time-series functionalities and hyperfunctions, simplifies the data preparation process by developing fast data storage and recovery without any preprocessing directly into a database.
Through the stock price data example, we have explored the TimescaleDB statistical aggregate functions, using Python to explore trends and model building using ARIMA. I highly recommend delving deeper into the field of time-series analysis by exploring advanced techniques. These include deep learning models, anomaly detection algorithms, and multivariate analysis to enhance the accuracy of the dataset.
Additionally, with the help of a myriad of resources, from the TimescaleDB documentation to Python time-series analysis libraries and interactive platforms, you can work on the edge of time-series analysis. This will enable innovative breakthroughs and aid decision-making processes.