Hacking NFL data with PostgreSQL, TimescaleDB, and SQL
Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics – and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience - all with PostgreSQL, standard SQL, and freely available extensions.
Time-series data is everywhere, including, much to our surprise, the world of professional sports. At Timescale, we're always looking for fun ways to showcase the expanding reach of time-series data. Stock, cryptocurrency, IoT, and infrastructure metrics data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for #timeseries or #TimescaleDB, and you're sure to find questions about high-frequency trading or massive scale observability data with tools like Prometheus.
You can imagine our excitement, then, when we happened upon the NFL Big Data Bowl, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models.
Did the NFL really give access to 18+ million rows of detailed play data from every regular season NFL game?
For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials).
Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.
So, at first glance, football game metrics might not immediately jump out as anything special.
But then the NFL did something pretty ambitious and amazing.
All NFL players are equipped with RFID chips that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize. The NFL started tracking and storing data for every player on the field, for every play, for every game.
As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.
This isn’t ordinary data. This is time-series data. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.
Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a multi-node, petabyte-scale, completely free relational database for time-series.
We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season.
Read on for more information about the NFL’s dataset and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.
If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.
If you’re new to time-series data or just have some questions you’d like to ask about the dataset, join our public Slack community, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.
The NFL dataset
Over the last few years, the NFL and Kaggle have collaborated on the NFL Big Data Bowl. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.
Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about last year’s winners.)
(Keep watching the NFL website for more information on the next Big Data Bowl.)
Accessing the data
For the purposes of this blog post and accompanying tutorial, we will use the sample data provided by the NFL. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.
This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets.
If you want to follow along - or recreate! - the queries we go through below, follow our tutorial to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any Timescale Cloud instance.)
Let's start exploring!
We've provided the steps needed to ingest the dataset into TimescaleDB in the accompanying tutorial, so we won’t go into that here.
The NFL dataset includes the following data:
- Games: all relevant data about each game of the regular season, including date, teams, time, and location
- Players: information on each player, including what team they play for and their originating college
- Plays: a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.
- Week [1-17]: for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the
trackingtable and totals almost 20 million rows of time-series data.)
In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see our TimescaleDB vs. InfluxDB comparison for reference).
Because TimescaleDB is PostgreSQL with time-series super powers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.
Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!
The power of SQL
Year after year, we see SQL listed as one of the most popular languages among developers on the StackOverflow survey. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.
SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.
Not so! SQL can data munge with the best of them! Let's look at a first, quick example.
Average yards per position, per game
For this first example, we'll query the
tracking table (the player movement data from all 17 weeks of games) and join to the
game table to determine the number of yards per player position, per game.
The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.
WITH total_position_yards AS ( SELECT sum(dis) position_yards, POSITION, gameid FROM tracking t GROUP BY POSITION, gameid) SELECT avg(position_yards), position, game_date FROM game g INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid WHERE POSITION IN ('QB','RB','WR','TE') GROUP BY game_date, POSITION;
Number of plays by offensive player
As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.
This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.
WITH snap_events AS ( -- Create a table that filters the play events to show only snap plays -- and display the players team information SELECT DISTINCT player_id, t.event, t.gameid, t.playid, CASE WHEN t.team = 'away' THEN g.visitor_team WHEN t.team = 'home' THEN g.home_team ELSE NULL END AS team_name FROM tracking t LEFT JOIN game g ON t.gameid = g.game_id WHERE t.event IN ('snap_direct','ball_snap') ) -- Count these events & filter results to only display data when the player was -- on the offensive SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name FROM snap_events a LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid LEFT JOIN player pl ON a.player_id = pl.player_id WHERE a.team_name = p.possessionteam GROUP BY a.player_id, pl.display_name, a.team_name ORDER BY play_count DESC;
If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others.
Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.
WITH snap_events AS ( -- Create a table that filters the play events to show only snap plays -- and display the players team information SELECT DISTINCT player_id, t.event, t.gameid, t.playid, CASE WHEN t.team = 'away' THEN g.visitor_team WHEN t.team = 'home' THEN g.home_team ELSE NULL END AS team_name FROM tracking t LEFT JOIN game g ON t.gameid = g.game_id WHERE t.event IN ('snap_direct','ball_snap') ) -- Count these events & filter results to only display data when the player was -- on the offensive SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name, pl."position" FROM snap_events a LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid LEFT JOIN player pl ON a.player_id = pl.player_id WHERE a.team_name = p.possessionteam AND pl."position" != 'QB' GROUP BY a.player_id, pl.display_name, a.team_name, pl."position" ORDER BY play_count DESC;
So, now we can see the non-quarterbacks who are on offense the most in a season:
Sack percentage by quarterback on passing plays
We can start to go a little deeper by extracting specific data from the
tracking table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.
Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).
Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team.
To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.
First, we select the distinct list of all plays, for each quarterback (
qb_plays). The reason we do a
SELECT DISTINCT… is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.
With this result, we can then count the number of total plays per quarterback (
total_qb_plays), the total number of games each quarterback played (
qb_games) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (
With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.
WITH qb_plays AS ( SELECT DISTINCT ON (POSITION, playid, gameid) POSITION, playid, player_id, gameid FROM tracking t WHERE POSITION = 'QB' ), total_qb_plays AS ( SELECT count(*) play_count, player_id FROM qb_plays GROUP BY player_id ), qb_games AS ( SELECT count(DISTINCT gameid) game_count, player_id FROM qb_plays GROUP BY player_id ), sacks AS ( SELECT count(*) sack_count, player_id FROM play p INNER JOIN qb_plays ON p.gameid = qb_plays.gameid AND p.playid = qb_plays.playid WHERE p.passresult = 'S' GROUP BY player_id ) SELECT play_count, game_count, sack_count, (sack_count/play_count::float)*100 sack_percentage, display_name FROM total_qb_plays tqp INNER JOIN qb_games qg ON tqp.player_id = qg.player_id LEFT JOIN sacks s ON s.player_id = qg.player_id INNER JOIN player ON tqp.player_id = player.player_id ORDER BY sack_count DESC NULLS last;
If you're an ardent football fan, the results from 2018 probably don't surprise you.
Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.
Now, let’s try some more “advanced” queries and analyses.
Faster insights with PostgreSQL and TimescaleDB
So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements.
Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.
This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.
Faster queries with TimescaleDB continuous aggregates
We noticed that we often needed to build queries that started with the
tracking table, filtering data by specific players, positions, and games. Part of the reason is that the
play table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the
tracking table to identify the players who were involved in any given play.
The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.
This means that every row in
tracking has to be read and aggregated before we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!
On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).
Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our continuous aggregates docs.)
CREATE MATERIALIZED VIEW player_yards_by_game_ WITH (timescaledb.continuous) AS SELECT player_id, position, gameid, time_bucket(INTERVAL '1 day', "time") AS bucket, SUM(dis) AS yards FROM tracking t GROUP BY player_id, position, gameid, bucket;
After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.
WITH total_position_yards AS ( SELECT sum(yards) position_yards, POSITION, gameid FROM player_yards_by_game t GROUP BY POSITION, gameid) SELECT avg(position_yards), position, game_date FROM game g INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid WHERE POSITION IN ('QB','RB','WR','TE') GROUP BY game_date, POSITION ORDER BY game_date, position;
We get the same result, but now the query runs in 100ms - 800x faster!
Advanced SQL data analysis with TimescaleDB hyperfunctions
Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.
It is for this kind of analysis that we built TimescaleDB hyperfunctions, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.
Grouping data into percentiles
The NFL dataset is a great use case for percentiles. Being able to quickly find players that perform better or worse than some cohort is really powerful.
As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.
WITH sum_yards AS ( --Add position to the table to allow for grouping by it later SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid FROM player_yards_by_game a LEFT JOIN player p ON a.player_id = p.player_id GROUP BY a.player_id, display_name, p.position, gameid ) --Find the mean and median for each position type SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards FROM sum_yards WHERE POSITION IS NOT null GROUP BY position ORDER BY mean_yards DESC;
Finding extreme outliers
Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!
WITH sum_yards AS ( --Add position to the table to allow for grouping by it later SELECT a.player_id, display_name, SUM(yards) AS yards, p.position FROM player_yards_by_game a LEFT JOIN player p ON a.player_id = p.player_id GROUP BY a.player_id, display_name, p.position ), position_percentile AS ( SELECT POSITION, approx_percentile(0.95, percentile_agg(yards)) AS p95 FROM sum_yards GROUP BY position ) SELECT a.POSITION, a.display_name, yards, p95 FROM sum_yards a LEFT JOIN position_percentile pp ON a.POSITION = pp.position WHERE yards >= p95 AND a.POSITION IN ('WR','FS','QB','TE') ORDER BY position;
Where can the data take you?
As you’ve seen in this example, time-series data is everywhere. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.
We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.
To get started with the NFL data:
- Spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days.
- Follow our complete tutorial for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.
If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, join our public Slack community. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.
🙏 We’d like to thank the NFL for making this data available, and the millions of passionate fans around the world who make the NFL such an exciting game to watch.
And, Geaux Saints 🏈!