NFL Predictions

2016-12-07 • Classification, Decision Trees, LDA, Linear Regression, Monte Carlo, Python, Random Forest, Sports

Actual Winner

I am a sports fan, and while I am a triathlete who loves his sport I enjoy watching many other sports like tennis, F1, soccer and football. When you are among friends and family who are passionate about sports like me, it is inevitable to call winners for any match you see or expect to attend. Is like betting but without the downside of losing money. You actually gain respect if you are consistent at calling winners, and a reputation if you end up constantly calling losers. I would definitely like to be of those who are respected, so I decided to improve my chances to be the one who says: I told you so. With the difference that I may have something to support my claim rather that just call it. When I first started to think about this idea I decided to get statistical data for each team. Yards per game, scores, goal fields and so on. The best model I came up use the game scores and additional information like which team is visiting and parameters for the Elo rating calculation. The data used to build the models was collected from two sources: the NFL site to get the offensive and defensive stats for all teams, and the historical match results from pro-football-reference.com, this guys have arrange the data on a way that is quite easy to crawl for all match results since 1920 when the league was founded. The spreads were collected aussportsbetting.com, updates are taken from oddsshark.com. The data samples for training the models increases its records after each game. The final model uses the Elo rating as the primary attribute to predict a match result. It starts at the same value for all teams and after each game is completed the rating is updated, becoming part of the sample data. For other models the data in the training sample started with data from 1990 to 2009. Predictions are made for the entire 2010 season, when the season in completed the data from 2010 is included in the training sample. The same process is executed up to 2016. Initial data from NFL.com and pro-footbal-reference.com was pull from using Python crawlers. From the official league website stats from 1990 to 2016 week 9 was gathered. Games scores and result for the same period were collected in the same way. From week 11 an automatic process gets all new data from the NFL official site with the same python crawler. Las vegas spread are now being updated from oddshark.com. To start my analysis I looked at the data to know which variables seem to have relevant information on team performance. Using correlation measurements and distributions I was able to give me an idea of which approach to take. I would make available some of those visualizations latter on the blog. From the NFL stats you get different measurements for the performance of each team. For initial models I try different combinations of that statistical data to train different models. I created an ensemble model that combined the result of ridge, lasso and elastic net regressions. The best model however takes the elo rating, a variable to identify if the team is visiting or playing at home to predict a winner. Elo rating may be computed after each game is played taking into its value the performance of the team. I decided to go with an ensemble model since the algorithms provided different types of outputs and the accuracy has great variation for a single model. The implementation of Elo rating came across when I was selecting my the benchmarks to my model. For the ensemble model the process is as follows. For each match in a given season the training data includes the performance data of the teams in the game up to the season before the one to be predicted. For example, the prediction for the game between Colts and Cardinals on the week 13 in 2013 the training data included performance data from 1990 to 2012. Then the performance of the 2013 season, from week 1 to week 12 is fitted to the model. In the ensemble each model provides a score for each team, the taking the average score a winner is selected. The model is tested for each game between 2010 and 2016, a period in which more than 1,700 games were played. In addition, 3 benchmarks were selected to see the results against other models. I included FiveThertyEight, Fox Sports and odds spreads. Each bench provides different approaches to select winners and the models I build get similar results when calling winners. Results and adjusted predictions for upcoming games are updated automatically in four sections of this post. Weekly forecast contains the predicted and actual winner for each game since 2010 including playoffs of 2016. Since Elo rating is computed in the section Team Ranking is possible to see the ranking ordered by the probabilities of winning the Super Bowl for each team. The probabilities are obtained by running a 10k MC simulations. The third section contains a comparison of the model to the benchmarks. It presents the accumulated average Recall of winners per week for the 2016 season. Finally, the Elo for each team is presented using Tableau. All models were implemented in Python 3.4 using scikit-learn package to run the ML algorithms. Elo rating is also computed in Python with custom code.

ClassificationDecision TreesLDALinear RegressionMonte CarloPythonRandom ForestSports