NBA Betting Model - Beating the House

  $150 billion per year.

According to legalsportsbetting.com, that’s a conservative projection for the monetary size of the betting industry for 2021. To put that into context, that is higher than the market capitalization of Starbucks. By 2025, marketwatch.com estimates $8 billion in yearly revenue on its own in the sports betting industry. That is like buying over 420 Bitcoin per day for an entire year (as of the 4/23/2021 BTC price).

  The prevalence of online betting platforms and apps – like FanDuel and DraftKings – make it extremely easy and fun to bet on any given game or player. The increased interconnectivity amongst sports league and sports media outlets also engenders increased exposure to the betting landscape. For instance, ESPN recently aired a “BetCast” stream for a Philadelphia 76ers vs Brooklyn Nets primetime game featuring graphics and statistics throughout of various prop bets and odds. Once a taboo in sports talk shows, a viewer can now rarely get through any 30-minute sports talk show without hearing about betting odds of today’s/future games.

Thus, when thinking through relevant basketball topics to analyze in R, NBA betting models seemed like the perfect topic. I was hesitant, though, based on the abundance of variability within a basketball game. Over the course of an 82-game season, heavily favored teams can lose any given game, and unpredictable factors like injuries or foul trouble can disrupt the expectation of who will win. Given the myriad of confounding variables, I was almost ready to give up.

My curiosity about the concept luckily got the best of me. I wondered if there were any patterns amongst teams who win games, versus how much they tend to win by, versus how many total points are scored. Looking at all NBA game-by-game team stats from the 2014-2018 seasons, I was able to recognize some patterns via logistic and linear regression models.

Nevertheless, understanding what led to teams winning games/by how much/points scored is not very useful outside of the context of individual matchups. So, using some Excel magic, I built a dashboard that leverages these logistic and linear regression formulas to predict the predicted winner, spread, and total points scored in any given matchup based on a team’s average season stats. Testing this dashboard on all 59 NBA games from 4/11/21-4/17/21, I compared the model’s predicted outputs to that of DraftKings’ projected numbers in order to make game-by-game predictions (for example, if the Milwaukee Bucks were projected by DraftKings to defeat the Oklahoma City Thunder by 6 points but my model projected they would instead win by 10, then I would pick the Bucks -6 accordingly). Using this method, the model correctly predicted spread with 59% accuracy and the over/under at exactly 50% accuracy, for an overall 54% rate.

This clip may not seem very high – until thinking about it in context. If betting $10 on every game, a bettor would have made $100 that week alone. If a relatively similar wining trend followed for every week of the season, this could balloon up to as much as $3000+. The ultimate goal of any bettor is to be above $0, and maybe (just maybe) this model could consistently beat the house.

Thinking about the holistic basketball betting space, Daily Fantasy Sports lineups are also increasingly popular, and so I also built an optimization model that maximizes average fantasy points per game at the highest overall salary without going over the threshold. With both the team betting and fantasy components, any user could use this dashboard for their betting entertainment needs. But how do the models really work?




Making the Model(s)

Team Stats Models

              When starting this project, the process of finding and cleaning the right data for the models' inputs and outputs was very difficult.

First, a descriptive dataset was needed that would be large enough to glean legitimate trends. Obtaining a large enough sample of game-by-game data that painted a full picture of the game (for example, some datasets did not track pertinent variables like rebounds, turnovers, and assists) was rather tricky. It was easier to find this type of data for a period of days or weeks' sample size, yet it was not until I found a dataset on Kaggle with all games from 2014-2018. With close to 10,000 games’ in it, this would be a great set of training data for mining and modeling.

Next, it took many iterations to understand which statistics should/should not be included in the models as well as how they should be included if so. Ignoring this step would have led to models that were either extremely overfitted or not replicable for future matchups. Think about predicting spread, for instance. One would probably think that the home and away team’s average points per game would be important predictors in any model of a team’s expected margin of victory. This is a valid thought; however, including these statistics would highly outweigh other important variables in the dataset to the point that the model would solely calculate the difference in the home team vs away team’s average points scored per game. If it were that easy to predict the outcome of any game in this way, then betting the spread would be extremely easy. Any sports fan would likely recognize that this method would ignore how one team’s offense will play versus another’s defense and vice-versa, as well as how a team’s overall style of play vastly changes the outcome of a game versus different opponents. Based on the understanding of the scope of this data, total points were excluded from all models as a dependent variable so that predictors more holistic of a team’s style of play (such as True Shooting % and Offensive Rating) could be considered more fairly. In turn, this model also can allow for variations in a team’s play versus another team’s throughout the season because it is not overfitted. Eliminating duplicative variables also improves the model’s flexibility, such as only inputting each team’s Offensive Ratings (since Defensive Ratings are defined as the opposing team’s Offensive Rating) or Offensive Rebound % since it is just a percentage of total rebounds.

              Even after understanding the input data more strongly, it would not be very useful without good data from which to predict future outcomes. Individual game data was utilized as the input for this model, but it is very difficult to find/scrape nightly box scores for each team. Even with such data available, it would likely be erroneous to correlate a team’s previous game statistics to how they will perform in their next game alone (a team could be extraordinarily hot or cold in a single game or their performance may be unrepresentative of their average output due to the performance of/injury to a star player). With both of these thoughts in mind, I utilized teams’ average statistics over the course of the season from basketball-reference.com, as I was interested in investigating if a team’s average style of play was relatively predictive of their game-by-game performance as a whole (regardless of injuries). Regardless, in future versions of this betting tool, some API could be used to pull nightly team data and then parse a team’s averages for the last 5 games, 10 games, etc. (based on user preference) to predict through season versus recent team performance.

              Finally, when initially validating the models, factoring in the changes in style of play from 2014-2018 compared to the 2021 season became an important consideration. My model’s projected margin of victory and total points scored for games both tended to be consistently lower than both DraftKings’ projections and actual games’ box scores. Digging deeper into the data, I noticed that the average points scored per game in the 2020-21 season is approximately 8% higher than that of the training data from 2014-18. This is likely due to the increased number of 3-pointers and layups attempted as compared to the fall in midrange shots taken within this time period ("Take that for data!") Therefore, I weighted both predictions in my dashboard by that 8% increase. Although there may have been a better way to do this, I noticed immediate improvements in projections as compared to actual game outcomes. In future versions of the dashboard, I would hope to include training data from 2019-21 rather than weight the results.  

DFS Lineup Optimization Model

              I opted for a simpler setup for the lineup optimization model overall. In this case, average fantasy points per game would likely be a strong predictor of upcoming performance since a player performing unusually better/worse than expectation is extremely random in nature. Furthermore, average fantasy points per game is already a linear equation in itself – so applying a linear regression model based on a player’s average statistics would lead to the exact same results as calculating their average fantasy points per game as given. Another assumption made is that a player’s individual performance will vary minimally regardless of opponent especially when focusing on higher volume players that fantasy bettors would likely be picking. Using Excel’s Solver plug-in, which optimizes the highest possible “score”/metric based on cost or other thresholds and limits, the highest projected lineup possible can be chosen for any contest input of players.  Similar to the team stats’ predictive betting models, I also hope to build the option to view data from the last 5 games, 10 games, etc. to give the user the option to factor in undervalued/overvalued players based on recent performance.       

Takeaways and Outputs

Win % Model:

              Looking at the odds of one team beating another, I wanted to investigate what factors were predictive in a matchup independent of the margin of victory. It is important to note that these odds are based on the likelihood that one team would beat another if their average season stats held relatively true in any given matchup. A logistic regression model with normalized input data, here were the most important variables:

·       Home True Shooting % & Away True Shooting %          

·       Home Turnover % & Away Turnover %

·       Home Offensive Rebound % & Away Offensive Rebound %

·       Home Total Fouls & Away Total Fouls    

·       Home Free Throw Rate  & Away Free Throw Rate     

·       Home Assists & Away Assists

·       Home 3-Point Shot %

According to this model, teams who have a high True Shooting % (which factors free throw shooting into a team’s overall shooting percentage), get to the free throw line at a high rate while minimizing fouls defensively, can maximize possessions while minimizing turnovers, and have strong ball movement tend to win games. This also holds true compared to my MarchMadness predictive model  - maximizing possessions with high quality shots being taken. Another interesting caveat to point out is that the home team’s statistics tended to have a stronger weight than that of the same statistic(s) for the away team. This may suggest that home court advantage matters (this would be interesting to analyze as compared to the 2020 bubble games without a crowd at all). Solely Home 3-Point Shot % being significant in the formula may further suggest that the home crowd can slightly elevate a team’s shooting from deep. Across all 3 models, this same home/away weight difference can be seen.

Spread/Margin of Victory Model:

              The main differences between this model and the Win % model are linear versus logistic regression, respectively, and the significance of team differences in various statistics. I made the choice to format the team’s statistics as differences where possible (rather than using the actual numbers alone) since the model predicts the difference in score between two teams. The below variables were found to be significant:

·       Home vs Away Net Rating & Home Offensive Rating

·       Field Goals Made Difference

·       Free Throws Attempted Difference         

·       Three Pointers Attempted Difference

·       Block Difference

·       Turnover Difference

·       Field Goals Attempted Difference

·       Offensive Rebounds Difference

·       Assists Difference

·       Home True Shooting % & Away True Shooting %

·       Home Field Goal %        

·       Home Effective Field Goal % & Away Effective Field Goal %

Although relatively similar to the previous model, it is interesting that some variables were significant in predicting if a team will win yet not significant in how much a team wins by (and vice-versa). Especially due to the nature of betting on sports games’ spreads, where bettors are rarely actually ever wagering on who is winning, it would make sense that the significant statistics vary slightly. It is also important to note that relying alone on a team’s odds to beat another team may not be a strong predictor of the spread.

Total Points Model:

              Finally, these are the significant predictors of home + away team points:

·       Home Offensive Rating & Home/Away Net Rating  

·       Home Assists & Away Assists

·       Home Fouls & Away Fouls          

·       Home Steals & Away Steals        

·       Home Turnover % & Away Turnover %  

·       Home True Shooting % & Away True Shooting %          

·       Home Blocks & Away Blocks      

·       Home Free Throw Rate & Away Free Throw Rate  

·       Home Effective Field Goal % & Away Effective Field Goal %                    

·       Home Offensive Rebound % & Away Offensive Rebound %

·       Free Throws Attempted Difference         

·       Turnovers Difference

·       Field Goals Made Difference

The most interesting takeaway is that most statistics like assists were weighted separately by home and away team whereas only Free Throws Attempted, Turnovers, and Field Goals made were significant as differences. This may indicate that one team’s points scored is relatively independent of the other team’s scoring and defensive effort. That is, a team’s points scored is theoretically more dependent on the team’s ability to make open and closely defended shots throughout the game regardless of defense (this would be interesting to analyze further via NBA shot tracking statistics).

Conclusion

In summary, this has been an extremely interesting concept to model due to its complexity, relevance, and user interactivity. My goal in this project was to create an extremely interactive dashboard in which users can see relevant scores based on any given matchup. In order to fully optimize the user’s experience, these are my ideal updates to make before the 2021-22 NBA season begins:

·       Automate the data pull through an API (this eliminates the manual copy paste aspect and allows averages of the last 5 games, 10 games, etc. to be leveraged for making bets)

·       Build this dashboard in Tableau so that it is more interactive on Tableau Public, the data would hopefully update more quickly, and more metrics of teams could be shown when hovering over the team/team’s statistics

·       Could be expanded for various other prop bets/futures odds for ROY, MVP, etc. (please comment below if any ideas for other bets!)

·       Given the Win % model, this could be leveraged for predicting a team’s record, who will win in the NBA playoffs, and maybe even for March Madness (although college data would be more relevant as input data obviously)

I made this tool not with the goal of making money but instead to understand and unlock trends in the betting arena. I hope anyone who reads this blog can make money using my tool. If enough people use it, maybe we can even beat the house! Just kidding Vegas, please don’t report me for saying that.


Comments

Popular posts from this blog

NBA All-Star Predictive Modeling

Making the Perfect March Madness Bracket – An (Impossible) Tradition Unlike Any Other