Making the Perfect March Madness Bracket – An (Impossible) Tradition Unlike Any Other

               Maybe this will be the year. I have watched a lot of college basketball, and I have a pretty good feel on which teams may be legitimate contenders versus the frauds. Maybe, just maybe, I can pick the perfect bracket this time around. Or at least make the best bracket in my pool and win some money. Maybe I have finally figured out something that others have not…

              This is often the mindset of millions of college basketball fans come March (myself included). Every year, once the final version of my bracket (or top 5-10 versions) has been submitted, a very small piece of me truly believes that I have built one of the best brackets in the country. However, regardless of my confidence year in and year out, my bracket’s results often look like I had just flipped a coin on every game by the end. With a 1 in 120 billion chance in crafting a perfect bracket according to ncaa.com, you may be better off doing just that (I once did this using my grandmother, who barely understands the game of basketball, as a test subject; she picked each game based on which school’s name she liked the most and correctly predicted Duke winning it all in 2015).

              I often wonder to myself if vast sets of data can engender better March Madness predictions any better. Part of the fun of the tournament is all of its variability. Your favorite team could get hot offensively one game, making 70% of their 3-pointers, and upset their heavily favored opponent in any given game. Or make an improbable half-court shot to win. Within each 40+ minute game, anything can seemingly transpire, especially in a high-intensity environment with emotions running high. As data analysts and sports reporters have especially seen in the past, this is what makes bracket predicting tricky - even if you have vast pertinent information about the teams involved.

              With that being said, this college basketball model is NOT meant to make a perfect bracket. It is not even necessarily meant to predict the champion or final four teams with 100% accuracy. Instead, this model is an experiment into whether historical seasons’ of data can unlock patterns regarding the success of tournament teams. This is also an experiment into whether utilizing these data-driven patterns in making a bracket (when using a team’s season stats) can produce a more accurate bracket than one made using solely the “eye test”.

Building the Model

              There were a number of data decisions I had to make when aggregating and summarizing teams’ season data. Combining data from both www.sports-reference.com and basketball.realgm.com, I compiled historical data for every tournament eligible team from 2002-03 to 2018-19. I wanted to analyze various types of data points as well, from box score numbers like Points per Game, to more advanced numbers like Offensive and Defensive Rating, to opponent-based numbers like Strength of Schedule.

The next step was deciding what dependent variable to utilize to gauge tournament success. The main difficulty with this was garnering a dependent variable that would have enough data for the model to learn from (only 1 out of 68 teams in the tournament win it yearly, so using patterns seen in these teams alone from 2002-18 seemed like too small of a sample size). I decided to measure whether teams make the Elite 8 in order to provide a larger sample size of “success”.  Since I knew that my model(s) would eventually produce the probability that any given team would make the Elite 8, I could slightly extrapolate from these results and rank teams most likely to make the Elite 8 as also the most likely team to win the entire tournament.

The final (and arguably most important) preprocessing step involved normalizing and standardizing the data. In order to get a truer interpretation of which metrics correlate most with tournament success, the data needed to be first averaged/standardized rather than using season totals because totals would favor teams who played more games/minutes overall in the regular season. In order to minimize this, I standardized all box score variables to per minute played and also used statistics in percent formats whenever applicable (such as Free Throws Made per Minute as well as Free Throw Percentage). However, this did not eliminate the chance of the model overrating some variables over others based on their magnitudes. For example, without any more data normalizing, Points per Minute would naturally be a much higher number than Steals per Minute as points are scored much more often than steals occur. Therefore, I normalized these variables in R (you can view my source code and methodology here) such that all numerical variables fell between 0 and 1 values with the maximum value per statistic garnering a “1” and the minimum a ”0”.

From here, I built 2 logistic regression models and a large tree model, then conducted K-Nearest Neighbor and discriminant analysis to predict the teams most likely to make the Elite 8 and beyond. To test these models in the context of March Madness, I built one bracket based solely on the logistic models’ outputs and another utilizing the large tree, K-Nearest Neighbor, and discriminant analysis outputs in tandem. The logistic regression model boasts 100% sensitivity in correctly selecting Elite 8 teams from 2002-18. Here are some of the stats that my models picked out as the most relevant to tournament success:

·       Defensive Rating

·       e-Difference (the difference between a team’s offensive and defensive rating)

·       Possessions per Minute

·       Three Pointers Attempted per Minute

·       Fouls Committed per Minute

·       Offensive Rebounds per Minute

·       Steals per Minute

·       Points Scored per Minute

·       Win %

·       Strength of Schedule

              The model prefers teams who are active and efficient defensively (low defensive ratings and force turnovers without fouling … maybe defense does win championships?), maximize their production and possessions offensively (by shooting threes, controlling the offensive boards, and scoring a lot of points per minute), and win against high quality opponents (likely weighting teams who play in strong conferences higher). Here are the teams that the model liked the most to win the would-have-been tournament in 2019-20 (based on Bracketology experts' predicted seeding) as well the 2020-21 tournament predictions:

2019-20:

Elite 8:

Midwest: (1) Kansas over (3) Michigan State

East: (4) Maryland over (6) Penn State

South: (1) Baylor over (10) Indiana

West: (6) West Virginia over (7) Michigan

Final 4:

(1) Kansas over (4) Maryland

(6) West Virginia over (1) Baylor

Championship:

(1) Kansas over (6) West Virginia

 

Takeaways:

  • The Midwest would have been the strongest quadrant with 2 teams in the model’s top 3
  • A lot of Big 10 and big 12 teams in Elite 8 (all 8 teams!)
  • A variety of seeds were projected to make Elite 8

·  _______________________________________________________________________________

2020-21:

Model Picks:


Elite 8:

West: (1) Gonzaga (34% champion probability) over (2) Iowa (18%)

South: (1) Baylor (62%) over (2) Ohio State (12%)

East: (1) Michigan (76%) over (2) Alabama (21%)

Midwest: (1) Illinois (39%) over (2) Houston (19%) 

Final 4:

(1) Baylor (62%) over (1) Illinois (39%)

(1) Michigan (76%) over (1) Gonzaga (34%)

Championship:

(1) Michigan (76%) over (1) Baylor (62%)

Takeaways:

  •  East is the strongest region with 2 of top 5 teams according to model
  • A lot of Big 10 teams (4 of Elite 8)
  • All 1's and 2's in Elite 8, and all 1's in Final 4 (maybe the NET rankings that the NCAA uses to rank teams throughout the season is relatively accurate)

It has been very interesting to analyze the trends in successful tournament teams’ stats that have held true between 2002-18. This is probably the model’s greatest value add. However, I have some skepticism that this will be an extremely accurate predictor for March Madness games by itself.

For one, utilizing season-long statistics does not account for team’s trends/success streaks within various points of the season, instead combining all stats into a yearlong average. Thus, this will not account for teams getting hot at the end of the year (I have always thought that end of season performance would be one of a strongest predictor of tournament success). Further, this does not account for injuries or COVID-19 related issues, and the change in play based on different players’ performances (or lack thereof) cannot be accounted for in altering a team’s odds in the tournament. For example, Michigan has been an amazing team this year by all accounts, yet the injury of their star player, Isaiah Livers, could severely hurt their tournament chances. I am not sure how to account for player statuses in a future predictive model, but I believe that this would be a major predictor. My hope, however, is that a bracket built entirely by my predictive model will perform better than one picked using minimal statistics. Maybe this will be the true proof that analytics are better than the eye test alone (but do not bank on this proving that on its own). Future edits will be made to this post after the tournament ends in which I will analyze my results, may the best bracket win!


Comments

Popular posts from this blog

NBA Betting Model - Beating the House

NBA All-Star Predictive Modeling