Making the Perfect March Madness Bracket – An (Impossible) Tradition Unlike Any Other
Maybe this will be the year. I have watched a lot of college basketball, and I have a pretty good feel on which teams may be legitimate contenders versus the frauds. Maybe, just maybe, I can pick the perfect bracket this time around. Or at least make the best bracket in my pool and win some money. Maybe I have finally figured out something that others have not…
This is
often the mindset of millions of college basketball fans come March (myself
included). Every year, once the final version of my bracket (or top 5-10
versions) has been submitted, a very small piece of me truly believes that I
have built one of the best brackets in the country. However, regardless of my
confidence year in and year out, my bracket’s results often look like I had
just flipped a coin on every game by the end. With a 1 in 120 billion chance
in crafting a perfect bracket according to ncaa.com, you may be better off
doing just that (I once did this using my grandmother, who barely understands
the game of basketball, as a test subject; she picked each game based on which
school’s name she liked the most and correctly predicted Duke winning it all in
2015).
I often wonder
to myself if vast sets of data can engender better March Madness predictions
any better. Part of the fun of the tournament is all of its variability. Your
favorite team could get hot offensively one game, making 70% of their
3-pointers, and upset their heavily favored opponent in any given game. Or make an improbable half-court shot to win. Within each 40+ minute game,
anything can seemingly transpire, especially in a high-intensity environment
with emotions running high. As data analysts and sports reporters have
especially seen in the past, this is what makes bracket predicting tricky - even
if you have vast pertinent information about the teams involved.
With that
being said, this college basketball model is NOT meant to make a perfect
bracket. It is not even necessarily meant to predict the champion or final four
teams with 100% accuracy. Instead, this model is an experiment into whether
historical seasons’ of data can unlock patterns regarding the success of
tournament teams. This is also an experiment into whether utilizing these
data-driven patterns in making a bracket (when using a team’s season stats) can
produce a more accurate bracket than one made using solely the “eye test”.
Building the
Model
There
were a number of data decisions I had to make when aggregating and summarizing
teams’ season data. Combining data from both www.sports-reference.com and basketball.realgm.com, I compiled historical data
for every tournament eligible team from 2002-03 to 2018-19. I wanted to analyze
various types of data points as well, from box score numbers like Points per
Game, to more advanced numbers like Offensive and Defensive Rating, to opponent-based
numbers like Strength of Schedule.
The next step was deciding what dependent variable to
utilize to gauge tournament success. The main difficulty with this was
garnering a dependent variable that would have enough data for the model to
learn from (only 1 out of 68 teams in the tournament win it yearly, so using
patterns seen in these teams alone from 2002-18 seemed like too small of a
sample size). I decided to measure whether teams make the Elite 8 in order to provide
a larger sample size of “success”. Since
I knew that my model(s) would eventually produce the probability that any given
team would make the Elite 8, I could slightly extrapolate from these results and
rank teams most likely to make the Elite 8 as also the most likely team to win
the entire tournament.
The final (and arguably most important) preprocessing step
involved normalizing and standardizing the data. In order to get a truer
interpretation of which metrics correlate most with tournament success, the
data needed to be first averaged/standardized rather than using season totals
because totals would favor teams who played more games/minutes overall in the
regular season. In order to minimize this, I standardized all box score variables
to per minute played and also used statistics in percent formats whenever
applicable (such as Free Throws Made per Minute as well as Free Throw
Percentage). However, this did not eliminate the chance of the model overrating
some variables over others based on their magnitudes. For example, without any
more data normalizing, Points per Minute would naturally be a much higher
number than Steals per Minute as points are scored much more often than steals
occur. Therefore, I normalized these variables in R (you can view my source code and methodology here) such that all numerical variables fell between 0
and 1 values with the maximum value per statistic garnering a “1” and the minimum
a ”0”.
From here, I built 2 logistic regression models and a large
tree model, then conducted K-Nearest Neighbor and discriminant analysis to
predict the teams most likely to make the Elite 8 and beyond. To test these
models in the context of March Madness, I built one bracket based solely on the
logistic models’ outputs and another utilizing the large tree, K-Nearest
Neighbor, and discriminant analysis outputs in tandem. The logistic
regression model boasts 100% sensitivity in correctly selecting Elite 8
teams from 2002-18. Here are some of the stats that my models picked out as the
most relevant to tournament success:
·
Defensive
Rating
·
e-Difference
(the difference between a team’s offensive and defensive rating)
·
Possessions
per Minute
·
Three
Pointers Attempted per Minute
·
Fouls
Committed per Minute
·
Offensive
Rebounds per Minute
·
Steals
per Minute
·
Points
Scored per Minute
·
Win
%
· Strength of Schedule
The model
prefers teams who are active and efficient defensively (low defensive ratings
and force turnovers without fouling … maybe defense does win championships?),
maximize their production and possessions offensively (by shooting threes,
controlling the offensive boards, and scoring a lot of points per minute), and
win against high quality opponents (likely weighting teams who play in strong
conferences higher). Here are the teams that the model liked the most to win
the would-have-been tournament in 2019-20 (based on Bracketology experts' predicted seeding) as well the 2020-21 tournament predictions:
2019-20:
Elite 8:
Midwest: (1) Kansas over (3) Michigan State
East: (4) Maryland over (6) Penn State
South: (1) Baylor over (10) Indiana
West: (6) West Virginia over (7) Michigan
Final 4:
(1) Kansas over (4) Maryland
(6) West Virginia over (1) Baylor
Championship:
(1) Kansas over (6) West Virginia
Takeaways:
- The Midwest would have been the strongest quadrant with 2 teams in the model’s top 3
- A lot of Big 10 and big 12 teams in Elite 8 (all 8 teams!)
- A variety of seeds were projected to make Elite 8
· _______________________________________________________________________________
2020-21:
Model Picks:
Elite 8:
West: (1) Gonzaga (34% champion probability) over (2) Iowa (18%)
South: (1) Baylor (62%) over (2) Ohio State (12%)
East: (1) Michigan (76%) over (2) Alabama (21%)
Midwest: (1) Illinois (39%) over (2) Houston (19%)
Final 4:
(1) Baylor (62%) over (1) Illinois
(1) Michigan (76%) over (1) Gonzaga (34%)
Championship:
(1) Michigan (76%) over (1) Baylor (62%)
Takeaways:
- East is the strongest region with 2 of top 5 teams according to model
- A lot of Big 10 teams (4 of Elite 8)
- All 1's and 2's in Elite 8, and all 1's in Final 4 (maybe the NET rankings that the NCAA uses to rank teams throughout the season is relatively accurate)
It has been very interesting to
analyze the trends in successful tournament teams’ stats that have held true
between 2002-18. This is probably the model’s greatest value add. However, I
have some skepticism that this will be an extremely accurate predictor for
March Madness games by itself.
For one, utilizing season-long statistics does not account for team’s trends/success streaks within various points of the season, instead combining all stats into a yearlong average. Thus, this will not account for teams getting hot at the end of the year (I have always thought that end of season performance would be one of a strongest predictor of tournament success). Further, this does not account for injuries or COVID-19 related issues, and the change in play based on different players’ performances (or lack thereof) cannot be accounted for in altering a team’s odds in the tournament. For example, Michigan has been an amazing team this year by all accounts, yet the injury of their star player, Isaiah Livers, could severely hurt their tournament chances. I am not sure how to account for player statuses in a future predictive model, but I believe that this would be a major predictor. My hope, however, is that a bracket built entirely by my predictive model will perform better than one picked using minimal statistics. Maybe this will be the true proof that analytics are better than the eye test alone (but do not bank on this proving that on its own). Future edits will be made to this post after the tournament ends in which I will analyze my results, may the best bracket win!
Comments
Post a Comment