30680
30680
Summary
The objective of our team was to find the five best coaches of the last 100 years for three
different college sports. Our team decided to look at men’s basketball, football, and baseball. We
wanted to be able to definitively determine team skill from the games played, and then use a machine-
learning algorithm to calculate the correct coach skills for each team in a given year. We created a
networks-based model to calculate team skill from historical game data. For basketball, we were able
to obtain the final scores of every single Div. I game played from 1939 to 2013. The range for football
was even larger: 1869 to 2013. The data for baseball was sparse. We were only able to obtain the final
scores of tournament games from 1947 to 2013. A digraph was created for each year in each sport.
Nodes represented teams, and edges represented a game played between two teams. The arrowhead
pointed towards the losing team.
We calculated the team skill of each graph using a right-hand eigenvector centrality measure.
Eigenvector centrality calculates the relative ‘importance’ of a node based on both the number of
connections and the importance of the nodes that it is connected to. In this way, teams that beat good
teams will be ranked higher than teams that beat mediocre teams. The eigenvector centrality rankings
for most years were well correlated with tournament performance and poll-based rankings.
We assumed that the relationship between coach skill (𝐶! ), player skill (𝑃! ), and team skill (𝑇! )
was this: 𝐶! ∗ 𝑃! = 𝑇! . Our team then created a function to describe the probability that a given score
difference would occur based on player skill and coach skill. Our rationale was this: if two teams have
unevenly matched players, coach skill will likely not influence the outcome of the game. However, if
two teams have evenly matched players, coach skill will manifest itself in player substitutions, time-
outs, etc. and will determine the team that won and the score difference. We multiplied the probabilities
of all edges in the network together to find the probability that the correct network would occur with any
given player skill and coach skill matrix. Our team was able to determine player skill as a function of
team skill and coach skill, eliminating the need to optimize two unknown matrices. The top five
coaches in each year were noted, and the top coach of all time was calculated by dividing the number
of times that coach ranked in the yearly top five by the years said coach had been active. The top five
coaches in the last century are: Basketball: John Wooden (0.28), Lute Olson (0.26), Jim
Boeheim (0.24), Gregg Marshall (0.23), and Jamie Dixon (0.21). Football: Glenn Scobey Warner
(0.24), Bobby Bowden (0.23), Jim Grobe (0.18), Bob Stoops (0.17), and Bill Peterson (0.16).
Baseball: Mark Marquess (0.27), Augie Garrido (0.24), Tom Chandler (0.22), Richard Jones
(0.19), and Bill Walkenbach (0.16).
更多数学建模资料请关注微店店铺“数学建模学习交流”
https://k.weidian.com/RHO6PSpA
Who’s
the
Best
Coach?
By:
MCM
Team
#30680
As
football
and
basketball
legends
like
Nick
Saban
and
Mike
Krzyzewski
fight
hard
on
the
court
to
keep
their
reputations
high,
those
of
us
at
home
would
like
to
ask:
are
these
reputations
justified?
Does
being
well-‐known
necessarily
make
a
coach
better?
Is
the
best
coach,
contrary
to
popular
belief,
someone
who
is
less
well-‐
known
than
some
of
their
colleagues?
We
are
going
to
set
out
to
find
the
best
coach
of
the
last
one
hundred
years
in
the
college
sports
of
basketball,
football,
and
baseball.
We
will
utilize
some
cool
mathematics
during
our
journey
to
make
our
rankings
less
biased
than
other
rankings
like
the
AP
Poll
or
USA
Today.
We
will
start
by,
for
every
year
in
the
last
century,
ranking
the
college
sports
teams
that
play
in
the
top
college
division.
We
do
this
in
a
really
interesting
way
that
is
quite
simple
to
understand.
Let’s
say
you
have
two
teams
that
are
playing
a
game
against
one
another.
On
a
piece
of
paper,
draw
two
circles
representing
the
two
teams.
Then,
draw
an
arrow
from
one
circle
to
the
other.
The
arrow
head
will
point
to
the
team
that
lost
the
game.
Now,
we
have
created
a
very
simple
drawing
that
can
represent
this
game.
Now
imagine
that
this
process
is
repeated
for
all
5000
games
played
in
the
2012-‐
2013
basketball
season.
The
resulting
drawing
will
look
something
like
this:
From
this
drawing,
we
can
see
exactly
which
teams
beat
which
teams.
From
this,
we
can
use
various
different
graph
metrics
to
calculate
the
best
teams.
In
the
drawing
above,
the
best
teams
have
the
biggest
circles
drawn.
So
what
does
the
best
coach
have
to
do
with
any
of
this?
Well,
we
think
that
coaches
that
lead
the
highest
ranked
teams
must
be
doing
something
right.
However,
we
aren’t
just
going
to
give
the
highest
ranked
teams
the
distinction
of
having
the
best
coaches.
The
highest
ranked
teams
will
on
average
have
better
coaches
–
sure,
but
there’s
no
reason
that
the
#2
team
won’t
have
a
better
coach
than
the
#1
team.
We
think
that
it
is
pretty
safe
to
say
that,
if
the
players
of
one
team
are
much
worse
than
the
players
of
one
of
their
opponents,
even
the
best
coaching
won’t
help
the
team
pull
off
a
win.
However,
when
two
teams
are
evenly
matched
and
they
face
each
other,
we
think
that
coaching
actually
has
a
big
impact
on
who
wins
the
game.
Think
about
it
this
way:
If
the
players
of
two
teams
are
evenly
matched,
then
the
game
is
probably
going
to
be
close
all
the
way
from
the
start
to
the
end.
The
coaches
of
either
team
will
be
the
ones
determining
who
wins
the
game
through
things
like
the
the
timing
of
player
substitutions
and
time-‐outs,
as
well
as
effectively
raising
team
morale
and
calling
good
plays.
Therefore,
when
two
teams
that
are
evenly
matched
(we
determine
‘evenly
matched’
based
upon
the
ranking
system
described
above),
the
team
that
wins
will
have
a
better
coach
than
the
team
that
loses.
If
we
calculate
this
for
any
year
in
a
college
sport,
we
will
get
a
list
of
the
top
coaches.
We
can
do
this
up
to
100
years
in
the
past
and
see
which
coaches
show
up
highest
on
the
list.
These
are
then
determined,
through
the
power
of
mathematics,
to
be
the
best
coaches
in
said
sport
for
the
last
100
years.
Everything
you
just
read
wasn’t
just
an
exercise
in
thought.
We
have
actually
gone
through
this
entire
process,
just
as
it
was
described
to
you
(it
is
actually
a
little
more
complicated
than
this).
We
were
able
to
come
up
with
the
best
coaches
of
the
last
100
years
for
the
college
sports
of
mens
basketball,
football,
and
baseball.
These
coaches
are:
Rank
Basketball
Football
Baseball
1
John
Wooden
Glenn
Scobey
Warner
Mark
Marquess
2
Lute
Olson
Bobby
Bowden
Augie
Garrido
3
Jim
Boeheim
Jim
Grobe
Tom
Chandler
4
Gregg
Marshall
Bob
Stoops
Richard
Jones
5
Jamie
Dixon
Bill
Peterson
Bill
Walkenbach
Thanks
so
much
for
reading!
Agree
with
our
results?
Disagree?
Feel
free
to
send
us
an
email
at:
[email protected].
We’re
always
happy
to
hear
from
our
readers.
A Networks and Machine Learning Approach to
Determine the Best College Coaches of the
20th-21st Centuries
Team #30680
February 10, 2014
1
Team #30680 Page 2 of 18
Contents
1 Problem Statement 3
2 Planned Approach 3
3 Assumptions 3
7 Ranking Coaches 15
7.1 Top Coaches of the Last 100 Years . . . . . . . . . . . . . . . . . 15
9 Conclusions 17
2
Team #30680 Page 3 of 18
1 Problem Statement
College sport coaches often achieve widespread recognition. Coaches like Nick
Saban in football and Mike Krzyzewski in basketball repeatedly lead their
schools to national championships. Because coaches influence both the per-
formance and reputation of the teams they lead, a question of great concern to
universities, players, and fans alike is: Who is the best coach in a given sport?
Sports Illustrated, a magazine for sports enthusiasts, has asked us to find the
best all-time college coaches for the previous century. We are tasked with creat-
ing a model that can be applied in general across both genders and all possible
sports at the college-level. The solution proposed within this paper will offer an
insight to these problems and will objectively determine the top five coaches of
all time in the sports of baseball, men’s basketball, and football.
2 Planned Approach
Our objective is to rank the top 5 coaches in each of 3 different college-level
sports. We need to determine which metrics reflect most accurately the ranking
of coaches within the last 100 years. To determine the most effective ranking
system, we will proceed as follows:
3. Develop a means by which to decouple the effect of the coach from the
team performance.
4. Create a model that, given the player and coach skills for every team, can
predict the probability of the occurrence of a specific network of a) wins
and losses and b) the point margin with which a win or loss occurred.
3 Assumptions
Due to limited data about the coaching habits of all coaches at all teams over
the last century in various collegiate sports, we use the following assumptions to
3
Team #30680 Page 4 of 18
complete our model. These simplifying assumptions will be used in our report
and can be replaced with more reliable data when it becomes available.
• The skills of teams are constant throughout any given year (ex: No players
are injured in the middle of a season). This assumption will allow us to
compare a team’s games from any point in the season to any other point in
the season. In reality, changing player skills throughout the season make
it more difficult to determine the effect of the coach on a game.
• Winning k games against a good team improves team skill more than
winning k games against an average team. This assumption is intuitive
and allows us to use the eigenvector centrality metric as a measure of total
team skill.
• The skill of a team is a function of the skill of the players and the skill
of the coach. We assume that the skill of a coach is multiplicative over
the skill of the players. That is: Ts = Cs · Ps where Ts is the skill of
the team, Cs is the skill of the coach, and Ps is a measure of the skill of
the players. Making coach skill multiplicative over player skill assumes
that the coach has the same effect on each player. This assumption is
important because it simplifies the relationship between player and coach
skill to a point where we can easily optimize coach skill vectors.
• The effect of coach skill is only large when the difference between player
skill is small. For example, if team A has the best players in the conference
and team B has the worst, it is likely that even the best coach would not
be able to, in the short run, bring about wins over team A. However,
if two teams are similarly matched in players, a more-skilled coach will
make advantageous plays that lead to his/her team winning more often
than not.
• When player skills between two teams are similarly matched, coach skill
is the only factor that determines the team that wins and the margin by
which they win by. By making this assumption, we do not have to account
for any other factors.
4
Team #30680 Page 5 of 18
found a number of different websites, each with a portion of the requisite data.
For each of these websites, we created a customized program to scrape the data
from the relevant webpages. Once we gathered all the data from our sources, we
processed it to standardize the formatting. We then aimed to merge the data
gathered from each source into a useable format. For example, we gathered
basketball game results from one source, and data identifying team coaches
from another. To merge them and show the game data for a specific coach, we
attempted to match on common fields (ex. “Team Name”). Often, however, the
data from each source did not match exactly (ex. “Florida State” vs “Florida
St.”). In these situations, we had to manually create a matching table that
would allow our program to merge the data sources.
Although we are seeking to identify the best college coach for each sport
of interest for the last century, it should be noted that many current college
sports did not exist a century ago. The National Collegiate Athletic Association
(NCAA), the current managing body for nearly all college athletics, was only
officially established in 1906 and the first NCAA national championship took
place in 1921, 7 years short of a century ago. Although some college sports
were independently managed before being brought into the NCAA, it is often
difficult to gather accurate data for this time.
5
Team #30680 Page 6 of 18
ranges from 1949 to the present, and was merged with coach data for the same
time period.
6
Team #30680 Page 7 of 18
Figure 1: A complete network for the 2009-2010 NCAA Div. I basketball season.
Each node represents a team, and each edge represents a game between the two
teams. Note that, since teams play other teams in their conference most often,
many teams have clustered into one of the 32 NCAA Div.1 Conferences.
weaknesses to this metric. The most prominent of these weaknesses arises from
the fact that, since not every team plays every other team over the course of
the season, some teams will naturally play more difficult teams while others
will play less difficult teams. This is exaggerated by the fact that many college
sports are arranged into conferences, with some conferences containing mostly
highly-ranked teams and others containing mostly low-ranked teams. Therefore,
win/loss percentage often exaggerates the skill of teams in weaker conferences
while failing to highlight teams in more difficult conferences.
7
Team #30680 Page 8 of 18
another node in the graph - also not particularly relevant in our graphs because
distance between nodes is not related to team skills.
Ax = λx (2)
If we place the restriction that the ranking of each node must be positive,
we find that there is a unique solution for the eigenvector x, where the nth
component of x represents the ranking of node n. There are multiple different
methods of calculating x; most of them are iterative methods that converge on a
final value of x after numerous iterations. One interesting and intuitive method
of calculating the eigenvector x is highlighted below. It has been shown that
the eigenvector x is proportional to the row sums of a matrix S formed by the
following equation [6, 9]:
8
Team #30680 Page 9 of 18
eigenvector centrality matrix both describe the number of walks of all lengths
weighted inversely by the length of the walk. This explanation is an intuitive
way to describe the eigenvector centrality metric. We utilized NetworkX (a
Python library) to calculate the eigenvector centrality measure for our sports
game networks.
We can apply eigenvector centrality in the context of this problem because
it takes into account both the number of wins and losses and whether those
wins and losses were against “good” or “bad” teams. If we have the following
graph: A → B → C and know that C is a good team, it follows that A is also
a good team because they beat a team who then went on to beat C. This is
an example of the kind of interaction that the metric of eigenvector centrality
takes into account. Calculating this metric over the entire yearly graph, we can
create a list of teams ranked by eigenvector centrality that is quite accurate.
Below is a table of top ranks from eigenvector centrality compared to the AP
and USA Today polls for a random sample of our data, the 2009-2010 NCAA
Division I Mens Basketball season. It shows that eigenvector centrality creates
an accurate ranking of college basketball teams. The italicized entries are ones
that appear in the top ten of both eigenvector centrality ranking and one of the
AP and USA Today polls.
As seen in the table above, six out of the top ten teams as determined by
eigenvector centrality are also found on the top ten rankings list of popular polls
such as AP and USA Today. We can see that the metric we have created using
a networks-based model creates results that affirms the results of commonly-
accepted rankings. Our team-ranking model has a clear, easy-to-understand
basis in networks-based centrality measures and gives reasonably accurate re-
sults. It should be noted that we chose this approach to ranking teams over
a much simpler approach such as simply gathering the AP rankings for vari-
ous reasons, one of which is that there are not reliable sources of college sport
ranking data that cover the entire history of the sports we are interested in.
Therefore, by calculating the rankings ourselves, we can analyze a wider range
of historical data.
Below is a graph that visualizes the eigenvector centrality values for all
games played in the 2010-2011 NCAA Division I Mens Football tournament.
9
Team #30680 Page 10 of 18
Larger and darker nodes represent teams that have high eigenvector centrality
values, while smaller and lighter nodes represent teams that have low eigenvector
centrality values. The large nodes therefore represent the best teams in the
2010-2011 season.
Figure 2: A complete network for the 2012-2013 NCAA Div. I Men’s Basketball
season. The size and darkness of each nodes represents its relative eigenvector
centrality value. Again, note the clustering of teams into NCAA conferences.
10
Team #30680 Page 11 of 18
Ts = Cs · Ps , (4)
as Cs of any particular team could be thought of as a multiplier on the player
skill Ps , which results in team skill Ts .
Although the relationship between these factors may be more complex in
real life, this relationship gives us reasonable results and works well with our
model.
Probability
Margin of Win
Figure 3: A has a high chance of winning when its players are more skilled.
Because the player skills are very imbalanced, the coach skill will likely not
change the outcome of the game. Even if B has an excellent coach, the effect of
the coach’s skill will not be enough to make B’s win likely.
Case two: Player skills approximately equal: If the player skills of the
two teams are approximately evenly matched, the coach skill has a much higher
likelihood of impacting the outcome of the game. When the player skills are
11
Team #30680 Page 12 of 18
similar for both teams, the Gaussian curve looks like the one shown in Figure 4.
In this situation, the coach has a much greater influene on the outcome of the
game - crucial calls of time-outs, player substitutions, and strategies can make
or break an otherwise evenly matched game. Therefore, if the coach skills are
unequal, causing the Gaussian curve is shifted even slightly, one team will have
a higher chance of winning (even if the margin of win will likely be small).
Probability
Margin of Win
Figure 4: Neither A nor B are more likely to win when player skills are the
same (if player skill is the only factor considered).
With the assumptions regarding the effect of coach skill given a difference
in player skills, we can say that the effect of a coach can be expressed as:
1
(CA − CB ) · (5)
1 + α|PA − PB |
Where CA is the coach skill of team A, CB is the coach skill of team B, PA is
the player skill of team A, PB is the player skill of team B, and α is some scalar
constant. With this expression, the difference in coach skills is diminished if the
difference in player skills is large, and coach effect is fully present when players
have equal skill.
12
Team #30680 Page 13 of 18
2
1
− (C · player effect + D · coach effect − margin)
E
Y
K ·e . (7)
all games
13
Team #30680 Page 14 of 18
The table above shows the results of running Powell’s method until the
probability function shown in Equation 6 is optimized, for three widely separated
arbitrary years. We have chosen to show the top three coaches per year for the
purposes of conciseness. We will additionally highlight the performance of our
top three three outstanding coaches.
John Wooden - UCLA: John Wooden built one of the ’greatest dynasties
in all of sports at UCLA’, winning 10 NCAA Division I Basketball tournaments
and leading an unmatched streak of seven tournaments in a row from 1967 to
1973 [1]. He won 88 straight games during one stretch
Jim Boeheim - Syracuse: Boeheim has led Syracuse to the NCAA Tour-
nament 28 of the 37 years that he has been coaching the team [3]. He is second
only to Mike Krzyzewsky of Duke in total wins. He consistently performs even
when his players vary - he is the only head coach in NCAA history to lead a
school to four final four appearences in four separate decades.
Roy Williams - North Carolina: Williams is currently the head of the
basketball program at North Carolina where he is sixth all-time in the NCAA
for winning percentage [5]. He performs impressively no matter who his players
are - he is one of two coaches in history to have led two different teams to the
Final Four at least three times each.
14
Team #30680 Page 15 of 18
7 Ranking Coaches
Knowing that we are only concerned with finding the top five coaches per sport,
we decided to only consider the five highest-ranked coaches for each year. To
calculate the overall ranking of a coach over all possible years, we considered
the number of years coached and the frequency which the coach appeared in the
yearly top five list. That is:
Na
Cv = (9)
Nc
Where Cv is the overall value assigned to a certain coach, Na is the number
of times a coach appears in yearly top five coach lists, and Nc is the number of
years that the coach has been active. This method of measuring overall coach
skill is especially strong because we can account for instances where coaches
change teams.
15
Team #30680 Page 16 of 18
decrease slightly with this modification. When we ran and analyzed the results,
we found that the coach skill value did in fact decrease by approximately 1%, as
we expected. However, the Alabama coach maintained his ranking of top coach
for the season.
The second change that we incorporated was to switch the results of the
same game (Alabama 67, Providence 60) to a win for Providence (Providence
67, Alabama 60). We expect this will have a greater negative influence on
the skill value of the Alabama coach, and when we ran the analysis we found
that, indeed, the Alabama coach skill value decreased by approximately 4%.
Although a relatively minor difference, the second-ranked coach originally had
a skill value only very slightly behind the Alabama coach, and the 4% loss in
fact placed the second-ranked coach in the first ranking position.
From this analysis we can see that our model follows our predictions accu-
rately, and that removing factors that add positively to the skill ranking of a
coach is detrimental to their skill value. Although the changes we made were
minor, there is often a lot of competition for the first-place ranking of a coach,
and due to the limited number of games played per season, a change to this
data can have an influence on the final ranking. Although these results indicate
that error in our data can effect the final ranking, the analysis also shows that
our model responds predictably to a variation in the input.
8.2 Strengths
The main strength of our approach is that it is able to separate coach proficiency
from team proficiency by calculating probabilities that the historical game data
occur given coach skills. This allows us to more accurately gauge the skills
of a coach without factoring in the skills of his/her players. Furthermore, our
approach is flexible as many relationships can be modified. For example, if a
study shows that there is a better function to describe the relationship between
coach skill, player skill, and team skill, it can easily be used in our model. Our
model is also able to compare the relative effectiveness of coaches from all time
periods, as long as the average margin of victory is similar across time periods.
8.3 Weaknesses
The main weakness of our model pertains to computational efficiency, as our
computers could not always adequately calculate all necessary values in our
model. For example, the computer could not find the eigenvector centrality
values on a small percentage of the graphs, as the Von Mises iteration failed
to converge. Furthermore, sometimes Powell’s method of minimizing our cost
function yielded high costs relative to other years because the initialized array
of coach skills was close to a local minima. This could be solved by running
Powell’s method from several randomly initialized coach skills arrays, but this
increases computational time. In fact, the overall results of our model could
likely be improved significantly given more time to run the iterative optimization
16
Team #30680 Page 17 of 18
9 Conclusions
In this report, we have analyzed nearly 300 years of data in order to determine
the most accurate and unbiased ranking of all-time best college sports coaches.
By constructing comprehensive networks with with edges representing each and
every game played in the last century of the college sports that we analyzed,
we were able to create a comprehensive metric of team skill using the concept
of eigenvector centrality. By considering win/loss margins, we were able to
identify patterns that enabled us to separate the team skill measure into its
two components - player skill and coach skill. We then created a probability
function based on player skill and coach skill to determine the likelihood of an
edge in our network occurring. By multiplying this probability across all edges,
we were able to determine the probability of the entire graph occurring given
team skill and coach skill vectors. Using an iterative, multivariable, machine
learning algorithm, we maximized this probability function for coach skill for
each season and for each sport. Using data that mapped the name of a coach
to his/her team for each season, we were able to combine the results of each
individual season and analyze the skill of each individual coach over their entire
coaching history. From this data, we selected the top 5 coaches from every sport
to feature as our all-time best coaches of the century.
17
Team #30680 Page 18 of 18
References
[1] John Wooden. Retrieved from: http://msn.foxsports.com/
collegebasketball/story/John-Wooden-dies-UCLA-coach-99-060410,
2010.
[2] ShrpSports. Retrieved from: http://www.shrpsports.com/, 2011.
18