Football Analytics
Football Analytics
Honors Thesis
2 Tutorials 2
2.1 Introduction to Markov Chains . . . . . . . . . . . . . . . . . . . . 2
2.2 Introduction to Hidden Markov Models . . . . . . . . . . . . . . . . 4
2.2.1 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Procedure 6
3.1 Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 The States and State Transition Matrix . . . . . . . . . . . . 7
3.1.2 Emission Probability Distributions . . . . . . . . . . . . . . 7
3.2 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Testing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Results 10
4.1 What is Success? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Choosing Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Testing the Chosen Bins on 2017 . . . . . . . . . . . . . . . . . . . 11
1
Abstract
Hidden Markov models allow for analysis of the behavior of an unob-
servable Markov chain whose states are known but whose sequence is not.
Instead of observing the state sequence directly, the best available option
is to observe a related observation event whose outcome suggests a given
state in a non-deterministic way. This project examines the utility of hidden
Markov models in football analytics.
2 Tutorials
2.1 Introduction to Markov Chains
Suppose you are a meteorologist, studying weather behavior in a specific region.
On a given day, one of three possible weather scenarios will occur: sunny, cloudy, or
rainy. In your analysis, you determine that the probability of the weather scenario
for one day depends completely on the weather of the previous day (i.e. as shown
in Figure 1, given today is sunny, the probability that tomorrow is a cloudy day is
fixed with probability .3, a rainy day with probability .1, and another sunny day
with probability .6). As a meteorologist, it is your job to determine the probability
of rain in three days. If the weather is cloudy today, then what is the probability
that it will rain in two days? This is a motivating example for the use of Markov
chains.
2
Markov chains are a quintessential model in stochastic probability. A Markov
chain is a sequence among a set of discrete states where transitions between states
follow the Markov property. The Markov property requires that the probability of
transition to any state depends only on the current state and not on any previous
states.
Let’s return to the weather illustration. Set the transition probabilities as in
Figure 1.
0.6 0.3
0.2
Sunny Cloudy
0.3
0.4 0.5
0.1 0.1
Rainy
Figure 1: a visual
representation of
an example of a
0.5 Markov chain
3
.6 .3 .1
P = .2 .3 .5
.4 .1 .5
In this matrix, the first row captures the probabilities of transitioning away
from sunny, the second captures the transition away from cloudy, and the third
away from rainy. As all these probabilities apply to a one-step transition, this
matrix is often referred to as the one-step transition matrix. Now, what about
a two-step transition matrix?
As previously mentioned, this probability is the sum of all combinations of
transitions away from state i to another state x and ultimately from state x to the
final state j. Notice that this probability is captured in the square of the one-step
transition matrix, where each step is the linear combination of the transitions away
from state i and the matched transition to state j. Hence the two-step transition
matrix is as follows.
.6 .3 .1 .6 .3 .1 .46 .28 .26
P 2 = .2 .3 .5 · .2 .3 .5 = .38 .2 .42
.4 .1 .5 .4 .1 .5 .46 .2 .34
Then, if the weather is cloudy today, the probability that it will rain in two
days is .42.
By extension, the n-step transition matrix is simply P n .
4
Suppose, in a five-day week, there were events on Tuesday, Wednesday and Friday,
and no such events on Monday and Thursday, as shown in the table below.
The question remains: based on the given weather patterns and conditional
probabilities for sporting events at the park, what is the most likely state sequence?
First, what was the weather on Monday? Well, suppose each weather outcome
is equally likely. That is, the initial distribution, π = [ 13 13 31 ]. Surely, the best
guess of the weather pattern is affected by the fact that no events were played that
day, but we can take care of that in a minute.
Denote aij as the probability of transitioning from state i to state j, and bj (k)
be the probability of observing scenario k when in state j. Also, denote S` as the
state on the `th step and O` as the observation on the `th step.
The desired state sequence is s1 , · · · , sn such that P(S1 · · · Sn = s1 · · · sn |
O1 · · · On ) is maximized.
max P(q1 q2 · · · qt = i, O1 · · · Ot | λ)
q1 ,··· ,qt−1
where λ is the fixed state transition matrix and observation probability matrix.
Notice that, by induction and given the Markov property,
5
Initialization:
δ1 (i) = πi bi (O1 ) 1≤i≤N
Ψi (i) = 0
Recursion:
δt (j) = max (δt−1 (i)aij )bj (Ot ) 1 ≤ t ≤ T, 1 ≤ j ≤ N
1≤i≤N
Termination:
P ∗ = max δT (i)
1≤i≤N
Path Backtracking:
qt∗ = Ψt+1 (qt+1
∗
) t = T − 1, T − 2, · · · , 1
{q1∗ , q2∗ , · · · , qT∗ } is the most likely state sequence.
1 1 1
Initialization: δ1 (1) = π1 b1 (O1 ) =· =
3 5 15
1 1 1
δ1 (2) = π2 b2 (O1 ) = · =
3 2 6
1 4 4
δ1 (3) = π3 b3 (O1 ) = · =
3 5 15
Recursion, step 1: δ2 (1) = (max(δ1 (i)ai1 )) · b1 (O2 )
i
1 3 1 1 4 2 4
= max{ · , · , · }·
15 5 6 5 15 5 5
8 4
= ·
75 5
3 Procedure
3.1 Main Idea
As previously stated, hidden Markov models are useful when a user is interested
in an unobservable system whose transition behavior resembles a Markov chain.
6
While the system itself is unobservable, however, there exists an observable emis-
sion that provides insight into which state had occurred, albeit unobserved, behind
the scenes.
The unobservable instance, the “weather pattern" in the previous example, is
the event of a win or a loss. We consider the observable instance, the “amusement
park attendance" in the previous example, to be the predicted game statistics.
This structure leverages the idea that game statistics can hint at game outcomes.
Using the stat lines from the games a team has played in a season, the Viterbi
algorithm can be run on these stat lines to find the best-guess state sequence –
that is, the best-guess sequence of wins and losses – to deduce the most likely
record for the season.
7
individual stat line to its own probability, that level of granularity is not feasible
when using historical data. Instead, the emission distributions were constructed
using bins. Generally, suppose n different statistical factors were used in each stat
line, with stat i containing si bins. Then, there are s1 · · · sn total “mega-bins,”
or total inputs, in the distributions. It is helpful to think of these “mega-bins” as
entries in an n-dimensional matrix, with dimensions s1 ×· · ·×sn . Then, each game
would result in the incrementing of one entry of this matrix by one. It is important
to note that the distributions were constructed by counting the incidence of each
statistical mega-bin in the historical data, rather than the sum of the incidences
of each individual bin. The chosen method is preferred because it maintains the
dependence among the different statistical factors. With this structure, it remains
to traverse all games in the historical sample and count the number of occurrences
of each bin when a win occurs (or a loss for the loss distribution), then divide all
bin counts by the total number of wins (or losses).
Once this bin counting method is established, a new question arises: what is
the best way to normalize for differing bin sizes? Using normal histogram practice,
this normalization involves dividing each bin’s count by its n-dimensional volume,
and then dividing every bin by the total sum of all bin counts. For tail bins of a
statistical factor, that is, a bin associated with the event that a stat is not bounded
on one side. In this case, it is safe to create a de-facto bound for the purposes of
making sense of the volume of that bin.
8
3.3 Testing the Model
The purpose of constructing the model is to construct a most-likely state sequence,
that is, win-loss sequence, given a certain sequence of game statistics. We built
a Matlab code for taking in the transition matrix and emission probability dis-
tributions for wins and losses (processes for which are explained in the previous
subsection), plus the game statistics of a team of choice, and returning the most
likely state sequence using the Viterbi algorithm. For variety, we ran the test on
many different teams to see how the model performed. Below is an example of the
results from one team’s season.
Table 1: Model Results for Team X Season X
pown rown pald rald todiff fddiff pw pl states pred actual
236 65 316 64 3 -4 0.0104 0.0020 7 w w
254 30 263 83 2 -5 0.0104 0.0020 8 w w
153 58 306 105 3 -5 0.0180 0.0167 11 w w
262 104 339 78 2 4 0.0190 0.0027 11 w w
255 96 214 59 1 6 0.0011 0.0000 11 w w
189 93 239 101 0 3 0.0051 0.0195 4 l l
201 57 403 158 0 -4 0.0009 0.0095 3 l l
259 78 311 97 1 4 0.0104 0.0020 9 w l
284 47 388 128 0 -1 0.0035 0.0108 4 l l
145 72 290 135 1 -8 0.0072 0.0177 3 l w
224 82 308 94 -1 -1 0.0018 0.0083 2 l l
231 87 264 140 1 8 0.0017 0.0006 2 l l
292 85 315 85 -1 4 0.0023 0.0039 2 l w
248 34 411 161 -3 -7 0.0000 0.0029 2 l l
353 93 348 40 -2 3 0.0006 0.0064 2 l l
250 124 323 183 4 2 0.0048 0.0023 9 w w
The first six columns show the six statistical data values for each game that
were used in the model. The next two columns show the probabilities of the stat
lines falling into the specific bin, conditioning on a win or loss. These probabilities
were extracted from the empirically constructed emission probability distributions
that were discussed in section 3.1.2. The Viterbi algorithm then takes into account
the distribution values and the transition probabilities from the current state to
decide the new state, shown in the seventh column. As mentioned, entering a state
greater than or equal to seven implies a prediction of a win, while any prediction
less than or equal to six is a prediction of a loss, as shown in the eighth column.
In the ninth column, the actual game result is displayed for comparison.
9
4 Results
4.1 What is Success?
In order to understand the efficacy of the model, it is first important to establish
metrics that illustrate it. First, and foremost, is the notion of the overall predictive
success rate. This success rate is determined by running the model on every team’s
season in a year and computing the proportion of games that the model predicted
correctly.
While the success rate is the most useful metric for determining the efficacy
of the model, it is also important to consider the efficacy of the model compared
to the historical data. More specifically, it must be examined whether it is any
better to use the hidden Markov model with the underlying transition probabilities
than to simply choose the game winner by comparing the conditional probabilities
given by the empirically constructed conditional probabilities of a win or loss (i.e.
the seventh and eighth columns in Table 1). How often did the Viterbi algorithm
even “go against the grain” by choosing the opposite outcome of the strategy of
simply choosing based on the distributions? If and when this occurred, what was
the success rate of the model? With this consideration, two new metrics can
be established, the chain override rate and chain override success rate, aimed at
answering those two critical questions.
10
three passing / three rushing, two rushing / three passing). The “best” bins were a
judgement call based on how they performed in the three factors mentioned above.
The four are given below.
Table 2: Model Results for the “Best” Bin Performances from Each Category over
2016
pass rush todiff fddiff opsr cosr cor w_0s l_0s
[175 225] [100 150] [1 2] [6 20] .76 .76 25 437 456
[140 175 235] [65 90 130] [1 3] [6 20] .76 .71 28 1754 1801
[125 275] [80 90 130] [1 4] [6 20] .76 .73 30 885 856
[140 175 275] [75 150] [1 3] [6 15] .75 .73 30 904 919
The passing metric is automatically capped at 0 and 550 as lower and upper
bounds. If there is a game that goes over 550 or under 0 (believe it or not, that has
actually happened), they are included in the highest or lowest bins, respectively.
In the same way, the rushing metric is automatically capped at 0 and 225. The
turnover and first down differential metrics were treated slightly differently. The
vectors displayed in Table 2 denote the inner and outer delimiters of a metric that
is assumed to be symmetric. In particular, [a b] means that the upper and lower
bounds are −b and b, with delimiters at −a and a. Once again, both passing and
both rushing metrics are assumed to follow the same bins.
It is important to note that it was considered fair to include 2016 data in the
construction of the pdf. One result of this is that the number of zeros in the win
and loss pdfs decreased modestly, which is healthy for the model. Additionally, this
data increase could be the reason why two of the four metrics were able to surpass
.80 in overall predictive success rate. While this is possibly the case, it is unlikely
because these bins still performed well in 2017 without the inclusion of 2016 data.
11
The chain override success rate achieved disappointing results, possibly suggesting
that there may not be a high correlation between results across a time series for
this metric. Nonetheless, the mean remained strong in 2017 which further validates
that the model is more successful than simply using the pdfs alone. Based on this
output, the out-of-sample results are encouraging and speak to the efficacy of the
model as a whole.
12
points than many other team sports. A more subtle effect of the 16-game, week-
by-week season is the lack of long streaks. While the chain rides on the presence
of streaks in order to occasionally override the pdfs, there are fewer streaks in the
NFL season, possibly decreasing efficacy of the model.
Overall, it can be concluded that hidden Markov models are useful for modeling
a football season. With some future work, the model can be even further improved
as a respectable predicting device.
13