tesi
tesi
This work arises from a thesis proposal launched by the company Deltatre, whose
objective is to extract key performance indicators of a team or a player through a
Markov chain model from the data collection carried out on a football competition.
Each match is modeled as a Markov chain with suitable states and transitions.
The Markov chain theory is used to create the model, while other mathematical
tools as chi-square distribution and confidence intervals are used to check the
goodness of results. The creation of statistics is inspired by the expected threat
theory introduced by Karun Singh.
The data preprocessing begins with understanding the data available and those
useful for analysis: the most important are the ball position and the player and
the team in ball possession, but also other information is used, as the type of
event and the phase of the match.
Then the states defining the model are choosen, composed by field areas plus two
additional states, the goal and the lost ball; the number of field states depends
on the subdivisions on both sides of the field, creating a m × n grid which can be
represented using blurred and defined heatmaps.
Once defined the states, a transition matrix is needed, or rather his estima-
tion, to complete the Markov chain model; this is done computing the frequencies
matrix, whose entries include the count of how many times the ball does a tran-
sition from a state to another, and then normalizing it to obtain a probabilities
matrix, having all sums on rows equal to 1.
The transition matrices can be computed taking data from a single match or from
the entire tournament; in this last case it is used also to compute the stationary
distribution of the Markov chain.
But these results come from estimates, so it is critical to explore their robust-
ness through the Goodman method, obtaining heatmaps of absolute and relative
confidence interval amplitudes.
In the last chapter the idea of Karun Singh is applied to the model, obtaining
the expected threat related to the tournament edition.
Then it is used to find the dominance of teams during the match, which can be
done in different ways and it gives a dynamic and clear idea of the behaviour of
the match.
Some possibilities are to group or to weight different values of the expected threat
with respect to established criteria, for example subdividing the match in minutes
or actions and adding the expected threat in these parts.
The idea of dominance is also applied to players, considering their expected threat
both during a match or during the tournament.
Then to make fairer players performances, two new improvements are introduced:
a normalization for minutes played, so that those who play more minutes or
matches are not more advantaged than others, and a contribution of players to
the expected threat, computing their gains not just based on where they touch
the ball, but also where they pass or steal it, thus encouraging even less offensive
players.
Introduction
About company
Born in 1986 from an idea of Giampiero Rinaudo and Luca Marini, Deltatre is the
world’s leading sports and entertainment technology provider, offering graphics,
data, OTT and live broadcast solutions.
It counts more than a thousand employees in offices spread across 19 cities around
the world and it has received more than 200 awards in its history.
It works with mostly international clients having business in several sports, sup-
porting them in all key steps of a process.
About a match for example, its contribution starts finding useful statistics and
information to introduce the teams or the players before the start.
It continues during the event or its break, e.g. showing highlights or real-time
analysis on the action just taken.
It finishes with the data collection and analysis after the match, providing statis-
tics and insights on it.
1
Contents
Introduction 1
About company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Aim of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Theoretical prerequisites 4
1.1 Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Assumptions and other properties . . . . . . . . . . . . . . . 5
1.1.5 Embedded Markov chain . . . . . . . . . . . . . . . . . . . . 6
1.2 Probabilities robustness . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Quesenberry and Hurst method . . . . . . . . . . . . . . . . 9
1.2.4 Goodman method . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Expected threat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Results 13
2 Preprocessing 13
2.1 Data understanding . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Marks file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Position extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Single match . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Field subdivision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Single team in a match . . . . . . . . . . . . . . . . . . . . . 17
2.4 Results representation . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Single team in all matches of a tournament . . . . . . . . . 20
2.4.2 Grid refining . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Markov chain 23
3.1 Transition matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Frequencies matrix . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Probabilities matrix . . . . . . . . . . . . . . . . . . . . . . 24
3.1.3 Total tournament edition . . . . . . . . . . . . . . . . . . . 24
3.2 Stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Application of robustness to probabilities matrix . . . . . . . . . . 27
3.4 Application of robustness to stationary distribution . . . . . . . . . 30
2
4 Application of expected threat to probabilities matrix 31
4.1 Comparison between expected threat and lost probabilities . . . . . 34
4.2 Match dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Both teams in a match . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Subdivision by actions . . . . . . . . . . . . . . . . . . . . . 40
4.3 Player dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Total tournament edition . . . . . . . . . . . . . . . . . . . 42
4.3.2 Normalization for minutes played . . . . . . . . . . . . . . . 43
4.3.3 Expected threat contribution . . . . . . . . . . . . . . . . . 45
A Code 48
A.1 Event extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.2 Event validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.3 Function is_goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.4 Function is_consequential . . . . . . . . . . . . . . . . . . . . . . . 49
A.5 States assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.6 Function find_states . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.7 If condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.8 Function field_statistics . . . . . . . . . . . . . . . . . . . . . . . . 50
A.9 Function plot_statistics_on_field . . . . . . . . . . . . . . . . . . . 51
A.10 All tournament matches . . . . . . . . . . . . . . . . . . . . . . . . 52
A.11 Transition matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.12 Function prob_origin_matrix . . . . . . . . . . . . . . . . . . . . . 53
A.13 Total tournament edition . . . . . . . . . . . . . . . . . . . . . . . 54
A.14 Stationary distribution . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.15 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.16 Matrix of confidence intervals amplitude . . . . . . . . . . . . . . . 55
A.17 Matrix of relative amplitudes . . . . . . . . . . . . . . . . . . . . . 56
A.18 Absolute and relative confidence intervals amplitude . . . . . . . . 56
A.19 Function exp_thr . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.20 Function find_states_match . . . . . . . . . . . . . . . . . . . . . . 57
A.21 If condition (event by event) . . . . . . . . . . . . . . . . . . . . . . 57
A.22 Function field_statistics_match . . . . . . . . . . . . . . . . . . . . 58
A.23 xT computing: initial idea . . . . . . . . . . . . . . . . . . . . . . . 59
A.24 xT computing: weighted sum idea . . . . . . . . . . . . . . . . . . 59
A.25 xT computing: time segments idea . . . . . . . . . . . . . . . . . . 60
A.26 Function act_dur . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.27 Cumulative xT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.28 Expected threat player . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.29 Role extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.30 Bar chart match . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.31 Expected threat player tournament . . . . . . . . . . . . . . . . . . 63
A.32 Bar chart tournament . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.33 Normalization by minutes on the match . . . . . . . . . . . . . . . 65
A.34 Normalization by minutes on the tournament . . . . . . . . . . . . 66
A.35 Function gained_exp_thr_player . . . . . . . . . . . . . . . . . . . 68
A.36 Expected threat contribution . . . . . . . . . . . . . . . . . . . . . 69
B Images 70
3
Chapter 1
Theoretical prerequisites
so the probability of moving to the next state depends only on the present state
and not on the previous history.
It holds if conditional probabilities are well defined, that is
P(X1 = x1 , . . . , Xn = xn ) > 0.
P is the transition matrix of the process (previously evaluated in the entry (xn , x),
that is the transition from state xn to state x), which together with a state space
S and an initial state i0 or a distribution across the state space π0 uniquely iden-
tifies a Markov chain.[Wike]
1.1.2 States
The state space of the chain S is a finite set including the possible values of Xi .
4
Given two states i and j, j is reachable from i (i → − j) if ∃m ≥ 0 s.t.
p(m) (i, j)> 0, while they are called communicating if i →
− j and j →
− i.
If all states of the space are reachable with each other, the Markov chain is irre-
ducible.
Given the finite cardinality of the state space, if the Markov chain is irreducible
and aperiodic, Perron–Frobenius theorem states that there is a unique stationary
distribution π and "P k converges to a rank-one matrix in which each row is the
stationary distribution π, that is
lim P k = 1π,
k→∞
1.1.3 Transitions
The changes of state of the system are called transitions and the probabilities
associated with them are called transition probabilities.
If the state space is finite, they can be included in the transition matrix, such
that pij = P(Xn+1 = j | Xn = i).
"Since each row of P sums to one and all elements are non-negative, P is a right
stochastic matrix". [Wike]
Usually in real cases the transition matrix is not given because it is really
difficult to know exact transition probabilities from a state to another.
Then it is necessary to find their estimates, which is a relatively straightforward
process if sequence of states for each individual transition can be observed.
In general, denoting nij the number of observations in state i at t − 1 and in
state j at t, the probability of going from i to j is:
nij
pij = ,
Ni
that is nothing but the proportion of observations started in state i and ended in
state j on all those started in state i.[Jon05]
5
• time-homogeneous: P(Xn+1 = x | Xn = y) = P(Xn = x | Xn−1 = y) = pyx ∀n,
so the probability of the transition is independent of n.
SPACE
"If the Markov chain is time-homogeneous, then the transition matrix P is the
same after each step, so the m-step transition probability can be computed as the
m-th power of the transition matrix, P m ". [Wike]
Precisely in this case, taking an instant n then
SPACE
The trajectory of a Markov chain is the set of realizations of the stochastic process;
given the initial distribution it is possible to compute the probability of obtaining
a certain trajectory i1 , i2 , . . . , in , that is
6
The one-step transition probability matrix of the EMC, T , has entries tij repre-
senting the conditional probabilities of transitioning from state i into state j.
"One way to find these conditional probabilities is computing
( qij
P if i ̸= j
tij = k̸=i qik
0 otherwise.
T = I − (diag(Q))−1 Q,
where I is the identity matrix, and diag(Q) is the diagonal matrix formed by
selecting the main diagonal from the matrix Q and setting all other elements to
zero". [Wike]
To compute the stationary probability distribution vector, it should be found φ
such that
φT = φ, φ > 0 ∀i ∈ S, ∥φ∥1 = 1,
and then
−φ(diag(Q))−1
π= .
∥φ(diag(Q))−1 ∥1
7
The formal definition is that if Z1 , . . . , Zk are independent, standard normal ran-
dom variables, then the sum of their squares is distributed according to the chi-
square distribution with k degrees of freedom, usually denoted as
k
X
Q = Zi2 ∼ χ2 (k) or χ2k .
i=1
This distribution has a positive integer parameter k which specifies the number
of degrees of freedom (that is the number of random variables Zi being summed).
[Moo74]
"The chi-square distribution is a special case of the gamma distribution and
it is one of the most widely used probability distributions in inferential statistics,
notably in hypothesis testing and in construction of confidence intervals". [Moo74]
"A chi-square test is a statistical hypothesis test that is valid to perform when
the test statistic is chi-squared distributed under the null hypothesis, specifically
Pearson’s chi-squared test and variants thereof.
Pearson’s chi-squared test is used to determine whether there is a statistically
significant difference between the expected frequencies and the observed frequencies
in one or more categories of a contingency table". [Wikc]
Suppose that n observations in a random sample from a population are classi-
fied into k mutually exclusive classes with respective observed numbers xi , for i =
1, 2, . . . , k and a null hypothesis gives the probability πi that an observation falls
into the i − th class.
So we have the expected numbers mi = n · pi ∀i, where
k
X k
X k
X
pi = 1, mi = n · pi = n.
i=1 i=1 i=1
Pearson in [Pea00] proposed that assuming the correctness of the null hypothesis,
as n →
− ∞ the limiting distribution of the quantity below is χ2 -distributed
k k
X (xi − mi )2 X x2i
X2 = = −n
mi mi
i=1 i=1
8
The number γ, whose typical value is close to but never greater than 1, is some-
times given in the form 1 − α (or as a percentage 100 · (1 − α)%), where α is a
small positive number, often 0.05." [Dek05]
So taking the 95% confidence interval as an example, a confidence interval can
be interpreted in two different ways:
• in terms of a long-run frequency in repeated samples: "Were this proce-
dure to be repeated on numerous samples, the proportion of calculated 95%
confidence intervals that encompassed the true value of the population pa-
rameter would tend toward 95%."[CD74]
The parameter A is the upper α × 100-th percentile point of the chi-square dis-
tribution with k − 1 degrees of freedom, while Ni is the sample size of state i and
nij is the observed cell frequency of the transition from state i to state j.
−
These confidence limits πij +
, πij are derived simply as the two solutions of the
following quadratic equation in πij
2
nij Aπij (1 − πij )
− πij = , i, j = 1, 2, ..., k
Ni Ni
In the article it is not explained the provenance of the formula, but it is clearly
related to the use of a chi-square distribution for the binomial test; "for large
samples the binomial distribution is well approximated by convenient continuous
distributions as the normal or precisely the chi-square and these are used as the
basis for alternative tests that are much quicker to compute, such as Pearson’s
chi-square test". [Wika]
So given that a binomial distribution can be asymptotically approximated with a
normal distribution, it holds that
(nij − πij Ni )
Z=p , Z ∼ N (0, 1)
Ni πij (1 − πij )
(nij − πij Ni )2
χ2 =
Ni πij (1 − πij )
9
Then doing some manipulations the quadratic equation with χ2 instead of A is
obtained.
n
2 (nij − πij Ni )2 Ni2 ( Niji − πij )2
χ = =⇒ χ2 πij (1 − πij ) =
Ni πij (1 − πij ) Ni
2
χ2 πij (1 − πij )
nij
=⇒ = − πij
Ni Ni
−
Since solutions πij +
, πij have to be related to the 1 − α confidence interval, χ2 must
be substituted by A as upper α × 100-th percentile point.
When Ni → − ∞ the fraction of cases included in the confidence interval will
be at least 1 − α.
But there are some difference with respect to different values of k: for k = 2,
that fraction is 1 − α (and it coincides with that of usual large-sample confidence
interval for the parameter of a binomial distribution), but for k > 2 the fraction
will be greater than 1 − α, so it is become just a lower bound, not the exact value.
10
• reward individual player actions
• operate on event-level data
• reward actions independent of the end outcome of the possession
• reward moving the ball not just into positions with high goal probability,
but also into "threatening" positions that can in turn lead to dangerous
positions with high likelihood.
After these modelling assumptions were made, he would like to "assign a threat
value to every location on the pitch"[Sin19]; it is not a new idea in football ana-
lytics, but the novelty is the way it is done.
Field areas are identified by a couple (x, y), where x is related to the long side of
the field and y to the short one; so it is a sort of couple of discrete Carthesian
coordinates having the origin of the system on the bottom left corner.
Karun selected for every zone (x, y) four attributes:
• move probability mx,y : when a player has possession in zone (x, y), it is
how often he opts to move (i.e. pass or dribble) the ball as next action.
• shoot probability sx,y : when a player has possession in zone (x, y), it is
how often he opts to shoot as next action.
• move transition matrix Tx,y : in the cases where the player moves from
zone (x, y), it is the probability that he moves to each of the other zones
(z, w).
It is quite different from usual transition matrices, having starting states on
rows and ending states on columns, because here the starting state (x, y) is
fixed out of the matrix, which instead includes possible values of z on rows
and those of w on columns.
So the sum to 1 is not obtained by rows as usual but considering all entries
of the matrix.
• goal probability gx,y : when the player shoots from zone (x, y), it is the
probability that the shot turns into a goal.
Then from a position (x, y), the first possibility is to shoot and score with proba-
bility sx,y × gx,y , the second one is to pass the ball following the move transition
matrix and consider the threat from the new position (z, w).
11
SPACE
This reasoning is encoded in the iterative formula
m X
X n
xTx,y = (sx,y × gx,y ) + (mx,y × T(x,y)→
− (z,w) xTz,w ),
z=1 w=1
where m and n are dimensions of the long and short side of the field respec-
tively, while xTx,y is precisely the expected threat of the zone (x, y), the new KPI
including all possible source of threateness.
It is clear that computing the expected threat directly solving the iterative
formula seems quite hard due to the presence of expected threat values in both
sides of the equation; so Karun has taken advantage of the iterative formula itself
to find a practical neat workaround.
The initial condition is
xTx,y = 0 ∀x = 1, . . . , m, y = 1, . . . , n
so at the first iteration the expected threat is nothing but the expected goal xGx,y ,
another KPI largely used in football analytics.
Applying this formula iteratively, Karun "found 4-5 iterations to be sufficient
for reasonable convergence, though this may vary based on your dataset."[Sin19]
Of course the more steps are made the more combinations of possible actions
grow; by the way another interpretation of the expected threat after n iterations
is the probability of scoring within the next n actions, due to the iterative nature
of the formula.
12
Results
Chapter 2
Preprocessing
1. the outermost level includes just the year of the competition; the analysis
will regard just one of the three years of data available.
2. for every year there is the list of all matches played, temporally ordered and
marked by progressive numbers.
3. finally for every match there is a set of files including all information about
it.
SPACE
For the analysis will be mostly used the Marks file, introduced in the next para-
graph, which deserves a more specific knowledge, while the content of the others
can be resumed as follows:
13
• MatchInfo: it includes other background information about the match but
not so related to the game, for example the geometry of the field as well as
weather conditions (wind speed, temperature, humidity, etc...) .
• Phases: it regards the two halves of the match, for each of which there is
information as the effective duration of the half, that of the injury time and
LeftTeamID.
• Tags: it is a list of strings (usually one) which briefly describes the event,
using expression as BallTouch, Goal, ThrowIn, Substitution; for the analysis
the first two type of events are the more relevant.
14
SPACE
15
but discretizing the field in different regions and assigning coordinates to the
respective area.
In this way it is possible to lay the groundwork for a model that allows to represent
field areas as states of a dynamic process.
The simplest solution is to create an uniform grid and to apply it on the field
obtaining m × n areas, where m is the number of subdivisions along the longest
direction and n those on the shortest direction; then every couple of coordinates
is assigned to its area (x corresponds to the longest direction, y to the shortest
one).
A problem that may emerge from this grid subdivision is the so-called "curse of
dimensionality"; as m and n grow to create a more refined grid, also the dimension
of the problem grows as m × n, not linearly, causing computational inefficiencies.
Another problem is the robustness of results, that depends on the frequencies
at which transitions between field areas are visited and that decreases with the
increase of the problem dimension (see paragraph 3.3).
To avoid these problems the initial grid is 6 × 3, even if a more refined one
could be surely better graphically (see paragraph 2.4.2).
In addition to the field areas, other states are needed to record goals scored
and lost balls; but the last is necessary just when the analysis is focused on a
team, while if both teams are considered the state "lost ball" for the team A
coincides with a ball possession in a certain field area for the team B.
Moreover, the data cleaning process showed that some ball positions were
registered out-of-the-field, e.g. before a corner or a throw-in, so it was created
a sort of frame around the field to include these events, assigning them to the
nearest field area.
The list B.3 includes examples of these cases, where position on X, Y or both do
not stay within range [−1, 1].
The idea is shown in the following image and coded in the function field_statistics,
that will be analyzed later.
Figure 2.1: Areas distinguished by corner, behind football goal and lateral (both
orange and green)
SPACE
Another aspect to pay attention to is the field perspective for both teams: if a
field area is seen by team A as an upper-right one, for team B is a lower-left one,
but after the half-time it is in reverse, therefore it is essential to standardize these
16
changes of point of view.
The strategy was to take as reference system the offense from left to right
or from top to bottom (respectively if in the field representation the long side is
horizontal or vertical). Then the ball coordinates for the team whose offense is
consistent with the reference system are let unchanged while the ball coordinates
of the other team are mapped using the function f : (x, y) → − (−x, −y), that is
just their reflection with respect to the origin of the Carthesian reference system.
For example let’s assume that the analysis is about team A and that in the first
half it is attacking from the right to the left of the field; then f is applied to ball
coordinates of the first half, but not to those of the second half, because after
the break there is the change sides so the team A will attack consistently to the
reference system.
• goal: to find goal events the function is_goal is used (see A.3), which simply
check if there is the string ’Goal’ in the columns ’Tags’ of the event.
After this event naturally the state will be lost because the ball will pass to
the other team (except very rare cases where the goal scored by team A is
the last event of the first half and the second half kick-off is beaten by team
A itself); so goal is an almost surely non-recursive state.
• lost: after selecting the events related to that team, to find when the
ball is lost by the team the idea is to check indices with the function
is_consequential (see A.4) such that if the difference between an index and
the following is greater than one it means that in the middle there are events
related to the other team, so the ball was lost.
A key point is that a single state lost could correspond to many events of the
other team, so there is a state goal or position, then a state lost and again a
17
state goal or, most probably, position; for this reason lost is a non-recursive
state.
• position (X,Y): in this case the values of the coordinates X and Y are
simply shown.
A way to represent the connection between states is a graph, useful also to better
understand the transition process that will be analyzed in the next chapter.
Figure 2.2: To avoid too much confusion not all areas are drawn: there are just the
connections between the state 1 and those drawn and self-loops on "field states",
because goal and lost are not recursive states.
SPACE
Then, to assign these states, the piece of code A.5 is used, where team_A is a
data-frame containing all events related to the team A.
It is really important to notice that counts the order in which if conditions are
written in the for cycle: indeed usually both goal and lost events have their
position so if that condition was written before, it could create misunderstandings.
Similarly goal should precede lost because in the case of own goal the ball is not
in possession, so if they were inverted the states would be lost, losing the fact that
a goal was scored.
This framework is included in a function find_states (see A.6), which takes
as input the json file with all information and the team to be analysed and it
provides as output a list of states and an array of left team, the team playing
from left to right, practically about one half with a team and the remaining, after
the half-time, with the other.
Then results are used to find the field area for each state in the list through the
function field_statistics (see A.8), which takes as input the number of subdivisions
of long and short field side, the outputs of previous function states and left_team
and again the team, recalling that the focus is actually the single team in the
match.
Instead outputs are two numpy arrays:
18
• field_zone: it contains the field areas of states provided as input, so the
length is the same of the list states and the possible values are m × n inside
the game field plus two dummy areas, the lost one, signed by m × n + 1,
and the goal one, signed by m × n + 2.
The matching between states and field areas is immediate in the case of lost
←→ m × n + 1 and goal ←→ m × n + 2, while is not so easy with position states.
The first step is to understand if the team considered is attacking from left to
right or not, done through the condition (see A.7), where after the else is applied
the reflection function to the position.
Then there are two for cycles which run through all the field zones searching the
one corresponding to the position provided; it is important to notice that in the
case of contour areas are added controls using if conditions to check if the position
is eventually in a frame around the field.
19
SPACE
20
SPACE
It is interesting to compare the performance of two teams, understanding the dif-
ferent ways to play during the tournament.
Figure 2.3: The game of team A is more concentrated in the classic construction
area of the game, the number 7, while that of team B has a more distributed
maneuver in the central areas of the field
21
2.4.2 Grid refining
Until now the field has been subdivided into 6 × 3 areas to avoid data sparse prob-
lems, which may emerge when considering just a single or not many matches, e.g.
in the case of a team playing just an initial part of the tournament before being
eliminated.
A possible refinement of the grid is obtained duplicating subdivisions in both di-
mensions, resulting in an almost quadruplication of states (from 20 to 74); then
field_zone and count vectors are recomputed changing parameters m and n of
field_statistics and plot_statistics_on_field function to show results (of a single
match on the left, of the tournament on the right).
Figure 2.4: The left plot has some strange behaviours on the side lanes and it is
much more irregular than the right one because there are far fewer observations,
less than 10 in many areas, arising the problems mentioned above, while with
more data the graphical result is better than with 6 × 3 grid.
22
Chapter 3
Markov chain
So for example during the tournament players of team A have passed the ball
from area 8 to area 5 for 88 times, lost the ball for 61 times from area 2 and
scored a goal from area 13 for 3 times.
23
This matrix is also useful to verify the successful cleaning of the data because it
can be noticed that transition from lost states to itself and from goal states to
itself are both zeros, due to the fact that they are non-recursive states, as well as
transitions from lost to goal.
Finally the sum of goals scored by team A during the competition is visible on
the cell (19, 18), the transition from goal to lost, rather than calculating the sum
of values in the last column, including transition from all states to goal (clearly
they are equal).
Figure 3.1: For example the probability to pass from field area 17 to 14, that is a
back passage on the right lane, is almost 15%, while that of scoring a goal from
area 13 is less than 1%.
24
Figure 3.2: It is interesting to notice that the tournament plot is almost sym-
metric, with a negligible dominance of on the left lane, while for team A this
dominance is more pronounced, as it can be seen from the darker colour of areas
3 and 12 with respect to 5 and 14 respectively. Moreover seeing the lighter colour
of area 6 and the darker of area 11, the team B seems to slightly prefer a more
advanced position on the right side.
25
3.2 Stationary distribution
The analysis on the entire tournament has led to the study of a new tool that
could be useful in the continuation of the work: the stationary distribution.
Until now charts were about how many times the ball passed on different field
areas, while here the focus is to find the stationary probabilities to be in field
areas given a random instant of the match.
This aim can be attained in three different ways:
• limit distribution: based on the fact that the chain is ergodic because it is
finite, irreducible (taking every couple of states they are mutually reachable)
and aperiodic (the chain period is equal to 1), so the limit distribution is
also the stationary one.
In the piece of code A.14 after a common first part to import the transition
matrix, let just notice that in the limit method starting from a vector in which
all states are equiprobable, the transition matrix is iterated for k times, where
theoretically k tends to infinity but in practice it only takes about ten iteration
to reach convergence.
The comparison between the results can be done again through the function
A.9, obtaining
26
Limit time
Elapsed time : 0.001977 seconds .
Analytic time
Elapsed time : 0.006979 seconds .
Count time
Elapsed time : 0.031848 seconds .
SPACE
The real differences between these methods are reliability and conditioning of
the results; iterative and analytical ones are subject to the propagation of any
errors present within the transition matrix, whose poor conditioning is not to be
excluded given its construction based on empirical data.
So because of its nature, the empirical method is the most stable, since it does
not depend on the transition matrix but only on the count vector in the various
field areas.
27
SPACE
A possible explanation could be that high-frequency transitions are also the most
likely ones, so a large absolute amplitude of the intervals turns out to be in fact
small if evaluated relatively.
Then a new matrix rel_ampl_matrix is created in A.17, having entries equal to
the ratio between entries of amplitude_confint_matrix and prob_trans_matrix.
Moreover it is added an if condition to control the initial hypothesis "values
in ’counts’ are greater than or equal to 5", because in this case the measure of
robustness of these entries is not valid.
A max_elem is saved and their value is put at 2 × max_elem by hand to make
the difference with respect to "valid" values.
28
SPACE
Now the heatmap is much more similar to what expected, with low values near
the diagonal and high values on those in upper-right and lower-left corners.
Row and column 18 are almost white and this could seems inconsistent with
other states near the edges of the matrix, but it should be remembered that state
m × n + 1 corresponds to lost, which has an high frequency from and to all other
states.
To better understand the behavior of relative amplitudes in the area near the
diagonal it can be created a little more detailed heatmap:
29
SPACE
30
Figure 3.3: The pattern of absolute amplitude of confidence intervals is really
similar to that of stationary distribution, stating that they seem directly propor-
tional.
The relative amplitude instead reveals how the less visited areas are those with
the highest values, especially the 16-th area, in front of the opponent’s goal.
Its relative amplitude is more than 15% of the stationary distribution value, while
others are lower than 10%.
Chapter 4
The aim is trying to replicate the idea of Karun Singh presented in section 1.3
exploiting what was previously found.
The starting point is quite different, because here the probabilities of moving
or shooting were not distinguished from each other but only the passage from
one state to another, while goal probabilities and transition matrix can be easily
derived.
In particular, probabilities are the last column of the matrix because it includes
all probabilities to pass from a field zone to the state goal.
31
The function exp_thr (A.19) is used for this purpose, to find the expected
threat and to save it in an Excel file; the input n represents the number of it-
erations, which is basically how many times is done the matrix-vector product
between transition matrix and expected threat vector.
The reasoning behind this simplification is that at the iteration 0, that is the
initial condition, the question should be "What is the probability of scoring a goal
with another 0 passes?", therefore shooting directly from the area where the ball
is; clearly the answer is given by the last column of the matrix.
In general for step k the question is the same with k passes, but it can be reduced
to "What is the probability of scoring a goal with another pass from the state
k − 1?"; that is the explanation of the recursive formula
XX
xT(x,y) = T(x,y)→
− (z,w) xT(z,w)
z w
SPACE
It is interesting a comparison between the expected threat found at the second
iteration and goal probabilities, the initial condition, to appreciate the improving
of information about the goal, passing from its mere realization to its construction
in the last and usually more decisive steps.
32
A last consideration can be done regarding the normalization of the expected
threat with respect to how many times the ball passes over areas, such that is
given importance to areas more frequented, measuring it with count vector, to
have a better idea of from where comes the majority of danger situations.
33
4.1 Comparison between expected threat and lost prob-
abilities
It is interesting to spent few lines seeing the relationship between this two factors.
The expectation is that they are directly proportional because when a team plays
near the opposing area clearly the probability of losing the ball is higher than near
its area; this is confirmed by the following chart, where the points correspond
to every field area and their coordinates are the expected threat and the lost
probability.
Figure 4.1: The increasing trend is shown by the regression line of these points.
SPACE
There is a clear distinction between "safe" areas, characterized by low lost proba-
bilities but also low expected threat, while there are areas more or less "inviting"
depending on whether they are above or below the line. To understand which are
those areas, the vectors should be analyzed, whose representation on the pitch is
shown in the following charts.
34
SPACE
So the worst areas are the three having lost probabilities between 0.3 and 0.4 but
expected threat lower than 0.001, that are the number 0, 1 and 2.
The other group of "bad" areas is composed by those having lost probabilities
between about 0.13 and 0.2 but expected threat lower than 0.0005, that are the
number 3, 4, 5, 6, 8, while the field area 7 is at the beginning of the line.
The last not so inviting area corresponds to the isolated point with more than 0.6
as probability of losing the ball, that is the number 16.
Regarding "good" areas, there is a group of three points between 0.1 and 0.2
of lost probabilities and around 0.001 of expected threat including areas number
9, 10 and 11. The remaining five points over the line are the best areas in which
playing on the field, because expected threat values are high relatively at the
probabilities of losing the ball; they are the number 12, 13, 14, 15 and 17.
35
Secondly the team changes event by event, so in the new function
field_statistics_match (A.22) the condition A.7 becomes A.21.
Finally the output count is omitted because it does not make much sense write
down how many times the ball is in a certain area without distinction between
teams.
So these functions provide the inputs of this process, which are applied to the
expected threat found in the previous section following the piece of code A.23.
The idea is to iterate on field_zone items taking the expected threat in that field
area with the positive sign if the team playing on it is the first team touching
the ball (not very relevant here, it only counts graphically), the negative sign
otherwise.
The representation of this result is not very significant; watching carefully can be
noticed that the team represented at the top seems to dominate the other, but
this is not enough.
Another idea was to compute the match expected threat with a weighted sum of
the expected threat in the actual field area and the previous ones (preceded by
plus or minus following the same criterion as before).
In this way a leveling in the chart due to equal values of expected threat is avoided.
In A.24 the first for cycle is a kind of initialization of the match expected threat,
while the second includes the weighted sum.
It is designed in such a way as to halve the importance of the expected threat
the further away from the current observation but to maintain the sum of weights
equal to 1; the choice was to consider the actual and the three previous observa-
tions, so a 4-upla of weights satisfying these properties is [8/15, 4/15, 2/15, 1/15].
Clearly it is not the unique possibility because properties are satisfied also by
n-uple as [24/45, 12/45, 6/45, 3/45] or [27/40, 9/40, 3/40, 1/40] if the importance
is not halved but divided into three or even [4/7, 2/7, 1/7] if considered just the
36
two previous observations, but after a qualitative comparison between some of
them the choice was [8/15, 4/15, 2/15, 1/15].
The chart is slightly more understandable than the initial one but it is still difficult
to get a better idea about certain moments of the game.
A solution could be to group values of expected threat with respect to their
timestamp (it will be done by k minutes) in order to have a clearer chart.
In A.25 the time range of the match is divided in time_segments, on which
the external for cycle iterates, while the range of the internal cycle is the en-
tire time_stamp vector but the if condition makes sure that just the values on
the i-th time segment are taken.
37
Figure 4.2: Ranges on the x-axis indicate the parts in which is splitted the match,
that are nothing but the match duration measured in minutes divided by param-
eter k and overestimed.
SPACE
The graphic representation of the case k = 2 is much better than before; now
there is a clear dominance between of the team A with respect to the team B and
the moments in which one team prevails over the other are more outlined.
A further improvement is to distinguish teams not just with the horizontal
line but colouring areas created by their expected threat, to draw a dash line at
the half-time and to set k = 1 so that numbers on the x-axis have the meaning
of minutes of the match.
38
Figure 4.3: The result of the match was 2-0 for the team A
SPACE
Finally it is possible to add vertical lines corresponding to the minute of the goals
done, that could be useful to better understand the behaviour of the match; a team
can reach the goal due to its dominance in the match or it can gain dominance
thanks to the goal itself.
39
4.2.2 Subdivision by actions
Another approach to the match analysis is to consider actions played in turn by
teams, grouping consequential events of the same team in a unique action.
The aim is again to measure the dangerousness of teams during the match but
not dividing it by minutes and "mixing" different attacks of the teams; now the
strategy is to compute the expected threat of an action by doing the sum of the
expected threat of every event that compose the action.
It seems to be a good trade-off between possession and verticalization because a
long actions in areas away from the goal or dangerous but short actions do not
represent a real dominance in the match.
The function act_dur (see A.26) is built to obtain a vector counting the
number of events for each action; the first for cycle includes if conditions to dis-
criminate between cases of same team in consecutive events in which the action
continues and those of different teams in which a new action is initialized.
The other outputs are the average length of actions mean_action_A and mean_action_B,
so a split of the count_action vector is needed.
A not very significant result is shown in the following chart, similarly to the
first attempt in paragraph 4.2.1.
SPACE
A solution is again to aggregate information: it is possible to create a cumulative
vector of expected threat for each team to bring out measure like the expected
threat created during the entire match or even just during certain situations.
The piece of code A.27 can be added to the previous function act_dur, while the
charts are two different ways of representation of these statistics:
40
4.3 Player dominance
Analyzing the match it could be interesting to go into more detail about the
contribution of each player, not just stopping at team level.
41
The initial idea is to compute the expected threat produced by the player similar
to how it has been done for actions: when a player is the protagonist of an event
is added the expected threat of the field area where the event takes place.
The implementation consists of a function exp_thr_player (see A.28) where by
swiping the player vector, which includes the player protagonist of every event
in the match, three lists are created for the players themselves and their relative
expected threat and team.
The function role_extraction is built to show results through a chart and not
to use the player ID to identify him, assigning roles followed by sequence numbers
according to them position in the player_list vector (see A.29).
The piece of code A.30 is written to create the following bar chart:
Figure 4.4: The third midfielder is the player with the highest expected threat
during the match.
42
longer the distinction between teams because this analysis is distinct for each of
them.
Figure 4.5: The player of team A with the highest expected threat during the
tournament is again the third midfielder.
43
Figure 4.6: It is interesting to see how this normalization gives importance to the
new entries of team A, that probably have had a good impact in the last minutes
of the match.
However, they enter when other players are more tired so there might be a little
bias.
SPACE
This analysis can be extended to the entire competition similarly to before.
But doing the normalization on the match it becomes relevant just the impact
of the new entries (so few players), while on the tournament that of players not
always deployed on the field, which are many more and with more variability.
In the piece of code A.34 the structure of external for cycles come again from
the previous paragraph, while internal cycles to compute vector min_on_field
has the same structure of A.33.
44
Figure 4.7: These results can be considered the most consistent until now regard-
ing the player performances during the competition.
They confirm that midfielders, in particular MF5 and MF3, were the most threat-
ing in the team.
45
The bar chart has a similar structure to the figure 4.4 but now values can be
negative.
Figure 4.8: With respect to the previous approach, goalkeepers and defenders
acquire great importance in the contribution to the match.
Bibliography
[CD74] Hinkley D.V. Cox D.R. Theoretical Statistics, pages 49,209. Chapman
Hall, 1974.
[Dek05] Cornelis; Lopuhaä Hendrik Paul; Meester Ludolf Erwin Dekking, Fred-
erik Michel; Kraaikamp. A Modern Introduction to Probability and
Statistics. Springer Texts in Statistics, 2005.
46
[Goo65] L.A. Goodman. On simultaneous confidence intervals for multinomial
proportions, volume 7, pages 247–254. Technometrics, 2 edition, 1965.
[Per13] Josef Perktold. Tests and confidence intervals for binomial propor-
tions. https://www.statsmodels.org/dev/\_modules/statsmodels/
stats/proportion.html#multinomial\_proportions\_confint,
2013. Created on 1 March 2013.
47
Appendix A
Code
def is_goal ( x ):
if ’ Goal ’ in x [ ’ Tags ’ ]:
return True
else :
48
return False
def is_consequential ( x ):
if not np . isnan ( x [ ’ next_index ’ ]):
if int ( x [ ’ next_index ’ ]) - int ( x [ ’ index_col ’ ]) > 1:
return False
else :
return True
else :
return True
states = []
for i , row in team_A . iterrows ():
if row [ ’ is_goal ’ ]:
states . append ( ’ goal ’)
states . append ( ’ lost ’)
elif not row [ ’ i s _ b a l l _ i n _ p o s s e s s i o n ’ ]:
states . append ( ’ lost ’)
else :
states . append ([ row [ ’X ’] , row [ ’Y ’ ]])
[...]
[...]
49
states . append ( ’ goal ’)
states . append ( ’ lost ’)
elif not row [ ’ i s _ b a l l _ i n _ p o s s e s s i o n ’ ]:
states . append ( ’ lost ’)
else :
states . append ([ row [ ’X ’] , row [ ’Y ’ ]])
A.7 If condition
[...]
else :
50
for h in range ( n ):
if -1 + j * x < states [ i ][0] <= -1 +
( j + 1) * x and -1 + h * y <
states [ i ][1] <= -1 + ( h + 1) * y :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
if j == 0:
if -1 - eps < states [ i ][0] <= -1:
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
elif j == m - 1:
if 1 <= states [ i ][0] < 1 + eps :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
if h == 0:
if -1 - eps < states [ i ][1] <= -1:
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
elif h == n - 1:
if 1 <= states [ i ][1] < 1 + eps :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
else :
for j in range ( m ):
for h in range ( n ):
if -1 + j * x < - states [ i ][0] <= -1 +
( j + 1) * x and -1 + h * y <
- states [ i ][1] <= -1 + ( h + 1) * y :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
if j == 0:
if -1 - eps < - states [ i ][0] <= -1:
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
elif j == m - 1:
if 1 <= - states [ i ][0] < 1 + eps :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
if h == 0:
if -1 - eps < - states [ i ][1] <= -1:
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
elif h == n - 1:
if 1 <= - states [ i ][1] < 1 + eps :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
51
"""
Plot of statistics extracted by field statistics functions
: param m : areas of the long side of the field
: param n : areas of the short side of the field
: param count : vector returned by field_statistics
with count of events for each field zone
: param team : ID of the team for the plot title
: return : plots
"""
# create and fill a grid to plot frequencies in
# different positions
grid = np . zeros (( m , n ))
for i in range ( m ):
for j in range ( n ):
grid [i , j ] = count [ i * n + j ]
plt . xticks ( ticks = np . arange ( m ))
plt . yticks ( ticks = np . arange ( n ))
# blurred heatmap
plt . imshow ( grid , cmap = ’ Blues ’ , interpolation = " spline16 " )
if team is not None :
plt . title ( team )
plt . show ()
plt . savefig ( ’ Blurred % s ’ % team )
plt . colorbar ( hm )
centers = np . zeros (( m * n , 2))
for i in range ( m ):
for j in range ( n ):
centers [ i * n + j , 0] = i
centers [ i * n + j , 1] = j
plt . show ()
plt . savefig ( ’ DefinedSquare % s ’ % team )
52
path_list = glob . glob ( path + " /**/*% s *. Marks . json " % team ,
recursive = True , )
for match_path in path_list :
[ states , left_team ] = find_states ( match_path , team_id )
coordinates . close ()
return tm
53
% team , startcol =0 , startrow =0)
coordinates . close ()
return ptm
team_list = [...]
team_code = [...]
team_dict = dict ( zip ( team_list , team_code ))
field_zone = []
count = np . zeros ( m * n +2)
path = os . path . dirname ( r ’C :[...] ’)
path_list = glob . glob ( path + " /**/*% s *. Marks . json " % team ,
recursive = True , )
i m p o r t _ t r a n s i t i o n _ m a t r i x = pd . read_excel ( ’C :[...] ’)
trans ition_ma trix = i m p o r t _ t r a n s i t i o n _ m a t r i x . iloc [0: , 0:]
dim = np . shape ( transiti on_matri x )[1]
# limit method
iteration_matrix = np . linalg . matrix_power ( transition_matrix , k )
init_distr = np . repeat (1/ dim , dim )
stationary_limit = np . matmul ( iteration_matrix , init_distr )
# analytic method
S , U = eig ( t ransitio n_matrix )
54
count = pd . DataFrame . to_numpy ( count . iloc [1: , 1:])
stationary_count = count / np . sum ( count )
return region
region_v = m u l t i n o m i a l _ p r o p o r t i o n s _ c o n f i n t
( transiti on_matri x [i , :])
region [i , : , :] = region_v
55
a m p l i t u d e _ c o n f i n t _ m a t r i x [i , j ] = region [i , j , 1] -
region [i , j , 0]
rel_ampl_matrix [i , j ] = a m p l i t u d e _ c o n f i n t _ m a t r i x [i , j ]/
prob _trans_m atrix [i , j ]
a m p l i t u d e _ c o n f i n t _ c o u n t = np . zeros ( dim )
for i in range ( dim ):
a m p l i t u d e _ c o n f i n t _ c o u n t [ i ] = region_stat [0 , i , 1] -
region_stat [0 , i , 0]
p l o t _ s t a t i s t i c s _ o n _ f i e l d (6 , 3 , amplitude_confint_count ,
’ S t a t i o n a r y D i s t r i b u t i o n ’)
r e l _ a m p l _ c o n f i n t _ c o u n t = np . zeros ( dim )
for i in range ( dim ):
r e l _ a m p l _ c o n f i n t _ c o u n t [ i ] = a m p l i t u d e _ c o n f i n t _ c o u n t [ i ]/
stationary_count [ i ]
p l o t _ s t a t i s t i c s _ o n _ f i e l d (6 , 3 , rel_ampl_confint_count ,
’ S t a t i o n a r y D i s t r i b u t i o n ’)
56
def exp_thr ( n ):
i m p o r t _ t r a n s i t i o n _ m a t r i x = pd . read_csv ( ’C :[...] ’)
coordinates . close ()
return xT , count
[...]
states = []
team = []
left_team = []
for i , row in balltouches . iterrows ():
if row [ ’ is_goal ’ ]:
states . append ( ’ goal ’)
team . append ( row [ ’ Team ’ ])
left_team . append ( row [ ’ LeftTeam ’ ])
else :
states . append ([ row [ ’X ’] , row [ ’Y ’ ]])
team . append ( row [ ’ Team ’ ])
left_team . append ( row [ ’ LeftTeam ’ ])
57
if int ( team [ i ]) == int ( left_team [ i ]):
[...]
else :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
if j == 0:
if -1 - eps < states [ i ][0] <= -1:
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
elif j == m - 1:
if 1 <= states [ i ][0] < 1 + eps :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
58
if h == 0:
if -1 - eps < states [ i ][1] <= -1:
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
elif h == n - 1:
if 1 <= states [ i ][1] < 1 + eps :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
else :
for j in range ( m ):
for h in range ( n ):
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
if j == 0:
if -1 - eps < - states [ i ][0] <= -1:
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
elif j == m - 1:
if 1 <= - states [ i ][0] < 1 + eps :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
if h == 0:
if -1 - eps < - states [ i ][1] <= -1:
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
elif h == n - 1:
if 1 <= - states [ i ][1] < 1 + eps :
field_zone [ i ] = j * n + h
count [ j * n + h ] += 1
return field_zone
# initial idea
xT = np . zeros ( len ( field_zone ))
for i in range (1 , len ( field_zone )):
if team [ i ] == team [0]:
xT [ i ] = tixT [ int ( field_zone [ i ])]
else :
xT [ i ] = - tixT [ int ( field_zone [ i ])]
59
if team [ i ] == team [0]:
xT [ i ] = tixT [ int ( field_zone [ i ])]
else :
xT [ i ] = - tixT [ int ( field_zone [ i ])]
for i in range (3 , len ( field_zone )):
if team [ i ] == team [0]:
else :
count_action [ j ] += 1
xT_action [ j ] += tixT [ int ( field_zone [ i ])]
60
elif vect_team [i -1] != vect_team [0] and vect_team [ i ]
!= vect_team [0]:
count_action [ j ] += 1
xT_action [ j ] -= tixT [ int ( field_zone [ i ])]
team_A_action = []
team_B_action = []
for i in range ( len ( np_count_action )):
if vect_team [ i ] == team [0]:
team_A_action . append ( np_count_action [ i ])
else :
team_B_action . append ( np_count_action [ i ])
mean_action_A = np . mean ( team_A_action )
mean_action_B = np . mean ( team_B_action )
A.27 Cumulative xT
cum_xT_action_A = []
cum_xT_action_B = []
j = 0
k = 0
for i in range ( len ( np_xT_action )):
if np_xT_action [ i ] > 0:
if i == 0 or i == 1:
cum_xT_action_A . append ( np_xT_action [ i ])
j += 1
elif i > 1:
j += 1
61
elif np_xT_action [ i ] < 0:
if i == 0 or i == 1:
cum_xT_action_B . append ( np_xT_action [ i ])
k += 1
elif i > 1:
k += 1
62
role_list = []
count_g = 1
count_d = 1
count_m = 1
count_f = 1
for i in range ( len ( player_list )):
j = players_id . index ( player_list [ i ])
if role [ j ] == ’ Goalkeeper ’:
role_list . append ( str ( ’ GK ’ )+ str ( count_g ))
count_g += 1
if role [ j ] == ’ Defender ’:
role_list . append ( str ( ’ DF ’ )+ str ( count_d ))
count_d += 1
if role [ j ] == ’ Midfielder ’:
role_list . append ( str ( ’ MF ’ )+ str ( count_m ))
count_m += 1
if role [ j ] == ’ Forward ’:
role_list . append ( str ( ’ FW ’ )+ str ( count_f ))
count_f += 1
path_ list_lin eups = glob . glob ( path + " /**/*% s *. Lineups . json "
% team , recursive = True , )
63
[ states , team , player , left_team , time_stamp ,
first_enj , second_enj ] = find_s tates_ma tch
( path_list_marks [ i ] , path_list_phases [ i ])
field_zone = f i e l d _ s t a t i s t i c s _ m a t c h (m , n , states ,
team , left_team )
same_team = []
same_team_player = []
s a m e _ t e a m _ f ie l d _ z o n e = []
for j in range ( len ( team )):
if team [ j ] == team_id :
same_team . append ( team [ j ])
same_team_player . append ( player [ j ])
s a m e _ t e a m _ fi e l d _ z o n e . append ( field_zone [ j ])
if i > 0:
[ states , team , player , left_team , time_stamp ,
first_enj , second_enj ] = find_s tates_ma tch
( path_list_marks [ i ] , path_list_phases [ i ])
field_zone = f i e l d _ s t a t i s t i c s _ m a t c h (m , n , states ,
team , left_team )
same_team = []
same_team_player = []
s a m e _ t e a m _ f ie l d _ z o n e = []
for j in range ( len ( team )):
if team [ j ] == team_id :
same_team . append ( team [ j ])
same_team_player . append ( player [ j ])
s a m e _ t e a m _ f ie l d _ z o n e . append ( field_zone [ j ])
64
xT_player . append (0)
xT_player [ -1] += xT_player_i [ k ]
min_on_field [ i ] = match_dur
65
A.34 Normalization by minutes on the tournament
path_ list_lin eups = glob . glob ( path + " /**/*% s *. Lineups . json "
% team , recursive = True , )
field_zone = f i e l d _ s t a t i s t i c s _ m a t c h (m , n , states ,
team , left_team )
same_team = []
same_team_player = []
s a m e _ t e a m _ f ie l d _ z o n e = []
for j in range ( len ( team )):
if team [ j ] == team_id :
same_team . append ( team [ j ])
same_team_player . append ( player [ j ])
s a m e _ t e a m _ f ie l d _ z o n e . append ( field_zone [ j ])
66
elif not np . any ( player_left == player_list [ h ]) and
np . any ( player_entered == player_list [ h ]):
field_zone = f i e l d _ s t a t i s t i c s _ m a t c h (m , n ,
states , team , left_team )
same_team = []
same_team_player = []
s a m e _ t e a m _ f ie l d _ z o n e = []
for j in range ( len ( team )):
if team [ j ] == team_id :
same_team . append ( team [ j ])
same_team_player . append ( player [ j ])
s a m e _ t e a m _ f ie l d _ z o n e . append ( field_zone [ j ])
[ xT_player_i , player_list_i , team_list_i ] =
exp_thr_player ( same_team , same_team_player ,
same_team_field_zone , tixT )
match_dur = 90 + int ( first_enj ) + int ( second_enj )
min_on_field_i = []
for h in range ( len ( player_list_i )):
67
j = np . where ( player_left == player_list_i [ h ])
min_on_field_i . append ( int ( time_stamp_subst [ j [1]])
//(1000*60))
68
player_list = []
xT_player = []
team_list = []
for i in range ( len ( player ) -1):
exist_count = player_list . count ( player [ i ])
if exist_count > 0:
j = player_list . index ( player [ i ])
if field_zone [ i ] != m * n +1 and
field_zone [ i +1] != m * n +1:
else :
player_list . append ( player [ i ])
team_list . append ( team [ i ])
xT_player . append (0)
if field_zone [ i ] != m * n +1 and
field_zone [ i +1] != m * n +1:
69
player_list . append ( player [ i ])
team_list . append ( team [ i ])
xT_player . append (0)
if field_zone [ i ] != m * n +1 and field_zone [ i +1] !=
m * n +1:
xT_player [ -1] += diff_xT [ int ( field_zone [ i ]) ,
int ( field_zone [ i +1])]
else :
print ( ’ goal ’)
return xT_player , player_list , team_list , diff_xT
Appendix B
Images
70
Figure B.2: Ball coordinates
71