LINFO2275 Questions d Examen-4
LINFO2275 Questions d Examen-4
May 2023
1.1 PCA
The idea is to project the observations on the axis carrying the maximum variance of the projected data
— Among all possible axis passing through the centroid of the cloud of points
— So that this axis is carrying maximal information on the data
The total variance of a cloud of points can be decomposed as follows :
n
1 X
σ2 = (xi − g)T (xi − g)
n − 1 i=1
Pn
Where g is the centroid of the cloud of points : g = n1 i=1 xi . The operators of orthogonal projection π on
an axis having unit direction vector v is π = vv T , v T v = 1. We thus find the variance of the data projected
on axis v :
n
1 X
σv2 = (π(xi − g))T (π(xi − g))
n − 1 i=1
n
1 X
= (xi − g)T vv T (xi − g)
n − 1 i=1
" n #
1 T
X
T
= v (xi − g)(xi − g) v
n−1 i=1
= v T Σv
Where Σ is the variance-covariance matrix of the data and is symmetric positive definite (spd). We want to
maximize the projected variance
max(v T Σv) subject to (v T v = 1)
v
This a standard constrained optimization problem with equality constraints that can be solved with a La-
grange function.
L = v T Σv + λ(1 − v T v)
∂v L = 0
This gives us a eigenvalue/eigenvector problem Σv = λv. Since Σ is spd, the eigenvalues are all non-negative
and the eigenvectors can be chosen orthogonal. The variance of the projected data is equal to
σv2 = v T Σv
= v T λv =λ
The solution is to project the data onto the associated eigenvector, starting with the max variance one.
1
1.2 Feature Selection
Allows to alleviate the effect of the curse of dimensionality. It also enhances the generalization capability
of the model and speeds up the learning process.
Maximum relevance selection : It consists in computing the associations between the target and the
features and keeping only the most significant ones. The associations can be computed through mutual in-
formation or statistical tests
Minimal redundancy selection : It consists in computing the associations between the features and
discarding the ones that are strongly correlated between each other.
Stepwise regression : Greedy algorithm that adds or deletes the best feature (worse) at each step. The big
issue is to know when to stop. Often when likelihood ratio is not significant anymore.
Some algorithms like Decision Trees or Bagging perform feature selection on its own.
We can compute the total sample variance (with g as the general centroid) : (voir slide 49)
2
n
1X T
σ2 = (xi − g) (xi − g)
n i=1
q
1X
= SS(k)
n
k=1
We can decompose the sum of squares (or inertia) for each class k into a within and a between sum of
squares : (voir slide 50 )
X T
SS(k) = (xi − g) (xi − g)
i∈C(k)
X 2
= ∥xi − g(k)∥ + n(k)∥g(k) − g∥2
i∈C(k)
The first term is the within variance while the second is the between variance. Now, we can write the total
variance as a within and a between contribution : (voir slide 51)
q q
1 X X 2
X
σ2 = ∥xi − g(k)∥ + n(k)∥g(k) − g∥2 = σ(w)
2 2
+ σ(b)
n
k=1 i∈C(k) k=1
We can use the projection operator π = vv T with v T v = 1 to find the axis maximizing the between on
total projected variance ratio : - For the within variance of the projected data : (voir slide 53, 54)
q
2 1X X T
σv(w) = (πxi − πg(k)) (πxi − πg(k))
n
k=1 i∈C(k)
T
= v Wv
- For the between variance after the projection : (voir slide 55)
2
σv(b) = v T Bv
We are looking for the axis that is discriminating the classes as much as possible (maximizing the bet-
ween/total ratio) :
2
!
σv(b)
T
v Bv
max = max P
v σv2 v vT v
subject to the following conditions
v T Bv
∂v P =0
vT v
T P
v = 0. Thus, Bv = vvT P
Bv
P P
We thus have 2 v T v Bv − 2 v T Bv v
v.
v T Bv
⇒0<λ= P <1
vT v
P−1
From the previous equation, we obtain the eigensystem problem : Bv = λv where there are at most
(q-1) non-zero eigenvalues.
→ We select the largest non-zero eigenvalues/eigenvectors. The normalized eigenvectors correspond to the
direction vectors of the discriminant axes. Then, we project the data on these axis, providing the coordinates
or discriminant scores : Zi = v T xi
3
2.2 Bayesian classification through a gaussian mixture
We compute a posteriori probability through Bayes formula
P (wi )P (z|y = wi )
P (y = wi |z) = Pq
k=1 P (wk )P (z|y = wk )
Each observation z is then assigned to the class k for which the posterior probability is maximal.
Maximum relevance selection : It consists in computing the associations between the target and the
features and keeping only the most significant ones. The associations can be computed through mutual in-
formation or statistical tests
Minimal redundancy selection : It consists in computing the associations between the features and
discarding the ones that are strongly correlated between each other.
Stepwise regression : Greedy algorithm that adds or deletes the best feature (worse) at each step. The big
issue is to know when to stop. Often when likelihood ratio is not significant anymore.
Some algorithms like Decision Trees or Bagging perform feature selection on its own.
3.1 CCA
Canonical correlation analysis (CCA) is a method allowing to analyze the relationships between two sets
of features (and data sets) X and Y, measured on the same samples
These data are realizations of two random vectors x and y
This method computes a linear combination of the random vector x and a linear combination of the random
vector y
— It thus computes two scores zx and zy (the linear combinations) which are maximally correlated
— It therefore defines score spaces that explain as much as possible the relationships between the two sets
of features
It is used in multi-view learning (difference sources of data) (e.g. athletes and results in athletics competition)
We define the centroids, (
gx = E[x]
(1)
gy = E[y]
As well as the linear combinations for the two sets (which can also be viewed as projections)
(
zx = uTx x
(2)
zy = uTy y
4
We have to maximize the covariance between the two random variables zx and zy
var(zx ) = E[z˜x2 ]
= E[uTx (x − gx )uTx (x − gx )] (7)
= uTx E[(x T
− gx )(x − gx ) ]ux
We thus have to maximize the covariance between the two sets of variables
5
and set the result equal to zero (
∂ux L = 0
(14)
∂uy L = 0
We obtain (
Σxy uy − λx Σxx ux = 0
(15)
Σyx ux − λy Σyy uy = 0
Which leads to the following eigenvalues/eigenvector problems (λ = λx λy )
( −1
Σxx Σxy Σ−1
yy Σyx ux = λux
(16)
Σ−1 −1
yy Σyx Σxx Σxy uy = λuy
up to a scaling factor
Maximum relevance selection : It consists in computing the associations between the target and the
features and keeping only the most significant ones. The associations can be computed through mutual in-
formation or statistical tests
Minimal redundancy selection : It consists in computing the associations between the features and
discarding the ones that are strongly correlated between each other.
Stepwise regression : Greedy algorithm that adds or deletes the best feature (worse) at each step. The big
issue is to know when to stop. Often when likelihood ratio is not significant anymore.
Some algorithms like Decision Trees or Bagging perform feature selection on its own.
6
4 Multiple correspondence analysis + feature selection
Describe and derive in detail the method allowing to perform a multiple correspondence
analysis. Interpret the results, e.g., what is the interpretation of the eigenvalues ? How can we
obtain the coordinates of the data in the principal components system ? Finally, describe some
methods for performing feature selection (and not extraction)
Let us seek the linear combinations that correlate maximally the features as quantified by the sum of cova-
riance
Xp X p p X
X p
cov (yi , yj ) = E [yi yj ]
i=1 j=1 i=1 j=1
Xp X
p
= E yi yj
i=1 j=1
!2
p
X
= E yi
i=1
Together with the constraint that the sum of the variances remains constant (equal to p)
p
X p
X
cov(yi , yi ) = var(yi ) = p
i=1 i=1
TO BE COMPLETED
Maximum relevance selection : It consists in computing the associations between the target and the
features and keeping only the most significant ones. The associations can be computed through mutual in-
formation or statistical tests
Minimal redundancy selection : It consists in computing the associations between the features and
discarding the ones that are strongly correlated between each other.
Stepwise regression : Greedy algorithm that adds or deletes the best feature (worse) at each step. The big
issue is to know when to stop. Often when likelihood ratio is not significant anymore.
Some algorithms like Decision Trees or Bagging perform feature selection on its own.
7
inner products. Then, describe how a data matrix can be obtained from an inner product
matrix. Describe the links with principal components analysis. What is finally the procedure
for drawing the data from their distances ?
We want to invert the relation in order to find K in function of D. This problem doesn’t have a unique
solution. This is why we impose K centered
Ke = 0
eT K = 0T
Finally, the inner product matrix can be computed from the squared distance matrix through
1
K = − HDH
2
eeT
H= I−
n
Hx = x − mean(x)e
When H is applied to a vector, it centers the vector by subtracting the mean to each element of it.
Thus, the multiplication of D from the left and right centers of the matrix and its row and columns
sum are both equal to 0.
1 1
− HDH = − H diag(K)eT + ediag(K)T − 2K H
2 2
1
= − Hdiag(K)eT H + Hediag(K)T H − 2HKH
2
1
= − (0 + 0 − 2HKH)
2
=K
8
2. Once K has been found, represent X in a Euclidean space preserving exactly the inner products and
thus the distances
From standard matrix theory, we know any symmetric matrix admits a spectral decomposition
1 1
K = U ΛU T = (U Λ 2 )(U Λ 2 )T
Where U contains the eigenvectors of K on its columns and Λ is a diagonal matrix containing the
eigenvalues on its diagonal. We therefore define
1
X = UΛ2
K = XX T
For the data matrix to be well-defined, all the eigenvalues in Λ have to be non-negative reals. Therefore,
K must be positive semi-definite. In that case, K is a valid inner product matrix and X is the associated
matrix and is centered.
It means we want to choose the optimal state at level k. The optimal cost when starting from the initial state
is D∗ mins0 {D∗ (s0 )} and the optimal state when starting from some intermediate state sk :
" N #
X
∗
D (sk ) = min(sk+1 ,...sN ) d(si |si−1 )
i=k+1
9
This allows us to get the reccurence formulas :
D∗ (sN ) = 0
D∗ (sk = i) = minsk+1 [d(sk+1 |sk = i) + D∗ (sk+1 ]
D∗ = mins0 {D∗ (s0 )}
6.2 Edit-distance
Computes the minimal number of insertions, deletions and substitutions (number of editions) to transform
one string into another one. We have 2 strings x and y where xi , yi is the character at index i in x, y. The length
̸ |y|. The substring of x beginning at character i and ending at j is denoted as xji .
of x is |x| and generally, |x| =
We now read the characters of x one by one in order to construct y by using 3 operations
1. Insertion of a character in y without taking any character from x
2. Deletion of a character from x without concatenating it into y
3. Substitution of a character from y by one of x
|x|
xi means we have read the i − 1 first characters of x and that x has been cutted from its i − 1 first characters
that have been taken to construct y. In the same way, y0j means that the j first characters of y have been
transcribed. We progressively read the first characters of x in order to build y.
It corresponds to a process with levels (steps) and states. By applying dynamic programming where each
state is characterized by a couple (i, j)
(
|x|
One state corresponds to xi , y0j
The level k corresponds to i + j = const = k
For the 3 operations :
|x| j−1 |x| j
Insertion with respect to x : x i , y 0 → x i , y 0
|x| |x|
Deletion with respect to x : xi−1 , y0j → xi , y0j
Substitution with respect to x : x|x| , yj−1 → x|x| , yj
i−1 0 i 0
The 2 first operations are jumping from level k to level k+1 and for the 3rd one we directly jump to level k+2.
This situation can be represented in a 2D table where one level is represented by the constant (i + j),
one state is represented by (i, j), one operation corresponds to a valid transition in this table.
With
̸ yj
δij = 1ifxi =
δij = 0ifxi = yj
Then, the dynamic programming formula can be applied to this problem :
|x|
D∗ (x0 , y00 ) = 0
∗ |x| j−1
D (xi , y0 ) + 1
|x|
D∗ (xi , y0j ) = min |x|
D∗ (xi−1 , y0j ) + 1
∗ |x| j−1
D (xi−1 , y0 ) + δij
|x| |y|
Finally, dist(x, y) = D∗ (x|x| , y0
10
6.3 Don’t forget to train to compute on example
It means we want to choose the optimal state at level k. The optimal cost when starting from the initial state
is D∗ mins0 {D∗ (s0 )} and the optimal state when starting from some intermediate state sk :
" N #
X
D∗ (sk ) = min(sk+1 ,...sN ) d(si |si−1 )
i=k+1
D∗ (sN ) = 0
11
To do so, we unfold the network in time creating levels, where each level corresponds to a "time step" (or
transition). If n is the number of nodes when computing the least cost between any node to node 0, we cannot
have more than n-1 steps on the least-cost path, otherwise we would visit at least 2 times the same node.
We transform the initial problem into a dynamic programming problem with n levels and n nodes (Di-
rected Acyclic Graph DAG) and we then use the backward recurrence formula. We define a n × n table
D∗ (i, k) where i is the index of the node (0 to n-1) located at level k (0 to n-1). Element (i, k) of this table
represents D∗ (sk = i). This table contains the minimal cost for reaching destination node 0 from a node of
index i at level k and at each iteration i we examine which node j is the most interesting to visit.
∗
D (0, n − 1) = 0 − You reached destination node 0 at the last level n − 1
D∗ (i, n − 1) = ∞, i ̸= 0 − Those nodes are forbidden, you are required to reach node 0
∗
D (i, k) = minj∈Succ(i) {cij + D∗ (j, k + 1)}
It means we want to choose the optimal state at level k. The optimal cost when starting from the initial state
is D∗ mins0 {D∗ (s0 )} and the optimal state when starting from some intermediate state sk :
" N #
X
∗
D (sk ) = min(sk+1 ,...sN ) d(si |si−1 )
i=k+1
D∗ (sN ) = 0
12
frames or by defining a time alignement that allows for warping. In order to obtain a meaningful alignment,
we have to add monotonic constraints
ik ≥ ik−1
Monotonicity
jk ≥ jk−1
ik − ik−1 ≤ 1
Continuity
jk − jk−1 ≤ 1
i1 = 1, j1 = 1
Boundary conditions
iK = I, jK = J
The problem is the solved by dynamic programming by only considering the valid transitions using those
recurence relations :
g(1, 1) = d(r1n , o1 )
g(i − 1, j) + d(rin , oj )
g(i, j) = min g(i − 1, j − 1) + 2d(rin , oj )
g(i, j − 1) + d(rin , oj )
1
D(RN , O) = g(I, J)
I +J
9.1 Explain the general principles behind the Markov decision processes
We have a set of states S = {1, 2, ..., n} where st = k means that the process is in state k at time t. Each
state has a set of admissible actions U (k) (or control actions, decisions) it can undertake. Let a ∈ U (k) be
an available action in state k. Each action only depends on the current state and is independent of the past
states (independent in time).
Each action a = u(st ) at time t and in state st has a bounded cost 0 ≤ c(u(st )|st ) ≤ ∞. To jump to state
st+1 = k ′ , we have a probability mass P (st+1 = k ′ |st = k, u(st ) = a) = p(k ′ |k, a) that only depends on the
current state. If the problem is deterministic, the distribution reduces to a Kronecker delta.
We suppose that the goal state is reachable from each state and that costs are non-negative so that the
network of states won’t get a negative cycle.
The goal state is said absorbing because once it is reached, the process stays there with no cost p(d|d, i) = 1
and c(d|d) = 0.
13
The optimal expected cost can be computed thanks to recurrence relations. If at = u(kt ) and u(kt ) ∈ U (kt ) :
"∞ #!
X
∗
V (k0 ) = min Es1 ,s2 ,... c(u(st )|st )|s0 = k0 , π
(a0 ,a1 ,...)
t=0
"∞ #
X X
V ∗ (k0 ) = min P (s1 = k1 , s2 = k2 , ...|s0 = k0 , π) × c(u(kt )|kt )
(a0 ,a1 ,...)
k1 ,k2 ,... t=0
X
V ∗ (k0 ) = min P (s1 = k1 |s0 = k0 , u(k0 ) = a0 )
a0
k1
X ∞
X
× c(a0 |k0 ) + min P (s2 = k2 , s3 = k3 , ...|s1 = k1 , π) c(u(kt )|kt )
(a1 ,a2 ,...)
k1 ,k2 ,... t=1
The part of the equation with min(a1 ,a2 ,...) is equal to V ∗ (k1 )
!
X
∗ ∗
V (k0 ) = min p(k1 |k0 , a0 )[c(a0 |k0 ) + V (k1 )]
a0 ∈U (k0 )
k1
Which are the Bellman’s equations of optimality. The optimal action for state k is the action that verifies
this equation. It yields us the following value iteration algorithm
We know that Ex,y [f (x, y)] = Ex [Ey [f (x, y)|x]] it allows us to compute : After simplifying with u(kt ) = at :
n
X
Vπ (k0 ) = c(a0 |k0 ) + p(k1 |k0 , a0 )vπ (k1 )
k1 =1
14
9.3 Reinforcement Learning
The goal of reinforcement learning is to learn by doing. We introduce for this the Q-value : The expected
cost when choosing action a in state k and then relying on policy π
X
Qπ (k, a) = c(a|k) + p(k ′ |k, a)Vπ (k ′ )
k′
Consequently :
V ∗ (k) = mina∈U (k) Q∗ (k, a)
The Bellman optimality conditions in terms of Q-value become
X
Q∗ (k, a) = c(a|k) + p(k ′ |k, a)mina′ ∈U (k′ ) Q∗ (k ′ , a′ )
k′
15
We now want to use stochastic approximation to adjust the Q-values when trying some action in some
state by observing the cost and going to the next state. This is related to Q-learning
Where α(t) should decrease gradually according to the theory of stochastic approximation. There should
as well be an exploration mechanism to explore the space stochastically.
Sometimes, function approximations are needed when the number of different possible states becomes too
large. The update rule then becomes some sort of gradient descent algorithm.
10.1 Explain the general principles behind the Markov decision processes
We have a set of states S = {1, 2, ..., n} where st = k means that the process is in state k at time t. Each
state has a set of admissible actions U (k) (or control actions, decisions) it can undertake. Let a ∈ U (k) be
an available action in state k. Each action only depends on the current state and is independent of the past
states (independent in time).
Each action a = u(st ) at time t and in state st has a bounded cost 0 ≤ c(u(st )|st ) ≤ ∞. To jump to state
st+1 = k ′ , we have a probability mass P (st+1 = k ′ |st = k, u(st ) = a) = p(k ′ |k, a) that only depends on the
current state. If the problem is deterministic, the distribution reduces to a Kronecker delta.
We suppose that the goal state is reachable from each state and that costs are non-negative so that the
network of states won’t get a negative cycle.
The goal state is said absorbing because once it is reached, the process stays there with no cost p(d|d, i) = 1
and c(d|d) = 0. We can obtain the Bellman’s equations of optimality. The optimal action for state k is the
action that verifies this equation.
∞
!
X
∗ ′ ′
V (k) = min c(a|k) + p(k |k, a)V ∗ (k )
a∈U (k)
k′ =1
16
10.3 Reinforcement Learning
The goal of reinforcement learning is to learn by doing. We introduce for this the Q-value : The expected
cost when choosing action a in state k and then relying on policy π
X
Qπ (k, a) = c(a|k) + p(k ′ |k, a)Vπ (k ′ )
k′
Consequently :
V ∗ (k) = mina∈U (k) Q∗ (k, a)
The Bellman optimality conditions in terms of Q-value become
X
Q∗ (k, a) = c(a|k) + p(k ′ |k, a)mina′ ∈U (k′ ) Q∗ (k ′ , a′ )
k′
We now want to use stochastic approximation to adjust the Q-values when trying some action in some
state by observing the cost and going to the next state. This is related to Q-learning
Where α(t) should decrease gradually according to the theory of stochastic approximation. There should
as well be an exploration mechanism to explore the space stochastically.
Sometimes, function approximations are needed when the number of different possible states becomes too
large. The update rule then becomes some sort of gradient descent algorithm.
17
j. The vectors have a dimension of nw the total number of words. Hence, each document is represented by
f1j
f2j
dj ≜ .
..
fnw j
This is the Bag of words representation in the word space. The order of the words is not taken into account
and the vector is usually very large and sparse. If we have nd documents, the term-document matrix becomes :
f11 f12 . . . f1nd
f21 f22 . . . f2nd
D≜ .
.. .. ..
..
. . .
fnw 1 fnw 2 . . . fnw nd
A query is also represented by a vector where each element is 0 or 1 if the word is present in the query or not
0
1
q≜ 0
..
.
1
The purpose of the model is to retrieve document di based on query q by defining a notion of similarity
between the query and the document The similarity can be defined by the euclidean distance but it doesn’t
work well because queries don’t contain many words. It is much more often defined by the cosinus of the
angle between 2 the vectors : The cosine similarity
qT di
sim (q, di ) ≜ cos (q, di ) =
∥q∥ ∥di ∥
We can then redefine q where each word is weighted using the IDF, the weight of the presence of the
word. We then compute the cosine as before in order to rank them.
0
..
.
q≜ − log 2 (P (wi ))
..
.
1
18
One solution is to use sub-space projection methods like Singular Value Decomposition (SVD) or Factor
Analysis. Here, we use SVD to reduce the rank of the term-document matrix (D) from n to m with m<n. It
allows reducing the dimensionality of the space by clustering the words that are semantically similar (used in
the same documents). It shows as well which documents are semantically similar to one another (containing
the same words). It is similar to a double PCA where you consider the terms as observations and the docu-
ments as variables AND the documents as observations and the terms as variables. This allows us to make a
concept space.
The matrix D = U ΣV T has an SVD if it is of full rank with Σ a diagonal matrix where σ1 ≥ σ2 ≥ ... ≥ σn > 0.
We are interested in the best matrix of rank m to approximate D. To do so we take Σ̃, the Σ matrix where
we equal to null every σi for m + 1 ≤ i ≤ n. Finally, we compute D̃ = U Σ̃V T . Then to compare the query
with each document we take compute for d˜i . This works because the SVD allows to work in a latent space
representing concepts. The main problem is to know how many dimensions we want to keep (value of m).
When introducing a query to a vector space model, we obtain a classification of each document compa-
red with the query :
19
We then perform query expansion, it means expanding the query regarding some relevance feedback from
the user and the most relevant documents from the vector model. Each document is represented by a binary
vector where each index is equal to 1 if the associated word is present in the document or 0 if not.
0
1
di = 0
..
.
1
We define the probability of observing a document d = x given that this document is relevant (R = 1) for
user uk as P (d = x|R = 1, uk ). However, during the document retrieval phase, we are mainly interested in
P (R = 1|d = x, uk ). The larger this value, the more likely the document x is relevant. This is why this value
has to be calculated for each document in the database.
P (R = 1 | d = x, uk )
λ=
P (R = 0 | d = x, uk )
P (d = x | R = 1, uk ) P (R = 1 | uk )
= ×
P (d = x | R = 0, uk ) P (R = 0 | uk )
Qnw
P (dn = xn | R = 1, uk ) P (R = 1 | uk )
= Qn=1
nw ×
n=1 P (dn = xn | R = 0, uk ) P (R = 0 | uk )
Finally, we can summarize λ as a naive bayes classifier :
Qnw
P (dn = xn | R = 1, uk )
λ ∝ Qn=1
nw
n=1 P (dn = xn | R = 0, uk )
Here, P (dn = xn | R = 1, uk ) and P (dn = xn | R = 0, uk ) are the likelihoods estimated by the frequencies.
They can be estimated by the proportion of documents containing the word wn among relevant and irrelevant
documents.
13 PageRank Model
Explain in detail the basic PageRank model for scoring web pages. Describe also its different
extensions (personalization vector, etc). Then describe the «HITS» scoring model and its
bibliometric interpretation
20
13.1 PageRank model
Corresponds to a measure of "prestige" in a directed graph. The objective is to exploit the link structure
between documents to extract information without looking at the content of it. We see here the documents
repository as a graph where nodes are documents and edges are directed links between them.
Each webpage i has an associated score xi proportional to the weighted averaged score of the pages pointing
to page i. Let wi,j be the weight associated to the link between page i and j, usually 1 if the 2 pages are
linked, 0 if not and wj . the out degree of page j. All those weights are stored in the matrix W, since the
graphs are directed, this matrix is not necessarily symmetric.
n
X wji xj
xi ∝
j=1
wj.
Xn
wj. = wji
i=1
Thus, a highly scored page is a page which is pointed by many pages and that has many highly scored pages.
We consider a page as important if it is pointed by many important pages and has few outgoing links. To find
those values, the weights are updated iteratively until convergence. Then, all the pages are ranked accordingly
to their score.
13.2 Extensions
13.2.1 Google Matrix
Some pages have no links with other pages, in this case, the P matrix is not stochastic anymore because
the rows do not sum to 1 anymore. One solution is to jump to any node of the graph randomly.
eeT
G = αP + (1 − α)
n
Where G is the Google Matrix and e is a vector full of ones, and 0 < α < 1. G is not more sparse but is
normal
G = αP + (1 − α)evT
The objective of this algorithm is to find the good hubs and authorities from the results of the search engines.
Each page is thus assigned a hub and authority score (xhi , xai ). A page’s authority score is proportional to
the sum of hubs connected to it :
n
X
xaj = η wij xhi
i=1
21
And a page hub score is itself proportional to the sum of authority scores that link to it
n
X
xhj = µ wij xai
i=1
xh = µWxa
and thus
xa = ηµWT Wxa
xh = µηWWT xh
To obtain the scores, an equivalent method is to take the eigenvectors of WT W and WWT to obtain the
vector of authority scores and hubs scores.
13.3.1 Bibliometrics
HITS algorithm can be used for bibliometrics for cocitations (when 2 documents are both cited in the
same 3rd document) and corefernces (when 2 documents both reference the same document). In this model,
the coreferences matrix is closely related to the hub matrix and the cocitations is closely related to the
authority matrix
14.1 PageRank
Corresponds to a measure of "prestige" in a directed graph. The objective is to exploit the link structure
between documents to extract information without looking at the content of it. We see here the documents
repository as a graph where nodes are documents and edges are directed links between them.
Each webpage i has an associated score xi proportional to the weighted averaged score of the pages pointing
to page i. Let wi,j be the weight associated to the link between page i and j, usually 1 if the 2 pages are
linked, 0 if not and wj . the out degree of page j. All those weights are stored in the matrix W, since the
graphs are directed, this matrix is not necessarily symmetric.
n
X wji xj
xi ∝
j=1
wj.
Xn
wj. = wji
i=1
Thus, a highly scored page is a page which is pointed by many pages and that has many highly scored pages.
We consider a page as important if it is pointed by many important pages and has few outgoing links. To find
those values, the weights are updated iteratively until convergence. Then, all the pages are ranked accordingly
to their score.
14.2 RandomWalk
It can be nicely interpreted in term of random surfing. We define the probability of following the link
from page j to page i as :
wj,i
P (page(k + 1) = i|page(k) = j) =
wj .
22
We can then rewrite the equation as :
xi (k + 1) = P( page (k + 1) = i)
Xn
= P( page (k + 1) = i | page (k) = j)xj (k)
j=1
n
X wji
= xj (k)
j=1
wj
This gives us a Markov model of random surf through the web which is the same equation as before when
ignoring iteration k.
If pi,j is thePi and jth element of the transition probability matrix P , the expression can be rewritten as
n
xi (k + 1) = j+1 pji xj (k). In matrix form, we have x(k + 1) = PT x(k). The stationary distribution is given
by x(k + 1) = x(k) = x and thus :
x = PT x
xi is thus viewed as the probability of being at page i. The solution to these equations is the stationary
distribution of the random surf which is the proba of finding the surfer on page i on the long term behavior.
The PageRank scores can then be obtained by computing the left eigenvector of P corresponding to eigenvalue
1. If the graph is undirected, the scores are simply the in degrees of the nodes.
14.3 Extensions
14.3.1 Google Matrix
Some pages have no links with other pages, in this case, the P matrix is not stochastic anymore because
the rows do not sum to 1 anymore. One solution is to jump to any node of the graph randomly.
eeT
G = αP + (1 − α)
n
Where G is the Google Matrix and e is a vector full of ones, and 0 < α < 1. G is not more sparse but is
normal
G = αP + (1 − α)evT
15 Collaborative Recommendation
Describe in detail the basic model of collaborative recommendation, namely,the model based
on k nearest neighbors. Also describe in detail the ItemRank model of collaborative recom-
mendation (also called “random walk with restart”), inspired by PageRank. Finally, how do we
assess a collaborative recommendation system
We have individuals (i) and items (j). Each individual is represented by a profile vector vi in the item
23
space where vi,j = 1 if i bought item j, 0 otherwise. We also compute the individuals-items frequency ma-
trix W containing item wi,j that indicates whether item j was purchased by i or how many times (2 variants).
We first compute a similarity between 2 individuals i and j based on their profile vectors. There exist many
possibilities like the cosine or where each vector can be represented as follows :
sim(i, j) = a/(a + b + c)
From those similarities, one can compute the k nearest neighbors for each individual and compute some pre-
dicted value for the items. The first items proposed to this individual are the ones with the highest predicted
value.
The predicted value of item i for individual p is computed as the number of times item i has been pur-
chased by p’s the neighbors weighted by the similarities between the individuals.
P
q∈N (p) sim(p, q) ∗ wq,i
pred(p, i) = P
q∈N (p) sim(p, q)
15.2 ItemRank
Extension of PageRank that can be used for collaborative recommendations and based on the random
walk with restart idea. It assumes that a bipartite graph has already been made
where vi contains 1/ni for every item user i bought and 0 for the rest with ni the number of bought items
and P the transition probabilities matrix derived from the graph.
The random walker has a proba 1 − α of being transported to a bought item node. The steady-state so-
lution is the similarity score associated with each item. For user i :
It is also important to look at the "surprise" (unexpected) with some recommendations, to not always
recommend the popular ones. There must be some diversity in the recommended items.
24
16.1 Latent Class Model
This model is used for collaborative filtering. We have x a random variable representing individuals (m in
total) and y representing the items (n in total). There exist a latent unobserved class z representing classes
of interest and classes of items of the users (l in total). The variables x and y are assumed conditionally
independent given z. The number of latent classes l is given a priori.
In this case, we are interested in the posterior probability distribution of the items j for some individual i :
P (y = j|x = i) from which the most probable items will be recommended to the user. P (x = x, y = y, z = k)
is often simplified to P (x, y, k) and represents the discrete probability mass function. X, y were defined earlier
and i, j, k are simple variables.
The parameters of the model are defined as the probability masses of the discrete random variables
— P (k) = P (z = k) Class prior proba
— P (x|k) = P (x = x|z = k) Within class observation of users
— P (y|k) = P (y = y|z = k) Within class observation of items
The posterior probability of an arbitrary item y for a user x can be computed from :
Pl
k=1 P (x|k)P (y|k)P (k)
P (y|x) = P l ′ ′
k′ =1 P (x|k )P (k )
Variable z is considered as an unobserved hidden latent variable. It appears in the complete likelihood as a
random variable, not yet known. The parameters are estimated by maximum likelihood but this function is
hard to maximize.
16.2 EM Algorithm
This algorithm is used to estimate the parameters of the latent class model. First, it is needed to compute
the log-likelihood of the data, l. The EM algo then iterates 2 steps, increasing the likelihood each time, until
convergence :
1. Expectation step : Estimates the value of z for each user, item ⇒ it performs clustering. Computes
the expectation of l given the current value of the params Θ and the observations (in vector x and y)
E[l|x, y; Θ̂]
2. Maximization step : Provides reestimates formulas for the parameters when class memberships are
known. Maximizes the expectation of the log-likelihood in terms of the parameters assuming the hidden
variables are fixed.
After convergence, the predictions are then given by P̂ (y|x).
It is also important to look at the "surprise" (unexpected) with some recommendations, to not always
recommend the popular ones. There must be some diversity in the recommended items.
25
17 Collaborative recommendation + non-negative matrix factori-
zation
Describe in detail the basic modelof collaborative recommendation, namely,themodel based
on knearest neighbors. Also explain in detail the non-negative matrix factorization techniques-
forcollaborative recommendation based on ratings. Moreover, how can we assessa collaborative
recommendation system
We have individuals (i) and items. Each individual is represented by a profile vector vi in the item space
where vi,j = 1 if i bought item j, 0 otherwise. We also compute the individuals-items frequency matrix W
containing item wi,j that indicates whether item j was purchased by i or how many times (2 variants).
We first compute a similarity between 2 individuals i and j based on their profile vectors. There exist many
possibilities like the cosine or where each vector can be represented as follows :
sim(i, j) = a/(a + b + c)
From those similarities, one can compute the k nearest neighbors for each individual and compute some pre-
dicted value for the items. The first items proposed to this individual are the ones with the highest predicted
value.
The predicted value of item i for individual p is computed as the number of times item i has been pur-
chased by p’s the neighbors weighted by the similarities between the individuals.
P
q∈N (p) sim(p, q) ∗ wq,i
pred(p, i) = P
q∈N (p) sim(p, q)
To compute U and V we can use the alternating least squares algorithm to try and reconstruct the ra-
tings W ( 2
minimize W − UVT
U F and
subject to U ≥ O
2
minimize W − UVT
V F
subject to V ≥ O
However, this works only when W does not contain too many missing values (when it is not sparse). The null
elements are biasing the predictions. We then need to avoid the missing values in the objective function. We
26
then minimize : X
(wij − uTi vj )2
(i,j)∈ϵ
It is also important to look at the "surprise" (unexpected) with some recommendations, to not always
recommend the popular ones. There must be some diversity in the recommended items.
18 Reputation Model
Describe a simple reputation model, its parameters and its assumptions. How do we estimate
its parameters ? Moreover, how can we assess a collaborative recommendation system
The quality xki of the item i sent by the provider k is assumed to be normally distributed and centered
on the quality of the provider.
xki = qk + ϵxki
ϵxki ∼ N (0, σkx )
Each provider is characterized by 2 features. His internal score q and his stability in providing a constant
quality σ x . The consumer l who ordered item i will rate the transaction based on the inspection of its quality
xki . He is based on 3 parameters : His reactivity with respect to the quality of the provided item a, his bias
b and his stability in providing constant ratings a fixed observed quality σ y
The rating ykli provided by consumer l for transaction i with provider k is modeled in the linear regres-
sion
ykli = al xki + bl + ϵyli − ϵyl ∼ N (0, σly )
By setting al = 1 and assuming the provider always provides the same quality level, we can simplify the
model by
ykli = qk + bl + ϵli
In this case, the model becomes
Pnc hP i
1
l=1 σ̂l2 i∈{k→l} ykli − b̂l
q̂k ← Pnc nkl′ , for all k
l′ =1 σ̂ 2′
l
Pnp P
k=1 i∈{k→l} (ykli − q̂k )
b̂l ← Pnp , for all l and then center the b̂l
k′ =1 nk ′ l
Pnp P 2
k=1 i∈{k→l} ykli − q̂k − b̂l
σ̂l2 ← Pnp , for all l
k′ =1 nk ′ l
P
With nkl = i∈{k→l} 1
27
19 Markov Chains + Absorbing
Describe in detail what is a Markov chain and derive its evolution equation, as well as how
to compute the absorbing probabilities (probability of being absorbed in an absorbing state)
in the case of absorbing Markov chains.Also discuss some applications
Matrix P 2 is the 2 steps transition probabilities matrix. Thus P τ = P (st+τ = j|st = i). If x(t) is the column
vector containing the probability distribution of finding the process in each state of the Markov chain at time
step t :
xi (t) = P (st = i)
Xn
= P (st = i, st−1 = k)
k=1
Xn
= P (st = i|st−1 = k)P (st−1 = k)
k=1
Xn
= pki xk (t − 1)x(t) = Pt x(t − 1)
k=1
This is the time evolution equation of the Markov chain providing the vector of probabilities of being in
each state a long time x(t) given the initial probabilities x(0). When starting from state i, x(0) = ei and
xj (t|s0 = i) = xj|i (t) = (ei )T P t ej , this corresponds to the probability of observing the process in state j at
time t when starting from state i at time t = 0
28
matrix of the absorbing Markov chain. If i, j are transient states, entry i, j of N is :
nij = eTi N ej
∞
!
X
= eTi Q t
ej
t=0
∞
X
= xj|i (t)
t=0
Each nij contains the expected number of passages or visits through transient state j, when starting from
state i. The expected number of visits before being absorbed when starting from each state is n = N e.
We can also compute the probability of being absorbed by state j given that we started in state i
∞ X
X ntr
bij = xk|i (t)rkj
t=0 k=1
ntr
X
= nik rkj
k=1
= [N R]ij
Where the sum of the k is taken on the set of transient states only. All those probabilities can be gathered
into the matrix B
29
P (st+2 = j|st = i) :
n
X
P (st+2 = j | st = i) = P (st+2 = j, st+1 = k | st = i)
k=1
Xn
= P (st+2 = j | st = i, st+1 = k) P (st+1 = k | st = i)
k=1
Xn
= P (st+2 = j | st+1 = k) P (st+1 = k | st = i)
k=1
Xn
= pkj pik
k=1
2
= P ij
Matrix P 2 is the 2 steps transition probabilities matrix. Thus P τ = P (st+τ = j|st = i). If x(t) is the column
vector containing the probability distribution of finding the process in each state of the Markov chain at time
step t :
xi (t) = P (st = i)
Xn
= P (st = i, st−1 = k)
k=1
Xn
= P (st = i|st−1 = k)P (st−1 = k)
k=1
Xn
= pki xk (t − 1)x(t) = Pt x(t − 1)
k=1
This is the time evolution equation of the Markov chain providing the vector of probabilities of being in
each state a long time x(t) given the initial probabilities x(0). When starting from state i, x(0) = ei and
xj (t|s0 = i) = xj|i (t) = (ei )T P t ej , this corresponds to the probability of observing the process in state j at
time t when starting from state i at time t = 0
transient and absorbing states. Both are sub-stochastic : their sum-rows are ≤ 1 and at least one sum-row is
< 1. P t can then be computed as follow :
30
Since Q is sub-stochastic, Qt → 0 when t → ∞ The matrix N = (I − Q)−1 is called the fundamental
matrix of the absorbing Markov chain. If i, j are transient states, entry i, j of N is :
nij = eTi N ej
∞
!
X
= eTi Q t
ej
t=0
∞
X
= xj|i (t)
t=0
Each nij contains the expected number of passages or visits through transient state j, when starting from
state i. The expected number of visits before being absorbed when starting from each state is n = N e.
Matrix P 2 is the 2 steps transition probabilities matrix. Thus P τ = P (st+τ = j|st = i). If x(t) is the column
vector containing the probability distribution of finding the process in each state of the Markov chain at time
step t :
xi (t) = P (st = i)
Xn
= P (st = i, st−1 = k)
k=1
Xn
= P (st = i|st−1 = k)P (st−1 = k)
k=1
Xn
= pki xk (t − 1)
k=1
t
x(t) = P x(t − 1)
31
This is the time evolution equation of the Markov chain providing the vector of probabilities of being in
each state a long time x(t) given the initial probabilities x(0). When starting from state i, x(0) = ei and
xj (t|s0 = i) = xj|i (t) = (ei )T P t ej , this corresponds to the probability of observing the process in state j at
time t when starting from state i at time t = 0
It provides the expected profit of a customer before it leaves the company knowing that mi = 0 for the
absorbing state.
random variables for the hidden states taking their values from 1 to n. There are no observations associated
to states. x(t) are the random variables for the possible observations whose discrete values are denoted
oi . Π = {π} are the initial states probabilities P (s(1) = 1). P = {pij } are the state transition probabilities
P (s(t + 1) = j|s(t) = i). B = {bi (ok )} are the observation or emission probabilities within the states P (ok |st ).
Finally θ = {Π, P, B}
32
22.2 Likelihood Computation
We want to compute P (x|θ). To do so, we define the forward procedure
βi (T ) = 1
Xn
P (x|θ) = πi βi (1)
i=1
random variables for the hidden states taking their values from 1 to n. There are no observations associated
to states. x(t) are the random variables for the possible observations whose discrete values are denoted
oi . Π = {π} are the initial states probabilities P (s(1) = 1). P = {pij } are the state transition probabilities
P (s(t + 1) = j|s(t) = i). B = {bi (ok )} are the observation or emission probabilities within the states P (ok |st ).
Finally θ = {Π, P, B}
33
23.2 Optimal state sequence
To find the most probable state at time t given the observations :
To find the state sequence that best explains the observations and the states, we can use the Viterbi algorithm,
which is a dynamic programming algorithm.
δj (1) = πj bj (x1 )
δ(t + 1) = max {δi (t)pij bj (xt+1 )}
i
ψ(t + 1) = arg max {δi (t)pij bj (xt+1 )}
i
Where π̂i = γi (1) the probability of starting from i. Now we can compute the new estimates of the model
parameters.
Expected number of transitions from i to j
p̂ij =
expected number of transitions out of state i
PT −1
γij (t)
= Pt=1
T −1
t=1 γi (t)
Expected number of emissions of ok in state i
b̂i (Ok ) =
P total number of emissions in state i
t:xt =Ok γt (i)
= P T
t=1 γi (t)
Then, is iterated until convergence the computation of the forward and backward variables α and β and the
parameters estimates for Π, P, B. At each iteration, the likelihood increases
34