0% found this document useful (0 votes)
24 views34 pages

LINFO2275 Questions d Examen-4

The document outlines examination questions for a course on statistical methods, focusing on PCA, discriminant analysis, canonical correlation analysis, and multiple correspondence analysis, along with feature selection techniques. It details the mathematical derivations and interpretations of these methods, including eigenvalues and variance decomposition. Additionally, it describes various feature selection methods aimed at improving model performance and addressing dimensionality issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views34 pages

LINFO2275 Questions d Examen-4

The document outlines examination questions for a course on statistical methods, focusing on PCA, discriminant analysis, canonical correlation analysis, and multiple correspondence analysis, along with feature selection techniques. It details the mathematical derivations and interpretations of these methods, including eigenvalues and variance decomposition. Additionally, it describes various feature selection methods aimed at improving model performance and addressing dimensionality issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

LINFO2275 : Questions d’examen

May 2023

1 PCA + Feature Selection


Describe and derive in detail the method allowing to obtain the principal axes of a principal
components analysis (one technique based on the empirical data and one technique based on
the mathematical definitions). Interpret the results, e.g., what is the interpretation of the
eigenvalues ? How can we obtain the coordinates of the data in the principal components
system ?Finally, describe some methods for performing feature selection (and not extraction)

1.1 PCA
The idea is to project the observations on the axis carrying the maximum variance of the projected data
— Among all possible axis passing through the centroid of the cloud of points
— So that this axis is carrying maximal information on the data
The total variance of a cloud of points can be decomposed as follows :
n
1 X
σ2 = (xi − g)T (xi − g)
n − 1 i=1
Pn
Where g is the centroid of the cloud of points : g = n1 i=1 xi . The operators of orthogonal projection π on
an axis having unit direction vector v is π = vv T , v T v = 1. We thus find the variance of the data projected
on axis v :
n
1 X
σv2 = (π(xi − g))T (π(xi − g))
n − 1 i=1
n
1 X
= (xi − g)T vv T (xi − g)
n − 1 i=1
" n #
1 T
X
T
= v (xi − g)(xi − g) v
n−1 i=1
= v T Σv
Where Σ is the variance-covariance matrix of the data and is symmetric positive definite (spd). We want to
maximize the projected variance
max(v T Σv) subject to (v T v = 1)
v
This a standard constrained optimization problem with equality constraints that can be solved with a La-
grange function.
L = v T Σv + λ(1 − v T v)
∂v L = 0
This gives us a eigenvalue/eigenvector problem Σv = λv. Since Σ is spd, the eigenvalues are all non-negative
and the eigenvectors can be chosen orthogonal. The variance of the projected data is equal to
σv2 = v T Σv
= v T λv =λ
The solution is to project the data onto the associated eigenvector, starting with the max variance one.

1
1.2 Feature Selection
Allows to alleviate the effect of the curse of dimensionality. It also enhances the generalization capability
of the model and speeds up the learning process.

Maximum relevance selection : It consists in computing the associations between the target and the
features and keeping only the most significant ones. The associations can be computed through mutual in-
formation or statistical tests

Minimal redundancy selection : It consists in computing the associations between the features and
discarding the ones that are strongly correlated between each other.

Stepwise regression : Greedy algorithm that adds or deletes the best feature (worse) at each step. The big
issue is to know when to stop. Often when likelihood ratio is not significant anymore.

L1 regularization : Adding a L1 regularization to the objective function (log-likelihood, for example)

Some algorithms like Decision Trees or Bagging perform feature selection on its own.

2 Discriminant Analysis + Feature Selection


Derive and explain the decomposition of the variance in within-and between-cluster variance
as well as the technique allowing to obtain the factorial axes of a discriminant analysis. Interpret
these results –how can we interpret the computed eigenvalues ? In addition, explainthe intuition
behindthe classification algorithm that is used in the context of a discriminant analysis after
having projected the data on the discriminant axes (a mixture of gaussians).Finally, describe
some methods for performing feature selection (and not extraction)

2.1 Discriminant Analysis


Discriminant analysis (two usages) :
- feature extraction method (like PCA) taking categorical dependent variables into account
- classification method using a Gaussian mixture generative model
- only applies to numerical features (extension to categorical variables possible)
Feature extraction : seek the axis maximizing the ratio between the between-class and the total variance
of the projected data.

We can compute the total sample variance (with g as the general centroid) : (voir slide 49)

2
n
1X T
σ2 = (xi − g) (xi − g)
n i=1
q
1X
= SS(k)
n
k=1

We can decompose the sum of squares (or inertia) for each class k into a within and a between sum of
squares : (voir slide 50 )
X T
SS(k) = (xi − g) (xi − g)
i∈C(k)
X 2
= ∥xi − g(k)∥ + n(k)∥g(k) − g∥2
i∈C(k)

The first term is the within variance while the second is the between variance. Now, we can write the total
variance as a within and a between contribution : (voir slide 51)
 
q q
1 X X 2
X
σ2 =  ∥xi − g(k)∥ + n(k)∥g(k) − g∥2  = σ(w)
2 2
+ σ(b)
n
k=1 i∈C(k) k=1

We can use the projection operator π = vv T with v T v = 1 to find the axis maximizing the between on
total projected variance ratio : - For the within variance of the projected data : (voir slide 53, 54)
q
2 1X X T
σv(w) = (πxi − πg(k)) (πxi − πg(k))
n
k=1 i∈C(k)
T
= v Wv
- For the between variance after the projection : (voir slide 55)
2
σv(b) = v T Bv
We are looking for the axis that is discriminating the classes as much as possible (maximizing the bet-
ween/total ratio) :
2
!
σv(b)
 T 
v Bv
max = max P
v σv2 v vT v
subject to the following conditions

v T Bv
 
∂v P =0
vT v
 T P
v = 0. Thus, Bv = vvT P
Bv
P  P
We thus have 2 v T v Bv − 2 v T Bv v
v.

v T Bv
⇒0<λ= P <1
vT v
P−1
From the previous equation, we obtain the eigensystem problem : Bv = λv where there are at most
(q-1) non-zero eigenvalues.
→ We select the largest non-zero eigenvalues/eigenvectors. The normalized eigenvectors correspond to the
direction vectors of the discriminant axes. Then, we project the data on these axis, providing the coordinates
or discriminant scores : Zi = v T xi

3
2.2 Bayesian classification through a gaussian mixture
We compute a posteriori probability through Bayes formula

P (wi )P (z|y = wi )
P (y = wi |z) = Pq
k=1 P (wk )P (z|y = wk )

Each observation z is then assigned to the class k for which the posterior probability is maximal.

2.3 Feature Selection


Allows to alleviate the effect of the curse of dimensionality. It also enhances the generalization capability
of the model and speeds up the learning process.

Maximum relevance selection : It consists in computing the associations between the target and the
features and keeping only the most significant ones. The associations can be computed through mutual in-
formation or statistical tests

Minimal redundancy selection : It consists in computing the associations between the features and
discarding the ones that are strongly correlated between each other.

Stepwise regression : Greedy algorithm that adds or deletes the best feature (worse) at each step. The big
issue is to know when to stop. Often when likelihood ratio is not significant anymore.

L1 regularization : Adding a L1 regularization to the objective function (log-likelihood, for example)

Some algorithms like Decision Trees or Bagging perform feature selection on its own.

3 CCA + feature selection


3. Describe and derive in detail the method allowing to perform a canonical correlation
analysis. Interpret the results, e.g., what is the interpretation of the eigenvalues ? How can we
obtain the coordinates of the data in the principal components system ? Finally, describe some
methods for performing feature selection (and not extraction)

3.1 CCA
Canonical correlation analysis (CCA) is a method allowing to analyze the relationships between two sets
of features (and data sets) X and Y, measured on the same samples
These data are realizations of two random vectors x and y
This method computes a linear combination of the random vector x and a linear combination of the random
vector y
— It thus computes two scores zx and zy (the linear combinations) which are maximally correlated
— It therefore defines score spaces that explain as much as possible the relationships between the two sets
of features
It is used in multi-view learning (difference sources of data) (e.g. athletes and results in athletics competition)
We define the centroids, (
gx = E[x]
(1)
gy = E[y]
As well as the linear combinations for the two sets (which can also be viewed as projections)
(
zx = uTx x
(2)
zy = uTy y

4
We have to maximize the covariance between the two random variables zx and zy

maxux ,uy (cov(zx , zy )) (3)

Subject to, e.g, unit-variance constraints (


var(zx ) = 1
(4)
var(zy ) = 1
Let us define the centered variables (
z̃x = uTx (x − gx )
(5)
z̃y = uTy (y − gy )
We have to compute the covariance

cov(zx , zy ) = E[z̃x z̃y ]


= E[uTx (x − gx )uTy (y − gy )]
(6)
= E[uTx (x − gx )(y − gy )T uy ]
= uTx E[(x − gx )(y − gy )T ]uy

which is the covariance matrix between the two sets of variables


Finally, let us compute the variance

var(zx ) = E[z˜x2 ]
= E[uTx (x − gx )uTx (x − gx )] (7)
= uTx E[(x T
− gx )(x − gx ) ]ux

var(zy ) = uTy E[(y − gy )(y − gy )T ]uy (8)


We define the following variance-covariance matrices
 T
Sxy = E[(x − gx )(y − gy ) ]

Sxx = E[(x − gx )(x − gx )T ] (9)

Syy = E[(y − gy )(y − gy )T ]

Take care : Sxy is not square


Their estimates based on empirical data (the sample variance-covariance matrices) are
n

1 X
[(xk − gx )(yk − gy )T ]


 Σ xy =
n − 1


k=1



n


 1 X
Σxx = [(xk − gx )(xk − gx )T ] where n is the number of observations (10)
 n−1

 k=1

 n
 1 X
[(yk − gy )(yk − gy )T ]

Σ =

yy

n−1


k=1

We thus have to maximize the covariance between the two sets of variables

maxux ,uy (uTx Σxy uy ) (11)

subject to the equality constraints (


uTx Σxx ux = 1
(12)
uTy Σyy uy = 1
We therefore define the Lagrange function
λx λy
L = uTx Σxy uy + (1 − uTx Σxx ux ) + (1 − uTy Σyy uy ) (13)
2 2

5
and set the result equal to zero (
∂ux L = 0
(14)
∂uy L = 0
We obtain (
Σxy uy − λx Σxx ux = 0
(15)
Σyx ux − λy Σyy uy = 0
Which leads to the following eigenvalues/eigenvector problems (λ = λx λy )
( −1
Σxx Σxy Σ−1
yy Σyx ux = λux
(16)
Σ−1 −1
yy Σyx Σxx Σxy uy = λuy

Now, from the previous formula (15)


— By pre-multiplying the first equation by (ux )T
— By pre-multiplying the second equation by (uy )T
— We obtain that cov(zx , zy ) = λx = λy
— And thus λ = λx λy = cov 2 (zx , zy ) ≥ 0
From the optimality conditions, we easily obtain
1 −1

ux =
 Σ Σxy uy
 λx xx
(17)
1 −1
uy = Σ Σyx ux


λy yy

So that each direction vector can be obtained from the other


We thus only have to solve one eigensystem
If ux1 is the first eigenvector (corresponding to the largest eigenvalue), associated to the first set of features
x, the score, or, coordinate on the first axis of some feature vector xk will be the projection

z̃x1 = uTx1 (xk − gx ) (18)

up to a scaling factor

3.2 Feature Selection


Allows to alleviate the effect of the curse of dimensionality. It also enhances the generalization capability
of the model and speeds up the learning process.

Maximum relevance selection : It consists in computing the associations between the target and the
features and keeping only the most significant ones. The associations can be computed through mutual in-
formation or statistical tests

Minimal redundancy selection : It consists in computing the associations between the features and
discarding the ones that are strongly correlated between each other.

Stepwise regression : Greedy algorithm that adds or deletes the best feature (worse) at each step. The big
issue is to know when to stop. Often when likelihood ratio is not significant anymore.

L1 regularization : Adding a L1 regularization to the objective function (log-likelihood, for example)

Some algorithms like Decision Trees or Bagging perform feature selection on its own.

6
4 Multiple correspondence analysis + feature selection
Describe and derive in detail the method allowing to perform a multiple correspondence
analysis. Interpret the results, e.g., what is the interpretation of the eigenvalues ? How can we
obtain the coordinates of the data in the principal components system ? Finally, describe some
methods for performing feature selection (and not extraction)

4.1 Multiple Correspondence Analysis


To each feature, xi , we associate a binary indicator vector xi whose dimension is the number of attributes
in the variable. The goal is to find the linear combination of the elements of the random indicator vector
yi = uTi xi which maximizes the sum of the associations between the categorical features. This corresponds
to the axes preserving maximally the relationships between the variables. Here, the features must not be
centered, but the covariance should.

Let us seek the linear combinations that correlate maximally the features as quantified by the sum of cova-
riance
Xp X p p X
X p
cov (yi , yj ) = E [yi yj ]
i=1 j=1 i=1 j=1
 
Xp X
p
= E yi yj 
i=1 j=1
 !2 
p
X
= E yi 
i=1

Together with the constraint that the sum of the variances remains constant (equal to p)
p
X p
X
cov(yi , yi ) = var(yi ) = p
i=1 i=1

TO BE COMPLETED

4.2 Feature Selection


Allows to alleviate the effect of the curse of dimensionality. It also enhances the generalization capability
of the model and speeds up the learning process.

Maximum relevance selection : It consists in computing the associations between the target and the
features and keeping only the most significant ones. The associations can be computed through mutual in-
formation or statistical tests

Minimal redundancy selection : It consists in computing the associations between the features and
discarding the ones that are strongly correlated between each other.

Stepwise regression : Greedy algorithm that adds or deletes the best feature (worse) at each step. The big
issue is to know when to stop. Often when likelihood ratio is not significant anymore.

L1 regularization : Adding a L1 regularization to the objective function (log-likelihood, for example)

Some algorithms like Decision Trees or Bagging perform feature selection on its own.

5 Classical multidimensional scaling


Derive and explain the technique of classical «multidimensional scaling». Prove the formula
allowing to compute the inner products from the distances and then the distances from the

7
inner products. Then, describe how a data matrix can be obtained from an inner product
matrix. Describe the links with principal components analysis. What is finally the procedure
for drawing the data from their distances ?

5.1 Classical MDS


Technique that allows to represent n individuals in a Euclidean space based on their pairwise distances.
Suppose we are given a n × n matrix containing the squared Euclidean distances or similarities (possible to
go from one form to the other.
d2ij = [D]ij
sij = [S]ij
dij = (sii − sij + sjj )1/2
We also define the matrix K containing the inner products between some feature vectors xi . K is also called
the Gram of Kernel matrix.
kij = [K]ij = xTi xj
K = XX T
Then, MDS proceeds in 2 steps :
1. Given a valid squared distance matrix D, compute the inner product matrix K.
It is easy to compute X → K → D. The goal of MDS is to compute D → K → X. If the distances are
Euclidean, it means there exists an embedding space containing n feature vectors xi in which

d2ij = ||xi − xj ||2


= ||xi ||2 + ||xj ||2 − 2xTi xj
= kii + kjj − 2kij
D = diag(K)eT + ediag(K)T − 2K

We want to invert the relation in order to find K in function of D. This problem doesn’t have a unique
solution. This is why we impose K centered

Ke = 0

eT K = 0T
Finally, the inner product matrix can be computed from the squared distance matrix through
1
K = − HDH
2
eeT
 
H= I−
n
Hx = x − mean(x)e

When H is applied to a vector, it centers the vector by subtracting the mean to each element of it.
Thus, the multiplication of D from the left and right centers of the matrix and its row and columns
sum are both equal to 0.
1 1
− HDH = − H diag(K)eT + ediag(K)T − 2K H

2 2
1
= − Hdiag(K)eT H + Hediag(K)T H − 2HKH

2
1
= − (0 + 0 − 2HKH)
2
=K

We used here the fact that K is doubly centered and He = 0

8
2. Once K has been found, represent X in a Euclidean space preserving exactly the inner products and
thus the distances
From standard matrix theory, we know any symmetric matrix admits a spectral decomposition
1 1
K = U ΛU T = (U Λ 2 )(U Λ 2 )T
Where U contains the eigenvectors of K on its columns and Λ is a diagonal matrix containing the
eigenvalues on its diagonal. We therefore define
1
X = UΛ2
K = XX T
For the data matrix to be well-defined, all the eigenvalues in Λ have to be non-negative reals. Therefore,
K must be positive semi-definite. In that case, K is a valid inner product matrix and X is the associated
matrix and is centered.

5.2 Links with PCA


The eigenvectors of K correspond to the coordinates of the individuals in the PCA axes up to a scaling
factor. The feature vectors are expressed in the principal components coordinate system. Now, let u1 be the
dominant eigenvector of X T X obtained by performing PCA on X. u1 is obtained by solving X T Xu1 = λ1 u1 .
By multiplying this eigenequation from the left by X :
XX T (Xu1 ) = λ(Xu1 )
Where Xu1 is an eigenvector of XX T = K. Xu1 is the projection of the feature vectors xi on the first principal
axis u1 . This indicates that the dominant eigenvector of K provides the first coordinate of the points in the
1
principal components coordinate system, up to a scaling factor. Therefore, X = U Λ 2 is expressed in the
principal components coordinate system.

6 Dynamic programming recurrence (edit-distance)


Derive and explain the dynamic programming recurrence formula and apply it to the pro-
blem of comparing two sequences of symbols (edit-distance). As an exercise, solve a concrete
example comparing two short sequences(provided at the exam)

6.1 Dynamic Programming Recurrence formula


Suppose we have a lattice with N levels. We want to reach level N starting from the first level with minimum
cost using the shortest path. We have sk the variable containing the state at level k. d(sk = j|sk−1 = 1) is the
local cost associated to the decision to jump to state sk = j at level k given that we were in state sk−1 = i
at level k − 1. The total cost of a path (s0 , s1 , ...sN ) is
N
X
D(s0 , s1 , ...sN ) = d(si |si−1 )
i=1

and the optimal cost when starting from state s0 is


D∗ (s0 ) = min(s0 ,s1 ,...sN ) [D(s0 , s1 , ...sN )]
"N #
X
= min(s0 ,s1 ,...sN ) d(si |si−1 )
i=1

It means we want to choose the optimal state at level k. The optimal cost when starting from the initial state
is D∗ mins0 {D∗ (s0 )} and the optimal state when starting from some intermediate state sk :
" N #
X

D (sk ) = min(sk+1 ,...sN ) d(si |si−1 )
i=k+1

9
This allows us to get the reccurence formulas :
D∗ (sN ) = 0
D∗ (sk = i) = minsk+1 [d(sk+1 |sk = i) + D∗ (sk+1 ]
D∗ = mins0 {D∗ (s0 )}

6.2 Edit-distance
Computes the minimal number of insertions, deletions and substitutions (number of editions) to transform
one string into another one. We have 2 strings x and y where xi , yi is the character at index i in x, y. The length
̸ |y|. The substring of x beginning at character i and ending at j is denoted as xji .
of x is |x| and generally, |x| =

We now read the characters of x one by one in order to construct y by using 3 operations
1. Insertion of a character in y without taking any character from x
2. Deletion of a character from x without concatenating it into y
3. Substitution of a character from y by one of x
|x|
xi means we have read the i − 1 first characters of x and that x has been cutted from its i − 1 first characters
that have been taken to construct y. In the same way, y0j means that the j first characters of y have been
transcribed. We progressively read the first characters of x in order to build y.

It corresponds to a process with levels (steps) and states. By applying dynamic programming where each
state is characterized by a couple (i, j)
(  
|x|
One state corresponds to xi , y0j
The level k corresponds to i + j = const = k
For the 3 operations :
    
|x| j−1 |x| j


 Insertion with respect to x : x i , y 0 → x i , y 0
    
|x| |x|
Deletion with respect to x : xi−1 , y0j → xi , y0j
    
 Substitution with respect to x : x|x| , yj−1 → x|x| , yj


i−1 0 i 0

The 2 first operations are jumping from level k to level k+1 and for the 3rd one we directly jump to level k+2.

This situation can be represented in a 2D table where one level is represented by the constant (i + j),
one state is represented by (i, j), one operation corresponds to a valid transition in this table.

A cost can be associated to each operation :



 δins (yi ) = 1
δdel (xi ) = 1
δsub (xi , yj ) = δij

With 
̸ yj
δij = 1ifxi =
δij = 0ifxi = yj
Then, the dynamic programming formula can be applied to this problem :
|x|
D∗ (x0 , y00 ) = 0

∗ |x| j−1
 D (xi , y0 ) + 1

|x|
D∗ (xi , y0j ) = min |x|
D∗ (xi−1 , y0j ) + 1
 ∗ |x| j−1

D (xi−1 , y0 ) + δij
|x| |y|
Finally, dist(x, y) = D∗ (x|x| , y0

10
6.3 Don’t forget to train to compute on example

7 Dynamic programming recurrence (Bellman-Ford)


Derive and explain the dynamic programming recurrence formula and apply it to the
problem of computing the shortest-path distance between two nodes of a weighted direc-
ted graph(Bellman-Ford algorithm). As an exercise, solve a concrete example on a small
graph(provided at the exam)

7.1 Dynamic Programming Recurrence formula


Suppose we have a lattice with N levels. We want to reach level N starting from the first level with minimum
cost using the shortest path. We have sk the variable containing the state at level k. d(sk = j|sk−1 = 1) is the
local cost associated to the decision to jump to state sk = j at level k given that we were in state sk−1 = i
at level k − 1. The total cost of a path (s0 , s1 , ...sN ) is
N
X
D(s0 , s1 , ...sN ) = d(si |si−1 )
i=1

and the optimal cost when starting from state s0 is

D∗ (s0 ) = min(s0 ,s1 ,...sN ) [D(s0 , s1 , ...sN )]


"N #
X
= min(s0 ,s1 ,...sN ) d(si |si−1 )
i=1

It means we want to choose the optimal state at level k. The optimal cost when starting from the initial state
is D∗ mins0 {D∗ (s0 )} and the optimal state when starting from some intermediate state sk :
" N #
X
D∗ (sk ) = min(sk+1 ,...sN ) d(si |si−1 )
i=k+1

This allows us to get the reccurence formulas :

D∗ (sN ) = 0

D∗ (sk = i) = minsk+1 [d(sk+1 |sk = i) + D∗ (sk+1 ]


D∗ = mins0 {D∗ (s0 )}

7.2 Bellmanford algorithm


Suppose we have a weighted, directed and strongly connected network or graph and you want to compute
the minimum cost for reaching node 0 from all other nodes in the network. Denote cij = d(j|i) the cost
between node i and j. For i ̸= j
— cij > 0 is the cost of the link
— cij = ∞ if there is no link between 2 nodes. An infinite cost means that transition will never be chosen
— c0j = ∞ for all j ̸= 0 : Once we reach the final node 0 we are at destination and we stay there. Such a
node is called absorbing
On the other hand, for i = j (when we have a loop) :
— cii = ∞ for all i ̸= 0 : We do not allow looping in a node
— c00 = 0 we stay within the destination state without additional cost

11
To do so, we unfold the network in time creating levels, where each level corresponds to a "time step" (or
transition). If n is the number of nodes when computing the least cost between any node to node 0, we cannot
have more than n-1 steps on the least-cost path, otherwise we would visit at least 2 times the same node.

We transform the initial problem into a dynamic programming problem with n levels and n nodes (Di-
rected Acyclic Graph DAG) and we then use the backward recurrence formula. We define a n × n table
D∗ (i, k) where i is the index of the node (0 to n-1) located at level k (0 to n-1). Element (i, k) of this table
represents D∗ (sk = i). This table contains the minimal cost for reaching destination node 0 from a node of
index i at level k and at each iteration i we examine which node j is the most interesting to visit.
 ∗
 D (0, n − 1) = 0 − You reached destination node 0 at the last level n − 1
D∗ (i, n − 1) = ∞, i ̸= 0 − Those nodes are forbidden, you are required to reach node 0
 ∗
D (i, k) = minj∈Succ(i) {cij + D∗ (j, k + 1)}

7.3 Training example available page 50

8 Dynamic programming recurrence (dynamic time warping)


Derive and explain the dynamic programming recurrence formula and explain its application
to the problem of «dynamic time warping» in speech recognition

8.1 Dynamic Programming Recurrence formula


Suppose we have a lattice with N levels. We want to reach level N starting from the first level with minimum
cost using the shortest path. We have sk the variable containing the state at level k. d(sk = j|sk−1 = 1) is the
local cost associated to the decision to jump to state sk = j at level k given that we were in state sk−1 = i
at level k − 1. The total cost of a path (s0 , s1 , ...sN ) is
N
X
D(s0 , s1 , ...sN ) = d(si |si−1 )
i=1

and the optimal cost when starting from state s0 is

D∗ (s0 ) = min(s0 ,s1 ,...sN ) [D(s0 , s1 , ...sN )]


"N #
X
= min(s0 ,s1 ,...sN ) d(si |si−1 )
i=1

It means we want to choose the optimal state at level k. The optimal cost when starting from the initial state
is D∗ mins0 {D∗ (s0 )} and the optimal state when starting from some intermediate state sk :
" N #
X

D (sk ) = min(sk+1 ,...sN ) d(si |si−1 )
i=k+1

This allows us to get the reccurence formulas :

D∗ (sN ) = 0

D∗ (sk = i) = minsk+1 [d(sk+1 |sk = i) + D∗ (sk+1 ]


D∗ = mins0 {D∗ (s0 )}

8.2 Dynamic Time Warping (DTW)


Often used in word recognition by matching the pronunciation to the nearest template in a database.
However, the timing in which the word is pronounced can differ greatly. We have to account for distortions
and warping of the signal. We thus align the 2 signals by either defining a distance d(i, j) between the 2

12
frames or by defining a time alignement that allows for warping. In order to obtain a meaningful alignment,
we have to add monotonic constraints

ik ≥ ik−1
Monotonicity
jk ≥ jk−1

ik − ik−1 ≤ 1
Continuity
jk − jk−1 ≤ 1

i1 = 1, j1 = 1
Boundary conditions
iK = I, jK = J

The problem is the solved by dynamic programming by only considering the valid transitions using those
recurence relations :
g(1, 1) = d(r1n , o1 )
 
 g(i − 1, j) + d(rin , oj ) 
g(i, j) = min g(i − 1, j − 1) + 2d(rin , oj )
g(i, j − 1) + d(rin , oj )
 

1
D(RN , O) = g(I, J)
I +J

9 Markov Decision Process + Reinforcement Learning


Explain the general principles behind «Markov decisionprocesses» and derivein detailthe ex-
pressions allowing to compute the optimal policy (value-iteration technique). Briefly discuss the
links withdynamic programming,the Q-value, reinforcement learning and Q-learning(without
the proofsthis time)

9.1 Explain the general principles behind the Markov decision processes
We have a set of states S = {1, 2, ..., n} where st = k means that the process is in state k at time t. Each
state has a set of admissible actions U (k) (or control actions, decisions) it can undertake. Let a ∈ U (k) be
an available action in state k. Each action only depends on the current state and is independent of the past
states (independent in time).

Each action a = u(st ) at time t and in state st has a bounded cost 0 ≤ c(u(st )|st ) ≤ ∞. To jump to state
st+1 = k ′ , we have a probability mass P (st+1 = k ′ |st = k, u(st ) = a) = p(k ′ |k, a) that only depends on the
current state. If the problem is deterministic, the distribution reduces to a Kronecker delta.

We suppose that the goal state is reachable from each state and that costs are non-negative so that the
network of states won’t get a negative cycle.

9.2 Optimal Policy


The goal of the MDP is to find the optimal policy π = {u(1), u(2), ..., u(n)} that gives the optimal set
of actions for each state to reach a goal state s = d starting from the initial state s0 = k0 at t = 0 while
minimizing the total expected cost Vπ on the sequence of random variables (s1 , s2 , ...sn ) until the goal state
is reached. "∞ #
X
Vπ (s0 = k0 ) = Es1 ,s2 ,... c(u(st )|st )|s0 = k0 , π
t=0

The goal state is said absorbing because once it is reached, the process stays there with no cost p(d|d, i) = 1
and c(d|d) = 0.

The optimal policy π ∗ is given by


V ∗ (k0 ) = min{Vπ (k0 )}
π

13
The optimal expected cost can be computed thanks to recurrence relations. If at = u(kt ) and u(kt ) ∈ U (kt ) :
"∞ #!
X

V (k0 ) = min Es1 ,s2 ,... c(u(st )|st )|s0 = k0 , π
(a0 ,a1 ,...)
t=0
 "∞ #
X X
V ∗ (k0 ) = min  P (s1 = k1 , s2 = k2 , ...|s0 = k0 , π) × c(u(kt )|kt ) 
(a0 ,a1 ,...)
k1 ,k2 ,... t=0

Using the Markov property, we factorize the first equation


 #

"
X X
V ∗ (k0 ) = min  P (s2 = k2 , s3 = k3 , ...|s1 = k1 , π)P (s1 = k1 |s0 = k0 , u(k0 ) = a0 ) × c(a0 |k0 ) + c(u(kt )|kt ) 
(a0 ,a1 ,...)
k1 ,k2 ,... t=1

X
V ∗ (k0 ) = min P (s1 = k1 |s0 = k0 , u(k0 ) = a0 )
a0
k1
 
X ∞
X
× c(a0 |k0 ) + min P (s2 = k2 , s3 = k3 , ...|s1 = k1 , π) c(u(kt )|kt )
(a1 ,a2 ,...)
k1 ,k2 ,... t=1

The part of the equation with min(a1 ,a2 ,...) is equal to V ∗ (k1 )
!
X
∗ ∗
V (k0 ) = min p(k1 |k0 , a0 )[c(a0 |k0 ) + V (k1 )]
a0 ∈U (k0 )
k1

And thus we obtain :



!
X
∗ ′ ∗ ′
V (k) = min c(a|k) + p(k |k, a)V (k )
a∈U (k)
k′ =1

Which are the Bellman’s equations of optimality. The optimal action for state k is the action that verifies
this equation. It yields us the following value iteration algorithm

And after convergence, the best action is provided by

We know that Ex,y [f (x, y)] = Ex [Ey [f (x, y)|x]] it allows us to compute : After simplifying with u(kt ) = at :
n
X
Vπ (k0 ) = c(a0 |k0 ) + p(k1 |k0 , a0 )vπ (k1 )
k1 =1

Which lets us compute the optimal value

V ∗ (k0 ) = min Vπ (k0 )


(a0 ,a1 ,...)

This is how we get the Bellman conditions

14
9.3 Reinforcement Learning
The goal of reinforcement learning is to learn by doing. We introduce for this the Q-value : The expected
cost when choosing action a in state k and then relying on policy π
X
Qπ (k, a) = c(a|k) + p(k ′ |k, a)Vπ (k ′ )
k′

Consequently :
V ∗ (k) = mina∈U (k) Q∗ (k, a)
The Bellman optimality conditions in terms of Q-value become
X
Q∗ (k, a) = c(a|k) + p(k ′ |k, a)mina′ ∈U (k′ ) Q∗ (k ′ , a′ )
k′

This allow us to get the following value-iteration algorithm

15
We now want to use stochastic approximation to adjust the Q-values when trying some action in some
state by observing the cost and going to the next state. This is related to Q-learning

Where α(t) should decrease gradually according to the theory of stochastic approximation. There should
as well be an exploration mechanism to explore the space stochastically.

Sometimes, function approximations are needed when the number of different possible states becomes too
large. The update rule then becomes some sort of gradient descent algorithm.

10 Markov Descision Processes without derivation


Discuss in details (with the equations) the links between Markov decision problems, Q-
learning, andreinforcement learning, without providing the whole derivation of the value ite-
ration

10.1 Explain the general principles behind the Markov decision processes
We have a set of states S = {1, 2, ..., n} where st = k means that the process is in state k at time t. Each
state has a set of admissible actions U (k) (or control actions, decisions) it can undertake. Let a ∈ U (k) be
an available action in state k. Each action only depends on the current state and is independent of the past
states (independent in time).

Each action a = u(st ) at time t and in state st has a bounded cost 0 ≤ c(u(st )|st ) ≤ ∞. To jump to state
st+1 = k ′ , we have a probability mass P (st+1 = k ′ |st = k, u(st ) = a) = p(k ′ |k, a) that only depends on the
current state. If the problem is deterministic, the distribution reduces to a Kronecker delta.

We suppose that the goal state is reachable from each state and that costs are non-negative so that the
network of states won’t get a negative cycle.

10.2 Optimal Policy


The goal of the MDP is to find the optimal policy π = {u(1), u(2), ..., u(n)} that gives the optimal set
of actions for each state to reach a goal state s = d starting from the initial state s0 = k0 at t = 0 while
minimizing the total expected cost Vπ on the sequence of random variables (s1 , s2 , ...sn ) until the goal state
is reached. "∞ #
X
Vπ (s0 = k0 ) = Es1 ,s2 ,... c(u(st )|st )|s0 = k0 , π
t=0

The goal state is said absorbing because once it is reached, the process stays there with no cost p(d|d, i) = 1
and c(d|d) = 0. We can obtain the Bellman’s equations of optimality. The optimal action for state k is the
action that verifies this equation.

!
X
∗ ′ ′
V (k) = min c(a|k) + p(k |k, a)V ∗ (k )
a∈U (k)
k′ =1

It yields us the following value iteration algorithm


And after convergence, the best action is provided by

16
10.3 Reinforcement Learning
The goal of reinforcement learning is to learn by doing. We introduce for this the Q-value : The expected
cost when choosing action a in state k and then relying on policy π
X
Qπ (k, a) = c(a|k) + p(k ′ |k, a)Vπ (k ′ )
k′

Consequently :
V ∗ (k) = mina∈U (k) Q∗ (k, a)
The Bellman optimality conditions in terms of Q-value become
X
Q∗ (k, a) = c(a|k) + p(k ′ |k, a)mina′ ∈U (k′ ) Q∗ (k ′ , a′ )
k′

This allow us to get the following value-iteration algorithm

We now want to use stochastic approximation to adjust the Q-values when trying some action in some
state by observing the cost and going to the next state. This is related to Q-learning

Where α(t) should decrease gradually according to the theory of stochastic approximation. There should
as well be an exploration mechanism to explore the space stochastically.

Sometimes, function approximations are needed when the number of different possible states becomes too
large. The update rule then becomes some sort of gradient descent algorithm.

11 Information Retrieval - Vector Space Model


Explain the basic vector model of «information retrieval», as well as two of its exten-
sions(term reweighting and «latent semantic model»). Moreover, how can we validate/assess
an information retrieval system (precision, recall, F-measure)

11.1 Vector Space Model


Each document, query and user profile is represented by a vector. A word is represented by an index in each
vector and the value associated with each index is the frequency of the word in the analyzed document, query
etc. In the space of words. Document j is represented by dj , fi,j is the frequency of word i in the document

17
j. The vectors have a dimension of nw the total number of words. Hence, each document is represented by
 
f1j
 f2j 
dj ≜  .
 
 ..


fnw j

This is the Bag of words representation in the word space. The order of the words is not taken into account
and the vector is usually very large and sparse. If we have nd documents, the term-document matrix becomes :
 
f11 f12 . . . f1nd
 f21 f22 . . . f2nd 
D≜ .
 
.. .. ..
 ..

. . . 
fnw 1 fnw 2 . . . fnw nd

A query is also represented by a vector where each element is 0 or 1 if the word is present in the query or not
 
0
 1 
 
q≜ 0 
 
 .. 
 . 
1

The purpose of the model is to retrieve document di based on query q by defining a notion of similarity
between the query and the document The similarity can be defined by the euclidean distance but it doesn’t
work well because queries don’t contain many words. It is much more often defined by the cosinus of the
angle between 2 the vectors : The cosine similarity

qT di
sim (q, di ) ≜ cos (q, di ) =
∥q∥ ∥di ∥

11.2 Term Weighting


It is based on the idea that 2 words don’t have the same importance or weight. This is why we want to
take account of the discriminative power of each word. For example, if a word is present in each document, it
becomes useless information. We define P (wi ) the a priori probability that the word wi appears in a document
of the corpus. To measure the importance of the word we take the inverse document frequency (IDF) of the
word : idfi = − log2 (P (wi )) with P (wi ) equal to the number of document in which wi appears divided by
the number of documents.

We can then redefine q where each word is weighted using the IDF, the weight of the presence of the
word. We then compute the cosine as before in order to rank them.
 
0
 .. 

 . 

q≜  − log 2 (P (wi )) 

 .. 
 . 
1

11.3 Latent Semantic Models


These models try to measure some semantic information, meaning to retrieve documents with information
related to the word but it must not contain the word itself (Query : Baby => retrieves Newborn info). We
consider words that are often used together. Two words are semantically related when they often co-occur in
the same document.

18
One solution is to use sub-space projection methods like Singular Value Decomposition (SVD) or Factor
Analysis. Here, we use SVD to reduce the rank of the term-document matrix (D) from n to m with m<n. It
allows reducing the dimensionality of the space by clustering the words that are semantically similar (used in
the same documents). It shows as well which documents are semantically similar to one another (containing
the same words). It is similar to a double PCA where you consider the terms as observations and the docu-
ments as variables AND the documents as observations and the terms as variables. This allows us to make a
concept space.

The matrix D = U ΣV T has an SVD if it is of full rank with Σ a diagonal matrix where σ1 ≥ σ2 ≥ ... ≥ σn > 0.
We are interested in the best matrix of rank m to approximate D. To do so we take Σ̃, the Σ matrix where
we equal to null every σi for m + 1 ≤ i ≤ n. Finally, we compute D̃ = U Σ̃V T . Then to compare the query
with each document we take compute for d˜i . This works because the SVD allows to work in a latent space
representing concepts. The main problem is to know how many dimensions we want to keep (value of m).

11.4 Assessment of document retrieval systems


— Precision : Estimates the percentage of relevant retrieved documents in the set of all retrieved docu-
ments. Indicates the relevance of the retrieved documents
— Recall : Estimates the percentage of relevant retrieved documents in the set of all relevant retrieved
documents. Indicates to which extent the relevant documents were retrieved
— F-measure : There is a trade-off between precision and recall that must be taken, this is why the F
measure was introduced because it takes both into account.

F = 2(precision × recall)/(precision + recall)

12 Information Retrieval Probabilistic Model


Explain the probabilistic model of«information retrieval» and discuss the underlying as-
sumptions. Moreover, how can we validate/assess an information retrieval system(precision,
recall, F-measure)

12.1 Probabilistic Model


This type of model relies on statistical models. Each user profile is represented by such a model. A do-
cument can then be relevant to a user : R=1 if relevant or 0 if not. Everything is then based on relevance
feedback from the user or by ranking of a vector space model. We only talk of the binary independence
retrieval model here.

When introducing a query to a vector space model, we obtain a classification of each document compa-
red with the query :

19
We then perform query expansion, it means expanding the query regarding some relevance feedback from
the user and the most relevant documents from the vector model. Each document is represented by a binary
vector where each index is equal to 1 if the associated word is present in the document or 0 if not.
 
0
 1 
 
di =  0 
 
 .. 
 . 
1
We define the probability of observing a document d = x given that this document is relevant (R = 1) for
user uk as P (d = x|R = 1, uk ). However, during the document retrieval phase, we are mainly interested in
P (R = 1|d = x, uk ). The larger this value, the more likely the document x is relevant. This is why this value
has to be calculated for each document in the database.

Hence, instead of computing P (d = x|R = 1, uk ), we only need to compute the odds :


P (R = 1 | d = x, uk )
λ=
P (R = 0 | d = x, uk )
P (R = 1 | d = x, uk )
=
1 − P (R = 1 | d = x, uk )
It provides the same ranking if the largest the λ, the more likely the document is relevant. Using Bayes law
we can simplify it even more using dn the nth element of vector d

P (R = 1 | d = x, uk )
λ=
P (R = 0 | d = x, uk )
P (d = x | R = 1, uk ) P (R = 1 | uk )
= ×
P (d = x | R = 0, uk ) P (R = 0 | uk )
Qnw
P (dn = xn | R = 1, uk ) P (R = 1 | uk )
= Qn=1
nw ×
n=1 P (dn = xn | R = 0, uk ) P (R = 0 | uk )
Finally, we can summarize λ as a naive bayes classifier :
Qnw
P (dn = xn | R = 1, uk )
λ ∝ Qn=1
nw
n=1 P (dn = xn | R = 0, uk )

Here, P (dn = xn | R = 1, uk ) and P (dn = xn | R = 0, uk ) are the likelihoods estimated by the frequencies.
They can be estimated by the proportion of documents containing the word wn among relevant and irrelevant
documents.

12.2 Assessment of document retrieval systems


— Precision : Estimates the percentage of relevant retrieved documents in the set of all retrieved docu-
ments. Indicates the relevance of the retrieved documents
— Recall : Estimates the percentage of relevant retrieved documents in the set of all relevant retrieved
documents. Indicates to which extent the relevant documents were retrieved
— F-measure : There is a trade-off between precision and recall that must be taken, this is why the F
measure was introduced because it takes both into account.
F = 2(precision × recall)/(precision + recall)

13 PageRank Model
Explain in detail the basic PageRank model for scoring web pages. Describe also its different
extensions (personalization vector, etc). Then describe the «HITS» scoring model and its
bibliometric interpretation

20
13.1 PageRank model
Corresponds to a measure of "prestige" in a directed graph. The objective is to exploit the link structure
between documents to extract information without looking at the content of it. We see here the documents
repository as a graph where nodes are documents and edges are directed links between them.

Each webpage i has an associated score xi proportional to the weighted averaged score of the pages pointing
to page i. Let wi,j be the weight associated to the link between page i and j, usually 1 if the 2 pages are
linked, 0 if not and wj . the out degree of page j. All those weights are stored in the matrix W, since the
graphs are directed, this matrix is not necessarily symmetric.
n
X wji xj
xi ∝
j=1
wj.
Xn
wj. = wji
i=1

Thus, a highly scored page is a page which is pointed by many pages and that has many highly scored pages.
We consider a page as important if it is pointed by many important pages and has few outgoing links. To find
those values, the weights are updated iteratively until convergence. Then, all the pages are ranked accordingly
to their score.

13.2 Extensions
13.2.1 Google Matrix
Some pages have no links with other pages, in this case, the P matrix is not stochastic anymore because
the rows do not sum to 1 anymore. One solution is to jump to any node of the graph randomly.

eeT
G = αP + (1 − α)
n
Where G is the Google Matrix and e is a vector full of ones, and 0 < α < 1. G is not more sparse but is
normal

13.2.2 Personalization vector


Used to favor pages in a natural for advertising for example. Rather than using eeT /n for the Google
matrix, we use ev T with v a probability vector called personalization vector given by the user. It contains
the a priori probability of jumping to any page by teleportation. The matrix thus becomes :

G = αP + (1 − α)evT

13.3 HITS Algorithm


It is based on the concept of 2 categories of web pages that are strongly connected : the hub the authority
pages.
Hubs are heavily linked to authorities. A good hub points to many good authorities and have few inco-
ming links. Authorities do not link with other authorities but is pointed by many hubs. The main authorities
on a topic are often in competition with one another.

The objective of this algorithm is to find the good hubs and authorities from the results of the search engines.
Each page is thus assigned a hub and authority score (xhi , xai ). A page’s authority score is proportional to
the sum of hubs connected to it :
n
X
xaj = η wij xhi
i=1

21
And a page hub score is itself proportional to the sum of authority scores that link to it
n
X
xhj = µ wij xai
i=1

In matrix form it becomes :


xa = ηWT xh


xh = µWxa
and thus
xa = ηµWT Wxa


xh = µηWWT xh
To obtain the scores, an equivalent method is to take the eigenvectors of WT W and WWT to obtain the
vector of authority scores and hubs scores.

13.3.1 Bibliometrics
HITS algorithm can be used for bibliometrics for cocitations (when 2 documents are both cited in the
same 3rd document) and corefernces (when 2 documents both reference the same document). In this model,
the coreferences matrix is closely related to the hub matrix and the cocitations is closely related to the
authority matrix

14 PageRank Model + RandomWalk


Explain in detail the PageRank model for scoring web pages as well as its random walk
interpretation. Describe also its different extensions(personalization vector, etc)

14.1 PageRank
Corresponds to a measure of "prestige" in a directed graph. The objective is to exploit the link structure
between documents to extract information without looking at the content of it. We see here the documents
repository as a graph where nodes are documents and edges are directed links between them.

Each webpage i has an associated score xi proportional to the weighted averaged score of the pages pointing
to page i. Let wi,j be the weight associated to the link between page i and j, usually 1 if the 2 pages are
linked, 0 if not and wj . the out degree of page j. All those weights are stored in the matrix W, since the
graphs are directed, this matrix is not necessarily symmetric.
n
X wji xj
xi ∝
j=1
wj.
Xn
wj. = wji
i=1

Thus, a highly scored page is a page which is pointed by many pages and that has many highly scored pages.
We consider a page as important if it is pointed by many important pages and has few outgoing links. To find
those values, the weights are updated iteratively until convergence. Then, all the pages are ranked accordingly
to their score.

14.2 RandomWalk
It can be nicely interpreted in term of random surfing. We define the probability of following the link
from page j to page i as :
wj,i
P (page(k + 1) = i|page(k) = j) =
wj .

22
We can then rewrite the equation as :

xi (k + 1) = P( page (k + 1) = i)
Xn
= P( page (k + 1) = i | page (k) = j)xj (k)
j=1
n
X wji
= xj (k)
j=1
wj

This gives us a Markov model of random surf through the web which is the same equation as before when
ignoring iteration k.

If pi,j is thePi and jth element of the transition probability matrix P , the expression can be rewritten as
n
xi (k + 1) = j+1 pji xj (k). In matrix form, we have x(k + 1) = PT x(k). The stationary distribution is given
by x(k + 1) = x(k) = x and thus :
x = PT x
xi is thus viewed as the probability of being at page i. The solution to these equations is the stationary
distribution of the random surf which is the proba of finding the surfer on page i on the long term behavior.
The PageRank scores can then be obtained by computing the left eigenvector of P corresponding to eigenvalue
1. If the graph is undirected, the scores are simply the in degrees of the nodes.

14.3 Extensions
14.3.1 Google Matrix
Some pages have no links with other pages, in this case, the P matrix is not stochastic anymore because
the rows do not sum to 1 anymore. One solution is to jump to any node of the graph randomly.

eeT
G = αP + (1 − α)
n
Where G is the Google Matrix and e is a vector full of ones, and 0 < α < 1. G is not more sparse but is
normal

14.3.2 Personalization vector


Used to favor pages in a natural for advertising for example. Rather than using eeT /n for the Google
matrix, we use ev T with v a probability vector called personalization vector given by the user. It contains
the a priori probability of jumping to any page by teleportation. The matrix thus becomes :

G = αP + (1 − α)evT

15 Collaborative Recommendation
Describe in detail the basic model of collaborative recommendation, namely,the model based
on k nearest neighbors. Also describe in detail the ItemRank model of collaborative recom-
mendation (also called “random walk with restart”), inspired by PageRank. Finally, how do we
assess a collaborative recommendation system

15.1 Collaborative Recommendation


Most popular algorithm is based on a bipartite graph : An individual or user-based k nearest neighbors
algorithm

We have individuals (i) and items (j). Each individual is represented by a profile vector vi in the item

23
space where vi,j = 1 if i bought item j, 0 otherwise. We also compute the individuals-items frequency ma-
trix W containing item wi,j that indicates whether item j was purchased by i or how many times (2 variants).

We first compute a similarity between 2 individuals i and j based on their profile vectors. There exist many
possibilities like the cosine or where each vector can be represented as follows :

sim(i, j) = a/(a + b + c)

From those similarities, one can compute the k nearest neighbors for each individual and compute some pre-
dicted value for the items. The first items proposed to this individual are the ones with the highest predicted
value.

The predicted value of item i for individual p is computed as the number of times item i has been pur-
chased by p’s the neighbors weighted by the similarities between the individuals.
P
q∈N (p) sim(p, q) ∗ wq,i
pred(p, i) = P
q∈N (p) sim(p, q)

15.2 ItemRank
Extension of PageRank that can be used for collaborative recommendations and based on the random
walk with restart idea. It assumes that a bipartite graph has already been made

We assume a random walker on the graph

xi (k + 1) = αPT xi (k) + (1 − α)vi

where vi contains 1/ni for every item user i bought and 0 for the rest with ni the number of bought items
and P the transition probabilities matrix derived from the graph.

The random walker has a proba 1 − α of being transported to a bought item node. The steady-state so-
lution is the similarity score associated with each item. For user i :

xi = (1 − α)(I − αPT )−1 vi

15.3 Performance Evaluation


A solution is to delete some links to bought items. They form the test set. The algorithm thus has to
predict these items. Cross-validation is usually used with the precision and recall metrics.

It is also important to look at the "surprise" (unexpected) with some recommendations, to not always
recommend the popular ones. There must be some diversity in the recommended items.

16 Latent Class Model and EM Algorithm


Describe, but without detailing the EM algorithm applied to the complete likelihood func-
tion, the collaborative recommendation system based on latent classes. Derive the update
formula from a heuristic perspective. Moreover, how can we assess a collaborative recommen-
dation system

24
16.1 Latent Class Model
This model is used for collaborative filtering. We have x a random variable representing individuals (m in
total) and y representing the items (n in total). There exist a latent unobserved class z representing classes
of interest and classes of items of the users (l in total). The variables x and y are assumed conditionally
independent given z. The number of latent classes l is given a priori.

In this case, we are interested in the posterior probability distribution of the items j for some individual i :
P (y = j|x = i) from which the most probable items will be recommended to the user. P (x = x, y = y, z = k)
is often simplified to P (x, y, k) and represents the discrete probability mass function. X, y were defined earlier
and i, j, k are simple variables.

The joint distribution users-items is


l
X
P (x, y) = P (x, y, z = k)
k=1
l
X
= P (x, y|z = k)P (z = k)
k=1
l
X
= P (x, |z = k)P (y|z = k)P (z = k)
k=1

The parameters of the model are defined as the probability masses of the discrete random variables
— P (k) = P (z = k) Class prior proba
— P (x|k) = P (x = x|z = k) Within class observation of users
— P (y|k) = P (y = y|z = k) Within class observation of items
The posterior probability of an arbitrary item y for a user x can be computed from :
Pl
k=1 P (x|k)P (y|k)P (k)
P (y|x) = P l ′ ′
k′ =1 P (x|k )P (k )

Variable z is considered as an unobserved hidden latent variable. It appears in the complete likelihood as a
random variable, not yet known. The parameters are estimated by maximum likelihood but this function is
hard to maximize.

16.2 EM Algorithm
This algorithm is used to estimate the parameters of the latent class model. First, it is needed to compute
the log-likelihood of the data, l. The EM algo then iterates 2 steps, increasing the likelihood each time, until
convergence :
1. Expectation step : Estimates the value of z for each user, item ⇒ it performs clustering. Computes
the expectation of l given the current value of the params Θ and the observations (in vector x and y)
E[l|x, y; Θ̂]
2. Maximization step : Provides reestimates formulas for the parameters when class memberships are
known. Maximizes the expectation of the log-likelihood in terms of the parameters assuming the hidden
variables are fixed.
After convergence, the predictions are then given by P̂ (y|x).

16.3 Performance Evaluation


A solution is to delete some links to bought items. They form the test set. The algorithm thus has to
predict these items. Cross-validation is usually used with the precision and recall metrics.

It is also important to look at the "surprise" (unexpected) with some recommendations, to not always
recommend the popular ones. There must be some diversity in the recommended items.

25
17 Collaborative recommendation + non-negative matrix factori-
zation
Describe in detail the basic modelof collaborative recommendation, namely,themodel based
on knearest neighbors. Also explain in detail the non-negative matrix factorization techniques-
forcollaborative recommendation based on ratings. Moreover, how can we assessa collaborative
recommendation system

17.1 Collaborative Recommendation


Most popular algorithm is based on a bipartite graph : An individual or user-based k nearest neighbors
algorithm

We have individuals (i) and items. Each individual is represented by a profile vector vi in the item space
where vi,j = 1 if i bought item j, 0 otherwise. We also compute the individuals-items frequency matrix W
containing item wi,j that indicates whether item j was purchased by i or how many times (2 variants).

We first compute a similarity between 2 individuals i and j based on their profile vectors. There exist many
possibilities like the cosine or where each vector can be represented as follows :

sim(i, j) = a/(a + b + c)

From those similarities, one can compute the k nearest neighbors for each individual and compute some pre-
dicted value for the items. The first items proposed to this individual are the ones with the highest predicted
value.

The predicted value of item i for individual p is computed as the number of times item i has been pur-
chased by p’s the neighbors weighted by the similarities between the individuals.
P
q∈N (p) sim(p, q) ∗ wq,i
pred(p, i) = P
q∈N (p) sim(p, q)

17.2 Non-negative matrix recommendation


Let W ≥ 0 be the users-items matrix. The elements of the matrix (here containing ratings) are approxi-
mated by the inner product of 2 latent feature vectors u and v where ui is the latent feature vector associated
to user i characterizing its tastes and vj the vector associated to item j characterizing its attributes. Both
are constrained non-negative. Let U, V contain the latent vectors on its rows. We have W ≈ UVT

To compute U and V we can use the alternating least squares algorithm to try and reconstruct the ra-
tings W ( 2
minimize W − UVT
U F and
subject to U ≥ O
 2
minimize W − UVT
V F
 subject to V ≥ O

However, this works only when W does not contain too many missing values (when it is not sparse). The null
elements are biasing the predictions. We then need to avoid the missing values in the objective function. We

26
then minimize : X
(wij − uTi vj )2
(i,j)∈ϵ

This is often optimized by performing a constrained stochastic gradient descent

17.3 Performance Evaluation


A solution is to delete some links to bought items. They form the test set. The algorithm thus has to
predict these items. Cross-validation is usually used with the precision and recall metrics.

It is also important to look at the "surprise" (unexpected) with some recommendations, to not always
recommend the popular ones. There must be some diversity in the recommended items.

18 Reputation Model
Describe a simple reputation model, its parameters and its assumptions. How do we estimate
its parameters ? Moreover, how can we assess a collaborative recommendation system

18.1 Simple Reputation Model


Suppose we have nx providers who have some internal latent quality of service q and that send an item
with latent quality x. We also have ny customers receiving items or services through a customer/provider
transaction. These customers provide a rating y on the received item, which is observed and standardized to
have 0 mean and unit variance.

The quality xki of the item i sent by the provider k is assumed to be normally distributed and centered
on the quality of the provider.
xki = qk + ϵxki
ϵxki ∼ N (0, σkx )
Each provider is characterized by 2 features. His internal score q and his stability in providing a constant
quality σ x . The consumer l who ordered item i will rate the transaction based on the inspection of its quality
xki . He is based on 3 parameters : His reactivity with respect to the quality of the provided item a, his bias
b and his stability in providing constant ratings a fixed observed quality σ y

The rating ykli provided by consumer l for transaction i with provider k is modeled in the linear regres-
sion
ykli = al xki + bl + ϵyli − ϵyl ∼ N (0, σly )
By setting al = 1 and assuming the provider always provides the same quality level, we can simplify the
model by
ykli = qk + bl + ϵli
In this case, the model becomes
Pnc hP  i
1
l=1 σ̂l2 i∈{k→l} ykli − b̂l
q̂k ← Pnc nkl′ , for all k
l′ =1 σ̂ 2′
l
Pnp P
k=1 i∈{k→l} (ykli − q̂k )
b̂l ← Pnp , for all l and then center the b̂l
k′ =1 nk ′ l
Pnp P  2
k=1 i∈{k→l} ykli − q̂k − b̂l
σ̂l2 ← Pnp , for all l
k′ =1 nk ′ l

P
With nkl = i∈{k→l} 1

27
19 Markov Chains + Absorbing
Describe in detail what is a Markov chain and derive its evolution equation, as well as how
to compute the absorbing probabilities (probability of being absorbed in an absorbing state)
in the case of absorbing Markov chains.Also discuss some applications

19.1 Markov Chains


We have a finite set of states S = (1, 2, ..., n), st = k means that the process is in state k at time t. Markov
are Markov’s chains of sequential discrete-time and discrete state stochastic processes. We have a matrix P
called the one-step transition probabilities matrix representing probabilities where the sum of the rows sum
to 1 with pij = P (st+1 = j|st = i) and we assume that jumping to a state only depends on the current state
(Markov property). We assume that those transitions are homogeneous, independent of time. Let’s compute
P (st+2 = j|st = i) :
n
X
P (st+2 = j | st = i) = P (st+2 = j, st+1 = k | st = i)
k=1
Xn
= P (st+2 = j | st = i, st+1 = k) P (st+1 = k | st = i)
k=1
Xn
= P (st+2 = j | st+1 = k) P (st+1 = k | st = i)
k=1
Xn
= pkj pik
k=1
 2
= P ij

Matrix P 2 is the 2 steps transition probabilities matrix. Thus P τ = P (st+τ = j|st = i). If x(t) is the column
vector containing the probability distribution of finding the process in each state of the Markov chain at time
step t :
xi (t) = P (st = i)
Xn
= P (st = i, st−1 = k)
k=1
Xn
= P (st = i|st−1 = k)P (st−1 = k)
k=1
Xn
= pki xk (t − 1)x(t) = Pt x(t − 1)
k=1

This is the time evolution equation of the Markov chain providing the vector of probabilities of being in
each state a long time x(t) given the initial probabilities x(0). When starting from state i, x(0) = ei and
xj (t|s0 = i) = xj|i (t) = (ei )T P t ej , this corresponds to the probability of observing the process in state j at
time t when starting from state i at time t = 0

19.2 Absorbing Markov Chains


A state i is called absorbing if it is impossible to leave from it : pii = 1. An absorbing Markov chain is a
chain containing absorbing states. The others are called transient (TR). The transition matrix can be put in
canonical form Where Q is the transition matrix between transient states, R is the transition matrix between
transient and absorbing states. Both are sub-stochastic : their sum-rows are ≤ 1 and at least one sum-row is
< 1. P t can then be computed as follow :
Since Q is sub-stochastic, Qt → 0 when t → ∞ The matrix N = (I − Q)−1 is called the fundamental

28
matrix of the absorbing Markov chain. If i, j are transient states, entry i, j of N is :

nij = eTi N ej

!
X
= eTi Q t
ej
t=0

X
= xj|i (t)
t=0

Each nij contains the expected number of passages or visits through transient state j, when starting from
state i. The expected number of visits before being absorbed when starting from each state is n = N e.

We can also compute the probability of being absorbed by state j given that we started in state i
∞ X
X ntr
bij = xk|i (t)rkj
t=0 k=1
ntr
X
= nik rkj
k=1
= [N R]ij

Where the sum of the k is taken on the set of transient states only. All those probabilities can be gathered
into the matrix B

20 Markov Chains + Expected number of visits


Describe in detail what is a Markov chainand deriveits evolution equation as well ashow to
compute the expected number of visits to each state in the case of absorbing Markov chains.Also
discuss some applications

20.1 Markov Chains


We have a finite set of states S = (1, 2, ..., n), st = k means that the process is in state k at time t. Markov
are Markov’s chains of sequential discrete-time and discrete state stochastic processes. We have a matrix P
called the one-step transition probabilities matrix representing probabilities where the sum of the rows sum
to 1 with pij = P (st+1 = j|st = i) and we assume that jumping to a state only depends on the current state
(Markov property). We assume that those transitions are homogeneous, independent of time. Let’s compute

29
P (st+2 = j|st = i) :
n
X
P (st+2 = j | st = i) = P (st+2 = j, st+1 = k | st = i)
k=1
Xn
= P (st+2 = j | st = i, st+1 = k) P (st+1 = k | st = i)
k=1
Xn
= P (st+2 = j | st+1 = k) P (st+1 = k | st = i)
k=1
Xn
= pkj pik
k=1
 2
= P ij

Matrix P 2 is the 2 steps transition probabilities matrix. Thus P τ = P (st+τ = j|st = i). If x(t) is the column
vector containing the probability distribution of finding the process in each state of the Markov chain at time
step t :
xi (t) = P (st = i)
Xn
= P (st = i, st−1 = k)
k=1
Xn
= P (st = i|st−1 = k)P (st−1 = k)
k=1
Xn
= pki xk (t − 1)x(t) = Pt x(t − 1)
k=1

This is the time evolution equation of the Markov chain providing the vector of probabilities of being in
each state a long time x(t) given the initial probabilities x(0). When starting from state i, x(0) = ei and
xj (t|s0 = i) = xj|i (t) = (ei )T P t ej , this corresponds to the probability of observing the process in state j at
time t when starting from state i at time t = 0

20.2 Absorbing Markov Chains


A state i is called absorbing if it is impossible to leave from it : pii = 1. An absorbing Markov chain is a
chain containing absorbing states. The others are called transient (TR). The transition matrix can be put in
canonical form Where Q is the transition matrix between transient states, R is the transition matrix between

transient and absorbing states. Both are sub-stochastic : their sum-rows are ≤ 1 and at least one sum-row is
< 1. P t can then be computed as follow :

30
Since Q is sub-stochastic, Qt → 0 when t → ∞ The matrix N = (I − Q)−1 is called the fundamental
matrix of the absorbing Markov chain. If i, j are transient states, entry i, j of N is :

nij = eTi N ej

!
X
= eTi Q t
ej
t=0

X
= xj|i (t)
t=0

Each nij contains the expected number of passages or visits through transient state j, when starting from
state i. The expected number of visits before being absorbed when starting from each state is n = N e.

21 Markov Chains + Lifetime Value


Describe in detail what is a Markov chain and derive its evolution equation as well as
how the «lifetime value» of a customer can be computed according to this model (marketing
application)

21.1 Markov Chains


We have a finite set of states S = (1, 2, ..., n), st = k means that the process is in state k at time t. Markov
are Markov’s chains of sequential discrete-time and discrete state stochastic processes. We have a matrix P
called the one-step transition probabilities matrix representing probabilities where the sum of the rows sum
to 1 with pij = P (st+1 = j|st = i) and we assume that jumping to a state only depends on the current state
(Markov property). We assume that those transitions are homogeneous, independent of time. Let’s compute
P (st+2 = j|st = i) :
n
X
P (st+2 = j | st = i) = P (st+2 = j, st+1 = k | st = i)
k=1
Xn
= P (st+2 = j | st = i, st+1 = k) P (st+1 = k | st = i)
k=1
Xn
= P (st+2 = j | st+1 = k) P (st+1 = k | st = i)
k=1
Xn
= pkj pik
k=1
 2
= P ij

Matrix P 2 is the 2 steps transition probabilities matrix. Thus P τ = P (st+τ = j|st = i). If x(t) is the column
vector containing the probability distribution of finding the process in each state of the Markov chain at time
step t :
xi (t) = P (st = i)
Xn
= P (st = i, st−1 = k)
k=1
Xn
= P (st = i|st−1 = k)P (st−1 = k)
k=1
Xn
= pki xk (t − 1)
k=1
t
x(t) = P x(t − 1)

31
This is the time evolution equation of the Markov chain providing the vector of probabilities of being in
each state a long time x(t) given the initial probabilities x(0). When starting from state i, x(0) = ei and
xj (t|s0 = i) = xj|i (t) = (ei )T P t ej , this corresponds to the probability of observing the process in state j at
time t when starting from state i at time t = 0

21.2 Marketing Application


Suppose we have a model with n customers clusters/segments based on some metric like RFM (Recency,
Frequency, Monetary value). Each cluster becomes a state of a Markov chain. The nth cluster corresponds to
lost clients, it is absorbing and generates no benefit. Each month we observe the movements from cluster to
cluster and transition probabilities are estimated by counting the observed frequencies of jumping from one
state to another. This provides the entries of the transition probabilities matrix. There is an average profit
mi associated with each customer in state i (can be negative) and a discounting factor 0 < γ < 1. We want
to compute the expected profit on an infinite time horizon :

X n
X
m= γt xi (t)mi
t=0 i=1
X∞
= γ t mT (P T )t x(0)
t=0

X
= γ t xT (0)P t m
t=0
= xT (0)(I − γP )−1 m

It provides the expected profit of a customer before it leaves the company knowing that mi = 0 for the
absorbing state.

22 Hidden Markov Model + Likelihood


Explain the general principles and the structure of a hidden Markov model for speech
recognition. Moreover, without going into the deep details, how can we solve the first important
problem in HMMs, namely computing the likelihood of the observations (the probability of
generating a sequence of observations).And finally, how do we recognize an uttered word among
a finite dictionary

22.1 Hidden Markov Models (HMM)


A HMM is a graphical model as seen on Figure 2 made of hidden unobserved states (green) dependent
only on the previous state (Markov Property) and discrete and categorical observations (purple) at each
time step (p possible values in total). Arrows indicate probabilistic dependencies between states. s(t) are the

Figure 1 – Example of HMM

random variables for the hidden states taking their values from 1 to n. There are no observations associated
to states. x(t) are the random variables for the possible observations whose discrete values are denoted
oi . Π = {π} are the initial states probabilities P (s(1) = 1). P = {pij } are the state transition probabilities
P (s(t + 1) = j|s(t) = i). B = {bi (ok )} are the observation or emission probabilities within the states P (ok |st ).
Finally θ = {Π, P, B}

32
22.2 Likelihood Computation
We want to compute P (x|θ). To do so, we define the forward procedure

αi (t) = P (x1 , ..., xT , st = i|θ)


Xn n
X
P (x|θ) = P (xt+1 , ...xT , sT = j|θ) = αj (T )
j=1 j=1

And the backward procedure :


βi (t) = P (x1 , ..., xT |st = iθ)
Xn
= pij bj (xt+1 )βj (t + 1)
j=1

βi (T ) = 1
Xn
P (x|θ) = πi βi (1)
i=1

Together, combined : we get that


n
X
P (x|θ) = αi (t)βi (1)
i=1

23 Hidden Markov Model + Most likely states computation


Explain the general principles and the structure of a hidden Markov model for speech re-
cognition. Moreover, without going into the deep details, how do we solve the secondimportant
problemin HMMs, namely how do we compute the most likely state sequence given the ob-
servations (decoding).Finally, explain intuitively howwe solve the third important problem in
HMMs, namely estimating the parameters of the model

23.1 Hidden Markov Models (HMM)


A HMM is a graphical model as seen on Figure 2 made of hidden unobserved states (green) dependent
only on the previous state (Markov Property) and discrete and categorical observations (purple) at each
time step (p possible values in total). Arrows indicate probabilistic dependencies between states. s(t) are the

Figure 2 – Example of HMM

random variables for the hidden states taking their values from 1 to n. There are no observations associated
to states. x(t) are the random variables for the possible observations whose discrete values are denoted
oi . Π = {π} are the initial states probabilities P (s(1) = 1). P = {pij } are the state transition probabilities
P (s(t + 1) = j|s(t) = i). B = {bi (ok )} are the observation or emission probabilities within the states P (ok |st ).
Finally θ = {Π, P, B}

33
23.2 Optimal state sequence
To find the most probable state at time t given the observations :

γi (t) = P (st = i|x, θ)


P (x, st = i|θ)
=
P (x|θ)
αi (t)βi (t)
= Pn
j=1 αj (t)βj (t)

To find the state sequence that best explains the observations and the states, we can use the Viterbi algorithm,
which is a dynamic programming algorithm.

arg max P (s|x, θ) = arg max P (s, x|θ)


s s

δj (t) = maxs1 ,...st−1 P (s1 , ...st−1 , x1 ...xt−1 , st = j, xt |θ)


It defines the state sequence which maximizes the probability of generating the observations xt up to time
t − 1, landing in state j and emitting the observation at time t. Recursively, we can compute all values :

δj (1) = πj bj (x1 )
δ(t + 1) = max {δi (t)pij bj (xt+1 )}
i
ψ(t + 1) = arg max {δi (t)pij bj (xt+1 )}
i

23.3 Parameters estimation


Given an observation sequence, find the model parameters Π, P, B that most likely produce that sequence
(maximum likelihood)

γij = P (st = i, st+1 = j|x, θ) : Probability of traversing an arc


αi (t)bj (xt+1 )pij βj (t + 1)
= PN
k=1 αk (t)βk (t)
γi (t) = P (st = i|x, θ) : Probability of being in state i

Where π̂i = γi (1) the probability of starting from i. Now we can compute the new estimates of the model
parameters.
Expected number of transitions from i to j
p̂ij =
expected number of transitions out of state i
PT −1
γij (t)
= Pt=1
T −1
t=1 γi (t)
Expected number of emissions of ok in state i
b̂i (Ok ) =
P total number of emissions in state i
t:xt =Ok γt (i)
= P T
t=1 γi (t)
Then, is iterated until convergence the computation of the forward and backward variables α and β and the
parameters estimates for Π, P, B. At each iteration, the likelihood increases

34

You might also like