10 PDF
10 PDF
Osvaldo Simeone
x\t 0 1
0 0.1 0.2
1 0.2 0.1
2 0.1 0.3
x\t 0 1
0 0.33 0.67
1 0.67 0.33
2 0.25 0.75
The optimal hard predictor Lp (t̂ ∗ (·)) under the detection-error loss is given by the
MAP predictor
pest=zeros(3,2);
for n=1:N
if (x(n)==0)&&(t(n)==0)
pest(1,1)=pest(1,1)+1/N;
elseif (x(n)==0)&&(t(n)==1)
pest(1,2)=pest(1,2)+1/N;
elseif (x(n)==1)&&(t(n)==0)
pest(2,1)=pest(2,1)+1/N;
elseif (x(n)==1)&&(t(n)==1)
pest(2,2)=pest(2,2)+1/N;
elseif (x(n)==2)&&(t(n)==0)
pest(3,1)=pest(3,1)+1/N;
elseif (x(n)==2)&&(t(n)==1)
pest(3,2)=pest(3,2)+1/N;
end
end
pest=zeros(3,2);
for n=1:N
if (x(n)==0)&&(t(n)==0)
pest(1,1)=pest(1,1)+1/N;
elseif (x(n)==0)&&(t(n)==1)
pest(1,2)=pest(1,2)+1/N;
elseif (x(n)==1)&&(t(n)==0)
pest(2,1)=pest(2,1)+1/N;
elseif (x(n)==1)&&(t(n)==1)
pest(2,2)=pest(2,2)+1/N;
elseif (x(n)==2)&&(t(n)==0)
pest(3,1)=pest(3,1)+1/N;
elseif (x(n)==2)&&(t(n)==1)
pest(3,2)=pest(3,2)+1/N;
end
end
For the same joint distribution at the previous problem, generate again a
data set D of N = 10 data points.
For this data set, assuming the detection-error loss, compute the training
loss LD (t̂(·)), for the predictor t̂(0) = 0, t̂(1) = 1, and t̂(2) = 0.
Identify the predictor t̂D (·) that minimizes the training loss, i.e., the
empirical risk minimization (ERM) predictor, assuming a model class H with
t̂(0|θ) = θ1 , t̂(1|θ) = θ2 and t̂(2|θ) = θ3 for θ1 , θ2 , θ3 ∈ {0, 1}, and compare
its training loss with the predictor considered at the previous point.
Compute the population loss of the ERM predictor.
Using your previous results, calculate the optimality gap
Lp (t̂D (·)) − Lp (t̂ ∗ (·)).
Repeat the points above with N = 10000 and comment on your result.
The ERM predictor is random since it depends on pD (x, t). In this problem, we
have θ = [t̂(0), t̂(1), t̂(2)]T and hence we impose no constraints on the predictor.
Under the detection-error loss, with discrete variable, ERM is given by the MAP
predictor applied to pD (x, t), i.e.,
ERM
t̂D (x) = arg max pD (t|x) = arg max pD (x, t).
t t
MATLAB code:
tERM=zeros(3,1);
for i=1:3
[M,I]=max(pest(i,:));
tERM(i)=I-1;
end
ERM
The training loss LD (t̂D (·)) for the ERM predictor is also random, and, for the
detection-error loss, is given as
ERM ERM ERM
LD (t̂D (·)) =pD (x = 0, t 6= t̂D (0)) + pD (x = 1, t 6= t̂D (1))
ERM
+ pD (x = 2, t 6= t̂D (2)).
MATLAB code:
LDERM=pest(1,˜tERM(1)+1)+pest(2,˜tERM(2)+1)+pest(3,˜tERM(3)+1)
%compare with LD!
ERM
Similarly, The population loss Lp (t̂D (·)) for the ERM predictor is random and is
generally given as
ERM ERM ERM ERM
Lp (t̂D (·)) = p(x = 0, t 6= t̂D (0))+p(x = 1, t 6= t̂D (1))+p(x = 2, t 6= t̂D (2)).
MATLAB code:
p(1,1)=0.1;p(1,2)=0.2; p(2,1)=0.2; p(2,2)=0.1; p(3,1)=0.1; p(3,2)=0.3;
LpERM=p(1,˜tERM(1)+1)+p(2,˜tERM(2)+1)+p(3,˜tERM(3)+1) % com-
pare with Lp (t̂ ∗ (·)) = 0.3!
for i=1:3
[M,I]=max(pest(i,:));
tERM(i)=I-1;
end %compare with population-optimal predictor
p(1,1)=0.1;p(1,2)=0.2; p(2,1)=0.2; p(2,2)=0.1; p(3,1)=0.1; p(3,2)=0.3;
LpERM=p(1,˜tERM(1)+1)+p(2,˜tERM(2)+1)+p(3,˜tERM(3)+1) % compare
with Lp (t̂ ∗ (·)) = 0.3!
The improved results in terms of optimality gap Lp (t̂D (·)) − Lp (t̂ ∗ (·)) for
N = 10000 are a consequence of the law of large numbers:
I The law of large numbers implies that the training loss
N
1 X
LD (θ) = `(tn , t̂(xn |θ))
N n=1
Show that the training loss for Example 2 in the text can be written as
1
LD (θ) = ||tD − XD θ||2 .
N
T
Show that the condition N1 XD XD = I imposes that, on average over the data set,
all the features ud (x), d = 1, ..., D, have the same expected energy and they are
uncorrelated.
Matrix 1
XTX
N D D
is of dimension D × D and its (d, d)th entry is given as
1 T 1 1
[XD XD ]d,d = udT ud = ||ud ||2 ,
N N N
where ud = [ud (x1 ), ud (x2 ), ..., ud (xN )]T is the dth column of matrix XD .
1 T
Therefore, the condition N
[XD XD ]d,d = 1 is equivalent to
N
1 X
ud (xn )2 = 1,
N n=1
which indicates that the expected value of feature d, on average over a uniform
selection of samples from the training set, is equal to one.
MATLAB code:
function thERM=LSsolver(X,t)
thERM=inv(X’*X)*X’*t;
MATLAB code to plot the values of t against the values of x in the training set:
plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)
MATLAB code that uses the function LSsolver and plots the hard predictor
ERM
t̂(x|θD ):
thERM=LSsolver(X,t);
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[1,xaxis(l),xaxis(l)ˆ2,xaxis(l)ˆ3,xaxis(l)ˆ4,xaxis(l)ˆ5]’;
tl(l)=thERM’*ul;
end
hold on; plot(xaxis,tl,’b’,’LineWidth’,2)
ERM )
Continuing the previous problem, compute the training loss LD (θM
for values of the model capacity M between 1 and 6, and plot the
training loss versus M in this range.
Explain the result in terms of the capacity of the model to fit the
training data.
Estimate the population loss using validation by computing the
empirical loss on the held-out data xval and tval, which you can
find in your workspace. Plot the validation error on the same figure.
Which value of M would you choose? Explain by using the concepts
of bias and estimation error.
ERM
MATLAB code for computing the training loss LD (θM ) for different values of M:
N=20;
for M=1:6
X=zeros(N,M+1);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:M]];
end
thERM=LSsolver(X,t);
LDERM(M)=1/N*norm(t-X*thERM)ˆ2;
end
plot([1:6],LDERM,’r--’,’LineWidth’,2)
xlabel(’$M$’,’Interpreter’,’latex’)
ylabel(’quadratic loss’,’Interpreter’,’latex’)
ERM
MATLAB code for computing the estimate of the population loss Lp (θM ) via
validation for different values of M:
for M=1:6
X=zeros(N,M+1);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:M]];
end
thERM=LSsolver(X,t);
Xval=xval.ˆ[0:M];
LvERM(M)=1/N*norm(tval-Xval*thERM)ˆ2;
end
hold on; plot([1:6],LvERM,’k’,’LineWidth’,2)
When M is small, the bias dominates and the training loss is relatively large. In
contrast, when M is large, the estimation error dominates and the gap between the
population and the training losses is large.
MATLAB code:
M=6;
lambda=exp(-10);
N=20;
for n=1:N
X(n,:)=[x(n).ˆ[0:M]]’;
end
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[xaxis(l).ˆ[0:M]]’;
tl(l)=thRERM’*ul;
end
plot(xaxis,tl,’b’,’LineWidth’,2);
hold on; plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)
LDRERM=1/N*norm(t-X*thRERM)ˆ2
MATLAB code:
M=6;
lambda=exp(-20);
N=20;
for n=1:N
X(n,:)=[x(n).ˆ[0:M]]’;
end
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[xaxis(l).ˆ[0:M]]’;
tl(l)=thRERM’*ul;
end
plot(xaxis,tl,’b’,’LineWidth’,2);
hold on; plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)
LDRERM=1/N*norm(t-X*thRERM)ˆ2
The training loss is smaller for a smaller value of λ since, in this case,
the optimization criterion is dominated by the training loss and the
regularization term is less relevant.
Continuing the previous problem, for M = 6, consider ten possible values for λ,
namely λ = exp(v ) with v = log(λ) taking one of ten equally spaced values
between -30 and 10.
R−ERM
For each value of λ, compute the training loss LD (θD ).
Plot it as a function of log(λ). Comment on your results.
Evaluate the estimate of the population loss on the held-out set variables xval and
tval, which you can find in your workspace. Plot the validation-based estimate of
the population loss in the same figure. Which value of λ would you choose?
The training loss increases as λ increases due to the increased contribution to the
objective function of the regularization term.
MATLAB code for computing the estimate of the population loss Lp (θDERM ) via
validation for different values of λ:
for l=1:10
lambda=lambdavec(l);
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
Xval=xval.ˆ[0:M];
LvRERM(l)=1/N*norm(tval-Xval*thRERM)ˆ2;
end
hold on; plot(linspace(-30,10,10),LvRERM,’k’,’LineWidth’,2)
When λ is large, the bias caused by the regularization term dominates and the
training loss is large. In contrast, when λ is small, the estimation error dominates
as for ERM and the gap between the population and the training losses is large.
We have a classification problem with true joint distribution p(x, t) given by the
table below.
x\t 0 1
0 0.1 0.2
1 0.2 0.1
2 0.1 0.3
H ={t̂(x|θ) : t̂(0|θ) = θ1 ,
t̂(1|θ) = t̂(2|θ) = θ2 for θ1 , θ2 ∈ {0, 1}}.
Assume that we have a training data set with empirical distribution pD (x, t) given as
x\t 0 1
0 0.15 0.2
1 0.2 0.05
2 0.25 0.15
For ERM, we can similarly decompose the training loss using the empirical
distribution as
So, we have the identity “population loss = minimum population loss + bias
+ estimation error”.
ERM
Recall that Lp (θD ) − Lp (t̂ ∗ (·)) is the optimality error.
We will now evaluate these terms in turn to verify this equality.
We have a classification problem with true joint distribution p(x, t) given by the table
below.
x\t 0 1
0 0.1 0.2
1 0.2 0.1
2 0.1 0.3
∗
Population-optimal within-class model soft predictor θH = argmin Lp (θ):
θ∈Θ
The population log-loss is given as
∗
[θH ]2 = arg min {0.3 · (− log(1 − θ2 )) + 0.4 · (− log(θ2 ))}.
θ2 ∈[0,1]
d
{0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))}
dθ1
1 1
= 0.1 − 0.2 = 0,
1 − θ1 θ1
∗
which yields the equation 0.1 · θ1 − 0.2 · (1 − θ1 ) = 0, yielding [θH ]1 = 2/3.
MATLAB code:
th=[0:0.01:1];
plot(th,0.1*(-log(1-th))+0.2*(-log(th)),’LineWidth’,2)
xlabel(’$\theta 1$’,’Interpreter’,’latex’)
∗
We can follow the same steps for θ2 , obtaining [θH ]2 = 4/7.
The corresponding optimal hard predictor under the detection-error loss is
the MAP predictor
∗
t̂(0) = argmaxp(t|x, θH )=1
t
∗ ) is hence given as
The population log-loss Lp (θH
∗ ∗
Lp (θH ) =E(x,t)∼p(x,t) [− log p(t|x, θH )]
=0.1 · (− log(1 − 2/3)) + 0.2 · (− log(2/3))
+ 0.3 · (− log(1 − 4/7) + 0.4 · (− log(4/7))
=0.66.
In Problem 4.10, we have considered the ideal situation in which the population
distribution p(x, t) is known. In this problem, we assume the same population distribution
and the same model class H of soft predictors, and we study the learning problem.
To this end, assume that we have a training data set of N = 100 data points with
empirical distribution pD (x, t) given as
x\t 0 1
0 0.15 0.2
1 0.2 0.05
2 0.25 0.15
ML
Obtain the maximum likelihood (ML) model θD = argmin LD (θ).
θ∈Θ
ML
Calculate the population log-loss Lp (θD ) and compare it with the
∗
population loss Lp (θH ) of the population-optimal within-class predictor
obtained in Problem 4.10.
ML
ML soft predictor θH = argmin LD (θ): The training log-loss is given as
θ∈Θ
ML
[θD ]2 = arg min {0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 ))}.
θ2 ∈[0,1]
ML
Proceeding as we have done above, we get θD = [4/7, 4/13]T .
Note that this is different from the population-optimal solution
∗
θH = [2/3, 4/7]T .
ML ) is hence given as
The population log-loss Lp (θD
ML ∗
Lp (θD ) =E(x,t)∼p(x,t) [− log p(t|x, θH )]
=0.1 · (− log(1 − 4/7)) + 0.2 · (− log(4/7))
+ 0.3 · (− log(1 − 4/13)) + 0.4 · (− log(4/13))
=0.77.
First, let us check that the marginals are valid pdfs. To this end, we need to
check the normalization condition
Z +∞
p(θi )dθi = 1.
−∞
1
MAP
MAP soft predictor θH = argmin LD (θ) − N log p(θ) : We have
θ∈Θ
1
LD (θ) − log p(θ) =0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
N
+ 0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 ))
1
− (log p(θ1 ) + log p(θ2 )).
N
Therefore, the MAP soft predictor can be obtained as
MAP 1
[θH ]1 = arg min {0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 )) − log p(θ1 )}
θ1 ∈[0,1] N
MAP 1
[θH ]2 = arg min {0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 )) − log p(θ2 )}.
θ2 ∈[0,1] N