0% found this document useful (0 votes)
93 views

10 PDF

This document summarizes solutions to problems from Chapter 4 of the book "Machine Learning for Engineers" regarding supervised learning. For Problem 4.1, it provides the optimal soft and hard predictors for a classification problem, generates training data from the problem distribution, and computes the empirical distribution. For Problem 4.2, it computes the training loss for a given predictor on a small dataset, identifies the empirical risk minimization predictor, and calculates its training and population losses, noting that these values converge to the true optimal as the training size increases due to the law of large numbers.

Uploaded by

Tala Abdelghani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

10 PDF

This document summarizes solutions to problems from Chapter 4 of the book "Machine Learning for Engineers" regarding supervised learning. For Problem 4.1, it provides the optimal soft and hard predictors for a classification problem, generates training data from the problem distribution, and computes the empirical distribution. For Problem 4.2, it computes the training loss for a given predictor on a small dataset, identifies the empirical risk minimization predictor, and calculates its training and population losses, noting that these values converge to the true optimal as the training size increases due to the law of large numbers.

Uploaded by

Tala Abdelghani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Machine Learning for Engineers:

Chapter 4. Supervised Learning: Getting Started –


Problems

Osvaldo Simeone

King’s College London

May 12, 2021

Osvaldo Simeone ML4Engineers 1 / 63


Problem 4.1
We have a classification problem with true joint distribution p(x, t) given by the
table below.

x\t 0 1

0 0.1 0.2

1 0.2 0.1

2 0.1 0.3

What is the optimal soft predictor?


What is the optimal hard predictor t̂ ∗ (·) under the detection-error loss, and what is
the minimum population loss Lp (t̂ ∗ (·))?
Using MATLAB, generate a training set D of N = 10 i.i.d. examples from
this joint distribution.
Compute the empirical distribution pD (x, t).
Repeat with N = 10000 samples.

Osvaldo Simeone ML4Engineers 2 / 63


Problem 4.1: Solution
The optimal soft predictor is given by the conditional distribution p(t|x):

x\t 0 1

0 0.33 0.67

1 0.67 0.33

2 0.25 0.75

The optimal hard predictor Lp (t̂ ∗ (·)) under the detection-error loss is given by the
MAP predictor

t̂ ∗ (x) = arg max p(t|x) = arg max p(x, t),


t t
i.e.,
t̂ ∗ (0) = 1, t̂ ∗ (1)=0, t̂ ∗ (2)=1,
and the population loss for this predictor is
Lp (t̂ ∗ (·)) =p(x = 0, t = 0) + p(x = 1, t = 1) + p(x = 2, t = 0)
=0.3.

Osvaldo Simeone ML4Engineers 3 / 63


Problem 4.1: Solution
To generate data, we can use ancestral sampling, i.e., we sequentially sample
x ∼ p(x) and t ∼ p(t|x). To this end, we compute the marginal
p(x = 0) = p(x = 1) = 0.3 and p(x = 2) = 0.4.
MATLAB code to generate samples (x, t):
clear; N=10;
q=[0.3,0.3,0.4];
xoh=mnrnd(1,q,N); %generate x as one-hot vectors
x=xoh*[0,1,2]’; %convert from one-hot vector to scalar representation
t=zeros(N,1);
for n=1:N %generate t given x
if (x(n)==0)
t(n)=binornd(1,0.67);
elseif (x(n)==1)
t(n)=binornd(1,0.33);
elseif (x(n)==2)
t(n)=binornd(1,0.75);
end
end

Osvaldo Simeone ML4Engineers 4 / 63


Problem 4.1: Solution
MATLAB code to compute the empirical distribution:

pest=zeros(3,2);
for n=1:N
if (x(n)==0)&&(t(n)==0)
pest(1,1)=pest(1,1)+1/N;
elseif (x(n)==0)&&(t(n)==1)
pest(1,2)=pest(1,2)+1/N;
elseif (x(n)==1)&&(t(n)==0)
pest(2,1)=pest(2,1)+1/N;
elseif (x(n)==1)&&(t(n)==1)
pest(2,2)=pest(2,2)+1/N;
elseif (x(n)==2)&&(t(n)==0)
pest(3,1)=pest(3,1)+1/N;
elseif (x(n)==2)&&(t(n)==1)
pest(3,2)=pest(3,2)+1/N;
end
end

Osvaldo Simeone ML4Engineers 5 / 63


Problem 4.1: Solution

Repeating with N=10000:


N=10000;
q=[0.3,0.3,0.4];
xoh=mnrnd(1,q,N); %generate x as one-hot vectors
x=xoh*[0,1,2]’; %convert from one-hot vector to scalar representation
t=zeros(N,1);
for n=1:N %generate t given x
if (x(n)==0)
t(n)=binornd(1,0.67);
elseif (x(n)==1)
t(n)=binornd(1,0.33);
elseif (x(n)==2)
t(n)=binornd(1,0.75);
end
end

Osvaldo Simeone ML4Engineers 6 / 63


Problem 4.1: Solution

pest=zeros(3,2);
for n=1:N
if (x(n)==0)&&(t(n)==0)
pest(1,1)=pest(1,1)+1/N;
elseif (x(n)==0)&&(t(n)==1)
pest(1,2)=pest(1,2)+1/N;
elseif (x(n)==1)&&(t(n)==0)
pest(2,1)=pest(2,1)+1/N;
elseif (x(n)==1)&&(t(n)==1)
pest(2,2)=pest(2,2)+1/N;
elseif (x(n)==2)&&(t(n)==0)
pest(3,1)=pest(3,1)+1/N;
elseif (x(n)==2)&&(t(n)==1)
pest(3,2)=pest(3,2)+1/N;
end
end

Osvaldo Simeone ML4Engineers 7 / 63


Problem 4.2

For the same joint distribution at the previous problem, generate again a
data set D of N = 10 data points.
For this data set, assuming the detection-error loss, compute the training
loss LD (t̂(·)), for the predictor t̂(0) = 0, t̂(1) = 1, and t̂(2) = 0.
Identify the predictor t̂D (·) that minimizes the training loss, i.e., the
empirical risk minimization (ERM) predictor, assuming a model class H with
t̂(0|θ) = θ1 , t̂(1|θ) = θ2 and t̂(2|θ) = θ3 for θ1 , θ2 , θ3 ∈ {0, 1}, and compare
its training loss with the predictor considered at the previous point.
Compute the population loss of the ERM predictor.
Using your previous results, calculate the optimality gap
Lp (t̂D (·)) − Lp (t̂ ∗ (·)).
Repeat the points above with N = 10000 and comment on your result.

Osvaldo Simeone ML4Engineers 8 / 63


Problem 4.2: Solution
Generating N = 10 samples using MATLAB:
clear; N=10;
q=[0.3,0.3,0.4];
xoh=mnrnd(1,q,N); %generate x as one-hot vectors
x=xoh*[0,1,2]’; %convert from one-hot vector to scalar representation
t=zeros(N,1);
for n=1:N %generate t given x
if (x(n)==0)
t(n)=binornd(1,0.67);
elseif (x(n)==1)
t(n)=binornd(1,0.33);
elseif (x(n)==2)
t(n)=binornd(1,0.75);
end
end
The training loss LD (t̂(·)) for the predictor t̂(0) = 0, t̂(1) = 1, and t̂(2) = 0 depends on
then empirical distribution pD (x, t), and is given as

LD (t̂ ∗ (·)) =pD (x = 0, t = 1) + pD (x = 1, t = 0) + pD (x = 2, t = 1).

Osvaldo Simeone ML4Engineers 9 / 63


Problem 4.2: Solution

MATLAB code to estimate pD (x, t):


pest=zeros(3,2);
for n=1:N
if (x(n)==0)&&(t(n)==0)
pest(1,1)=pest(1,1)+1/N;
elseif (x(n)==0)&&(t(n)==1)
pest(1,2)=pest(1,2)+1/N;
elseif (x(n)==1)&&(t(n)==0)
pest(2,1)=pest(2,1)+1/N;
elseif (x(n)==1)&&(t(n)==1)
pest(2,2)=pest(2,2)+1/N;
elseif (x(n)==2)&&(t(n)==0)
pest(3,1)=pest(3,1)+1/N;
elseif (x(n)==2)&&(t(n)==1)
pest(3,2)=pest(3,2)+1/N;
end
end
LD=pest(1,2)+pest(2,1)+pest(3,2)

Osvaldo Simeone ML4Engineers 10 / 63


Problem 4.2: Solution

The ERM predictor is random since it depends on pD (x, t). In this problem, we
have θ = [t̂(0), t̂(1), t̂(2)]T and hence we impose no constraints on the predictor.
Under the detection-error loss, with discrete variable, ERM is given by the MAP
predictor applied to pD (x, t), i.e.,
ERM
t̂D (x) = arg max pD (t|x) = arg max pD (x, t).
t t

MATLAB code:
tERM=zeros(3,1);
for i=1:3
[M,I]=max(pest(i,:));
tERM(i)=I-1;
end

Osvaldo Simeone ML4Engineers 11 / 63


Problem 4.2: Solution

ERM
The training loss LD (t̂D (·)) for the ERM predictor is also random, and, for the
detection-error loss, is given as
ERM ERM ERM
LD (t̂D (·)) =pD (x = 0, t 6= t̂D (0)) + pD (x = 1, t 6= t̂D (1))
ERM
+ pD (x = 2, t 6= t̂D (2)).

MATLAB code:
LDERM=pest(1,˜tERM(1)+1)+pest(2,˜tERM(2)+1)+pest(3,˜tERM(3)+1)
%compare with LD!

Osvaldo Simeone ML4Engineers 12 / 63


Problem 4.2: Solution

ERM
Similarly, The population loss Lp (t̂D (·)) for the ERM predictor is random and is
generally given as
ERM ERM ERM ERM
Lp (t̂D (·)) = p(x = 0, t 6= t̂D (0))+p(x = 1, t 6= t̂D (1))+p(x = 2, t 6= t̂D (2)).

MATLAB code:
p(1,1)=0.1;p(1,2)=0.2; p(2,1)=0.2; p(2,2)=0.1; p(3,1)=0.1; p(3,2)=0.3;
LpERM=p(1,˜tERM(1)+1)+p(2,˜tERM(2)+1)+p(3,˜tERM(3)+1) % com-
pare with Lp (t̂ ∗ (·)) = 0.3!

Osvaldo Simeone ML4Engineers 13 / 63


Problem 4.2: Solution
MATLAB code to repeat the point above by generating a data set with
N = 10000 samples.
N=10000;
q=[0.3,0.3,0.4];
xoh=mnrnd(1,q,N); %one-hot vectors
x=xoh*[0,1,2]’; %convert from one-hot vector to scalar representation
pest=zeros(3,2); %initialization of the empirical estimate
for n=1:N
if (x(n)==0)
t(n)=binornd(1,0.67);
pest(1,t(n)+1)=pest(1,t(n)+1)+1/N;
elseif (x(n)==1)
t(n)=binornd(1,0.33);
pest(2,t(n)+1)=pest(2,t(n)+1)+1/N;
elseif (x(n)==2)
t(n)=binornd(1,0.75);
pest(3,t(n)+1)=pest(3,t(n)+1)+1/N;
end
end

Osvaldo Simeone ML4Engineers 14 / 63


Problem 4.2: Solution

for i=1:3
[M,I]=max(pest(i,:));
tERM(i)=I-1;
end %compare with population-optimal predictor
p(1,1)=0.1;p(1,2)=0.2; p(2,1)=0.2; p(2,2)=0.1; p(3,1)=0.1; p(3,2)=0.3;
LpERM=p(1,˜tERM(1)+1)+p(2,˜tERM(2)+1)+p(3,˜tERM(3)+1) % compare
with Lp (t̂ ∗ (·)) = 0.3!

Osvaldo Simeone ML4Engineers 15 / 63


Problem 4.2: Solution

The improved results in terms of optimality gap Lp (t̂D (·)) − Lp (t̂ ∗ (·)) for
N = 10000 are a consequence of the law of large numbers:
I The law of large numbers implies that the training loss

N
1 X
LD (θ) = `(tn , t̂(xn |θ))
N n=1

tends to the population loss Lp (θ) as N → ∞ with high probability for


any fixed θ.

Osvaldo Simeone ML4Engineers 16 / 63


Problem 4.3

Show that the training loss for Example 2 in the text can be written as
1
LD (θ) = ||tD − XD θ||2 .
N

Osvaldo Simeone ML4Engineers 17 / 63


Problem 4.3

In the polynomial regression problem we have the training loss


N
1 X
LD (θ) = (tn − t̂(xn |θ))2
N n=1
N
1 X
= (tn − θT u(xn ))2 .
N n=1
PN
Recall that the squared `2 norm of an N × 1 vector z is given as ||z||2 = n=1 zn2 .
So, we have the equality
N
1 1 X
||tD − XD θ||2 = ||[tD ]n − [XD θ]n ||2
N N n=1
N N
1 X 1 X
= ||tn − u(xn )T θ||2 = (tn − θT u(xn ))2 = LD (θ).
N n=1 N n=1

Osvaldo Simeone ML4Engineers 18 / 63


Problem 4.4

T
Show that the condition N1 XD XD = I imposes that, on average over the data set,
all the features ud (x), d = 1, ..., D, have the same expected energy and they are
uncorrelated.

Osvaldo Simeone ML4Engineers 19 / 63


Problem 4.4

Matrix 1
XTX
N D D
is of dimension D × D and its (d, d)th entry is given as
1 T 1 1
[XD XD ]d,d = udT ud = ||ud ||2 ,
N N N
where ud = [ud (x1 ), ud (x2 ), ..., ud (xN )]T is the dth column of matrix XD .
1 T
Therefore, the condition N
[XD XD ]d,d = 1 is equivalent to
N
1 X
ud (xn )2 = 1,
N n=1

which indicates that the expected value of feature d, on average over a uniform
selection of samples from the training set, is equal to one.

Osvaldo Simeone ML4Engineers 20 / 63


Problem 4.4: Solutions

Furthermore, the condition 1


N
T
[XD XD ]d,d 0 = 0 for d 6= d 0 is equivalent to
1 T 1
[XD XD ]d,d 0 = udT ud 0 = 0,
N N
which indicates that the correlation – or inner product – between features d and
d 0 , evaluated over a uniform selection of samples from the training set, is equal to
zero.

Osvaldo Simeone ML4Engineers 21 / 63


Problem 4.5

Write a function thERM = LSsolver(X, t) that takes as input an N × D data


matrix X, an N × 1 desired output vector t, and outputs the least squares (LS)
solution theta.
Load the training data D from the file sineregrdataset on the book’s webpage.
The variables x and t contain N = 20 training inputs and outputs, respectively.
Plot the values of t against the values of x in the training set.
We wish to implement regression with feature vector u(x) = [1, x, x 2 , ..., x M ]T .
With M = 5, build the N × (M + 1) data matrix XD and assign it to a matrix X.
Use the function LSsolver you have designed above to obtain the ERM solution
ERM
θD .
ERM
Plot the hard predictor t̂(x|θD ) in the same figure containing the training set.
Evaluate the training loss for the trained predictor.
Repeat the steps above with M = 3 and M = 10 and compare the resulting
training errors.

Osvaldo Simeone ML4Engineers 22 / 63


Problem 4.5: Solution

MATLAB code:
function thERM=LSsolver(X,t)
thERM=inv(X’*X)*X’*t;

Osvaldo Simeone ML4Engineers 23 / 63


Problem 4.5: Solution

MATLAB code to plot the values of t against the values of x in the training set:
plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)

MATLAB code to build the 20 × 6 data matrix XD :


N=20;
for n=1:N
X(n,:)=[1,x(n),x(n)ˆ2,x(n)ˆ3,x(n)ˆ4,x(n)ˆ5];
end

Osvaldo Simeone ML4Engineers 24 / 63


Problem 4.5: Solution

MATLAB code that uses the function LSsolver and plots the hard predictor
ERM
t̂(x|θD ):
thERM=LSsolver(X,t);
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[1,xaxis(l),xaxis(l)ˆ2,xaxis(l)ˆ3,xaxis(l)ˆ4,xaxis(l)ˆ5]’;
tl(l)=thERM’*ul;
end
hold on; plot(xaxis,tl,’b’,’LineWidth’,2)

Osvaldo Simeone ML4Engineers 25 / 63


Problem 4.5: Solution

MATLAB code that evaluates the training loss


LDERM=1/N*norm(t-X*thERM)ˆ2

Repeat for M = 3 (and similarly for M = 10):


X=zeros(N,4);
for n=1:N
X(n,:)=[1,x(n),x(n)ˆ2,x(n)ˆ3];
end
thERM=LSsolver(X,t);
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[1,xaxis(l),xaxis(l)ˆ2,xaxis(l)ˆ3]’;
tl(l)=thERM’*ul;
end
hold on; plot(xaxis,tl,’g’,’LineWidth’,2)

Osvaldo Simeone ML4Engineers 26 / 63


Problem 4.5: Solution

And for M = 10:


X=zeros(N,11);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:10]];
end
thERM=LSsolver(X,t);
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[1,xaxis(l).ˆ[1:10]]’;
tl(l)=thERM’*ul;
end
hold on; plot(xaxis,tl,’k’,’LineWidth’,2)

Osvaldo Simeone ML4Engineers 27 / 63


Problem 4.6

ERM )
Continuing the previous problem, compute the training loss LD (θM
for values of the model capacity M between 1 and 6, and plot the
training loss versus M in this range.
Explain the result in terms of the capacity of the model to fit the
training data.
Estimate the population loss using validation by computing the
empirical loss on the held-out data xval and tval, which you can
find in your workspace. Plot the validation error on the same figure.
Which value of M would you choose? Explain by using the concepts
of bias and estimation error.

Osvaldo Simeone ML4Engineers 28 / 63


Problem 4.6: Solution

ERM
MATLAB code for computing the training loss LD (θM ) for different values of M:
N=20;
for M=1:6
X=zeros(N,M+1);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:M]];
end
thERM=LSsolver(X,t);
LDERM(M)=1/N*norm(t-X*thERM)ˆ2;
end
plot([1:6],LDERM,’r--’,’LineWidth’,2)
xlabel(’$M$’,’Interpreter’,’latex’)
ylabel(’quadratic loss’,’Interpreter’,’latex’)

The training loss decreases since an increased model capacity M allows an


increasingly more accurate fit of the training data.

Osvaldo Simeone ML4Engineers 29 / 63


Problem 4.6: Solution

ERM
MATLAB code for computing the estimate of the population loss Lp (θM ) via
validation for different values of M:
for M=1:6
X=zeros(N,M+1);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:M]];
end
thERM=LSsolver(X,t);
Xval=xval.ˆ[0:M];
LvERM(M)=1/N*norm(tval-Xval*thERM)ˆ2;
end
hold on; plot([1:6],LvERM,’k’,’LineWidth’,2)

When M is small, the bias dominates and the training loss is relatively large. In
contrast, when M is large, the estimation error dominates and the gap between the
population and the training losses is large.

Osvaldo Simeone ML4Engineers 30 / 63


Problem 4.7

In this problem, we continue Problem 4.6 by evaluating the impact of


regularization.
R−ERM
For M = 6, set λ = exp(−10) and compute the regularized ERM solution θD .
R−ERM
Plot the hard predictor t̂(x|θD ) in the same figure containing the training set.
Note that, when λ = 0, we obtain the ERM solution.
Repeat for λ = exp(−20).
Evaluate the training loss for the trained predictor in both cases and comment on
the comparison.

Osvaldo Simeone ML4Engineers 31 / 63


Problem 4.7: Solution

MATLAB code:
M=6;
lambda=exp(-10);
N=20;
for n=1:N
X(n,:)=[x(n).ˆ[0:M]]’;
end
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[xaxis(l).ˆ[0:M]]’;
tl(l)=thRERM’*ul;
end
plot(xaxis,tl,’b’,’LineWidth’,2);
hold on; plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)
LDRERM=1/N*norm(t-X*thRERM)ˆ2

Osvaldo Simeone ML4Engineers 32 / 63


Problem 4.7: Solution

MATLAB code:
M=6;
lambda=exp(-20);
N=20;
for n=1:N
X(n,:)=[x(n).ˆ[0:M]]’;
end
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[xaxis(l).ˆ[0:M]]’;
tl(l)=thRERM’*ul;
end
plot(xaxis,tl,’b’,’LineWidth’,2);
hold on; plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)
LDRERM=1/N*norm(t-X*thRERM)ˆ2

Osvaldo Simeone ML4Engineers 33 / 63


Problem 4.7: Solution

The training loss is smaller for a smaller value of λ since, in this case,
the optimization criterion is dominated by the training loss and the
regularization term is less relevant.

Osvaldo Simeone ML4Engineers 34 / 63


Problem 4.8

Continuing the previous problem, for M = 6, consider ten possible values for λ,
namely λ = exp(v ) with v = log(λ) taking one of ten equally spaced values
between -30 and 10.
R−ERM
For each value of λ, compute the training loss LD (θD ).
Plot it as a function of log(λ). Comment on your results.
Evaluate the estimate of the population loss on the held-out set variables xval and
tval, which you can find in your workspace. Plot the validation-based estimate of
the population loss in the same figure. Which value of λ would you choose?

Osvaldo Simeone ML4Engineers 35 / 63


Problem 4.8: Solution
ERM
MATLAB code for computing the training loss LD (θD ) for different values of λ:
M=6;
N=20;
lambdavec=exp(linspace(-30,10,10));
X=zeros(N,M+1);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:M]];
end
for l=1:10
lambda=lambdavec(l);
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
LDRERM(l)=1/N*norm(t-X*thRERM)ˆ2;
end
plot(linspace(-30,10,10),LDRERM,’r--’,’LineWidth’,2)
xlabel(’$\log(\lambda)$’,’Interpreter’,’latex’)
ylabel(’quadratic loss’,’Interpreter’,’latex’)

The training loss increases as λ increases due to the increased contribution to the
objective function of the regularization term.

Osvaldo Simeone ML4Engineers 36 / 63


Problem 4.8: Solution

MATLAB code for computing the estimate of the population loss Lp (θDERM ) via
validation for different values of λ:
for l=1:10
lambda=lambdavec(l);
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
Xval=xval.ˆ[0:M];
LvRERM(l)=1/N*norm(tval-Xval*thRERM)ˆ2;
end
hold on; plot(linspace(-30,10,10),LvRERM,’k’,’LineWidth’,2)

When λ is large, the bias caused by the regularization term dominates and the
training loss is large. In contrast, when λ is small, the estimation error dominates
as for ERM and the gap between the population and the training losses is large.

Osvaldo Simeone ML4Engineers 37 / 63


Problem 4.9

We have a classification problem with true joint distribution p(x, t) given by the
table below.

x\t 0 1

0 0.1 0.2

1 0.2 0.1

2 0.1 0.3

We consider the class of predictors

H ={t̂(x|θ) : t̂(0|θ) = θ1 ,
t̂(1|θ) = t̂(2|θ) = θ2 for θ1 , θ2 ∈ {0, 1}}.

Note that this class of predictors is constrained.

Osvaldo Simeone ML4Engineers 38 / 63


Problem 4.9

Assume that we have a training data set with empirical distribution pD (x, t) given as

x\t 0 1

0 0.15 0.2

1 0.2 0.05

2 0.25 0.15

Osvaldo Simeone ML4Engineers 39 / 63


Problem 4.9

Under the detection-error loss, evaluate


I the population-optimal unconstrained predictor t̂ ∗ (·) = argmin Lp (t̂(·));
t̂(·)

I the population-optimal within-class model parameter θH = argmin Lp (θ);
θ∈Θ
ERM
I and the ERM model θD ∈ Θ.
Using the results above, decompose the population loss of the ERM predictor in
terms of the minimum unconstrained population loss, bias, and estimation error.

Osvaldo Simeone ML4Engineers 40 / 63


Problem 4.9: Solution

Population-optimal unconstrained predictor

t̂ ∗ (0) = 1, t̂ ∗ (1)=0, t̂ ∗ (2)=1,

and Lp (t̂ ∗ (·)) = 0.1 + 0.1 + 0.3 = 0.3.


Population-optimal within-class model parameter vector: The population loss can
be written as

Lp (θ) =0.1 · 1(θ1 = 1) + 0.2 · 1(θ1 = 0)


+ (0.2 + 0.1) · 1(θ2 = 1) + (0.1 + 0.3) · 1(θ2 = 0),

Therefore, we have θH = [1, 1]T .

Osvaldo Simeone ML4Engineers 41 / 63


Problem 4.9: Solution

Note that this leads to the predictor


∗ ∗
t̂(0|θH ) = [θH ]1 = 1, ,
∗ ∗ ∗
and t̂(1|θH ) =t̂(2|θH ) = [θH ]2 = 1

which is different from t̂ ∗ (x).

Osvaldo Simeone ML4Engineers 42 / 63


Problem 4.9: Solution

For ERM, we can similarly decompose the training loss using the empirical
distribution as

LD (θ) =0.15 · 1(θ1 = 1) + 0.2 · 1(θ1 = 0)


+ (0.2 + 0.25) · 1(θ2 = 1) + (0.05 + 0.15) · 1(θ2 = 0).
ERM
Therefore, we have θD = [1, 0]T .
Note that this leads to the predictor
ERM ERM
t̂(0|θD ) = [θD ]1 = 1,
ERM ERM ERM
and t̂(1|θD ) =t̂(2|θD ) = [θD ]2 = 0,

which is different from t̂ ∗ (x) and t̂(x|θH



).

Osvaldo Simeone ML4Engineers 43 / 63


Problem 4.9: Solution

The population loss of the ERM solution can be decomposed as


ERM
Lp (θD )= Lp (t̂ ∗ (·)) ∗
+ (Lp (θH ) − Lp (t̂ ∗ (·)))
| {z } | {z }
minimum unconstrained pop. loss bias
ERM ∗
+ (Lp (θD )− Lp (θH )) .
| {z }
estimation error

So, we have the identity “population loss = minimum population loss + bias
+ estimation error”.
ERM
Recall that Lp (θD ) − Lp (t̂ ∗ (·)) is the optimality error.
We will now evaluate these terms in turn to verify this equality.

Osvaldo Simeone ML4Engineers 44 / 63


Problem 4.9: Solution

First, let us compute the population loss of ERM


ERM ERM ERM
Lp (θD ) =0.1 · 1([θD ]1 = 1) + 0.2 · 1([θD ]1 = 0)
ERM ERM
+ (0.2 + 0.1) · 1([θD ]2 = 1) + (0.1 + 0.3) · 1([θD ]1 = 0)
=0.1 + 0.4 = 0.5.

The minimum unconstrained population loss is instead

Lp (t̂ ∗ (·)) = 0.3


ERM
so the optimality error is Lp (θD ) − Lp (t̂ ∗ (·)) = 0.2.

Osvaldo Simeone ML4Engineers 45 / 63


Problem 4.9: Solution

The population loss of the population-optimal within-class predictor is



Lp (θH ) =0.1 + 0.3 = 0.4,

and hence the bias is given as Lp (θH ) − Lp (t̂ ∗ (·)) = 0.1.
ERM ∗
Finally, the estimation error is Lp (θD ) − Lp (θH ) = 0.5 − 0.4 = 0.1.
So we have: population loss (0.5) = minimum population loss (0.3) + bias (0.1)
+ estimation error (0.1).

Osvaldo Simeone ML4Engineers 46 / 63


Problem 4.10

We have a classification problem with true joint distribution p(x, t) given by the table
below.

x\t 0 1

0 0.1 0.2

1 0.2 0.1

2 0.1 0.3

We consider the class of soft predictors

H ={p(t|x, θ) : p(t = 1|x = 0, θ) = θ1 ∈ [0, 1],


p(t = 1|x = i, θ) = θ2 ∈ [0, 1] for i = 1, 2 for θ = (θ1 , θ2 )}.

Osvaldo Simeone ML4Engineers 47 / 63


Problem 4.10

Evaluate the population-optimal within-class soft predictor



θH = argmin Lp (θ)
θ∈Θ

under the log-loss, and calculate the resulting population log-loss


∗ ).
Lp (θH
∗ ) in terms of the cross entropy
Write the population log-loss Lp (θH
between population distribution and soft predictor.
∗ ), obtain the optimal
Given the within-class soft predictor p(t|x, θH
hard predictor under the detection-error loss.

Osvaldo Simeone ML4Engineers 48 / 63


Problem 4.10: Solution


Population-optimal within-class model soft predictor θH = argmin Lp (θ):
θ∈Θ
The population log-loss is given as

Lp (θ) =E(x,t)∼p(x,t) [− log p(t|x, θ)]


=p(0, 0) · (− log p(0|0, θ)) + p(0, 1) · (− log p(1|0, θ))
+ p(1, 0) · (− log p(0|1, θ)) + p(1, 1) · (− log p(1|1, θ))
+ p(2, 0) · (− log p(0|2, θ)) + p(2, 1) · (− log p(1|2, θ))
=0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
+ 0.2 · (− log(1 − θ2 )) + 0.1 · (− log(θ2 ))
+ 0.1 · (− log(1 − θ2 )) + 0.3 · (− log(θ2 ))
=0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
+ 0.3 · (− log(1 − θ2 )) + 0.4 · (− log(θ2 )).

Osvaldo Simeone ML4Engineers 49 / 63


Problem 4.10: Solution

Therefore, the population-optimal soft predictor can be obtained as



[θH ]1 = arg min {0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))}
θ1 ∈[0,1]


[θH ]2 = arg min {0.3 · (− log(1 − θ2 )) + 0.4 · (− log(θ2 ))}.
θ2 ∈[0,1]

To minimize these functions, we can either use a numerical tool, e.g.,


plotting the function and finding the optimal value “by hand”, or use
first-order optimality conditions (see next Chapter):

d
{0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))}
dθ1
   
1 1
= 0.1 − 0.2 = 0,
1 − θ1 θ1

which yields the equation 0.1 · θ1 − 0.2 · (1 − θ1 ) = 0, yielding [θH ]1 = 2/3.

Osvaldo Simeone ML4Engineers 50 / 63


Problem 4.10: Solution

MATLAB code:
th=[0:0.01:1];
plot(th,0.1*(-log(1-th))+0.2*(-log(th)),’LineWidth’,2)
xlabel(’$\theta 1$’,’Interpreter’,’latex’)

We can follow the same steps for θ2 , obtaining [θH ]2 = 4/7.
The corresponding optimal hard predictor under the detection-error loss is
the MAP predictor

t̂(0) = argmaxp(t|x, θH )=1
t

and similarly t̂(1) = t̂(2) = 1.

Osvaldo Simeone ML4Engineers 51 / 63


Problem 4.10: Solution

∗ ) is hence given as
The population log-loss Lp (θH
∗ ∗
Lp (θH ) =E(x,t)∼p(x,t) [− log p(t|x, θH )]
=0.1 · (− log(1 − 2/3)) + 0.2 · (− log(2/3))
+ 0.3 · (− log(1 − 4/7) + 0.4 · (− log(4/7))
=0.66.

Osvaldo Simeone ML4Engineers 52 / 63


Problem 4.10: Solution
The population log-loss Lp (θ) for any model parameter vector θ can
be expressed as

Lp (θ) = E(x,t)∼p(x,t) [− log p(t|x, θ)]


= Ex∼p(x) [H( p(t|x) || p(t|x, θ) )].
| {z } | {z }
true posterior distribution soft predictor

Therefore, we can compute the cross entropies

H(p(t|x = 0)||p(t|x = 0, θ)) = −1/3 log(1/3) − 2/3 log(2/3) = 0.63


H(p(t|x = 1)||p(t|x = 1, θ)) = −2/3 log(3/7) − 1/3 log(4/7) = 0.75
H(p(t|x = 2)||p(t|x = 2, θ)) = −1/4 log(3/7) − 3/4 log(4/7) = 0.63,

and the population log-loss is given as

Lp (θ) = 0.3 · 0.63 + 0.3 · 0.75 + 0.4 · 0.63


= 0.66.

Osvaldo Simeone ML4Engineers 53 / 63


Problem 4.11

In Problem 4.10, we have considered the ideal situation in which the population
distribution p(x, t) is known. In this problem, we assume the same population distribution
and the same model class H of soft predictors, and we study the learning problem.
To this end, assume that we have a training data set of N = 100 data points with
empirical distribution pD (x, t) given as

x\t 0 1

0 0.15 0.2

1 0.2 0.05

2 0.25 0.15

Osvaldo Simeone ML4Engineers 54 / 63


Problem 4.11

ML
Obtain the maximum likelihood (ML) model θD = argmin LD (θ).
θ∈Θ
ML
Calculate the population log-loss Lp (θD ) and compare it with the

population loss Lp (θH ) of the population-optimal within-class predictor
obtained in Problem 4.10.

Osvaldo Simeone ML4Engineers 55 / 63


Problem 4.11: Solution

ML
ML soft predictor θH = argmin LD (θ): The training log-loss is given as
θ∈Θ

LD (θ) =E(x,t)∼p(x,t) [− log p(t|x, θ)]


=pD (0, 0) · (− log p(0|0, θ)) + pD (0, 1) · (− log p(1|0, θ))
+ pD (1, 0) · (− log p(0|1, θ)) + pD (1, 1) · (− log p(1|1, θ))
+ pD (2, 0) · (− log p(0|2, θ)) + pD (2, 1) · (− log p(1|2, θ))
=0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
+ 0.2 · (− log(1 − θ2 )) + 0.05 · (− log(θ2 ))
+ 0.25 · (− log(1 − θ2 )) + 0.15 · (− log(θ2 ))
=0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
+ 0.45 · (− log(1 − θ2 ) + 0.2 · (− log(θ2 )).

Osvaldo Simeone ML4Engineers 56 / 63


Problem 4.11: Solution

Therefore, the ML soft predictor parameters can be obtained as


ML
[θD ]1 = arg min {0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))}
θ1 ∈[0,1]

ML
[θD ]2 = arg min {0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 ))}.
θ2 ∈[0,1]

ML
Proceeding as we have done above, we get θD = [4/7, 4/13]T .
Note that this is different from the population-optimal solution

θH = [2/3, 4/7]T .

Osvaldo Simeone ML4Engineers 57 / 63


Problem 4.11: Solution

ML ) is hence given as
The population log-loss Lp (θD
ML ∗
Lp (θD ) =E(x,t)∼p(x,t) [− log p(t|x, θH )]
=0.1 · (− log(1 − 4/7)) + 0.2 · (− log(4/7))
+ 0.3 · (− log(1 − 4/13)) + 0.4 · (− log(4/13))
=0.77.

As expected, this is larger than the minimum within-class population


∗ ) = 0.66.
log-loss Lp (θH

Osvaldo Simeone ML4Engineers 58 / 63


Problem 4.12
Continuing the previous problem, let us now consider the effect of
regularization.
To this end, assume that, based on prior knowledge, we have good reasons
to choose the following prior

p(θ) = p(θ1 , θ2 ) = p(θ1 )p(θ2 )

with marginals p(θi ) = q if 0 ≤ θi < 0.5 and p(θi ) = 2 − q if 0.5 ≤ θi ≤ 1


for i = 1, 2 and some fixed 0 ≤ q ≤ 2.
Verify that the marginals p(θi ) are valid pdfs.
Write the conditions defining the maximum a posterior (MAP) model
MAP
parameters θD for any N and prior parameter q.
MAP
Plot the population log-loss Lp (θD ) for the empirical table given above as
a function of q in the interval [0, 2] (recall that we have N = 100).
As a reference, plot the population log-loss obtained by ML and comment
on the comparison between the two losses.

Osvaldo Simeone ML4Engineers 59 / 63


Problem 4.12: Solution

First, let us check that the marginals are valid pdfs. To this end, we need to
check the normalization condition
Z +∞
p(θi )dθi = 1.
−∞

This can be easily verified, since we have


Z +∞
p(θi )dθi = 0.5 · q + (1 − 0.5) · (2 − q)
−∞
= 1.

Osvaldo Simeone ML4Engineers 60 / 63


Problem 4.12: Solution

1
MAP

MAP soft predictor θH = argmin LD (θ) − N log p(θ) : We have
θ∈Θ

1
LD (θ) − log p(θ) =0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
N
+ 0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 ))
1
− (log p(θ1 ) + log p(θ2 )).
N
Therefore, the MAP soft predictor can be obtained as

MAP 1
[θH ]1 = arg min {0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 )) − log p(θ1 )}
θ1 ∈[0,1] N

MAP 1
[θH ]2 = arg min {0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 )) − log p(θ2 )}.
θ2 ∈[0,1] N

Osvaldo Simeone ML4Engineers 61 / 63


Problem 4.12: Solution
MATLAB code for the plot
L=100;
N=100;
q=linspace(0.01,1.99,L);
thvec=[0:0.01:1];
for l=1:L
[m,ind]=min(0.15*(-log(1-thvec))+0.2*(-log(thvec))-
1/N*log(q(l)*(thvec<0.5)+(2-q(l))*(thvec>=0.5))); %minimizes over theta1
th1MAP=thvec(ind);
[m,ind]=min(0.45*(-log(1-thvec))+0.2*(-log(thvec))-
1/N*log(q(l)*(thvec<0.5)+(2-q(l))*(thvec>=0.5))); %minimizes over theta2
th2MAP=thvec(ind);
LpMAP(l)=0.1*(-log(1-th1MAP))+0.2*(-log(th1MAP))+0.3*(-log(1-
th2MAP))+0.4*(-log(th2MAP)); %population loss
end
plot(q,LpMAP,’b’,’LineWidth’,2)
th1ML=4/7; th2ML=4/13;
hold on;
plot(q,(0.1*(-log(1-th1ML))+0.2*(-log(th1ML))+0.3*(-log(1-th2ML))+0.4*(-
log(th2ML)))*ones(L,1),’r--’,’LineWidth’,2); %ML population loss
xlabel(’$q$’,’Interpreter’,’latex’)
ylabel(’population loss’,’Interpreter’,’latex’)
legend(’MAP’,’ML’)

Osvaldo Simeone ML4Engineers 62 / 63


Problem 4.12: Solution

Since the population-optimal solution has both elements larger than


0.5, a “good” prior has q sufficiently small.
With a good prior, the population loss of the MAP solution is smaller
as compared to ML.
In contrast, with a “bad” prior, i.e., with a large q, the prior enforces
an incorrect inductive bias, and the population loss of the MAP
solution is larger as compared to ML.
Note that, as N increases, the MAP solution would be less sensitive
to the choice of a bad prior since the regularization term would be
weighted less as compared to the likelihood.

Osvaldo Simeone ML4Engineers 63 / 63

You might also like