0% found this document useful (0 votes)

93 views

10 PDF

This document summarizes solutions to problems from Chapter 4 of the book "Machine Learning for Engineers" regarding supervised learning. For Problem 4.1, it provides the optimal soft and hard predictors for a classification problem, generates training data from the problem distribution, and computes the empirical distribution. For Problem 4.2, it computes the training loss for a given predictor on a small dataset, identifies the empirical risk minimization predictor, and calculates its training and population losses, noting that these values converge to the true optimal as the training size increases due to the law of large numbers.

Uploaded by

Tala Abdelghani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views

10 PDF

Uploaded by

Tala Abdelghani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Machine Learning for Engineers:

Chapter 4. Supervised Learning: Getting Started –

Problems

Osvaldo Simeone

King’s College London

May 12, 2021

Osvaldo Simeone ML4Engineers 1 / 63

Problem 4.1
We have a classification problem with true joint distribution p(x, t) given by the
table below.

x\t 0 1

0 0.1 0.2

1 0.2 0.1

2 0.1 0.3

What is the optimal soft predictor?

What is the optimal hard predictor t̂ ∗ (·) under the detection-error loss, and what is
the minimum population loss Lp (t̂ ∗ (·))?
Using MATLAB, generate a training set D of N = 10 i.i.d. examples from
this joint distribution.
Compute the empirical distribution pD (x, t).
Repeat with N = 10000 samples.

Osvaldo Simeone ML4Engineers 2 / 63

Problem 4.1: Solution
The optimal soft predictor is given by the conditional distribution p(t|x):

x\t 0 1

0 0.33 0.67

1 0.67 0.33

2 0.25 0.75

The optimal hard predictor Lp (t̂ ∗ (·)) under the detection-error loss is given by the
MAP predictor

t̂ ∗ (x) = arg max p(t|x) = arg max p(x, t),

t t
i.e.,
t̂ ∗ (0) = 1, t̂ ∗ (1)=0, t̂ ∗ (2)=1,
and the population loss for this predictor is
Lp (t̂ ∗ (·)) =p(x = 0, t = 0) + p(x = 1, t = 1) + p(x = 2, t = 0)
=0.3.

Osvaldo Simeone ML4Engineers 3 / 63

Problem 4.1: Solution
To generate data, we can use ancestral sampling, i.e., we sequentially sample
x ∼ p(x) and t ∼ p(t|x). To this end, we compute the marginal
p(x = 0) = p(x = 1) = 0.3 and p(x = 2) = 0.4.
MATLAB code to generate samples (x, t):
clear; N=10;
q=[0.3,0.3,0.4];
xoh=mnrnd(1,q,N); %generate x as one-hot vectors
x=xoh*[0,1,2]’; %convert from one-hot vector to scalar representation
t=zeros(N,1);
for n=1:N %generate t given x
if (x(n)==0)
t(n)=binornd(1,0.67);
elseif (x(n)==1)
t(n)=binornd(1,0.33);
elseif (x(n)==2)
t(n)=binornd(1,0.75);
end
end

Osvaldo Simeone ML4Engineers 4 / 63

Problem 4.1: Solution
MATLAB code to compute the empirical distribution:

pest=zeros(3,2);
for n=1:N
if (x(n)==0)&&(t(n)==0)
pest(1,1)=pest(1,1)+1/N;
elseif (x(n)==0)&&(t(n)==1)
pest(1,2)=pest(1,2)+1/N;
elseif (x(n)==1)&&(t(n)==0)
pest(2,1)=pest(2,1)+1/N;
elseif (x(n)==1)&&(t(n)==1)
pest(2,2)=pest(2,2)+1/N;
elseif (x(n)==2)&&(t(n)==0)
pest(3,1)=pest(3,1)+1/N;
elseif (x(n)==2)&&(t(n)==1)
pest(3,2)=pest(3,2)+1/N;
end
end

Osvaldo Simeone ML4Engineers 5 / 63

Problem 4.1: Solution

Repeating with N=10000:

N=10000;
q=[0.3,0.3,0.4];
xoh=mnrnd(1,q,N); %generate x as one-hot vectors
x=xoh*[0,1,2]’; %convert from one-hot vector to scalar representation
t=zeros(N,1);
for n=1:N %generate t given x
if (x(n)==0)
t(n)=binornd(1,0.67);
elseif (x(n)==1)
t(n)=binornd(1,0.33);
elseif (x(n)==2)
t(n)=binornd(1,0.75);
end
end

Osvaldo Simeone ML4Engineers 6 / 63

Problem 4.1: Solution

Osvaldo Simeone ML4Engineers 7 / 63

Problem 4.2

For the same joint distribution at the previous problem, generate again a
data set D of N = 10 data points.
For this data set, assuming the detection-error loss, compute the training
loss LD (t̂(·)), for the predictor t̂(0) = 0, t̂(1) = 1, and t̂(2) = 0.
Identify the predictor t̂D (·) that minimizes the training loss, i.e., the
empirical risk minimization (ERM) predictor, assuming a model class H with
t̂(0|θ) = θ1 , t̂(1|θ) = θ2 and t̂(2|θ) = θ3 for θ1 , θ2 , θ3 ∈ {0, 1}, and compare
its training loss with the predictor considered at the previous point.
Compute the population loss of the ERM predictor.
Using your previous results, calculate the optimality gap
Lp (t̂D (·)) − Lp (t̂ ∗ (·)).
Repeat the points above with N = 10000 and comment on your result.

Osvaldo Simeone ML4Engineers 8 / 63

Problem 4.2: Solution
Generating N = 10 samples using MATLAB:
clear; N=10;
q=[0.3,0.3,0.4];
xoh=mnrnd(1,q,N); %generate x as one-hot vectors
x=xoh*[0,1,2]’; %convert from one-hot vector to scalar representation
t=zeros(N,1);
for n=1:N %generate t given x
if (x(n)==0)
t(n)=binornd(1,0.67);
elseif (x(n)==1)
t(n)=binornd(1,0.33);
elseif (x(n)==2)
t(n)=binornd(1,0.75);
end
end
The training loss LD (t̂(·)) for the predictor t̂(0) = 0, t̂(1) = 1, and t̂(2) = 0 depends on
then empirical distribution pD (x, t), and is given as

LD (t̂ ∗ (·)) =pD (x = 0, t = 1) + pD (x = 1, t = 0) + pD (x = 2, t = 1).

Osvaldo Simeone ML4Engineers 9 / 63

Problem 4.2: Solution

MATLAB code to estimate pD (x, t):

Osvaldo Simeone ML4Engineers 10 / 63

Problem 4.2: Solution

The ERM predictor is random since it depends on pD (x, t). In this problem, we
have θ = [t̂(0), t̂(1), t̂(2)]T and hence we impose no constraints on the predictor.
Under the detection-error loss, with discrete variable, ERM is given by the MAP
predictor applied to pD (x, t), i.e.,
ERM
t̂D (x) = arg max pD (t|x) = arg max pD (x, t).
t t

MATLAB code:
tERM=zeros(3,1);
for i=1:3
[M,I]=max(pest(i,:));
tERM(i)=I-1;
end

Osvaldo Simeone ML4Engineers 11 / 63

Problem 4.2: Solution

ERM
The training loss LD (t̂D (·)) for the ERM predictor is also random, and, for the
detection-error loss, is given as
ERM ERM ERM
LD (t̂D (·)) =pD (x = 0, t 6= t̂D (0)) + pD (x = 1, t 6= t̂D (1))
ERM
+ pD (x = 2, t 6= t̂D (2)).

MATLAB code:
LDERM=pest(1,˜tERM(1)+1)+pest(2,˜tERM(2)+1)+pest(3,˜tERM(3)+1)
%compare with LD!

Osvaldo Simeone ML4Engineers 12 / 63

Problem 4.2: Solution

ERM
Similarly, The population loss Lp (t̂D (·)) for the ERM predictor is random and is
generally given as
ERM ERM ERM ERM
Lp (t̂D (·)) = p(x = 0, t 6= t̂D (0))+p(x = 1, t 6= t̂D (1))+p(x = 2, t 6= t̂D (2)).

MATLAB code:
p(1,1)=0.1;p(1,2)=0.2; p(2,1)=0.2; p(2,2)=0.1; p(3,1)=0.1; p(3,2)=0.3;
LpERM=p(1,˜tERM(1)+1)+p(2,˜tERM(2)+1)+p(3,˜tERM(3)+1) % com-
pare with Lp (t̂ ∗ (·)) = 0.3!

Osvaldo Simeone ML4Engineers 13 / 63

Problem 4.2: Solution
MATLAB code to repeat the point above by generating a data set with
N = 10000 samples.
N=10000;
q=[0.3,0.3,0.4];
xoh=mnrnd(1,q,N); %one-hot vectors
x=xoh*[0,1,2]’; %convert from one-hot vector to scalar representation
pest=zeros(3,2); %initialization of the empirical estimate
for n=1:N
if (x(n)==0)
t(n)=binornd(1,0.67);
pest(1,t(n)+1)=pest(1,t(n)+1)+1/N;
elseif (x(n)==1)
t(n)=binornd(1,0.33);
pest(2,t(n)+1)=pest(2,t(n)+1)+1/N;
elseif (x(n)==2)
t(n)=binornd(1,0.75);
pest(3,t(n)+1)=pest(3,t(n)+1)+1/N;
end
end

Osvaldo Simeone ML4Engineers 14 / 63

Problem 4.2: Solution

for i=1:3
[M,I]=max(pest(i,:));
tERM(i)=I-1;
end %compare with population-optimal predictor
p(1,1)=0.1;p(1,2)=0.2; p(2,1)=0.2; p(2,2)=0.1; p(3,1)=0.1; p(3,2)=0.3;
LpERM=p(1,˜tERM(1)+1)+p(2,˜tERM(2)+1)+p(3,˜tERM(3)+1) % compare
with Lp (t̂ ∗ (·)) = 0.3!

Osvaldo Simeone ML4Engineers 15 / 63

Problem 4.2: Solution

The improved results in terms of optimality gap Lp (t̂D (·)) − Lp (t̂ ∗ (·)) for
N = 10000 are a consequence of the law of large numbers:
I The law of large numbers implies that the training loss

N
1 X
LD (θ) = `(tn , t̂(xn |θ))
N n=1

tends to the population loss Lp (θ) as N → ∞ with high probability for

any fixed θ.

Osvaldo Simeone ML4Engineers 16 / 63

Problem 4.3

Show that the training loss for Example 2 in the text can be written as
1
LD (θ) = ||tD − XD θ||2 .
N

Osvaldo Simeone ML4Engineers 17 / 63

Problem 4.3

In the polynomial regression problem we have the training loss

N
1 X
LD (θ) = (tn − t̂(xn |θ))2
N n=1
N
1 X
= (tn − θT u(xn ))2 .
N n=1
PN
Recall that the squared `2 norm of an N × 1 vector z is given as ||z||2 = n=1 zn2 .
So, we have the equality
N
1 1 X
||tD − XD θ||2 = ||[tD ]n − [XD θ]n ||2
N N n=1
N N
1 X 1 X
= ||tn − u(xn )T θ||2 = (tn − θT u(xn ))2 = LD (θ).
N n=1 N n=1

Osvaldo Simeone ML4Engineers 18 / 63

Problem 4.4

T
Show that the condition N1 XD XD = I imposes that, on average over the data set,
all the features ud (x), d = 1, ..., D, have the same expected energy and they are
uncorrelated.

Osvaldo Simeone ML4Engineers 19 / 63

Problem 4.4

Matrix 1
XTX
N D D
is of dimension D × D and its (d, d)th entry is given as
1 T 1 1
[XD XD ]d,d = udT ud = ||ud ||2 ,
N N N
where ud = [ud (x1 ), ud (x2 ), ..., ud (xN )]T is the dth column of matrix XD .
1 T
Therefore, the condition N
[XD XD ]d,d = 1 is equivalent to
N
1 X
ud (xn )2 = 1,
N n=1

which indicates that the expected value of feature d, on average over a uniform
selection of samples from the training set, is equal to one.

Osvaldo Simeone ML4Engineers 20 / 63

Problem 4.4: Solutions

Furthermore, the condition 1

N
T
[XD XD ]d,d 0 = 0 for d 6= d 0 is equivalent to
1 T 1
[XD XD ]d,d 0 = udT ud 0 = 0,
N N
which indicates that the correlation – or inner product – between features d and
d 0 , evaluated over a uniform selection of samples from the training set, is equal to
zero.

Osvaldo Simeone ML4Engineers 21 / 63

Problem 4.5

Write a function thERM = LSsolver(X, t) that takes as input an N × D data

matrix X, an N × 1 desired output vector t, and outputs the least squares (LS)
solution theta.
Load the training data D from the file sineregrdataset on the book’s webpage.
The variables x and t contain N = 20 training inputs and outputs, respectively.
Plot the values of t against the values of x in the training set.
We wish to implement regression with feature vector u(x) = [1, x, x 2 , ..., x M ]T .
With M = 5, build the N × (M + 1) data matrix XD and assign it to a matrix X.
Use the function LSsolver you have designed above to obtain the ERM solution
ERM
θD .
ERM
Plot the hard predictor t̂(x|θD ) in the same figure containing the training set.
Evaluate the training loss for the trained predictor.
Repeat the steps above with M = 3 and M = 10 and compare the resulting
training errors.

Osvaldo Simeone ML4Engineers 22 / 63

Problem 4.5: Solution

MATLAB code:
function thERM=LSsolver(X,t)
thERM=inv(X’*X)*X’*t;

Osvaldo Simeone ML4Engineers 23 / 63

Problem 4.5: Solution

MATLAB code to plot the values of t against the values of x in the training set:
plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)

MATLAB code to build the 20 × 6 data matrix XD :

N=20;
for n=1:N
X(n,:)=[1,x(n),x(n)ˆ2,x(n)ˆ3,x(n)ˆ4,x(n)ˆ5];
end

Osvaldo Simeone ML4Engineers 24 / 63

Problem 4.5: Solution

MATLAB code that uses the function LSsolver and plots the hard predictor
ERM
t̂(x|θD ):
thERM=LSsolver(X,t);
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[1,xaxis(l),xaxis(l)ˆ2,xaxis(l)ˆ3,xaxis(l)ˆ4,xaxis(l)ˆ5]’;
tl(l)=thERM’*ul;
end
hold on; plot(xaxis,tl,’b’,’LineWidth’,2)

Osvaldo Simeone ML4Engineers 25 / 63

Problem 4.5: Solution

MATLAB code that evaluates the training loss

LDERM=1/N*norm(t-X*thERM)ˆ2

Repeat for M = 3 (and similarly for M = 10):

X=zeros(N,4);
for n=1:N
X(n,:)=[1,x(n),x(n)ˆ2,x(n)ˆ3];
end
thERM=LSsolver(X,t);
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[1,xaxis(l),xaxis(l)ˆ2,xaxis(l)ˆ3]’;
tl(l)=thERM’*ul;
end
hold on; plot(xaxis,tl,’g’,’LineWidth’,2)

Osvaldo Simeone ML4Engineers 26 / 63

Problem 4.5: Solution

And for M = 10:

X=zeros(N,11);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:10]];
end
thERM=LSsolver(X,t);
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[1,xaxis(l).ˆ[1:10]]’;
tl(l)=thERM’*ul;
end
hold on; plot(xaxis,tl,’k’,’LineWidth’,2)

Osvaldo Simeone ML4Engineers 27 / 63

Problem 4.6

ERM )
Continuing the previous problem, compute the training loss LD (θM
for values of the model capacity M between 1 and 6, and plot the
training loss versus M in this range.
Explain the result in terms of the capacity of the model to fit the
training data.
Estimate the population loss using validation by computing the
empirical loss on the held-out data xval and tval, which you can
find in your workspace. Plot the validation error on the same figure.
Which value of M would you choose? Explain by using the concepts
of bias and estimation error.

Osvaldo Simeone ML4Engineers 28 / 63

Problem 4.6: Solution

ERM
MATLAB code for computing the training loss LD (θM ) for different values of M:
N=20;
for M=1:6
X=zeros(N,M+1);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:M]];
end
thERM=LSsolver(X,t);
LDERM(M)=1/N*norm(t-X*thERM)ˆ2;
end
plot([1:6],LDERM,’r--’,’LineWidth’,2)
xlabel(’$M$’,’Interpreter’,’latex’)
ylabel(’quadratic loss’,’Interpreter’,’latex’)

The training loss decreases since an increased model capacity M allows an

increasingly more accurate fit of the training data.

Osvaldo Simeone ML4Engineers 29 / 63

Problem 4.6: Solution

ERM
MATLAB code for computing the estimate of the population loss Lp (θM ) via
validation for different values of M:
for M=1:6
X=zeros(N,M+1);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:M]];
end
thERM=LSsolver(X,t);
Xval=xval.ˆ[0:M];
LvERM(M)=1/N*norm(tval-Xval*thERM)ˆ2;
end
hold on; plot([1:6],LvERM,’k’,’LineWidth’,2)

When M is small, the bias dominates and the training loss is relatively large. In
contrast, when M is large, the estimation error dominates and the gap between the
population and the training losses is large.

Osvaldo Simeone ML4Engineers 30 / 63

Problem 4.7

In this problem, we continue Problem 4.6 by evaluating the impact of

regularization.
R−ERM
For M = 6, set λ = exp(−10) and compute the regularized ERM solution θD .
R−ERM
Plot the hard predictor t̂(x|θD ) in the same figure containing the training set.
Note that, when λ = 0, we obtain the ERM solution.
Repeat for λ = exp(−20).
Evaluate the training loss for the trained predictor in both cases and comment on
the comparison.

Osvaldo Simeone ML4Engineers 31 / 63

Problem 4.7: Solution

MATLAB code:
M=6;
lambda=exp(-10);
N=20;
for n=1:N
X(n,:)=[x(n).ˆ[0:M]]’;
end
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[xaxis(l).ˆ[0:M]]’;
tl(l)=thRERM’*ul;
end
plot(xaxis,tl,’b’,’LineWidth’,2);
hold on; plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)
LDRERM=1/N*norm(t-X*thRERM)ˆ2

Osvaldo Simeone ML4Engineers 32 / 63

Problem 4.7: Solution

MATLAB code:
M=6;
lambda=exp(-20);
N=20;
for n=1:N
X(n,:)=[x(n).ˆ[0:M]]’;
end
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
xaxis=[min(x):0.01:max(x)];
L=length(xaxis);
for l=1:L
ul=[xaxis(l).ˆ[0:M]]’;
tl(l)=thRERM’*ul;
end
plot(xaxis,tl,’b’,’LineWidth’,2);
hold on; plot(x,t,’ro’,’MarkerSize’,10,’LineWidth’,2)
xlabel(’$x$’,’Interpreter’,’latex’)
ylabel(’$t$’,’Interpreter’,’latex’)
LDRERM=1/N*norm(t-X*thRERM)ˆ2

Osvaldo Simeone ML4Engineers 33 / 63

Problem 4.7: Solution

The training loss is smaller for a smaller value of λ since, in this case,
the optimization criterion is dominated by the training loss and the
regularization term is less relevant.

Osvaldo Simeone ML4Engineers 34 / 63

Problem 4.8

Continuing the previous problem, for M = 6, consider ten possible values for λ,
namely λ = exp(v ) with v = log(λ) taking one of ten equally spaced values
between -30 and 10.
R−ERM
For each value of λ, compute the training loss LD (θD ).
Plot it as a function of log(λ). Comment on your results.
Evaluate the estimate of the population loss on the held-out set variables xval and
tval, which you can find in your workspace. Plot the validation-based estimate of
the population loss in the same figure. Which value of λ would you choose?

Osvaldo Simeone ML4Engineers 35 / 63

Problem 4.8: Solution
ERM
MATLAB code for computing the training loss LD (θD ) for different values of λ:
M=6;
N=20;
lambdavec=exp(linspace(-30,10,10));
X=zeros(N,M+1);
for n=1:N
X(n,:)=[1,x(n).ˆ[1:M]];
end
for l=1:10
lambda=lambdavec(l);
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
LDRERM(l)=1/N*norm(t-X*thRERM)ˆ2;
end
plot(linspace(-30,10,10),LDRERM,’r--’,’LineWidth’,2)
xlabel(’$\log(\lambda)$’,’Interpreter’,’latex’)
ylabel(’quadratic loss’,’Interpreter’,’latex’)

The training loss increases as λ increases due to the increased contribution to the
objective function of the regularization term.

Osvaldo Simeone ML4Engineers 36 / 63

Problem 4.8: Solution

MATLAB code for computing the estimate of the population loss Lp (θDERM ) via
validation for different values of λ:
for l=1:10
lambda=lambdavec(l);
thRERM=inv(X’*X+lambda*eye(M+1))*X’*t;
Xval=xval.ˆ[0:M];
LvRERM(l)=1/N*norm(tval-Xval*thRERM)ˆ2;
end
hold on; plot(linspace(-30,10,10),LvRERM,’k’,’LineWidth’,2)

When λ is large, the bias caused by the regularization term dominates and the
training loss is large. In contrast, when λ is small, the estimation error dominates
as for ERM and the gap between the population and the training losses is large.

Osvaldo Simeone ML4Engineers 37 / 63

Problem 4.9

We have a classification problem with true joint distribution p(x, t) given by the
table below.

x\t 0 1

0 0.1 0.2

1 0.2 0.1

2 0.1 0.3

We consider the class of predictors

H ={t̂(x|θ) : t̂(0|θ) = θ1 ,
t̂(1|θ) = t̂(2|θ) = θ2 for θ1 , θ2 ∈ {0, 1}}.

Note that this class of predictors is constrained.

Osvaldo Simeone ML4Engineers 38 / 63

Problem 4.9

Assume that we have a training data set with empirical distribution pD (x, t) given as

x\t 0 1

0 0.15 0.2

1 0.2 0.05

2 0.25 0.15

Osvaldo Simeone ML4Engineers 39 / 63

Problem 4.9

Under the detection-error loss, evaluate

I the population-optimal unconstrained predictor t̂ ∗ (·) = argmin Lp (t̂(·));
t̂(·)
∗
I the population-optimal within-class model parameter θH = argmin Lp (θ);
θ∈Θ
ERM
I and the ERM model θD ∈ Θ.
Using the results above, decompose the population loss of the ERM predictor in
terms of the minimum unconstrained population loss, bias, and estimation error.

Osvaldo Simeone ML4Engineers 40 / 63

Problem 4.9: Solution

Population-optimal unconstrained predictor

t̂ ∗ (0) = 1, t̂ ∗ (1)=0, t̂ ∗ (2)=1,

and Lp (t̂ ∗ (·)) = 0.1 + 0.1 + 0.3 = 0.3.

Population-optimal within-class model parameter vector: The population loss can
be written as

Lp (θ) =0.1 · 1(θ1 = 1) + 0.2 · 1(θ1 = 0)

+ (0.2 + 0.1) · 1(θ2 = 1) + (0.1 + 0.3) · 1(θ2 = 0),
∗
Therefore, we have θH = [1, 1]T .

Osvaldo Simeone ML4Engineers 41 / 63

Problem 4.9: Solution

Note that this leads to the predictor

∗ ∗
t̂(0|θH ) = [θH ]1 = 1, ,
∗ ∗ ∗
and t̂(1|θH ) =t̂(2|θH ) = [θH ]2 = 1

which is different from t̂ ∗ (x).

Osvaldo Simeone ML4Engineers 42 / 63

Problem 4.9: Solution

For ERM, we can similarly decompose the training loss using the empirical
distribution as

LD (θ) =0.15 · 1(θ1 = 1) + 0.2 · 1(θ1 = 0)

+ (0.2 + 0.25) · 1(θ2 = 1) + (0.05 + 0.15) · 1(θ2 = 0).
ERM
Therefore, we have θD = [1, 0]T .
Note that this leads to the predictor
ERM ERM
t̂(0|θD ) = [θD ]1 = 1,
ERM ERM ERM
and t̂(1|θD ) =t̂(2|θD ) = [θD ]2 = 0,

which is different from t̂ ∗ (x) and t̂(x|θH

∗
).

Osvaldo Simeone ML4Engineers 43 / 63

Problem 4.9: Solution

The population loss of the ERM solution can be decomposed as

ERM
Lp (θD )= Lp (t̂ ∗ (·)) ∗
+ (Lp (θH ) − Lp (t̂ ∗ (·)))
| {z } | {z }
minimum unconstrained pop. loss bias
ERM ∗
+ (Lp (θD )− Lp (θH )) .
| {z }
estimation error

So, we have the identity “population loss = minimum population loss + bias
+ estimation error”.
ERM
Recall that Lp (θD ) − Lp (t̂ ∗ (·)) is the optimality error.
We will now evaluate these terms in turn to verify this equality.

Osvaldo Simeone ML4Engineers 44 / 63

Problem 4.9: Solution

First, let us compute the population loss of ERM

ERM ERM ERM
Lp (θD ) =0.1 · 1([θD ]1 = 1) + 0.2 · 1([θD ]1 = 0)
ERM ERM
+ (0.2 + 0.1) · 1([θD ]2 = 1) + (0.1 + 0.3) · 1([θD ]1 = 0)
=0.1 + 0.4 = 0.5.

The minimum unconstrained population loss is instead

Lp (t̂ ∗ (·)) = 0.3

ERM
so the optimality error is Lp (θD ) − Lp (t̂ ∗ (·)) = 0.2.

Osvaldo Simeone ML4Engineers 45 / 63

Problem 4.9: Solution

The population loss of the population-optimal within-class predictor is

∗
Lp (θH ) =0.1 + 0.3 = 0.4,
∗
and hence the bias is given as Lp (θH ) − Lp (t̂ ∗ (·)) = 0.1.
ERM ∗
Finally, the estimation error is Lp (θD ) − Lp (θH ) = 0.5 − 0.4 = 0.1.
So we have: population loss (0.5) = minimum population loss (0.3) + bias (0.1)
+ estimation error (0.1).

Osvaldo Simeone ML4Engineers 46 / 63

Problem 4.10

We have a classification problem with true joint distribution p(x, t) given by the table
below.

x\t 0 1

0 0.1 0.2

1 0.2 0.1

2 0.1 0.3

We consider the class of soft predictors

H ={p(t|x, θ) : p(t = 1|x = 0, θ) = θ1 ∈ [0, 1],

p(t = 1|x = i, θ) = θ2 ∈ [0, 1] for i = 1, 2 for θ = (θ1 , θ2 )}.

Osvaldo Simeone ML4Engineers 47 / 63

Problem 4.10

Evaluate the population-optimal within-class soft predictor

∗
θH = argmin Lp (θ)
θ∈Θ

under the log-loss, and calculate the resulting population log-loss

∗ ).
Lp (θH
∗ ) in terms of the cross entropy
Write the population log-loss Lp (θH
between population distribution and soft predictor.
∗ ), obtain the optimal
Given the within-class soft predictor p(t|x, θH
hard predictor under the detection-error loss.

Osvaldo Simeone ML4Engineers 48 / 63

Problem 4.10: Solution

∗
Population-optimal within-class model soft predictor θH = argmin Lp (θ):
θ∈Θ
The population log-loss is given as

Lp (θ) =E(x,t)∼p(x,t) [− log p(t|x, θ)]

=p(0, 0) · (− log p(0|0, θ)) + p(0, 1) · (− log p(1|0, θ))
+ p(1, 0) · (− log p(0|1, θ)) + p(1, 1) · (− log p(1|1, θ))
+ p(2, 0) · (− log p(0|2, θ)) + p(2, 1) · (− log p(1|2, θ))
=0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
+ 0.2 · (− log(1 − θ2 )) + 0.1 · (− log(θ2 ))
+ 0.1 · (− log(1 − θ2 )) + 0.3 · (− log(θ2 ))
=0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
+ 0.3 · (− log(1 − θ2 )) + 0.4 · (− log(θ2 )).

Osvaldo Simeone ML4Engineers 49 / 63

Problem 4.10: Solution

Therefore, the population-optimal soft predictor can be obtained as

∗
[θH ]1 = arg min {0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))}
θ1 ∈[0,1]

∗
[θH ]2 = arg min {0.3 · (− log(1 − θ2 )) + 0.4 · (− log(θ2 ))}.
θ2 ∈[0,1]

To minimize these functions, we can either use a numerical tool, e.g.,

plotting the function and finding the optimal value “by hand”, or use
first-order optimality conditions (see next Chapter):

d
{0.1 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))}
dθ1

1 1
= 0.1 − 0.2 = 0,
1 − θ1 θ1
∗
which yields the equation 0.1 · θ1 − 0.2 · (1 − θ1 ) = 0, yielding [θH ]1 = 2/3.

Osvaldo Simeone ML4Engineers 50 / 63

Problem 4.10: Solution

MATLAB code:
th=[0:0.01:1];
plot(th,0.1*(-log(1-th))+0.2*(-log(th)),’LineWidth’,2)
xlabel(’$\theta 1$’,’Interpreter’,’latex’)
∗
We can follow the same steps for θ2 , obtaining [θH ]2 = 4/7.
The corresponding optimal hard predictor under the detection-error loss is
the MAP predictor
∗
t̂(0) = argmaxp(t|x, θH )=1
t

and similarly t̂(1) = t̂(2) = 1.

Osvaldo Simeone ML4Engineers 51 / 63

Problem 4.10: Solution

∗ ) is hence given as
The population log-loss Lp (θH
∗ ∗
Lp (θH ) =E(x,t)∼p(x,t) [− log p(t|x, θH )]
=0.1 · (− log(1 − 2/3)) + 0.2 · (− log(2/3))
+ 0.3 · (− log(1 − 4/7) + 0.4 · (− log(4/7))
=0.66.

Osvaldo Simeone ML4Engineers 52 / 63

Problem 4.10: Solution
The population log-loss Lp (θ) for any model parameter vector θ can
be expressed as

Lp (θ) = E(x,t)∼p(x,t) [− log p(t|x, θ)]

= Ex∼p(x) [H( p(t|x) || p(t|x, θ) )].
| {z } | {z }
true posterior distribution soft predictor

Therefore, we can compute the cross entropies

H(p(t|x = 0)||p(t|x = 0, θ)) = −1/3 log(1/3) − 2/3 log(2/3) = 0.63

H(p(t|x = 1)||p(t|x = 1, θ)) = −2/3 log(3/7) − 1/3 log(4/7) = 0.75
H(p(t|x = 2)||p(t|x = 2, θ)) = −1/4 log(3/7) − 3/4 log(4/7) = 0.63,

and the population log-loss is given as

Lp (θ) = 0.3 · 0.63 + 0.3 · 0.75 + 0.4 · 0.63

= 0.66.

Osvaldo Simeone ML4Engineers 53 / 63

Problem 4.11

In Problem 4.10, we have considered the ideal situation in which the population
distribution p(x, t) is known. In this problem, we assume the same population distribution
and the same model class H of soft predictors, and we study the learning problem.
To this end, assume that we have a training data set of N = 100 data points with
empirical distribution pD (x, t) given as

x\t 0 1

0 0.15 0.2

1 0.2 0.05

2 0.25 0.15

Osvaldo Simeone ML4Engineers 54 / 63

Problem 4.11

ML
Obtain the maximum likelihood (ML) model θD = argmin LD (θ).
θ∈Θ
ML
Calculate the population log-loss Lp (θD ) and compare it with the
∗
population loss Lp (θH ) of the population-optimal within-class predictor
obtained in Problem 4.10.

Osvaldo Simeone ML4Engineers 55 / 63

Problem 4.11: Solution

ML
ML soft predictor θH = argmin LD (θ): The training log-loss is given as
θ∈Θ

LD (θ) =E(x,t)∼p(x,t) [− log p(t|x, θ)]

=pD (0, 0) · (− log p(0|0, θ)) + pD (0, 1) · (− log p(1|0, θ))
+ pD (1, 0) · (− log p(0|1, θ)) + pD (1, 1) · (− log p(1|1, θ))
+ pD (2, 0) · (− log p(0|2, θ)) + pD (2, 1) · (− log p(1|2, θ))
=0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
+ 0.2 · (− log(1 − θ2 )) + 0.05 · (− log(θ2 ))
+ 0.25 · (− log(1 − θ2 )) + 0.15 · (− log(θ2 ))
=0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
+ 0.45 · (− log(1 − θ2 ) + 0.2 · (− log(θ2 )).

Osvaldo Simeone ML4Engineers 56 / 63

Problem 4.11: Solution

Therefore, the ML soft predictor parameters can be obtained as

ML
[θD ]1 = arg min {0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))}
θ1 ∈[0,1]

ML
[θD ]2 = arg min {0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 ))}.
θ2 ∈[0,1]

ML
Proceeding as we have done above, we get θD = [4/7, 4/13]T .
Note that this is different from the population-optimal solution
∗
θH = [2/3, 4/7]T .

Osvaldo Simeone ML4Engineers 57 / 63

Problem 4.11: Solution

ML ) is hence given as
The population log-loss Lp (θD
ML ∗
Lp (θD ) =E(x,t)∼p(x,t) [− log p(t|x, θH )]
=0.1 · (− log(1 − 4/7)) + 0.2 · (− log(4/7))
+ 0.3 · (− log(1 − 4/13)) + 0.4 · (− log(4/13))
=0.77.

As expected, this is larger than the minimum within-class population

∗ ) = 0.66.
log-loss Lp (θH

Osvaldo Simeone ML4Engineers 58 / 63

Problem 4.12
Continuing the previous problem, let us now consider the effect of
regularization.
To this end, assume that, based on prior knowledge, we have good reasons
to choose the following prior

p(θ) = p(θ1 , θ2 ) = p(θ1 )p(θ2 )

with marginals p(θi ) = q if 0 ≤ θi < 0.5 and p(θi ) = 2 − q if 0.5 ≤ θi ≤ 1

for i = 1, 2 and some fixed 0 ≤ q ≤ 2.
Verify that the marginals p(θi ) are valid pdfs.
Write the conditions defining the maximum a posterior (MAP) model
MAP
parameters θD for any N and prior parameter q.
MAP
Plot the population log-loss Lp (θD ) for the empirical table given above as
a function of q in the interval [0, 2] (recall that we have N = 100).
As a reference, plot the population log-loss obtained by ML and comment
on the comparison between the two losses.

Osvaldo Simeone ML4Engineers 59 / 63

Problem 4.12: Solution

First, let us check that the marginals are valid pdfs. To this end, we need to
check the normalization condition
Z +∞
p(θi )dθi = 1.
−∞

This can be easily verified, since we have

Z +∞
p(θi )dθi = 0.5 · q + (1 − 0.5) · (2 − q)
−∞
= 1.

Osvaldo Simeone ML4Engineers 60 / 63

Problem 4.12: Solution

1
MAP

MAP soft predictor θH = argmin LD (θ) − N log p(θ) : We have
θ∈Θ

1
LD (θ) − log p(θ) =0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 ))
N
+ 0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 ))
1
− (log p(θ1 ) + log p(θ2 )).
N
Therefore, the MAP soft predictor can be obtained as

MAP 1
[θH ]1 = arg min {0.15 · (− log(1 − θ1 )) + 0.2 · (− log(θ1 )) − log p(θ1 )}
θ1 ∈[0,1] N

MAP 1
[θH ]2 = arg min {0.45 · (− log(1 − θ2 )) + 0.2 · (− log(θ2 )) − log p(θ2 )}.
θ2 ∈[0,1] N

Osvaldo Simeone ML4Engineers 61 / 63

Problem 4.12: Solution
MATLAB code for the plot
L=100;
N=100;
q=linspace(0.01,1.99,L);
thvec=[0:0.01:1];
for l=1:L
[m,ind]=min(0.15*(-log(1-thvec))+0.2*(-log(thvec))-
1/N*log(q(l)*(thvec<0.5)+(2-q(l))*(thvec>=0.5))); %minimizes over theta1
th1MAP=thvec(ind);
[m,ind]=min(0.45*(-log(1-thvec))+0.2*(-log(thvec))-
1/N*log(q(l)*(thvec<0.5)+(2-q(l))*(thvec>=0.5))); %minimizes over theta2
th2MAP=thvec(ind);
LpMAP(l)=0.1*(-log(1-th1MAP))+0.2*(-log(th1MAP))+0.3*(-log(1-
th2MAP))+0.4*(-log(th2MAP)); %population loss
end
plot(q,LpMAP,’b’,’LineWidth’,2)
th1ML=4/7; th2ML=4/13;
hold on;
plot(q,(0.1*(-log(1-th1ML))+0.2*(-log(th1ML))+0.3*(-log(1-th2ML))+0.4*(-
log(th2ML)))*ones(L,1),’r--’,’LineWidth’,2); %ML population loss
xlabel(’$q$’,’Interpreter’,’latex’)
ylabel(’population loss’,’Interpreter’,’latex’)
legend(’MAP’,’ML’)

Osvaldo Simeone ML4Engineers 62 / 63

Problem 4.12: Solution

Since the population-optimal solution has both elements larger than

0.5, a “good” prior has q sufficiently small.
With a good prior, the population loss of the MAP solution is smaller
as compared to ML.
In contrast, with a “bad” prior, i.e., with a large q, the prior enforces
an incorrect inductive bias, and the population loss of the MAP
solution is larger as compared to ML.
Note that, as N increases, the MAP solution would be less sensitive
to the choice of a bad prior since the regularization term would be
weighted less as compared to the likelihood.

Osvaldo Simeone ML4Engineers 63 / 63

1718sem2-Ee5904 Me5404
No ratings yet
1718sem2-Ee5904 Me5404
4 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages
Effect of Sensitization On Microhardness and Corrosion Resistance of Austenitic Stainless Steel
No ratings yet
Effect of Sensitization On Microhardness and Corrosion Resistance of Austenitic Stainless Steel
4 pages
Princes of Cardolan
100% (2)
Princes of Cardolan
23 pages
3 N 71 B
No ratings yet
3 N 71 B
4 pages
Cut and Fill
No ratings yet
Cut and Fill
16 pages
3 PDF
No ratings yet
3 PDF
56 pages
Orf523 S24 HW3
No ratings yet
Orf523 S24 HW3
4 pages
Midterm 1 Practice Solutions
No ratings yet
Midterm 1 Practice Solutions
12 pages
8 PDF
No ratings yet
8 PDF
76 pages
hw4_red
No ratings yet
hw4_red
6 pages
HW 23 P 4 Rie
No ratings yet
HW 23 P 4 Rie
5 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
HW 4
No ratings yet
HW 4
6 pages
HW 1
No ratings yet
HW 1
4 pages
Exam With Solutions
No ratings yet
Exam With Solutions
7 pages
EndSem 202223 Solution
No ratings yet
EndSem 202223 Solution
4 pages
ML ES 23-24-II Key
No ratings yet
ML ES 23-24-II Key
4 pages
(Chapman
No ratings yet
(Chapman
69 pages
HW 2
No ratings yet
HW 2
5 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
11 pages
hw1 PDF
No ratings yet
hw1 PDF
6 pages
Z.H. Sikder University of Science and Technology: Mid-Term Examination, Fall-2020
No ratings yet
Z.H. Sikder University of Science and Technology: Mid-Term Examination, Fall-2020
6 pages
hw01s
No ratings yet
hw01s
10 pages
Bruno Gonçalves: Deep Learning From Scratch
No ratings yet
Bruno Gonçalves: Deep Learning From Scratch
95 pages
HW 1 in 2015
No ratings yet
HW 1 in 2015
3 pages
Fall2020 CS395T Mock Midterm Solutions
No ratings yet
Fall2020 CS395T Mock Midterm Solutions
4 pages
Cs8082 Machine Learning Techniques Ripped From Amazon Kindle e Books by Sai Seena
No ratings yet
Cs8082 Machine Learning Techniques Ripped From Amazon Kindle e Books by Sai Seena
148 pages
HW 1
No ratings yet
HW 1
3 pages
Final F01soln
No ratings yet
Final F01soln
13 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
56 pages
2019-20-I ES Key
No ratings yet
2019-20-I ES Key
4 pages
Machine Learning Homework1 Solutions
No ratings yet
Machine Learning Homework1 Solutions
16 pages
ML Ctanujit
No ratings yet
ML Ctanujit
56 pages
Wallahibrother
No ratings yet
Wallahibrother
23 pages
Machine Learning Homework
No ratings yet
Machine Learning Homework
8 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Linear-Regression 231212 072619
No ratings yet
Linear-Regression 231212 072619
13 pages
Lecture 5
No ratings yet
Lecture 5
9 pages
HW 1 Eeowh 3
No ratings yet
HW 1 Eeowh 3
6 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
No ratings yet
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
10 pages
practicalMachineLearning_lecture3
No ratings yet
practicalMachineLearning_lecture3
25 pages
DR Sanchez - Slides - 2
No ratings yet
DR Sanchez - Slides - 2
40 pages
Trial Exam 2021 With Solutions
No ratings yet
Trial Exam 2021 With Solutions
10 pages
Homework 1
No ratings yet
Homework 1
8 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
P04 EvaluationKNN SolutionNotes
No ratings yet
P04 EvaluationKNN SolutionNotes
3 pages
lec12
No ratings yet
lec12
9 pages
Lab Manual 05
No ratings yet
Lab Manual 05
13 pages
exercise 01 math refresher
No ratings yet
exercise 01 math refresher
4 pages
Solution ToYegnRame2001
No ratings yet
Solution ToYegnRame2001
107 pages
Final f04
No ratings yet
Final f04
13 pages
endsem_ML_regular_AK
No ratings yet
endsem_ML_regular_AK
7 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
1st Exam Question Paper
No ratings yet
1st Exam Question Paper
2 pages
Lec 03
No ratings yet
Lec 03
42 pages
1st Exam Question Paper 2
No ratings yet
1st Exam Question Paper 2
16 pages
Final f02
No ratings yet
Final f02
12 pages
Introduction To Machine Learning IIT KGP Week 2
100% (1)
Introduction To Machine Learning IIT KGP Week 2
14 pages
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Cts Part III
No ratings yet
Cts Part III
14 pages
MAS 151 - Group 8 - ABF
No ratings yet
MAS 151 - Group 8 - ABF
4 pages
Spice and Wolf Volume 03
No ratings yet
Spice and Wolf Volume 03
261 pages
Pengolahan Citra Berwarna: Kuliah Ke-3 PPCD
No ratings yet
Pengolahan Citra Berwarna: Kuliah Ke-3 PPCD
55 pages
Solutions For 9677
100% (2)
Solutions For 9677
30 pages
OVP Details Doku For Consulting Note
No ratings yet
OVP Details Doku For Consulting Note
57 pages
Pam PPM PWM
No ratings yet
Pam PPM PWM
55 pages
Uate Ad 2.24.4
No ratings yet
Uate Ad 2.24.4
2 pages
Fitzgerald Audrey Eportfolio Assignment 3
No ratings yet
Fitzgerald Audrey Eportfolio Assignment 3
4 pages
Nilai Impian Diyana 2 Brochure
100% (1)
Nilai Impian Diyana 2 Brochure
19 pages
Genius Lab: Covid-19 Antigen Test
No ratings yet
Genius Lab: Covid-19 Antigen Test
1 page
Chapter 4 - Software PDF - Compressed
No ratings yet
Chapter 4 - Software PDF - Compressed
108 pages
Thuraya XT-Hotspot
No ratings yet
Thuraya XT-Hotspot
2 pages
Comparative Linguistic Analysis of Russian and English Proverbs and Sayings
No ratings yet
Comparative Linguistic Analysis of Russian and English Proverbs and Sayings
9 pages
Product Life Cycle Parle - G: by Aashish Kadel
No ratings yet
Product Life Cycle Parle - G: by Aashish Kadel
18 pages
DUST-DY_Manual_uk_rev4.6.1-1
No ratings yet
DUST-DY_Manual_uk_rev4.6.1-1
28 pages
The Algorithms-Aided Design (AAD) : 1 Input/Output Data
No ratings yet
The Algorithms-Aided Design (AAD) : 1 Input/Output Data
6 pages
Jackson II Asc Om
No ratings yet
Jackson II Asc Om
18 pages
Eco Feminism New...
No ratings yet
Eco Feminism New...
9 pages
Bon SL Zoom Slit Lamp User and Service Manual
No ratings yet
Bon SL Zoom Slit Lamp User and Service Manual
20 pages
Optimal Coordination of Overcurrent Relay Protection Using Evolutionary Programming
No ratings yet
Optimal Coordination of Overcurrent Relay Protection Using Evolutionary Programming
59 pages
AIAT_manual_BV_Family_R1.2_-_DMR116617Rev00
No ratings yet
AIAT_manual_BV_Family_R1.2_-_DMR116617Rev00
73 pages
Jindal Syllabus and Exam Pattern
No ratings yet
Jindal Syllabus and Exam Pattern
18 pages
Mini Project Final Review
No ratings yet
Mini Project Final Review
19 pages
Pride and Prejudice
No ratings yet
Pride and Prejudice
479 pages