Machine Learning Assignment 1 Basic Concepts: Due: 27 March 2015, 15:00pm
Machine Learning Assignment 1 Basic Concepts: Due: 27 March 2015, 15:00pm
Assignment 1
Basic Concepts
Please hand in your solutions in the class, and upload a pdf version with the code in a
single .zip file to the Moodle system before the deadline. Your submission delay is rounded
up to a day, i.e. one minute delay is considered one day.
You are supposed to submit your code for the questions that are marked with [+CODE].
After each section, the interface you should provide for your code is specified. Your code
will be tested by calling a function as mentioned.
For computations and plotting the graphs, you are free to use any software/language of
your choice. The recommended tool is MatLab/Octave, since they will be by far the easiest
tools for the next assignments. You better get used to them sooner than later.
MatLab: http://www.mathworks.com/
Octave : http://octave.sourceforge.net/
Fast Octave installation in Ubuntu: apt-get install octave octave-signal
Fast Octave installation in MacOS: port install octave octave-signal
Inside Octave, load a package (e.g. signal) by: pkg load signal
1
( HINT: To calculate the convolution, you can split Z = X + Y into X = aZ + t and
Y = (1 a)Z t, instead of X = x and Y = Z x; i.e.:
Z Z
fZ (z) = fX (x)fY (z x)dx = fX (az + t)fY ((1 a)z t) dt. (5)
p(DS|CI ), p(DS|CH )
(HINT: You can use histogram with proper bin size. You can use [a,b]=hist(X) function
in MatLab and plot the histogram with plot(b,a,color). For the distribution, be careful
about the normalization factor.)
CODE: function runB(data)
Page 2
(c): Based on the visual look of the graphs, which attribute is the best to determine the
patients infection status? Why? (NOTE: There might be more than one correct answer.
Your answer to the why question is what matters.)
(d): Find out how much each data attribute can tell us about the infection status by
calculating their correlation with the infection status. Which correlation has the highest
value? Is it consistent with your reasoning in the previous section?
CODE: function runD(data)
(e): Plot Infection-vs.-WBC, Infection-vs.-BT, and Infection-vs.-DS in three different
graphs (For the infection, assume I=-1 if the person is healthy, and I=+1 if infected). How
do the correlations you calculated before reflect on these graphs? How can this graph tell us
about our single-attribute decision algorithm?
CODE: function runE(data)
Sufficient Statistic: Given a set X of i.i.d. data with probability p(X|) for an unknown
parameter , a sufficient statistic is a function T (X), which contains all the information that
X provides to estimate . In other words:
P (|T (X), X) = P (|T (X)). (6)
From this point, assume p(W BC, BT, DS|CI ) and p(W BC, BT, DS|CH ) have Gaussian
distributions. As a result, all their slices and projections are also Gaussian.
(f ): What are the sufficient statistics (TI and TH ) for the ML estimator of the parameters
of p(W BC|CH ) and p(W BC|CI ) distributions? If TM = [TI , TH , K], what parameter K can
make TM sufficient statistic for estimating the parameters of p(W BC)?
(g): What are p(CH ) and p(CI ) in the training dataset? Calculate p(W BC) from the
training data.
CODE: function runG(data)
(h): Imagine the data is captured from people who came to the hospital for a checkup,
while in the real-world, only 5% of the people are infected. What is the estimated p(W BC)
for the real world? Why?
CODE: function runH(data)
(i): Design the decision algorithm, only based on WBC, with:
Maximum Likelihood approach
Minimum Cost approach, considering the cost of mis-classifying an infected person as
10 times higher than a healthy person.
Maximum A Posteriori approach, considering that only 5% of people in the real world
has are infected.
CODE: functions runI ML(data), runI COST(data), runI MAP(data)
(j): Consider the Maximum Likelihood approach. We would like to select only one
attribute for our decision making. Using the data in the dataset, find out which attribute
is the best for our decision making, using 10-fold cross-validation. Is your result consistent
with your analysis in the previous sections? Why?
CODE: function runJ(data)
Page 3