0% found this document useful (0 votes)
341 views

Additional Exercice S Data Science

The document discusses three examples involving machine learning models: 1) A nearest neighbor model for surfing predictions using a weather dataset. 2) A bag-of-words model for email spam detection trained on five sample emails. 3) A linear regression model to predict oxygen consumption based on astronaut age and heart rate.

Uploaded by

Frank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
341 views

Additional Exercice S Data Science

The document discusses three examples involving machine learning models: 1) A nearest neighbor model for surfing predictions using a weather dataset. 2) A bag-of-words model for email spam detection trained on five sample emails. 3) A linear regression model to predict oxygen consumption based on astronaut age and heart rate.

Uploaded by

Frank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Additional exercices

1. The table below lists a dataset that was used to create a nearest neighbour model
that predicts whether it will be a good day to go surfing.

Assuming that the model uses Euclidean distance to find the nearest neighbour,
what prediction will the model return for each of the following query instances.

2. Email spam filtering models often use a bag-of-words representation


for emails. In a bag-of-words representation, the descriptive features that
describe a document (in our case, an email) each represent how many
times a particular word occurs in the document. One descriptive feature
is included for each word in a predefined dictionary. The dictionary is typically
defined as the complete set of words that occur in the training dataset.
The table below lists the bag-of-words representation for the following five
emails and a target feature, SPAM, whether they are spam emails or genuine
emails:

 “money, money, money”


 “free money for free gambling fun”
 “gambling for fun”
 “machine learning for fun, fun, fun”
 “free machine learning”

1
 What target level would a nearest neighbor model using Euclidean distance
return for the following email: “machine learning for free”?
 What target level would a k-NN model with k=3 and using Euclidean distance
return for the same query?
 What target level would a weighted k-NN model with k=5 and using a weighting
scheme of the reciprocal of the squared Euclidean distance between the neighbor
and the query, return for the query?
 What target level would a k-NN model with k=3 and using Manhattan distance
return for the same query?
 There are a lot of zero entries in the spam bag-of-words dataset. This is indicative
of sparse data and is typical for text analytics. Cosine similarity is often a good
choice when dealing with sparse non-binary data. What target level would a 3-
NN model using cosine similarity return for the query?

3. You have been hired by the European Space Agency to build a model that predicts
the amount of oxygen that an astronaut consumes when performing five minutes of
intense physical work. The descriptive features for the model will be the age of the
astronaut and their average heart rate throughout the work. The regression model is

The table below shows a historical dataset that has been collected for this task.

2
 Assuming that the current weights in a multivariate linear regression model
are w[0] = 59.50, w[1] = 0.15, and w[2]=0.60, make a prediction for each
training instance using this model.
 Calculate the sum of squared errors for the set of predictions generated in the
previous question.
 Assuming a learning rate of 0.000002, calculate the weights at the next
iteration of the gradient descent algorithm.
 Calculate the sum of squared errors for a set of predictions generated using
the new set of weights calculated in the previous question.

You might also like