0% found this document useful (0 votes)
4 views

Notes-1

The document provides an overview of key concepts in machine learning, including the differences between regression and classification, the bias-variance trade-off, and the distinction between supervised and unsupervised learning. It also discusses model performance measurement through loss functions and introduces the k-Nearest Neighbors (kNN) method, highlighting its flexibility and considerations for implementation. Additionally, it touches on extensions of kNN, such as kernel regression, which allows for weighted predictions based on proximity.

Uploaded by

girosi4121
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Notes-1

The document provides an overview of key concepts in machine learning, including the differences between regression and classification, the bias-variance trade-off, and the distinction between supervised and unsupervised learning. It also discusses model performance measurement through loss functions and introduces the k-Nearest Neighbors (kNN) method, highlighting its flexibility and considerations for implementation. Additionally, it touches on extensions of kNN, such as kernel regression, which allows for weighted predictions based on proximity.

Uploaded by

girosi4121
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Machine Learning and Big Data Analytics Section 1

Emily Mower
September 2018

Note that the material in these notes draws on the excellent and more thorough treatment
of these topics in Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor
Hastie and Robert Tibshirani.

1 Important Machine Learning Concepts


1.1 Regression vs Classification
Prediction problems can be defined based on the characteristics of the outcome variable we want
to predict.

Regression problems are those where the outcome is quantitative


Classification problems are those where the outcome is qualitative / categorical

Sometimes the same methods can be used for regression and classification problems, but many
methods are useful for only one of the two problem types.

1.2 Bias-Variance Trade-off


The variance of a statistical learning method is the amount by which the prediction function
would change if it was estimated on a different training set. A model that overfits has high
variance, whereas a model that underfits has low variance.
To remember the difference between low variance and high variance models, I find it helpful to
think of examples. Suppose your model was “use the mean of the training data as the predicted
value for all new data points.” The mean shouldn’t change much across training sets, so this has
low variance. On the other hand, a model that picked up super complex patterns is likely to be
picking up noise in addition to signal. The noise will vary by training set, so such a method would
have high variance.

The bias of a statistical learning method is the error produced by representing a real world
problem by a statistical learning method. Very flexible models (which are prone to overfitting) can
capture complex patterns and so tend to have low bias. Very simple models (which are prone to
underfitting) are limited in their ability to pick up patterns and so may have high bias.
The book uses the example of representing a non-linear function by a linear one to show that
no matter how much data you have, a linear model will not do a great prediction job when the
process generating the data is non-linear. Bias also applies to methods that might not fit your
traditional concept of a statistical function. In the K-Nearest Neighbors section, we will discuss
bias in that setting.

Often, we will talk about the bias-variance trade-off . In an ideal world, we would find a
model that has low variance and low bias, because that would yield a good and consistent model.
In practice, you usually have to allow bias to increase in order to decrease variance and vice versa.
However, there are many models that will decrease one (bias or variance) significantly while only
increasing the other a little.

1
1.3 Supervised v. Unsupervised Learning
Supervised learning refers to problems where there is a known outcome. In these problems, you
can train a model to take features and predict the known outcome.
Unsupervised learning refers to problems where you are interested in uncovering patterns and
do not have a target outcome in mind.

An example of supervised learning would be using students’ high school grades, class enroll-
ments, and demographic variables to predict whether or not they attend college.
An example of unsupervised learning would be using the same grades, enrollment, and demo-
graphic features to identify “types” of high school students. That is, students who look similar
according to these features. Perhaps you are interested in this because you want to make classes
that contain a mix of different types of students. Often, unsupervised learning is useful for creating
features for supervised learning problems, but sometimes uncovering patterns is the final objective.

1.4 Measuring Model Performance


There are different functions you can use to measure model performance, and which function you
choose depends on your data and your objective. These functions are called “loss functions,” which
is a somewhat intuitive name when you think about the fact that your machine learning algorithm
is trying to minimize this function and thus minimize your loss.
To understand how and why loss functions depend on your data and objectives, examples can
be helpful.
Consider first that you are trying to predict the future college majors of this year’s incoming
freshmen (a classification problem). In this case, your prediction will either be right (you predict
the major they end up choosing) or it will be wrong. Therefore, you might use accuracy (% correct)
to measure model performance.
What if, though, you cared more about being wrong for some majors than others? For example,
imagine that all biology majors are going to need personalized lab equipment in their junior year
and that the lab equipment is really expensive if ordered last minute but a lot cheaper if ordered a
year or more in advance? Then, you might want to give more weight to people who end up being
biology majors so that your model does better for predicting biology majors than other majors.
Now consider that you are trying to predict home prices (a regression problem). You might
measure your performance using mean-squared error (MSE), which is found by taking the difference
between the predicted sale price for each home and the true sale price (the error), squaring it for
each home, and then taking the mean of these squared errors. However, home prices are skewed
(e.g. some homes are extremely expensive compared to most homes on the market). This means
that a 5% error on a $3 million home is a lot bigger than a 5% error on a $100,000 home. When
you square the errors (as you do when calculating MSE), the difference becomes enormous.
But since both errors are 5%, maybe you want to penalize them the same. One option is to
use Mean Percentage Error (MPE), but this has the weird effect that if you over-predict one home
by 5% and under-predict the other by 5%, your MPE is zero. Therefore, a popular option is to
use the Mean Absolute Percentage Error (MAPE), which is the mean of the absolute values of the
percentage errors and thus would be 5% in this example.
For many prediction problems in the policy sphere, we may not only care about accuracy of
prediction but also about fairness or other objectives. The loss function is a place where we can
explicitly tell the model to optimize for these concerns in addition to predictive performance.

2 k-Nearest Neighbors
2.1 Concept
The idea underlying k-Nearest Neighbors (kNN) is that we expect observations with similar features
to have similar outcomes. kNN makes no other assumptions about functional form, so it is quite
flexible.

2
2.2 Method
kNN can be used for either regression or classification, though it works slightly differently depending
on what setting we are in. In the classification setting, the prediction is a majority vote of the
observation’s k-nearest neighbors. In the regression setting, the prediction is the average outcome
of the observation’s k-nearest neighbors.
For kNN, bias will be lower when the relationship between features and the outcome is smooth
across the feature space. When the relationship is rough, bias will increase quickly as further away
neighbors are included in the prediction.
The only choice we have to make when implementing kNN is the value of k (e.g. how many
neighbors should we use in our prediction?). A good way to find k is through cross-validation,
something we will cover a little later, but which broadly involves training the algorithm on one set
of data and seeing how well it does on a different set.

2.3 Implementation and Considerations


A concern with kNN is whether you have good coverage of your feature space. Imagine that all of
your training points were in one region of the feature space, but some of your test points are far
away from this region. You will still use the k nearest neighbors to predict the outcome for these
far-away test points, but it might not work as well as if the points were close together. Therefore,
when implementing kNN, it’s good to think about how similar the features in your test set will be
to the features in your training set. If they differ systematically, that is a concern (as it would be
for other ML methods as well).
Another important consideration is whether there is an imbalance in the frequency of one
outcome compared to another. For example, suppose we are trying to classify points as “true”
or “false” and most points are “true.” Even if the “false” outcomes are clustered together in the
feature space, if we use a large enough value of k, we will predict “true” for these observations
simply because there are many more “true” observations than “false” observations. Therefore, we
would do better to use a small value for k in this setting.
Another consideration is whether proximity in each variable is equally important or if proximity
in one variable is more important than proximity in another variable. kNN will normalize variables
so that they are all on the same scale (same mean and variance) and then treat distance in
all normalized variables the same. If you want to up-weight proximity for some variables and
down-weight it for others, you can change the way each variable is normalized to accomplish this.
Alternatively, you can include only those variables you think are important. When you have this
type of uncertainty, there are more principled ways of selecting variables that will be discussed
later in the course.

2.4 Extensions
You might think that neighbors that are really close should be weighted more than neighbors that
are a bit further away. Many people agree, so there are methods to allow you to weight different
observations differently. You might also think that you shouldn’t use just the k nearest neighbors,
but all the neighbors within a certain distance. Or maybe you think there’s information available
in all observations, but there’s more information in closer neighbors. All of these adjustments fall
under the umbrella of kernel regression. In fact, kNN is a special case of kernel regression.
Broadly defined, kernel regression methods are a class of methods that generate predictions by
taking weighted averages of observations. Because these methods (kNN included) do not specify a
functional form, they are called “non-parametric regression” methods.

You might also like