Notes-1
Notes-1
Emily Mower
September 2018
Note that the material in these notes draws on the excellent and more thorough treatment
of these topics in Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor
Hastie and Robert Tibshirani.
Sometimes the same methods can be used for regression and classification problems, but many
methods are useful for only one of the two problem types.
The bias of a statistical learning method is the error produced by representing a real world
problem by a statistical learning method. Very flexible models (which are prone to overfitting) can
capture complex patterns and so tend to have low bias. Very simple models (which are prone to
underfitting) are limited in their ability to pick up patterns and so may have high bias.
The book uses the example of representing a non-linear function by a linear one to show that
no matter how much data you have, a linear model will not do a great prediction job when the
process generating the data is non-linear. Bias also applies to methods that might not fit your
traditional concept of a statistical function. In the K-Nearest Neighbors section, we will discuss
bias in that setting.
Often, we will talk about the bias-variance trade-off . In an ideal world, we would find a
model that has low variance and low bias, because that would yield a good and consistent model.
In practice, you usually have to allow bias to increase in order to decrease variance and vice versa.
However, there are many models that will decrease one (bias or variance) significantly while only
increasing the other a little.
1
1.3 Supervised v. Unsupervised Learning
Supervised learning refers to problems where there is a known outcome. In these problems, you
can train a model to take features and predict the known outcome.
Unsupervised learning refers to problems where you are interested in uncovering patterns and
do not have a target outcome in mind.
An example of supervised learning would be using students’ high school grades, class enroll-
ments, and demographic variables to predict whether or not they attend college.
An example of unsupervised learning would be using the same grades, enrollment, and demo-
graphic features to identify “types” of high school students. That is, students who look similar
according to these features. Perhaps you are interested in this because you want to make classes
that contain a mix of different types of students. Often, unsupervised learning is useful for creating
features for supervised learning problems, but sometimes uncovering patterns is the final objective.
2 k-Nearest Neighbors
2.1 Concept
The idea underlying k-Nearest Neighbors (kNN) is that we expect observations with similar features
to have similar outcomes. kNN makes no other assumptions about functional form, so it is quite
flexible.
2
2.2 Method
kNN can be used for either regression or classification, though it works slightly differently depending
on what setting we are in. In the classification setting, the prediction is a majority vote of the
observation’s k-nearest neighbors. In the regression setting, the prediction is the average outcome
of the observation’s k-nearest neighbors.
For kNN, bias will be lower when the relationship between features and the outcome is smooth
across the feature space. When the relationship is rough, bias will increase quickly as further away
neighbors are included in the prediction.
The only choice we have to make when implementing kNN is the value of k (e.g. how many
neighbors should we use in our prediction?). A good way to find k is through cross-validation,
something we will cover a little later, but which broadly involves training the algorithm on one set
of data and seeing how well it does on a different set.
2.4 Extensions
You might think that neighbors that are really close should be weighted more than neighbors that
are a bit further away. Many people agree, so there are methods to allow you to weight different
observations differently. You might also think that you shouldn’t use just the k nearest neighbors,
but all the neighbors within a certain distance. Or maybe you think there’s information available
in all observations, but there’s more information in closer neighbors. All of these adjustments fall
under the umbrella of kernel regression. In fact, kNN is a special case of kernel regression.
Broadly defined, kernel regression methods are a class of methods that generate predictions by
taking weighted averages of observations. Because these methods (kNN included) do not specify a
functional form, they are called “non-parametric regression” methods.