Chapter 1
Chapter 1
methods assume that yi is a categorical or nominal variable from some finite set, yi ∈
Similarly the form of the output or response variable can in principle be anything, but most
{1,...,C} (such as male or female), or that yi is a real-valued scalar (such as income level).
When yi is categorical, the problem is known as classification or pattern recognition, and
when yi is real-valued, the problem is known as regression. Another variant, known as
ordinal regression, occurs where label space Y has some natural ordering, such as grades A–
F.
The second main type of machine learning is the descriptive or unsupervised learning
approach. Here we are only given inputs, D = {xi }Ni=1 i=1, and the goal is to find “interesting
patterns” in the data. This is sometimes called knowledge discovery. This is a much less
well-defined problem, since we are not told what kinds of patterns to look for, and there is
no obvious error metric to use (unlike supervised learning, where we can compare our
prediction of y for a given x to the observed value).
There is a third type of machine learning, known as reinforcement learning, which is
somewhat less commonly used. This is useful for learning how to act or behave when given
occasional reward or punishment signals. (For example, consider how a baby learns to
walk.) Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground between
Supervised (With Labelled training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets during the training period.
Figure1.1: Left: Some labeled training examples of colored shapes, along with 3 unlabeled
the feature vector xi. The last column is the label, yi ∈ {0, 1}.
test cases. Right: Representing the training data as an N × D design matrix. Row i represents
Classification
outputs y, where y ∈ {1,...,C}, with C being the number of classes. If C = 2, this is called
In this section, we discuss classification. Here the goal is to learn a mapping from inputs x to
binary classification (in which case we often assume y ∈ {0, 1}); if C > 2, this is called
multiclass classification. If the class labels are not mutually exclusive (e.g., somebody may
be classified as tall and strong), we call it multi-label classification, but this is best viewed as
predicting multiple related binary class labels (a so-called multiple output model). When we
use the term “classification”, we will mean multiclass classification with a single output,
unless we state otherwise.
One way to formalize the problem is as function approximation. We assume y = f(x) for
some unknown function f, and the goal of learning is to estimate the function f given a
labeled training set, and then to make predictions using Y^ =f(x). (We use the hat symbol to
denote an estimate.) Our main goal is to make predictions on novel inputs, meaning ones
that we have not seen before (this is called generalization), since predicting the response on
the training set is easy (we can just look up the answer).
Example:- As a simple toy example of classification, consider the problem illustrated in
Figure 1.1(a). We have two classes of object which correspond to labels 0 and 1. The inputs
are colored shapes. These have been described by a set of D features or attributes, which
are stored in an N × D design matrix X, shown in Figure 1.1(b). The input features x can be
discrete, continuous or a combination of the two. In addition to the inputs, we have a vector
of training labels y. In Figure 1.1, the test cases are a blue crescent, a yellow circle and a blue
arrow. None of these have been seen before. Thus we are required to generalize beyond
the training set. A reasonable guess is that blue crescent should be y = 1, since all blue
shapes are labeled 1 in the training set. The yellow circle is harder to classify, since some
yellow things are labeled y = 1 and some are labeled y = 0, and some circles are labeled y = 1
and some y = 0. Consequently it is not clear what the right label should be in the case of the
yellow circle. Similarly, the correct label for the blue arrow is unclear.
Regression
consider fitting two models to the data: a straight line and a quadratic function.
(a) Linear regression on some 1d data. (b) Same data with polynomial regression (degree 2).
Predict tomorrow’s stock market price given current market conditions and other possible
side information.
Predict the age of a viewer watching a given video on YouTube.
Predict the location in 3d space of a robot arm end effector, given control signals (torques)
sent to its various motors.
Predict the temperature at any location inside a building using weather data, time, door
sensors, etc.