Chapter Introduction
Chapter Introduction
Introduction
The main focus of machine learning (ML) is making decisions or predictions based on data.
There are a number of other fields with significant overlap in technique, but difference in
focus: in economics and psychology, the goal is to discover underlying causal processes This description
and in statistics it is to find a model that fits a data set well. In those fields, the end product paraphrased from a
post on 9/4/12 at
is a model. In machine learning, we often fit models, but as a means to the end of making
andrewgelman.com
good predictions or decisions.
As ML methods have improved in their capability and scope, ML has become arguably
the best way–measured in terms of speed, human engineering time, and robustness–to
approach many applications. Great examples are face detection, speech recognition, and
many kinds of language-processing tasks. Almost any application that involves under-
standing data or signals that come from the real world can be nicely addressed using ma-
chine learning.
One crucial aspect of machine learning approaches to solving problems is that human and often undervalued
engineering plays an important role. A human still has to frame the problem: acquire and
organize data, design a space of possible solutions, select a learning algorithm and its pa-
rameters, apply the algorithm to the data, validate the resulting solution to decide whether
it’s good enough to use, try to understand the impact on the people who will be affected
by its deployment, etc. These steps are of great importance.
The conceptual basis of learning from data is the problem of induction: Why do we think
that previously seen data will help us predict the future? This is a serious long standing
philosophical problem. We will operationalize it by making assumptions, such as that all
training data are so-called i.i.d.(independent and identically distributed), and that queries This means that the el-
will be drawn from the same distribution as the training data, or that the answer comes ements in the set are
related in the sense that
from a set of possible answers known in advance.
they all come from the
In general, we need to solve these two problems: same underlying prob-
ability distribution, but
• estimation: When we have data that are noisy reflections of some underlying quan- not in any other ways.
tity of interest, we have to aggregate the data and make estimates or predictions
about the quantity. How do we deal with the fact that, for example, the same treat-
ment may end up with different results on different trials? How can we predict how
well an estimate may compare to future results?
• generalization: How can we predict results of a situation or experiment that we have
never encountered before in our data set?
6
MIT 6.390 Spring 2024 7
We can describe problems and their solutions using six characteristics, three of which
characterize the problem and three of which characterize the solution:
1. Problem class: What is the nature of the training data and what kinds of queries will
be made at testing time?
2. Assumptions: What do we know about the source of the data or the form of the
solution?
3. Evaluation criteria: What is the goal of the prediction or estimation system? How
will the answers to individual queries be evaluated? How will the overall perfor-
mance of the system be measured?
4. Model type: Will an intermediate model of the world be made? What aspects of the
data will be modeled in different variables/parameters? How will the model be used
to make predictions?
5. Model class: What particular class of models will be used? What criterion will we
use to pick a particular model from the model class?
6. Algorithm: What computational process will be used to fit the model to the data
and/or to make predictions?
Without making some assumptions about the nature of the process generating the data, we
cannot perform generalization. In the following sections, we elaborate on these ideas. Don’t feel you have
to memorize all these
kinds of learning, etc.
1.1 Problem class We just want you to
have a very high-level
view of (part of) the
There are many different problem classes in machine learning. They vary according to what breadth of the field.
kind of data is provided and what kind of conclusions are to be drawn from it. Five stan-
dard problem classes are described below, to establish some notation and terminology.
In this course, we will focus on classification and regression (two examples of super-
vised learning), and we will touch on reinforcement learning, sequence learning, and clus-
tering.
1.1.1.1 Regression
For a regression problem, the training data Dn is in the form of a set of n pairs:
where x(i) represents an input, most typically a d-dimensional vector of real and/or dis-
crete values, and y(i) is the output to be predicted, in this case a real-number. The y values Many textbooks use xi
are sometimes called target values. and ti instead of x(i)
The goal in a regression problem is ultimately, given a new input value x(n+1) , to predict and y(i) . We find that
notation somewhat dif-
the value of y(n+1) . Regression problems are a kind of supervised learning, because the ficult to manage when
desired output y(i) is specified for each of the training examples x(i) . x(i) is itself a vector and
we need to talk about
its elements. The no-
Last Updated: 04/07/24 16:49:48 tation we are using is
standard in some other
parts of the ML litera-
ture.
MIT 6.390 Spring 2024 8
1.1.1.2 Classification
A classification problem is like regression, except that the values that y(i) can take do not
have an order. The classification problem is binary or two-class if y(i) (also known as the
class) is drawn from a set of two possible values; otherwise, it is called multi-class.
1.1.2.1 Clustering
Given samples x(1) , . . . , x(n) ∈ Rd , the goal is to find a partitioning (or “clustering”) of
the samples that groups together similar samples. There are many different objectives,
depending on the definition of the similarity between samples and exactly what criterion
is to be used (e.g., minimize the average distance between elements inside a cluster and
maximize the average distance between elements across clusters). Other methods perform
a “soft” clustering, in which samples may be assigned 0.9 membership in one cluster and
0.1 in another. Clustering is sometimes used as a step in the so-called density estimation
(described below), and sometimes to find useful structure or influential features in data.
• The agent observes the current state st . Note it’s standard prac-
tice in reinforcement
• It selects an action at . learning to use s and
a instead of x and y
• It receives a reward, rt , which typically depends on st and possibly at . to denote the machine
learning model’s in-
• The environment transitions probabilistically to a new state, st+1 , with a distribution put and output. The
that depends only on st and at . subscript t denotes the
timestep, and captures
the sequential nature of
• The agent observes the current state, st+1 . the problem.
• ...
The goal is to find a policy π, mapping s to a, (that is, states to actions) such that some
long-term sum or average of rewards r is maximized.
This setting is very different from either supervised learning or unsupervised learning,
because the agent’s action choices affect both its reward and its ability to observe the envi-
ronment. It requires careful consideration of the long-term effects of actions, as well as all
of the other issues that pertain to supervised learning.
1.2 Assumptions
The kinds of assumptions that we can make about the data source or the solution include:
• The data are generated by a Markov chain (i.e. outputs only depend only on the
current state, with no additional memory).
• The “true” model that is generating the data can be perfectly described by one of
some particular set of hypotheses.
The effect of an assumption is often to reduce the “size” or “expressiveness” of the space of
possible hypotheses and therefore reduce the amount of data required to reliably identify
an appropriate hypothesis.
• 0-1 Loss applies to predictions drawn from finite domains. If the actual values are
drawn from a contin-
0 if g = a uous distribution, the
L(g, a) = probability they would
1 otherwise ever be equal to some
predicted g is 0 (except
for some weird cases).
• Squared loss
L(g, a) = (g − a)2
• Absolute loss
L(g, a) = |g − a|
• Asymmetric loss Consider a situation in which you are trying to predict whether
someone is having a heart attack. It might be much worse to predict “no” when the
answer is really “yes”, than the other way around.
1 if g = 1 and a = 0
L(g, a) = 10 if g = 0 and a = 1
0 otherwise
Any given prediction rule will usually be evaluated based on multiple predictions and
the loss of each one. At this level, we might be interested in:
• Minimizing expected loss over all the predictions (also known as risk)
• Minimizing or bounding regret: how much worse this predictor performs than the
best one drawn from some class
• Characterizing asymptotic behavior: how well the predictor will perform in the limit
of infinite training data
• Finding algorithms that are probably approximately correct: they probably generate
a hypothesis that is right most of the time.
There is a theory of rational agency that argues that you should always select the action
that minimizes the expected loss. This strategy will, for example, make you the most money
in the long run, in a gambling setting. As mentioned above, expected loss is also sometimes Of course, there are
called risk in ML literature, but that term means other things in economics or other parts other models for ac-
tion selection and it’s
of decision theory, so be careful...it’s risky to use it. We will, most of the time, concentrate
clear that people do not
on this criterion. always (or maybe even
often) select actions that
follow this rule.
1.4 Model type
Recall that the goal of a ML system is typically to estimate or generalize, based on data
provided. Below, we examine the role of model-making in machine learning.
1.6 Algorithm
Once we have described a class of models and a way of scoring a model given data, we
have an algorithmic problem: what sequence of computational instructions should we run
in order to find a good model from our class? For example, determining the parameter
vector which minimizes the training error might be done using a familiar least-squares
minimization algorithm, when the model h is a function being fit to some data x.
Sometimes we can use software that was designed, generically, to perform optimiza-
tion. In many other cases, we use algorithms that are specialized for ML problems, or for
particular hypotheses classes. Some algorithms are not easily seen as trying to optimize a
particular criterion. In fact, a historically important method for finding linear classifiers,
the perceptron algorithm, has this character.