0% found this document useful (0 votes)
48 views

Deep Learning Summer School 2015: Introduction To Machine Learning

The document provides an introduction to machine learning. It discusses how machine learning grew out of artificial intelligence and neuroscience research. Machine learning uses algorithms to leverage large amounts of data to produce accurate predictive models. The key ingredient is data, which comes in many forms and must be organized into a structured list of labeled examples for training machine learning algorithms. The dimensions of the problem, including the number of examples, input dimensionality, and target dimensionality determine which learning algorithms can be practically applied. Preprocessing is often needed to extract relevant features and turn raw data into normalized input vectors.

Uploaded by

AvanakshSingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Deep Learning Summer School 2015: Introduction To Machine Learning

The document provides an introduction to machine learning. It discusses how machine learning grew out of artificial intelligence and neuroscience research. Machine learning uses algorithms to leverage large amounts of data to produce accurate predictive models. The key ingredient is data, which comes in many forms and must be organized into a structured list of labeled examples for training machine learning algorithms. The dimensions of the problem, including the number of examples, input dimensionality, and target dimensionality determine which learning algorithms can be practically applied. Preprocessing is often needed to extract relevant features and turn raw data into normalized input vectors.

Uploaded by

AvanakshSingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Deep Learning

Summer School 2015


Introduction to
Machine Learning
by Pascal Vincent

Montreal Institute for Learning Algorithms August 4, 2015

Département d'informatique et de recherche


opérationnelle

lundi 3 août 2015


What is machine learning ?
Historical perspective
• Born from the ambitious goal
of Artificial Intelligence

• Founding project:
The Perceptron (Frank Rosenblatt 1957)
First artificial neuron learning form examples

• Two historically opposed approaches to AI:


Neuroscience inspired: Classical symbolic AI:
➪ neural nets learning from Primacy of logical reasoning capabilities
examples for artificial ➪ No learning (humans coding rules)
perception ➪ poor handling of uncertainty
Got eventually fixed (Bayes Nets...)
Learning and probabilistic models largely won ➪ machine learning
lundi 3 août 2015
Artificial Intelligence
in the 60s

Computer science
Artificial Intelligence
Largely symbolic AI

rks
wo
t
l ne
u ra
l ne
c i a
rtifi
a

Neurosciences

lundi 3 août 2015


Current view of ML founding disciplines

Opimization
Computer science +
control
Information Artificial Intelligence
theory

or ks
r netw
al
u
Machine Learning artifi
c i al n e
& roscie nc e s
eu
Statistics t at ion al n
om pu
c
l p hysi cs
at i st i ca
St Neurosciences

Physics

lundi 3 août 2015


What is machine-learning?
A (hypnotized) user’s perspective

A scientific (witchcraft) field that

• researches fundamental principles (potions)

• and develops magical algorithms (spells to invoke)

• capable of leveraging collected data to (automagically)


produce accurate predictive functions
applicable to similar data (in the future!)

(may also yield informative descriptive functions of data)


lundi 3 août 2015
The key ingredient of
machine learning is...
Data!
• Collected from nature... or industrial processes.
• Comes stored in many forms (and formats...), strucutred,
unstructured, occasionally clean, usually messy, ...

• In ML we like to view data as a list of examples


(or we’ll turn it into one)
➡ ideally many examples of the same nature.
➡ preferably with each example a vector of numbers
(or we’ll first turn it into one!)
lundi 3 août 2015
Input
d
Dn Training data set (training set) dimensionality:

{
targets:
inputs: targets: inputs: X Y

{
(what we observe)(what we must predict) (input feature vector) (label)

X1 (3.5, -2, ... , 127, 0, ...) +1 Y1


“horse” Turn it into
a nice data
Number of matrix...
examples:
(-9.2, 32, ... , 24, 1, ...) -1
n “cat”

etc...
preprocessing,
feature
etc...
extraction

X n,2

{
“horse” Xn (6.8, 54, ... , 17, -3, ...) +1 Yn

New test
point:

lundi 3 août 2015


? x= (5.7, -27, ... , 64, 0, ...)
x ∈ Rd

+1
Importance of the
Problem dimensions
➩ Détermines which learning algorithms will be practically applicable
(based on their algorithmic complexity and memory requirements).

• Number of examples: n
(sometimes several millions)

• Input dimensionality: d
number of input features characterizing each example
(often 100 to 1000, sometimes 10000 or much more)

• Target dimensionality ex. number of classes m


(often small, )
sometimes huge

Data suitable for ML will often be organized


as a matrix: n x (d+1) ou n x (d+m)
lundi 3 août 2015
Turning data into a nice list of

ssy
examples
me
inputs: targets:
data-plumbing
“horse

“cat

etc..

“horse

Key questions to decide what «examples» should be:


• input: What is all the (potentially relevant) information I will have at my
disposal about a case when I will have to make a prediciton about it?(at test time)
• target: what I want to predict: Can I get my hands on many such examples
that are actually labeled with prediciton targets?
lundi 3 août 2015
Turning an example into an
input vector x ∈ R d

Raw input representation: x = (0, 0, ..., 54, 120, ..., 0, 0)


x = (125, 125, ..., 250, ...)

OR some preprocessed representation:

x = (, , , , ....)

dog t
runping

han
ped

se
elep

hor
jum

cat
the

jum
we

Bag of words for «The cat jumped»: x = (... 0... ,0, 1, ...0... , 1, 0, 0, ...., 0, 0, 1, 0, ...0..

OR vector of hand-engineered features: x = (feature 1, ... , feature d)


ex: Histograms of Oriented Gradients
lundi 3 août 2015
Dataset imagined as a point cloud
in a high-dimensional vector space
d
x2 x∈R
target
input (label)
x ∈ IRd y
{
x1 x2 x3 x4 {x5 t ?
n examples

0.32 -0.27 +1 0 0.82 113


1
-0.12 0.42 -1 1 0.22 034
0.06 0.35 -1 1 -0.37 156
0.91 -0.72 +1 0 -0.63 177
... ... ... ... ... ...

Each example (row) is now a


d+1-dimensional vector
x1
Each input is a point in
x3 , ..., xd a d-dimensional vector space
lundi 3 août 2015
Ex: nearest-neighbor classifier
Algorithm:
For test point x:
BLUE!
Find nearest neighbor of x x?
among the training set
according to some distance
measure
(eg: Euclidean distance).

Predict that x has the same


class as this nearest neighbor.
Training set

lundi 3 août 2015


Machine learning tasks (problem types)
Supervised learning = predict a target y from input x
(and semi-supervised learning)

}
y represents a category or “class”
➠classification binary : y ∈ {−1, +1} or y ∈ {0, 1}
multiclass : y ∈ {1, m} or y ∈ {0, m − 1} Predictive
• y is a real-value number
➠ regression y∈R or y∈R m models

Unsupervised learning: no explicit prediciton target y


• model the probability distribution of x


➠ density estimation

discover underlying structure in data


➠ clustering
➠ dimensionality reduction
➠ (unsupervised) representation learning
} Descriptive
modeling

Reinforcement learning: taking good sequential decisions to maximize a reward


in an environment influenced by your decisions.
lundi 3 août 2015
Learning phases
• Training: we learn a predictive function fθ by optimizing
it so that it predicts well on the training set.

• Use for prediction: we can then use fθ on new (test) inputs


that were not part of the training set.

➩ The GOAL of learning is NOT to learn perfectly (memorize)


the training set.
➩ What’s important is the ability for the predictor to
generalize well on new (future) cases.

lundi 3 août 2015


Ex: 1D regression
target (label)
1

0.9
fθ 1. Collect training data
0.75
0.7
2. Learn a function (predictor)
input → target
0.55 3. Use learned function
0.5
0.4 on new inputs
0.25

0 input

Original slide by Olivier Delalleau

lundi 3 août 2015


Supervised task: Learn a function fθ that will
minimize prediciton errors
predict y from x as measured by cost (loss) L.

target loss function L(fθ (x), y)


input (label)
x ∈ IRd y
{
{
output fθ (x)
x1 x2 x3 x4 x5 t
n examples

0.32 -0.27 +1 0 0.82 113


-0.12 0.42 -1 1 0.22 34
0.06 0.35 -1 1 -0.37 56
0.91
...
-0.72
...
+1
...
0
...
-0.63
...
77
...
fθ : paramters

Training set Dn { x1
-0.12 x20.42 x3-1 x14 0.22
x5 34
target y
input x
lundi 3 août 2015
A machine learning algorithm
usually corresponds to a combination of
the following 3 elements:
(either explicitly specified or implicit)

✓(often
the choice of a specific function family: F
a parameterized family)

✓(typically
a way to evaluate the quality of a function f∈F
using a cost (or loss) function L
mesuring how wrongly f prédicts)

✓(typically
a way to search for the «best» function f∈F
an optimization of function parameters to
minimize the overall loss over the training set).

lundi 3 août 2015


Evaluating the quality of a function f∈F
and
Searching for the «best» function f∈F

lundi 3 août 2015


Evaluating a predictor f(x)
The performance of a predictor is often evaluated using
several different evaluation metrics:

• Evaluations of true quantities of interest ($ saved,


#lifes saved, ...) when using predictor inside a more
complicated system.

• «Standard» evaluation metrics in a specific field


(e.g. BLEU (Bilingual Evaluation Understudy) scores in translation)

• Misclassification error rate for a classifier (or precision


and recall, or F-score, ...).

• The loss actually being optimized by the ML algorithm


(often different from all the above...)

lundi 3 août 2015


Standard loss-functions
• For a density estimation task: f : R →
d + a proper probability
R mass or density function
negative log likelihood loss: L(f (x)) = − log f (x)

• For a regression task: f : R → Rd

squared error loss: L(f (x), y) = (f (x) − y)2

• For a classification task: f : Rd → {0, . . . , m − 1}


misclassification error loss: L(f (x), y) = I{f (x)�=y}

lundi 3 août 2015


Surrogate loss-functions
• For a classification task: d
f : R → {0, . . . , m − 1}
misclassification error loss: L(f (x), y) = I{f (x)�=y}

Problem: it is hard to optimize the misclassification loss directly


(gradient is 0 everywhere. NP-hard with a linear classifier) Must use a surrogate loss:

Binary classifier Multiclass classifier


Outputs probability of class 1 Outputs a vector of probabilities:
g(x) ≈ P(y=1 | x) Probability for class 0 is 1-g(x) g(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) )
Probabilistic Binary cross-entropy loss: Negated conditional log likelihood loss
classifier
L(g(x),y) = -(y log(g(x)) + (1-y) log(1-g(x)) L(g(x),y) = -log g(x)y
Decision function: f(x) = Ig(x)>0.5 Decision function: f(x) = argmax(g(x))
Outputs a «score» g(x) for class 1. Outputs a vector g(x) of real-valued
score for the other class is -g(x) scores for the m classes.
Non-
probabilistic
Hinge loss: Multiclass margin loss
classifier L(g(x),t) = max(0, 1-tg(x)) where t=2y-1 L(g(x),y) = max(0,1+max(g(x)k)-g(x)y )
k≠y
Decision function: f(x) = Ig(x)>0 Decision function: f(x) = argmax(g(x))
lundi 3 août 2015
Expected risk v.s. Empirical risk
Examples (x,y) are supposed drawn i.i.d. from an unknown
true distribution p(x,y) (from nature or industrial process)

• Generalization error = Expected risk (or just «Risk»)


«how poorly we will do on average on the infinity of future
examples from that unknown distribution»
R(f ) = Ep(x,y) [L(f (x), y)]

• Empirical risk = average loss on a finite dataset


«how poorly we’re doing on average on this finite dataset»
1 �
R̂(f, D) = L(f (x), y)
|D|
(x,y)∈D

where |D| is the number of examples in D


lundi 3 août 2015
Empirical risk minimization
Examples (x,y) are supposed drawn i.i.d. from an unknown
true distribution p(x,y) (nature or industrial process)

• We’d love to find a predictor that minimizes the


generalization error (the expected risk)

• But can’t even compute it! (expectation over unknown distribution)

• Instead: Empirical risk minimization principle


«Find predictor that minimizes average loss over a trainset»
fˆ(Dtrain ) = argmin R̂(f, Dtrain )
f ∈F

This is the training phase in ML

lundi 3 août 2015


Evaluating the generalization error
‣ We can’t compute expected risk R(f )
‣ But R̂(f, D) is a good estimate of R(f ) provided:
• Dotherwise
was not used to find/choose f
estimate is biased ➩ can’t be the training set!
• D is large enough (otherwise estimate is too noisy); drawn from p

➡ Must keep a separate test-set Dtest ≠Dtrain to properly


estimate generalization error of fˆ(Dtrain ) :
R(fˆ(Dtrain )) ≈ R̂(fˆ(Dtrain ), Dtest )
generalization average error on
error test-set (never used for training)
This is the test phase in ML
lundi 3 août 2015
Simple train/test procedure
• Provided large enough
dataset D drawn from p(x,y)

}
D= (x1 , y1 ) • Make sure examples are in
random order.
(x2 , y2 )
Training • Split dataset in two:
set Dtrain and Dtest
Dtrain
• Use Dtrain to choose/
optimize/find best
...

predictor f =fˆ(Dtrain )

lundi 3 août 2015


(xN , yN )
} Test set
Dtest
• Use Dtest to evaluate
generalization performance
of predictor f.
Model selection
Choosing a specific
function family F

lundi 3 août 2015


Ex. of parameterized function families
Fpolynomial p
Polynomial predictor (of degree p):
(in 1 dimension)

Flinear
Model Selection
Linear (affine) predictor: (in 1 dimension)
(«linear regression») (in d dimensions)

Q: what is the simplest


predictor fθ(x) ? Fconst
Constant predictor: fθ(x)=b
where θ={b}
(always predict the same value or class!)

lundi 3 août 2015


Capacity of a learning algorithm
• Choosing a specific Machine Learning algorithm
means choosing a specific function family F.

• How «big, rich, flexible, expressive, complex» that family


is, defines what is informally called the «capacity» of the
ML algorithm.
Ex: capacity(Fpolynomial 3) > capacity(Flinear)

• One can come up with several formal measures of


«capacity» for a function family / learning algorithm
(e.g. VC-dimension Vapnik–Chervonenkis)

• One rule-of-thumb estimate, is the number of adaptable


parameters: i.e. how many scalar values are contained in θ.
Notable exception: chaining many linear mappings is still a linear mapping!
lundi 3 août 2015
Effective capacity, and
capacity-control hyper-parameters
The «effective» capacity of a ML algo is controlled by:
• Choice of ML algo, which determines big family F
• Hyper-parameters that further specify F
e.g.: degree p of a polynomial predictor; Kernel choice in SVMs;
#of layers and neurons in a neural network

• Hyper-parameters of «regularization» schemes


e.g. constraint on the norm of the weights w
(➩ ridge-regression; L2 weight decay in neural nets);
Bayesian prior on parameters; noise injection (dropout); ...

• Hyper-parameters that control early-stopping of the


iterative search/optimization procedure.
(➩ won’t explore as far from the initial starting point)
lundi 3 août 2015
Popular classifiers
their parameters and hyper-parameters
Capacity-control Learned
Algo
hyperparameters parameters
logistic regression strength of L2 regularizer w,b
(L2 regularized)

linear SVM C w,b


C; kernel choice & params support vector
kernel SVM
(σ for RBF; degree for polynomal) weights: α
neural network layer sizes; early stop; ... layer weight matrices
the tree (with index and
decision tree depth threshold of variables)

k-nearest neighbors k; choice of metric memorizes


trainset

lundi 3 août 2015


Model Selection
Tuning the capacity
• Capacity must be optimally tuned to ensure good generalization
• by choosing Algorithm and hyperparameters
• to avoid
Model Selection
Selection under-fitting and over-fitting.

Ex: 1D regression with polynomial predictor

capacity too low capacity too high optimal capacity


➩under-fitting ➩over-fitting ➩good generalisation

performance on training set is not a good estimate of generalization


lundi 3 août 2015
Ex: 2D classification • Function family too poor
(too inflexible)
Linear classifier
• = Capacity too low for this problem
(relative to number of examples)

• => Under-fitting
largeur
22 saumon bar
21
20
19
18
17
16
15
14 luminosité
2 4 6 8 10

lundi 3 août 2015


• Function family too rich
(too flexible)

• = Capacity too high for this problem


(relative to the number of examples)

• => Over-fitting
largeur
22 saumon bar
21
20
19
18
?

17
16
15
14 luminosité
2 4 6 8 10
Nombre d’erreurs d’entrainement: 0
lundi 3 août 2015
• Optimal capacity for this problem
(par rapport à la quantité de données)

• => Best generalization


(on future test points)

largeur
22 saumon bar
21
20
19
18
17
16
15
14 luminosité
2 4 6 8 10

lundi 3 août 2015


fF = arg min R( f )

Problème d’apprentissage
f ∈F
Decomposing the generalization error
Set of all possible best possible
• la meilleure fonction possible (la d écision/l’erreur de Bay
function
• Erreurs d’estimation et d’approximation,
functions capacité
in the universe f = arg min R( f )

approxiation
• la sortie de notre algorithme d’apprentissage:
error f!(Dn) = !
fn

bias
estim∗a
• la !
fn) − R( ffonction
meilleure
• R( !
) = (R( dans

F :fF )) + (R( fF )ti−
fn) − R( ∗
on R( f ∗
))
best function vari error
ance
in F f F = arg min R( f )

Considered f ∈F
function family F
fˆ(Dtrain )
function our algo
• la meilleure fonction possible (la d écision/l’erreur de
learnt using trainset
Bayes
f ∗ = arg min R( f )

lundi 3 août 2015


fF = arg min R( f )

Problème d’apprentissage
f ∈F
What is responsibe for the variance?
Set of all possible best possible
• la meilleure fonction possible (la d écision/l’erreur de Bay
function
• Erreurs d’estimation et d’approximation,
functions capacité
in the universe f = arg min R( f )

approxiation
• la sortie de notre algorithme d’apprentissage:
error f!(Dn) = !
fn

bias
ˆ(Dtrain3)
f estim∗a
• la !
fn) − R( ffonction
meilleure
• R( ∗ !
) = (R( dans F :fF )) + (R( fF )ti−
fn) − R( ∗
on R( f ∗
))
best function vari error
ance
in F f = arg min R( f )

F
Considered f ∈F
function family F fˆ(Dtrain2) fˆ(Dtrain1)
function our algo
• la meilleure fonction possible (la d écision/l’erreur de
learnt using trainset
Bayes
f ∗ = arg min R( f )

lundi 3 août 2015


Optimal capacity
& the biais-variance dilemma

• Choosing richer F: capacity ↑


➪ bias ↓ but variance ↑.

• Choosing smaller F : capacity ↓


➪ variance ↓ but bias↑.

• Optimal compromise... will depend on number of examples n

• Bigger n ➪ variance ↓
So we can afford to increase capacity (to lower the bias)
➪ can use more expressive models

• The best regularizer is more data!

lundi 3 août 2015


Model selection Make sure examples are in random order
Split data D in 3: Dtrain Dvalid Dtest
how to
D= Model selection meta-algorithm:

}
For each considered model (ML algo) A:
(x1 , y1 ) For each considered hyper-parameter config λ:
• train model A with hyperparams λ on Dtrain
(x2 , y2 ) Training fˆAλ = Aλ (Dtrain )
set • evaluate resulting predictor on Dvalid
Dtrain (with preferred evaluation metric)
eAλ = R̂(fˆAλ , Dvalid )
∗ ∗
Locate A , λ that yielded best eAλ

} Validation Either return f ∗ = fA∗λ∗


...

set Or retrain and return


Dvalid f ∗ = A∗λ∗ (Dtrain ∪ Dvalid )

}
Finally: compute unbiased estimate of
generalization performance of f * using Dtest
Test set
Dtest R̂(f ∗ , Dtest )
(xN , yN ) Dtest must never have been used during training or
model selection to select, learn, or tune anything.
lundi 3 août 2015 On évalue la p
Ex of model hyper-parameter selection
Training set error
Validation set error

6,0

4,5

3,0

1,5

0
1 3 5 7 9 11 13 15
Hyper-parameter value

Hyper-parameter value which yields smallest error on validaiton set is 5


(it was 1 for the training set)
lundi 3 août 2015
Question
What if we selected capacity-control
hyper-parameters that yield best
performance on the training set?

What would we tend to select?


Is it a good idea? Why?

lundi 3 août 2015


Model selection procedure
Methodology
summary:

Figure by Nicolas Chapados

lundi 3 août 2015


Ensemble methods
• Principle: train and combine multiple predictors to good effect
• Bagging: average many high-variance predictors
➪ variance ↓
(e.g.: average deep trees ➪ Random decision forests)
• Boosting: build weighted combination of low-capacity classifiers
➪ bias ↓ and capacity ↑
(e.g. boosting shallow trees; or linear classifiers)

lundi 3 août 2015


Bagging
for reducing variance
on a regression problem

lundi 3 août 2015


How to obtain non-linear
predictor with a linear predictor
Three ways to map x to a feature representation x̃ = φ(x)

• Use an explicit fixed mapping (ex: hand-crafted features)

• Use an implicit fixed mapping


➠Kernel Methods (SVMs, Kernel Logistic Regression ...)

• Learn a parameterized mapping


(i.e. let the ML algo learn the new representation)
➠ Multilayer feed-forward Neural Networks
such as Multilayer Perceptrons (MLP)

lundi 3 août 2015


Levels of
representation

lundi 3 août 2015


Questions ?
46

lundi 3 août 2015

You might also like