0% found this document useful (0 votes)
2 views

Chapter 5 - Machine Learning Basics

The document provides an overview of machine learning, defining it as a branch of artificial intelligence focused on algorithms that learn from data. It discusses various types of machine learning, including supervised, unsupervised, and reinforcement learning, along with their applications and techniques. Additionally, it highlights the evolution of machine learning and the significance of deep learning in modern AI systems.

Uploaded by

kassaalex30
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 5 - Machine Learning Basics

The document provides an overview of machine learning, defining it as a branch of artificial intelligence focused on algorithms that learn from data. It discusses various types of machine learning, including supervised, unsupervised, and reinforcement learning, along with their applications and techniques. Additionally, it highlights the evolution of machine learning and the significance of deep learning in modern AI systems.

Uploaded by

kassaalex30
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Machine Learning Basics

Machine Learning Basics


Machine Learning Basics

• Artificial Intelligence is a scientific field concerned with the


development of algorithms that allow computers to learn without being
explicitly programmed
• Machine Learning is a branch of Artificial Intelligence, which focuses
on methods that learn from data and make predictions on unseen data

Labeled Data Machine Learning algorithm

Training
Prediction

Labeled Data Learned model Prediction

2
Comparison
• Traditional Programming
Data
Computer Output
Program

• Machine Learning
Data
Computer Program
Output

3
What is Machine Learning?
• “the acquisition of knowledge or skills through
experience, study, or by being taught.”

4
What is Machine Learning?
• [Arthur Samuel, 1959]
– Field of study that gives computers
– the ability to learn without being explicitly programmed

• [Kevin Murphy] algorithms that


– automatically detect patterns in data
– use the uncovered patterns to predict future data or other
outcomes of interest

• [Tom Mitchell] algorithms that


– improve their performance (P)
– at some task (T)
– with experience (E)

5
Why Study Machine Learning?
Engineering Better Computing Systems
• Develop systems
– too difficult/expensive to construct manually
– because they require specific detailed skills/knowledge
– knowledge engineering bottleneck

• Develop systems
– that adapt and customize themselves to individual users.
– Personalized news or mail filter
– Personalized tutoring

• Discover new knowledge from large databases


– Medical text mining (e.g. migraines to calcium channel
blockers to magnesium)
– data mining

6
Why Study Machine Learning?
Cognitive Science
• Computational studies of learning may help us
understand learning in humans
– and other biological organisms.

7
Where does ML fit in?

8
Why are things working today?
• More compute power

• More data Better

• Better algorithms

Accuracy
/models

Amount of Training Data

9
ML in a Nutshell
• Tens of thousands of machine learning algorithms
– Hundreds new every year

• Decades of ML research oversimplified:


– All of Machine Learning:
– Learn a mapping from input to output f: X  Y
– X: emails, Y: {spam, notspam}

10
ML in a Nutshell
• Input: x (images, text, emails…)

• Output: y (spam or non-spam…)

• (Unknown) Target Function


– f: X  Y (the “true” mapping / reality)

• Data
– (x1,y1), (x2,y2), …, (xN,yN)

• Model / Hypothesis Class


– g: X  Y
– y = g(x) = sign(wTx)
11
Machine Learning Types
Machine Learning Basics

• Supervised: learning with labeled data


 Example: email classification, image classification
 Example: regression for predicting real-valued outputs
• Unsupervised: discover patterns in unlabeled data
 Example: cluster similar data points
• Reinforcement learning: learn to act based on feedback/reward
 Example: learn to play Go

class A

class B

Regression Clustering
Classification

12
Tasks
Supervised Learning
x Classification y Discrete

x Regression y Continuous

Unsupervised Learning

x Clustering y Discrete ID

x Dimensionality y Continuous
Reduction

13
Supervised Learning

Classification
x Classification y Discrete

14
Supervised Learning

Regression

x Regression y Continuous

15
Stock market

16
Weather prediction

Temperature

17
Unsupervised Learning

Clustering
x Clustering y Discrete

Unsupervised Learning
Y not provided

18
Clustering Data: Group similar things

19
Unsupervised Learning

Dimensionality Reduction / Embedding


x Clustering y Continuous

Unsupervised Learning
Y not provided

20
Reinforcement Learning

Reinforcement Learning
x Reinforcement y
Learning
Actions

Learning from feedback

21
Reinforcement Learning
• Reinforcement Learning is a part of Machine learning where an

agent is put in an environment and it learns to behave in this

environment by performing certain actions and observing the

rewards and punishments of those actions.

• Reinforcement learning can be thought of as a hit and trial method

of learning.

• The machine gets a Reward or Penalty point for each action it

performs. If the option is correct, the machine gains the reward point

or gets a penalty point in case of a wrong response.

22
Reinforcement Learning: Learning to act
• There is only one
“supervised” signal at
the end of the game.

• But you need to make a


move at every step

• RL deals with “credit


assignment”

23
Supervised Learning
Machine Learning Basics

• Supervised learning categories and techniques


 Numerical classifier functions
o Linear classifier, perceptron, logistic regression, support vector
machines (SVM), neural networks
 Parametric (probabilistic) functions
o Naïve Bayes, Gaussian discriminant analysis (GDA), hidden
Markov models (HMM), probabilistic graphical models
 Non-parametric (instance-based) functions
o k-nearest neighbors, kernel regression, kernel density
estimation, local regression
 Symbolic functions
o Decision trees, classification and regression trees (CART)
 Aggregation (ensemble) learning
o Bagging, boosting (Adaboost), random forest

24
Unsupervised Learning
Machine Learning Basics

• Unsupervised learning categories and techniques


 Clustering
o k-means clustering
o Mean-shift clustering
o Spectral clustering
 Density estimation
o Gaussian mixture model (GMM)
o Graphical models
 Dimensionality reduction
o Principal component analysis (PCA)
o Factor analysis

25
Nearest Neighbor Classifier
Machine Learning Basics

• Nearest Neighbor – for each test data point, assign the class label of
the nearest training data point
 Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class of
the nearest data point (minimum distance)
 It does not require learning a set of weights

Test Training
Training example examples
examples from class
from class 2
1

26
k-Nearest Neighbors Classifier
Machine Learning Basics

• k-Nearest Neighbors approach considers multiple neighboring data


points to classify a test data point
 E.g., 3-nearest neighbors
o The test example in the figure is the + mark
o The class of the test example is obtained by voting (based on the distance to the 3
closest points)

x2
x
x
x o
x x
x x
+ o
o x
o + x
o
o o
o
o

x1

27
Linear Classifier
Machine Learning Basics

• Linear classifier
 Find a linear function f of the inputs xi that separates the classes

 Use pairs of inputs and labels to find the weights matrix W and the
bias vector b
o The weights and biases are the parameters of the function f
 Several methods have been used to find the optimal set of parameters
of a linear classifier
o A common method of choice is the Perceptron algorithm, where
the parameters are updated until a minimal error is reached (single
layer, does not use backpropagation)
 Linear classifier is a simple approach, but it is a building block of
advanced classification algorithms, such as SVM and neural networks
o Earlier multi-layer neural networks were referred to as multi-layer
perceptrons (MLPs)
28
Linear Classifier
Machine Learning Basics

• The decision boundary is linear


 A straight line in 2D, a flat plane in 3D, a
hyperplane in 3D and higher dimensional space

29
Support Vector Machines
Machine Learning Basics

• Support vector machines (SVM)


 How to find the best decision boundary?
o All lines in the figure correctly separate the 2
classes
o The line that is farthest from all training
examples will have better generalization
capabilities
 SVM solves an optimization problem:
oo Next,
First,increase
identifythe
a decision
geometricboundary
margin that
correctly
between classifies
the boundarythe
andexamples
all examples
 The data points that define the
maximum margin width are called
support vectors
 Find W and b by solving:

30
Linear vs Non-linear Techniques
Linear vs Non-linear Techniques

• Linear classification techniques


 Linear classifier
 Perceptron
 Logistic regression
 Linear SVM
 Naïve Bayes
• Non-linear classification techniques
 k-nearest neighbors
 Non-linear SVM
 Neural networks
 Decision trees
 Random forest

31
Linear vs Non-linear Techniques
Linear vs Non-linear Techniques

• For some tasks, input


data can be linearly
separable, and linear
classifiers can be
suitably applied

• For other tasks, linear


classifiers may have
difficulties to produce
adequate decision
boundaries

32
Non-linear Techniques
Linear vs Non-linear Techniques

• Non-linear classification
 Features are obtained as non-linear functions of the inputs
 It results in non-linear decision boundaries
 Can deal with non-linearly separable data

Inputs:

Features:

Outputs:
33
Binary vs Multi-class Classification
Binary vs Multi-class Classification

• A classification problem with only 2 classes is referred to as binary


classification
 The output labels are 0 or 1
 E.g., benign or malignant tumor, spam or no-spam email
• A problem with 3 or more classes is referred to as multi-class
classification

34
Binary vs Multi-class Classification
Binary vs Multi-class Classification

• Both the binary and multi-class classification problems can be linearly


or non-linearly separated
 Figure: linearly and non-linearly separated data for binary classification
problem

35
ML vs. Deep Learning
Introduction to Deep Learning

• Conventional machine learning methods rely on human-designed


feature representations
 ML becomes just optimizing weights to best make a final prediction

36
ML vs. Deep Learning
Introduction to Deep Learning

• Deep learning (DL) is a machine learning subfield that uses multiple


layers for learning data representations
 DL is exceptionally effective at learning patterns

37
ML vs. Deep Learning
Introduction to Deep Learning

• DL applies a multi-layer process for learning rich hierarchical features


(i.e., data representations)
 Input image pixels → Edges → Textures → Parts → Objects

Low- Mid- High- Trainabl


Output
Level Level Level e
Features Features Features Classifie
r

38
Why is DL Useful?
Introduction to Deep Learning

• DL provides a flexible, learnable framework


for representing visual, text, linguistic
information
 Can learn in supervised and unsupervised manner
• DL represents an effective end-to-end
learning system
• Requires large amounts of training data
• Since about 2010, DL has outperformed
other ML techniques
 First in vision and speech, then NLP, and other
applications 39
Representational Power
Introduction to Deep Learning

• NNs with at least one hidden layer are universal approximators


 Given any continuous function h(x) and some , there exists a NN with one
hidden layer (and with a reasonable choice of non-linearity) described
with the function f(x), such that
 I.e., NN can approximate any arbitrary complex continuous function

• NNs use nonlinear mapping of the inputs x to


the outputs f(x) to compute complex decision
boundaries
• But then, why use deeper NNs?
 The fact that deep NNs work better is an
empirical observation
 Mathematically, deep NNs have the same
representational power as a one-layer NN

40
Introduction to Neural Networks
Introduction to Neural Networks

• Handwritten digit recognition (MNIST dataset)


 The intensity of each pixel is considered an input element
 Output is the class of the digit

Input Output

x1 0.
y1 is 1
1
x2 0.
y
72
is 2
The image is
“2”




x256 …
y10
0.2 is 0
16 x 16 = 256

Ink → 1 Each dimension represents


the confidence of a digit
No ink → 0
41
Introduction to Neural Networks
Introduction to Neural Networks

• Handwritten digit recognition

x1 y1
x2
Machine y2
“2”



x256 𝑓 :𝑅
256
→𝑅
10
y10
The function is represented by a neural network

42
Elements of Neural Networks
Introduction to Neural Networks

• NNs consist of hidden layers with neurons (i.e., computational units)


• A single neuron maps a set of inputs into an output number, or

z a1w1  a2 w2    aK wK  b
a1 w1
𝑎=𝜎 ( 𝑧 )
a2 w2
z  z  a

wK

output
aK Activation
weight function
s b
input
bias

43
Elements of Neural Networks
Introduction to Neural Networks

• A NN with one hidden layer and one output layer

Weights Biases

𝒉𝒊𝒅𝒅𝒆𝒏 𝒍𝒂𝒚𝒆𝒓 𝒉 =𝝈 ( 𝐖 𝟏 𝒙 + 𝒃𝟏 )

𝒐𝒖𝒕𝒑𝒖𝒕 𝒍𝒂𝒚𝒆𝒓 𝒚 =𝝈 (𝑾 𝟐 𝒉+𝒃𝟐 )

Activation functions

4 + 2 = 6 neurons (not counting inputs)

𝒚 [3 × 4] + [4 × 2] = 20 weights
4 + 2 = 6 biases

𝒙 26 learnable parameters

𝒉
44
Elements of Neural Networks
Introduction to Neural Networks

• Deep NNs have many hidden layers


 Fully-connected (dense) layers (a.k.a. Multi-Layer Perceptron or MLP)
 Each neuron is connected to all neurons in the succeeding layer

Input Layer Layer Layer Output


x1 1 2 … L y1

x2 … y2








xN … yM

Input Output
Layer Hidden Layer
Layers 45
Elements of Neural Networks
Introduction to Neural Networks

• A simple network, toy example

( 1 ∙1 ) + (− 1 ) ∙ ( − 2 )+ 1= 4

0.98 Sigmoid Function


1 4
1
- 1
 z  
2 1 1  e z
 z 
-1 -2 0.12
-1
1 z
0

1 -2

46
Elements of Neural Networks
Introduction to Neural Networks

• A simple network, toy example (cont’d)


 For an input vector , the output is

1 4 0.98 2 0.86 3 0.62


1
- - -
2 1 1 0 1 -
2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2

𝑓 :𝑅 →𝑅
2 2 𝑓
([ ]) [
1
−1
= 0 .62
0.83 ]
47
Activation Functions
Introduction to Neural Networks

• Non-linear activations are needed to learn complex (non-linear) data


representations
 Otherwise, NNs would be just a linear function (such as )
 NNs with large number of layers (and neurons) can approximate more
complex functions
o Figure: more neurons improve representation (but, may overfit)

48
Activation: Sigmoid
Introduction to Neural Networks

• Sigmoid function σ: takes a real-valued number and “squashes” it into


the range between 0 and 1
 The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
 When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
 Sigmoid activations are less common in modern NNs

𝑓 (𝑥 ) ℝ𝑛 → [ 0 , 1 ]

𝑥
49
Activation: Tanh
Introduction to Neural Networks

• Tanh function: takes a real-valued number and “squashes” it into


range between -1 and 1
 Like sigmoid, tanh neurons saturate
 Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
 Tanh is a scaled sigmoid:

𝑓 (𝑥 ) ℝ 𝑛 → [ −1 , 1 ]

50
Activation: ReLU
Introduction to Neural Networks

• ReLU (Rectified Linear Unit): takes a real-valued number and


thresholds it at zero
ℝ𝑛 → ℝ+ ¿ ¿
𝑛

 Most modern deep NNs use ReLU


activations
𝑓 (𝑥 )
 ReLU is fast to compute
o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
 Accelerates the convergence of
gradient descent
o Due to linear, non-saturating form
 Prevents the gradient vanishing 𝑥
problem

51
Training NNs
Training Neural Networks

• The network parameters include the weight matrices and bias vectors
from all layers
𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿

 Often, the model parameters are referred to as weights


• Training a model to learn a set of parameters that are optimal
(according to a criterion) is one of the greatest challenges in ML

x1 … 0.
y1 is 1
1

Softmax
x2 … 0.
7y2 is 2





x256 … y10
0.2 is 0
16 x 16 = 256

52
Training NNs
Training Neural Networks

• To train a NN, set the parameters such that for a training subset of
images, the corresponding elements in the predicted output have
maximum values

Input: y1 has the maximum value

Input: y2 has the maximum value


.
.
.

Input: y9 has the maximum value

Input: y10 has the maximum value

53
Training NNs
Training Neural Networks

• Define a loss function/objective function/cost function that calculates


the difference (error) between the model prediction and the true label
 E.g., can be mean-squared error, cross-entropy, etc.

x1 … y1 0.
1
… 2
x2 … 0.
y2 0
… 3
Cost










x256 … ℒ (𝜃) 0
y10 0.5

True label “1”

54
Training NNs
Training Neural Networks

• For a training set of images, calculate the total loss overall all images:
• Find the optimal parameters that minimize the total loss

ℒ1 ( 𝜃 )
x1 NN ^
𝑦
1
y1
ℒ2 ( 𝜃 )
x2 NN ^
𝑦
2
y
2
ℒ3 ( 𝜃 )
x3 NN ^
𝑦
3
y3






ℒ𝑛 ( 𝜃 )
xN NN ^
𝑦
𝑁 yN
55
Backpropagation
Training Neural Networks

• Backpropagation is short for “backward propagation”


• For training NNs, forward propagation (forward
pass) refers to passing the inputs through the
hidden layers to obtain the model outputs
(predictions)
 The loss function is then calculated
 Backpropagation traverses the network in reverse order,
from the outputs backward toward the inputs to calculate
the gradients of the loss
 The chain rule is used for calculating the partial derivatives
of the loss function with respect to the parameters in the
different layers in the network
• Each update of the model parameters during
training takes one forward and one backward pass
56
Generalization
Generalization

• Underfitting
 The model is too “simple” to
represent all the relevant class
characteristics
 E.g., model with too few
parameters
 Produces high error on the training
set and high error on the validation
set

• Overfitting
 The model is too “complex” and fits
irrelevant characteristics (noise) in
the data
 E.g., model with too many
parameters
 Produces low error on the training
57
Overfitting
Generalization

• Overfitting – a model with high capacity fits the noise in the data
instead of the underlying relationship

• The model may fit the training


data very well, but fails to
generalize to new examples (test
or validation data)

58

You might also like