0% found this document useful (0 votes)
40 views

Deep Learning Basics Lecture 1 Feedforward

This document provides an overview of deep learning basics and feedforward networks. It discusses how machine learning involves collecting data, extracting features, and building models to minimize loss. Feedforward networks are motivated by representing inputs with learned nonlinear functions across multiple layers, inspired by biological neurons. Key components of feedforward networks include the input, hidden layers with activation functions like ReLU to introduce nonlinearity, and an output layer for tasks like regression or classification.

Uploaded by

baris
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Deep Learning Basics Lecture 1 Feedforward

This document provides an overview of deep learning basics and feedforward networks. It discusses how machine learning involves collecting data, extracting features, and building models to minimize loss. Feedforward networks are motivated by representing inputs with learned nonlinear functions across multiple layers, inspired by biological neurons. Key components of feedforward networks include the input, hidden layers with activation functions like ReLU to introduce nonlinearity, and an output layer for tasks like regression or classification.

Uploaded by

baris
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Deep Learning Basics

Lecture 1: Feedforward
Princeton University COS 495
Instructor: Yingyu Liang
Motivation I: representation learning
Machine learning 1-2-3

• Collect data and extract features


• Build model: choose hypothesis class 𝓗 and loss function 𝑙
• Optimization: minimize the empirical loss
Features

𝑥
Color Histogram

Extract build
features hypothesis 𝑦 = 𝑤𝑇𝜙 𝑥

Red Green Blue


Features: part of the model
Nonlinear model

build
hypothesis 𝑦 = 𝑤𝑇𝜙 𝑥

Linear model
Example: Polynomial kernel SVM

𝑥1

𝑦 = sign(𝑤 𝑇 𝜙(𝑥) + 𝑏)

𝑥2

Fixed 𝜙 𝑥
Motivation: representation learning
• Why don’t we also learn 𝜙 𝑥 ?

Learn 𝜙 𝑥 𝜙 𝑥 Learn 𝑤
𝑦 = 𝑤𝑇𝜙 𝑥

𝑥
Feedforward networks
• View each dimension of 𝜙 𝑥 as something to be learned


𝑦 = 𝑤𝑇𝜙 𝑥

𝑥 𝜙 𝑥
Feedforward networks
• Linear functions 𝜙𝑖 𝑥 = 𝜃𝑖𝑇 𝑥 don’t work: need some nonlinearity


𝑦 = 𝑤𝑇𝜙 𝑥

𝑥 𝜙 𝑥
Feedforward networks
• Typically, set 𝜙𝑖 𝑥 = 𝑟(𝜃𝑖𝑇 𝑥) where 𝑟(⋅) is some nonlinear function


𝑦 = 𝑤𝑇𝜙 𝑥

𝑥 𝜙 𝑥
Feedforward deep networks
• What if we go deeper?


……
𝑦

ℎ1 ℎ2 ℎ𝐿
𝑥
Figure from
Deep learning, by
Goodfellow, Bengio, Courville.
Dark boxes are things to be learned.
Motivation II: neurons
Motivation: neurons

Figure from
Wikipedia
Motivation: abstract neuron model
• Neuron activated when the correlation
between the input and a pattern 𝜃
exceeds some threshold 𝑏 𝑥1
• 𝑦 = threshold(𝜃 𝑇 𝑥 − 𝑏) 𝑥2
or 𝑦 = 𝑟(𝜃 𝑇 𝑥 − 𝑏)
• 𝑟(⋅) called activation function 𝑦

𝑥𝑑
Motivation: artificial neural networks
Motivation: artificial neural networks
• Put into layers: feedforward deep networks


……
𝑦

ℎ1 ℎ2 ℎ𝐿
𝑥
Components in Feedforward
networks
Components
• Representations:
• Input
• Hidden variables
• Layers/weights:
• Hidden layers
• Output layer
Components
First layer Output layer


……
𝑦

Input 𝑥 Hidden variables ℎ1 ℎ2 ℎ𝐿


Input
• Represented as a vector

• Sometimes require some


preprocessing, e.g., Expand
• Subtract mean
• Normalize to [-1,1]
Output layers
Output layer
• Regression: 𝑦 = 𝑤𝑇ℎ +𝑏
• Linear units: no nonlinearity


Output layers
Output layer
• Multi-dimensional regression: 𝑦 = 𝑊𝑇ℎ +𝑏
• Linear units: no nonlinearity


Output layers
Output layer
𝜎(𝑤 𝑇 ℎ
• Binary classification: 𝑦 = + 𝑏)
• Corresponds to using logistic regression on ℎ


Output layers
Output layer
• Multi-class classification:
• 𝑦 = softmax 𝑧 where 𝑧 = 𝑊 𝑇 ℎ + 𝑏
• Corresponds to using multi-class
logistic regression on ℎ
𝑧 𝑦


Hidden layers
• Neuron take weighted linear
combination of the previous


layer
• So can think of outputting one
value for the next layer


ℎ𝑖 ℎ𝑖+1
Hidden layers
• 𝑦 = 𝑟(𝑤 𝑇 𝑥 + 𝑏)

• Typical activation function 𝑟


• Threshold t 𝑧 = 𝕀[𝑧 ≥ 0]
𝑟(⋅)
• Sigmoid 𝜎 𝑧 = 1/ 1 + exp(−𝑧) 𝑥 𝑦
• Tanh tanh 𝑧 = 2𝜎 2𝑧 − 1
Hidden layers
• Problem: saturation

𝑟(⋅)
𝑥 𝑦
Too small gradient

Figure borrowed from Pattern Recognition and Machine Learning, Bishop


Hidden layers
• Activation function ReLU (rectified linear unit)
• ReLU 𝑧 = max{𝑧, 0}

Figure from Deep learning, by


Goodfellow, Bengio, Courville.
Hidden layers
• Activation function ReLU (rectified linear unit)
• ReLU 𝑧 = max{𝑧, 0} Gradient 1

Gradient 0
Hidden layers
• Generalizations of ReLU gReLU 𝑧 = max 𝑧, 0 + 𝛼 min{𝑧, 0}
• Leaky-ReLU 𝑧 = max{𝑧, 0} + 0.01 min{𝑧, 0}
• Parametric-ReLU 𝑧 : 𝛼 learnable

gReLU 𝑧

You might also like