0% found this document useful (0 votes)
2 views

Lecture2.2 UnimodalRepresentations Part1 PDF

The document discusses unimodal representations in multimodal machine learning, focusing on image representations, convolutional neural networks (CNNs), and their applications in object detection and segmentation. It covers various aspects of CNNs, including their structure, advantages, and visualization techniques. Additionally, it highlights tools for automatic visual behavior analysis and existing software for tasks like face detection and expression analysis.

Uploaded by

kuroemon84426644
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture2.2 UnimodalRepresentations Part1 PDF

The document discusses unimodal representations in multimodal machine learning, focusing on image representations, convolutional neural networks (CNNs), and their applications in object detection and segmentation. It covers various aspects of CNNs, including their structure, advantages, and visualization techniques. Additionally, it highlights tools for automatic visual behavior analysis and existing software for tasks like face detection and expression analysis.

Uploaded by

kuroemon84426644
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Multimodal Machine Learning

Lecture 2.2: Unimodal Representations


Đàm Quang Tuấn
Lecture Objectives

Dimension of heterogeneity
Image representations
Image gradients, edges, kernels
Convolution neural network (CNN)
Convolution and pooling layers
Visualizing CNNs
Region-based CNNs
Sequence modeling with convolution networks

Team matching event

8
Dimensions of
Heterogeneity
Heterogeneous Modalities

Information present in different modalities will often show


diverse qualities, structures and representations.

Homogeneous Heterogeneous
Modality A
Modalities Modalities
Modality B (with similar qualities) (with diverse qualities)

Examples: Images Text from Language ???


from 2 2 different and vision
cameras languages

1
0
1
Dimensions of Heterogeneity Modality A Modality B

1 Element representations:
Discrete, continuous, granularity

2 Element distributions:
Density, frequency

3 Structure:
Temporal, spatial, latent,
explicit
4 Information:
Abstraction, entropy 𝐻( ) 𝐻( )
5 Noise:
Uncertainty, noise, missing data

6 Relevance:
Task, context dependence 1 2
1
1
1
Modality Profile Modality A Modality B

1 Element representations:
Discrete, continuous, granularity

2 Element distributions:
Density, frequency

3 Structure:
Temporal, spatial, latent,
explicit
4 Information:
Abstraction, entropy 𝐻( ) 𝐻( )
5 Noise:
Uncertainty, noise, missing data

6 Relevance:
Task, context dependence 1 2
1
2
1
Modality Profile Visual Image Modality

1 Element representations:
Discrete, continuous, granularity
?
2 Element distributions:
Density, frequency ?
3 Structure:
Temporal, spatial, latent, ?
explicit
4 Information:
Abstraction, entropy
?
5 Noise:
Uncertainty, noise, missing data ?
6 Relevance:
Task, context dependence ?
1
3
1
Image
Representations
How Would You Describe This Image?


15
1
Object-Based Visual Representation

“person”
label

Feature vector
Appearance
descriptor
❑ Age
❑ Expression
❑ Clothes …

1
Object Descriptors
Many approaches over the years…

Image gradient Edge detection

How to represent and


detect an object?
Histograms of
Oriented Gradients Optical Flow

1
Object Descriptors
Many approaches over the years…
Horizontal
Oriented
and vertical
gradients
gradients

Templates tested
on the image (i.e.,
convolution
kernels)

How to represent and


detect an object?
Inspired by
Gabor filters
visual cortex

1
Convolution Kernels

∗ =

Convolution
kernels

Response maps

1
Object Descriptors
Many approaches over the years…

Convolutional Neural Network (CNN)

How to represent and More details about CNNs is coming…


detect an object? … and we will also talk about visual
Transformers in coming weeks…

And images are more than a list of objects!

2
One representation, lots of tasks

https://github.com/facebookresearch/detectron2
2
Facial expression analysis

[OpenFace: an open sourcefacial behavior analysis toolkit, T. Baltrušaitiset al., 2016]

2
Articulated Body Tracking: OpenPose
https://github.com/CMU-Perceptual-Computing-
Lab/openpose

See appendix for list of available tools


for automatic visual behavior analysis

2
Convolutional
Neural Networks
Why using Convolutional Neural Networks?
Goal: building more abstract,
hierarchical visual representations
Objects

Key advantages:
1)Inspired from visual cortex
2)Encourages visual abstraction Parts
3)Exploits translation invariance
4)Kernels/templates are learned
5)Fewer parameters than MLP Edges/blobs

Input
pixels

2
Translation Invariance

2 Data Points –Which one is up?



MLP can easily learn this task
(possibly with only 1 neuron!)

What happens if the face is slightly translated?


➢ The model should still be able to classify it

Conventional MLP models are not translation invariant!


➢ But CNNs are kernel-based, which helps with translation
invariance and reduce number of parameters

2
Predefined vs Learned Kernels
Predefined kernels
Learned kernels
Convolutional Neural Network (CNN)

Gabor filters
With CNNs, the kernel values are
learned as model parameters

2
LearnedFilters(aka Convolution Kernels) ht t ps : //distill.pub/2017/f eat u re-visualization/

2
Convolution in 2D –Example

∗ =
Convolution
kernel

Input Response map


image

2
Convolution as a Fully-Connected Network

Input
Input: all pixels
Not efficient!

2 0 0 × 2 0 0 image
(image)
requires
4 0 , 0 0 0 × 𝑛 parameters
(where n is size of kernel)
Output
And it may learn different kernels
for different pixel positions
Output: kernel responses
(response map) Not translation invariant

3
Convolutional Neural Layer

Input

Input: all pixels


Example with
(image)
1D kernel:
Weighted sum
𝑊𝑥 w1 w2 w3
Output

(response map)
Output: kernel responses
Convolution
𝑦=𝑊𝑥 kernel

3
Convolutional Neural Layer

Modification 1:Sliding window –Only apply


Input
the kernel to a small region
Input: all pixels

(image)
Weighted sum Example with
𝑊𝑥 1D kernel:
Output w1 w2 w3

(response map)
Output: kernel responses

𝑦=𝑊𝑥

3
Convolutional Neural Layer

Modification 2:Same kernel applied to


Input
all sliding windows
Input: all pixels

(image)
Weighted sum Example with
𝑊𝑥 1D kernel:
Output 𝒘 𝟏 𝒘 𝟐 𝒘 𝟑

(response map)
Output: kernel responses

𝑦=𝑊𝑥

3
Convolutional Neural Layer

Modification 2:Same kernel applied to


Input
all sliding windows

(image)
Example with
𝑾 = 1D kernel:
Output 𝒘 𝟏 𝒘 𝟐 𝒘 𝟑

(response map) Can be implemented efficiently on GPUs


𝑦=𝑊𝑥 W will be 3D: 3rd dimension allows for multiple kernels
3
Convolutional Neural Network

Multiple convolutional layers


Allows the network to Objects
learn combinations of
sub-parts, to increase Combination of edges
complexity
Parts

but how to encourage Combination of edges


abstraction and summarization? Edges/blobs
Combination of pixels
Answer: Pooling layers
Input
pixels

3
Pooling Layer

Response map subsampling:


Allows summarization of the responses

3
Common architectures

Repeat several times:


Start with a convolutional layer
Followed by non-linear activation and pooling
End with a fully connected (MLP) layer

3
Example: VGGNetmodel

Used for object classification task


1000-way classification
task 138 million parameters

3
Residual Networks (ResNet)

Adding residual connections

ResNet(He et al., 2015)


• Up to 152 layers!

3
Visualizing CNNs
Visualizing the Last CNN Layer: t-sne

Alex Net

Embed high dimensional data


points (i.e.feature codes) so
that pairwise distances are
conserved in local
neighborhoods.

4
Deconvolution

4
Visualizing & Understanding Conv. Nets

• What Makes Convnets “Tick”?


• What happens in hidden units?
• Layer 1: easy to visualize
• Deeper layers: just a bunch of numbers? Or
something more meaningful?
• Do convnets use context or actually model
target classes
Introducing: Visualizing & Understanding Conv. Nets

• Zeiler & Fergus, 2013


• Goal: Try to visualize the “black box” hidden units, gain insights
• Hope: Use conclusions to improve performance
• Idea: “Deconvolutional” neural net
Deconvolutional Nets

• Originally suggested for unsupervised feature learning : construct


a convolutional net, cost function is image reconstruction error
• Used here to find what stimuli causes strongest responses in
hidden units
• Run many images through net → find strongest unit activations in
each layer → visualize by “reversing” net operation
Reversing a convent
“Unpooling”
Layer 1:
Hidden Layer Visualizations: layer 2
Hidden Layer Visualizations: Layer 3
Hidden Layer Visualizations: Layer 4
Hidden Layer Visualizations: Layer 5
CAM: Class Activation Mapping [CVPR 2016]

4
Grad-CAM [ICCV 2017]

4
Region-based CNNs
Object recognition

4
Object Detection (and Segmentation)

Input image Detected Objects

One option: Sliding window

4
Object Detection (and Segmentation)

Input image Region Proposals Detected Objects

A better option: Start by Identifying hundreds of region


proposals and then apply our CNN object detector

How to efficiently identify region proposals?

4
Selective Search [Uijlings et al., IJCV 2013]
Image segmentation And then merge
(using superpixels) similar regions

Create box
region proposals

4
R-CNN [Girshicket al., CVPR 2014]

• Select ~2000 region proposals Time consuming!


• Warp each region Apply CNN to
• each region Time consuming!

Fast R-CNN: Applies CNN only once, and then extracts regions
Faster R-CNN: Region selection on the Conv5 response map

5
Mask R-CNN: Detection and Segmentation
(He et al., 2018)

5
Sequential Modeling
with Convolutional
Networks
Modeling Temporal and Sequential Data

How to represent a video sequence?

One option: Recurrent Neural Networks


(more about this next week)

5
3D CNN

3D CNN

Input as a 3D tensor
(stacking video images)

First layer with 3D kernels

5
Time-Delay Neural Network

1D Convolution

Alexander Waibel, Phoneme Recognition Using Time-Delay Neural Networks,


SP87-100, Meeting of the Institute of Electrical, Information and Communication
Engineers (IEICE), December,1987,Tokyo, Japan.

5
Temporal Convolution Network (TCN) [Lea et al., CVPR 2017]

Decoder

Encoder

5
Appendix: Tools for
Automatic visual
behavior analysis
5
Automatic analysis of visual behavior

Face detection
Face tracking
Facial landmark detecion
Head pose
Eye gaze tracking
Facial expression analysis
Body pose tracking

5
Face Detection –Multi-Task CNN [SPL 2016]

Stage 1: candidate windows are produced through a fast Proposal Network

Stage 2: refine these candidates through a Refinement Network Stage 3:

produces final bounding box and facial landmarks position

60
6
Existing software (face detection)

Multi-Task CNN face detector


https://kpzhang93.github.io/MTCNN_face_detection_alignment/index.html
OpenCV (Viola-Jones detector)
dlib(HOG + SVM)
http://dlib.net/
Tree based model (accurate but very slow)
http://www.ics.uci.edu/~xzhu/face/
HeadHunter(accurate but slow)
http://markusmathias.bitbucket.org/2014_eccv_face_detection/
NPD
http://www.cbsr.ia.ac.cn/users/scliao/projects/npdface/

6
Facial Landmarks: Constrained Local Neural Field

62
Existing software (facial landmarks)

OpenFace: facial features


https://github.com/TadasBaltrusaitis/OpenFace
Chehraface tracking
https://sites.google.com/site/chehrahome/
Menpoproject (good AAM, CLM learning tool)
http://www.menpo.or
g/
IntraFace: Facial attributes, facial expression analysis
http://www.humansensing.cs.cmu.edu/intraface/
OKAO Vision: Gaze estimation, facial expression
http://www.omron.com/ecb/products/mobile/okao03.html (Commercial software)
VisageSDK
http://www.visagetechnologies.com/products/visagesdk/ (Commercial software)

6
Facial expression analysis

[OpenFace: an open source facial behavior analysis toolkit, T. Baltrušaitiset al., 2016]

64
6
Existing Software (expression analysis)

OpenFace: Action Units


https://github.com/TadasBaltrusaitis/OpenFace
Shore: facial tracking, smile detection, age and gender detection
http://www.iis.fraunhofer.de/en/bf/bsy/fue/isyst/detektion/
FACET/CERT (EmotientAPI): Facial expression recognition
http://imotionsglobal.com/software/add-on-modules/attention-tool-facet-
module-facial-action-coding-system-facs/(Commercial software)
Affdex
http://www.affectiva.com/solutions/apis-sdks/
(commercial software)

6
Gaze Estimation –Eye, Head and Body

Image from Hachisuet al (2018). FaceLooks: A Smart Headband for Signaling Face-to-Face Behavior.Sensors.

6
Existing Software (head gaze)

OpenFace
https://github.com/TadasBaltrusaitis/OpenFace
Chehraface tracking
https://sites.google.com/site/chehrahome/
Watson: head pose estimation
http://sourceforge.net/projects/watson/
Random forests
http://www.vision.ee.ethz.ch/~gfanelli/head_pose/head_forest.html
(requires a Kinect)
IntraFace
http://www.humansensing.cs.cmu.edu/intraface/

6
Existing Software (eye gaze)

OpenFace: gaze from a webcam


https://github.com/TadasBaltrusaitis/OpenFace
EyeAPI: eye pupil detection
http://staff.science.uva.nl/~rvalenti/
EyeTab
https://www.cl.cam.ac.uk/research/rainbow/projects/eyetab/
OKAO Vision: Gaze estimation, facial expression
http://www.omron.com/ecb/products/mobile/okao03.html (Commercial software)

6
Articulated Body Tracking: OpenPose

6
Existing Software (body tracking)

OpenPose
https://github.com/CMU-Perceptual-Computing-Lab/openpose
Microsoft Kinect
http://www.microsoft.com/en-us/kinectforwindows/
OpenNI
http://openni.org/
Convolutional Pose Machines
https://github.com/shihenw/convolutional-pose-machines-release

7
Visual Descriptors

Image gradient Edge detection Histograms of Oriented Gradients

SIFT descriptors
Optical Flow Gabor Jets

7
Existing Software (visual descriptors)

OpenCV: optical flow, gradient, Haarfilters…


SIFT descriptors
http://blogs.oregonstate.edu/hess/code/sift/
dlib – HoG
http://dlib.net/
OpenFace: Aligned HoGfor faces
https://github.com/TadasBaltrusaitis/CLM-framework

You might also like