Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giró i Nieto, “Deep learning for vision: Objects”. Master in Multimedia, La Salle URL (May 2016)
@DocXavi
Deep Learning for Computer Vision
Object Analytics
5 May 2016
Xavier Giró-i-Nieto
Master en
Creació Multimedia

One lecture organized in three parts
2
Images (global) Objects (local)
Deep ConvNets for Recognition for...
Video (2D+T)

One lecture organized in four parts
3
Detection Recognition
Local analysis for...
Segmentation
person
bag
me
my bag
person
bag
Proposals

4
Segmentation
person
bag
me
my bag
person
bag
Proposals

Proposals: Hand-crafted
5
Slides credit:
Marc Bolaños
Hand-crafted proposals used to be based on bottom-up proposals.
Selective Search (SS) Multiscale Combinatorial Grouping (MCG)
[SS] Uijlings, Jasper RR, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeulders. "Selective search for object
recognition." International journal of computer vision 104, no. 2 (2013): 154-171.
[MCG] Arbeláez, Pablo, Jordi Pont-Tuset, Jonathan Barron, Ferran Marques, and Jitendra Malik. "Multiscale combinatorial
grouping." CVPR 2014.

Proposals: DeepBox
6
Kuo, Weicheng, Bharath Hariharan, and Jitendra Malik. "Deepbox: Learning objectness with convolutional
networks." ICCV 2015. [software]

Proposals: DeepBox
7
Slides credit:
Marc Bolaños
Deepbox proposes a very simple method:
1) Use a state-of-the-art method (Edge Box) to generate initial object proposals.
2) Rerank them (and possibly discard them) by using DeepBox.

Proposals: DeepBox: Architecture
8
Slides credit:
Marc Bolaños
PASCAL VOC
AUC = 0.75, IoU = 0.5
AUC = 0.62, IoU = 0.7
PASCAL VOC
AUC = 0.74, IoU = 0.5
AUC = 0.60, IoU = 0.7
AlexNet
architecture
(heavier)
DeepBox
architecture
(lighter)
Small
drop

Proposals: DeepBox: Training
9
Slides credit:
Marc Bolaños
1) Initialize layers with AlexNet weights. 3) Train on Hard Negatives
2) Train on Sliding Windows
Negative Samples:
Extract windows by raster scanning.
Positive Samples:
Having GT bounding boxes, they
generate samples per instance
with a perturbation of:
By using bottom-up proposals from Edge
boxes:
If GT overlap threshold <= 0.3 → Negative
Samples
If GT overlap threshold >= 0.7 → Positive
Samples:

Proposals: DeepBox: Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit:
Marc Bolaños

11
With a rather simple approach ConvNets can obtain much better results than
previous techniques for Object Proposals.
Slides credit:
Marc Bolaños

12
Slides credit:
Marc Bolaños

13
Increasing not only Detection capabilities of known classes, but also of unknown ones
(suitable for Object Discovery).
Slides credit:
Marc Bolaños

14
Segmentation
person
bag
me
my bag
person
bag
Proposals

Detection: Objects
15

Detection: Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60 %
Slide credit:
Amaia Salvador

Detection: Objects
17
Girshick, Ross, Forrest Iandola, Trevor Darrell, and Jitendra Malik. "Deformable Part Models are
Convolutional Neural Networks." CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by
Deformable Parts-based Models (DPMs)

Detection: Objects: R-CNN
18
Girshick, R., Donahue, J., Darrell, T., & Malik, J. . Rich feature hierarchies for accurate
object detection and semantic segmentation. CVPR 2014

19
Slide credit:
Joost van de Weijer

20
Slide credit:
Joost van de Weijer

21

Detection: Objects: Fast R-CNN
22
Girshick, Ross. "Fast R-CNN." ICCV 2015.

23
Slide credit:
Amaia Salvador

24
Slide credit:
Amaia Salvador
Same as SPP[3], but single scale

25
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Spatial pyramid pooling in deep convolutional
networks for visual recognition." PAMI 2015.
Slide credit:
Joost van de Weijer

26
Slide credit:
Amaia Salvador
H
h
w
h
w
Size of pooling bins:
h / H’ x w/ W’
w/W’
h/H’
max pooling
CONV5

27
Slide credit:
Amaia Salvador
AlexNet [4], VGG16 [5], VGG_1024 [6]

28
Slide credit:
Amaia Salvador
Multi-task loss

Detection: Objects: Faster R-CNN
29
Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster R-CNN: Towards real-time
object detection with region proposal networks. In Advances in Neural Information
Processing Systems (pp. 91-99). [Python code] [Matlab code]

30
Slide credit:
Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in
current state of the art object detection systems
Selective Search. Van de Sande, K. E., Uijlings, J. R., Gevers, T., & Smeulders, A. W. (2011, November). Segmentation as selective search for object
recognition. InComputer Vision (ICCV), 2011 IEEE International Conference on (pp. 1879-1886). IEEE.
CPMC. Carreira, J., & Sminchisescu, C. (2010, June). Constrained parametric min-cuts for automatic object segmentation. In Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE Conference on (pp. 3241-3248). IEEE.
MCG. Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 328-335).

31
Slide credit:
Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals
with a Region Proposal Network (RPN).

32
Slide credit:
Amaia Salvador
Conv
Layer 5
Conv
layers
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layer
FC layers
Class scores

33
Slide credit:
Amaia Salvador
Conv
Layer 5
Conv
layers
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layer
FC layers
Class scores

34
Slide credit:
Amaia Salvador
Objectness scores
(object/no object)
Bounding Box Regression
In practice, k = 9 (3 different scales and 3 aspect ratios)

35
Slide credit:
Amaia Salvador
Conv
Layer 5
Conv
layers
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layer
FC layers
Class scores

36
Slide credit:
Amaia Salvador
Fast R-CNN

37
Slide credit:
Amaia Salvador
Conv
Layer 5
Conv
layers
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layer
FC layers
Class scores
4-step training to share features for RPN and Fast R-CNN

38
Slide credit:
Amaia Salvador
Conv
Layer 5
Conv
layers
RPN RPN Proposals
Step 1: Train RPN initialized with an ImageNet pre-trained model.
ImageNet weights
(fine tuned)

39
Slide credit:
Amaia Salvador
Conv
Layer 5
Conv
layers
RPN Proposals
(learned in 1)
Class probabilities
Step 2: Train Fast R-CNN with learned RPN proposals.
ImageNet weights
(fine tuned)

40
Slide credit:
Amaia Salvador
Conv
Layer 5
Conv
layers RPN RPN Proposals
Step 3: The model trained in 2 is used to initialize RPN and train again.
Weights from Step 2
(fixed)

41
Slide credit:
Amaia Salvador
Conv
Layer 5
Conv
layers
RPN Proposals
(learned in 3)
Class probabilities
Step 4: Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3.
Weights from Step 2&3
(fixed)

42
Slide credit:
Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)

43
Slide credit:
Amaia Salvador

44
Slide credit:
Amaia Salvador

45
Slide credit:
Amaia Salvador

Xavier Giró i Nieto, “Deep learning for vision: Objects”. Master in Multimedia, La Salle URL (May 2016) 46
Detection: Objects: Reinforcement L.
Caicedo, Juan C., and Svetlana Lazebnik. "Active object localization with deep reinforcement learning." ICCV
2015 [Slides by Miriam Bellver]

Detection: Objects: Reinforcement L.
Object is localized based on visual features from AlexNet FC6.

Detection: Objects: Reinforcement Slide credit:
Míriam Bellver
Set of actions A
Transformation actions

Míriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region, inhibition-of-return (IoR)

Míriam Bellver
Set of states S
(o,h)
o = feature vector from pre-trained CNN fc6 : 4096 dim
h = history of taken actions binary vector dim 90

Míriam Bellver
Reward Function R
ground-truthbounding box

Míriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum
IoU:
0.6

Míriam Bellver
Policy function
If the current state is S, which should be the next action A?
Reinforcement Learning using a Q-learning

Míriam Bellver
The action-value function is estimated using a neural network that:
● has as many output units as actions
● the algorithm incorporates a replay-memory to collect experiences
● category-specific Q-network
Policy of the agent: selection action A with maximum estimated value of the
learnt action-value function.

Míriam Bellver

Míriam Bellver
Datasets for training and testing : PASCAL VOC
Two modes of evaluation:
1) All attended Regions (AAR)
2) Terminal regions (TR)

Míriam Bellver
Best performance with
few region proposals

Míriam Bellver

Detection: Faces
60

Detection: Faces:DDFD
61
Farfade, Sachin Sudhakar, Mohammad Saberian, and Li-Jia Li. "Multi-view Face
Detection Using Deep Convolutional Neural Networks." ICMR (2015). [software]

Detection: Faces: DDFD: Train
62
Dataset
● Source: Annotated Facial Landmarks in the Wild by TU Graz
● 25k annotated faces on images downloaded from Flickr.
● 380k manually annotated facial landmarks.

Detection: Faces: DDFD: Train
63
● Randomly samples sub-windows (blocks)
○ Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50%.
○ ...and negative sample otherwise.
● Total samples: 200K positive and 20M negative.

Detection: Faces: DDFD: Test
64
Test images are rescaled up/down 3 times per octave to find different sizes.

65
Sliding window of 227x227 over the test image.
Source: James Hays, “Object Category Detetcion: Sliding Windows” (Brown University, 2011)

66
Fully-connected layers are converted to convolutional layers, which allows
processing images from any size.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully Convolutional Networks for Semantic
Segmentation." CVPR 2015

67
● This makes possible to:
○ Efficiently run the convnet on images of any size.
○ Obtain a heat-map of the face etector.

68
● Non-Maximum Suppression (NMS) to avoid overlapped detections.
Source: Adrian Rosebrock, “Non-Maximum Suppression for Object Detection in Python” (Pyimagesearch, 2014)

Detection: Faces: DDFD: Results
69

Detection: Faces: DDFD: Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models.
- OpenCV face detector is an implementation of Viola & Jones.
- IMPORTANT: DPM or Headhunter need extra information about pose or facial landmarks during
training.

71
Segmentation
person
bag
me
my bag
person
bag
Proposals

Faces: Recognition: FaceNet
Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "FaceNet: A Unified Embedding for Face
Recognition and Clustering." CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar.)

Faces
Euclidean space
where distances
correspond to
face similarity
FaceNet

End-to-end learning of an embedding (distance metric learning)...
Weinberger, Kilian Q., and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor
classification." The Journal of Machine Learning Research 10 (2009): 207-244

...by means of well chosen triplets, using curriculum learning.
Bengio, Yoshua, Jérôme Louradour, Ronan Collobert, and Jason Weston. "Curriculum learning." In Proceedings of the 26th annual international
conference on machine learning, pp. 41-48. ACM, 2009

Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." In Computer
Vision–ECCV 2014, pp. 818-833. Springer International Publishing, 2014 (Slides by Xavier Giró-i-Nieto)
Architecture 1 (NN1): ZF

Architecture 2 (NN2): GoogLeNet
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. "Going Deeper With Convolutions." CVPR 2015. (Slides by Elisa Sayrol)

Faces: Recognition: FaceNet: Test
LBW: 99.63% (new record)
YouTubeFaces DB: 95.12%

Faces: Recognition: FaceNet: Software
Software implementation: OpenFace

Faces: Recognition: VGG Face
Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition."
Proceedings of the British Machine Vision 1, no. 3 (2015): 6. [software]

E. Mohedano, Salvador, A., McGuinness, K., Giró-i-Nieto, X., O'Connor, N., and Marqués, F., “Bags of Local
Convolutional Features for Scalable Instance Search”, ICMR 2016
83
Objects: Recognition: Retrieval

Image Database
Visual Query
“A dog”
Expected outcome:

Image Database
Visual Query
“This dog”
Expected outcome:

...
Instance Retrieval
(Instance: Object, Building, Person, Place…)

v1
= (v11
, …, v1n
)
vk
= (vk1
, …, vkn
)
...
INVERTED FILE
word Image ID
1 1, 12,
2 1, 30, 102
3 10, 12
4 2,3
6 10
...
Local hand-crafted features
(e.g. SIFT)
Bag of Visual
WordsN-Dimensional
feature space High-dimensional
Highly sparse

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems (pp. 1097-1105).
Convolutional Neural Networks

Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. In ECCV 2014
Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In
DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation

Babenko, A., & Lempitsky, V. (2015). Aggregating local deep features for image retrieval. ICCV 2015
Tolias, G., Sicre, R., & Jégou, H. (2015). Particular object retrieval with integral max-pooling of CNN activations. ICLR 2015
Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional Weighting for Aggregated Deep Convolutional Features. arXiv
preprint arXiv:1512.04065.
sum/max pooled conv features as global representation

Ng, J., Yang, F., & Davis, L. (2015). Exploiting local features from deep networks for image retrieval. In DeepVision CVPRW 2015
conv features encoded with VLAD as global representation

(336x256)
Resolution
conv5_1 from
VGG16[1]
(42x32)
25K centroids 25K-D vector

Query Representation
... ... ...
... ... ...
Global Search
(GS)
Local Search
(LS)

96
Segmentation
person
bag
me
my bag
person
bag
Proposals

Objects: Segmentation
97
Slide credit:
Eduard Fontdevila
Semantic segmentation: assign a category label to all pixels in an image

Objects: Segmentation: Farabet
98
Farabet, Clement, Camille Couprie, Laurent Najman, and Yann LeCun. "Learning hierarchical features
for scene labeling." TPAMI 2013

99
Pyramid of three spatial scales.

100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear: tanh
Pooling: max

101
Upsampling and concatenation.

102
Pixel-wise soft-max classifier

103
Problem: No spatial consistency among labels
3 explored solutions:
1) Superpixels
2) Conditional Random Fields
3) Parameter-free multilevel parsing

104
Prediction with a 2-layer
network
Solution 1: Superpixels

105
Prediction with a 2-layer
network
Solution 2: Superpixels + CRF

106
Solution 3: Multi-level parsing
Problems with Solutions 1 & 2:
Observation level.
BPT
[Garrido, Salembier]

107
Problems with Solutions 1 & 2: Observation level.
Contribution: Automatically discover the best
observation level (optimal cover) for each pixel in the
image.

108
Problems with Solutions 1 & 2: Observation level.
Contribution: Automatically discover the best
observation level (optimal cover) for each pixel in the
image.
C2 will be labelled with the class of C5
For each pixel (leaf) i, the optimal component
is the C_i is the one along the path between
the leaf and the root with minimal cost S.

Objects: Segmentation: SDS
109
Slide credit:
Eduard Fontdevila
Hariharan, Arbelaez, Girshick, Malik, Simultaneous Detection and Segmentation (ECCV 2014)

110
Slide credit:
Eduard Fontdevila
● Interest in obtaining segments, not just bounding boxes
● Multiscale combinational grouping (MCG) to generate object candidates
○ Cuts algorithm
○ Hierarchical segmenter
○ Grouping strategy to combine
multiscale regions

111
Slide credit:
Eduard Fontdevila
BBOX CNN
feature
vector
1
feature
vector
2
[1 2]
*Finetuned to classify bboxes (with background), so extracting features from the region foreground is
suboptimal
BBOX CNN*
vector A
background masked out
with the mean image

112
Slide credit:
Eduard Fontdevila
● Training: 2 networks trained in isolation
● Testing: results are combined
BBOX CNN
feature
vector
1
feature
vector
2
[1 2]
REGION
CNN
vector B

113
Slide credit:
Eduard Fontdevila
● Training: as a whole (using segmentation overlap)
● Testing: results are combined (using the output of the penultimate layer)
vector C

114
Slide credit:
Eduard Fontdevila
penultimate fully
connected layer
SVM

115
Slide credit:
Eduard Fontdevila

116
Slide credit:
Eduard Fontdevila
● Results on pixel IU (Jaccard index) to evaluate semantic segmentation:
○ Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme, Carreira et al)

117
Slide credit:
Eduard Fontdevila

118
Segmentation
person
bag
me
my bag
person
bag
Proposals

Thank you !
https://imatge.upc.edu/web/people/xavier-giro
https://twitter.com/DocXavi
https://www.facebook.com/ProfessorXavi
xavier.giro@upc.edu
Xavier Giró-i-Nieto

Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016 (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016