Deep Learning for Computer Vision: Image Retrieval (UPC 2016)

[course site]
Image Retrieval
Day 3 Lecture 6
Eva Mohedano

Content Based Image Retrieval
2
Given an image query, generate a rank of all
similar images.

Classification
3
Query: This chair Results from dataset classified as “chair”

Retrieval
4
Query: This chair Similar images

Retrieval Pipeline
5
Image RepresentationsQuery image
Image
Dataset
Image Matching Ranking List
Similarity score Image
..
.
0.98
0.97
0.10
0.01
v = (v1
, …, vn
)
v1
= (v11
, …, v1n
)
vk
= (vk1
, …, vkn
)
...
Euclidean distance
Cosine Similarity
Similarity
Metric

Retrieval Pipeline
6
v1
= (v11
, …, v1n
)
vk
= (vk1
, …, vkn
)
...
k feature vectors per
image
Bag of Visual
Words
N-Dimensional
feature space
M visual words
(M clusters)
INVERTED FILE
word Image ID
1 1, 12,
2 1, 30, 102
3 10, 12
4 2,3
6 10
...
Large vocabularies (50k-1M)
Very fast!
Typically used with SIFT features

CNN for retrieval
7
Classification Object Detection
Segmentation

Off-the-shelf CNN representations
8
Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. In ECCV
Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In CVPRW
FC layers as global feature representation

9
sum/max pool conv features across filters
Babenko, A., & Lempitsky, V. (2015). Aggregating local deep features for image retrieval. ICCV
Tolias, G., Sicre, R., & Jégou, H. (2015). Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879.
Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional Weighting for Aggregated Deep Convolutional Features. arXiv preprint arXiv:1512.04065.

10
Descriptors from convolutional layers

11
R-MAC: Regional Maximum Activation of Convolutions
Tolias, G., Sicre, R., & Jégou, H. (2015). Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint
arXiv:1511.05879.

12
BoW, VLAD encoding of conv features
Ng, J., Yang, F., & Davis, L. (2015). Exploiting local features from deep networks for image retrieval. In CVPRW
Mohedano, E., Salvador A., McGuinnes K, Marques F, O’Connor N, Giro-i-Nieto X (2016). Bags of Local Convolutional Features for Scalable Instance
Search. In ICMR

13
(336x256)
Resolution
conv5_1 from
VGG16
(42x32)
25K centroids 25K-D vector
Mohedano, E., Salvador A., McGuinnes K, Marques F, O’Connor N, Giro-i-Nieto X (2016). Bags of Local Convolutional Features for Scalable
Instance Search. In ICMR

14
(336x256)
Resolution
conv5_1 from
VGG16
(42x32)
25K centroids 25K-D vector

15
Paris Buildings 6k Oxford Buildings 5k
TRECVID Instance Search 2013
(subset of 23k frames)
[7] Kalantidis, Y., Mellina, C., & Osindero, S. (2015).
Cross-dimensional Weighting for Aggregated Deep Convolutional
Features. arXiv preprint arXiv:1512.04065.
Mohedano, E., Salvador A., McGuinnes K, Marques F, O’Connor
N, Giro-i-Nieto X (2016). Bags of Local Convolutional Features for
Scalable Instance Search. In ICMR

CNN representations
- l2 Normalization + PCA whitening + l2 Normalization
- Cosine similarity
- Convolutional features better than fully connected features
- Convolutional features keep spatial information → Retrieval+object location
- Convolutional layers allows custom input size.
- If data labels available, fine tuning the network to the image domain improves
CNN representations.
16

Learning representations for retrieval
Siamese Network: Network to learn a function that maps input
patterns into a target space such that l2-norm in the target
space approximates the semantic distance in the input space.
Applied in:
Dimensionality reduction[1]
Face verification[2]
Learning local image representations[3]
17
[1] Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: CVPR.
[2] S. Chopra, R. Hadsell and Y. LeCun, Learning a similarity metric discriminatively, with application to face verification.(CVPR'05)
[3] Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, and F. Moreno-Noguer. Fracking deep convolutional image descriptors. CoRR,
abs/1412.6537, 2014

Siamese Network: Network to learn a function that maps input
patterns into a target space such that l2-norm in the target
space approximates the semantic distance in the input space.
18
Image from: Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, and F. Moreno-Noguer. Fracking deep convolutional image descriptors. CoRR,
abs/1412.6537, 2014

Siamese Network with Triplet Loss: Loss function minimizes distance between query and
positive and maximizes distance between query and negative
19
Schroff, F; Kalenichenko, D and Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015
w w
CNN CNN CNN
a p n
L2 embedding space
Triplet Loss

20
Deep Image Retrieval: Learning global representations for image
search, Gordo A. et al. Xerox Research Centre, 2016
- R-MAC representation
- Learning descriptors for retrieval using three channels
siamese loss: Ranking objective:
- Learning where to pool within an image: predicting object
locations
- Local features (from predicted ROI) pooled into a more
discriminative space (learned fc)
- Building and cleaning a dataset to generate triplets

21

22
Deep Image Retrieval: Learning global representations for image search,
Gordo A. et al. Xerox Research Centre, 2016
Dataset: Landmarks dataset:
● 214K images of 672 famous landmark site.
● Dataset processing based on a matching
baseline: SIFT + Hessian-Affine keypoint
detector.
● Important to select the “useful” triplets.

23
Comparison between training for Classification (C) of training for Rankings (R)

24

Summary
25
Pre-trained CNN are useful to generate image descriptors for retrieval
Convolutional layers allow us to encode local information
Knowing how to rank similarity is the primary task in retrieval
Designing CNN architectures to learn how to rank

Deep Learning for Computer Vision: Image Retrieval (UPC 2016)

Recommended

More Related Content

What's hot (20)

Similar to Deep Learning for Computer Vision: Image Retrieval (UPC 2016) (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Deep Learning for Computer Vision: Image Retrieval (UPC 2016)