Hadsell Et Al - Dimensionality Reduction by Learning an Invariant Mapping
Hadsell Et Al - Dimensionality Reduction by Learning an Invariant Mapping
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
A contrastive loss function is employed to learn the param- sional mapping. In recent work, Weinberger et al.in [10] at-
eters W of a parameterized function GW , in such a way that tempt to learn the kernel matrix when the high dimensional
neighbors are pulled together and non-neighbors are pushed input lies on a low dimensional manifold by formulating the
apart. Prior knowledge can be used to identify the neighbors problem as a semidefinite program. There are also related
for each training data point. algorithms for clustering due to Shi and Malik [12] and Ng
The method uses an energy based model that uses the et al. [15].
given neighborhood relationships to learn the mapping The proposed approach is different from these methods;
function. For a family of functions G, parameterized by W , it learns a function that is capable of consistently mapping
the objective is to find a value of W that maps a set of high new points unseen during training. In addition, this function
dimensional inputs to the manifold such that the euclidean is not constrained by simple distance measures in the input
distance between points on the manifold, DW (X 1 , X
2 ) = space. The learning architecture is somewhat similar to the
||GW (X 1 ) − GW (X 2 )||2 approximates the “semantic sim- one discussed in [4, 5].
ilarity”of the inputs in input space, as provided by a set of Section 2 describes the general framework, the loss func-
neighborhood relationships. No assumption is made about tion, and draws an analogy with a mechanical spring sys-
GW except that it is differentiable with respect to W . tem. The ideas in this section are made concrete in sec-
tion 3. Here various experimental results are given.
1.1 Previous Work
2 Learning the Low Dimensional Mapping
The problem of mapping a set of high dimensional points
onto a low dimensional manifold has a long history. The The problem is to find a function that maps high dimen-
two classical methods for the problem are Principal Com- sional input patterns to lower dimensional outputs, given
ponent Analysis (PCA) [7] and Multi-Dimensional Scal- neighborhood relationships between samples in input space.
ing (MDS) [6]. PCA involves the projection of inputs to a The graph of neighborhood relationships may come from
low dimensional subspace that maximizes the variance. In information source that may not be available for test points,
MDS, one computes the projection that best preserves the such as prior knowledge, manual labeling, etc. More pre-
pairwise distances between input points. However both the cisely, given a set of input vectors I = {X 1 , . . . , XP },
methods - PCA in general and MDS in the classical scaling i ∈ , ∀i = 1, . . . , n, find a parametric func-
D
where X
case (when the distances are euclidean distances) - generate
tion GW : D −→ d with d D, such that it has the
a linear embedding.
following properties:
In recent years there has been a lot of activity in design-
ing non-linear spectral methods for the problem. These 1. Simple distance measures in the output space (such as
methods involve solving the eigenvalue problem for a euclidean distance) should approximate the neighbor-
particular matrix. Recently proposed algorithms include hood relationships in the input space.
ISOMAP (2000) by Tenenbaum et al. [1], Local Linear Em-
bedding - LLE (2000) by Roweis and Saul [13], Laplacian 2. The mapping should not be constrained to implement-
Eigenmaps (2003) due to Belkin and Niyogi [2] and Hes- ing simple distance measures in the input space and
sian LLE (2003) by Donoho and Grimes [8]. All the above should be able to learn invariances to complex trans-
methods have three main steps. The first is to identify a list formations.
of neighbors of each point. Second, a gram matrix is com-
3. It should be faithful even for samples whose neighbor-
puted using this information. Third, the eigenvalue problem
hood relationships are unknown.
is solved for this matrix. None of these methods attempt to
compute a function that could map a new, unknown data
point without recomputing the entire embedding and with- 2.1 The Contrastive Loss Function
out knowing its relationships to the training points. Out-
of-sample extensions to the above methods have been pro- Consider the set I of high dimensional training vectors
X i . Assume that for each X
i ∈ I there is a set S of train-
posed by Bengio et al.in [3], but they too rely on a predeter- Xi
mined computable distance metric. ing vectors that are deemed similar to X i . This set can be
Along a somewhat different line Schöelkopf et al.in computed by some prior knowledge - invariance to distor-
1998 [11] proposed a non-linear extension of PCA, called tions or temporal proximity, for instance - which does not
Kernel PCA. The idea is to non-linearly map the inputs to depend on a simple distance. A meaningful mapping from
a high dimensional feature space and then extract the prin- high to low dimensional space maps similar input vectors to
cipal components. The algorithm first expresses the PCA nearby points on the output manifold and dissimilar vectors
computation solely in terms of dot products and then ex- to distant points. A new loss function whose minimization
ploits the kernel trick to implicitly compute the high dimen- can produce such a function is now introduced.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
Unlike loss functions that sum over samples, this loss function only if their distance is within this radius (See fig-
function runs over pairs of samples. Let X 2 ∈ I be a
1 , X ure 1).
pair of input vectors shown to the system. Let Y be a binary The contrastive term involving dissimilar pairs, LD , is
label assigned to this pair. Y = 0 if X1 and X 2 are deemed crucial. Simply minimizing DW (X 1 , X
2 ) over the set of all
similar, and Y = 1 if they are deemed dissimilar. Define similar pairs will usually lead to a collapsed solution, since
the parameterized distance function to be learned DW be- DW and the loss L could then be made zero by setting GW
tween X 2 as the euclidean distance between the outputs
1 , X to a constant. Most energy-based models require the use of
of GW . That is, an explicit contrastive term in the loss function.
1 , X
DW (X 2 ) = GW (X
1 ) − GW (X
2 )2 (1) 2.2 Spring Model Analogy
To shorten notation, DW (X 1 , X
2 ) is written DW . Then the
An analogy to a particular mechanical spring system is
loss function in its most general form is given to provide an intuition of what is happening when
the loss function is minimized. The outputs of GW can be
P
L(W ) = 1, X
L(W, (Y, X 2 )i ) (2) thought of as masses attracting and repelling each other with
i=1 springs. Consider the equation of a spring
1, X
L(W, (Y, X 2 )i ) = (1 − Y )LS Di + Y LD Di
W W
F = −KX (5)
(3)
where F is the force, K is the spring constant and X is the
2 )i is the i-th labeled sample pair, LS is
1 , X displacement of the spring from its rest length. A spring
where (Y, X
is attract-only if its rest length is equal to zero. Thus any
the partial loss function for a pair of similar points, LD the
positive displacement X will result in an attractive force
partial loss function for a pair of dissimilar points, and P
between its ends. A spring is said to be m-repulse-only if its
the number of training pairs (which may be as large as the
rest length is equal to m. Thus two points that are connected
square of the number of samples).
with a m-repulse-only spring will be pushed apart if X is
LS and LD must be designed such that minimizing L
less than m. However this spring has a special property
with respect to W would result in low values of DW for
that if the spring is stretched by a length X > m, then no
similar pairs and high values of DW for dissimilar pairs.
attractive force brings it back to rest length. Each point is
3.5 connected to other points using these two kinds of springs.
Seen in the light of the loss function, each point is connected
3
by attract-only springs to similar points, and is connected
2.5
by m-repulse-only spring to dissimilar points. See figure 2.
Consider the loss function LS (W, X 1 , X
2 ) associated
2 with similar pairs.
Loss (L)
1.5
LS (W, X 2 ) = 1 (DW )2
1 , X (6)
2
1
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
global loss function L over all springs, one would ultimately
drive the system to its equilibrium state.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
The experiments involving airplane images from the organization that is primarily determined by the slant angle
NORB dataset [9] use a 2-layer fully connected neural net- of the samples. The samples are spread rather uniformly in
work as GW . The number of hidden and output units used the populated region.
was 20 and 3 respectively. Experiments on the MNIST
dataset used a convolutional network as GW (figure 3).
Convolutional networks are trainable, non-linear learning
machines that operate at pixel level and learn low-level fea-
tures and high-level representations in an integrated manner.
They are trained end-to-end to map images to outputs. Be-
cause of a structure of shared weights and multiple layers,
they can learn optimal shift-invariant local feature detectors
while maintaining invariance to geometric distortions of the
input image.
3.2 Learned Mapping of MNIST samples In this experiment, the DrLIM approach is evaluated us-
ing 2 categories of MNIST, distorted by adding samples that
The first experiment is designed to establish the basic have been horizontally translated. The objective is to learn
functionality of the DrLIM approach. The neighborhood a 2D mapping that is invariant to horizontal translations.
graph is generated with euclidean distances and no prior In the distorted set, 3000 images of 4’s and 3000 im-
knowledge. ages of 9’s are horizontally translated by -6, -3, 3, and 6
The training set is built from 3000 images of the hand- pixels and combined with the originals, producing a total
written digit 4 and 3000 images of the handwritten digit 9 of 30,000 samples. The 2000 samples in the test set were
chosen randomly from the MNIST dataset. Approximately distorted in the same way.
1000 images of each digit comprised the test set. These im- First the system was trained using pairs from a euclidean
ages were shuffled, paired, and labeled according to a sim- distance neighborhood graph (5 nearest neighbors per sam-
ple euclidean distance measure: each sample X i was paired ple), as in experiment 1. The large distances between trans-
with its 5 nearest neighbors, producing the set SXi . All lated samples creates a disjoint neighborhood relationship
other possible pairs were labeled dissimilar. graph and the resulting mapping is disjoint as well. The out-
The mapping of the test set to a 2D manifold is shown put points are clustered according to the translated position
in figure 4. The lighter-colored blue dots are 9’s and the of the input sample (figure 5). Within each cluster, however,
darker-colored red dots are 4’s. Several input test samples the samples are well organized and evenly distributed.
are shown next to their manifold positions. The 4’s and 9’s For comparison, the LLE algorithm was used to map the
are in two somewhat overlapping regions, with an overall distorted MNIST using the same euclidean distance neigh-
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
In order to make the mapping function invariant to trans-
lation, the euclidean nearest neighbors were supplemented
with pairs created using prior knowledge. Each sample was
paired with (a) its 5 nearest neighbors, (b) its 4 translations,
and (c) the 4 translations of each of its 5 nearest neighbors.
Additionally, each of the sample’s 4 translations was paired
with (d) all the above nearest neighbors and translated sam-
ples. All other possible pairs are labeled as dissimilar.
The mapping of the test set samples is shown in figure 7.
The lighter-colored blue dots are 4’s and the darker-colored
red dots are 9’s. As desired, there is no organization on the
basis of translation; in fact, translated versions of a given
character are all tighlty packed in small regions on the man-
ifold.
Figure 5. This experiment shows the effect of a simple distance-
based mapping on MNIST data with horizontal translations added
(-6, -3, +3, and +6 pixels). Since translated samples are far apart,
the manifold has 5 distinct clusters of samples corresponding to
the 5 translations. Note that the clusters are individually well-
organized, however. Results are on test samples, unseen during
training.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
degrees every 5 degrees), and 6 lighting conditions (4 lights used to create an embedding using LLE. Although arbitrary
in various on-off combinations). The objective is to learn a neighborhoods can be used in the LLE algorithm, the al-
globally coherent mapping to a 3D manifold that is invariant gorithm computes linear reconstruction weights to embed
to lighting conditions. A pattern based on temporal conti- the samples, which severely limits the desired effect of us-
nuity of the camera was used to construct a neighborhood ing distant neighbors. The embedding produced by LLE is
graph; images are similar if they were taken from contigu- shown (see figure 10). Clearly, the 3D embedding is not
ous elevation or azimuth regardless of lighting. Images may invariant to lighting, and the organization of azimuth and
be neighbors even if they are very distant in terms of Eucli- elevation does not reflect the real topology neighborhood
den distance in pixel space, due to different lighting. graph.
The dataset was split into 660 training images and a 312
test images. The result of training on all 10989 similar pairs
and 206481 dissimilar pairs is a 3-dimensional manifold in
the shape of a cylinder (see figure 8). The circumference
of the cylinder corresponds to change in azimuth in input
space, while the height of the cylinder corresponds to ele-
vation in input space. The mapping is completely invariant
to lighting. This outcome is quite remarkable. Using only
local neighborhood relationships, the learned manifold cor-
responds globally to the positions of the camera as it pro-
duced the dataset.
Viewing the weights of the network helps explain how
the mapping learned illumination invariance (see figure 9).
The concentric rings match edges on the airplanes to a par-
ticular azimuth and elevation, and the rest of the weights
are close to 0. The dark edges and shadow of the wings, for
example, are relatively consistent regardless of lighting.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
Figure 8. Test set results: the DrLIM approach learned a mapping to 3d space for images of a single airplane (extracted from NORB dataset).
The output manifold is shown under fi ve different viewing angles. The manifold is roughly cylindrical with a systematic organization: along
the circumference varies azimuth of camera in the viewing half-sphere. Along the height varies the camera elevation in the viewing sphere.
The mapping is invariant to the lighting condition, thanks to the prior knowledge built into the neighborhood relationships.
learned are only limited by the power of the parameterized tering. Advances in Neural Information Processing Systems,
function GW . The function maps inputs that evenly cover 16, 2004. In S. Thrun, L.K. Saul and B. Scholkopf, editors.
a manifold, as can be seen by the experimental results. It Cambrige, MA. MIT Press.
[4] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah.
also faithfully maps new, unseen samples to meaningful lo-
Signature verifi cation using a siamese time delay neural net-
cations on the manifold. work. J. Cowan and G. Tesauro (eds) Advances in Neural
The strength of DrLIM lies in the contrastive loss func- Information Processing Systems, 1993.
tion. By using a separate loss function for similar and dis- [5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity
similar pairs, the system avoids collapse to a constant func- metric discriminatively, with applications to face verifi caton.
tion and maintains an equilibrium in output space, much as In Proceedings of the IEEE Conference on Computer Vision
a mechanical system of interconnected springs does. and Pattern Recognition (CVPR-05), 1:539–546, 2005.
[6] T. Cox and M. Cox. Multidimensional scaling. London:
The experiments with LLE show that LLE is most useful
Chapman and Hill, 1994.
where the input samples are locally very similar and well- [7] T. I. Jolliffe. Principal component analysis. New York:
registered. If this is not the case, then LLE may give degen- Springer-Verlag, 1986.
erate results. Although it is possible to run LLE with arbi- [8] D. L. Donoho and C. E. Grimes. Hessian eigenmap: Lo-
trary neighborhood relationships, the linear reconstruction cally linear embedding techniques for high dimensional data.
of the samples negates the effect of very distant neighbors. Proceedings of the National Academy of Arts and Sciences,
Other dimensionality reduction methods have avoided this 100:5591–5596, 2003.
[9] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for
limitation, but none produces a function that can accept new generic object recognition with invariance to pose and light-
samples without recomputation or prior knowledge. ing. In Proceedings of the IEEE Conference on Computer
Creating a dimensionality reduction mapping using prior Vision and Pattern Recognition (CVPR-04), 2:97–104, 2004.
knowledge has other uses. Given the success of the NORB [10] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel
experiment, in which the positions of the camera were matrix for nonlinear dimensionality reduction. In Proceed-
learned from prior knowledge of the temporal connections ings of the Twenty First International Conference on Ma-
chine Learning (ICML-04), pages 839–846, 2004.
between images, it may be feasible to learn a robot’s posi- [11] B. Schőelkopf, A. J. Smola, and K. R. Muller. Nonlinear
tion and heading from image sequences. component analysis as a kernel eigen-value problem. Neural
Computation, 10:1299–1219, 1998.
[12] J. Shi and J. Malik. Normalized cuts and image segmenta-
References tion. IEEE Transactions on Pattern Analysis and Machine
Intelligence (PAMI), pages 888–905, 2000.
[1] J. B. Tenenbaum, V. DeSilva, and J. C. Langford. A global [13] S. T. Roweis and L. K. Saul. Nonlinear dimensionality re-
geometric framework for non linear dimensionality reduc- duction by locally linear embedding. Science, 290.
tion. Science, 290:2319–2323, 2000. [14] P. Vincent and Y. Bengio. A neural support vector network
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spec- architecture with adaptive kernels. In Proc. of the Interna-
tral techniques for embedding and clustering. Advances in tional Joint Conference on Neural Networks, 5, July 2000.
Neural Information Processing Systems, 15(6):1373–1396, [15] A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering:
2003. Analysis and an algorithm. Advances in Neural Information
[3] Y. Bengio, J. F. Paiement, and P. Vincent. Out-of-sample ex- Processing Systems, 14:849–856, 2002.
tensions for lle, isomap, mds, eigenmaps, and spectral clus-
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE