0% found this document useful (0 votes)
15 views

Hadsell Et Al - Dimensionality Reduction by Learning an Invariant Mapping

Uploaded by

seung.youn.lee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Hadsell Et Al - Dimensionality Reduction by Learning an Invariant Mapping

Uploaded by

seung.youn.lee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Dimensionality Reduction by Learning an Invariant Mapping

Raia Hadsell Sumit Chopra Yann LeCun


Courant Institute of Mathematical Sciences
New York University
New York, NY, USA
{raia, sumit, yann}@cs.nyu.edu

Abstract age data is limited because linearly combining images only


makes sense for images that are perfectly registered and
Dimensionality reduction involves mapping a set of high very similar. Laplacian Eigenmap [2] and Hessian LLE [8]
dimensional input points onto a low dimensional mani- do not require a meaningful metric in input space (they
fold so that “similar” points in input space are mapped merely require a list of neighbors for every sample), but
to nearby points on the manifold. We present a method - as with LLE, new points whose relationships with training
called Dimensionality Reduction by Learning an Invariant samples are unknown cannot be processed. Out-of-sample
Mapping (DrLIM) - for learning a globally coherent non- extensions to several dimensionality reduction techniques
linear function that maps the data evenly to the output man- have been proposed that allow for consistent embedding of
ifold. The learning relies solely on neighborhood relation- new data samples without recomputation of all samples [3].
ships and does not require any distance measure in the input These extensions, however, assume the existence of a com-
space. The method can learn mappings that are invariant putable kernel function that is used to generate the neigh-
to certain transformations of the inputs, as is demonstrated borhood matrix. This dependence is reducible to the depen-
with a number of experiments. Comparisons are made to dence on a computable distance metric in input space.
other techniques, in particular LLE. Another limitation of current methods is that they tend to
cluster points in output space, sometimes densely enough to
be considered degenerate solutions. Rather, it is sometimes
1 Introduction desirable to find manifolds that are uniformly covered by
samples.
The method proposed in the present paper, called Di-
Modern applications have steadily expanded their use of mensionality Reduction by Learning an Invariant Mapping
complex, high dimensional data. The massive, high dimen- (DrLIM), provides a solution to the above problems. Dr-
sional image datasets generated by biology, earth science, LIM is a method for learning a globally coherent non-linear
astronomy, robotics, modern manufacturing, and other do- function that maps the data to a low dimensional manifold.
mains of science and industry demand new techniques for The method presents four essential characteristics:
analysis, feature extraction, dimensionality reduction, and
visualization. • It only needs neighborhood relationships between
Dimensionality reduction aims to translate high dimen- training samples. These relationships could come from
sional data to a low dimensional representation such that prior knowledge, or manual labeling, and be indepen-
similar input objects are mapped to nearby points on a man- dent of any distance metric.
ifold. Most existing dimensionality reduction techniques
have two shortcomings. First, they do not produce a func- • It may learn functions that are invariant to complicated
tion (or a mapping) from input to manifold that can be ap- non-linear trnasformations of the inputs such as light-
plied to new points whose relationship to the training points ing changes and geometric distortions.
is unknown. Second, many methods presuppose the exis-
tence of a meaningful (and computable) distance metric in • The learned function can be used to map new samples
the input space. not seen during training, with no prior knowledge.
For example, Locally Linear Embedding (LLE) [13] lin-
early combines input vectors that are identified as neigh- • The mapping generated by the function is in some
bors. The applicability of LLE and similar methods to im- sense “smooth” and coherent in the output space.

Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
A contrastive loss function is employed to learn the param- sional mapping. In recent work, Weinberger et al.in [10] at-
eters W of a parameterized function GW , in such a way that tempt to learn the kernel matrix when the high dimensional
neighbors are pulled together and non-neighbors are pushed input lies on a low dimensional manifold by formulating the
apart. Prior knowledge can be used to identify the neighbors problem as a semidefinite program. There are also related
for each training data point. algorithms for clustering due to Shi and Malik [12] and Ng
The method uses an energy based model that uses the et al. [15].
given neighborhood relationships to learn the mapping The proposed approach is different from these methods;
function. For a family of functions G, parameterized by W , it learns a function that is capable of consistently mapping
the objective is to find a value of W that maps a set of high new points unseen during training. In addition, this function
dimensional inputs to the manifold such that the euclidean is not constrained by simple distance measures in the input
distance between points on the manifold, DW (X 1 , X
2 ) = space. The learning architecture is somewhat similar to the
||GW (X 1 ) − GW (X 2 )||2 approximates the “semantic sim- one discussed in [4, 5].
ilarity”of the inputs in input space, as provided by a set of Section 2 describes the general framework, the loss func-
neighborhood relationships. No assumption is made about tion, and draws an analogy with a mechanical spring sys-
GW except that it is differentiable with respect to W . tem. The ideas in this section are made concrete in sec-
tion 3. Here various experimental results are given.
1.1 Previous Work
2 Learning the Low Dimensional Mapping
The problem of mapping a set of high dimensional points
onto a low dimensional manifold has a long history. The The problem is to find a function that maps high dimen-
two classical methods for the problem are Principal Com- sional input patterns to lower dimensional outputs, given
ponent Analysis (PCA) [7] and Multi-Dimensional Scal- neighborhood relationships between samples in input space.
ing (MDS) [6]. PCA involves the projection of inputs to a The graph of neighborhood relationships may come from
low dimensional subspace that maximizes the variance. In information source that may not be available for test points,
MDS, one computes the projection that best preserves the such as prior knowledge, manual labeling, etc. More pre-
pairwise distances between input points. However both the cisely, given a set of input vectors I = {X 1 , . . . , XP },
methods - PCA in general and MDS in the classical scaling  i ∈  , ∀i = 1, . . . , n, find a parametric func-
D
where X
case (when the distances are euclidean distances) - generate
tion GW : D −→ d with d  D, such that it has the
a linear embedding.
following properties:
In recent years there has been a lot of activity in design-
ing non-linear spectral methods for the problem. These 1. Simple distance measures in the output space (such as
methods involve solving the eigenvalue problem for a euclidean distance) should approximate the neighbor-
particular matrix. Recently proposed algorithms include hood relationships in the input space.
ISOMAP (2000) by Tenenbaum et al. [1], Local Linear Em-
bedding - LLE (2000) by Roweis and Saul [13], Laplacian 2. The mapping should not be constrained to implement-
Eigenmaps (2003) due to Belkin and Niyogi [2] and Hes- ing simple distance measures in the input space and
sian LLE (2003) by Donoho and Grimes [8]. All the above should be able to learn invariances to complex trans-
methods have three main steps. The first is to identify a list formations.
of neighbors of each point. Second, a gram matrix is com-
3. It should be faithful even for samples whose neighbor-
puted using this information. Third, the eigenvalue problem
hood relationships are unknown.
is solved for this matrix. None of these methods attempt to
compute a function that could map a new, unknown data
point without recomputing the entire embedding and with- 2.1 The Contrastive Loss Function
out knowing its relationships to the training points. Out-
of-sample extensions to the above methods have been pro- Consider the set I of high dimensional training vectors
X i . Assume that for each X
 i ∈ I there is a set S  of train-
posed by Bengio et al.in [3], but they too rely on a predeter- Xi
mined computable distance metric. ing vectors that are deemed similar to X  i . This set can be
Along a somewhat different line Schöelkopf et al.in computed by some prior knowledge - invariance to distor-
1998 [11] proposed a non-linear extension of PCA, called tions or temporal proximity, for instance - which does not
Kernel PCA. The idea is to non-linearly map the inputs to depend on a simple distance. A meaningful mapping from
a high dimensional feature space and then extract the prin- high to low dimensional space maps similar input vectors to
cipal components. The algorithm first expresses the PCA nearby points on the output manifold and dissimilar vectors
computation solely in terms of dot products and then ex- to distant points. A new loss function whose minimization
ploits the kernel trick to implicitly compute the high dimen- can produce such a function is now introduced.

Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
Unlike loss functions that sum over samples, this loss function only if their distance is within this radius (See fig-
function runs over pairs of samples. Let X 2 ∈ I be a
1 , X ure 1).
pair of input vectors shown to the system. Let Y be a binary The contrastive term involving dissimilar pairs, LD , is
label assigned to this pair. Y = 0 if X1 and X 2 are deemed crucial. Simply minimizing DW (X 1 , X
2 ) over the set of all
similar, and Y = 1 if they are deemed dissimilar. Define similar pairs will usually lead to a collapsed solution, since
the parameterized distance function to be learned DW be- DW and the loss L could then be made zero by setting GW
tween X 2 as the euclidean distance between the outputs
1 , X to a constant. Most energy-based models require the use of
of GW . That is, an explicit contrastive term in the loss function.
1 , X
DW (X 2 ) = GW (X
1 ) − GW (X
2 )2 (1) 2.2 Spring Model Analogy
To shorten notation, DW (X 1 , X
2 ) is written DW . Then the
An analogy to a particular mechanical spring system is
loss function in its most general form is given to provide an intuition of what is happening when
the loss function is minimized. The outputs of GW can be

P
L(W ) = 1, X
L(W, (Y, X 2 )i ) (2) thought of as masses attracting and repelling each other with
i=1 springs. Consider the equation of a spring
   
1, X
L(W, (Y, X 2 )i ) = (1 − Y )LS Di + Y LD Di
W W
F = −KX (5)
(3)
where F is the force, K is the spring constant and X is the
2 )i is the i-th labeled sample pair, LS is
1 , X displacement of the spring from its rest length. A spring
where (Y, X
is attract-only if its rest length is equal to zero. Thus any
the partial loss function for a pair of similar points, LD the
positive displacement X will result in an attractive force
partial loss function for a pair of dissimilar points, and P
between its ends. A spring is said to be m-repulse-only if its
the number of training pairs (which may be as large as the
rest length is equal to m. Thus two points that are connected
square of the number of samples).
with a m-repulse-only spring will be pushed apart if X is
LS and LD must be designed such that minimizing L
less than m. However this spring has a special property
with respect to W would result in low values of DW for
that if the spring is stretched by a length X > m, then no
similar pairs and high values of DW for dissimilar pairs.
attractive force brings it back to rest length. Each point is
3.5 connected to other points using these two kinds of springs.
Seen in the light of the loss function, each point is connected
3
by attract-only springs to similar points, and is connected
2.5
by m-repulse-only spring to dissimilar points. See figure 2.
Consider the loss function LS (W, X 1 , X
2 ) associated
2 with similar pairs.
Loss (L)

1.5
LS (W, X 2 ) = 1 (DW )2
1 , X (6)
2
1

The loss function L is minimized using the stochastic gra-


0.5 dient descent algorithm. The gradient of LS is
margin: m
∂LS ∂DW
= DW
0
1.25
Energy (Ew) (7)
∂W ∂W
Figure 1. Graph of the loss function L against the energy DW . Comparing equations 5 and 7, it is clear that the gradient
∂LS
The dashed (red) line is the loss function for the similar pairs and ∂W of LS gives the attractive force between the two points.
the solid (blue) line is for the dissimilar pairs. ∂DW
∂W defines the spring constant K of the spring and D W ,
which is the distance between the two points, gives the per-
The exact loss function is turbation X of the spring from its rest length. Clearly, even
1, X
2 ) = a small value of DW will generate a gradient (force) to de-
L(W, Y, X
crease DW . Thus the similar loss function corresponds to
1 1
(1 − Y ) (DW )2 + (Y ) {max(0, m − DW )}2 (4) the attract-only spring (figure 2).
2 2 Now consider the partial loss function LD .
where m > 0 is a margin. The margin defines a radius 1

around GW (X). Dissimilar pairs contribute to the loss 1 , X
LD (W, X 2) = (max{0, m − DW })2 (8)
2

Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
global loss function L over all springs, one would ultimately
drive the system to its equilibrium state.

2.3 The Algorithm

The algorithm first generates the training set, then trains


the machine.
Step 1: For each input sample X i , do the following:
(a) Using prior knowledge find the set of samples
SXi = {X  j }p , such that X
j is deemed sim-
j=1
 i.
ilar to X
(b) Pair the sample X  i with all the other training
samples and label the pairs so that:
 j ∈ S  , and Yij = 1 otherwise.
Yij = 0 if X Xi

Combine all the pairs to form the labeled training set.


Step 2: Repeat until convergence:
(a) For each pair (X i, X
 j ) in the training set, do
i. If Yij = 0, then update W to decrease
DW = GW (X  i ) − GW (X
j )2
Figure 2. The spring system. The solid circles represent points that
are similar to the point in the center. The hollow circles represent ii. If Yij = 1, then update W to increase
dissimilar points. The springs are shown as red zigzag lines. The DW = GW (X  i ) − GW (X
j )2
forces acting on the points are shown in blue arrows. The length of
the arrows approximately gives the strength of the force. In the two This increase and decrease of euclidean distances in the
plots on the right side, the x-axis is the distance DW and the y-axis output space is done by minimizing the above loss function.
is the value of the loss function. (a). Shows the points connected
to similar points with attract-only springs. (b). The loss function
and gradient of similar pairs. (c) The point connected only with
3 Experiments
dissimilar points inside the circle of radius m with m-repulse-only
springs. (d) The loss function and gradient of dissimilar pairs. (e) The experiments presented in this section demonstrate
A point is pulled by other points in different directions, creating the invariances afforded by our approach and also clarify the
equilibrium. limitations of techniques such as LLE. First we give details
of the parameterized machine GW that learns the mapping
function.
When DW > m, ∂L ∂W = 0. Thus there is no gradient
D

(force) on the two points that are dissimilar and are at a


distance DW > m. If DW < m then 3.1 Training Architecture
∂LD ∂DW
= −(m − DW ) (9) The learning architecture is similar to the one used in [4]
∂W ∂W and [5]. Called a siamese architecture, it consists of two
Again, comparing equations 5 and 9 it is clear that the dis- copies of the function GW which share the same set of pa-
similar loss function LD corresponds to the m-repulse-only rameters W , and a cost module. A loss module whose input
spring; its gradient gives the force of the spring, ∂DW
∂W gives
is the output of this architecture is placed on top of it. The
the spring constant K and (m− DW ) gives the perturbation input to the entire system is a pair of images (X 2 ) and
1 , X
X. The negative sign denotes the fact that the force is re- a label Y . The images are passed through the functions,
pulsive only. Clearly the force is maximum when DW = 0 yielding two outputs G(X 1 ) and G(X 2 ). The cost module
and absent when DW = m. See figure 2. then generates the distance DW (GW (X 1 ), GW (X2 )). The
Here, especially in the case of LS , one might think that loss function combines DW with label Y to produce the
simply making DW = 0 for all attract-only springs would scalar loss LS or LD , depending on the label Y . The pa-
put the system in equilibrium. Consider, however, figure 2e. rameter W is updated using stochastic gradient. The gradi-
Suppose b1 is connected to b2 and b3 with attract-only ents can be computed by back-propagation through the loss,
springs. Then decreasing DW between b1 and b2 will in- the cost, and the two instances of GW . The total gradient is
crease DW between b1 and b3 . Thus by minimizing the the sum of the contributions from the two instances.

Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
The experiments involving airplane images from the organization that is primarily determined by the slant angle
NORB dataset [9] use a 2-layer fully connected neural net- of the samples. The samples are spread rather uniformly in
work as GW . The number of hidden and output units used the populated region.
was 20 and 3 respectively. Experiments on the MNIST
dataset used a convolutional network as GW (figure 3).
Convolutional networks are trainable, non-linear learning
machines that operate at pixel level and learn low-level fea-
tures and high-level representations in an integrated manner.
They are trained end-to-end to map images to outputs. Be-
cause of a structure of shared weights and multiple layers,
they can learn optimal shift-invariant local feature detectors
while maintaining invariance to geometric distortions of the
input image.

Figure 4. Experiment demonstrating the effectiveness of the Dr-


LIM in a trivial situation with MNIST digits. A Euclidean near-
Figure 3. Architecture of the function GW (a convolutional net- est neighbor metric is used to create the local neighborhood rela-
work) which was learned to map the MNIST data to a low dimen- tionships among the training samples, and a mapping function is
sional manifold with invariance to shifts. learned with a convolutional network. Figure shows the placement
of the test samples in output space. Even though the neighborhood
The layers of the convolutional network comprise a con- relationships among these samples are unknown, they are well or-
volutional layer C1 with 15 feature maps, a subsampling ganized and evenly distributed on the 2D manifold.
layer S2 , a second convolutional layer C3 with 30 feature
maps, and fully connected layer F3 with 2 units. The sizes
of the kernels for the C1 and C3 were 6x6 and 9x9 respec- 3.3 Learning a Shift-Invariant Mapping of
tively. MNIST samples

3.2 Learned Mapping of MNIST samples In this experiment, the DrLIM approach is evaluated us-
ing 2 categories of MNIST, distorted by adding samples that
The first experiment is designed to establish the basic have been horizontally translated. The objective is to learn
functionality of the DrLIM approach. The neighborhood a 2D mapping that is invariant to horizontal translations.
graph is generated with euclidean distances and no prior In the distorted set, 3000 images of 4’s and 3000 im-
knowledge. ages of 9’s are horizontally translated by -6, -3, 3, and 6
The training set is built from 3000 images of the hand- pixels and combined with the originals, producing a total
written digit 4 and 3000 images of the handwritten digit 9 of 30,000 samples. The 2000 samples in the test set were
chosen randomly from the MNIST dataset. Approximately distorted in the same way.
1000 images of each digit comprised the test set. These im- First the system was trained using pairs from a euclidean
ages were shuffled, paired, and labeled according to a sim- distance neighborhood graph (5 nearest neighbors per sam-
ple euclidean distance measure: each sample X  i was paired ple), as in experiment 1. The large distances between trans-
with its 5 nearest neighbors, producing the set SXi . All lated samples creates a disjoint neighborhood relationship
other possible pairs were labeled dissimilar. graph and the resulting mapping is disjoint as well. The out-
The mapping of the test set to a 2D manifold is shown put points are clustered according to the translated position
in figure 4. The lighter-colored blue dots are 9’s and the of the input sample (figure 5). Within each cluster, however,
darker-colored red dots are 4’s. Several input test samples the samples are well organized and evenly distributed.
are shown next to their manifold positions. The 4’s and 9’s For comparison, the LLE algorithm was used to map the
are in two somewhat overlapping regions, with an overall distorted MNIST using the same euclidean distance neigh-

Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
In order to make the mapping function invariant to trans-
lation, the euclidean nearest neighbors were supplemented
with pairs created using prior knowledge. Each sample was
paired with (a) its 5 nearest neighbors, (b) its 4 translations,
and (c) the 4 translations of each of its 5 nearest neighbors.
Additionally, each of the sample’s 4 translations was paired
with (d) all the above nearest neighbors and translated sam-
ples. All other possible pairs are labeled as dissimilar.
The mapping of the test set samples is shown in figure 7.
The lighter-colored blue dots are 4’s and the darker-colored
red dots are 9’s. As desired, there is no organization on the
basis of translation; in fact, translated versions of a given
character are all tighlty packed in small regions on the man-
ifold.
Figure 5. This experiment shows the effect of a simple distance-
based mapping on MNIST data with horizontal translations added
(-6, -3, +3, and +6 pixels). Since translated samples are far apart,
the manifold has 5 distinct clusters of samples corresponding to
the 5 translations. Note that the clusters are individually well-
organized, however. Results are on test samples, unseen during
training.

borhood graph. The result was a degenerate embedding in


which differently registered samples were completely sepa-
rated (figure 6). Although there is sporadic local organiza-
tion, there is no global coherence in the embedding.

Figure 7. This experiment measured DrLIM’s success at learning


a mapping from high-dimensional, shifted digit images to a 2D
manifold. The mapping is invariant to translations of the input
images. The mapping is well-organized and globally coherent.
Results shown are the test samples, whose neighborhood relations
are unknown. Similar characters are mapped to nearby areas, re-
gardless of their shift.

3.4 Mapping Learned with Temporal


Neighborhoods and Lighting Invari-
ance

The final experiment demonstrates dimensionality re-


duction on a set of images of a single object. The object is
an airplane from the NORB [9] dataset with uniform back-
Figure 6. LLE’s embedding of the distorted MNIST set with hor- grounds. There are a total of 972 images of the airplane un-
izontal translations added. Most of the untranslated samples are der various poses around the viewing half-sphere, and under
tightly clustered at the top right corner, and the translated samples various illuminations. The views have 18 azimuths (every
are grouped at the sides of the output. 20 degrees around the circle), 9 elevations (from 30 to 70

Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
degrees every 5 degrees), and 6 lighting conditions (4 lights used to create an embedding using LLE. Although arbitrary
in various on-off combinations). The objective is to learn a neighborhoods can be used in the LLE algorithm, the al-
globally coherent mapping to a 3D manifold that is invariant gorithm computes linear reconstruction weights to embed
to lighting conditions. A pattern based on temporal conti- the samples, which severely limits the desired effect of us-
nuity of the camera was used to construct a neighborhood ing distant neighbors. The embedding produced by LLE is
graph; images are similar if they were taken from contigu- shown (see figure 10). Clearly, the 3D embedding is not
ous elevation or azimuth regardless of lighting. Images may invariant to lighting, and the organization of azimuth and
be neighbors even if they are very distant in terms of Eucli- elevation does not reflect the real topology neighborhood
den distance in pixel space, due to different lighting. graph.
The dataset was split into 660 training images and a 312
test images. The result of training on all 10989 similar pairs
and 206481 dissimilar pairs is a 3-dimensional manifold in
the shape of a cylinder (see figure 8). The circumference
of the cylinder corresponds to change in azimuth in input
space, while the height of the cylinder corresponds to ele-
vation in input space. The mapping is completely invariant
to lighting. This outcome is quite remarkable. Using only
local neighborhood relationships, the learned manifold cor-
responds globally to the positions of the camera as it pro-
duced the dataset.
Viewing the weights of the network helps explain how
the mapping learned illumination invariance (see figure 9).
The concentric rings match edges on the airplanes to a par-
ticular azimuth and elevation, and the rest of the weights
are close to 0. The dark edges and shadow of the wings, for
example, are relatively consistent regardless of lighting.

Figure 10. 3d embedding of NORB images by LLE algorithm. The


neighborhood graph was constructed to create invariance to light-
ing, but the linear reconstruction weights of LLE force it organize
the embedding by lighting. The shape of the embedding resembles
a folded paper. The top image shows the ’v’ shape of the fold and
the lower image looks into the valley of the fold.

Figure 9. The weights of the 20 hidden units of a fully-connected


4 Discussion and Future Work
neural network trained with DrLIM on airplane images from the
NORB dataset. Since the camera rotates 360o around the airplane The experiments presented here demonstrate that, unless
and the mapping must be invariant to lighting, the weights are zero prior knowledge is used to create invariance, variabilities
except to detect edges at each azimuth and elevation; thus the con- such as lighting and registration can dominate and distort
centric patterns. the outcome of dimensionality reduction. The proposed ap-
proach, DrLIM, offers a solution: it is able to learn an in-
For comparison, the same neighborhood relationships variant mapping to a low dimensional manifold using prior
defined by the prior knowledge in this experiment were knowledge. The complexity of the invariances that can be

Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
Figure 8. Test set results: the DrLIM approach learned a mapping to 3d space for images of a single airplane (extracted from NORB dataset).
The output manifold is shown under fi ve different viewing angles. The manifold is roughly cylindrical with a systematic organization: along
the circumference varies azimuth of camera in the viewing half-sphere. Along the height varies the camera elevation in the viewing sphere.
The mapping is invariant to the lighting condition, thanks to the prior knowledge built into the neighborhood relationships.

learned are only limited by the power of the parameterized tering. Advances in Neural Information Processing Systems,
function GW . The function maps inputs that evenly cover 16, 2004. In S. Thrun, L.K. Saul and B. Scholkopf, editors.
a manifold, as can be seen by the experimental results. It Cambrige, MA. MIT Press.
[4] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah.
also faithfully maps new, unseen samples to meaningful lo-
Signature verifi cation using a siamese time delay neural net-
cations on the manifold. work. J. Cowan and G. Tesauro (eds) Advances in Neural
The strength of DrLIM lies in the contrastive loss func- Information Processing Systems, 1993.
tion. By using a separate loss function for similar and dis- [5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity
similar pairs, the system avoids collapse to a constant func- metric discriminatively, with applications to face verifi caton.
tion and maintains an equilibrium in output space, much as In Proceedings of the IEEE Conference on Computer Vision
a mechanical system of interconnected springs does. and Pattern Recognition (CVPR-05), 1:539–546, 2005.
[6] T. Cox and M. Cox. Multidimensional scaling. London:
The experiments with LLE show that LLE is most useful
Chapman and Hill, 1994.
where the input samples are locally very similar and well- [7] T. I. Jolliffe. Principal component analysis. New York:
registered. If this is not the case, then LLE may give degen- Springer-Verlag, 1986.
erate results. Although it is possible to run LLE with arbi- [8] D. L. Donoho and C. E. Grimes. Hessian eigenmap: Lo-
trary neighborhood relationships, the linear reconstruction cally linear embedding techniques for high dimensional data.
of the samples negates the effect of very distant neighbors. Proceedings of the National Academy of Arts and Sciences,
Other dimensionality reduction methods have avoided this 100:5591–5596, 2003.
[9] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for
limitation, but none produces a function that can accept new generic object recognition with invariance to pose and light-
samples without recomputation or prior knowledge. ing. In Proceedings of the IEEE Conference on Computer
Creating a dimensionality reduction mapping using prior Vision and Pattern Recognition (CVPR-04), 2:97–104, 2004.
knowledge has other uses. Given the success of the NORB [10] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel
experiment, in which the positions of the camera were matrix for nonlinear dimensionality reduction. In Proceed-
learned from prior knowledge of the temporal connections ings of the Twenty First International Conference on Ma-
chine Learning (ICML-04), pages 839–846, 2004.
between images, it may be feasible to learn a robot’s posi- [11] B. Schőelkopf, A. J. Smola, and K. R. Muller. Nonlinear
tion and heading from image sequences. component analysis as a kernel eigen-value problem. Neural
Computation, 10:1299–1219, 1998.
[12] J. Shi and J. Malik. Normalized cuts and image segmenta-
References tion. IEEE Transactions on Pattern Analysis and Machine
Intelligence (PAMI), pages 888–905, 2000.
[1] J. B. Tenenbaum, V. DeSilva, and J. C. Langford. A global [13] S. T. Roweis and L. K. Saul. Nonlinear dimensionality re-
geometric framework for non linear dimensionality reduc- duction by locally linear embedding. Science, 290.
tion. Science, 290:2319–2323, 2000. [14] P. Vincent and Y. Bengio. A neural support vector network
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spec- architecture with adaptive kernels. In Proc. of the Interna-
tral techniques for embedding and clustering. Advances in tional Joint Conference on Neural Networks, 5, July 2000.
Neural Information Processing Systems, 15(6):1373–1396, [15] A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering:
2003. Analysis and an algorithm. Advances in Neural Information
[3] Y. Bengio, J. F. Paiement, and P. Vincent. Out-of-sample ex- Processing Systems, 14:849–856, 2002.
tensions for lle, isomap, mds, eigenmaps, and spectral clus-

Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE

You might also like