deep
deep
Abstract—We present a deep convolutional neural network for The traditional homography estimation pipeline is com-
estimating the relative homography between a pair of images. posed of two stages: corner estimation and robust homography
Our feed-forward network has 10 layers, takes two stacked estimation. Robustness is introduced into the corner detection
arXiv:1606.03798v1 [cs.CV] 13 Jun 2016
Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly
produces the Homography relating two images. Our method does net require separate corner detection and homography
estimation steps and all parameters are trained in an end-to-end fashion using a large dataset of labeled images.
Deep Image Homography Es
Conv1 Conv2
homography as follows: The 4-point parameterization has been used in traditional For
Conv3 Conv4
we
FC
Input Images
Conv3
8x11
H
If w
Max
16x16x128 16x16x128 1024
Conv5 Conv6 Pooling
∆u1 ∆v1
3x3FC 3x3
Pooling
FC Max
128x128x2 64x64x64
Softmax 64x64x64
Pooling
H4point =
Max
(2)
128x128x2
exp
64x64x64 64x64x64
produces the Homography relating two images. Our method does net require separate corner detection and homography
forestimation
the first
steps andcorner,
all parameters the 4-point parameterization represents the
Pooling
∆u3 ∆v3
128x128x64 128x128x64
are trained in an end-to-end fashion using a large dataset of labeled images.
Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly
produces the Homography relating two images. Our method Homography as follows:
does net require separate corner detection and homography into
estimation steps and all parameters are trained in an end-to-end fashion
II.using
T HE a4-large dataset of labeled images.
∆u4 ∆v4 POINT H OMOGRAPHY PARAMETERIZATION applying random projective transformations to a large dataset
of natural images 1 . This procedure is detailed below. real
The simplest way to parameterize a homography is with a 0 1
To generate a single training example, we first randomly
II. T HE 4- POINT H OMOGRAPHY PARAMETERIZATION 3x3 matrix
applying randomand a fixed transformations
projective scale (see Equation 1). However,
to a large dataset if u crop
1 a square patch v
1 from the larger image I at position p (we 1I
Equivalently to the matrix formulation of the homography, The simplest way to parameterize a homography is with a
we unroll
of natural
single
the 81 . (or
images
vector,
To generate
This9)procedure
well quickly
a single
parameters
training realize
example,
of the homography
is detailed below.
thatwewefirst
arerandomly
into a
mixing both B u avoid the borders to prevent bordering artifacts later in the
data v C
2 generation 2pipeline). This random crop is Ip . Then, the any
3x3 matrix and a fixed scale (see Equation 1). However, if rotational
a square and
patchtranslational terms. ForI example,
at positionthe subma-
H =B C (2)
the 4-point parameterization uses eight numbers. Once the
crop from the larger image p (we four corners of Patch A are randomly perturbed by values
we unroll the 8 (or 9) parameters of the homography into a
single vector, well quickly realize that we are mixing both
trixthe
avoid [H11 H12 ; Hto21 H
borders 22 ], represents
prevent borderingthe
4point
rotational
artifacts laterterms
in thein the @ u within v A
3 the range3 [-⇢, ⇢]. The four correspondences define
datahomography, while theThis
generation pipeline). vector
random[H13cropH23 ]isisIpthe translational
. Then, the a homography H AB . Then, the inverse of this homography
displacement of the four corners is known, one can easily
rotational and translational terms. For example, the subma-
fouroffset. Balancing
corners of PatchtheArotational
are randomly and translational
perturbed byterms valuesas part
u H4 BA
v
= (H AB )4 1 is applied to the large image to produce
trix [H11 H12 ; H21 H22 ], represents the rotational terms in the of anthe
optimization
within range [-⇢, problem
⇢]. The isfour difficult.
correspondences define image I 0 . A second patch Ip0 is cropped from I 0 at position p.
homography, while the vector [H13 H23 ] is the translational
a homography 0 AB
. Then,
0 the inverse of this 1 0homography
convert H4point to Hmatrix . This can be accomplished in a
H 1 1 The two grayscale patches, Ip and Ip0 are then stacked channel-
offset. Balancing the rotational and translational terms as part AB u11 0 H11to theH12large H13image uto1 produce
Equivalently to the matrix formulation of the homography,
= (H ) is applied
BA
H wise to create the 2-channel image which is fed directly into
of an optimization problem is difficult. 1-to-1 mapping @ v1 0 A ⇠@ H23 A
image I 0 . A second patch Ip0His21cropped
H22 from I0 @ A
at vposition
1 p. (1) our ConvNet. The 4-point parameterization of H AB is then
number of ways, for example one can use the normalized
0 1 0 10 1 The two grayscale1patches, IH 31 IH
p and
0 H33stacked
are then 1 channel-
u1 0 the 4-point parameterization ⎛
H11
@ v1 0 A ⇠ @H21
H12
H22
H13 u1
H23 A @ v1 A
is represented by eight
(1)
⎞ numbers. p 32
wise to create the 2-channel image which is fed directly into
We found that an alternate parameterization, one based on
used as the associated ground-truth training label. The process
is illustrated in Figure 3.
Direct Linear Transform (DLT) algorithm [9], or the function H In other words, once the displacement
1 H31 H32 H33 1 H H of the
H four corners is
our ConvNet. The 4-point parameterization of H AB is then
a single kind of location variable, namely the corner location,
used as the associated ground-truth training label. The process
is more suitable for our deep homography estimation task.
11Managing 12 the training 13
image generation pipeline gives us
full control over the kinds of visual effects we want to model.
We found that an alternate parameterization, one based on
a single kind of location variable, namely the corner location, known, only a single H closed = ⎝form H ⎠ is needed
is illustrated in Figure 3.
H transformation
H The 4-point parameterization has beenpipeline
matrix used in traditional For example, to make our method more robust to motion blur,
21 can apply22such blurs23
getPerspectiveTransform()in OpenCV.
Managing the training image generation gives us
we to the image in our training set.
is more suitable for our deep homography estimation task. fullhomography
control over theestimation
kinds of methods [2], we
visual effects andwant
we to usemodel.
it in our
The 4-point parameterization has been used in traditional for the 8-dof homography. This H canHbe accomplished
H in a If we want the method to be robust to occlusions, we can
Formodern
example,deep manifestation
to make our methodof more therobust
homography
to motionestimation
blur, 31 random32 33 into our training images. We
insert occluding shapes
homography estimation methods [2], and we use it in our we problem
can apply (Seesuch
Figure 2). to
blurs Letting
the image u1 =inuour10 u 1 be the set.
training u-offset
modern deep manifestation of the homography estimation number of ways, for example one can use the normalized
for the experimented with in-painting random occluding rectangles
wantfirst
thecorner,
methodtheto4-point parameterization
to occlusions,represents
we can the
III. DATA G ENERATION FOR H OMOGRAPHY E STIMATIONproblem (See Figure 2). Letting u = u 0 u be the u-offset experimented with in-painting random If we
Homography
insert as follows:
random occluding
be robust
shapes into our training images. We
into our training images, as a simple mechanism to simulate
for the first corner, the 4-point parameterization represents the Direct Linear Transform (DLT) algorithm [9], or the function
1 1 1 real occlusions.
0 occluding
1 rectangles
Training deep convolutional networks from scratch requires Fig. 2:0 4-point parameterization. B u We
C use the 4-point param-
Homography as follows: into our training images, as a simpleu mechanism
v to simulate 1 1 1 In our experiments, we used cropped MS-COCO [15] images, although
1 getPerspectiveTransform()in
real occlusions. H =B @ u
v C
v A
(2) OpenCV. 4point
2 2 any large-enough dataset could be used for training
In other words, once the displacement of the four corners is for the 8-dof homography. This can be accomplished (Δu , Δv )
in a 3 3
(u ,v )
1 1
betw
4 4 1 1
H H11 H12 H13
number of ways, for example one can use the normalized
known, only a single closed form transformation is needed
Direct Linear Transform (DLT)
(
a large amount of data. To meet this requirement, weH generate(
(u ’,v ’) algorithm [9], or the function
1-to-1 mapping
matrix = H21 H22 H23
H31 H32 H33
for the 8-dof homography. This can be accomplished in a 1 1
of t
(u ,v )
number of ways, for example one can use the normalized getPerspectiveTransform()in
H 1 1
OpenCV.
H11 H12 H13
OMOGRAPHY
H21 H22 H23
H31 H32 H33
STIMATION
getPerspectiveTransform()in OpenCV. Fig. 2: 4-point parameterization. We use the 4-point param-
Training deep convolutional networks from scratch requires eterization of the homography. There exists a 1-to-1 mapping
III. DATA G ENERATION FOR H OMOGRAPHY E STIMATION Fig.a 2: 4-point
large parameterization.
amount of data. To meetWe
thisuse the 4-point we
requirement, param-
generate between the 8-dof ”corner offset” matrix and the representation
eterization
a nearlyofunlimited
the homography.
numberThere exists training
of labeled a 1-to-1 mapping
examples by of the homography as a 3x3 matrix.
Training deep convolutional networks from scratch requires
a large amount of data. To meet this requirement, we generate between the 8-dof ”corner offset” matrix and the representation
a nearly unlimited number of labeled training examples by of the homography as a 3x3 matrix.
function during training. While quantization means that there
is some inherent quantization error, the network is able to
produce a confidence for each of the corners produced by
the method. We chose to use 21 quantization bins for each
of the 8 output dimensions, which results in a final layer
with 168 output neurons. Figure 6 is a visualization of the
corner confidences produced by our method — notice how
the confidence is not equal for all corners.
Step 1: Randomly crop at Step 2: Randomly perturb four
position p. This is Patch A. corners of Patch A.
V. E XPERIMENTS
We train both of our networks for about 8 hours on a
single Titan X GPU, using stochastic gradient descent (SGD)
with momentum of 0.9. We use a base learning rate of 0.005
and decrease the learning rate by a factor of 10 after every
30,000 iterations. The networks are trained for for 90,000 total
iterations using a batch size of 64. We use Caffe [11], a popular
Step 3: Compute HAB given Step 4: Apply (HAB)-1 = HBA to open-source deep learning package, for all experiments.
these correspondences. the image, and crop again at To create the training data, we use the MS-COCO Training
position p, this is Patch B.
Set. All images are resized to 320x240 and converted to
grayscale. We then generate 500,000 pairs of image patches
Deep Image
Homography HAB sized 128x128 related by a homography using the method
Network described in Section III. We choose ρ = 32, which means
that each corner of the 128x128 grayscale image can be
Step 5: Stack Patch A and Patch B channel-wise and feed into the perturbed by a maximum of one quarter of the total image edge
network. Set HAB as the target vector. size. We avoid larger random perturbations to avoid extreme
transformations. We did not use any form of pre-training; the
Fig. 3: Training Data Generation. The process for creating
weights of the networks were initialized to random values and
a single training example is detailed. See Section III for more
trained from scratch. We use the MS-COCO validation set to
information.
monitor overfitting, of which we found very little.
To our knowledge there are no large, publicly available
homography estimation test sets, thus we evaluate our homog-
IV. C ONV N ET M ODELS raphy estimation approach on our own Warped MS-COCO
Our networks use 3x3 convolutional blocks with Batch- 14 Test Set. To create this test set, we randomly chose 5000
Norm [10] and ReLUs, and are architecturally similar to images from the test set and resized each image to grayscale
Oxfords VGG Net [17] (see Figure 1). Both networks take as 640x480, and generate a pairs of image patches sized 256x256
2
input a two-channel grayscale image sized 128x128x2. In other and corresponding ground truth homography, using the
words, the two input images, which are related by a homogra- approach described in Figure 3 with ρ = 64.
phy, are stacked channel-wise and fed into the network. We use We compare the Classification and Regression variants
8 convolutional layers with a max pooling layer (2x2, stride of the HomographyNet with two baselines. The first base-
2) after every two convolutions. The 8 convolutional layers line is a classical ORB [15] descriptor + RANSAC +
have the following number of filters per layer: 64, 64, 64, 64, getPerspectiveTransform() OpenCV Homography
128, 128, 128, 128. The convolutional layers are followed by computation. We use the default OpenCV parameters in the
two fully connected layers. The first fully connected layer has traditional homography estimator. This estimates ORB features
1024 units. Dropout with a probability of 0.5 is applied after at multiple scales and uses the top 25 scoring matches as input
the final convolutional layer and the first fully-connected layer. to the RANSAC estimator. In scenarios where too few ORB
Our two networks share the same architecture up to the last features are computed, the ORB+RANSAC approach outputs
layer, where the first network produces real-valued outputs and an identity estimate. In scenarios where the ORB+RANSAC’s
the second network produces discrete quantities (see Figure 4). estimate is too extreme, the 4-point homography estimate is
The regression network directly produces 8 real-valued clipped at [-64,64]. The second baseline uses a 3x3 identity
numbers and uses the Euclidean (L2) loss as the final layer matrix for every pair of images in the test set.
during training. The advantage of this formulation is the Since the HomographyNets expect a fixed sized 128x128x2
simplicity; however, without producing any kind of confidence input, the image pairs from the Warped MS-COCO 14 Test Set
value for the prediction, such a direct approach could be are resized from 256x256x2 to 128x128x2 before being passed
prohibitive in certain applications. 2 We found that very few ORB features were detected when the patches
The classification network uses a quantization scheme, has were sized 128x128, while the HomographyNets had no issues working at
a softmax at the last layer, and we use the cross entropy loss the smaller scale.
3.Comparison Figure
… 3x3
Softmax
… 3x3
8x21 8
16x16x128 16x16x128 1024 16x16x128 16x16x128 1024
! 1
Loss: Cross-Entropy − p(x) log q(x) ||p(x) − q(x)||2
Loss: Euclidean (L2) 2
x
Fig. 4: Classification HomographyNet vs Regression HomographyNet. Our VGG-like Network has 8 convolutional layers
and two fully connected layers. The final layer is 8x21 for the classification network and 8x1 for the regression network. The
8x21 output can be interpreted as four 21x21 corner distributions. See Section IV for full ConvNet details.
through the network. The 4-point parameterized homography Secondly, by formulating homography estimation as a ma-
output by the network is then multiplied by a factor of chine learning problem, one can build application-specific
two to account for this. When evaluating the Classification homography estimation engines. For example, a robot that
HomographyNet, the corner displacement with the highest navigates an indoor factory floor using planar SLAM via
confidence is chosen. homography estimation could be trained solely with images
The results are reported in Figure 5. We report the Mean captured from the robot’s
Tableimage
1 sensor of the indoor factory.
Average Corner Error for each approach. To measure this While it is possible to optimize a feature detector such as
metric, one first computes the L2 distance between the ground ORB to work in specific environments, it is not straightfor-
9.2 11.7 24.1 49.1
truth corner position and the estimated corner position. The ward. Environment and sensor-specific noise, motion blur, and
error is averaged over the four corners of the image, and the occlusions which might restrict the ability of a homography
mean is computed over the entire test set. While the regression estimation algorithm can be tackled in a similar fashion using a
network performs the best, the classification network can ConvNet. Other classical computer vision tasks such as image
produce confidences and thus a meaningful way to visually mosaicing (as in [19]) and markerless camera tracking systems
debug the results. In certain applications, it may be critical to
have this measure of certainty.
We visualize homography estimations in Figure 7. The
blue squares in column 1 are mapped to a blue quadrilateral 50
Mean Average Corner Error
n t
io Ne
tio e
SA +
y
og ntity
AN B
H Id )
ss y
ph
re p h
ifi ph
R R
om e
eg gra
ss a
la gr
(R mo
(C o
Corner 4 Corner 3
3
4
Fig. 6: Corner Confidences Measure. Our Classification HomographyNet produces a score for each potential 2D displacement
of each corner. Each corner’s 2D grid of scores can be interpreted as a distribution.
for augmented reality (as in [16]) could also benefit from [8] A. P. Gee, D. Chekhlov, A. Calway, and W. Mayol-
HomographyNets trained on image pair examples created from Cuevas. Discovering higher level structure in visual slam.
the target system’s sensors and environment. IEEE Transactions on Robotics, 2008.
[9] R. I. Hartley and A. Zisserman. Multiple View Geometry
VII. C ONCLUSION in Computer Vision. Cambridge University Press, ISBN:
0521540518, second edition, 2004.
In this paper we asked if one of the most essential computer [10] Sergey Ioffe and Christian Szegedy. Batch normalization:
vision estimation tasks, namely homography estimation, could Accelerating deep network training by reducing internal
be cast as a learning problem. We presented two Convolutional covariate shift. CoRR, abs/1502.03167, 2015.
Neural Network architectures that are able to perform well [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
on this task. Our end-to-end training pipeline contains two R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
additional insights: using a 4-point corner parameterization of Convolutional architecture for fast feature embedding.
homographies, which makes the parameterizations coordinates arXiv preprint arXiv:1408.5093, 2014.
operate on the same scale, and using a large dataset of real [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
image to synthetically create an seemingly unlimited-sized Imagenet classification with deep convolutional neural
training set for homography estimation. We hope that more networks. In NIPS. 2012.
geometric problems in vision will be tackled using learning [13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
paradigms. Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and
C. Lawrence Zitnick. Microsoft coco: Common objects
R EFERENCES in context. In ECCV, 2014.
[14] R. Mur-Artal, J. M. M. Montiel, and J. D. Tards. Orb-
[1] M. Bai, W. Luo, K. Kundu, and R. Urtasun. Deep Seman- slam: A versatile and accurate monocular slam system.
tic Matching for Optical Flow. CoRR, abs/1604.01827, IEEE Transactions on Robotics, 2015.
April 2016. [15] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary
[2] Simon Baker, Ankur Datta, and Takeo Kanade. Param- Bradski. Orb: An efficient alternative to sift or surf. In
eterizing homographies. Technical Report CMU-RI-TR- ICCV, 2011.
06-11, Robotics Institute, Pittsburgh, PA, March 2006. [16] G. Simon, A. Fitzgibbon, and A. Zisserman. Markerless
[3] Matthew Brown and David G Lowe. Automatic tracking using planar structures in the scene. In Proc.
panoramic image stitching using invariant features. Inter- International Symposium on Augmented Reality, pages
national journal of computer vision, 74(1):59–73, 2007. 120–128, October 2000.
[4] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. [17] K. Simonyan and A. Zisserman. Very deep convolu-
Exploring representation learning with cnns for frame to tional networks for large-scale image recognition. CoRR,
frame ego-motion estimation. ICRA, 2016. abs/1409.1556, 2014.
[5] David Eigen, Christian Puhrsch, and Rob Fergus. Depth [18] Paul Smith, Ian Reid, and Andrew Davison. Real-time
map prediction from a single image using a multi-scale monocular SLAM with straight lines. In Proc. British
deep network. CoRR, abs/1406.2283, 2014. Machine Vision Conference, 2006.
[6] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Large- [19] Richard Szeliski. Video mosaics for virtual environ-
scale direct monocular SLAM. 2014. ments. IEEE Computer Graphics and Applications, 1996.
[7] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip [20] Zhengyou Zhang. A flexible new technique for camera
Häusser, Caner Hazirbas, Vladimir Golkov, Patrick calibration. PAMI, 22(11):1330–1334, 2000.
van der Smagt, Daniel Cremers, and Thomas Brox.
Flownet: Learning optical flow with convolutional net-
works. ICCV, 2015.
Traditional Homography Estimation Deep Image Homography Estimation
Fig. 7: Traditional Homography Estimation vs Deep Image Homography Estimation. In each of the 12 examples, blue depicts the
ground truth region. The left column shows the output of ORB-based Homography Estimation, the matched features in red, and the resulting
mapping in green of the cropping. The right column shows the output of the HomographyNet (regression head) in green. Rows 1-2: The
ORB features either concentrate on small regions or cannot detect enough features and perform poorly relative to the HomographyNet, which
is uneffected by these phenomena. Row 3: Both methods give reasonably good homography estimates. Row 4: A small amount of Gaussian
noise is added to the image pair in row 3, deteriorating the results produced by the traditional method, while our method is unaffected by
the distortions. Rows 5-6: The traditional approach extracts well-distributed ORB features, and also outperforms the deep method.