0% found this document useful (0 votes)

37 views6 pages

deep

Uploaded by

zhuzhuyyds666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views6 pages

deep

Uploaded by

zhuzhuyyds666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Deep Image Homography Estimation

Daniel DeTone Tomasz Malisiewicz Andrew Rabinovich

Magic Leap, Inc. Magic Leap, Inc. Magic Leap, Inc.
Mountain View, CA Mountain View, CA Mountain View, CA
[email protected] [email protected] [email protected]

Abstract—We present a deep convolutional neural network for The traditional homography estimation pipeline is com-
estimating the relative homography between a pair of images. posed of two stages: corner estimation and robust homography
Our feed-forward network has 10 layers, takes two stacked estimation. Robustness is introduced into the corner detection
arXiv:1606.03798v1 [cs.CV] 13 Jun 2016

grayscale images as input, and produces an 8 degree of freedom

homography which can be used to map the pixels from the stage by returning a large and over-complete set of points,
first image to the second. We present two convolutional neural while robustness into the homography estimation step shows
network architectures for HomographyNet: a regression network up as heavy use of RANSAC or robustification of the squared
which directly estimates the real-valued homography parameters, loss function. Since corners are not as reliable as man-made
and a classification network which produces a distribution over linear structures, the research community has put considerable
quantized homographies. We use a 4-point homography param-
eterization which maps the four corners from one image into the effort into adding line features [18] and more complicated
second image. Our networks are trained in an end-to-end fashion geometries [8] into the feature detection step. What we really
using warped MS-COCO images. Our approach works without want is a single robust algorithm that, given a pair of images,
the need for separate local feature detection and transformation simply returns the homography relating the pair. Instead of
estimation stages. Our deep models are compared to a traditional manually engineering corner-ish features, line-ish features,
homography estimator based on ORB features and we highlight
the scenarios where HomographyNet outperforms the traditional etc, is it possible for the algorithm to learn its own set
technique. We also describe a variety of applications powered by of primitives? We want to go even further, and add the
deep homography estimation, thus showcasing the flexibility of transformation estimation step as the last part of a deep
a deep learning approach. learning pipeline, thus giving us the ability to learn the entire
homography estimation pipeline in an end-to-end fashion.
I. I NTRODUCTION
Recent research in dense or direct featureless SLAM algo-
Sparse 2D feature points are the basis of most modern rithms such as LSD-SLAM [6] indicates promise in using a
Structure from Motion and SLAM techniques [9]. These full image for geometric computer vision tasks. Concurrently,
sparse 2D features are typically known as corners, and in deep convolutional networks are setting state-of-the-art bench-
all geometric computer vision tasks one must balance the marks in semantic tasks such as image classification, semantic
errors in corner detection methods with geometric estimation segmentation and human pose estimation. Additionally, recent
errors. Even the simplest geometric methods, like estimating works such as FlowNet [7], Deep Semantic Matching [1]
the homography between two images, rely on the error-prone and Eigen et al.’s Multi-Scale Deep Network [5] present
corner-detection method. promising results for dense geometric computer vision tasks
Estimating a 2D homography (or projective transformation) like optical flow and depth estimation. Even robotic tasks like
from a pair of images is a fundamental task in computer vision. visual odometry are being tackled with convolutional neural
The homography is an essential part of monocular SLAM networks [4].
systems in scenarios such as:
In this paper, we show that the entire homography estima-
• Rotation only movements tion problem can be solved by a deep convolutional neural
• Planar scenes network (See Figure 1). Our contributions are as follows: we
• Scenes in which objects are very far from the viewer present a new VGG-style [17] network for the homography
It is well-known that the transformation relating two im- estimation task. We show how to use the 4-point parameter-
ages undergoing a rotation about the camera center is a ization [2] to get a well-behaved deep estimation problem.
homography, and it is not surprising that homographies are Because deep networks require a lot of data to be trained
essential for creating panoramas [3]. To deal with planar and from scratch, we share our recipe for creating a seemingly
mostly-planar scenes, the popular SLAM algorithm ORB- infinite dataset of (IA , IB , H AB ) training triplets from an
SLAM [14] uses a combination of homography estimation existing dataset of real images like the MS-COCO dataset.
and fundamental matrix estimation. Augmented Reality ap- We present an additional formulation of the homography
plications based on planar structures and homographies have estimation problem as classification, which produces a dis-
been well-studied [16]. Camera calibration techniques using tribution over homographies and can be used to determine the
planar structures [20] also rely on homographies. confidence of an estimated homography.
1.Teaser Figure
Conv1 Conv2
Conv3 Conv4
Input Images
Conv5 Conv6
Conv7 Conv8
FC
FC
Softmax
3x3
3x3
3x3 3x3

3x3 3x3 Max

Pooling
16x16x128 16x16x128 1024
8x21
H
Max 32x32x128 32x32x128
3x3 3x3 Pooling
128x128x2 Max 64x64x64 64x64x64
Pooling
128x128x64 128x128x64

Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly
produces the Homography relating two images. Our method does net require separate corner detection and homography
estimation steps and all parameters are trained in an end-to-end fashion using a large dataset of labeled images.
Deep Image Homography Es
Conv1 Conv2

II. T HE 4- POINT H OMOGRAPHY PARAMETERIZATION of natural images 1 . Input

The process is illustrated in Figure 3 Conv3
Images
and Conv4

The simplest way to parameterize a homography is with a described below.

3x3 matrix and a fixed scale. The homography maps [u, v], To generate a single training example, we first randomly
the pixels in the left image, to [u0 , v 0 ], the pixels in the right crop a square patch from the larger image I at position p (we 3x3 3x3
3x3 3x3

avoid the borders to prevent bordering artifacts later in Pooling the

Max
image, and is defined up to scale (see Equation 1).
128x128x2 64x64x
64x64x64

 0    data generation pipeline). This random crop is Ip128x128x64

128x128x64
. Then, the
u H11 H12 H13 u Fig. 1: Deep Image Homography Estimation. HomographyNet
four corners of Patch A are randomly perturbed by values
v 0  ∼ H21 H22 H23  v  (1) produces the Homography relating two images. Our method doe
within the range [-ρ,estimationThe
ρ]. steps four
and allcorrespondences
parameters are trained define
in an end-to-end fas
1 H31 H32 H33 1 a homography H AB . Then, the inverse of this homography
However, if we unroll the 8 (or 9) parameters of the homog- H BA = (H AB )−1 is applied to the large image to produce
II. T HE0 4- POINT H OMOGRAPHY
raphy into a single vector, we’ll quickly realize that we are image I 0 . A second patch Ip is cropped from I 0 atPARAMETERIZATION
position p. app
0 to parameterize a homography is with a of n
mixing both rotational and translational terms. For example, The two grayscale patches, Ip and Ip are then stacked channel-
The simplest way
T
3x3 matrix image
wise to create the 2-channel and a fixedwhich scaleis(see
fedEquation
directly1).intoHowever, if
the submatrix [H11 H12 ; H21 H22 ], represents the rotational we unroll the 8 (or 9) parameters of the
cro
ABhomography into a avo
terms in the homography, while the vector [H13 H23 ] is the our ConvNet. The single
4-point parameterization
vector, well quickly realize of H that we isarethen
mixing both
data
translational offset. Balancing the rotational and translational used as the associated ground-truth
rotational training
and translational label.
terms. For example, the subma-
fou
terms as part of an optimization problem is difficult. trix [H
Managing the training11image H 12 ; H H ],
generation
21 22 represents the rotational
pipeline gives us terms in the
wit
homography, while the vector [H13 H23 ] is the translational
We found that an alternate parameterization, one based on full control over theoffset.
kindsBalancing
of visual effects we want to model.
the rotational and translational terms as part
a h
a single kind of location variable, namely the corner location, HB
For example, to make of anour method more
optimization problem robust to motion blur,
is difficult.
ima
is more suitable for our deep homography estimation task. we can apply such blurs to0 the1 image 0 in our training 1 0 set.1 The
u1 0 H11 H12 H13 u1
The 4-point parameterization has been used in traditional If we want the method to be @ v1robust to occlusions, we can wis
0 A ⇠ @H21 H22 H23 A @ v1 A (1)
homography estimation methods [2], and we use it in our insert random occluding shapes1 into our training
H31 H32 H33 images. 1 We our
modern deep manifestation of the homography estimation use
experimented with in-painting random occluding rectangles
We found that an alternate parameterization, one based on is i
problem (See Figure 2). Letting ∆u1 = u01 −u1 be the u-offset into our training images, as a simple mechanism to simulate
a single kind of location variable, namely the corner location, M
for the first corner, the 4-point parameterization represents a real occlusions. Deep Image Homography Estimation using ConvNets
is more suitable for our deep homography estimation task. full Conv1 Conv2

homography as follows: The 4-point parameterization has been used in traditional For
Conv3 Conv4

Deep Image Homography Estimation using ConvNets Input Images

Conv5 Conv6
Conv7 Conv8
FC

we
FC

homography estimation methods [2], and we use it in our

Conv1 Conv2 Softmax

  Input Images
Conv3

1 In our experiments, we used cropped MS-COCO [13] images, although

Conv4
3x3 3x3
3x3 3x3
3x3

8x11
H
If w
Max
16x16x128 16x16x128 1024
Conv5 Conv6 Pooling

modern deep manifestation of the homography estimation

Conv7 Conv8 Max 32x32x128 32x32x128

∆u1 ∆v1
3x3FC 3x3
Pooling
FC Max
128x128x2 64x64x64
Softmax 64x64x64
Pooling

any large-enough datasetproblem could beImage

used for training inse
3x3

Fig. 1: Deep (See Figure 2). Letting 1 0 u1 be Neuralthe u-offset

3x3 3x3
128x128x64 128x128x64

∆u2 ∆v2  uis1 a= DeepuConvolutional

8x11
3x3 3x3 Max
Pooling
16x16x128 16x16x128 1024 H
Homography Estimation. HomographyNet Network which directly
Max 32x32x128 32x32x128
3x3 3x3
Pooling

H4point =  
Max

(2)
128x128x2

exp
64x64x64 64x64x64
produces the Homography relating two images. Our method does net require separate corner detection and homography
forestimation
the first
steps andcorner,
all parameters the 4-point parameterization represents the
Pooling

∆u3 ∆v3 
128x128x64 128x128x64
are trained in an end-to-end fashion using a large dataset of labeled images.
Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly
produces the Homography relating two images. Our method Homography as follows:
does net require separate corner detection and homography into
estimation steps and all parameters are trained in an end-to-end fashion
II.using
T HE a4-large dataset of labeled images.
∆u4 ∆v4 POINT H OMOGRAPHY PARAMETERIZATION applying random projective transformations to a large dataset
of natural images 1 . This procedure is detailed below. real
The simplest way to parameterize a homography is with a 0 1
To generate a single training example, we first randomly
II. T HE 4- POINT H OMOGRAPHY PARAMETERIZATION 3x3 matrix
applying randomand a fixed transformations
projective scale (see Equation 1). However,
to a large dataset if u crop
1 a square patch v
1 from the larger image I at position p (we 1I
Equivalently to the matrix formulation of the homography, The simplest way to parameterize a homography is with a
we unroll
of natural
single
the 81 . (or
images
vector,
To generate
This9)procedure
well quickly
a single
parameters
training realize
example,
of the homography
is detailed below.
thatwewefirst
arerandomly
into a
mixing both B u avoid the borders to prevent bordering artifacts later in the
data v C
2 generation 2pipeline). This random crop is Ip . Then, the any
3x3 matrix and a fixed scale (see Equation 1). However, if rotational
a square and
patchtranslational terms. ForI example,
at positionthe subma-
H =B C (2)
the 4-point parameterization uses eight numbers. Once the
crop from the larger image p (we four corners of Patch A are randomly perturbed by values
we unroll the 8 (or 9) parameters of the homography into a
single vector, well quickly realize that we are mixing both
trixthe
avoid [H11 H12 ; Hto21 H
borders 22 ], represents
prevent borderingthe
4point
rotational
artifacts laterterms
in thein the @ u within v A
3 the range3 [-⇢, ⇢]. The four correspondences define
datahomography, while theThis
generation pipeline). vector
random[H13cropH23 ]isisIpthe translational
. Then, the a homography H AB . Then, the inverse of this homography
displacement of the four corners is known, one can easily
rotational and translational terms. For example, the subma-
fouroffset. Balancing
corners of PatchtheArotational
are randomly and translational
perturbed byterms valuesas part
u H4 BA
v
= (H AB )4 1 is applied to the large image to produce
trix [H11 H12 ; H21 H22 ], represents the rotational terms in the of anthe
optimization
within range [-⇢, problem
⇢]. The isfour difficult.
correspondences define image I 0 . A second patch Ip0 is cropped from I 0 at position p.
homography, while the vector [H13 H23 ] is the translational
a homography 0 AB
. Then,
0 the inverse of this 1 0homography
convert H4point to Hmatrix . This can be accomplished in a
H 1 1 The two grayscale patches, Ip and Ip0 are then stacked channel-
offset. Balancing the rotational and translational terms as part AB u11 0 H11to theH12large H13image uto1 produce
Equivalently to the matrix formulation of the homography,
= (H ) is applied
BA
H wise to create the 2-channel image which is fed directly into
of an optimization problem is difficult. 1-to-1 mapping @ v1 0 A ⇠@ H23 A
image I 0 . A second patch Ip0His21cropped
H22 from I0 @ A
at vposition
1 p. (1) our ConvNet. The 4-point parameterization of H AB is then
number of ways, for example one can use the normalized
0 1 0 10 1 The two grayscale1patches, IH 31 IH
p and
0 H33stacked
are then 1 channel-
u1 0 the 4-point parameterization ⎛
H11
@ v1 0 A ⇠ @H21
H12
H22
H13 u1
H23 A @ v1 A
is represented by eight
(1)
⎞ numbers. p 32
wise to create the 2-channel image which is fed directly into
We found that an alternate parameterization, one based on
used as the associated ground-truth training label. The process
is illustrated in Figure 3.

Direct Linear Transform (DLT) algorithm [9], or the function H In other words, once the displacement
1 H31 H32 H33 1 H H of the
H four corners is
our ConvNet. The 4-point parameterization of H AB is then
a single kind of location variable, namely the corner location,
used as the associated ground-truth training label. The process
is more suitable for our deep homography estimation task.
11Managing 12 the training 13
image generation pipeline gives us
full control over the kinds of visual effects we want to model.
We found that an alternate parameterization, one based on
a single kind of location variable, namely the corner location, known, only a single H closed = ⎝form H ⎠ is needed
is illustrated in Figure 3.
H transformation
H The 4-point parameterization has beenpipeline
matrix used in traditional For example, to make our method more robust to motion blur,
21 can apply22such blurs23
getPerspectiveTransform()in OpenCV.
Managing the training image generation gives us
we to the image in our training set.
is more suitable for our deep homography estimation task. fullhomography
control over theestimation
kinds of methods [2], we
visual effects andwant
we to usemodel.
it in our
The 4-point parameterization has been used in traditional for the 8-dof homography. This H canHbe accomplished
H in a If we want the method to be robust to occlusions, we can
Formodern
example,deep manifestation
to make our methodof more therobust
homography
to motionestimation
blur, 31 random32 33 into our training images. We
insert occluding shapes
homography estimation methods [2], and we use it in our we problem
can apply (Seesuch
Figure 2). to
blurs Letting
the image u1 =inuour10 u 1 be the set.
training u-offset
modern deep manifestation of the homography estimation number of ways, for example one can use the normalized
for the experimented with in-painting random occluding rectangles
wantfirst
thecorner,
methodtheto4-point parameterization
to occlusions,represents
we can the
III. DATA G ENERATION FOR H OMOGRAPHY E STIMATIONproblem (See Figure 2). Letting u = u 0 u be the u-offset experimented with in-painting random If we
Homography
insert as follows:
random occluding
be robust
shapes into our training images. We
into our training images, as a simple mechanism to simulate
for the first corner, the 4-point parameterization represents the Direct Linear Transform (DLT) algorithm [9], or the function
1 1 1 real occlusions.
0 occluding
1 rectangles

Training deep convolutional networks from scratch requires Fig. 2:0 4-point parameterization. B u We
C use the 4-point param-
Homography as follows: into our training images, as a simpleu mechanism
v to simulate 1 1 1 In our experiments, we used cropped MS-COCO [15] images, although

1 getPerspectiveTransform()in
real occlusions. H =B @ u
v C
v A
(2) OpenCV. 4point
2 2 any large-enough dataset could be used for training

eterization of the homography. There exists a 1-to-1 mapping

3 3
u v 1 1 1 Inour experiments, we used cropped MS-COCO [15]
v4 images, although
a large amount of data. To meet this requirement, we generate B u2 u4
v2 C any large-enough dataset could be used for training
H4point =B
@ u3
C (2)
v3 A Equivalently to the matrix formulation of the homography,
between the 8-dof ”corner offset”
III. DATA matrix and
G ENERATION FOR the representation
H OMOGRAPHY E STIMATION
(Δu1, Δv1)

a nearly unlimited number of labeled training examples by

u4 v4 the 4-point parameterization is represented by eight numbers.
In other words, once the displacement of the four corners is
H4point =
(Δu2, Δv2)
(Δu3, Δv3)
(Δu4, Δv4)
Fig
Equivalently to the matrix formulation of the homography,
of the homography asTraining
a 3x3 deep
matrix. eter
(Δu , Δv ) 1 1
known, only a single closed form transformation is needed
convolutional networks from scratch requires
the 4-point parameterization is represented by eight numbers. (Δu , Δv )

applying random projective transformations to a large dataset

2 2
H4point = (Δu , Δv ) (u ’,v ’)
1-to-1 mapping

In other words, once the displacement of the four corners is for the 8-dof homography. This can be accomplished (Δu , Δv )
in a 3 3
(u ,v )
1 1

betw
4 4 1 1
H H11 H12 H13
number of ways, for example one can use the normalized
known, only a single closed form transformation is needed
Direct Linear Transform (DLT)
(
a large amount of data. To meet this requirement, weH generate(
(u ’,v ’) algorithm [9], or the function
1-to-1 mapping
matrix = H21 H22 H23
H31 H32 H33
for the 8-dof homography. This can be accomplished in a 1 1

of t
(u ,v )
number of ways, for example one can use the normalized getPerspectiveTransform()in
H 1 1
OpenCV.
H11 H12 H13

Direct Linear Transform (DLT) algorithm [9], or the function a nearly

III. D G
unlimited
H
H
ATA
number
(
E
( of labeled training examples by
ENERATION FOR
matrix =

OMOGRAPHY
H21 H22 H23
H31 H32 H33
STIMATION
getPerspectiveTransform()in OpenCV. Fig. 2: 4-point parameterization. We use the 4-point param-
Training deep convolutional networks from scratch requires eterization of the homography. There exists a 1-to-1 mapping
III. DATA G ENERATION FOR H OMOGRAPHY E STIMATION Fig.a 2: 4-point
large parameterization.
amount of data. To meetWe
thisuse the 4-point we
requirement, param-
generate between the 8-dof ”corner offset” matrix and the representation
eterization
a nearlyofunlimited
the homography.
numberThere exists training
of labeled a 1-to-1 mapping
examples by of the homography as a 3x3 matrix.
Training deep convolutional networks from scratch requires
a large amount of data. To meet this requirement, we generate between the 8-dof ”corner offset” matrix and the representation
a nearly unlimited number of labeled training examples by of the homography as a 3x3 matrix.
function during training. While quantization means that there
is some inherent quantization error, the network is able to
produce a confidence for each of the corners produced by
the method. We chose to use 21 quantization bins for each
of the 8 output dimensions, which results in a final layer
with 168 output neurons. Figure 6 is a visualization of the
corner confidences produced by our method — notice how
the confidence is not equal for all corners.
Step 1: Randomly crop at Step 2: Randomly perturb four
position p. This is Patch A. corners of Patch A.
V. E XPERIMENTS
We train both of our networks for about 8 hours on a
single Titan X GPU, using stochastic gradient descent (SGD)
with momentum of 0.9. We use a base learning rate of 0.005
and decrease the learning rate by a factor of 10 after every
30,000 iterations. The networks are trained for for 90,000 total
iterations using a batch size of 64. We use Caffe [11], a popular
Step 3: Compute HAB given Step 4: Apply (HAB)-1 = HBA to open-source deep learning package, for all experiments.
these correspondences. the image, and crop again at To create the training data, we use the MS-COCO Training
position p, this is Patch B.
Set. All images are resized to 320x240 and converted to
grayscale. We then generate 500,000 pairs of image patches
Deep Image
Homography HAB sized 128x128 related by a homography using the method
Network described in Section III. We choose ρ = 32, which means
that each corner of the 128x128 grayscale image can be
Step 5: Stack Patch A and Patch B channel-wise and feed into the perturbed by a maximum of one quarter of the total image edge
network. Set HAB as the target vector. size. We avoid larger random perturbations to avoid extreme
transformations. We did not use any form of pre-training; the
Fig. 3: Training Data Generation. The process for creating
weights of the networks were initialized to random values and
a single training example is detailed. See Section III for more
trained from scratch. We use the MS-COCO validation set to
information.
monitor overfitting, of which we found very little.
To our knowledge there are no large, publicly available
homography estimation test sets, thus we evaluate our homog-
IV. C ONV N ET M ODELS raphy estimation approach on our own Warped MS-COCO
Our networks use 3x3 convolutional blocks with Batch- 14 Test Set. To create this test set, we randomly chose 5000
Norm [10] and ReLUs, and are architecturally similar to images from the test set and resized each image to grayscale
Oxfords VGG Net [17] (see Figure 1). Both networks take as 640x480, and generate a pairs of image patches sized 256x256
2
input a two-channel grayscale image sized 128x128x2. In other and corresponding ground truth homography, using the
words, the two input images, which are related by a homogra- approach described in Figure 3 with ρ = 64.
phy, are stacked channel-wise and fed into the network. We use We compare the Classification and Regression variants
8 convolutional layers with a max pooling layer (2x2, stride of the HomographyNet with two baselines. The first base-
2) after every two convolutions. The 8 convolutional layers line is a classical ORB [15] descriptor + RANSAC +
have the following number of filters per layer: 64, 64, 64, 64, getPerspectiveTransform() OpenCV Homography
128, 128, 128, 128. The convolutional layers are followed by computation. We use the default OpenCV parameters in the
two fully connected layers. The first fully connected layer has traditional homography estimator. This estimates ORB features
1024 units. Dropout with a probability of 0.5 is applied after at multiple scales and uses the top 25 scoring matches as input
the final convolutional layer and the first fully-connected layer. to the RANSAC estimator. In scenarios where too few ORB
Our two networks share the same architecture up to the last features are computed, the ORB+RANSAC approach outputs
layer, where the first network produces real-valued outputs and an identity estimate. In scenarios where the ORB+RANSAC’s
the second network produces discrete quantities (see Figure 4). estimate is too extreme, the 4-point homography estimate is
The regression network directly produces 8 real-valued clipped at [-64,64]. The second baseline uses a 3x3 identity
numbers and uses the Euclidean (L2) loss as the final layer matrix for every pair of images in the test set.
during training. The advantage of this formulation is the Since the HomographyNets expect a fixed sized 128x128x2
simplicity; however, without producing any kind of confidence input, the image pairs from the Warped MS-COCO 14 Test Set
value for the prediction, such a direct approach could be are resized from 256x256x2 to 128x128x2 before being passed
prohibitive in certain applications. 2 We found that very few ORB features were detected when the patches
The classification network uses a quantization scheme, has were sized 128x128, while the HomographyNets had no issues working at
a softmax at the last layer, and we use the cross entropy loss the smaller scale.
3.Comparison Figure

Classification HomographyNet Regression HomographyNet

Conv7 Conv8 Conv7 Conv8
FC FC
FC FC

… 3x3
Softmax
… 3x3

8x21 8
16x16x128 16x16x128 1024 16x16x128 16x16x128 1024
! 1
Loss: Cross-Entropy − p(x) log q(x) ||p(x) − q(x)||2
Loss: Euclidean (L2) 2
x

Fig. 4: Classification HomographyNet vs Regression HomographyNet. Our VGG-like Network has 8 convolutional layers
and two fully connected layers. The final layer is 8x21 for the classification network and 8x1 for the regression network. The
8x21 output can be interpreted as four 21x21 corner distributions. See Section IV for full ConvNet details.

through the network. The 4-point parameterized homography Secondly, by formulating homography estimation as a ma-
output by the network is then multiplied by a factor of chine learning problem, one can build application-specific
two to account for this. When evaluating the Classification homography estimation engines. For example, a robot that
HomographyNet, the corner displacement with the highest navigates an indoor factory floor using planar SLAM via
confidence is chosen. homography estimation could be trained solely with images
The results are reported in Figure 5. We report the Mean captured from the robot’s
Tableimage
1 sensor of the indoor factory.
Average Corner Error for each approach. To measure this While it is possible to optimize a feature detector such as
metric, one first computes the L2 distance between the ground ORB to work in specific environments, it is not straightfor-
9.2 11.7 24.1 49.1
truth corner position and the estimated corner position. The ward. Environment and sensor-specific noise, motion blur, and
error is averaged over the four corners of the image, and the occlusions which might restrict the ability of a homography
mean is computed over the entire test set. While the regression estimation algorithm can be tackled in a similar fashion using a
network performs the best, the classification network can ConvNet. Other classical computer vision tasks such as image
produce confidences and thus a meaningful way to visually mosaicing (as in [19]) and markerless camera tracking systems
debug the results. In certain applications, it may be critical to
have this measure of certainty.
We visualize homography estimations in Figure 7. The
blue squares in column 1 are mapped to a blue quadrilateral 50
Mean Average Corner Error

in column 2 by a random homography generated from the 49.1

process described in Section III. The green quadrilateral is 37.5
the estimated homography. The more closely the blue and
green quadrilateral align, the better. The red lines show the top
(pixels)

scoring matches of ORB features across the image patches. A 25

similar visualization is shown in columns 3 and 4, except the 24.1
Deep Homography Estimator is used.
12.5
VI. A PPLICATIONS 11.7
9.2
Our Deep Homography Estimation system enables a vari- 0
ety of interesting applications. Firstly, our system is fast. It
n) t

n t
io Ne

tio e
SA +

runs at over 300fps with a batch size of one (i.e. real-time

ca yN
C

y
og ntity
AN B

H Id )
ss y

ph
re p h

ifi ph
R R

inference mode) on an NVIDIA Titan X GPU, which enables

ra
O

om e
eg gra

ss a
la gr
(R mo

(C o

a host of applications that are simply not possible with a

om
o
H

slower system. The recent emergence of specialized embedded

hardware for deep networks will enable applications on many
embedded systems or platforms with limited computational Fig. 5: Homography Estimation Comparison on Warped
power which cannot afford an expensive and power-hungry MS-COCO 14 Test Set. The mean average corner error is
desktop GPU. These embedded systems are capable of running computed for various approaches on the Warped MS-COCO
much larger networks such as AlexNet [12] in real-time, and 14 Test Set. The HomographyNet with the regression head
should have no problem running the relatively light-weight performs the best. The far right bar shows the error computed
HomographyNets. if the identity transformation is estimated for each test pair.
1 Corner 1 Corner 2
2

Corner 4 Corner 3

3
4

Fig. 6: Corner Confidences Measure. Our Classification HomographyNet produces a score for each potential 2D displacement
of each corner. Each corner’s 2D grid of scores can be interpreted as a distribution.

for augmented reality (as in [16]) could also benefit from [8] A. P. Gee, D. Chekhlov, A. Calway, and W. Mayol-
HomographyNets trained on image pair examples created from Cuevas. Discovering higher level structure in visual slam.
the target system’s sensors and environment. IEEE Transactions on Robotics, 2008.
[9] R. I. Hartley and A. Zisserman. Multiple View Geometry
VII. C ONCLUSION in Computer Vision. Cambridge University Press, ISBN:
0521540518, second edition, 2004.
In this paper we asked if one of the most essential computer [10] Sergey Ioffe and Christian Szegedy. Batch normalization:
vision estimation tasks, namely homography estimation, could Accelerating deep network training by reducing internal
be cast as a learning problem. We presented two Convolutional covariate shift. CoRR, abs/1502.03167, 2015.
Neural Network architectures that are able to perform well [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
on this task. Our end-to-end training pipeline contains two R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
additional insights: using a 4-point corner parameterization of Convolutional architecture for fast feature embedding.
homographies, which makes the parameterizations coordinates arXiv preprint arXiv:1408.5093, 2014.
operate on the same scale, and using a large dataset of real [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
image to synthetically create an seemingly unlimited-sized Imagenet classification with deep convolutional neural
training set for homography estimation. We hope that more networks. In NIPS. 2012.
geometric problems in vision will be tackled using learning [13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
paradigms. Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and
C. Lawrence Zitnick. Microsoft coco: Common objects
R EFERENCES in context. In ECCV, 2014.
[14] R. Mur-Artal, J. M. M. Montiel, and J. D. Tards. Orb-
[1] M. Bai, W. Luo, K. Kundu, and R. Urtasun. Deep Seman- slam: A versatile and accurate monocular slam system.
tic Matching for Optical Flow. CoRR, abs/1604.01827, IEEE Transactions on Robotics, 2015.
April 2016. [15] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary
[2] Simon Baker, Ankur Datta, and Takeo Kanade. Param- Bradski. Orb: An efficient alternative to sift or surf. In
eterizing homographies. Technical Report CMU-RI-TR- ICCV, 2011.
06-11, Robotics Institute, Pittsburgh, PA, March 2006. [16] G. Simon, A. Fitzgibbon, and A. Zisserman. Markerless
[3] Matthew Brown and David G Lowe. Automatic tracking using planar structures in the scene. In Proc.
panoramic image stitching using invariant features. Inter- International Symposium on Augmented Reality, pages
national journal of computer vision, 74(1):59–73, 2007. 120–128, October 2000.
[4] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. [17] K. Simonyan and A. Zisserman. Very deep convolu-
Exploring representation learning with cnns for frame to tional networks for large-scale image recognition. CoRR,
frame ego-motion estimation. ICRA, 2016. abs/1409.1556, 2014.
[5] David Eigen, Christian Puhrsch, and Rob Fergus. Depth [18] Paul Smith, Ian Reid, and Andrew Davison. Real-time
map prediction from a single image using a multi-scale monocular SLAM with straight lines. In Proc. British
deep network. CoRR, abs/1406.2283, 2014. Machine Vision Conference, 2006.
[6] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Large- [19] Richard Szeliski. Video mosaics for virtual environ-
scale direct monocular SLAM. 2014. ments. IEEE Computer Graphics and Applications, 1996.
[7] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip [20] Zhengyou Zhang. A flexible new technique for camera
Häusser, Caner Hazirbas, Vladimir Golkov, Patrick calibration. PAMI, 22(11):1330–1334, 2000.
van der Smagt, Daniel Cremers, and Thomas Brox.
Flownet: Learning optical flow with convolutional net-
works. ICCV, 2015.
Traditional Homography Estimation Deep Image Homography Estimation

Fig. 7: Traditional Homography Estimation vs Deep Image Homography Estimation. In each of the 12 examples, blue depicts the
ground truth region. The left column shows the output of ORB-based Homography Estimation, the matched features in red, and the resulting
mapping in green of the cropping. The right column shows the output of the HomographyNet (regression head) in green. Rows 1-2: The
ORB features either concentrate on small regions or cannot detect enough features and perform poorly relative to the HomographyNet, which
is uneffected by these phenomena. Row 3: Both methods give reasonably good homography estimates. Row 4: A small amount of Gaussian
noise is added to the image pair in row 3, deteriorating the results produced by the traditional method, while our method is unaffected by
the distortions. Rows 5-6: The traditional approach extracts well-distributed ORB features, and also outperforms the deep method.

Working With Grids Course PDF
100% (1)
Working With Grids Course PDF
47 pages
Starbucks Deployment Tool To Optimally Assign Employees
No ratings yet
Starbucks Deployment Tool To Optimally Assign Employees
1 page
N
No ratings yet
N
8 pages
Homography Theory
No ratings yet
Homography Theory
32 pages
Homography - Estimation - Dubrofsky - Elan
No ratings yet
Homography - Estimation - Dubrofsky - Elan
32 pages
Detecting Planar Homographies in An Image Pair
No ratings yet
Detecting Planar Homographies in An Image Pair
6 pages
CV Lecture Homography
No ratings yet
CV Lecture Homography
2 pages
Homography (Computer Vision)
No ratings yet
Homography (Computer Vision)
3 pages
L17 Panaroma Disparity
No ratings yet
L17 Panaroma Disparity
73 pages
CV Assignment 2 RecognitionAR
No ratings yet
CV Assignment 2 RecognitionAR
5 pages
Lecture 5 Stitching Blending
No ratings yet
Lecture 5 Stitching Blending
75 pages
Computer Vision Planar Scenes and Homography
No ratings yet
Computer Vision Planar Scenes and Homography
27 pages
Character Keypoint
No ratings yet
Character Keypoint
6 pages
Image Stitching: Computer Vision Jia-Bin Huang, Virginia Tech
No ratings yet
Image Stitching: Computer Vision Jia-Bin Huang, Virginia Tech
57 pages
C V: E 2D H: Omputer Ision Stimation of Omography
No ratings yet
C V: E 2D H: Omputer Ision Stimation of Omography
23 pages
Correspondence Stitching
No ratings yet
Correspondence Stitching
94 pages
Computer Vision Applied To Super Resolution
No ratings yet
Computer Vision Applied To Super Resolution
12 pages
Creating Image Panoramas Using Homography Warping
No ratings yet
Creating Image Panoramas Using Homography Warping
3 pages
03_CV2_sfm_slam
No ratings yet
03_CV2_sfm_slam
82 pages
Freeman Ran Sac 2010
No ratings yet
Freeman Ran Sac 2010
165 pages
Acivs 2017
No ratings yet
Acivs 2017
12 pages
Homographies Panoramas
No ratings yet
Homographies Panoramas
31 pages
LA1_Script
No ratings yet
LA1_Script
6 pages
1911.09231
No ratings yet
1911.09231
7 pages
HW3 Hist Region Morph Wrap
No ratings yet
HW3 Hist Region Morph Wrap
7 pages
MAP and Registration
No ratings yet
MAP and Registration
10 pages
03 Two-View Geometry
No ratings yet
03 Two-View Geometry
40 pages
Robotics Perception Week 3 Assignment
No ratings yet
Robotics Perception Week 3 Assignment
6 pages
Laboratory 5. Feature Detection and Content Descriptors For Matching Applications
No ratings yet
Laboratory 5. Feature Detection and Content Descriptors For Matching Applications
11 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
Lab8
No ratings yet
Lab8
11 pages
Polygonal Approximation of Closed Curves Across Multiple Views
No ratings yet
Polygonal Approximation of Closed Curves Across Multiple Views
6 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
SCENES_Subpixel_Correspondence_Estimation_With_Epipolar_Supervision
No ratings yet
SCENES_Subpixel_Correspondence_Estimation_With_Epipolar_Supervision
10 pages
Lecture 05
No ratings yet
Lecture 05
57 pages
08_homographies_slides
No ratings yet
08_homographies_slides
86 pages
20232-Article Text-24245-1-2-20220628
No ratings yet
20232-Article Text-24245-1-2-20220628
9 pages
Homework - 2: Sriram Karthik Badam September 6, 2012
No ratings yet
Homework - 2: Sriram Karthik Badam September 6, 2012
29 pages
Multi View Three Dimensional Reconstruction: Advanced Techniques for Spatial Perception in Computer Vision
From Everand
Multi View Three Dimensional Reconstruction: Advanced Techniques for Spatial Perception in Computer Vision
Fouad Sabry
No ratings yet
2403.03221v1
No ratings yet
2403.03221v1
22 pages
Image Stitching and Homography
No ratings yet
Image Stitching and Homography
21 pages
Robust Odometry Estimation For RGB-D Cameras: Christian Kerl, J Urgen Sturm, and Daniel Cremers
No ratings yet
Robust Odometry Estimation For RGB-D Cameras: Christian Kerl, J Urgen Sturm, and Daniel Cremers
8 pages
HW 4
No ratings yet
HW 4
7 pages
Photo Stitching
No ratings yet
Photo Stitching
70 pages
Point Feature Detection and Matching: Davide Scaramuzza
No ratings yet
Point Feature Detection and Matching: Davide Scaramuzza
65 pages
Automatic Panorama Stitching
No ratings yet
Automatic Panorama Stitching
29 pages
Week 6 (CV)
No ratings yet
Week 6 (CV)
59 pages
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet
Unsupervised 3D Object Recognition and Reconstruction in Unordered Datasets
No ratings yet
Unsupervised 3D Object Recognition and Reconstruction in Unordered Datasets
8 pages
2402.14817v3
No ratings yet
2402.14817v3
22 pages
Direct Pose Estimation and Refinement
No ratings yet
Direct Pose Estimation and Refinement
235 pages
Computer Viruses
No ratings yet
Computer Viruses
58 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Dive Deeper into Rectifying Homography for Stereo Camera Online Self-Calibration
No ratings yet
Dive Deeper into Rectifying Homography for Stereo Camera Online Self-Calibration
7 pages
Global Feature?: Local Feature Detection and Extraction
No ratings yet
Global Feature?: Local Feature Detection and Extraction
6 pages
Computer_Vision__Winter_2025_HW2
No ratings yet
Computer_Vision__Winter_2025_HW2
4 pages
1.8.citedby - Fusing The Old With The New: Learning Relative Camera Pose With Geometry-Guided Uncertainty - 2104.08278
No ratings yet
1.8.citedby - Fusing The Old With The New: Learning Relative Camera Pose With Geometry-Guided Uncertainty - 2104.08278
11 pages
AT70.20: Applied Machine Vision, Midterm Exam: Mon Jun 20, 2016
No ratings yet
AT70.20: Applied Machine Vision, Midterm Exam: Mon Jun 20, 2016
4 pages
AT70.20: Applied Machine Vision, Midterm Exam: Mon Feb 27, 2017
No ratings yet
AT70.20: Applied Machine Vision, Midterm Exam: Mon Feb 27, 2017
8 pages
A Tutorial On Affine and Projective Geometries
No ratings yet
A Tutorial On Affine and Projective Geometries
14 pages
Semantic Structure From Motion With Points, Regions, and Objects
No ratings yet
Semantic Structure From Motion With Points, Regions, and Objects
8 pages
Lec 22
No ratings yet
Lec 22
12 pages
25+ Tricky SAP BODS Interview Questions 3
No ratings yet
25+ Tricky SAP BODS Interview Questions 3
1 page
BET Manual ENG (2020)
No ratings yet
BET Manual ENG (2020)
112 pages
Bank Reconciliation Procedure
No ratings yet
Bank Reconciliation Procedure
6 pages
Face Detection and Recognition Technology
No ratings yet
Face Detection and Recognition Technology
6 pages
Fast23 Liu
No ratings yet
Fast23 Liu
15 pages
4.ethics and Safety Measures
No ratings yet
4.ethics and Safety Measures
8 pages
careers-qualitestgroup-com-job-Bangalore-Automation-Architect-560045-33939844-
No ratings yet
careers-qualitestgroup-com-job-Bangalore-Automation-Architect-560045-33939844-
1 page
Basics of Switchboards (Siemens)
No ratings yet
Basics of Switchboards (Siemens)
70 pages
UBL-29-Nov-2024 08_19_11
No ratings yet
UBL-29-Nov-2024 08_19_11
24 pages
PART-I: Multiple Choices: Jimma University
100% (1)
PART-I: Multiple Choices: Jimma University
6 pages
1 SJST Manuscript Template 2021
No ratings yet
1 SJST Manuscript Template 2021
8 pages
2499368_E_20250106 [Warning] Some of the data sources in the document cannot be transported
No ratings yet
2499368_E_20250106 [Warning] Some of the data sources in the document cannot be transported
2 pages
Objective Computer Fundamental Bca First
No ratings yet
Objective Computer Fundamental Bca First
11 pages
Abreviaturas
No ratings yet
Abreviaturas
4 pages
A76XX Series - LBS - Application Note - V1.04
No ratings yet
A76XX Series - LBS - Application Note - V1.04
13 pages
Cx-Supervisor Datasheet en
No ratings yet
Cx-Supervisor Datasheet en
8 pages
2011 AL ICT Model Paper English
0% (1)
2011 AL ICT Model Paper English
18 pages
Aqthas Resume PDF
No ratings yet
Aqthas Resume PDF
3 pages
ENVS211 Lecture 1 Introduction To GIS
100% (1)
ENVS211 Lecture 1 Introduction To GIS
12 pages
Business Computing I
No ratings yet
Business Computing I
1 page
Raspberry Pi Pico Tips and Tricks 2023
100% (2)
Raspberry Pi Pico Tips and Tricks 2023
143 pages
MSC Computer Science Thesis Ideas
100% (2)
MSC Computer Science Thesis Ideas
6 pages
21st Century G11 2nd Quarter Module Based Reviewer
No ratings yet
21st Century G11 2nd Quarter Module Based Reviewer
32 pages
II4IIT Assignment-5 Solution
No ratings yet
II4IIT Assignment-5 Solution
5 pages
Keyboard
No ratings yet
Keyboard
1 page
Iii B.Tech Ii Sem Eie (R18) : Unit - V
No ratings yet
Iii B.Tech Ii Sem Eie (R18) : Unit - V
62 pages
Lenovo X1 Corbon 12298-2
No ratings yet
Lenovo X1 Corbon 12298-2
80 pages
AHFE 2021 Congreso Ergonomia
No ratings yet
AHFE 2021 Congreso Ergonomia
2 pages

deep

Uploaded by

deep

Uploaded by

Deep Image Homography Estimation

Daniel DeTone Tomasz Malisiewicz Andrew Rabinovich

grayscale images as input, and produces an 8 degree of freedom

3x3 3x3 Max

II. T HE 4- POINT H OMOGRAPHY PARAMETERIZATION of natural images 1 . Input

The simplest way to parameterize a homography is with a described below.

avoid the borders to prevent bordering artifacts later in Pooling the

 0    data generation pipeline). This random crop is Ip128x128x64

Deep Image Homography Estimation using ConvNets Input Images

homography estimation methods [2], and we use it in our

1 In our experiments, we used cropped MS-COCO [13] images, although

modern deep manifestation of the homography estimation

any large-enough datasetproblem could beImage

Fig. 1: Deep (See Figure 2). Letting 1 0 u1 be Neuralthe u-offset

∆u2 ∆v2  uis1 a= DeepuConvolutional

eterization of the homography. There exists a 1-to-1 mapping

a nearly unlimited number of labeled training examples by

applying random projective transformations to a large dataset

Direct Linear Transform (DLT) algorithm [9], or the function a nearly

Classification HomographyNet Regression HomographyNet

in column 2 by a random homography generated from the 49.1

scoring matches of ORB features across the image patches. A 25

runs at over 300fps with a batch size of one (i.e. real-time

inference mode) on an NVIDIA Titan X GPU, which enables

a host of applications that are simply not possible with a

slower system. The recent emergence of specialized embedded

You might also like