0% found this document useful (0 votes)
17 views16 pages

05_MVS

Uploaded by

auladacivil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

05_MVS

Uploaded by

auladacivil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Contents

1 Introduction 1
1.1 Recovering 3D Geometry from Images . . . . . . . . . . . . . . . . . . . . 1

2 Stereo Reconstruction 3
2.1 Classical Two View Stereo Reconstruction [Optional] . . . . . . . . . . . 3
2.1.1 Epipolar Geometry and Stereo Triangulation . . . . . . . . . . . . 3
2.2 Learning based Two View Stereo Reconstruction . . . . . . . . . . . . . . 9
2.3 PMVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 MVSNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Differentiable homography . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 MVSNet overall architecture . . . . . . . . . . . . . . . . . . . . . 13

Bibliography 15

i
1 Introduction

1.1 Recovering 3D Geometry from Images


Traditional multi-view reconstruction approaches use hand-crafted similarity metrics
Like (e.g. NCC) and regularizations techniques (SGM [6]) to recover 3D points. It is
reported in recent MVS benchmarks [1, 8] that, although traditional algorithms [2, 3, 10]
perform very well on the accuracy, the reconstruction completeness still has large room
for improvement. The main reason for the low completeness of traditional methods is
because the hand-crafted similarity measure and block matching method mainly works
well with Lambertian surfaces and fail in the following failure cases:

• Textureless Surfaces It is hard to infer the geometry from the textureless surface
(e.g. white wall) since it looks similar from different viewpoints.

• Occlusions The scene objects may be partly or wholly invisible in different views
due to the scene occlusions.

• Repetitions Block matching techniques can give a similar response to different


surfaces due to their geometric and photometric repetitiveness.

• Non-Lambertian Surfaces Non-Lambertian surfaces look different from the dif-


ferent viewpoints.

• Other non-geometric variations: image noise, vignetting effect, exposure change,


and lighting variation.

Zbontar et al. [15] have shown that doing block matching on feature spaces can give
more robust results and can be used for depth perception in a two-view stereo setting.
The goal of Multi-view stereo techniques is to estimate the dense representation from
overlapping calibrated views. Recent learning-based MVS methods [11, 13, 14] were
able to get more complete scene representations by learning the depth maps from
feature space.

1
1 Introduction

Figure 1.1: Failure cases of block matching. (Image credit: Andreas Geiger)

Figure 1.2: Top: Traditional 3D reconstruction method [10]


Bottom: our learning-based method
(Image credit: TU Delft students)

2
2 Stereo Reconstruction
Here, we will describe the depth estimation in two-view and multi-view settings
where the pose information is known. We describe both traditional methods and
learning-based methods.

2.1 Classical Two View Stereo Reconstruction [Optional]


2.1.1 Epipolar Geometry and Stereo Triangulation
Triangulation methods

Monocular vision has a scale ambiguity issue which makes it impossible to triangulate
the scene with the correct scale. In a simple explanation, if the distance of the scene
from the camera and geometry of the scene were scaled by some positive factor k,
independently from the value of the k image plane will always have the same projection
of the scene.

( X, Y, Z )T 7−→ ( f X/Z + o x , f Y/Z + oy )T


(2.1)
(kX, kY, kZ )T 7−→ ( f kX/kZ + o x , f kY/kZ + oy )T = ( f X/Z + o x , f Y/Z + oy )T

Without any prior information, it is also impossible to perceive scene geometry from
a single RGB image. The most popular way of constructing and perceiving scene
geometry is having a motion to have a different camera view as shown in Figure 2.1.
Even having multiple monocular views without knowing extrinsic calibration will
not resolve scale ambiguity. Again relative pose between views and camera to scene
distance were scaled by some positive factor k, independently from the value of the k
image plane will always have the same projection of the scene. Figure 2.2 shows that
the point will have the same projection independent from the scale factor of k.
This section will mainly cover, scene triangulation in two and multi-view settings
with known relative pose transformations.

3
2 Stereo Reconstruction

Figure 2.1: Left frame does not give much information if there is one or two spheres
in the scene, seeing also right frame gives better understanding to viewer
and lets viewer have perception of two spheres with different colors. (Image
credit: Arne Nordmann )

Figure 2.2: Scale ambiguity of the two view system without knowing the relative pose.

4
2 Stereo Reconstruction

Two-view triangulation

Before diving into the two-view triangulation methods, this subsection will introduce
the multiple view geometry basics and conventions. Let’s assume x! and x2 are the
projection of 3D point X in homogeneous coordinates in two different frames. R and
T are rotation and translation from the first frame to the second frame. λ1 and λ2 are
distances from the camera centers to the 3D point X.

λ1 x1 = X and λ2 x2 = RX + T
λ2 x2 = R ( λ1 x1 ) + T
v̂v = v × v = 0 hat operator
λ2 T̂x2 = T̂R(λ1 x1 ) + T̂T = T̂R(λ1 x1 )
(2.2)
λ2 x2T ( T̂x2 ) = λ1 x2T T̂Rx1
x2⊥ T̂x2
x2T T̂Rx1 = 0 epipolar constraint
E = T̂R essential matrix

xi0 being image coordinate of xi and the K being intrinsic matrix the equation can be
formulated more generic for uncalibrated views.

x20T K −T T̂RK −1 x10 = 0


F = K −T T̂RK −1 fundamental matrix
(2.3)
x20T Fx10 = 0
E = K T FK relation between essential and fundamental matrix

Because of the sensor noise and discretization step in image formation, there is noise
in pixel coordinates of the 3D point projections. So the extensions of corresponding
points in image planes usually do not intersect in 3D. This noise should be considered
for getting accurate triangulation of corresponding points. There are multiple ways of
doing two-view triangulation. Two of those methods will be covered here.

Midpoint method

A midpoint triangulation method is a simple approach for two-view triangulation.


As shown in figure 2.3, the idea is finding the closest distance between the bearing
vectors which are rays extensions from the camera intersecting the image plane at
corresponding points. Q1 and Q2 are the points on these rays where the rays are at the

5
2 Stereo Reconstruction

Figure 2.3: Midpoint point for two-view triangulation.

closest point to each other. The line passing through Q1 Q2 should be perpendicular
to these bearing vectors for Q1 Q2 being the closest distance between these rays. The
midpoint of the Q1 Q2 is accepted as a valid 3D triangulation of the corresponding
points. λi is being the scalar distance from the camera center to the 3D point Qi , R
and T are being relative pose from the second camera frame to the first camera, the
approach mathematically can be formulated as below:

Q1 = λ1 d1 Q2 = λ2 Rd2 + T (C1 is chosen to be origin for simplicity)


T T
( Q1 − Q2 ) d1 = 0 ( Q1 − Q2 ) Rd2 = 0 (dot product of perpendicular lines)
λ1 d1T d1 − λ2 Rd2T d1 = T T d1
λ1 d1T Rd2 − λ2 Rd2T Rd2 = T T Rd2

d1T d1 −( Rd2 )T d1
   T
λ1 T d1
 (2.4)
= (Ax = b) form equation
( Rd2 )T d1 ( Rd2 )T ( Rd2 ) λ2 T T ( Rd2 )
 
λ1
= A −1 b
λ2
P = ( Q1 + Q2 )/2 = (λ1 d1 + λ2 Rd2 + T )/2

6
2 Stereo Reconstruction

Linear triangulation

This method is based on the fact that in the ideal case back-projected rays and rays
from the camera center to the correspondence on the image plane should be aligned.
The cross-product of these two vectors should be equal to zero in the ideal case. Using
this knowledge problem converted to a set of linear equations that can be solved by
SVD. Let x and y be correspondences, P, and P are respective perspective projection
matrices of the camera, and λ x and λy scalar values.
   
ux uy
x =  vx  y =  vy 
1 1
λ x x = PX λ x y = QX
x × PX = 0 y × QX = 0
   T    T
ux p1 uy q1
 v x  ×  p T  X = 0  vy  × q T  X = 0
2 2
1 p3T 1 q3T
(2.5)
v x p3T − p2T vy q3T − q2T
   
 p T − u x p T  X = 0  q T − uy q T  X = 0
1 3 1 3
u x p2T − v x p1T uy q2T − vy q1T
v x p3T − p2T
 
 pT − ux pT 
 1 3
 vy q T − q T  X = 0
3 2
q1T − uy q3T
AX = 0

Solutions for X can be easily calculated using the Singular Value Decomposition.

Multi-view triangulation

For multi-view triangulation, one solution can be to find such an X in 3D space which
has a minimum sum of square distance with the 3D points lying on the bearing vector.
Analytical solution for X can be found by taking the derivative of the loss function
with respect to the 3D point and finding the 3D point where is the derivative of loss
function equals zero. Assuming the Ci is camera center of ith camera, Pi is the point on
ith bearing vector, λi scalar distance between Ci and Pi , and X is being optimal 3D point
as a triangulation result , the triangulation result can be formulated as below.

7
2 Stereo Reconstruction

Figure 2.4: Multi view triangulation method. [9]

Pi = Ci + λi di
λi di ∼ X − Ci At ideal case with no noise
λi = λi diT di ∼ diT ( X − Ci )
Pi = Ci + λi di ∼ Ci + di diT ( X − Ci )
r = X − Ci − di diT ( X − Ci ) = ( I − di diT )( X − Ci )
N N
L= ∑ r2 = ∑ (( I − di diT )(X − Ci ))2
i =1 i =1
∂L (2.6)
arg min L ⇒ =0
x ∂X
N
∂L
= 2 ∑ ( I − di diT )2 ( X − Ci ) = 0
∂X i =1
N
∂L
Ai = ( I − di diT ) ⇒ = ∑ AiT Ai ( X − Ci ) = 0
∂X i =1
N N
X = ( ∑ AiT Ai )−1 ∑ AiT Ai Ci
i =1 i =1

8
2 Stereo Reconstruction

Similarity score

Fully-connected, Sigmoid
Fully-connected, ReLU
Similarity score
Fully-connected, ReLU
Fully-connected, ReLU Dot product
Concatenate Normalize Normalize
Convolution, ReLU Convolution, ReLU Convolution Convolution

Convolution, ReLU Convolution, ReLU Convolution, ReLU Convolution, ReLU


Convolution, ReLU Convolution, ReLU Convolution, ReLU Convolution, ReLU

Left input patch Righ input patch Left input patch Right input patch

Figure 2.5: Simaese networks for stereo matching.[15]

2.2 Learning based Two View Stereo Reconstruction


Zbontar et al. [15] have initially shown that depth information can be extracted from
rectified image pairs by learning a similarity measure on relevant image patches. They
train their CNN-based siamese network as a binary classification network with similar
and irrelevant pairs of patches.
Kendall et al. [7] proposed the network where they use 2D CNN with shared weights
to retrieve rectified image pair features. In their work, they later used these feature
maps to calculate a matching score-based cost volume, and as the last step, they use a
3D CNN-based autoencoder to regularize this volume.

ity

[ ]
ar
sp
di

... * * *
...
height

width
Shared Weights Shared Weights
y
rit

[ ]
s pa
di

... * * * ...
height

width

Input Stereo Images 2D Convolution Cost Volume Multi-Scale 3D Convolution 3D Deconvolution Soft ArgMax Disparities

Figure 2.6: GC-Net - deep stereo regression architecture.[7]

9
2 Stereo Reconstruction

2.3 PMVS
Patch-Based Multi-View Stereo [2] has proven being quite effective in practice. After
an initial feature matching step aimed at constructing a sparse set of photoconsistent
patches, in the sense of the previous section—that is, patches whose projections in the
images where they are visible have similar brightness or color patterns—it divides the
input images into small square cells a few pixels across, and attempts to reconstruct
a patch in each one of them, using the cell connectivity to propose new patches, and
visibility constraints to filter out incorrect ones.
We assume throughout that n cameras with known intrinsic and extrinsic parameters
observe a static scene, and respectively denote by Oi and Ii (i = 1, ..., n) the optical
centers of these cameras and the images they have recorded of the scene. The main
elements of the PMVS model of multi-view stereo fusion and scene reconstruction are
small rectangular patches, intended to be tangent to the observed surfaces, and a few
of these patches’ key properties—namely, their geometry, which images they are visible
in and whether they are photoconsistent with those, and some notion of connectivity
inherited from image topology.

1. Matching Use feature matching to construct an initial set of patches, and optimize
their parameters to make them maximally photoconsistent.

2. Repeat 3 times:
a) Expansion Iteratively construct new patches in empty spots near existing
ones, using image connectivity and depth extrapolation to propose candi-
dates, and optimizing their parameters as before to make them maximally
photoconsistent.
b) Filtering Use again the image connectivity to remove patches identified as
outliers because their depth is not consistent with a sufficient number of
other nearby patches.

You should read section 2 (Key Elements of the Proposed Approach) and section 3
(Algorithm) from the paper.

10
2 Stereo Reconstruction

Figure 2.7: PMVS - overall approach. From left to right: a sample input image; detected
features; reconstructed patches after the initial matching;final patches after
expansion and filtering; polygonal surface extracted from reconstructed
patches.[2]

2.4 MVSNet
State-of-the-art learning-based MVS approaches adapt the photogrammetry-based MVS
algorithms by implementing them as a set of differentiable operations defined in the
feature space. MVSNet [13] introduced good quality 3D reconstruction by regularizing
the cost volume that was computed using differentiable homography on feature maps
of the reference and source images.

2.4.1 Differentiable homography


We know that any plane in 3 dimensional space can be parametrized with normal n T
and distance d. For plane π in the figure we can say that point Pi lies on the plane if
and only if n T Pi + d = 0.

11
2 Stereo Reconstruction

Figure 2.8: Two view homography [5] [4] [12]

Derivation with relative pose


1 z
pa = Ka Hab zb Kb−1 pb = b Ka Hab Kb−1 pb
za za
"R" and "t" are relative pose of a with respect to b
Hab Pb = R ∗ Pb + t
Plane constraint n T Pb + d = 0. (2.7)
−nT Pb
Hab Pb = RPb + t since −n T Pb /d =1
d
nT t
Hab = R −
d

12
2 Stereo Reconstruction

Conv + BN + ReLU, Stride = 1


Conv + BN + ReLU, Stride = 2
Conv, stride = 1
c Concatenation
Source Images

Addition

Shared Weights

Loss0
GT

Initial Depth Map


Shared Weights Loss1
Reference Image

c
Variance Soft
Metric Argmin
Refined Depth Map

Feature Differentiable Cost Volume Depth Map


Extraction Homography Regularization Refinement

Figure 2.9: MVSNet architecture.[13]

Derivation with absolute pose


nT t
Hab = R − remember relative case
d
Pb = Rb Pw + tb from world to camera b
Pa = R a Pw + t a from world to camera a
Pw = RbT Pb − RbT tb
Pa = R a Pw + t a = R a ( RbT Pb − RbT tb ) + t a
(2.8)
Pa = R a Pw + t a = R a RbT Pb
+ t a − R a RbT tb
T T
−b = R a Rb
R a← −b = t a − R a Rb tb
t a←
n T t a←
−b
Hab = R a←
−b − remember relative case
d
n T (t a − R a RbT tb )
Hab = R a RbT −
d

2.4.2 MVSNet overall architecture


MVSNet at first extract the deep features of the N (number of views) input images for
dense matching. It applies convolutional filters to extract the feature towers scale.Each
convolutional layer is followed by a batch-normalization (BN) layer and a rectified
linear unit (ReLU) the last layer.
Using features and the camera parameters, then we build cost volume regularization.
We use differentiable homography for building this cost volumes.

13
2 Stereo Reconstruction

The raw cost volume computed from image features are regularized later. Multi-scale
3DCNN have been used for cost volume regularization.This regularization step is
designed for refining the above cost volume to generate a probability volume for depth
inference.
Depth that was regressed from probability volume is further refined using the
2DCNN network.
You should read section 3 (3. MVSNET) from the paper.

14
Bibliography
[1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl. Large-scale data for
multiple-view stereopsis. International Journal of Computer Vision, pages 1–16, 2016.
[2] Y. Furukawa and J. Ponce. Accurate, dense, and robust multi-view stereopsis. IEEE TPAMI,
32(8):1362–1376, 2010.
[3] S. Galliani, K. Lasinger, and K. Schindler. Massively parallel multiview stereopsis by
surface normal diffusion. June 2015.
[4] D. Gallup, J. Frahm, P. Mordohai, Q. Yang, and M. Pollefeys. Real-time plane-sweeping
stereo with multiple sweeping directions. In 2007 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8, 2007. doi: 10.1109/CVPR.2007.383245.
[5] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge
University Press, ISBN: 0521540518, second edition, 2004.
[6] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE
Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2007.
[7] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry.
End-to-end learning of geometry and context for deep stereo regression. In Proceedings of
the IEEE International Conference on Computer Vision, pages 66–75, 2017.
[8] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanks and temples: Benchmarking
large-scale scene reconstruction. ACM Transactions on Graphics, 36(4), 2017.
[9] P. Moulon, P. Monasse, R. Perrot, and R. Marlet. OpenMVG: Open multiple view geome-
try. In International Workshop on Reproducible Research in Pattern Recognition, pages 60–74.
Springer, 2016.
[10] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm. Pixelwise view selection for
unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
[11] F. Wang, S. Galliani, C. Vogel, P. Speciale, and M. Pollefeys. Patchmatchnet: Learned
multi-view patchmatch stereo, 2021.
[12] Wikipedia. Homography (computer vision). Wikipedia, 2013.
[13] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan. Mvsnet: Depth inference for unstructured
multi-view stereo. European Conference on Computer Vision (ECCV), 2018.
[14] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan. Recurrent mvsnet for high-resolution
multi-view stereo depth inference. Computer Vision and Pattern Recognition (CVPR), 2019.
[15] J. Zbontar, Y. LeCun, et al. Stereo matching by training a convolutional neural network to
compare image patches. J. Mach. Learn. Res., 17(1):2287–2318, 2016.

15

You might also like