05_MVS
05_MVS
1 Introduction 1
1.1 Recovering 3D Geometry from Images . . . . . . . . . . . . . . . . . . . . 1
2 Stereo Reconstruction 3
2.1 Classical Two View Stereo Reconstruction [Optional] . . . . . . . . . . . 3
2.1.1 Epipolar Geometry and Stereo Triangulation . . . . . . . . . . . . 3
2.2 Learning based Two View Stereo Reconstruction . . . . . . . . . . . . . . 9
2.3 PMVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 MVSNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Differentiable homography . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 MVSNet overall architecture . . . . . . . . . . . . . . . . . . . . . 13
Bibliography 15
i
1 Introduction
• Textureless Surfaces It is hard to infer the geometry from the textureless surface
(e.g. white wall) since it looks similar from different viewpoints.
• Occlusions The scene objects may be partly or wholly invisible in different views
due to the scene occlusions.
Zbontar et al. [15] have shown that doing block matching on feature spaces can give
more robust results and can be used for depth perception in a two-view stereo setting.
The goal of Multi-view stereo techniques is to estimate the dense representation from
overlapping calibrated views. Recent learning-based MVS methods [11, 13, 14] were
able to get more complete scene representations by learning the depth maps from
feature space.
1
1 Introduction
Figure 1.1: Failure cases of block matching. (Image credit: Andreas Geiger)
2
2 Stereo Reconstruction
Here, we will describe the depth estimation in two-view and multi-view settings
where the pose information is known. We describe both traditional methods and
learning-based methods.
Monocular vision has a scale ambiguity issue which makes it impossible to triangulate
the scene with the correct scale. In a simple explanation, if the distance of the scene
from the camera and geometry of the scene were scaled by some positive factor k,
independently from the value of the k image plane will always have the same projection
of the scene.
Without any prior information, it is also impossible to perceive scene geometry from
a single RGB image. The most popular way of constructing and perceiving scene
geometry is having a motion to have a different camera view as shown in Figure 2.1.
Even having multiple monocular views without knowing extrinsic calibration will
not resolve scale ambiguity. Again relative pose between views and camera to scene
distance were scaled by some positive factor k, independently from the value of the k
image plane will always have the same projection of the scene. Figure 2.2 shows that
the point will have the same projection independent from the scale factor of k.
This section will mainly cover, scene triangulation in two and multi-view settings
with known relative pose transformations.
3
2 Stereo Reconstruction
Figure 2.1: Left frame does not give much information if there is one or two spheres
in the scene, seeing also right frame gives better understanding to viewer
and lets viewer have perception of two spheres with different colors. (Image
credit: Arne Nordmann )
Figure 2.2: Scale ambiguity of the two view system without knowing the relative pose.
4
2 Stereo Reconstruction
Two-view triangulation
Before diving into the two-view triangulation methods, this subsection will introduce
the multiple view geometry basics and conventions. Let’s assume x! and x2 are the
projection of 3D point X in homogeneous coordinates in two different frames. R and
T are rotation and translation from the first frame to the second frame. λ1 and λ2 are
distances from the camera centers to the 3D point X.
λ1 x1 = X and λ2 x2 = RX + T
λ2 x2 = R ( λ1 x1 ) + T
v̂v = v × v = 0 hat operator
λ2 T̂x2 = T̂R(λ1 x1 ) + T̂T = T̂R(λ1 x1 )
(2.2)
λ2 x2T ( T̂x2 ) = λ1 x2T T̂Rx1
x2⊥ T̂x2
x2T T̂Rx1 = 0 epipolar constraint
E = T̂R essential matrix
xi0 being image coordinate of xi and the K being intrinsic matrix the equation can be
formulated more generic for uncalibrated views.
Because of the sensor noise and discretization step in image formation, there is noise
in pixel coordinates of the 3D point projections. So the extensions of corresponding
points in image planes usually do not intersect in 3D. This noise should be considered
for getting accurate triangulation of corresponding points. There are multiple ways of
doing two-view triangulation. Two of those methods will be covered here.
Midpoint method
5
2 Stereo Reconstruction
closest point to each other. The line passing through Q1 Q2 should be perpendicular
to these bearing vectors for Q1 Q2 being the closest distance between these rays. The
midpoint of the Q1 Q2 is accepted as a valid 3D triangulation of the corresponding
points. λi is being the scalar distance from the camera center to the 3D point Qi , R
and T are being relative pose from the second camera frame to the first camera, the
approach mathematically can be formulated as below:
6
2 Stereo Reconstruction
Linear triangulation
This method is based on the fact that in the ideal case back-projected rays and rays
from the camera center to the correspondence on the image plane should be aligned.
The cross-product of these two vectors should be equal to zero in the ideal case. Using
this knowledge problem converted to a set of linear equations that can be solved by
SVD. Let x and y be correspondences, P, and P are respective perspective projection
matrices of the camera, and λ x and λy scalar values.
ux uy
x = vx y = vy
1 1
λ x x = PX λ x y = QX
x × PX = 0 y × QX = 0
T T
ux p1 uy q1
v x × p T X = 0 vy × q T X = 0
2 2
1 p3T 1 q3T
(2.5)
v x p3T − p2T vy q3T − q2T
p T − u x p T X = 0 q T − uy q T X = 0
1 3 1 3
u x p2T − v x p1T uy q2T − vy q1T
v x p3T − p2T
pT − ux pT
1 3
vy q T − q T X = 0
3 2
q1T − uy q3T
AX = 0
Solutions for X can be easily calculated using the Singular Value Decomposition.
Multi-view triangulation
For multi-view triangulation, one solution can be to find such an X in 3D space which
has a minimum sum of square distance with the 3D points lying on the bearing vector.
Analytical solution for X can be found by taking the derivative of the loss function
with respect to the 3D point and finding the 3D point where is the derivative of loss
function equals zero. Assuming the Ci is camera center of ith camera, Pi is the point on
ith bearing vector, λi scalar distance between Ci and Pi , and X is being optimal 3D point
as a triangulation result , the triangulation result can be formulated as below.
7
2 Stereo Reconstruction
Pi = Ci + λi di
λi di ∼ X − Ci At ideal case with no noise
λi = λi diT di ∼ diT ( X − Ci )
Pi = Ci + λi di ∼ Ci + di diT ( X − Ci )
r = X − Ci − di diT ( X − Ci ) = ( I − di diT )( X − Ci )
N N
L= ∑ r2 = ∑ (( I − di diT )(X − Ci ))2
i =1 i =1
∂L (2.6)
arg min L ⇒ =0
x ∂X
N
∂L
= 2 ∑ ( I − di diT )2 ( X − Ci ) = 0
∂X i =1
N
∂L
Ai = ( I − di diT ) ⇒ = ∑ AiT Ai ( X − Ci ) = 0
∂X i =1
N N
X = ( ∑ AiT Ai )−1 ∑ AiT Ai Ci
i =1 i =1
8
2 Stereo Reconstruction
Similarity score
Fully-connected, Sigmoid
Fully-connected, ReLU
Similarity score
Fully-connected, ReLU
Fully-connected, ReLU Dot product
Concatenate Normalize Normalize
Convolution, ReLU Convolution, ReLU Convolution Convolution
Left input patch Righ input patch Left input patch Right input patch
ity
[ ]
ar
sp
di
... * * *
...
height
width
Shared Weights Shared Weights
y
rit
[ ]
s pa
di
... * * * ...
height
width
Input Stereo Images 2D Convolution Cost Volume Multi-Scale 3D Convolution 3D Deconvolution Soft ArgMax Disparities
9
2 Stereo Reconstruction
2.3 PMVS
Patch-Based Multi-View Stereo [2] has proven being quite effective in practice. After
an initial feature matching step aimed at constructing a sparse set of photoconsistent
patches, in the sense of the previous section—that is, patches whose projections in the
images where they are visible have similar brightness or color patterns—it divides the
input images into small square cells a few pixels across, and attempts to reconstruct
a patch in each one of them, using the cell connectivity to propose new patches, and
visibility constraints to filter out incorrect ones.
We assume throughout that n cameras with known intrinsic and extrinsic parameters
observe a static scene, and respectively denote by Oi and Ii (i = 1, ..., n) the optical
centers of these cameras and the images they have recorded of the scene. The main
elements of the PMVS model of multi-view stereo fusion and scene reconstruction are
small rectangular patches, intended to be tangent to the observed surfaces, and a few
of these patches’ key properties—namely, their geometry, which images they are visible
in and whether they are photoconsistent with those, and some notion of connectivity
inherited from image topology.
1. Matching Use feature matching to construct an initial set of patches, and optimize
their parameters to make them maximally photoconsistent.
2. Repeat 3 times:
a) Expansion Iteratively construct new patches in empty spots near existing
ones, using image connectivity and depth extrapolation to propose candi-
dates, and optimizing their parameters as before to make them maximally
photoconsistent.
b) Filtering Use again the image connectivity to remove patches identified as
outliers because their depth is not consistent with a sufficient number of
other nearby patches.
You should read section 2 (Key Elements of the Proposed Approach) and section 3
(Algorithm) from the paper.
10
2 Stereo Reconstruction
Figure 2.7: PMVS - overall approach. From left to right: a sample input image; detected
features; reconstructed patches after the initial matching;final patches after
expansion and filtering; polygonal surface extracted from reconstructed
patches.[2]
2.4 MVSNet
State-of-the-art learning-based MVS approaches adapt the photogrammetry-based MVS
algorithms by implementing them as a set of differentiable operations defined in the
feature space. MVSNet [13] introduced good quality 3D reconstruction by regularizing
the cost volume that was computed using differentiable homography on feature maps
of the reference and source images.
11
2 Stereo Reconstruction
12
2 Stereo Reconstruction
Addition
…
Shared Weights
Loss0
GT
c
Variance Soft
Metric Argmin
Refined Depth Map
13
2 Stereo Reconstruction
The raw cost volume computed from image features are regularized later. Multi-scale
3DCNN have been used for cost volume regularization.This regularization step is
designed for refining the above cost volume to generate a probability volume for depth
inference.
Depth that was regressed from probability volume is further refined using the
2DCNN network.
You should read section 3 (3. MVSNET) from the paper.
14
Bibliography
[1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl. Large-scale data for
multiple-view stereopsis. International Journal of Computer Vision, pages 1–16, 2016.
[2] Y. Furukawa and J. Ponce. Accurate, dense, and robust multi-view stereopsis. IEEE TPAMI,
32(8):1362–1376, 2010.
[3] S. Galliani, K. Lasinger, and K. Schindler. Massively parallel multiview stereopsis by
surface normal diffusion. June 2015.
[4] D. Gallup, J. Frahm, P. Mordohai, Q. Yang, and M. Pollefeys. Real-time plane-sweeping
stereo with multiple sweeping directions. In 2007 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8, 2007. doi: 10.1109/CVPR.2007.383245.
[5] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge
University Press, ISBN: 0521540518, second edition, 2004.
[6] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE
Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2007.
[7] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry.
End-to-end learning of geometry and context for deep stereo regression. In Proceedings of
the IEEE International Conference on Computer Vision, pages 66–75, 2017.
[8] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanks and temples: Benchmarking
large-scale scene reconstruction. ACM Transactions on Graphics, 36(4), 2017.
[9] P. Moulon, P. Monasse, R. Perrot, and R. Marlet. OpenMVG: Open multiple view geome-
try. In International Workshop on Reproducible Research in Pattern Recognition, pages 60–74.
Springer, 2016.
[10] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm. Pixelwise view selection for
unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
[11] F. Wang, S. Galliani, C. Vogel, P. Speciale, and M. Pollefeys. Patchmatchnet: Learned
multi-view patchmatch stereo, 2021.
[12] Wikipedia. Homography (computer vision). Wikipedia, 2013.
[13] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan. Mvsnet: Depth inference for unstructured
multi-view stereo. European Conference on Computer Vision (ECCV), 2018.
[14] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan. Recurrent mvsnet for high-resolution
multi-view stereo depth inference. Computer Vision and Pattern Recognition (CVPR), 2019.
[15] J. Zbontar, Y. LeCun, et al. Stereo matching by training a convolutional neural network to
compare image patches. J. Mach. Learn. Res., 17(1):2287–2318, 2016.
15