
P-MVSNet: Learning Patch-wise Matching Confidence Aggregation
for Multi-View Stereo
Keyang Luo
1
, Tao Guan
1,3
, Lili Ju
2,3
, Haipeng Huang
3
, Yawei Luo
1
*
1
Huazhong University of Science and Technology, China
2
University of South Carolina, USA
3
Farsee2 Technology Ltd, China
Abstract
Learning-based methods are demonstrating their strong
competitiveness in estimating depth for multi-view stereo
reconstruction in recent years. Among them the approaches
that generate cost volumes based on the plane-sweeping al-
gorithm and then use them for feature matching have shown
to be very prominent recently. The plane-sweep volumes
are essentially anisotropic in depth and spatial directions,
but they are often approximated by isotropic cost volumes
in those methods, which could be detrimental. In this pa-
per, we propose a new end-to-end deep learning network
of P-MVSNet for multi-view stereo based on isotropic and
anisotropic 3D convolutions. Our P-MVSNet consists of
two core modules: a patch-wise aggregation module learns
to aggregate the pixel-wise correspondence information of
extracted features to generate a matching confidence vol-
ume, from which a hybrid 3D U-Net then infers a depth
probability distribution and predicts the depth maps. We
perform extensive experiments on the DTU and Tanks &
Temples benchmark datasets, and the results show that the
proposed P-MVSNet achieves the state-of-the-art perfor-
mance over many existing methods on multi-view stereo.
1. Introduction
Multi-view Stereo (MVS) aims to estimate a geometric
representation of the underlying scene from a collection of
images with known camera parameters, and is a fundamen-
tal computer vision problem which has been extensively
studied for decades. Inspired by the great success of Con-
volutional Neural Networks (CNNs) in many computer vi-
sion fields like semantic segmentation [28, 26], scene un-
derstanding [27] and stereo matching [5], several learning-
based MVS methods [43, 33] have been introduced recently
and can be divided into two types: voxel based ones and
depth-map based ones. The recent MVS benchmarks [1, 22]
show that learning-based methods can produce high-quality
*
Corresponding author.
(a) Reference image (b) Predicted depth map
(c) Filtered depth map (d) Reconstructed point cloud
Figure 1: Multi-view 3D reconstruction of Scan114 of DTU
dataset [1]. (a) The reference image; (b) the predicted depth map
by the proposed P-MVSNet; (c) the filtered depth map; (d) the
reconstructed 3D point cloud.
3D models comparable to the conventional state-of-the-arts
although there are still has a large of rooms for improve-
ment. Furthermore, it is also observed that the depth-map
based algorithms outperform the voxel based ones.
An essential step of the depth-map based learning meth-
ods is to construct a pixel-wise matching confidence/cost
volume. The basic idea is to first build a plane-sweep vol-
ume based on the plane-sweep algorithm [6] at a refer-
ence image picked from the input images, then calculate
the matching cost between each pixel in the reference im-
ages and its corresponding ones in other adjacent images on
each sampled depth hypothesis. A popular matching met-
ric used in most existing methods is the variance of features
between the pair of pixels, in which the contributions of all
involved pixel pairs to the matching cost are treated equally.
Such metric is often not conducive to the pixel-wise dense
matching actually. For instance, when the features of a pixel
10451
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
2380-7504/19/$31.00 ©2019 IEEE
DOI 10.1109/ICCV.2019.01055
Authorized licensed use limited to: Institute of Software. Downloaded on November 15,2024 at 06:27:33 UTC from IEEE Xplore. Restrictions apply.