0% found this document useful (0 votes)
7 views

FADNet A Fast and Accurate Network for Disparity Estimation

Uploaded by

yuweiji339
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

FADNet A Fast and Accurate Network for Disparity Estimation

Uploaded by

yuweiji339
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

FADNet: A Fast and Accurate Network for Disparity Estimation

Qiang Wang1,∗ , Shaohuai Shi1,∗ , Shizhen Zheng1 , Kaiyong Zhao1,† , Xiaowen Chu1,†

Abstract— Deep neural networks (DNNs) have achieved great


success in the area of computer vision. The disparity estimation
problem tends to be addressed by DNNs which achieve much
better prediction accuracy in stereo matching than traditional
hand-crafted feature based methods. On one hand, however, the
designed DNNs require significant memory and computation
arXiv:2003.10758v1 [cs.CV] 24 Mar 2020

resources to accurately predict the disparity, especially for (a) Input image (b) PSMNet [6]
those 3D convolution based networks, which makes it diffi-
cult for deployment in real-time applications. On the other
hand, existing computation-efficient networks lack expression
capability in large-scale datasets so that they cannot make an
accurate prediction in many scenarios. To this end, we propose
an efficient and accurate deep network for disparity estimation
named FADNet with three main features: 1) It exploits efficient
2D based correlation layers with stacked blocks to preserve fast (c) Our FADNet (d) Ground truth
computation; 2) It combines the residual structures to make the
Fig. 1: Performance illustrations. (a) a challenging input
deeper model easier to learn; 3) It contains multi-scale predic-
tions so as to exploit a multi-scale weight scheduling training image. (b) Result of PSMNet [6] which consumes 13.99 GB
technique to improve the accuracy. We conduct experiments GPU memory and runs 399.3 ms for one stereo image pair
to demonstrate the effectiveness of FADNet on two popular on an Nvidia Tesla V100 GPU. (c) Result of our FADNet,
datasets, Scene Flow and KITTI 2015. Experimental results which only consumes 1.62 GB GPU memory and runs 18.7
show that FADNet achieves state-of-the-art prediction accuracy,
ms for one stereo image pair on the Nvidia Tesla V100 GPU.
and runs at a significant order of magnitude faster speed than
existing 3D models. The codes of FADNet are available at
https://github.com/HKBU-HPML/FADNet. learning (AutoML) for neural architecture search (NAS) on
stereo matching, while some others [5][10] focus on creating
I. I NTRODUCTION large scale datasets with high-quality labels. In practice, to
It has been seen that deep learning has been widely measure whether a DNN model is good enough, we not only
deployed in many computer vision tasks. Disparity estima- need to evaluate its accuracy on unseen samples (whether
tion (also referred to as stereo matching) is a classical and it can estimate the disparity correctly), but also its time
important problem in computer vision applications, such as efficiency (whether it can generate the results in real-time).
3D scene reconstruction, robotics and autonomous driving. In ED-Conv2D methods, stereo matching neural networks
While traditional methods based on hand-crafted feature [2][3][5] are first proposed for end-to-end disparity estima-
extraction and matching cost aggregation such as Semi- tion by exploiting an encoder-decoder structure. The encoder
Global Matching (SGM) [1]) tend to fail on those textureless part extracts the features from the input images, and the
and repetitive regions in the images, recent advanced deep decoder part predicts the disparity with the generated fea-
neural network (DNN) techniques surpass them with decent tures. The disparity prediction is optimized as a regression
generalization and robustness to those challenging patches, or classification problem using large-scale datasets (e.g.,
and achieve state-of-the-art performance in many public Scene Flow [5], IRS [10]) with disparity ground truth. The
datasets [2][3][4][5][6][7]. The DNN-based methods for correlation layer [11][5] is then proposed to increase the
disparity estimation are end-to-end frameworks which take learning capability of DNNs in disparity estimation, and it
stereo images (left and right) as input to the neural network has been proved to be successful in learning strong features
and predict the disparity directly. The architectures of DNN at multiple levels of scales [11][5][12][13][14]. To further
are very essential to achieve accurate estimation, and can be improve the capability of the models, residual networks
categorized into two classes, encoder-decoder network with [15][16][17] are introduced into those ED-Conv2D networks
2D convolution (ED-Conv2D) and cost volume matching since the residual structure enables much deeper network to
with 3D convolution (CVM-Conv3D). Besides, recent studies be easier to train [18]. The ED-Conv2D methods have been
[8][9] begin to reveal the potential of automated machine proved computing efficient, but they cannot achieve very high
estimation accuracy.
∗ Authorshave contributed equally. To address the accuracy problem of disparity estimation,
† Correspondingauthors.
1 Department of Computer Science, Hong Kong Baptist University,
researchers have proposed CVM-Conv3D networks to better
{qiangwang,csshshi,szzheng,kyzhao,chxw} capture the features of stereo images and thus improve the
@comp.hkbu.edu.hk estimation accuracy [3][19][6][7][20]. The key idea of the
CVM-Conv3D methods is to generate the cost volume by multi-view images. Although monocular vision is low cost
concatenating left feature maps with their corresponding and commonly available in practice, it does not explicitly
right counterparts across each disparity level [19][6]. The introduce any geometrical constraint, which is important
features of cost volume are then automatically extracted by for disparity estimation[21]. On the contrary, stereo vision
3D convolution layers. However 3D operations in DNNs are leverages the advantages of cross-reference between the left
computing-intensive and hence very slow even with current and the right view, and usually show greater performance
powerful AI accelerators (e.g., GPUs). Although the 3D con- and robustness in geometrical tasks. In this paper, we mainly
volution based DNNs can achieve state-of-the-art disparity discuss the work related to stereo images for disparity
estimation accuracy, they are difficult for deployment due estimation, which is classified into two categories: 2D based
to their resource requirements. On one hand, it requires a and 3D based CNNs.
large amount of memory to install the model; so only a In 2D based CNNs, end-to-end architectures with mainly
limited set of accelerators (like Nvidia Tesla V100 with convolution layers [5][22] are proposed for disparity esti-
32GB memory) can run these models. On the other hand, it mation, which use two stereo images as input and generate
takes several seconds to generate a single result even on the the disparity directly and the disparity is optimized as a
very powerful Tesla V100 GPU using CVM-Conv3D models. regression task. However, the models are pure 2D CNN
The memory consumption and the inefficient computation architectures which are difficult to capture the matching fea-
make the CVM-Conv3D methods difficult to be deployed in tures such that the estimation results are not good. To address
practice. Therefore, it is crucial to address the accuracy and the problem, the correlation layer which can express the
efficiency problems for real-world applications. relationship between left and right images is introduced in the
To this end, we propose FADNet which is a Fast end-to-end architecture (e.g., DispNetCorr1D [5], FlowNet
and Accurate Disparity estimation Network based on ED- [11], FlowNet2 [23], DenseMapNet [24]). The correlation
Conv2D architectures. FADNet can achieve high accuracy layer significantly increases the estimating performance com-
while keeping a fast inference speed. As illustrated in Fig. pared to the pure CNNs, but existing architectures are still
1, our FADNet can easily obtain comparable performance not accurate enough for production.
as state-of-the-art PSMNet [6], while it runs approximately 3D based CNNs are further proposed to increase the
20× faster than PSMNet and consumes 10× less GPU estimation performance [3][19][6][7][20], which employ 3D
memory. In FADNet, we first exploit the multiple stacked convolutions with cost volume. The cost volume is mainly
2D-based convolution layers with fast computation, and then formed by concatenating left feature maps with their cor-
we combine state-of-the-art residual architectures to improve responding right counterparts across each disparity level
the learning capability, and finally we introduce multi-scale [19][6], and the features of the generated cost volumes can
outputs for FADNet so that it can exploit the multi-scale be learned by 3D convolution layers. The 3D based CNNs
weight scheduling to improve the training speed. These can automatically learn to regularize the cost volume, which
features enable FADNet to efficiently predict the disparity have achieved state-of-the-art accuracy of various datasets.
with high accuracy as compared to existing work. Our However, the key limitation of the 3D based CNNs is
contributions are summarized as follows: their high computation resource requirements. For example,
• We propose an accurate yet efficient DNN architecture training GANet [7] with the Scene Flow [5] dataset takes
for disparity estimation named FADNet, which achieves weeks even using very powerful Nvidia Tesla V100 GPUs.
comparable prediction accuracy as CVM-Conv3D mod- Even they achieve good accuracy, it is difficult to deploy due
els and it runs at an order of magnitude faster speed than to their very low time efficiency. To this end, we propose a
the 3D-based models. fast and accurate DNN model for disparity estimation.
• We develop a multiple rounds training scheme with
III. M ODEL D ESIGN AND I MPLEMENTATION
multi-scale weight scheduling for FADNet during train-
ing, which improves the training speed yet maintains Our proposed FADNet exploits the structure of DispNetC
the model accuracy. [5] as a backbone, but it is extensively reformed to take
• We achieve state-of-the-art accuracy on the Scene Flow care of both accuracy and inference speed, which is lacking
dataset with up to 20× and 45× faster disparity predic- in existing studies. We first change the structure in terms of
tion speed than PSMNet [6] and GANet [7] respectively. branch depth and layer type by introducing two new modules,
The rest of the paper is organized as follows. We introduce residual block and point-wise correlation. Then we exploit
some related work in DNN based stereo matching problems the multi-scale residual learning strategy for training the
in Section II. Section III introduces the methodology and refinement network. Finally, a loss weight training schedule
implementation of our proposed network. We demonstrate is used to train the network in a coarse-to-fine manner.
our experimental results in Section IV. We finally conclude
A. Residual Block and Point-wise Correlation
the paper in Section V.
DispNetC and DispNetS which are both from the study
II. R ELATED W ORK in [5] basically use an encoder-decoder structure equipped
There exist many studies using deep learning methods with five feature extraction and down-sampling layers and
in estimating image depth using monocular, stereo and five feature deconvolution layers. While conducting feature
L
Dual Point-Wise
ResBlock Correlation

DeConvolution Disparity

Element-Wise

R
Concatenate Å Addition

Conv, 3x3,
stride=1
Å Å Å Å Å Å Å
Conv, 3x3,
Conv, 3x3, stride=2

L
stride=1
Å
Conv, 3x3,
stride=2 Conv, 3x3,
stride=2

Conv, 3x3, Warped


L
stride=2
Dual Conv
Å
Dual ResBlock R

Fig. 2: The model structure of our proposed FADNet.

extraction and down-sampling, DispNetC and DispNetS first where k is the kernel size of cost matching, x1 and x2
adopt a convolution layer with a stride of 1 and then a are the centers of two patches from f1 and f2 respectively.
convolution layer with a stride of 2 so that they consistently Computing all patch combinations involves c×K 2 ×w2 ×h2
shrink the feature map size by half. We call the two-layer multiplication and produces a cost matching map of w × h.
convolutions with size reduction as Dual-Conv, which is Given a maximum searching range D, we fix x1 and shift the
shown in the left-bottom corner of Fig. 2. DispNetC equipped x2 on the x-axis direction from −D to D with a stride of two.
with Dual-Conv modules and a correlation layer finally Thus, the final output cost volume size will be w × h × D.
achieves an end-points error (EPE) of 1.68 on the Scene However, the correlation operation assumes that each pixel
Flow dataset, as reported in [5]. in the patch contributes equally to the point-wise convolution
The residual block originally derived in [15] for image results, which may loss the ability to learn more complicated
classification tasks is widely used to learn robust features matching patterns. Here we propose point-wise correlation
and train a very deep networks. The residual block can composed of two modules. The first module is a classical
well address the gradient vanish problem when training very convolution layer with a kernel size of 3 × 3 and a stride of
deep networks. Thus, we replace the convolution layer in 1. The second one is an element-wise multiplication which
the Dual-Conv module by the residual block to construct a is defined by Eq. (2).
new module called Dual-ResBlock, which is shown in the X
c(x1 , x2 ) = hf1 (x1 ), f2 (x2 )i, (2)
left-bottom corner of Fig. 2. With Dual-ResBlock, we can
make the network deeper without training difficulty as the where we remove the patch convolution manner from Eq. (1).
residual block allows us to train very deep models. Therefore, Since the maximum valid disparity is 192 in the evaluated
we further increase the number of feature extraction and datasets, the maximum search range for the original image
down-sampling layers from five to seven. Finally, DispNetC resolution is no more than 192. Remember that the correla-
and DispNetS are evolving to two new networks with better tion layer is put after the third Dual-ResBlock, of which the
learning ability, which are called RB-NetC and RB-NetS output feature resolution is 1/8. So a proper searching range
respectively, as shown in Fig. 2. value should not be less than 192/8=16. We set a marginally
One of the most important contributions of DispNetC larger value 20. We also test some other values, such as 10
is the correlation layer, which targets at finding corre- and 40, which do not surpass the version of using 20 in
spondences between the left and right images. Given two the network. The reason is that applying too small or large
multi-channel feature maps f1 , f2 with w, h and c as their search range value may lead to under-fitting or over-fitting.
width, height and number of channels, the correlation layer Table I lists the accuracy improvement brought by apply-
calculates the cost volume of them using Eq. (1). ing the proposed Dual-ResBlock and point-wise correlation.
X We train them using the same dataset as well as the training
c(x1 , x2 ) = hf1 (x1 + o), f2 (x2 + o)i, (1) schemes. It is observed that RB-NetC outperforms DispNetC
o∈[−k,k]×[−k,k] with a much lower EPE, which indicates the effectiveness of
the residual structure. We also notice that setting a proper Note that ds is the ground truth disparity of scale 21s
searching range value of the correlation layer helps further and dˆs is the predicted disparity of scale 21s . The loss
improve the model accuracy. function is separately applied in the seven scales of outputs,
which generates seven loss values. The loss values are then
TABLE I: Model accuracy improvement of Dual-ResBlock accumulated with loss weights.
and point-wise correlation with different D.
Model D Training EPE Test EPE TABLE II: Multi-scale loss weight scheduling.
DispNetC 20 2.89 2.80 Round w0 w1 w2 w3 w4 w5 w6
RB-NetC 10 2.28 2.06
RB-NetC 20 2.09 1.76 1 0.32 0.16 0.08 0.04 0.02 0.01 0.005
RB-NetC 40 2.12 1.83 2 0.6 0.32 0.08 0.04 0.02 0.01 0.005
3 0.8 0.16 0.04 0.02 0.01 0.005 0.0025
4 1.0 0 0 0 0 0 0

B. Multi-Scale Residual Learning


The loss weight scheduling technique which is initially
Instead of directly stacking DispNetC and DispNetS sub-
proposed in [5] is useful to learn the disparity in a coarse-
networks to conduct disparity refinement procedure [13], we
to-fine manner. Instead of just switching on/off the losses of
apply the multi-scale residual learning firstly proposed by
different scales, we apply different non-zero weight groups
[25]. The basic idea is that the second refinement network
for tackling different scale of disparity. Let ws denote the
learns the disparity residuals and accumulates them into
weight for the loss of the scale of s. The final loss function
the initial results generated by the first network, instead of
is
directly predicting the whole disparity map. In this way, 6
the second network only needs to focus on learning the
X
L= ws Ls (ds , dˆs ). (6)
highly nonlinear residual, which is effective to avoid gradient s=0
vanishing. Our final FADNet is formed by stacking RB-NetC
and RB-NetS with multi-scale residual learning, which is The specific setting is listed in Table II. Totally there are
shown in Fig. 2. seven scales of predicted disparity maps. At the beginning,
As illustrated in Fig. 2, the upper RB-NetC takes the we assign low-value weights for those large scale disparity
left and right images as input and produces disparity maps maps to learn the coarse features. Then we increase the loss
at a total of 7 scales, denoted by cs , where s is from 0 weights of large scales to let the network gradually learn
to 6. The bottom RB-NetS exploits the inputs of the left the finer features. Finally, we deactivate all the losses except
image, right image, and the warped left images to predict the the final predict one of the original input size. With different
residuals. The generated residuals (denoted by rs ) from RB- rounds of weight scheduling, the evaluation EPE is gradually
NetS are then accumulated to the prediction results by RB- increased to the final accurate performance which is shown
NetC to generate the final disparity maps with multiple scales in Table III on the Scene Flow dataset.
(s = 0, 1, ..., 6). Thus, the final disparity maps predicted by
FADNet, denoted by dˆs , can be calculated by TABLE III: Model accuracy with different rounds of weight
scheduling.
dˆs = cs + rs , 0 ≤ s ≤ 6. (3)
Round # Epochs Training EPE Test EPE Improvement (%)
C. Loss Function Design 1 20 1.85 1.57 -
2 20 1.33 1.32 18.9
Given a pair of stereo RGB images, our FADNet takes 3 20 1.04 0.93 41.9
them as input and produces seven disparity maps at different 4 30 0.92 0.83 12.0
scales. Assume that the input image size is H × W . The Note: “Improvement” indicates the improvement of the current round of
dimension of the seven scales of the output disparity maps weight schedule over its previous.
are H ×W , 21 H × 12 W , 14 H × 41 W , 18 H × 18 W , 16
1 1
H × 16 W,
1 1 1 1
32 H × 32 W , and 64 H × 64 W respectively. To train FADNet Table III lists the model accuracy improvements (around
in an end-to-end manner, we adopt the pixel-wise smooth L1 12%-41%) brought by the multiple round training of four
loss between the predicted disparity map and the ground truth loss weight groups. It is observed that both the training
using and testing EPEs are decreased smoothly and close, which
N indicates good generalization and advantages of our training
1 X strategy.
Ls (ds , dˆs ) = smoothL1 (dis − dˆis ), (4)
N i=1
IV. P ERFORMANCE E VALUATION
where N is the number of pixels of the disparity map, dis is
the ith element of ds ∈ RN and In this section, we present the experimental results of our
proposed FADNet compared to existing work (i.e., DispNetC
(
0.5x2 , if |x| < 1
smoothL1 (x) = (5) [5], PSMNet [6], GANet [7] and DenseMapNet [24]) in
|x| − 0.5, otherwise. terms of accuracy and time efficiency.
A. Experimental Setup TABLE IV: Disparity EPE on the scene flow dataset
We implement our FADNet using PyTorch1 , which is one Memory Runtime (ms)
Model EPE
(GB) Titan X (Pascal) Tesla V100
of popular deep learning frameworks, and we make the codes FADNet(ours) 0.83 3.87 65.5 48.1
and experimental setups be publicly available2 . DispNetC 1.68 1.62 28.7 18.7
In terms of accuracy, the model is trained with Adam DenseMapNet 5.36 - <30 -
PSMNet 1.09 13.99 OOM 399.3
(β1 = 0.9, β2 = 0.999). We perform color normalization GANet 0.84 29.1 OOM 2251.1
with the mean ([0.485, 0.456, 0.406]) and variation ([0.229, Note: “OOM” indicates that it runs out of memory. Runtime is the inference
0.224, 0.225]) of the ImageNet [26] dataset for data pre- time per pair of stereo images, and it is measured by 100 runs with average.
processing. During training, images are randomly cropped The underline numbers are from the original paper.
to size H = 384 and W = 768. The batch size is set to 16
for the training on four Nvidia Titan X (Pascal) GPUs (each
than GANet and PSMNet respectively on an Nvidia Tesla
of 4). We apply a four-round training scheme illustrated in
V100 GPU. Even PSMNet and GANet are not runnable on
Section III-C, where each round adopts one different loss
the Titan X (Pascal) GPU, which implies high cost of them
weight group. At the beginning of each round, the learning
in practice. Compared to DispNetC and DenseMapNet, even
rate is initialized as 10−4 and is decayed by half every 10
FADNet is relatively slow, it predicts the disparity more than
epochs. We train 20 epochs for the first three rounds and 30
2× accurate than DispNetC and DenseMapNet, which is a
for the last round.
huge accuracy improvement. The visualized comparison with
In terms of time efficiency, we evaluate the inference time
predicted disparity maps are shown in Fig. 3.
of existing state-of-the-art DNNs including both 2D and
3D based networks using a pair of stereo images (H = TABLE V: Results on the KITTI 2015 dataset
576, W = 960) from the Scene Flow dataset [5] on a
Model Noc(%) All(%)
desktop-level Nvidia Titan X (Pascal) GPU (with 12GB D1-bg D1-fg D1-all D1-bg D1-fg D1-all
memory) and a server-level Nvidia Tesla V100 GPU (with FADNet(ours) 2.49% 3.07% 2.59% 2.68% 3.50% 2.82%
32GB memory). DispNetC 4.11% 3.72% 4.05% 4.32% 4.41% 4.34%
GC-Net 2.02% 5.58% 2.61% 2.21% 6.16% 2.87%
PSMNet 1.71% 4.31% 2.14% 1.86% 4.62% 2.32%
B. Dataset GANet 1.34% 3.11% 1.63% 1.48% 3.46% 1.81%
Note: “Noc” and “All” indicates percentage of outliers averaged over ground truth
We used two publicly popular available datasets to train pixels of non-occluded and all regions respectively. “D1-bg”, “D1-fg” and “D1-all”
and evaluate the performance of our FADNet. The first one indicates percentage of outliers averaged over background, foreground and all ground
truth pixels respectively.
is Scene Flow which is produced by synthetic rendering
techniques. The second one is KITTI 2015 which is captured
by real world cameras and laser sensors. From the visualized disparity maps shown in Fig. 3, we
1) Scene Flow [5]: a large synthetic dataset which pro- can see that the details of textures are successfully estimated
vides totally 39,824 samples of stereo RGB images (35,454 by our FADNet while PSMNet is a little worse and DispNetC
for training and 4,370 for testing). The full resolution of almost misses all the details. The visualization results are
the images is 960×540. The dataset covers a wide range of dramatically different although the EPE gap between Disp-
object shapes and texture and provides high-quality dense NetC and FADNet is only 0.85. In the qualitative evaluation,
disparity ground truth. We use the endpoint error (EPE) as FADNet is more robust and accurate than DispNetC with a
error measurement. We remove those pixels whose disparity 2D based network and PSMNet with a 3D based network.
values are larger than 192 in the loss computation, which is From Table IV, it is noticed that the CVM-Conv3D
typically done by the previous studies [6][7]. architectures cannot be used on the desktop-level GPU which
is equipped with 12 GB memory, while the proposed FADNet
2) KITTI 2015 [27]: an open benchmark dataset which
requires only 3.87 GB to perform the disparity estimation.
contains 200 stereo images which are grayscale and have
The low memory requirement of FADNet makes it much
a resolution of 1241×376. The ground truth of disparity is
easier for deployment in real-world applications. DispNetC
generated by the LIDAR equipment, so the disparity map is
is also an efficient architecture in terms of both memory
very sparse. During training, we randomly crop 1024×256
consumption and computing efficiency, but its estimation
resolution of images and disparity maps. We use its full
performance is bad such that it cannot be used in real-world
resolution during test.
applications. In summary, FADNet not only achieves high
C. Experimental Results disparity estimation accuracy, but it is also very efficient and
practical for deployment.
The experimental results on the Scene Flow dataset are
The experimental results on the KITTI 2015 dataset are
shown in Table IV. Regarding the model accuracy mea-
shown in Table V. GANet achieves the best estimation results
sured with EPE, our proposed FADNet achieves comparable
among the evaluated models, and our proposed FADNet
performance compared to the state-of-the-art CVM-Conv3D
performs comparable error rates on the metric of D1-fg. The
(PSMNet and GANet), while FADNet is 46× and 8× faster
qualitative evaluation of the KITTI 2015 dataset is shown in
1 https://pytorch.org Fig. 4, it is seen that the error maps of FADNet are close to
2 https://github.com/HKBU-HPML/FADNet PSMNet, while they are much better than that of DispNetC.
(a) DispNetC (b) PSMNet (c) FADNet

Fig. 3: Results of disparity prediction for Scene Flow testing data. The leftest column shows the left images of the stereo
pairs. The rest three columns respectively show the disparity maps estimated by (a) DispNetC [5], (b) PSMNet [6], (c)
FADNet.

(a) DispNetC (b) PSMNet (c) FADNet

Fig. 4: Results of disparity prediction for KITTI 2015 testing data. The leftest column shows the left images of the stereo
pairs. The rest three columns respectively show the disparity maps estimated by (a) DispNetC [5], (b) PSMNet [6], (c)
FADNet, as well as their error maps.

V. C ONCLUSION AND F UTURE W ORK this paper. First, we would like to develop fast disparity infer-
ence of FADNet on edge devices. Since the computational
In this paper, we proposed an efficient yet accurate neural
capability of them is much lower than that of the server
network, FADNet, for end-to-end disparity estimation to
GPUs used in our experiments, it is necessary to explore
embrace the time efficiency and estimation accuracy on the
the techniques of model compression, including pruning,
stereo matching problem. The proposed FADNet exploits
quantization, and so on. Second, we would also like to
point-wise correlation layers, residual blocks, and multi-scale
apply AutoML [9] for searching a well-performing network
residual learning strategy to make the model be accurate
structure for disparity estimation.
in many scenarios while preserving fast inference time. We
compared FADNet with existing state-of-the-art 2D and 3D
ACKNOWLEDGEMENTS
based methods on two popular datasets in terms of accu-
racy and speed. Experimental results showed that FADNet This research was supported by Hong Kong RGC GRF
achieves comparable accuracy while it runs much faster than grant HKBU 12200418. We thank the anonymous reviewers
the 3D based models. Compared to the 2D based models, for their constructive comments and suggestions. We would
FADNet is more than two times accurate. also like to thank NVIDIA AI Technology Centre (NVAITC)
We have two future directions following our discovery in for providing the GPU clusters for some experiments.
R EFERENCES [22] J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade resid-
ual learning: A two-stage convolutional neural network for stereo
[1] H. Hirschmuller, “Stereo processing by semiglobal matching and matching,” in Proceedings of the IEEE International Conference on
mutual information,” IEEE Transactions on pattern analysis and Computer Vision, 2017, pp. 887–895.
machine intelligence, vol. 30, no. 2, pp. 328–341, 2007. [23] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,
[2] S. Zagoruyko and N. Komodakis, “Learning to compare image patches “Flownet 2.0: Evolution of optical flow estimation with deep net-
via convolutional neural networks,” in Proceedings of the IEEE works,” in Proceedings of the IEEE conference on computer vision
conference on computer vision and pattern recognition, 2015, pp. and pattern recognition, 2017, pp. 2462–2470.
4353–4361. [24] R. Atienza, “Fast disparity estimation using dense networks,” in 2018
[3] J. Zbontar, Y. LeCun et al., “Stereo matching by training a convolu- IEEE International Conference on Robotics and Automation (ICRA).
tional neural network to compare image patches.” Journal of Machine IEEE, 2018, pp. 3207–3212.
Learning Research, vol. 17, no. 1-32, p. 2, 2016. [25] J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade resid-
[4] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, ual learning: A two-stage convolutional neural network for stereo
P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning matching,” in The IEEE International Conference on Computer Vision
optical flow with convolutional networks,” in The IEEE International (ICCV) Workshops, Oct 2017.
Conference on Computer Vision (ICCV), December 2015. [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
[5] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, “ImageNet: A large-scale hierarchical image database,” in Computer
and T. Brox, “A large dataset to train convolutional networks for Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
disparity, optical flow, and scene flow estimation,” in Proceedings of on. IEEE, 2009, pp. 248–255.
the IEEE Conference on Computer Vision and Pattern Recognition, [27] M. Menze, C. Heipke, and A. Geiger, “Joint 3d estimation of vehicles
2016, pp. 4040–4048. and scene flow,” in ISPRS Workshop on Image Sequence Analysis
[6] J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in (ISA), 2015.
The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2018.
[7] F. Zhang, V. Prisacariu, R. Yang, and P. H. Torr, “Ga-net: Guided
aggregation net for end-to-end stereo matching,” in The IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), June
2019.
[8] T. Saikia, Y. Marrakchi, A. Zela, F. Hutter, and T. Brox, “Autodispnet:
Improving disparity estimation with automl,” in The IEEE Interna-
tional Conference on Computer Vision (ICCV), October 2019.
[9] X. He, K. Zhao, and X. Chu, “Automl: A survey of the state-of-the-
art,” arXiv preprint arXiv:1908.00709, 2019.
[10] Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, and X. Chu, “Irs: A
large synthetic indoor robotics stereo dataset for disparity and surface
normal estimation,” arXiv preprint arXiv:1912.09678, 2019.
[11] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,
P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning
optical flow with convolutional networks,” in Proceedings of the IEEE
international conference on computer vision, 2015, pp. 2758–2766.
[12] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,
“Flownet 2.0: Evolution of optical flow estimation with deep net-
works,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.
[13] E. Ilg, T. Saikia, M. Keuper, and T. Brox, “Occlusions, motion and
depth boundaries with a generic network for disparity, optical flow
or scene flow estimation,” in The European Conference on Computer
Vision (ECCV), September 2018.
[14] Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou,
and J. Zhang, “Learning for disparity estimation through feature
constancy,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 2811–2820.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[16] A. E. Orhan and X. Pitkow, “Skip connections eliminate singularities,”
arXiv preprint arXiv:1701.09175, 2017.
[17] W. Zhan, X. Ou, Y. Yang, and L. Chen, “Dsnet: Joint learning for
scene segmentation and disparity estimation,” in 2019 International
Conference on Robotics and Automation (ICRA). IEEE, 2019, pp.
2946–2952.
[18] X. Du, M. El-Khamy, and J. Lee, “Amnet: Deep atrous
multiscale stereo disparity estimation networks,” arXiv preprint
arXiv:1904.09099, 2019.
[19] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy,
A. Bachrach, and A. Bry, “End-to-end learning of geometry and
context for deep stereo regression,” in Proceedings of the IEEE
International Conference on Computer Vision, 2017, pp. 66–75.
[20] G.-Y. Nie, M.-M. Cheng, Y. Liu, Z. Liang, D.-P. Fan, Y. Liu, and
Y. Wang, “Multi-level context ultra-aggregation for stereo matching,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2019, pp. 3283–3291.
[21] Y. Luo, J. Ren, M. Lin, J. Pang, W. Sun, H. Li, and L. Lin, “Single
view stereo matching,” in The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018.

You might also like