0% found this document useful (0 votes)
21 views19 pages

双目-Robust Synthetic-To-Real Transfer for Stereo Matching

Uploaded by

2397273777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views19 pages

双目-Robust Synthetic-To-Real Transfer for Stereo Matching

Uploaded by

2397273777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Robust Synthetic-to-Real Transfer for Stereo Matching

Jiawei Zhang1 , Jiahe Li1 , Lei Huang2 , Xiaohan Yu3 , Lin Gu4,5 , Jin Zheng1 , Xiao Bai1 *
1
School of Computer Science and Engineering, State Key Laboratory of
Complex & Critical Software Environment, Jiangxi Research Institute, Beihang University
2
SKLCCSE, Institute of Artificial Intelligence, Beihang University
3
School of Computing, Macquarie University, Australia 4 RIKEN AIP 5 The University of Tokyo
arXiv:2403.07705v1 [cs.CV] 12 Mar 2024

Abstract Pre-train GT Fine-tune DKT(ours)

With advancements in domain generalized stereo match-


ing networks, models pre-trained on synthetic data demon-
strate strong robustness to unseen domains. However, few
studies have investigated the robustness after fine-tuning (a.1) (a.2)
them in real-world scenarios, during which the domain gen- (a) KITTI Fine-tuning (b) Booster Fine-tuning

eralization ability can be seriously degraded. In this pa-


Figure 1. Target-domain and cross-domain performance of net-
per, we explore fine-tuning stereo matching networks with-
works pre-trained on synthetic data, fine-tuned with GT, and with
out compromising their robustness to unseen domains. Our our method (best visualized in colors). IGEV-Stereo [40] is used as
motivation stems from comparing Ground Truth (GT) ver- the baseline model. D1 error (the lower the better) is used for eval-
sus Pseudo Label (PL) for fine-tuning: GT degrades, but PL uation. (a) and (b) fine-tune networks on the KITTI and Booster
preserves the domain generalization ability. Empirically, datasets, respectively. Our method achieves great performance
we find the difference between GT and PL implies valu- in target and unseen domains simultaneously. We also evaluate
able information that can regularize networks during fine- the robustness of KITTI fine-tuned networks on DrivingStereo in
tuning. We also propose a framework to utilize this differ- (a.2), where our method is more robust to challenging weather.
ence for fine-tuning, consisting of a frozen Teacher, an expo-
nential moving average (EMA) Teacher, and a Student net-
work. The core idea is to utilize the EMA Teacher to mea- of real data, existing networks are commonly pre-trained
sure what the Student has learned and dynamically improve on synthetic data. With recent advancements in building
GT and PL for fine-tuning. We integrate our framework with domain generalized networks [7, 17, 21, 52], they have
state-of-the-art networks and evaluate its effectiveness on demonstrated strong robustness to unseen domains. How-
several real-world datasets. Extensive experiments show ever, as shown in Figure 1, fine-tuning pre-trained networks
that our method effectively preserves the domain general- in real-world scenarios degrades the domain generalization
ization ability during fine-tuning. Code is available at: ability. We provide visualization results in Figure 2. The
https://github.com/jiaw-z/DKT-Stereo. degradation of robustness to unseen scenarios can render
networks unreliable for real-world applications. To solve it,
we explore what degrades the domain generalization ability
1. Introduction of stereo networks during fine-tuning and provide a solution
by combining Ground Truth (GT) with Pseudo Label (PL)
Estimating 3D geometry from 2D images is a fundamen- predicted by pre-trained stereo matching networks.
tal problem in computer vision. Stereo matching is a so-
lution that identifies matching correspondences and recov- Our exploration stems from the observation that using
ers depth information through triangulation. With the de- PL for fine-tuning stereo matching networks can preserve
velopment of deep learning, stereo matching networks have the domain generalization ability (cf . Section 3.3.1). Since
shown impressive performance on various benchmarks. there is always a gap between predicted PL and GT, we
A major obstacle to their applications is the high cost of consider whether the difference between them implies cer-
collecting real-world annotations. To make up for the lack tain knowledge. We are also motivated by previous stud-
ies [9, 14, 54] of knowledge distillation utilizing wrong re-
* Corresponding author: Xiao Bai ([email protected]). sponse to non-target classes, which they refer to as “dark

1
Target
Unseen

Left Image Pre-trained GT-ft DKT-ft Left Image Pre-trained GT-ft DKT-ft
(a) KITTI Fine-tuning
Target
Unseen

Left Image Pre-trained GT-ft DKT-ft Left Image Pre-trained GT-ft DKT-ft
(b) Booster Fine-tuning
Figure 2. Visualization results on both target and unseen domains. IGEV-Stereo [40] is used as the baseline model. The network pre-
trained on synthetic data shows robustness to unseen domains but can still fail to handle unseen challenges such as transparent or mirror
(ToM) surfaces. Fine-tuning the network with GT improves the target-domain performance, however, it seriously degrades the domain
generalization ability. Our DKT fine-tuning framework performs well on target and unseen domains simultaneously.

knowledge” of a transfer set by introducing PL with GT. EMA Teacher’s prediction in the consistent region, adding
Starting from the observation, we aim to identify how fine-grained permutations to prevent networks from overfit-
GT and the predicted PL behave differently and further an- ting GT details. For PL, F&E filters out the inconsistent
swer why fine-tuning with GT degrades the domain general- region between PL and the EMA Teacher’s prediction and
ization ability of stereo matching networks. Specifically, we ensembles them in the consistent region, enhancing the ac-
divide pixels into consistent and inconsistent regions based curacy of PL predicted by the frozen Teacher. We train
on the difference between GT and PL (cf . Section 3.1). GT the Student with improved GT and PL jointly. After fine-
and PL of the consistent region are similar, otherwise, it tuning, we keep the Student for inference.
is defined as the inconsistent region. We empirically find Our main contributions are as follows:
two primary causes that degrade the domain generalization • To the best of our knowledge, we make the first attempt
ability: (i) In the inconsistent region, networks encounter to address the domain generalization ability degradation
new knowledge not learned during pre-training on synthetic for fine-tuning stereo matching networks. We divide pix-
data. Learning new knowledge without sufficient consis- els into consistent and inconsistent regions based on the
tent region for regularization seriously degrades the domain difference between ground truth and pseudo label and
generalization ability. (ii) In the consistent region, stereo demonstrate their varied roles during fine-tuning. Our fur-
matching networks can still overfit the details of GT. ther analysis of their roles identifies two primary causes
Based on the exploration, we propose the framework uti- of the degradation: learning new knowledge without suf-
lizing Dark Knowledge to Transfer (DKT) stereo match- ficient regularization and overfitting ground truth details.
ing networks. DKT consists of a frozen Teacher, an ex- • We propose the F&E module to address these two causes,
ponential moving average (EMA) Teacher, and a Student filtering out the inconsistent region to avoid insufficient
network, all initialized with the same pre-trained weights. regularization and ensembling disparities in the consistent
We use the frozen Teacher to predict PL. The EMA Teacher region to prevent overfitting ground truth details.
is updated by the Student’s weights, serving as a dynamic • We introduce a dynamic adjustment for different re-
measure of consistent and inconsistent regions. With the gions by incorporating the exponential moving average
EMA Teacher’s prediction, we propose the Filter and En- Teacher, achieving the balance of preserving domain gen-
semble (F&E) module to improve disparity maps. For GT, eralization ability and learning target domain knowledge.
F&E filters out the inconsistent region with a probability, re- • We develop the DKT fine-tuning framework, which can
sulting in the remaining GT that has a reduced inconsistent be easily applied to existing networks, significantly im-
region, avoiding insufficient consistent region for regular- proving their robustness to unseen domains and achieving
ization. It also performs an ensemble between GT and the competitive target-domain performance simultaneously.

2
We believe this exploration will stimulate further consider-
ation of stereo matching networks’ robustness and domain
generalization ability when fine-tuning them in real-world
scenarios, crucial for their practical applications. Image GT PL Invalid

2. Related Work
Stereo Matching Networks. MC-CNN [48] first intro-
duced a convolutional neural network to compute match- Consistent GT Consistent PL Inconsistent GT Inconsistent PL
ing cost and predict disparity maps using SGM [15]. Since Figure 3. Visualization of each region resulting from our division.
then, numerous deep-learning-based methods have been de- We divide GT and PL based on their consistency.
veloped for stereo matching [3, 6, 16, 21, 38]. According to
their strategy to perform cost construction and aggregation, lates KD into target class and non-target class distillation
they can be divided into two categories. One type builds and decouples the two parts to transfer dark knowledge. We
3D cost volume and aggregates the cost with 2D convolu- adopt this concept to decouple different regions of disparity
tions. DispNetC [22] introduces end-to-end regression and maps, analyzing what degrades the domain generalization
builds a correlation-based 3D cost volume. Many works ability of stereo matching networks during fine-tuning.
[19, 24, 35, 42, 46] adopt the correlation-based strategy
and achieve impressive performance. Another type con- 3. Fine-tuning Stereo Matching Networks with
catenates features to construct 4D cost volume and perform
Ground Truth and Pseudo Label
aggregation with 3D convolutions [3, 5, 16, 39, 50, 53].
Methods in this category can leverage more comprehensive 3.1. Preliminary and Definition
information from the original images and achieve leading
We explore whether stereo matching PL contains valuable
performance on various benchmarks. Most recently, stereo
knowledge that can preserve the domain generalization abil-
matching networks [21, 37, 40, 49, 55] based on iterative
ity during fine-tuning and investigate the factors that make
optimization [34] are proposed and show state-of-the-art ac-
GT and PL behave differently. Stereo matching networks
curacy and strong robustness.
based on 2D or 3D cost aggregation are formed to obtain
Robust Stereo Matching. Recently, building stereo
predictions by conducting Soft-Argmin on probability vol-
matching networks that are robust to scenario changes has
ume [16], while the recently developed iterative optimiza-
received increased interest. Existing studies can be catego-
tion methods directly predict continuous values [21, 40, 55]
rized into joint generalization and cross-domain generaliza-
and achieve superior robustness. The difference between
tion types. Joint generalization involves training networks
GT and PL lies at the level of continuous values, making our
jointly on multiple datasets to perform well with a shared
study applicable to any form of stereo matching network.
set of parameters [20, 30, 32]. Cross-domain generaliza-
Given a rectified stereo pair (I l , I r ) with GT map D∗ , a
tion aims to improve generalization performance on unseen
pre-trained stereo matching network θ predicts a dense dis-
scenarios, with a current focus on synthetic-to-real general-
parity map D̂ = θ(I l , I r ) as PL. We adopt the pixel dis-
ization [2, 4, 7, 26, 36, 51, 52]. These methods enhance per-
tance D̂ − D∗ to measure responses to non-target dispari-
formance in unseen domains by avoiding networks overfit-
ties. To identify what makes GT and PL behave differently,
ting synthetic data during pre-training. However, they have
we divide pixels into different regions based on the consis-
not investigated the domain generalization ability after fine-
tency between PL and GT. The visualization of each region
tuning. This paper is unique as it attempts to address the
is shown in Figure 3. We define different regions as follows:
domain generalization degradation during fine-tuning.
Dark Knowledge. Knowledge Distillation (KD) is ex- Definition 1 Consistent region Xc (τ ). Xc (τ ) contains pix-
plored for transferring knowledge from one teacher model els where the differences between GT and PL are less than
to another student model [11, 14, 45]. In the original form the threshold τ . Xc (τ ) = {x||D̂(xi ) − D∗ (xi )| < τ }. This
for classification, it constructs a transfer set by using the soft region represents areas where GT and PL align closely.
target distribution produced by teachers. Dark knowledge
Definition 2 Inconsistent region Xinc (τ ). Xinc (τ ) con-
refers to the extra information beyond one-hot GT contained
tains pixels where the differences between GT and PL are
in the modified training set [27], associated with the distri-
not less than the threshold τ . Xinc (τ ) = {x||D̂(xi ) −
bution of responses to non-target classes. Building on this
D∗ (xi )| ≥ τ }. Stereo matching networks can encounter
concept, a series of studies [9, 47, 54] explore the effects of
unseen challenges in Xinc (τ ).
incorrect predictions in KD. BAN [9] decomposes the KD
into a dark knowledge term and ground truth component Definition 3 Invalid region Xinvalid . It contains pixels
and develops a born-again procedure. DKD [54] reformu- without available annotations due to the sparsity of GT.

3
3.2. Experiment Setup
Dataset. Our experiments are conducted on KITTI [10, 23],
DrivingStereo [44], Booster [25], Middlebury [28], and
ETH3D [29]. KITTI and DrivingStereo datasets collect
outdoor driving scenes, Booster and Middlebury provide
indoor scenes, and ETH3D includes indoor and outdoor
scenes in grayscale. We use the half resolution of Mid-
dlebury and DrivingStereo, and the quarter resolution of
Booster. Except for online submissions, we split datasets
for local evaluation due to the submission policy. More de- (a) Ground Truth
tails are shown in supplementary materials. In our tables,
we use blue for target domains and red for unseen domains.
Metric. We calculate the percentage of pixels with an
absolute error larger than a certain distance for evaluation.
3 pixel is used for KITTI and DrivingStereo datasets. 2
pixel is used for Middlebury and Booster datasets. 1 pixel
is used for ETH3D dataset.
Implementation. We adopt IGEV-Stereo [40] as the ba-
sic network architecture here, and results of other networks
are shown in supplementary materials. We initialize net- (b) Pseudo Label
works with SceneFlow [22] pre-trained weights. For KITTI Figure 4. Evaluation curves of the target and cross domain perfor-
datasets, we fine-tune networks for 50k steps with a batch mance during fine-tuning with GT or PL.
size of 8. For the Booster dataset, we fine-tune networks for
25k steps with a batch size of 2. We set the learning rate we observe an effective alleviation of generalization degra-
to 2e-4 for KITTI and 1e-5 for Booster. We use the data dation, particularly notable for the Booster dataset. This ob-
augmentation strategy in [21] during fine-tuning. servation highlights the significant impact of GT in the in-
consistent region on degrading generalization ability. Next,
3.3. Ground Truth versus Pseudo Label we isolate only the inconsistent region Xinc (3) for fine-
3.3.1 PL Preserves Domain Generalization Ability tuning, where we observe a catastrophic degradation in gen-
eralization ability. Networks trained with only GT(Xinc (3))
We compare the fine-tuning of pre-trained networks using even suffer difficulties generalizing to the target-domain
GT and PL. Contrast observations lead us to believe that validation data. This comparison highlights the crucial reg-
the difference between GT and PL contains valuable in- ularization role played by GT in the consistent region for
formation. Analyzing the validation curves presented in networks learning to address unseen challenges in the in-
Figure 4, we observe a significant degradation of domain consistent region. We conclude that while GT supervision
generalization ability during GT fine-tuning, occurring at in the inconsistent region is valuable for networks adapting
an early stage. In contrast, there is no substantial degrada- to unseen challenges, learning new knowledge in this re-
tion in domain generalization ability when using PL. This gion without sufficient regularization can seriously degrade
finding demonstrates that PL, as naive knowledge produced domain generalization ability.
by pre-trained stereo matching networks, can preserve the Fine-grained details in GT can cause overfitting, de-
original domain generalization ability. grading domain generalization ability. Unlike the incon-
sistent region, where handling is more challenging, net-
3.3.2 Investigating the Difference between GT and PL
works find handling the consistent region less difficult since
To understand the role of each region in GT and PL and most necessary knowledge has been involved in synthetic
to delve into the reasons behind the degradation of do- data. Subsequently, we investigate whether learning in
main generalization ability, we conduct fine-tuning exper- Xc (τ ) can also result in domain generalization degradation.
iments by isolating different regions. The comparative re- In our observations, a clear degradation in domain general-
sults across various settings are presented in Table 1. ization ability occurs when using GT compared to PL in
Learning new knowledge without sufficient regular- the same consistent region Xc (3). Further tightening the
ization can degrade domain generalization ability. We threshold for the consistent region to Xc (1) provides some
start by comparing the baseline, which uses all valid regions relief, however, we find that even inaccurate components
of GT, with a setting that uses only the consistent region within a one-pixel range can lead to varying domain gen-
Xc (3). By removing the inconsistent region Xinc (3) of GT, eralization abilities. We attribute this to stereo matching

4
Supervision 2012 2015 Midd ETH3D Booster Supervision 2012 2015 Midd ETH3D Booster
Training set KITTI 2012 & 2015 Training set KITTI 2012 & 2015
zero-shot 7.00 5.37 7.06 3.61 17.62 GT 1.94 1.36 12.23 23.88 18.43
GT(valid) 1.94 1.36 12.23 23.88 18.43 GT+PL(all) 3.24 2.75 7.40 3.64 16.54
GT(Xc (3)) 2.17 1.69 10.97 19.44 17.39 GT+PL(Xc (3)) 2.14 1.44 8.82 8.14 17.09
GT(Xinc (3)) 16.78 22.01 28.06 69.88 36.50 GT+PL(Xc (1)) 1.97 1.38 8.47 11.29 17.75
GT(Xc (1)) 2.58 2.07 9.89 11.76 17.95 Training set Booster
PL(all) 5.18 4.37 7.01 3.34 16.15 GT 52.30 55.44 19.78 93.31 12.88
PL(valid) 5.82 4.90 8.30 3.65 16.46 GT+PL(all) 7.14 10.69 8.02 10.30 14.29
PL(Xc (3)) 3.72 3.11 8.46 3.53 16.17 GT+PL(Xc (3)) 41.95 45.27 11.74 65.63 12.18
PL(Xinc (3)) 9.27 8.64 11.80 11.67 28.25 GT+PL(Xc (1)) 45.77 48.16 11.41 72.37 12.21
PL(Xc (1)) 3.29 2.92 8.10 3.69 16.26
Training set Booster Table 2. Results of jointly training networks with GT and PL.
zero-shot 5.13 6.04 7.06 3.61 19.49
GT(valid) 52.30 55.44 19.78 93.31 12.88
GT(Xc (3)) 4.77 7.51 7.26 18.39 14.07
GT(Xinc (3)) 93.35 96.38 50.09 99.24 30.25 strating utilizing PL in the valid region can be helpful. For
GT(Xc (1)) 4.58 7.66 6.93 17.15 13.78 the Booster dataset, we note that GT annotations are very
PL(all) 4.64 5.65 6.14 3.56 16.74 dense and have fewer invalid regions, thus the effect is not
PL(valid) 4.61 5.79 6.61 3.47 16.87
that obvious.
PL(Xc (3)) 3.23 4.36 6.15 3.28 14.79
PL(Xinc (3)) 5.70 6.53 7.35 4.55 21.75
PL(Xc (1)) 3.63 4.89 6.00 3.01 14.26 3.3.3 PL as Regularization with GT
Table 1. Results of fine-tuning stereo matching networks using This section explores the role of PL as a regularization term
different regions of GT or PL. when combined with GT for fine-tuning. We use PL of dif-
ferent regions with GT and make comparisons between dif-
ferent settings. The results are presented in Table 2.
networks overfitting fine-grained details in GT, which neg- Using all regions of PL as a naive strategy. We first
atively impacts domain generalization ability. fine-tune networks jointly with all available regions of GT
PL in the consistent region contains valuable knowl- and PL. While this approach mitigates the degradation in
edge to preserve the domain generalization ability. We domain generalization, it exhibits inferior performance on
compare using all PL with only using the consistent region target domains compared to using only GT, and its general-
Xc (3) of PL. Fine-tuning with PL(Xc (3)) preserves the do- ization is poorer than using only PL in Table 1. It indicates
main generalization ability. It indicates that details in PL do that this straightforward combination fails to leverage the
not contribute to domain generalization degradation, which advantages of both GT and PL.
we consider a positive distinction from the degradation of Utlizing more accurate PL is crucial for satisfactory
using GT supervision. Additionally, we observe enhanced target-domain performance. The diminishing difference
performance in the target domain, attributed to the increased between GT and PL correlates with an enhancement in
accuracy of PL(Xc (3)) compared to PL(all). target-domain performance. In particular, when employing
PL in the inconsistent region negatively impacts GT in conjunction with PL(Xc (1)), the network achieves
target-domain and unseen-domain performances. In the a target-domain performance comparable to one trained
inconsistent region, PL is considered incorrect relative to solely with GT. It’s worth noting that using the consistent
GT. We explore whether incorrect predictions follow any region of PL(Xc (τ )) for regularization may not be optimal,
useful patterns. Compared to using the correct PL(Xc (3)), as it can compromise domain generalization ability com-
training networks with PL(Xinc (3)) degrades both target pared to using all regions of PL.
and unseen domain performances. Consequently, we con- More challenging scenarios degrade domain gener-
clude that PL in the inconsistent region has negative im- alization ability more seriously. The evaluation of the
pacts, and keeping the correct PL is still vital. Booster dataset, characterized by numerous ToM surfaces,
PL in the invalid region benefits domain generaliza- reveals a significant impact on degrading generalization
tion ability. The invalid region Xinvalid contains the po- ability. Through a comparison between fine-tuning net-
tential consistent region and inconsistent region for PL that works on the Booster and KITTI datasets, we see the degra-
we cannot directly distinguish due to the unavailable of dation on Bosster is more serious. Across various experi-
GT. Based on the conclusion that the inconsistent region mental settings, excluding the use of all PL regions, fine-
contributes negatively, we question whether invalid regions tuning on Booster consistently results in an obvious decline
are still necessary and conduct comparisons between using in network robustness, emphasizing the importance of ad-
PL(all) and PL(valid). For KITTI datasets, we observe ad- dressing the degradation posed by fine-tuning networks in
ditional benefits when introducing invalid regions, demon- such challenging scenarios for real-world applications.

5
Stereo Images
Student

Student Prediction

Frozen
F&E-PL
Teacher
Pseudo Label Improved PL and Valid Mask
EMA
Teacher F&E-GT
GT and Valid Mask
EMA Teacher Prediction Improved GT and Valid Mask

Figure 5. Overview of the DKT framework. It uses the prediction from the EMA Teacher to improve GT and PL during fine-tuning.

3.3.4 Summary Label Remove with Label


Ensemble Probability Ensemble Remove
Our investigation reveals that both GT and PL play dual
roles during fine-tuning. GT proves crucial for improving truncate
target-domain performance, particularly in handling unseen EMA PL GT GT EMA PL PL PL
challenges. However, insufficient regularization and over- 𝝉 𝝉
(a) F&E-GT (b) F&E-PL
fitting GT details can degrade domain generalization ability.
PL effectively preserves the generalization ability, however, Figure 6. The two variants of Filter and Ensemble (F&E) applied
learning incorrect predictions impact negatively. Moreover, to GT and PL. We use EMA PL to adjust different regions, with a
a naive combination of GT and PL proves to be a conserva- threshold τ to control the operations on GT and PL.
tive strategy that fails to leverage the full potential of both.
F&E-GT. For GT disparity map D∗ , we remove the in-
4. DKT Framework consistent region Xinc (τ ) to avoid them dominating fine-
tuning. However, since we hope networks learn to cope
4.1. Architecture with new challenges, annotations of these challenging re-
The core idea of DKT is to dynamically adjust the learn- gions are especially valuable. Thus F&E-GT removes the
ing of different regions of GT and PL based on what net- challenging inconsistent region with a probability based on
works have learned during fine-tuning. Figure 5 shows an the proportion in the whole valid region:
overview of our framework. (
∗ D∗ , rand(0, 1) > |X inc (τ )|
|Xvalid |
DKT employs three same-initialized networks: Dinc = (2)
• Student θS is trained with GT and PL. After fine-tuning, invalid, otherwise
we keep the Student for inference. It uses GT with a smaller probability for the more chal-
• Teacher θT predicts PL and is frozen during fine-tuning. lenging region, thus avoiding new challenges disproportion-
It contains the original knowledge from pre-training. ately affecting learning. For the consistent region Xc (τ ),
• EMA Teacher θT ′ is an exponential moving average of we consider it an acceptable deviation and take continuous
Student. Here, it is used to measure what the Student has ensembling between GT D∗ and the EMA Teacher’s predic-

learned during fine-tuning. tion D̂T . We sum them with uniform weights and truncate
The EMA Teacher is updated as a momentum-based at 1 pixel apart from GT D∗ :
moving average [1, 13, 33] of the Student:
α = rand(0, 1) (3)
θT ′ = mθT ′ + (1 − m)θS , (1)
∗ ′
where m ∈ [0, 1] is the momentum decay value for update Dc = α ∗ D∗ + (1 − α) ∗ D̂T . (4)
∗ ∗
speed control. Dc = min(max(Dc , D∗ ∗
− 1), D + 1). (5)
We propose the Filter and Ensemble (F&E) module,
It adds fine-grained permutations to GT which prevents
which utilizes the EMA Teacher to remove harmful regions
the Student from overfitting GT details. The final improved
and keep beneficial regions during fine-tuning. The division ∗
GT D is obtained by combining the improved inconsistent
of different regions is conditioned on τ , which is set to 3
and consistent regions:
in our framework. Based on the characteristics of GT and  ∗
PL, it has two variants termed F&E-GT and F&E-PL. We ∗ Dinc (x), x ∈ Xinc (τ )
D (x) = ∗ (6)
present the two variants in Figure 6. Dc (x), x ∈ Xc (τ )

6
Left Image PCWNet RAFT-Stereo DLNR IGEV-Stereo DKT-IGEV
Figure 7. Qualitative results on the KITTI benchmark.
KITTI 2012 KITTI 2015 DrivingStereo Middlebury ETH3D Booster
Method
noc all bg fg all sunny cloudy foggy rainy avg >2px(%) >1px(%) >2px(%)
PSMNet [3] 1.49 1.89 1.86 4.62 2.32 6.36 4.98 10.63 21.39 10.84 21.85 88.87 32.88
GWCNet [12] 1.32 1.70 1.74 3.93 2.11 3.52 2.84 4.27 9.04 4.92 17.83 31.87 29.08
GANet [50] 1.19 1.60 1.48 3.46 1.81 3.77 3.44 4.26 10.46 5.47 18.79 14.40 30.65
PCWNet [31] 1.04 1.37 1.37 3.16 1.67 3.00 2.41 1.72 3.41 2.64 15.55 18.27 21.34
CGF-ACV [41] 1.03 1.34 1.32 3.08 1.61 3.22 3.19 6.69 17.50 7.65 23.76 36.92 22.42
RAFT-Stereo [21] 1.30 1.66 1.58 3.05 1.82 2.18 1.91 2.74 8.35 3.80 11.78 40.43 23.87
DLNR [55] - - 1.60 2.59 1.76 1.94 1.87 2.25 5.31 2.84 8.21 26.18 19.58
IGEV-Stereo [40] 1.12 1.44 1.38 2.67 1.59 2.28 2.21 2.51 3.80 2.70 11.87 24.28 18.33
DKT-RAFT(ours) 1.43 1.85 1.65 2.98 1.88 1.85 1.46 1.32 5.44 2.52 7.51 2.28 15.35
DKT-IGEV(ours) 1.22 1.56 1.46 3.05 1.72 1.93 1.71 1.96 3.26 2.22 7.53 4.23 15.30

Table 3. Results on the KITTI benchmark. We evaluate the domain generalization ability with checkpoints provided by authors.

F&E-PL. The frozen Teacher predicts PL D̂T , serv-


ing as important regularization during fine-tuning. F&E-PL
Sunny

aims to progressively enhance the accuracy of PL. We use



the EMA Teacher’s prediction D̂T to remove the inconsis-
tent region of D̂T :

M̂ = |D̂T − D̂T | < τ.
Cloudy

(7)
′ T
And we combine D̂T and D̂T to get improved PL D .
β = rand(0, 1) (8)
T T T′
D = β ∗ D̂ + (1 − β) ∗ D̂ .
Foggy

(9)
Training. Our final training loss is a weighted sum of
∗ T
disparity loss with the improved GT D and PL D within
valid masks M ∗ for GT and M̂ for PL:
Rainy

∗ T
L = Ldisp (D̂, D , M ∗ ) + λLdisp (D̂, D , M̂ ), (10)
where D̂ is the Student prediction and λ is the balancing
weight. After 5k steps of fine-tuning, we re-initialize the Image / GT Pre-trained GT-ft DKT-ft
EMA Teacher with the Student weights. The re-initialized Figure 8. Qualitative results on the DrivingStereo dataset. We
EMA Teacher predicts more accurate disparities, resulting fine-tune the networks with KITTI datasets.
in more accurate augmented GT and invalid-region PL.
4.2. Overall Performance
the base networks we use. We show the qualitative results
4.2.1 KITTI Benchmark
in Figure 7, compared to other methods, DKT demonstrates
We compare the performance of DKT and published SOTA reasonable predictions for weak texture regions. We eval-
methods [31, 40, 41, 55] on the KITTI benchmark by con- uate how these models generalize to DrivingStereo, which
sidering target-domain and cross-domain performance. We contains unseen driving scenarios with challenging weath-
utilize checkpoints provided by the authors to assess the ers. Among these published models, DKT achieves the
performance of domain generalization. For methods to sub- strongest robustness by considering four challenging weath-
mit separate checkpoints on 2012 and 2015 benchmarks, we ers. We visualize results in Figure 8. We evaluate if these
calculate the average of the two checkpoints. Table 3 shows models can generalize well to Middlebury, ETH3D, and
the results. For target KITTI scenarios, we find our meth- Booster datasets. We see that DKT is the only method that
ods can achieve good but slightly worse performance than can generalize well to these scenarios after fine-tuning.

7
LGT LP L F&E-GT F&E-PL 2012 2015 Midd ETH3D Booster
Training set KITTI 2012 & 2015
✓ 1.94 1.36 12.23 23.88 18.43
✓ ✓ 1.93 1.38 9.62 12.31 17.46
✓ ✓ 3.24 2.75 7.40 3.64 16.54 Left Image GT
✓ ✓ ✓ 3.26 2.79 7.03 3.24 15.19
✓ ✓ ✓ 1.97 1.43 8.08 4.23 15.11
✓ ✓ ✓ ✓ 1.98 1.39 7.11 3.64 15.51
Training set Booster
✓ 52.30 55.44 19.78 93.31 12.88 0k Iterations 5k Iterations
✓ ✓ 9.42 11.04 7.56 6.11 12.76
✓ ✓ 7.14 10.69 8.02 10.30 14.29
✓ ✓ ✓ 3.69 4.88 7.01 4.32 14.21
✓ ✓ ✓ 23.17 24.58 10.56 47.91 12.55
✓ ✓ ✓ ✓ 3.49 4.71 6.61 2.60 12.63 10k Iterations 25k Iterations

Table 4. Ablation study of each component of DKT framework on the KITTI 2012 and 2015, Figure 9. Predictions of the EMA
Middlebury, ETH3D, and Booster training sets. Teacher during fine-tuning.

Method Booster 2012 2015 Midd ETH3D prediction of the EMA Teacher. It removes the inconsis-
CFNet [30] * 38.32 4.97 6.31 15.77 5.47 tent region of GT with certain probabilities and adds fine-
RAFT-Stereo [21] * 17.44 4.34 5.68 8.41 2.29
CFNet(ft) [30] † 29.65 51.52 68.24 18.42 79.98
grained permutations to GT. It also works when we only
RAFT-Stereo(ft) [21] † 10.73 17.02 23.33 15.98 89.05 use GT supervision LGT and do not introduce PL supervi-
PCVNet(ft) [49] ‡ 9.03 35.54 38.60 22.88 75.21 sion LP L : only using F&E-GT to improve GT alleviates
DKT-RAFT(ours) 10.32 3.69 4.95 5.94 2.57 the domain generalization ability degradation and little im-
DKT-IGEV(ours) 14.11 3.76 4.78 6.94 3.13
pacts the target-domain performances. For using both PL
Table 5. Results on Booster benchmark. For target-domain perfor- supervision LP L and F&E-GT, we show this combination
mance, we report their online results. For generalization evalua- can effectively preserve the domain generalization ability.
tion: * uses authors’ weights, †reproduces models in the same set- However, using LP L negatively impacts the target-domain
ting as online submissions, ‡the original implementation for sub- performances because LP L can introduce incorrect super-
mission uses CREStereo dataset [18] to augment Booster and we vision, which we solve with F&E-PL.
retrain it with only Booster training set to avoid unfair comparison. F&E-PL. It plays an important role in improving target-
domain performances by removing the inconsistent region
4.2.2 Booster Benchmark of PL. It progressively identifies incorrect disparities of PL
according to EMA Teacher’s prediction, instead of directly
We follow [8, 25, 49] to fine-tune networks on the Booster comparing PL with GT which can degrade the domain gen-
dataset, which contains very challenging real-world scenar- eralization ability in Section 3.3.3. We show that only F&E-
ios. The results are presented in Table 5. We see fine-tuning PL can alleviate the domain generalization degradation in
networks with GT improves the predictions in the target do- KITTI, however, it cannot work well alone in the challeng-
main compared to the synthetic data pre-trained networks. ing Booster scenarios where using F&E-GT is necessary.
However, it also introduces a notable degradation in gen-
eralization ability when fine-tuning networks in such chal- 5. Conclusion
lenging scenarios. Using our DKT framework to fine-tune
networks achieves competitive target-domain performances We aim to fine-tune stereo networks without compromis-
compared to the baseline fine-tuning. Moreover, DKT can ing robustness to unseen domains. We identify that learning
help networks generalize knowledge learned from the chal- new knowledge without sufficient regularization and over-
lenging Booster data to unseen real-world scenes, improv- fitting GT details can degrade the robustness. We propose
ing the performance on KITTI and Middlebury than the syn- the DKT framework, which improves fine-tuning by dy-
thetic data pre-trained models. namically measuring what has been learned. Experiments
show that DKT can preserve robustness during fine-tuning,
4.3. Ablation Study improve robustness to challenging weather, and generalize
We evaluate the effectiveness of DKT in utilizing the EMA knowledge learned from target domains to unseen domains.
Teacher for improving PL and GT in Table 4. Additionally, Acknowledgements. This work was supported by the Na-
we provide visual insights into the EMA Teacher progres- tional Science and Technology Major Project under Grant
sively predicts more accurate disparities in Figure 9. 2022ZD0116310, the National Natural Science Foundation
F&E-GT. It improves GT by combining GT and the of China 62276016, 62372029.

8
Robust Synthetic-to-Real Transfer for Stereo Matching
Supplementary Material
Overview outdoor scenes. We use the training set for domain gen-
eralization evaluation.
We organize the material as follows. Section A shows more • DrivingStereo [44] is a large-scale real-world driving
details of the experiment settings. In Section B, we pro- dataset. A subset of it contains 2,000 stereo pairs col-
vide comparison results between using GT and PL for fine- lected under different weather (sunny, cloudy, foggy, and
tuning with the additional stereo matching network archi- rainy). We use the half resolution of these challenging
tecture. Section C and Section D conducts additional ex- scenes to evaluate the robustness after fine-tuning on the
periments about DKT. Section E presents more qualitative KITTI datasets.
results of the domain generalization performance.
Local Dataset Split. Except for online submissions, we
conduct the experiments based on local train and validation
A. Details of Experimental Setting splits. For the KITTI 2012 and 2015 datasets, we follow
Dataset. We conduct our experiments by initializing GWCNet [12] to split 14 stereo pairs of 2012 and 20 pairs
stereo matching networks with synthetic dataset pre-trained of 2015 for validation, the remaining 360 pairs are used for
weights and fine-tuning them in real-world scenarios. We training. For the booster dataset, we use the ‘Washer’ and
mainly focus on the robustness and domain generaliza- ‘OilCan’ scenes (15 stereo pairs) for validation, and the re-
tion ability after fine-tuning networks, and we also show maining 213 pairs for training. In this material, we also con-
their target-domain performances to ensure networks actu- duct fine-tuning experiments on Middlebury and ETH3D
ally learn from target domains. When not specifically men- datasets, following the data split in [20]. We use the ‘ArtL’
tioned, all networks in our experiments are pre-trained on and ‘Playroom’ scenes (2 stereo pairs) for Middlebury val-
the SceneFlow [22] dataset. The introduction of the syn- idation and the ‘facade’ and ‘forest’ scenes (3 stereo pairs)
thetic and real-world datasets are as follows: for ETH3D validation.
• SceneFlow [22] is a large synthetic dataset that consists
of 35,454 pairs of stereo images for training and 4,370 B. Network Architecture for GT vs. PL
pairs for evaluation. Both sets have dense ground-truth
disparities. The resolution of the images is 960 × 540. In the main paper, we investigate the distinct behaviors of
Besides the original clean pass, the dataset also contains Ground Truth (GT) and Pseudo Label (PL) during fine-
a final pass. The final pass has motion blur and defocus tuning. We achieve this by dividing pixels into different re-
blur, making it more similar to real-world images. Scene- gions (Xc (τ ), Xinc (τ ), Xinvalid ) and conducting compre-
Flow is currently the most commonly used dataset for pre- hensive comparisons between them. In addition to the iter-
training stereo matching networks. ative optimization based IGEV-Stereo [40], we employ the
• KITTI 2012 [10] collects outdoor driving scenes with 3D convolution-based CFNet [30] to affirm that our find-
sparse ground-truth disparities. It contains 194 training ings are applicable across diverse stereo matching network
samples and 195 testing samples with a resolution of architectures. As presented in Table I, learning new knowl-
1226 × 370. edge without sufficient regularization and overfitting GT de-
• KITTI 2015 [23] collects driving scenes with sparse dis- tails are two primary contributors to the degradation of do-
parity maps. It contains 200 training samples and 200 main generalization ability during the fine-tuning.
testing samples with a resolution of 1242 × 375.
• Booster [25] contains 228 samples for training and 191 C. Additional Ablations about DKT
samples for online testing in 64 different scenes with
C.1. Fine-grained Permutations
dense ground-truth disparities. Most of the collected
scenes have challenging non-Lambertian surfaces. We In F&E-GT, we leverage the exponential moving average
use the quarter resolution in our experiments. (EMA) Teacher’s prediction to serve as fine-grained per-
• Middlebury [28] consists of 15 training and 15 testing mutations for GT. In this section, we present ablations with
stereo pairs captured indoors. The dataset offers images alternative permutations. Specifically, we apply F&E-GT
at full, half, and quarter resolutions. We use the half- using the EMA Teacher for filtering out inconsistent re-
resolution training set for domain generalization evalua- gions but with variations in permutations. We use random
tion. noise from (-1, 1), PL from the frozen Teacehr, and the
• ETH3D [29] consists of 27 grayscale image pairs for EMA Teacher’s prediction for ablation and visualize the
training and 20 for testing. It includes both indoor and three kinds of fine-grained permutations in Figure I. The

9
Supervision 2012 2015 Midd ETH3D Booster C.2. Effects of the Frozen Teacher
Training set KITTI 2012 & 2015
zero-shot 5.71 4.84 15.77 5.48 38.84 An alternative way to using the frozen Teacher’s prediction
GT(valid) 2.15 1.39 19.83 29.94 30.95 with F&E-PL is directly using the EMA Teacher’s predic-
GT(Xc (3)) 2.26 1.67 17.92 24.52 30.88
tion, which progressively predicts more accurate disparities.
GT(Xinc (3)) 21.33 18.49 31.78 58.35 43.06
GT(Xc (1)) 2.67 1.95 16.27 14.67 31.16 An overview of DKT without the frozen Teacher is shown
PL(all) 5.05 4.26 13.71 4.86 29.92 in Figure II. We show the comparison in Table III. Using
PL(valid) 5.58 4.64 14.78 6.05 30.79 the frozen Teacher’s prediction gets improvements in tar-
PL(Xc (3)) 3.32 3.01 14.09 5.50 30.97
get domains than using the frozen Teacher’s prediction with
PL(Xinc (3)) 8.91 8.08 18.30 12.45 40.17
PL(Xc (1)) 2.94 2.57 15.38 5.80 31.34 F&E-PL, however, it leads to a slight drop in domain gener-
Training set Booster alization ability than using the Frozen Teacher’s prediction.
zero-shot 4.97 6.31 15.77 5.48 35.03
GT(valid) 56.20 71.41 18.45 80.53 25.86
Stereo Images
GT(Xc (3)) 4.39 6.04 13.53 20.90 26.85

GT(Xinc (3)) 97.27 97.86 77.44 99.89 45.69 Student
Student Prediction
GT(Xc (1)) 4.31 6.19 13.77 21.80 26.13
PL(all) 4.24 5.19 11.38 5.42 28.34 EMA
Teacher
EMA Teacher Prediction
PL(valid) 4.33 5.11 11.48 5.51 28.05 EMA Teacher
Prediction F&E-GT
PL(Xc (3)) 3.98 4.65 11.25 5.39 27.40 GT and Valid Mask
Improved GT and Valid Mask
PL(Xinc (3)) 5.56 7.68 16.81 6.08 36.44
PL(Xc (1)) 3.79 4.98 11.03 5.59 27.54 Figure II. DKT framework without the frozen Teacher.
Table I. Results of using different regions of GT or PL to fine-tune
CFNet [30]. Different regions play varied roles during fine-tuning.
Method 2012 2015 Midd ETH3D Booster
Training set KITTI 2012 & 2015
DKT(w/o F.T.) 1.97 1.34 8.02 4.43 15.65
DKT(full) 1.98 1.39 7.11 3.64 15.51
Training set Booster
DKT(w/o F.T.) 3.72 4.86 7.00 3.89 12.23
Image GT PL EMA PL DKT(full) 3.49 4.71 6.61 2.60 12.63

Table III. Effects of the using frozen Teacher to produce PL. F.T.:
the frozen Teacher. Using the frozen Teacher to produce PL pre-
serves the domain generalization ability better.
Consistent Mask Random PL Permutation EMA Permutation
Figure I. Visualization with the absolute value of three kinds of
fine-grained permutations. D. Additional Experiments about DKT
D.1. DKT vs. DG Methods
results are presented in Table II. Our findings demonstrate
that employing the frozen Teacher or EMA Teacher to add Training robust stereo matching networks on synthetic
permutations better preserves domain generalization ability datasets has been well-researched recently [7, 51, 52]. We
than random noise. Moreover, utilizing the EMA Teacher compare our methods with domain generalization methods
yields better target-domain performance compared to the and verify if these methods work well for real-world fine-
frozen Teacher. We attribute this improvement to the EMA tuning. We use ITSA [7] and Asymmetric Augmentation
Teacher progressively predicting more accurate disparities. [43] for comparison. As shown in Table IV, domain gen-
eralization methods designed for synthetic data pre-training
Method 2012 2015 Midd ETH3D Booster fail in this case. We think differences between synthetic and
Training set KITTI 2012 & 2015 real data may render previous methods unsuitable: exist-
random noise 1.94 1.39 11.08 19.79 17.93 ing methods reduce shortcuts learning caused by synthetic
F.T. 1.94 1.40 9.60 12.34 17.57 artifacts [7], while real-world data is actually real and the
EMA.T. 1.93 1.38 9.62 12.31 17.46 factors degrade generalization ability can be different.
Training set Booster
random noise 11.55 12.28 9.54 7.73 12.85 D.2. Stereo Matching Network Architectures
F.T. 9.48 10.96 7.81 5.93 12.89
EMA.T. 9.42 11.04 7.56 6.11 12.76 In the main paper, we conduct our experiments with ro-
Table II. Ablation of fine-grained permutations. F.T.: the frozen bust iterative optimization based stereo matching networks.
Teacher. PL serves as better permutations than random noise. Here we conduct experiments using other network archi-
tectures. We fine-tune CFNet [30] and CGI-Stereo [41],

10
Method 2012 2015 Midd ETH3D Booster D.3. Fine-tuning on More Datasets
Training set KITTI 2012 & 2015
baseline 1.94 1.36 12.23 23.88 18.43 We perform fine-tuning on the Middlebury and ETH3D
ITSA [7] 2.01 1.43 12.59 25.72 17.98 datasets, following the data split in MCV-MFC [20]. The
Asy.Aug 1.98 1.34 12.67 23.37 18.02 experimental results are presented in Table VI. Notably, we
DKT 1.98 1.39 7.11 3.64 15.51
observe that fine-tuning on these two datasets can lead to
Training set Booster
baseline 52.30 55.44 19.78 93.31 12.88 a degradation in generalization ability to some unseen do-
ITSA [7] 51.95 56.97 18.77 98.78 13.01 mains. However, it’s noteworthy that fine-tuning on Mid-
Asy.Aug 55.79 58.36 18.51 95.43 12.79 dlebury and ETH3D can enhance performance on specific
DKT 3.49 4.71 6.61 2.60 12.63 unseen datasets, and overall, the degradation in domain gen-
Table IV. Comparison with domain generalization methods. The eralization is less pronounced compared to fine-tuning on
previous methods designed for building domain generalized stereo KITTI and Booster. We think this difference is attributed
networks during synthetic data pre-training fail to preserve the do- to the fact that Middlebury and ETH3D datasets contain lit-
main generalization ability during fine-tuning. tle transparent or mirrored (ToM) surfaces, which have a
substantial impact on degrading domain generalization abil-
ity. The modest performance gaps between pre-trained net-
which have great domain generalization ability after pre- works and those subjected to fine-tuning suggest that the
training on synthetic data. We show the results in Table V. acquisition of new knowledge during the fine-tuning pro-
Compared to the baseline fine-tuning strategy with GT, net- cess may be relatively limited. Moreover, for both datasets,
works fine-tuned with DKT show better generalization abil- employing DKT during fine-tuning demonstrates better do-
ity. Furthermore, We explore fine-tuning the recent Croco- main generalization ability than using only GT.
Stereo [38] that builds transformers and train networks with
the self-supervised task on a large scale of data. After Method 2012 2015 Midd ETH3D Booster
self-supervised pre-training, Croco-Stereo trains networks Training set Middlebury 2014
to conduct stereo matching jointly on various datasets in- IGEV-Stereo [40] * 5.13 6.04 5.03 3.61 17.62
IGEV-Stereo(ft) 4.02 5.01 3.81 4.97 15.26
cluding SceneFlow, Middlebury, ETH3D, and Booster. We DKT-IGEV 3.47 4.62 3.83 2.97 14.23
fine-tune Croco-Stereo in the KITTI datasets and evaluate Training set ETH3D
the target and cross domain performance. We note that the IGEV-Stereo [40] * 5.13 6.04 7.06 3.09 17.62
cross-domain evaluation in this setting is not to represent IGEV-Stereo(ft) 5.19 5.62 12.31 2.26 22.57
the domain generalization ability of the model, but can rep- DKT-IGEV 4.81 5.59 7.32 2.23 17.33
resent how the model forgets previously seen scenarios. We Table VI. Results of fine-tuning networks on more datasets. * uses
do not fine-tune Croco-Stereo in the Booster datasets be- pre-trained weights provided by the authors. Networks fine-tuned
cause it has seen the validation set during pre-training. by the DKT framework show better robustness to unseen domains.

Method 2012 2015 Midd ETH3D Booster


Training set KITTI 2012 & 2015
D.4. Joint Generalization
CFNet [30] * 5.71 4.84 15.77 5.48 38.84 In addition to fine-tuning stereo matching networks on indi-
CFNet(ft) 2.15 1.39 19.83 29.94 30.95
DKT-CFNet 2.23 1.47 12.98 6.16 30.27 vidual real-world scenarios, we employ DKT for joint fine-
CGI-Stereo [41] * 6.55 5.49 13.91 6.30 33.38 tuning across multiple domains. Besides assessing perfor-
CGI-Stereo(ft) 2.41 1.58 18.62 29.84 30.51 mance in target domains, we also evaluate the domain gen-
DKT-CGI 2.26 1.63 14.31 7.12 29.09 eralization ability on previously unseen DricingStereo sce-
Croco-Stereo [38] * 12.21 18.16 2.62 0.13 8.30
Croco-Stereo(ft) 1.81 1.22 7.83 2.19 23.12 narios. The results, presented in Table VII, demonstrate that
DKT-Croco 1.78 1.26 3.08 1.27 9.81 using DKT for joint fine-tuning yields comparable results
Training set Booster across multiple seen domains, while exhibiting superior ro-
CFNet [30] * 4.97 6.31 15.77 5.48 35.03 bustness on unseen scenarios.
CFNet(ft) 56.20 71.41 18.45 80.53 25.86
DKT-CFNet 3.57 4.26 11.17 5.38 26.11 E. Additional Qualitative Results
CGI-Stereo [41] * 5.90 6.02 13.91 6.30 30.23
CGI-Stereo(ft) 30.93 46.84 20.34 46.79 23.87 In this section, we provide additional qualitative results of
DKT-CGI 5.38 5.11 13.83 6.37 23.60
the domain generalization performance of stereo match-
Table V. Results of fine-tuning with more network architectures. ing networks fine-tuned with only GT and DKT. Com-
* uses pre-trained weights provided by the authors. Our proposed pared to using only GT for fine-tuning, DKT effectively
DKT framework can be applied to various network architectures preserves the networks’ robustness to unseen domains after
and preserves their domain generalization ability. fine-tuning. Figures III to V use the same networks fine-
tuned on the KITTI datasets and show the performance on

11
2012 2015 Middlebury ETH3D Booster DrivingStereo
Method
>3px(%) >3px(%) >2px(%) >1px(%) >2px(%) sunny cloudy foggy rainy avg
CFNet [30] 2.47 1.78 6.96 1.99 30.43 2.75 2.49 2.03 6.39 3.42
DKT-CFNet 2.51 1.80 5.92 1.81 18.55 2.20 2.34 1.89 3.55 2.50
IGEV-Stereo [40] 2.00 1.56 3.80 1.98 12.83 2.29 1.89 1.49 8.19 3.47
DKT-IGEV 2.02 1.54 3.79 2.01 11.19 2.23 1.81 1.42 3.31 2.19

Table VII. Results of joint generalization. Networks are fine-tuned on a combination of KITTI 2012, KITTI 2015, Middlebury, ETH3D,
and Booster datasets. Networks fine-tuned by the DKT framework show competitive joint generalization performance, as well as better
robustness to unseen challenging weather.

unseen Middlebury, Booster, and ETH3D domains. Fig-


ures VI to IX use the same networks fine-tuned on the
Booster dataset and show the performance on unseen KITTI
2012, KITTI 2015, Middlebury, and ETH3D domains.

12
Left Image RAFT-Stereo DKT-RAFT(ours) IGEV-Stereo DKT-IGEV(ours)

Figure III. Qualitative results of KITTI fine-tuned networks on the Middlebury training set. The left panel shows the left input image and
the ground truth disparity. For each example, the first row shows the error map and the second row shows the colorized disparity prediction.

Left Image RAFT-Stereo DKT-RAFT(ours) IGEV-Stereo DKT-IGEV(ours)

Figure IV. Qualitative results of KITTI fine-tuned networks on the Booster training set. The left panel shows the left input image and the
ground truth disparity. For each example, the first row shows the error map and the second row shows the colorized disparity prediction.
13
Left Image RAFT-Stereo DKT-RAFT(ours) IGEV-Stereo DKT-IGEV(ours)

Figure V. Qualitative results of KITTI fine-tuned networks on the ETH3D training set. The left panel shows the left input image and the
ground truth disparity. For each example, the first row shows the error map and the second row shows the colorized disparity prediction.

14
Left Image RAFT-Stereo DKT-RAFT(ours) IGEV-Stereo DKT-IGEV(ours)

Figure VI. Qualitative results of Booster fine-tuned networks on the KITTI 2012 training set. The left panel shows the left input image and
the ground truth disparity. For each example, the first row shows the error map and the second row shows the colorized disparity prediction.

Left Image RAFT-Stereo DKT-RAFT(ours) IGEV-Stereo DKT-IGEV(ours)

Figure VII. Qualitative results of Booster fine-tuned networks on the KITTI 2015 training set. The left panel shows the left input image and
the ground truth disparity. For each example, the first row shows the error map and the second row shows the colorized disparity prediction.

15
Left Image RAFT-Stereo DKT-RAFT(ours) IGEV-Stereo DKT-IGEV(ours)

Figure VIII. Qualitative results of Booster fine-tuned networks on the Middlebury training set. The left panel shows the left input image
and the ground truth disparity. For each example, the first row shows the error map and the second row shows the colorized disparity
prediction.

Left Image RAFT-Stereo DKT-RAFT(ours) IGEV-Stereo DKT-IGEV(ours)

Figure IX. Qualitative results of Booster fine-tuned networks on the ETH3D training set. The left panel shows the left input image and the
ground truth disparity. For each example, the first row shows the error map and the second row shows the colorized disparity prediction.

16
References resentation learning. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
[1] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and An- 9729–9738, 2020. 6
drew Gordon Wilson. There are many consistent explana-
[14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-
tions of unlabeled data: Why you should average. arXiv
ing the knowledge in a neural network. arXiv preprint
preprint arXiv:1806.05594, 2018. 6
arXiv:1503.02531, 2015. 1, 3
[2] Changjiang Cai, Matteo Poggi, Stefano Mattoccia, and
[15] Heiko Hirschmuller. Accurate and efficient stereo process-
Philippos Mordohai. Matching-space stereo networks for
ing by semi-global matching and mutual information. In
cross-domain generalization. In 2020 International Confer-
2005 IEEE Computer Society Conference on Computer Vi-
ence on 3D Vision (3DV), pages 364–373. IEEE, 2020. 3
sion and Pattern Recognition (CVPR’05), pages 807–814.
[3] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo
IEEE, 2005. 3
matching network. In Proceedings of the IEEE Conference
[16] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter
on Computer Vision and Pattern Recognition, pages 5410–
Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry.
5418, 2018. 3, 7
End-to-end learning of geometry and context for deep stereo
[4] Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang.
regression. In Proceedings of the IEEE International Con-
Domain generalized stereo matching via hierarchical visual
ference on Computer Vision, pages 66–75, 2017. 3
transformation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 9559– [17] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot.
9568, 2023. 3 Domain generalization with adversarial feature learning. In
Proceedings of the IEEE Conference on Computer Vision
[5] Xinjing Cheng, Peng Wang, and Ruigang Yang. Learning
and Pattern Recognition, pages 5400–5409, 2018. 1
depth with convolutional spatial propagation network. IEEE
transactions on pattern analysis and machine intelligence, [18] Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei
42(10):2361–2379, 2019. 3 Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng
[6] Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Liu. Practical stereo matching via cascaded recurrent net-
Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and work with adaptive correlation. In Proceedings of the
Zongyuan Ge. Hierarchical neural architecture search for IEEE/CVF Conference on Computer Vision and Pattern
deep stereo matching. Advances in Neural Information Pro- Recognition, pages 16263–16272, 2022. 8
cessing Systems, 33:22158–22169, 2020. 3 [19] Zhengfa Liang, Yiliu Feng, Yulan Guo, Hengzhu Liu, Wei
[7] WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad, Chen, Linbo Qiao, Li Zhou, and Jianfeng Zhang. Learning
Alireza Bab-Hadiashar, and David Suter. Itsa: An for disparity estimation through feature constancy. In Pro-
information-theoretic approach to automatic shortcut avoid- ceedings of the IEEE Conference on Computer Vision and
ance and domain generalization in stereo matching networks. Pattern Recognition, pages 2811–2820, 2018. 3
In Proceedings of the IEEE/CVF Conference on Computer [20] Zhengfa Liang, Yulan Guo, Yiliu Feng, Wei Chen, Linbo
Vision and Pattern Recognition, pages 13022–13032, 2022. Qiao, Li Zhou, Jianfeng Zhang, and Hengzhu Liu. Stereo
1, 3, 10, 11 matching using multi-level cost volume and multi-scale fea-
[8] Alex Costanzino, Pierluigi Zama Ramirez, Matteo Poggi, ture constancy. IEEE transactions on pattern analysis and
Fabio Tosi, Stefano Mattoccia, and Luigi Di Stefano. Learn- machine intelligence, 43(1):300–315, 2019. 3, 9, 11
ing depth estimation for transparent and mirror surfaces. In [21] Lahav Lipson, Zachary Teed, and Jia Deng. Raft-stereo:
Proceedings of the IEEE/CVF International Conference on Multilevel recurrent field transforms for stereo matching. In
Computer Vision, pages 9244–9255, 2023. 8 2021 International Conference on 3D Vision (3DV), pages
[9] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, 218–227. IEEE, 2021. 1, 3, 4, 7, 8
Laurent Itti, and Anima Anandkumar. Born again neural net- [22] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,
works. In International Conference on Machine Learning, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A
pages 1607–1616. PMLR, 2018. 1, 3 large dataset to train convolutional networks for disparity,
[10] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we optical flow, and scene flow estimation. In Proceedings of
ready for autonomous driving? the kitti vision benchmark the IEEE conference on computer vision and pattern recog-
suite. In 2012 IEEE conference on computer vision and pat- nition, pages 4040–4048, 2016. 3, 4, 9
tern recognition, pages 3354–3361. IEEE, 2012. 4, 9 [23] Moritz Menze and Andreas Geiger. Object scene flow for au-
[11] Jianping Gou, Baosheng Yu, Stephen J Maybank, and tonomous vehicles. In Proceedings of the IEEE conference
Dacheng Tao. Knowledge distillation: A survey. Interna- on computer vision and pattern recognition, pages 3061–
tional Journal of Computer Vision, 129:1789–1819, 2021. 3 3070, 2015. 4, 9
[12] Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and [24] Matteo Poggi, Alessio Tonioni, Fabio Tosi, Stefano Mattoc-
Hongsheng Li. Group-wise correlation stereo network. In cia, and Luigi Di Stefano. Continual adaptation for deep
Proceedings of the IEEE/CVF Conference on Computer Vi- stereo. IEEE Transactions on Pattern Analysis and Machine
sion and Pattern Recognition, pages 3273–3282, 2019. 7, Intelligence, 2021. 3
9 [25] Pierluigi Zama Ramirez, Fabio Tosi, Matteo Poggi, Samuele
[13] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Salti, Stefano Mattoccia, and Luigi Di Stefano. Open chal-
Girshick. Momentum contrast for unsupervised visual rep- lenges in deep stereo: the booster dataset. In Proceedings of

17
the IEEE/CVF Conference on Computer Vision and Pattern [38] Philippe Weinzaepfel, Vaibhav Arora, Yohann Cabon,
Recognition, pages 21168–21178, 2022. 4, 8, 9 Thomas Lucas, Romain Brégier, Vincent Leroy, Gabriela
[26] Zhibo Rao, Bangshu Xiong, Mingyi He, Yuchao Dai, Renjie Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jérôme
He, Zhelun Shen, and Xing Li. Masked representation learn- Revaud. Improved cross-view completion pre-training for
ing for domain generalized stereo matching. In Proceedings stereo matching. arXiv preprint arXiv:2211.10408, 2022. 3,
of the IEEE/CVF Conference on Computer Vision and Pat- 11
tern Recognition, pages 5435–5444, 2023. 3 [39] Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten-
[27] Peter Sadowski, Julian Collado, Daniel Whiteson, and Pierre tion concatenation volume for accurate and efficient stereo
Baldi. Deep learning, dark knowledge, and dark matter. In matching. In Proceedings of the IEEE/CVF Conference on
NIPS 2014 Workshop on High-energy Physics and Machine Computer Vision and Pattern Recognition, pages 12981–
Learning, pages 81–87. PMLR, 2015. 3 12990, 2022. 3
[28] Daniel Scharstein, Heiko Hirschmüller, York Kitajima, [40] Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang.
Greg Krathwohl, Nera Nešić, Xi Wang, and Porter West- Iterative geometry encoding volume for stereo matching. In
ling. High-resolution stereo datasets with subpixel-accurate Proceedings of the IEEE/CVF Conference on Computer Vi-
ground truth. In German conference on pattern recognition, sion and Pattern Recognition, pages 21919–21928, 2023. 1,
pages 31–42. Springer, 2014. 4, 9 2, 3, 4, 7, 9, 11, 12
[29] Thomas Schops, Johannes L Schonberger, Silvano Galliani, [41] Gangwei Xu, Huan Zhou, and Xin Yang. Cgi-stereo: Accu-
Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- rate and real-time stereo matching via context and geometry
dreas Geiger. A multi-view stereo benchmark with high- interaction. arXiv preprint arXiv:2301.02789, 2023. 7, 10,
resolution images and multi-camera videos. In Proceedings 11
of the IEEE Conference on Computer Vision and Pattern [42] Haofei Xu and Juyong Zhang. Aanet: Adaptive aggrega-
Recognition, pages 3260–3269, 2017. 4, 9 tion network for efficient stereo matching. In Proceedings of
[30] Zhelun Shen, Yuchao Dai, and Zhibo Rao. Cfnet: Cascade the IEEE/CVF Conference on Computer Vision and Pattern
and fused cost volume for robust stereo matching. In Pro- Recognition, pages 1959–1968, 2020. 3
ceedings of the IEEE/CVF Conference on Computer Vision
[43] Gengshan Yang, Joshua Manela, Michael Happold, and
and Pattern Recognition, pages 13906–13915, 2021. 3, 8, 9,
Deva Ramanan. Hierarchical deep stereo matching on high-
10, 11, 12
resolution images. In Proceedings of the IEEE/CVF Con-
[31] Zhelun Shen, Yuchao Dai, Xibin Song, Zhibo Rao, Dingfu
ference on Computer Vision and Pattern Recognition, pages
Zhou, and Liangjun Zhang. Pcw-net: Pyramid combination
5515–5524, 2019. 10
and warping cost volume for stereo matching. In European
[44] Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng,
Conference on Computer Vision, pages 280–297. Springer,
Jianping Shi, and Bolei Zhou. Drivingstereo: A large-scale
2022. 7
dataset for stereo matching in autonomous driving scenarios.
[32] Zhelun Shen, Xibin Song, Yuchao Dai, Dingfu Zhou, Zhibo
In Proceedings of the IEEE/CVF Conference on Computer
Rao, and Liangjun Zhang. Digging into uncertainty-based
Vision and Pattern Recognition, pages 899–908, 2019. 4, 9
pseudo-label for robust stereo matching. arXiv preprint
arXiv:2307.16509, 2023. 3 [45] Penghui Yang, Ming-Kun Xie, Chen-Chen Zong, Lei Feng,
[33] Antti Tarvainen and Harri Valpola. Weight-averaged Gang Niu, Masashi Sugiyama, and Sheng-Jun Huang.
consistency targets improve semi-supervised deep learn- Multi-label knowledge distillation. In Proceedings of the
ing results. corr. abs/1703.01780 (2017). arXiv preprint IEEE/CVF International Conference on Computer Vision,
arXiv:1703.01780, 2017. 6 pages 17271–17280, 2023. 3
[34] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field [46] Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical dis-
transforms for optical flow. In Computer Vision–ECCV crete distribution decomposition for match density estima-
2020: 16th European Conference, Glasgow, UK, August 23– tion. In Proceedings of the IEEE/CVF Conference on Com-
28, 2020, Proceedings, Part II 16, pages 402–419. Springer, puter Vision and Pattern Recognition, pages 6044–6053,
2020. 3 2019. 3
[35] Alessio Tonioni, Fabio Tosi, Matteo Poggi, Stefano Mat- [47] Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi
toccia, and Luigi Di Stefano. Real-time self-adaptive deep Feng. Revisiting knowledge distillation via label smoothing
stereo. In Proceedings of the IEEE/CVF Conference on Com- regularization. In Proceedings of the IEEE/CVF Conference
puter Vision and Pattern Recognition, pages 195–204, 2019. on Computer Vision and Pattern Recognition, pages 3903–
3 3911, 2020. 3
[36] Fabio Tosi, Alessio Tonioni, Daniele De Gregorio, and Mat- [48] Jure Zbontar and Yann LeCun. Computing the stereo match-
teo Poggi. Nerf-supervised deep stereo. In Proceedings of ing cost with a convolutional neural network. In Proceed-
the IEEE/CVF Conference on Computer Vision and Pattern ings of the IEEE conference on computer vision and pattern
Recognition, pages 855–866, 2023. 3 recognition, pages 1592–1599, 2015. 3
[37] Hengli Wang, Rui Fan, Peide Cai, and Ming Liu. Pvstereo: [49] Jiaxi Zeng, Chengtang Yao, Lidong Yu, Yuwei Wu, and
Pyramid voting module for end-to-end self-supervised stereo Yunde Jia. Parameterized cost volume for stereo matching.
matching. IEEE Robotics and Automation Letters, 6(3): In Proceedings of the IEEE/CVF International Conference
4353–4360, 2021. 3 on Computer Vision, pages 18347–18357, 2023. 3, 8

18
[50] Feihu Zhang, Victor Prisacariu, Ruigang Yang, and
Philip HS Torr. Ga-net: Guided aggregation net for end-to-
end stereo matching. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
185–194, 2019. 3, 7
[51] Feihu Zhang, Xiaojuan Qi, Ruigang Yang, Victor Prisacariu,
Benjamin Wah, and Philip Torr. Domain-invariant stereo
matching networks. In European Conference on Computer
Vision, pages 420–439. Springer, 2020. 3, 10
[52] Jiawei Zhang, Xiang Wang, Xiao Bai, Chen Wang, Lei
Huang, Yimin Chen, Lin Gu, Jun Zhou, Tatsuya Harada, and
Edwin R Hancock. Revisiting domain generalized stereo
matching networks from a feature consistency perspective.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 13001–13011, 2022.
1, 3, 10
[53] Youmin Zhang, Yimin Chen, Xiao Bai, Suihanjin Yu, Kun
Yu, Zhiwei Li, and Kuiyuan Yang. Adaptive unimodal cost
volume filtering for deep stereo matching. In Proceedings of
the AAAI Conference on Artificial Intelligence, pages 12926–
12934, 2020. 3
[54] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun
Liang. Decoupled knowledge distillation. In Proceedings of
the IEEE/CVF Conference on computer vision and pattern
recognition, pages 11953–11962, 2022. 1, 3
[55] Haoliang Zhao, Huizhou Zhou, Yongjun Zhang, Jie Chen,
Yitong Yang, and Yong Zhao. High-frequency stereo match-
ing network. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 1327–
1336, 2023. 3, 7

19

You might also like