0% found this document useful (0 votes)

14 views21 pages

1_2022_NeurlPS_GMMSeg Gaussian Mixture based Generative Semantic Segmentation Models

Uploaded by

daidai gao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views21 pages

1_2022_NeurlPS_GMMSeg Gaussian Mixture based Generative Semantic Segmentation Models

Uploaded by

daidai gao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

GMMSeg: Gaussian Mixture based

Generative Semantic Segmentation Models

Chen Liang1,3∗† , Wenguan Wang2∗ , Jiaxu Miao1 , Yi Yang1

1 2 3
CCAI, Zhejiang University ReLER, AAII, University of Technology Sydney Baidu Research
https://github.com/leonnnop/GMMSeg
arXiv:2210.02025v1 [cs.CV] 5 Oct 2022

Abstract
Prevalent semantic segmentation solutions are, in essence, a dense discriminative
classifier of p(class |pixel feature). Though straightforward, this de facto paradigm
neglects the underlying data distribution p(pixel feature |class), and struggles to
identify out-of-distribution data. Going beyond this, we propose GMMSeg, a new
family of segmentation models that rely on a dense generative classifier for the
joint distribution p(pixel feature, class). For each class, GMMSeg builds Gaussian
Mixture Models (GMMs) via Expectation-Maximization (EM), so as to capture
class-conditional densities. Meanwhile, the deep dense representation is end-to-end
trained in a discriminative manner, i.e., maximizing p(class |pixel feature). This
endows GMMSeg with the strengths of both generative and discriminative models.
With a variety of segmentation architectures and backbones, GMMSeg outperforms
the discriminative counterparts on three closed-set datasets. More impressively,
without any modification, GMMSeg even performs well on open-world datasets.
We believe this work brings fundamental insights into the related fields.

1 Introduction
Semantic segmentation aims to explain visual semantics at the pixel level. It is typically considered as
a problem of pixel-wise classification, i.e., assigning a class label c ∈ {1,· · ·, C} to each pixel data x.
Under this regime, deep-neural solutions are naturally built as a combination of two parts (Fig. 1(a)):
an encoder-decoder, dense feature extractor that maps x to a high-dimensional feature representation
x, and a dense classifier that conducts C-way classification given input pixel feature x. Starting from
the first end-to-end segmentation solution – fully convolutional networks (FCN) [1], researchers leave
the classifier as parametric softmax, and fully devote to improving the dense feature extractor for
learning better representation. As a result, a huge amount of FCN-based solutions [2–5] emerged and
their state-of-the-art was further pushed forward by recent Transformer [6]-style algorithms [7–10].
From a probabilistic perspective, the softmax classifier, supervised by the cross-entropy loss together
with the feature extractor, directly models the class probability given an input, i.e., posterior p(c|x).
This is known as a discriminative classifier, as the conditional probability distribution discriminates
directly between the different values of c [11]. As discriminative classifiers directly find the classifica-
tion rule with the smallest error rate, they often give excellent performance in downstream tasks, and
hence become the de facto paradigm in segmentation. Yet, due to the discriminative nature, softmax-
based segmentation models suffer from several limitations: First, they only learn the decision boundary
between classes, without modeling the underlying data distribution [11]. Second, as only one weight
vector is learned per class, they assume unimodality for each class [12, 13], bearing no within-class
variation. Third, they learn a prediction space where the model accuracy deteriorates rapidly away
∗
Equal contributions.
†
Work partly done during an internship at Baidu Research.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

pixel embedding space
dense feature dense feature
extractor extractor
training training
pixel pixel
samples samples
outlier from
unseen class
decision boundaries learned by data densities modeled by
discriminative softmax (a) (b) generative GMM

Figure 1: (a) Existing softmax based discriminative regime only learns decision boundaries on the
pixel embedding space. (b) Our GMMSeg models pixel feature densities via generative GMMs.

from the decision boundaries [14] and thus yield poorly calibrated predictions [15], struggling to
recognize out-of-distribution data [16]. The first two limitations may hinder the expressive power of
segmentation models, and the last one challenges the adoption of segmentation models in decision-
critical tasks (e.g., autonomous driving) and motivates the development of anomaly segmentation
methods [17–19] (which, however, rely on pre-trained discriminative segmentation models).
As an alternative of discriminative classifiers, generative classifiers first find the joint probability
p(x, c), and use p(x, c) to evaluate the class-conditional densities p(x|c). Then classification is con-
ducted using Bayes rule. Numerous theoretical and empirical comparisons [20, 21] between these two
approaches have been initiated even before the deep learning revolution. They reach the agreement that
generative classifiers have potential to overcome shortcomings of their discriminative counterparts, as
they are able to model the input data itself. This stimulates the recent investigation of generative (and
discriminative-generative hybrid [22, 23]) classifiers in trustworthy AI [24–27] and semi-supervised
learning [22, 23], while the discriminative classifiers are still dominant in most downstream tasks.
In light of this background, we propose a GMM based segmentation framework – GMMSeg – that
addresses the limitations of current discriminative solutions from a generative perspective (Fig. 1(b)).
Our work not only represents a novel effort to advocate generative classifiers for end-to-end segmen-
tation, but also evidences the merits of generative approaches in a challenging, dense classification
task setting. In particular, we adopt a separate mixture of Gaussians for modeling the data distribution
of each class in the feature space, i.e., class-conditional feature densities p(x|c). During training,
GMM classifier is online optimized by a momentum version of (Sinkhorn) EM [28] on large-scale,
so as to ensure its generative nature and synchronization with the evolving feature space. Mean-
while, the feature extractor is end-to-end trained with the discriminative (cross-entropy) loss, i.e.,
maximizing the conditional likelihood p(c|x) derived with the generative GMM, so as to enable
expressive representation learning. In this way, GMMSeg smartly learns generative classification with
end-to-end discriminative representation in a compact and collaborative manner, exploiting the benefit
of both generative and discriminative approaches. This also greatly distinguishes GMMSeg from
most existing GMM based neural classifiers, which are either discriminatively trained [12, 29–31] or
trivially estimate a GMM in the feature space of a pre-trained discriminative classifier [19, 32, 33].
GMMSeg has several appealing facets: First, with the hybrid training strategy – online EM based
classifier optimization and end-to-end discriminative representation learning, GMMSeg can precisely
approximate the data distribution over a robust feature space. Second, the mixture components make
GMMSeg a structured model that well adapts to multimodal data densities. Third, the distribution-
preserving property allows GMMSeg to naturally reject abnormal inputs, without neither architectural
change (like [34–37]) nor re-training (like [38–40]) nor post-calibration (like [17, 18, 41–46]). Fourth,
GMMSeg is a principled framework, fully compatible with modern segmentation network architectures.
For thorough examination, in §4.1, we approach GMMSeg on several representative segmentation
architectures (i.e., DeepLabV3+[47], OCRNet [48], UperNet [49], SegFormer [7]), with diverse back-
bones (i.e., ResNet [50], HRNet [51], Swin [52], MiT [7]). Experimental results demonstrate GMMSeg
even outperforms the softmax-based discriminative counterparts, e.g., 0.6% – 1.5%, 0.5% – 0.8%,
and 0.7% – 1.7% mIoU gains over ADE20K [53], Cityscapes [54], and COCO-Stuff [55], respectively.
Furthermore, in §4.2, we validate our approach on anomaly segmentation. Without any modification,
our Cityscapes-trained GMMSeg model is directly tested on Fishyscapes Lost&Found [56] and Road
Anomaly [36] datasets, and outperforms all hand-tailored discriminative competitors.
To our best knowledge, GMMSeg is the first semantic segmentation method that reports promising
results on both closed-set and open-world scenarios by using a single model instance. More notably,
our impressive results manifest the advantages of generative classifiers in a large-scale real-world
setting. We feel this work opens a new avenue for research in this field.

2
2 Related Work
Semantic Segmentation. Since the seminal work of FCN [1], deep-net segmentation solutions are
typically built in a dense classification fashion, i.e., learning dense representation andcategorization
end-to-end. By directly adopting discriminative softmax forcategorization,FCN-style solutions put
focus on learning expressive dense representation; they modify the FCN architecture from various
aspects, such as enlargingthe receptive field [2,3,47,57–60], modeling multi-scale context [48,59,61–
77], investigating non-local operations [4, 5, 78–84], and exploring hierarchical information [85–87].
With a similar goal of sharpening representation, later Transformer-style solutions [7–9, 88] empower
attentive networks with, for instance, local contiguity [7] and multi-level feature aggregation [9,
10]. Two very recent attentive models [89, 90] formulate the task in an alternative form of mask
classification, however, still relying on discriminative softmax.
From the discussion above, we can find that current prevalent segmentation solutions are in essence
a pixel-wise, discriminative classifier, which only learns decision boundaries between classes in
the pixel feature space [14, 91], without modeling the underlying data distribution. In contrast,
our GMMSeg tackles the task from a generative viewpoint. GMMSeg deeply embeds generative
optimization of GMMs into end-to-end dense representation learning, so as to comprehensively
describe the class-aware knowledge [92] in a discriminative feature space. GMMSeg is partly inspired
by [13, 93], that also probe data structures via intra-class clustering. However, the dense classification
in the two works are achieved via non-parametric, nearest centroid retrieving – still a discriminative
model. In [94], though data density is estimated (as a mixture of vMF distributions [95]), it is only
used as a supervisory signal for dense embedding learning, and the final prediction is still made by a
discriminative classifier – k-NN. Our work represents the first step towards formulating (closed-set)
semantic segmentation within a generative neural classification framework.
Discriminative vs Generative Classifiers. Generative classifiers and discriminative classifiers rep-
resent two contrasting ways of solving classification tasks [24]. Basically, the generative classifiers
(such as Linear Discriminant Analysis and naive Bayes) learn the class densities p(x|c), while the
discriminative classifiers (such as softmax) learn the class boundaries p(c|x) without regard to the
underlying class densities. In practical classification tasks, softmax discriminative classifier is used
exclusively [24], due to its simplicity and excellent discriminative performance. Nonetheless, genera-
tive classifiers are widely agreed to have several advantages over their discriminative counterparts [21,
96], e.g., accurately modeling the input distribution, and explicitly identifying unlikely inputs in a
natural way. Driven by this common belief, a surge of deep learning literature [14, 97–99] investi-
gated the potential (and the limitation) of generative classifiers in adversarial defense [25,26,100–102],
explainable AI [24], out-of-distribution detection [27, 103], and semi-supervised learning [22, 23, 97].
As GMMs can express (almost) arbitrary continuous distributions, it has been adopted in many neural
classifiers [23,104]. However, most of these GMM classifiers are discriminative models [12,14,29,30]
that are trained ‘discriminatively’(i.e., maximizing posteriors p(c|x)). In GMMSeg, the GMM is purely
optimized via EM (i.e., estimating class densities p(x|c)) while the deep representation is trained via
gradient backpropagation of the discriminative loss. Thus the whole GMMSeg is a hybrid of genera-
tive GMM and discriminative representation, getting the best of two worlds. Although bearing the
general idea of trading-off between generative and discriminative classifiers [21–23, 96, 105, 106],
none of the previous hybrid algorithms demonstrate their utility in challenging segmentation tasks.
Anomaly Segmentation. Anomaly segmentation strives to identify unknown object regions, typically
in road-driving scenarios [56]. Existing solutions can be generally categorized into three classes: i)
Uncertainty estimation based algorithms [17, 18, 41–46] usually approximate the uncertainty from
simple statistics of the classification probability or logits of pre-trained segmentation models [17, 18,
41–43], or adopt Bayesian neural networks with Monte-Carlo dropout to capture pixel uncertainty [44–
46]. ii) Outlier exposure based algorithms make use of auxiliary datasets as training samples of
unexpected objects [38–40]. Therefore, this type of algorithms requires re-training the segmentation
network, resulting in performance degradation. iii) Image resynthesis based algorithms reconstruct
the input image and discriminate the anomaly instances according to the reconstruction error [34–37].
With a generative classifier, our GMMSeg handles anomaly segmentation naturally, without neither
external datasets of outliers, nor additional image resynthesis models. It also greatly differs from most
uncertainty estimation-based methods that are post-processing techniques adjusting the prediction
scores of softmax-based segmentation networks [17, 18, 41–43]. The most relevant ones are maybe a few
density estimation-based models [56,107,108], which directly measure the likelihood of samples w.r.t.

3
the data distribution. However, they are either limited to pre-trained representation [56] or specialized
for anomaly detection with simple data [107, 108]. To our best knowledge, this is the first time to
report promising results on both closed-set and open-world large-scale settings, through a single
model instance without any change of network architecture as well as training and inference protocols.

3 Methodology
In this section, we first formalize modern semantic segmentation models within a dense discriminative
classification framework and discuss defects of such discriminative regime from a probabilistic view-
point (§3.1). Then we describe our new segmentation framework – GMMSeg – that brings a paradigm
shift from the discriminative to generative (§3.2). Finally, in §3.3, we provide implementation details.
3.1 Existing Segmentation Solutions: Dense Discriminative Classifier
In the standard semantic segmentation setting, we are given a training dataset D = {(xn , cn )}N n=1 of
N pairs of pixel samples xn ∈ R3 and corresponding semantic labels cn ∈ {1,· · ·, C}. The goal is to
use D to learn a classification rule which can predict the label c′ ∈ {1,· · ·, C} of an unseen pixel x′ .
Recent mainstream solutions employ a deep neural network for pixel representation learning and
softmax for semantic label prediction. Hence they are usually built as a composition of f ◦g:
• A dense feature extractor fθ : R3→ RD, which is typically an encoder-decoder network that maps
the input pixel x to a D-dimensional feature representation x, i.e., x = fθ (x) ∈ RD ;1 and
• A dense classifier gω : RD → RC , which is achieved by parametric softmax that maps each pixel
representation x ∈ RD to C real-valued numbers {yc∈ R}C C
c=1 termed as logits, i.e., {yc }c=1 = gω (x),
and uses the logits to compute the posterior probability:
\small \begin {aligned}\label {eq:softmax} p(c|\bm {x}; \bm {\omega }, \bm {\theta }) \!=\! \frac {\textrm {exp}(y_{c})}{\sum _{c'}\!\textrm {exp}(y_{c'})} \!=\! \frac {\textrm {exp}(\bm {w}_{c}^{\!\top }\bm {x}+b_{c})}{\sum _{c'}\!\textrm {exp}(\bm {w}_{c'}^{\!\top }\bm {x}+b_{c'})}\!=\! \frac {\textrm {exp}(\bm {w}_{c}^{\!\top }f_{\bm {\theta }}(x)+b_{c})}{\sum _{c'}\!\textrm {exp}(\bm {w}_{c'}^{\!\top }f_{\bm {\theta }}(x)+b_{c'})}, \end {aligned} \vspace {1pt} (1)

where wc ∈ RD and bc ∈ R are the weight and bias for class c, respectively; and ω = {w1:C , b1:C }.
The final prediction is the class with the highest predicted probability: arg maxc p(c|x; ω, θ).
The feature extractor f and softmax-based classifier g are jointly trained end-to-end. Their corre-
sponding parameters {θ, ω} are optimized by minimizing the so-called cross-entropy loss on D:

\small \begin {aligned} \bm {\theta }^\ast , \bm {\omega }^\ast =\argmin _{\bm {\theta }, \bm {\omega }}-\!\!\sum \nolimits _{(x,c)\in \mathcal {D}}\log p(c|\bm {x}; \bm {\omega }, \bm {\theta }), \end {aligned} \vspace {1pt} (2)

which is equivalent to maximizing conditional likelihood, i.e., Π(x,c)∈D p(c|x). In some literature [11,
109], such learning strategy is called discriminative training. As softmax directly models the
conditional probability distribution p(c|x) with no concern for modeling the input distribution
p(x, c), existing softmax-based segmentation models are in essence a dense discriminative classifier.
Discriminative softmax typically gives good predictive performance, as the pixel classification rule
depends only on the conditional distribution p(c|x) in the sense of minimum error rate and softmax
optimizes the quantity of interest in a concise manner, i.e., learning a direct map from inputs x to the
class labels c. In spite of its prevalence and effectiveness, this dense discriminative regime has some
drawbacks that are still poorly understood: First, it attends only to learning the decision boundaries
between the C classes on the pixel embedding space, i.e., splitting the D-dimensional feature space
using C different (D−1)-dimensional hyperplanes. It achieves a simplified approach that eliminates
extra parameters for modeling the data (representation) distribution [110]. However, from another
perspective, it fails to capture the intrinsic class characteristics and is hard to achieve good generaliza-
tion on unseen data. Second, in softmax, each class c corresponds to only a single weight (wc , bc ).
That means existing segmentation models rely on an implicit assumption of unimodality of data of
each class in the feature space [12, 111, 112]. However, this unimodality assumption is rarely the case
in real-world scenarios and makes the model less tolerant of intra-class variances [13], especially
when the multimodality remains in the feature space [12]. Third, softmax is not capable of inferring
the data distribution – it is notorious with inflating the probability of the predicted class as a result of the
exponent employed on the network outputs [113]. Thus the prediction score of a class is useless besides
its comparative value against other classes. This is the root cause of why existing segmentation models

1
Strictly speaking, the dense feature extractor fθ typically maps pixel samples with image context, i.e.,
′ ′
fθ : Rh×w×3→ Rh ×w ×D , where h and w (h′ and w′ ) denote the spatial resolution of the image (feature map) .
Here we simplify the notations, i.e., fθ : R3→ RD, to keep a straightforward formulation.

4
are hard to identify pixel samples x′ of an unseen class (out-of-distribution data), i.e., c′ ̸ ∈ {1,· · ·, C}.
Accordingly, we argue that the time might be right to rethink the current de facto, discriminative
segmentation regime, where the softmax classifier may actually cause more harm than good.

3.2 GMMSeg: Dense GMM Generative Classification

Our GMMSeg reformulates the task from a dense generative classification point of view. Instead of
building posterior p(c|x) directly, generative classifiers predict labels using Bayes rule. Specifically,
generative classifiers model the joint distribution p(x, c), by estimating the class-conditional distribu-
tion p(x|c) along with the class prior p(c). Then, following Bayes rule, the posterior is derived as:

\small \begin {aligned}\label {eq:gc} p(c|\bm {x}) \!=\! \frac {p(c)p(\bm {x}|c)}{\sum _{c'}\!p(c')p(\bm {x}|c')}. \end {aligned} (3)

Since the class probabilities p(c) are typically set as a uniform prior (also in our case), estimating
the class-conditional distributions (i.e., data densities) p(x|c) is the core and most difficult part of
building a generative classifier. It is also worth noting that generative classifiers are optimized by
approximating the data distribution Π(x,c)∈D p(x|c), which is called generative training [11].
Although discriminative classifiers demonstrate impressive performance in many application tasks,
there are several crucial reasons for using generative rather than discriminative classifiers, which
can be succinctly articulated by Feynman’s mantra “What I cannot create, I do not understand.”
Surprisingly, generative classifiers have been rarely investigated in modern segmentation models.
Driven by the belief that generative classifiers are the right way to remove the shortcomings of discri-
minative approaches, we revisit GMM – one of the most classic generative probabilistic classifiers.
We couple the generative EM optimization of GMMs with the discriminative learning of the dense
feature extractor f – the most successful part of modern segmentation models, leading to a powerful,
principled, and dense generative classification based segmentation framework – GMMSeg (Fig. 2).
Specifically, GMMSeg adopts a weighted mixture of M multivariate Gaussians for modeling the
pixel data distribution of each class c in the D-dimensional embedding space:

\small \begin {aligned}\label {eq:GMM} p(\bm {x}|c;\bm {\phi }_c) = \sum \nolimits _{m=1}^{M}p(m|c;\bm {\pi }_c)p(\bm {x}|c,m;\bm {\mu }_c,\bm {\Sigma }_c) = \sum \nolimits _{m=1\!}^{M}{\pi }_{cm~\!}\mathcal {N}(\bm {x};\bm {\mu }_{cm},\bm {\Sigma }_{cm}). \end {aligned} (4)

Here m|c ∼ Multinomial(πc ) is the prior probability, i.e., m πcm = 1; µcm∈ RD and Σcm∈ RD×D
P
are the mean vector and covariance matrix for component m in class c; and ϕc = {πc , µc , Σc }. The
mixture nature allows GMMSeg to accurately approximate the data densities and to be superior over
softmax assuming unimodality for each class. Each Gaussian component has an independent co-
variance structure, enabling a flexible local measure of importance along different feature dimensions.
To find the optimal parameters of the GMM classifier, i.e., {ϕ∗c }Cc=1 , a standard approach is EM [114],
i.e., maximizing the log likelihood over the feature-label pairs {(xn , cn )}N
n=1 in the training dataset D:

\small \begin {aligned}\label {eq:EM1} \bm {\phi }^\ast _c\!=\!\argmax _{\bm {\phi }_c}\sum \nolimits _{\bm {x}_{n}:c_n=c\!}\log p(\bm {x}_n|c;\bm {\phi }_c)=\argmax _{\bm {\phi }_c}\sum \nolimits _{\bm {x}_{n}:c_n=c\!}\log \sum \nolimits _{m=1}^{M}p(\bm {x}_n,m|c;\bm {\phi }_c), \end {aligned} (5)

(0)
EM starts with some initial guess at the maximum likelihood parameters ϕc , and then proceeds to it-
(t)
eratively create successive estimates ϕc for t = 1, 2,· · · , by repeatedly optimizing a F function [115]:

\small \begin {aligned}\label {eq:EM} \textbf {E-Step:}~~~q^{(t)}_{c}= \argmax \nolimits _{q_{c}\!}F(q_{c},\bm {\phi }_{c}^{(t-1)}),~~~~~~~~~~\textbf {M-Step:}~~~\bm {\phi }^{(t)}_{c}= \argmax \nolimits _{\bm {\phi }_{c}\!}F(q_{c}^{(t)},\bm {\phi }_{c}). \\ \end {aligned} (6)
qc [m] = p(m|x, c; ϕc ) gives the probability that data x is assigned to component m. F is defined as:
\small \begin {aligned}\label {eq:EM2} F(q_{c},\bm {\phi }_{c}) = \mathbb {E}_{q_{c}}[\log p(\bm {x}, m|c;\bm {\phi }_{c})] + H(q_c), \end {aligned} (7)
where Eqc [·] gives the expectation w.r.t. the distribution over the M components given by qc , and
H(qc ) = −Eqc [log qc [m]] defines the entropy of qc . Based on Eqs. 4-7, for ∀xn: cn= c, we have:

\small \begin {aligned}\label {eq:EM3} \!\!\!\!\!\!\!\!\textbf {E-Step:}&~~~q^{(t)}_{cn}[m] = \frac {\pi ^{(t-1)}_{cm}\mathcal {N}(\bm {x}_{n}|\bm {\mu }^{(t-1)}_{cm},\bm {\Sigma }^{(t-1)}_{cm})}{\sum _{m'=1}^{M}\pi ^{(t-1)}_{cm'}\mathcal {N}(\bm {x}_{n}|\bm {\mu }^{(t-1)}_{cm'},\bm {\Sigma }^{(t-1)}_{cm'})},\\ \!\!\!\!\!\!\!\!\textbf {M-Step:}&~~~\pi ^{(t)}_{cm}\!=\!\frac {N^{(t)}_{cm}}{N_{c}},~~ \bm {\mu }^{(t)}_{cm}\!=\!\frac {1}{N^{(t)}_{cm}}\!\sum _{\bm {x}_{n\!}:c_{n\!}=_{\!}c}\!\!\!q^{(t)}_{cn}[m]\bm {x}_{n},~~ \bm {\Sigma }^{(t)}_{cm}\!=\!\frac {1}{N^{(t)}_{cm}}\!\sum _{\bm {x}_{n\!}:c_{n\!}=_{\!}c}\!\!\!q^{(t)}_{cn}[m](\bm {x}_{n\!}\!-\!\bm {\mu }^{(t)}_{cm})(\bm {x}_{n\!}\!-\!\bm {\mu }^{(t)}_{cm})\T ,\!\!\!\!\!\!\!\! \end {aligned}
(8)

5
EM optimization of generative GMM {p(cn|xn)}N {cn }N
n=1 n=1
{xn }N
n=1 ϕ2
dense feature ϕ1 M-step
extractor (Eq. 8) {ϕ1 , ϕ2 , ϕ3 } cross-
entropy
training E-step
pixel fθ (Eq. 9)
{q1 , q2 , q3 } (Eq. 11)
samples
p(xn|cn ; ϕc)
=p(cn|xn)
ϕ3 p(xn|1;ϕ1)+p(xn|2;ϕ2)+p(xn|3;ϕ3)
data flow
parameter optimization stochastic gradient descent of discriminative representation learning

Figure 2:Through generative-discriminative hybrid training, GMMSeg gains the best of the two worlds.
P
where Nc is the number of training samples labeled as c and Ncm= n:cn =c qcn [m]. In E-step, we re-
(t)
compute the posterior qc over the M components given the old parameters ϕ(t−1) . In M-step, with the
(t) (t)
soft cluster assignment qc , the parameters are updated as ϕc such that the F function is maximized.
In practice, we find standard EM suffers from slow convergence and delivers unsatisfactory re-
sults (cf. §4.3). A potential reason is the parameter sensitivity of EM – convergent parameters
may change vastly even with slightly different initialization [116]. Drawing inspiration from recent
optimal transport (OT) based clustering algorithms [117, 118], we introduce a uniform prior on the
1
mixture weights πc , i.e., ∀c, m : πcm = M . Recalling qc [m] = p(m|x, c), we can derive a constraint
1 1
P
Qc={qc : Nc xn :cn =c p(m|xn , c) = M }. Then E-step in Eq. 6 is performed by restricting the optimi-
zation of qc over the set Qc :
\small \begin {aligned}\label {eq:E} ~~~~~~~~~\textbf {E-Step:}~~~q_{c}^{(t)} = \argmax \nolimits _{q_{c}\in \mathcal {Q}_c}F(q_{c},\bm {\phi }_c^{(t-1)}). \end {aligned} (9)
This can be intuitively viewed as an equipartition constraint guided clustering process: inside each
class c, we expect the Nc pixel samples to be evenly assigned to M components. As indicated by [28],
Eq. 9 is analogous to entropy-regularized OT:
\small \begin {aligned}\label {eq:SEM} \!\!\!\!\!\!\!\!\min _{\bm {Q}_c\in \mathcal {Q}'_c\!}\sum \nolimits _{n,m\!}\!\bm {Q}_c(n,m)\bm {O}_c(n,m)\!+\!\epsilon H(\bm {Q}_c), ~~ \mathcal {Q}'_c\!=\!\{\bm {Q}_c\!\in \!\mathbb {R}^{N_c\!\times \!M\!}_{+}\!:\!\bm {Q}_c\mathbf {1}^{M\!}\!=\!\mathbf {1}^{N_c}, {(\bm {Q}_c)}^{\!\!\top }\!\mathbf {1}^{N_c\!}\!=\!\frac {N_c}{M}\mathbf {1}^{M}\},\!\!\!\! \end {aligned} (10)

where the transport matrix Qc (i.e., target solution) can be viewed as the posterior distribution qc of
Nc samples over the M components (i.e., Qc (n, m) = qcn [m]), the cost matrix Oc∈ RNc ×M is given
as the negative log-likelihood, i.e., Oc (n, m) = − log p(xn |c, m), and the entropy H(·) is penalized
by ϵ. The set Q′c encapsulates all the desired constraints over Qc , where 1M is a M -dimensional
all-ones vector. Intuitively, the more plausible a pixel sample xn is with respect to component m, the
less it costs to transport the underlying mass. Eq. 10 can be efficiently solved via Sinkhorn-Knopp
Iteration [118], where ϵ is set as the default (i.e., 0.05). This optimization scheme, called Sinkhorn
EM, is proved to have the same global optimum with the EM in Eq. 9 yet is less prone to getting stuck
in local optima [28], which is in line with our empirical results (cf. §4.3).
Our GMMSeg adopts a hybrid training strategy that is partly generative and partly discriminative:
\small \begin {aligned}\label {eq:loss} \!\!\!\!\!\!\!\!&\text {\textbf {Generative Optimization} (Sinkhorn EM) of \textbf {GMM Classifier}:}~~~\{\bm {\phi }^\ast _c\}^C_{c=1}\!=\!\\ \!\!\!\!\!\!\!\!&~~~~~~~~\{\argmax _{\bm {\phi }_c}\!\!\!\!\sum _{\bm {x}_{n\!}:c_n=c\!}\!\!\!\!\log p(\bm {x}_n|c;\bm {\phi }_c)\}^C_{c=1} = \{\argmax _{\bm {\phi }_c}\!\!\!\!\sum _{\bm {x}_{n\!}:c_n=c\!}\!\!\!\!\log \sum _{m=1\!}^{M}{\pi }_{cm~\!}\mathcal {N}(\bm {x}_n;\bm {\mu }_{cm},\bm {\Sigma }_{cm})\}^C_{c=1}, \\ \!\!\!\!\!\!\!\!&\text {\textbf {Discriminative Learning} (Cross-Entropy Loss) of \textbf {Dense Representation}:}~~~\bm {\theta }^\ast \!=\!\\ \!\!\!\!\!\!\!\!&\argmin _{\bm {\theta }}-\!\!\!\!\!\sum _{(x,c)\in \mathcal {D}}\!\!\!\!\log p(c|\bm {x}; \{\bm {\phi }^\ast _c\}^C_{c=1}, \bm {\theta }) =\!\argmin _{\bm {\theta }}-\!\!\!\!\!\!\sum _{(x,c)\in \mathcal {D}}\!\!\!\!\log \big (\frac {\sum _{m=1\!}^{M}{\pi }_{cm~\!}\mathcal {N}(f_{\bm {\theta }}(x);\bm {\mu }_{cm},\bm {\Sigma }_{cm})}{\sum _{c'=1\!}^{C}\sum _{m=1\!}^{M}{\pi }_{c'm~\!}\mathcal {N}(f_{\bm {\theta }}(x);\bm {\mu }_{c'm},\bm {\Sigma }_{c'm})}\big ).\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \end {aligned}

(11)

In GMMSeg, GMM classifier (has C ×M components in total) is purely optimized in a generative

fashion, i.e., applying Sinkhorn EM to model the data densities p(x|c) within each class c in the fea-
ture space fθ . The feature extractor/space fθ , in contrast, is end-to-end trained in a discriminative
manner, i.e., minimizing the cross-entropy loss over the posteriors output by the GMM. During each
training iteration, the extractor’s parameters θ are only updated by the gradient backpropagated
from the discriminative loss, while the GMM’s parameters {ϕc }c are only optimized by EM. To
accurately estimate the GMM distributions, an external memory is adopted to store a large set of pixel
representations, sampled from several preceding training batches, enabling large-scale EM. Moreover,
since the feature space fθ gradually evolves during training, we opt for a momentum EM: we directly
use the GMM’s parameters {ϕ̂c }c estimated in the latest iteration as the initial guess in the current

6
(0) (t) (t)
iteration {ϕc }c , and adopt momentum update in the M-Step, i.e., {ϕc ← (1−τ )ϕc +τ ϕ̂c }c , where
the momentum coefficient is set as τ = 0.999. This makes our training more stable and accelerates
the convergence of EM – we empirically find even one EM loop per training iteration is good enough.
This hybrid training scheme brings several advantages: First, GMMSeg achieves the merits of both
generative and discriminative learning. The online EM based generative optimization enables the
GMM to best fit the data distribution even on the evolving feature space. On the other hand, the
feature space is discriminatively end-to-end trained under the guidance of the GMM classifier, so as
to maximize the pixel-wise predictive performance. Second, as the generative EM optimization and
discriminative stochastic training work in an independent yet closely collaborative manner, GMMSeg
is fully compatible with modern segmentation network architectures and existing discriminative
training objectives. It can be further advanced with the development of network architectures of the
discriminative counterparts. Third, as GMMSeg explicitly models class-conditional data distribution
p(x|c), it can naturally handle off-manifold examples, i.e., directly giving meaningful likelihood of
the example fitting each class GMM distribution (see §4.2 for experiments on anomaly segmentation).
3.3 Implementation Details
Network Architecture. GMMSeg is a general framework that can be built upon any modern segmen-
tation network by replacing softmax with the GMM classifier. In our experiments (cf. §4.1), we
approach GMMSeg on a variety of segmentation models [7, 47–49] and backbones [50, 51]. In the
GMM classifier, a 1×1 conv is used to compress each pixel feature to a 64-dimensional vector, i.e.,
D = 64, and the covariance matrices Σ ∈ RD×D are constrained to be diagonal, for computational
efficiency. In our implementation, each class c is represented by a mixture of M = 5 Gaussians
(there are a total of 5C Gaussian components for a segmentation task with C semantic classes).
Furthermore, we adopt the winner-take-all assumption [119, 120], i.e., the class-wise responsibility
(Eq. 4) is dominated by the largest term, for better performance.
Training In each training iteration, we conduct one loop of momentum (Sinkhorn) EM (i.e., t=1) on
current training batch as well as the external memory for the generative optimization of GMM, and
backpropagate the gradient of the cross-entropy loss on current batch for the discriminative training
of the feature extractor. The external memory maintains a queue for each component in each class;
each queue gathers 32K pixel features from previous training batches in a first in, first out manner. To
improve the diversity of the stored pixel features, we sample a sparse set of 100 pixels per class from
each image, instead of directly storing the whole images into the memory. Note that the memory is
discarded after training, and does not introduce extra overheads in inference.
Inference. GMMSeg only brings negligible delay in the inference speed compared to the discrimina-
tive counterparts (see experiments in §4.3). For standard (closed-set) semantic segmentation, pixel
prediction is made using Bayes rule (cf. Eq. 3): arg maxc p(c|x), where p(c|x) ∝ p(x|c) with the
uniform class distribution prior: p(c) = 1/C. For anomaly segmentation, the pixel-wise uncertain-
ty/anomaly score can be naturally raised as: −maxc p(x|c), i.e., the outlier input should reside in
low-probability regions [121].

4 Experiments
We respectively examine the efficacy and robustness of GMMSeg on semantic segmentation (§4.1)
and anomaly segmentation (§4.2). In §4.3, we provide diagnostic analysis on our core model design.

4.1 Experiments on Semantic Segmentation

Datasets. We conduct experiments on three widely used semantic segmentation datasets:

• ADE20K [53] has 20K/2K/3K images in train/val/test set, with 150 stuff/object categories in total.
• Cityscapes [54] has 2,975/500/1,524 fine-labeled images for train/val/test set with 19 classes.
• COCO-Stuff [55] has 10K images (9K/1K for train/test), pixel-wise labeled with 171 classes.
Base Segmentation Architectures and Backbones. For thorough evaluation, we apply GMMSeg to
four famous segmentation architectures (i.e., DeepLabV3+ [47], OCRNet [48], UPerNet [49], Segfor-
mer [7]), with various backbones (i.e., ResNet [50], HRNet [51], Swin [52], MiT [7]). For fairness, we
re-implement these models using the standardized hyper-parameter setting in MMSegmentation [122].
Training Details. GMMSeg is implemented on MMSegmentation [122] and follows the standard
training setting for each dataset. All models are initialized with ImageNet-1K [123] pretrained back-

7
COCO-Stuff
Cityscapes
ADE20K

SegFormer [7] GMMSeg SegFormer [7] GMMSeg SegFormer [7] GMMSeg

Figure 3: Qualitative results (§4.1) on ADE20K [53], Cityscapes [54], and COCO-Stuff [55].

bones and trained with commonly used data augmentations including resizing, flipping, color jittering
and cropping. For ADE20K /COCO-Stuff/Cityscapes, images are cropped to 512×512/512×512/768×
768 and models are trained for 160K/80K/80K iterations with 16/16/8 batch size, using 8/16 NVIDIA
Tesla A100 GPUs. Other training hyper-parameters (i.e., optimizers, learning rates, weight decays,
schedulers) are set as the default in MMSegmentation and can be found in the supplementary.
Inference Details. For ADE20K and COCO-Stuff, we keep the aspect ratio of test images and rescale
the short side to 512. For Cityscapes, sliding window inference is used with 768×768 window size. Note
that for fairness, all our results are reported without any test-time data augmentation.
Quantitative Results. Table 1 demonstrates our quantitative results. Although mainly focusing on
the comparison with the four base segmentation models [7, 47–49], we further include five widely
recognized methods [1, 3, 8, 9, 89] for completeness. As can be seen, our GMMSeg outperforms all its
discriminative counterparts across various datasets, backbones, and network architectures (FCN-style
and Transformer-like): Table 1: Quantitative results (§4.1) on ADE20K [53] val, City-
• ADE20K [53] val. With FCN- scapes [54] val, and COCO-Stuff [55] test with mean IoU.
style segmentation neural ar- Method Backbone ADE20K Citys. COCO.
chitectures, i.e., DeepLabV3+ FCN [CVPR15] [1] ResNet101 39.9 75.5 32.6
and OCR, GMMSeg provides PSPNet [CVPR17] [3] ResNet101 44.4 79.8 37.8
†
1.2%/1.5% mIoU gains over SETR [CVPR21] [9] ViTLarge 48.2 79.2 -
†
corresponding discriminative Segmenter [ICCV21] [8] ViTLarge ‡ 51.8 79.1 -
†
models. Similar performance MaskFormer [NeurIPS21] [89] SwinBase ‡ 52.7 - -
improvements, i.e., 1.0% and DeepLab V3+ [ECCV18] [47] 45.5 80.6 33.8
ResNet101
GMMSeg 46.7 ↑ 1.2 81.1 ↑ 0.5 35.5 ↑ 1.7
0.6%, are also obtained with OCRNet [ECCV20] [48] 43.3 80.4 37.6
attentive neural architectures, HRNetV2W48
GMMSeg 44.8 ↑ 1.5 81.2 ↑ 0.8 39.2 ↑ 1.6
i.e., Swin-UperNet and SegFor- UPerNet [ECCV18] [49] 48.0 81.1 43.4
mer, manifesting the universal- SwinBase
GMMSeg 49.0 ↑ 1.0 81.8 ↑ 0.7 44.3 ↑ 0.9
ity and efficacy of GMMSeg. SegFormer [NeurIPS21] [7]
MiTB5
50.0 82.0 44.0
• Cityscapes[54]val.Again our GMMSeg 50.6 ↑ 0.6 82.6 ↑ 0.6 44.7 ↑ 0.7
GMMSeg surpasses all its dis- †: pretrained on ImageNet22K ; ‡: using larger crop-size, i.e., 640×640
criminative counterparts by large margins, e.g., 0.5% over DeepLabV3+ , 0.8% over OCRNet, 0.7%
over Swin-UperNet, and 0.6% over SegFormer, suggesting its wide utility in this field.
• COCO-Stuff [55] test. Our GMMSeg also demonstrates promising results. This is particularly
impressive considering these results are achieved by a dense generative classifier, while the semantic
segmentation task is commonly considered as a battlefield for discriminative approaches.
Qualitative Results. In Fig. 3, we illustrate the qualitative comparisons of our GMMSeg against
SegFormer [7]. It is evident that, among the representative samples in the three datasets, our method
yields more accurate predictions when facing challenging scenarios, e.g., unconspicuous objects.

4.2 Experiments on Anomaly Segmentation

Datasets. To fully reveal the merits of our generative method, we next test its robustness for abnormal
data, i.e., identifying test samples of unseen classes, using two popular anomaly segmentation datasets:
• Fishyscapes Lost&Found [56], built upon [124], has 100/275 val/test images. It is collected under
the same setup as Cityscapes [54] but with real obstacles on the road. Pixels are labeled as either back-
ground (i.e., pre-defined Cityscapes classes) or anomaly (i.e., other unexpected classes like crate).
• Road Anomaly [36] has 60 images containing anomalous objects in unusual road conditions.
Evaluation Metrics. The area under receiver operating characteristics (AUROC), average precision
(AP), and false positive rate (FPR95 ) at a true positive rate of 95%, are adopted following [18, 35, 56].

8
Table 2: Quantitative results (§4.2) on Fishyscapes Lost&Found [56] val and Road Anomaly [36].
Extra OOD Fishyscapes Lost&Found Road Anomaly
Method mIoU
Resyn. Data AUROC↑ AP↑ FPR95 ↓ AUROC↑ AP↑ FPR95 ↓
SynthCP [ECCV20] [35] ✓ ✓ 80.3 88.34 6.54 45.95 76.08 24.86 64.69
SynBoost [CVPR21] [34] ✓ ✓ - 96.21 60.58 31.02 81.91 38.21 64.75
MSP [ICLR17] [17] ✗ ✗ 80.3 86.99 6.02 45.63 73.76 20.59 68.44
Entropy [ICLR17] [17] ✗ ✗ 80.3 88.32 13.91 44.85 75.12 22.38 68.15
SML [ICCV21] [18] ✗ ✗ 80.3 96.88 36.55 14.53 81.96 25.82 49.74
∗
Mahalanobis [NeurIPS18] [19] ✗ ✗ 80.3 92.51 27.83 30.17 76.73 22.85 59.20
∗
GMMSeg-DeepLabV3+ ✗ ✗ 81.1 97.34 43.47 13.11 84.71 34.42 47.90
∗
GMMSeg-FCN ✗ ✗ 76.7 96.28 32.94 16.07 78.99 24.51 56.95
∗
GMMSeg-SegFormer ✗ ✗ 82.6 97.83 50.03 12.55 89.37 57.65 44.34
∗: confidence derived with generative formulation
MSP [17] GMMSeg MSP [17] GMMSeg

Figure 4: Qualitative results (§4.2) of anomaly heatmaps on Fishyscapes Lost&Found [56] val.

Experiment Protocol. As in [17, 18, 125], we adopt ResNet101 -DeepLabV3+ architecture. For com-
pleteness, we also report the results of our GMMSeg based on ResNet101 -FCN and MiTB5 -SegFormer.
All our models are the same ones in Table 1, i.e., trained on Cityscapes train only. As GMMSeg esti-
mates class densities p(x|c), it can naturally reject unlikely inputs (cf. §3.3), i.e., directly thresholding
−maxc p(x|c) for computing the anomaly segmentation metrics, without any post-processing.
Quantitative Results. As shown in Table 2, based on DeepLabV3+ architecture, GMMSeg outper-
forms all the competitors under the same setting, i.e., neither using external out-of-distribution data nor
extra resynthesis module. Note that, [17–19] rely on pre-trained discriminative segmentation models
and thus have to make post-calibration. However, GMMSeg directly derives meaningful confidence
scores from likelihood p(x|c). Mahalanobis [19] also models data density, yet, merely on pre-trained
feature space with a single Gaussian per class. In contrast, GMMSeg performs much better, proving
the superiority of mixture modeling and hybrid training. Even with a weaker architecture, i.e., FCN,
GMMSeg still performs robustly. When adopting SegFormer, better performance is achieved.
Qualitative Results. In Fig. 4, we visualize the anomaly score heatmaps generated by MSP [17]-
DeepLabV3+ [47] and GMMSeg-DeepLabV3+ . The softmax based counterpart ignores the anomalies
with overconfident predictions; in contrast, GMMSeg naturally rejects them (red colored regions).

4.3 Diagnostic Experiments

For in-depth analysis, we conduct ablative studies using DeepLabV3+ [47]-ResNet101 [50] segmentation
architecture. Due to limited space, we put some diagnostic experiments in our supplementary material.
Online Hybrid Training. We first investigate our hybrid training Table 3: Online hybrid training
strategy (cf. Eq. 11), where the discriminative feature extractor (§4.3), evaluated on ADE20K [53].
and generative GMM classifier are online optimized iteratively. Method mIoU (%)
Owe to this ingenious design, both components are gradually up- DeepLabV3+ + GMM 31.6
dated, aligned with and adaptive to each other, making GMMSeg GMMSeg-DeepLabV3+ 46.0
a compact model. To fully demonstrate the effectiveness, we study a variant, DeepLabV3+ + GMM,
where a GMM classifier is directly fitted onto the feature space trained with the softmax classifier
beforehand. As shown in Table 3, a clear performance drop is observed, i.e., mIoU: 46.0%→ 31.6%,
revealing the appealing efficacy of our end-to-end hybrid training strategy.
Discriminative GMMSeg Table 4: Discriminative GMMSeg vs. generative GMMSeg (§4.3).
vs. Generative GMMSeg. Cityscapes Fishyscapes Lost&Found
GMMSeg Training Objective
Our GMMSeg learns gen- mIoU↑ AUROC↑ AP↑ FPR95 ↓
erative GMM via EM, i.e., Discriminative max p(c|x; ϕ, θ) 81.0 89.77 17.68 51.81
Generative max p(x|c; ϕ)+max p(c|x; θ) 81.1 97.34 43.47 13.11
max p(x|c; ϕ), with discri-
minative representation learning, i.e., max p(c|x; θ). A discriminative counterpart can be achieved
by end-to-end learning all the parameters, i.e., {ϕ, θ}, with cross-entropy loss, i.e., max p(c|x; ϕ, θ).
Discriminative GMMSeg sacrifices data characterization for more flexiblility in discrimination, and
yields poor performance in open-world setting. While inapparent effect on closed-set Cityscapes is
observed, which in turn verifies the accurate specification of data distribution in generative GMMSeg.

9
Standard EM vs. Sinkhorn EM. In our Table 5: Ablative studies (§4.3) on ADE20K [53] val. The
GMMSeg, we leverage the entropic OT adopted settings are marked in red.
based Sinkhorn EM [28] (cf. Eq. 10) in- EM algorithm # Loop mIoU (%) # Component mIoU (%)
stead of the classic one (cf. Eq. 8) for 1 42.7 M =1 44.2
the generative optimization of the GMM. vanilla EM
10 44.8 M =3 45.3
In Table 5a, we investigate the impacts 1 46.0 M =5 46.0
of these two different EM algorithms Sinkhorn EM 5 46.0 M = 10 46.0
10 46.0 M = 15 45.7
and show that Sinkhorn EM is more fa-
vored. More specifically, during the E- (a) EM optimization (b) # Component per class
step, rather than the vanilla EM assigning data samples to Gaussian components independently,
Sinkhorn EM restricts the assignment with an equipartition constraint. As pointed out in [28], in-
corporating such prior information about the mixing weights of GMM components leads to higher
curvature around the global optimum. Our empirical results confirm this theoretical finding.
Number of EM Loop per Training Iteration. EM algorithm alternates between E-step and M-step
for maximum-likelihood inference (cf. Eq. 6). In GMMSeg, in order to blend EM with stochastic
gradient descent, we adopt an online version of (Sinkhorn) EM based on momentum update. In
Table 5a, we also study the influence of looping EM different times per training iteration. We can find
that one loop per iteration is enough to catch the drift of the gradually updated feature space.
Number of Gaussian Components per Class. In GMMSeg, data distribution of each class is modeled
by a mixture of M Gaussian components (cf. Eq. 4). Table 5b shows the results with different values of
M . When M = 1, each class corresponds to a single Gaussian, which is directly estimated via
Gaussian Discriminant Analysis, without EM. This baseline achieves 44.2% mIoU. After adopting
the mixture model, i.e., M : 1 → 3 → 5, the performance is greatly improved, i.e., mIoU: 44.2%→
45.3%→46.0%. This verifies our hypothesis of class multimodality. Yet, further increasing compo-
nent number (i.e., M :5 → 15) only brings marginal even negative gains, due to overparameterization.
Confidence Calibration. We further 1.0 1.0
Accuracy Accuracy
study the model calibration of GMM- 0.8 Gap 0.8 Gap
Seg and the discriminative counterpart,
i.e., DeepLabV3+ [49] with the softmax 0.6 0.6
Accuracy

classifier. In Fig. 5, we illustrate the 0.4 0.4

Expected Calibration Error (ECE) [15]
along with reliability diagrams, which 0.2 0.2
plot the expected pixel accuracy as a ECE: 0.1065 ECE: 0.0766
0.0 0.0
function of confidence [15]. As seen, 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
GMMSeg yields better calibrated prefic- Confidence
tions, i.e., smaller gaps between the ex- Figure 5: Reliability diagrams for DeepLabV3+ [49] (left)
pected accuracy and confidence. On the and GMMSeg-DeepLabV3+ (right) on Cityscapes val.
other hand, the discriminative softmax produces confidences that deviate more from the true probabil-
ities, and suffers higher calibration error accordingly, which again verifies the better reliability and
interpretability of GMMSeg compared to its discriminative counterparts.
Runtime Analysis. The inference speed of GMMSeg is 13.37 fps, which only yields negligible
overhead w.r.t. its discriminative softmax counterpart, i.e., 13.37 vs. 14.16 fps. We measure the fps
with a single NVIDIA GeForce RTX 3090 GPU with a batch size of one.

5 Conclusion
We presented GMMSeg, the first generative neural framework for semantic segmentation. By explicitly
modeling data distribution as GMMs, GMMSeg shows promise to solve the intrinsic limitations of
current softmax based discriminative regime. It successfully optimizes generative GMM with end-
to-end discriminative representation learning in a compact and collaborative manner. This makes
GMMSeg principled and well applicable in both closed-set and open-world settings. We believe this
work provides fundamental insights and can benefit a broad range of application tasks. As a part of
our future work, we will explore our algorithm in image classification and trustworthy AI related tasks.

10
References
[1] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmen-
tation. In CVPR, 2015. 1, 3, 8, 16
[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
IEEE TPAMI, 2017. 1, 3
[3] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing
network. In CVPR, 2017. 1, 3, 8
[4] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. 1, 3
[5] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention
network for scene segmentation. In CVPR, 2019. 1, 3
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1
[7] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer:
Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021. 1, 2, 3, 7, 8,
16, 17, 18, 19, 20
[8] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic
segmentation. In ICCV, 2021. 1, 3, 8
[9] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng
Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. In CVPR, 2021. 1, 3, 8
[10] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer:
High-resolution transformer for dense prediction. In NeurIPS, 2021. 1, 3
[11] JM Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, and M West. Generative or
discriminative? getting the best of both worlds. Bayesian statistics, 8(3):3–24, 2007. 1, 4, 5
[12] Hideaki Hayashi and Seiichi Uchida. A discriminative gaussian mixture model with sparsity. In ICLR,
2021. 1, 2, 3, 4
[13] Tianfei Zhou, Wenguan Wang, Ender Konukoglu, and Luc Van Gool. Rethinking semantic segmentation:
A prototype view. In CVPR, 2022. 1, 3, 4
[14] Lynton Ardizzone, Radek Mackowiak, Carsten Rother, and Ullrich Köthe. Training normalizing flows
with the information bottleneck for competitive generative classification. In NeurIPS, 2020. 2, 3
[15] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks.
In ICML, 2017. 2, 10
[16] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon,
Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating
predictive uncertainty under dataset shift. NeurIPS, 2019. 2
[17] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples
in neural networks. In ICLR, 2017. 2, 3, 9, 17, 21
[18] Sanghun Jung, Jungsoo Lee, Daehoon Gwak, Sungha Choi, and Jaegul Choo. Standardized max logits: A
simple yet effective approach for identifying unexpected road obstacles in urban-scene segmentation. In
ICCV, 2021. 2, 3, 8, 9, 17
[19] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting
out-of-distribution samples and adversarial attacks. NeurIPS, 2018. 2, 9
[20] Bradley Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal of
the American Statistical Association, 70(352):892–898, 1975. 2
[21] Andrew Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes. In NeurIPS, 2001. 2, 3
[22] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Hybrid
models with deep and invertible features. In ICML, 2019. 2, 3
[23] Pavel Izmailov, Polina Kirichenko, Marc Finzi, and Andrew Gordon Wilson. Semi-supervised learning
with normalizing flows. In ICML, 2020. 2, 3
[24] Radek Mackowiak, Lynton Ardizzone, Ullrich Kothe, and Carsten Rother. Generative classifiers as a
basis for trustworthy image classification. In CVPR, 2021. 2, 3
[25] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust
neural network model on mnist. In NeurIPS, 2018. 2, 3

11
[26] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Lever-
aging generative models to understand and defend against adversarial examples. In ICLR, 2018. 2,
3
[27] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F Núñez, and Jordi Luque. Input
complexity and out-of-distribution detection with likelihood-based generative models. In ICLR, 2019. 2,
3
[28] Gonzalo Mena, Amin Nejatbakhsh, Erdem Varol, and Jonathan Niles-Weed. Sinkhorn em: an expectation-
maximization algorithm based on entropic optimal transport. arXiv preprint arXiv:2006.16548, 2020. 2,
6, 10
[29] Ehsan Variani, Erik McDermott, and Georg Heigold. A gaussian mixture model layer jointly optimized
with discriminative features within a deep neural network architecture. In ICASSP, 2015. 2, 3
[30] Zoltán Tüske, Muhammad Ali Tahir, Ralf Schlüter, and Hermann Ney. Integrating gaussian mixtures into
deep neural networks: Softmax layer with hidden variables. In ICASSP, 2015. 2, 3
[31] Aldebaro Klautau, Nikola Jevtic, and Alon Orlitsky. Discriminative gaussian mixture models: A
comparison with kernel classifiers. In ICML, 2003. 2
[32] Zhihao Zheng and Pengyu Hong. Robust detection of adversarial attacks by modeling the intrinsic
properties of deep neural networks. In NeurIPS, 2018. 2
[33] Kimin Lee, Sukmin Yun, Kibok Lee, Honglak Lee, Bo Li, and Jinwoo Shin. Robust inference via
generative classifiers for handling noisy labels. In ICML, 2019. 2
[34] Giancarlo Di Biase, Hermann Blum, Roland Siegwart, and Cesar Cadena. Pixel-wise anomaly detection
in complex driving scenes. In CVPR, 2021. 2, 3, 9, 17
[35] Yingda Xia, Yi Zhang, Fengze Liu, Wei Shen, and Alan L Yuille. Synthesize then compare: Detecting
failures and anomalies for semantic segmentation. In ECCV, 2020. 2, 3, 8, 9
[36] Krzysztof Lis, Krishna Nakka, Pascal Fua, and Mathieu Salzmann. Detecting the unexpected via image
resynthesis. In ICCV, 2019. 2, 3, 8, 9, 17
[37] Tomas Vojir, Tomáš Šipka, Rahaf Aljundi, Nikolay Chumerin, Daniel Olmeda Reino, and Jiri Matas.
Road anomaly detection by partial image reconstruction with segmentation coupling. In ICCV, 2021. 2, 3
[38] Petra Bevandić, Ivan Krešo, Marin Oršić, and Siniša Šegvić. Simultaneous semantic segmentation and
outlier detection in presence of domain shift. In GCPR, 2019. 2, 3, 17
[39] Robin Chan, Matthias Rottmann, and Hanno Gottschalk. Entropy maximization and meta classification
for out-of-distribution detection in semantic segmentation. In ICCV, 2021. 2, 3
[40] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure.
arXiv preprint arXiv:1812.04606, 2018. 2, 3
[41] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image
detection in neural networks. arXiv preprint arXiv:1706.02690, 2017. 2, 3
[42] KIMIN LEE, Kibok Lee, Honglak Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for
detecting out-of-distribution samples. In ICLR, 2018. 2, 3
[43] Matthias Rottmann, Pascal Colling, Thomas Paul Hack, Robin Chan, Fabian Hüger, Peter Schlicht,
and Hanno Gottschalk. Prediction error meta classification in semantic segmentation: Detection via
aggregated dispersion measures of softmax probabilities. In IJCNN, 2020. 2, 3
[44] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer
vision? In Advances in neural information processing systems, 2017. 2, 3
[45] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive
uncertainty estimation using deep ensembles. In NeurIPS, 2017. 2, 3
[46] Jishnu Mukhoti and Yarin Gal. Evaluating bayesian deep learning methods for semantic segmentation.
arXiv preprint arXiv:1811.12709, 2018. 2, 3, 17
[47] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-
decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. 2, 3, 7, 8, 9,
16, 17, 21
[48] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmenta-
tion. In ECCV, 2020. 2, 3, 7, 8, 16
[49] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene
understanding. In ECCV, 2018. 2, 7, 8, 10
[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In CVPR, 2016. 2, 7, 9, 16

12
[51] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu,
Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition.
IEEE TPAMI, 2020. 2, 7
[52] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2, 7, 16
[53] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing
through ade20k dataset. In CVPR, 2017. 2, 7, 8, 9, 10, 16, 17, 18
[54] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Be-
nenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene
understanding. In CVPR, 2016. 2, 7, 8, 16, 17, 19
[55] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In
CVPR, 2018. 2, 7, 8, 16, 17, 20
[56] Hermann Blum, Paul-Edouard Sarlin, Juan Nieto, Roland Siegwart, and Cesar Cadena. The fishyscapes
benchmark: Measuring blind spots in semantic segmentation. IJCV, 2021. 2, 3, 4, 8, 9, 16, 17, 21
[57] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable
convolutional networks. In CVPR, 2017. 3
[58] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation
in street scenes. In CVPR, 2018. 3
[59] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. 3
[60] Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool. Mining cross-image semantics for weakly
supervised semantic segmentation. In ECCV, 2020. 3
[61] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In MICCAI, 2015. 3
[62] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du,
Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In ICCV,
2015. 3
[63] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder
architecture for image segmentation. IEEE TPAMI, 39(12):2481–2495, 2017. 3
[64] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Deep learning markov random
field for semantic segmentation. IEEE TPAMI, 40(8):1814–1828, 2017. 3
[65] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks
for high-resolution semantic segmentation. In CVPR, 2017. 3
[66] Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi. Espnet:
Efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, 2018. 3
[67] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit
Agrawal. Context encoding for semantic segmentation. In CVPR, 2018. 3
[68] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context network for
semantic segmentation. In CVPR, 2019. 3
[69] Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, and Junjie Yan. Class-wise dynamic graph
convolution for semantic segmentation. In ECCV, 2020. 3
[70] Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu, Chunhua Shen, and Nong Sang. Context prior for
scene segmentation. In CVPR, 2020. 3
[71] Mingyuan Liu, Dan Schonfeld, and Wei Tang. Exploit visual dependency relations for semantic segmen-
tation. In CVPR, 2021. 3
[72] Chi-Wei Hsiao, Cheng Sun, Hwann-Tzong Chen, and Min Sun. Specialize and fuse: Pyramidal output
representation for semantic segmentation. In ICCV, 2021. 3
[73] Zhenchao Jin, Bin Liu, Qi Chu, and Nenghai Yu. Isnet: Integrate image-level and semantic-level context
for semantic segmentation. In ICCV, 2021. 3
[74] Zhenchao Jin, Tao Gong, Dongdong Yu, Qi Chu, Jian Wang, Changhu Wang, and Jie Shao. Mining
contextual information beyond image for semantic segmentation. In ICCV, 2021. 3
[75] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool. Exploring
cross-image pixel contrast for semantic segmentation. In ICCV, 2021. 3
[76] Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. Vspw: A large-scale dataset
for video scene parsing in the wild. In CVPR, 2021. 3

13
[77] Zongxin Yang, Yunchao Wei, and Yi Yang. Collaborative video object segmentation by multi-scale
foreground-background integration. IEEE TPAMI, 2021. 3
[78] Adam W Harley, Konstantinos G Derpanis, and Iasonas Kokkinos. Segmentation-aware convolutional
networks using local attention masks. In ICCV, 2017. 3
[79] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR,
2018. 3
[80] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. Pyramid attention network for semantic
segmentation. arXiv preprint arXiv:1805.10180, 2018. 3
[81] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet:
Point-wise spatial attention network for scene parsing. In ECCV, 2018. 3
[82] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi-scale filters for semantic segmentation. In
ICCV, 2019. 3
[83] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, and Hong Liu. Expectation-
maximization attention networks for semantic segmentation. In ICCV, 2019. 3
[84] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet:
Criss-cross attention for semantic segmentation. In ICCV, 2019. 3
[85] Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. Deep hierarchical semantic segmenta-
tion. In CVPR, pages 1246–1257, 2022. 3
[86] Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, and Ling Shao. Hierarchical
human parsing with typed part-relation reasoning. In CVPR, 2020. 3
[87] Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. Learning
compositional neural information fusion for human parsing. In ICCV, 2019. 3
[88] Zongxin Yang, Yunchao Wei, and Yi Yang. Associating objects with transformers for video object
segmentation. NeurIPS, 2021. 3
[89] Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need
for semantic segmentation. In NeurIPS, 2021. 3, 8
[90] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-
attention mask transformer for universal image segmentation. CVPR, 2022. 3, 16
[91] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional
neural networks. In ICML, 2016. 3
[92] Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple knowledge representation for big data artificial
intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic
Engineering, 22(12):1551–1558, 2021. 3
[93] Wenguan Wang, Cheng Han, Tianfei Zhou, and Dongfang Liu. Visual recognition with deep nearest
centroids. arXiv preprint arXiv:2209.07383, 2022. 3
[94] Jyh-Jing Hwang, Stella X Yu, Jianbo Shi, Maxwell D Collins, Tien-Ju Yang, Xiao Zhang, and Liang-Chieh
Chen. Segsort: Segmentation by discriminative sorting of segments. In ICCV, 2019. 3
[95] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, Suvrit Sra, and Greg Ridgeway. Clustering on the
unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6(9), 2005.
3
[96] Rajat Raina, Yirong Shen, Andrew Mccallum, and Andrew Ng. Classification with hybrid generative/dis-
criminative models. NeurIPS, 2003. 3
[97] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved
techniques for training gans. In NeurIPS, 2016. 3
[98] Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for
invertible generative modeling. In NeurIPS, 2019. 3
[99] Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and
Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. In
ICLR, 2019. 3
[100] Yingzhen Li, John Bradshaw, and Yash Sharma. Are generative classifiers more robust to adversarial
attacks? In ICML, 2019. 3
[101] Xinshuai Dong, Hong Liu, Rongrong Ji, Liujuan Cao, Qixiang Ye, Jianzhuang Liu, and Qi Tian. Api-net:
Robust generative classifier via a single discriminator. In ECCV, 2020. 3
[102] Ethan Fetaya, Jörn-Henrik Jacobsen, Will Grathwohl, and Richard Zemel. Understanding the limitations
of conditional generative models. 2020. 3

14
[103] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep
generative models know what they don’t know? In ICLR, 2019. 3
[104] Florian Wenzel, Théo Galy-Fajou, Christan Donner, Marius Kloft, and Manfred Opper. Efficient gaussian
process classification using pòlya-gamma data augmentation. In AAAI, 2019. 3
[105] Tommi Jaakkola and David Haussler. Exploiting generative models in discriminative classifiers. In
NeurIPS, 1998. 3
[106] Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative and
discriminative models. In CVPR, 2006. 3
[107] Hyunsun Choi, Eric Jang, and Alexander A Alemi. Waic, but why? generative ensembles for robust
anomaly detection. arXiv preprint arXiv:1810.01392, 2018. 3, 4
[108] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji
Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In NeurIPS, 2019. 3, 4
[109] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning
with deep generative models. In NeurIPS, 2014. 4
[110] Guillaume Bouchard and Bill Triggs. The tradeoff between generative and discriminative classifiers. In
IASC International Symposium on Computational Statistics, 2004. 4
[111] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Distance-based image
classification: Generalizing to new classes at near-zero cost. IEEE TPAMI, 35(11):2624–2637, 2013. 4
[112] Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. Feedforward neural networks
initialization based on discriminant learning. Neural Networks, 146:220–229, 2022. 4
[113] Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential deep learning to quantify classification
uncertainty. In NeurIPS, 2020. 4
[114] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via
the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
5
[115] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse,
and other variants. In Learning in graphical models, pages 355–368. 1998. 5
[116] Naonori Ueda and Ryohei Nakano. Deterministic annealing variant of the em algorithm. NeurIPS, 7,
1994. 6
[117] Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering
and representation learning. In ICLR, 2020. 6
[118] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013. 6
[119] Steven Nowlan. Maximum likelihood competitive learning. NeurIPS, 1989. 7
[120] Nanda Kambhatla and Todd Leen. Classifying with gaussian mixtures and clusters. NeurIPS, 1994. 7
[121] Xuefeng Du, Xin Wang, Gabriel Gozum, and Yixuan Li. Unknown-aware object detection: Learning
what you don’t know from videos in the wild. In CVPR, 2022. 7
[122] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and
benchmark. https://github.com/open-mmlab/mmsegmentation, 2020. 7
[123] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet
large scale visual recognition challenge. IJCV, 115(3):211–252, 2015. 7, 16
[124] Peter Pinggera, Sebastian Ramos, Stefan Gehrig, Uwe Franke, Carsten Rother, and Rudolf Mester. Lost
and found: detecting small road hazards for self-driving vehicles. In IROS, 2016. 8
[125] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection.
NeurIPS, 2020. 9
[126] Mark Everingham, SM Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew
Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. 16
[127] Terrance DeVries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural
networks. arXiv preprint arXiv:1802.04865, 2018. 17
[128] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. NeurIPS, 2018. 17
[129] Dan Hendrycks, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn
Song. Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132,
2019. 17

15
SUMMARY OF THE APPENDIX
In the appendix, we provide the following items that shed deeper insight on our contributions:
• §A: Detailed training parameters.
• §B: More experimental results.
• §C: More qualitative visualization.

A Detailed Training Parameters

We evaluate our GMMSeg on six base segmentation architectures. Four of them, i.e., DeepLabv3+ [47],
OCRNet [48], Swin-UperNet [52], SegFormer [7], are presented in our main paper. And the two
additional base architectures, i.e., FCN [1] and Mask2Former [90], are provided in this supplemental
material (cf . §B). We follow the default training settings in the official Mask2Former codebase and
MMSegmentation for Mask2Former and other base architectures respectively. In particular, we train
FCN, DeepLabv3+ and OCRNet using SGD optimizer with initial learning rate 0.1, weight decay
4e-4 with polynomial learning rate annealing; we train Swin-UperNet and SegFormer using AdamW
optimizer with initial learning rate 6e-5, weight decay 1e-2 with polynomial learning rate annealing;
we train Mask2Former using AdamW optimizer with initial learning rate 1e-4, weight decay 5e-2
and the learning rate is decayed by a factor of 10 at 0.9 and 0.95 fractions of the total training steps.

B More Experimental Results

More Base Segmentation Architectures. We first demonstrate the efficacy of our GMMSeg on two
additional base segmentation architectures, i.e., FCN [1] and Mask2Former [90], with quantitative re-
sults summarized in Table 6. We train FCN based models with the according training hyperparameter
settings mentioned in §A and Table 6: Additional quantitative results (§B) on ADE20K [53] val,
strictly follow the same train- Cityscapes [54] val, and COCO-Stuff [55] test in mean IoU.
ing and inference setups in our Method Backbone ADE20K Citys. COCO.
main manuscript (cf. §4.1). Fur- FCN [CVPR15] [1] 39.9 75.5 32.6
thermore, for a fair comparison ResNet101
GMMSeg 41.8 ↑ 1.9 76.7 ↑ 1.2 34.1 ↑ 1.5
with Mask2Former, the back- Mask2Former [CVPR22] [90] 56.1 83.3 51.0
SwinLarge
bone, i.e., SwinLarge [52], is pre- GMMSeg 56.7 ↑ 0.6 83.8 ↑ 0.5 52.0 ↑ 1.0
trained with ImageNet22K [123].
For ADE20K /COCO-Stuff/Cityscapes, we train Mask2Former based models using images cropped to
640×640/640×640/1024×1024, for 160K/80K/90K iterations with 16/16/16 batch size. We adopt
sliding window inference on Cityscapes with a window size of 1024×1024 and we keep the aspect
ratio of test images and rescale the short side to 640 on ADE20K and COCO-Stuff.
Here FCN is a famous fully convolutional model that is in line with the per-pixel dense classification
models we discussed in the main paper (cf . §3.1). Besides, of particular interest is the Mask2Former,
which is an attentive model proposed very recently that formulates the task as a mask classification
problem, where a mask-level representation is learned instead of pixel-level. However, it still relies
on a discriminative softmax based classifier for mask classification. We equip Mask2Former by
replacing the softmax classification module with our generative GMM classifier.
As seen, GMMSeg consistently boosts the model performance despite different segmentation for-
mulations, i.e., pixel classification or mask classification, verifying the superiority of our GMMSeg
that brings a paradigm shift from a discriminative softmax to a generative GMM. Notably, with
Mask2Former-SwinLarge as base segmentation architecture, our GMMSeg earns mIoU scores of
56.7%/83.8%/52.0%, establishing new state-of-the-arts among ADE20K /Cityscapes/COCO-Stuff.
Anomaly Segmentation Result on Fishyscapes Lost&Found test and Static test. We addi-
tionally report the anomaly segmentation performance of our Cityscapes [54] trained GMMSeg
built upon DeepLabV3+ [47]-ResNet101 [50] on Fishyscapes [56] Lost&Found test and Static test.
Fishyscapes Static is a blending-based dataset built upon backgrounds from Cityscapes and anoma-
lous objects from Pascal VOC [126], that contains 30/1,000 images in val/test set. The test splits
of Fishyscapes Lost&Found and Static are privately held by the Fishyscapes organization that contain
entirely unknown anomalies to the methods. The results are summarized in Table 7, and are also

16
Table 7: Quantitative results (§B) on Fishyscapes (FS) Lost&Found test and Static test.
Re- Extra OoD FS Lost&Found FS Static
Method
training Network Data AP ↑ FPR95 ↓ AP ↑ FPR95 ↓
Density - Single-layer NLL [56] ✗ ✓ ✗ 3.01 32.9 40.86 21.29
Density - Minimum NLL [56] ✗ ✓ ✗ 4.25 47.15 62.14 17.43
Density - Logistic Regression [56] ✗ ✓ ✓ 4.65 24.36 57.16 13.39
Image Resynthesis [36] ✗ ✓ ✗ 5.70 48.05 29.6 27.13
Bayesian Deeplab [46] ✓ ✗ ✗ 9.81 38.46 48.70 15.05
OoD Training - Void Class [127] ✓ ✗ ✓ 10.29 22.11 45.00 19.40
Discriminative Outlier Detection Head [38] ✓ ✓ ✓ 31.31 19.02 96.76 0.29
Dirichlet Deeplab [128] ✓ ✗ ✓ 34.28 47.43 31.30 84.60
SynBoost [34] ✗ ✓ ✓ 43.22 15.79 72.59 18.75
MSP [129] ✗ ✗ ✗ 1.77 44.85 12.88 39.83
Entropy [17] ✗ ✗ ✗ 2.93 44.83 15.41 39.75
kNN Embedding - density [56] ✗ ✗ ✗ 3.55 30.02 44.03 20.25
SML [18] ✗ ✗ ✗ 31.05 21.52 53.11 19.64
GMMSeg-DeepLabV3+ ✗ ✗ ✗ 55.63 6.61 76.02 15.96

publicly available in anonymous on the official leaderboard3 . We categorize the methods by checking
whether they require retraining, extra segmentation networks or utilize OoD data, following [18, 56].
As seen, without any add-on post-calibration technique, GMMSeg significantly surpasses the state-of-
the-art methods by even larger margins on the challenging test set compared to results on val set, i.e.,
+24.58%/+14.91% in AP and +22.91%/+3.68% in FPR95 on Fishyscapes Lost&Found/Static test.
Notably, GMMSeg even outperforms all other benchmark methods that employ additional training
networks/data on Fishyscapes Lost&Found test, verifying the strong robustness to unexpected
anomalies on-road due to the accurate data density modeling of GMMSeg.
Impact of Memory Capacity. In Table 8, we further ex- Table 8: Impact of memory size,
plore the influence of the memory capacity, i.e., the amount evaluated on ADE20K [53] val.
of pixel representations stored for class-wise EM estimation, # Sample mIoU (%)
with DeepLabV3+ -ResNet101 on ADE20K val trained for 80K
0 40.3
iterations. For the first row, where the memory size is set to 8K 45.1
0, the EM is only performed within mini-batches. Not sur- 16K 45.4
prisingly, data distribution estimated at such a local scale is 32K 46.0
far from accurate, leading to inferior results. With enlarged 48K 46.0
memory capacity, the performance is increased. When the performance reaches saturation, the stored
pixel samples are sufficient enough to represent the true data distribution of the whole training set.

C More Qualitative Visualization

Semantic Segmentation. We illustrate the qualitative comparisons of GMMSeg equipped Seg-
Former [7]-MiTB5 against the original model on ADE20K [53] (Fig. 6), Cityscapes [54] (Fig. 7) and
COCO-Stuff [55] (Fig. 8). It is evident that, benefiting from the accurate data characterization model-
ing, GMMSeg is less confused by object categories and gives preciser predictions than SegFormer.
Anomaly Segmentation. We then show more qualitative results of MSP [17]-DeepLabV3+ [47] and
GMMSeg-DeepLabV3+ on Fishyscapes Lost&Found val. As observed, different from MSP, GMMSeg
gets rid of being overwhelmed by overconfident predictions and successfully identifies the anomalies.

3
https://fishyscapes.com/results

17
SegFormer GMMSeg SegFormer GMMSeg
Figure 6: Qualitative results (§C) of SegFormer [7] and our GMMSeg on ADE20K [53].

18
SegFormer GMMSeg SegFormer GMMSeg
Figure 7: Qualitative results (§C) of SegFormer [7] and our GMMSeg on Cityscapes [54].

19
SegFormer GMMSeg SegFormer GMMSeg
Figure 8: Qualitative results (§C) of SegFormer [7] and our GMMSeg on COCO-Stuff [55].

20
Image MSP [17]-DeepLabV3+ [47] GMMSeg-DeepLabV3+
Figure 9: Qualitative results (§C) of anomaly heatmaps on Fishyscapes Lost&Found [56] val.

Dense Transformer Networks For Brain Electron Microscopy Image Segmentation
No ratings yet
Dense Transformer Networks For Brain Electron Microscopy Image Segmentation
7 pages
How To Make Pacts With The Devil
80% (5)
How To Make Pacts With The Devil
16 pages
Lec+2(+Image+Segemnation)
No ratings yet
Lec+2(+Image+Segemnation)
52 pages
A1745136595_29458_13_2025_unit6cv
No ratings yet
A1745136595_29458_13_2025_unit6cv
54 pages
Interactive Image Segmentation Using an Adaptive GMMRF Model 1st editon by Andrew Blake, Carsten Rother, Brown, Patrick Perez, Philip Torr ISBN 3540219842 9783540219842 download
100% (3)
Interactive Image Segmentation Using an Adaptive GMMRF Model 1st editon by Andrew Blake, Carsten Rother, Brown, Patrick Perez, Philip Torr ISBN 3540219842 9783540219842 download
44 pages
1907.06119
No ratings yet
1907.06119
58 pages
Deep Learning UDA
No ratings yet
Deep Learning UDA
44 pages
DAAI - Lecture - 15 - 23nov22
No ratings yet
DAAI - Lecture - 15 - 23nov22
113 pages
Lecture4 GAN b
No ratings yet
Lecture4 GAN b
38 pages
Lecture Notes
No ratings yet
Lecture Notes
54 pages
Thesis Z Ai
No ratings yet
Thesis Z Ai
46 pages
Data Driven Markov Chain Monte Carlo
No ratings yet
Data Driven Markov Chain Monte Carlo
40 pages
Digital Image Processing Lecture
No ratings yet
Digital Image Processing Lecture
63 pages
Report
No ratings yet
Report
23 pages
Fundamentals-of-ML-Study-Guide - M3
No ratings yet
Fundamentals-of-ML-Study-Guide - M3
20 pages
Exploring Classification of Topological Priors With Machine Learning for Feature Extraction
No ratings yet
Exploring Classification of Topological Priors With Machine Learning for Feature Extraction
14 pages
Applsci 11 08802 - Compressed
No ratings yet
Applsci 11 08802 - Compressed
28 pages
Markov Random Fields and Segmentation With Graph Cuts: Computer Vision Jia-Bin Huang, Virginia Tech
No ratings yet
Markov Random Fields and Segmentation With Graph Cuts: Computer Vision Jia-Bin Huang, Virginia Tech
44 pages
segmentation_by_gan
No ratings yet
segmentation_by_gan
18 pages
Unsupervised Learning of Image Segmentation Based On Differentiable Feature Clustering
No ratings yet
Unsupervised Learning of Image Segmentation Based On Differentiable Feature Clustering
14 pages
Image Segmentation in Deep Learning
No ratings yet
Image Segmentation in Deep Learning
12 pages
Semantic Image Segmentation Via Deep Parsing Network
No ratings yet
Semantic Image Segmentation Via Deep Parsing Network
11 pages
407 a Decade s Battle on Datas
No ratings yet
407 a Decade s Battle on Datas
17 pages
DIC: Deep Image Clustering For Unsupervised Image Segmentation
No ratings yet
DIC: Deep Image Clustering For Unsupervised Image Segmentation
11 pages
2303.13724v1
No ratings yet
2303.13724v1
15 pages
Base and Meta A New Perspective On Few-Shot Segmentation
No ratings yet
Base and Meta A New Perspective On Few-Shot Segmentation
18 pages
ETAP Presentation
0% (1)
ETAP Presentation
63 pages
2211.14126v2
No ratings yet
2211.14126v2
14 pages
Exploring Classification of Topological Priors With Machine Learning for Feature Extraction A
No ratings yet
Exploring Classification of Topological Priors With Machine Learning for Feature Extraction A
14 pages
DL Segmentation 2
No ratings yet
DL Segmentation 2
18 pages
W-Net A Deep Model For Fully Unsupervised Image Segmentation
No ratings yet
W-Net A Deep Model For Fully Unsupervised Image Segmentation
13 pages
Wang_Rethinking_Bayesian_Deep_Learning_Methods_for_Semi-Supervised_Volumetric_Medical_Image_CVPR_2022_paper
No ratings yet
Wang_Rethinking_Bayesian_Deep_Learning_Methods_for_Semi-Supervised_Volumetric_Medical_Image_CVPR_2022_paper
9 pages
PixelTransformer_ Sample Conditioned Signal Generation
No ratings yet
PixelTransformer_ Sample Conditioned Signal Generation
10 pages
Prototype Based Deepm Learning Paper 2 Zhou
No ratings yet
Prototype Based Deepm Learning Paper 2 Zhou
12 pages
Deep Learning For Geometric and Semantic Tasks in Photogrammetry and Remote Sensing
No ratings yet
Deep Learning For Geometric and Semantic Tasks in Photogrammetry and Remote Sensing
11 pages
1511.02680v2
No ratings yet
1511.02680v2
11 pages
没代码No Adversaries to Zero-Shot Learning Distilling an Ensemble of Gaussian Feature Generators
No ratings yet
没代码No Adversaries to Zero-Shot Learning Distilling an Ensemble of Gaussian Feature Generators
12 pages
BayeSeg Bayesian Modeling For Medical Image Segmentation With Interpretable Generalizability
No ratings yet
BayeSeg Bayesian Modeling For Medical Image Segmentation With Interpretable Generalizability
14 pages
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
No ratings yet
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
40 pages
3503161.3548398
No ratings yet
3503161.3548398
9 pages
Image Segmentation Using Deep Learning: A Survey
No ratings yet
Image Segmentation Using Deep Learning: A Survey
22 pages
2210 11810FJFTJTsu
No ratings yet
2210 11810FJFTJTsu
13 pages
Bayesian_Deep_Learning_with_Monte_Carlo_Dropout_for_Qualification_of_Semantic_Segmentation
No ratings yet
Bayesian_Deep_Learning_with_Monte_Carlo_Dropout_for_Qualification_of_Semantic_Segmentation
4 pages
Image Segmentationand Semantic Labelingusing Machine Learning
No ratings yet
Image Segmentationand Semantic Labelingusing Machine Learning
6 pages
IJRAR1DUP001
No ratings yet
IJRAR1DUP001
3 pages
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
No ratings yet
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
10 pages
2408.14957v1
No ratings yet
2408.14957v1
7 pages
Image Segmentation Using Deep Learning: A Survey
No ratings yet
Image Segmentation Using Deep Learning: A Survey
23 pages
Interactive Image Segmentation Using an Adaptive GMMRF Model 1st editon by Andrew Blake, Carsten Rother, Brown, Patrick Perez, Philip Torr ISBN 3540219842 9783540219842 download
No ratings yet
Interactive Image Segmentation Using an Adaptive GMMRF Model 1st editon by Andrew Blake, Carsten Rother, Brown, Patrick Perez, Philip Torr ISBN 3540219842 9783540219842 download
39 pages
SUMSEM2023-24 CSI3901 ETH VL2023240701291 2024-06-06 Reference-Material-I
No ratings yet
SUMSEM2023-24 CSI3901 ETH VL2023240701291 2024-06-06 Reference-Material-I
10 pages
SegDiff - Image Segmentation With Diffusion Probabilistic Models
No ratings yet
SegDiff - Image Segmentation With Diffusion Probabilistic Models
13 pages
Is Simple Better?: Revisiting Simple Generative Models For Unsupervised Clustering
No ratings yet
Is Simple Better?: Revisiting Simple Generative Models For Unsupervised Clustering
6 pages
Dental X-Ray Image Segmenation Using A U-Shaped Deep Convolutional Network
No ratings yet
Dental X-Ray Image Segmenation Using A U-Shaped Deep Convolutional Network
13 pages
tmp4267 TMP
No ratings yet
tmp4267 TMP
11 pages
U-Net: Convolutional Networks For Biomedical Image Segmentation
No ratings yet
U-Net: Convolutional Networks For Biomedical Image Segmentation
8 pages
TEXTURE IMAGE SEGMENTATION USING NEURO EVOLUTIONARY METHODS-a Survey
No ratings yet
TEXTURE IMAGE SEGMENTATION USING NEURO EVOLUTIONARY METHODS-a Survey
5 pages
A Combined Model of Bayesian Network and Spatial Markov Kernel For Multiclass Image Segmentation and Categorization
No ratings yet
A Combined Model of Bayesian Network and Spatial Markov Kernel For Multiclass Image Segmentation and Categorization
4 pages
Image
No ratings yet
Image
16 pages
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
No ratings yet
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
11 pages
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
No ratings yet
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
6 pages
BLACK-GRAM
No ratings yet
BLACK-GRAM
22 pages
Cuaderno Trabajo1 QU334 20192
0% (1)
Cuaderno Trabajo1 QU334 20192
51 pages
Soft Heart
No ratings yet
Soft Heart
10 pages
Chekurkov Antigravity Research PDF
75% (4)
Chekurkov Antigravity Research PDF
5 pages
Jetty Talenta Bumi - Google Maps
No ratings yet
Jetty Talenta Bumi - Google Maps
4 pages
Innosilicon T3H+ Installation Guide Manual [Innosilicon]
No ratings yet
Innosilicon T3H+ Installation Guide Manual [Innosilicon]
11 pages
PP-2018-ICAS-SC-ENG-D-E
No ratings yet
PP-2018-ICAS-SC-ENG-D-E
26 pages
C++Lab Notes
No ratings yet
C++Lab Notes
20 pages
eStmt_2024-05-31
No ratings yet
eStmt_2024-05-31
8 pages
GP 435 Vol I CHG H 1 112
No ratings yet
GP 435 Vol I CHG H 1 112
40 pages
MBA-CM - Kemilembe Bais - 2017
No ratings yet
MBA-CM - Kemilembe Bais - 2017
100 pages
The Need For A New Medical Model A Challenge For Biomedicine
No ratings yet
The Need For A New Medical Model A Challenge For Biomedicine
15 pages
Review 2 Number N Simple Present
No ratings yet
Review 2 Number N Simple Present
2 pages
Jitesh Sah
No ratings yet
Jitesh Sah
29 pages
Micro VBB Elite
100% (1)
Micro VBB Elite
2 pages
Time Affecting The Price Elasticity of Demand and Supply
No ratings yet
Time Affecting The Price Elasticity of Demand and Supply
5 pages
Speak A New Language So That The World Will Be A New World.
No ratings yet
Speak A New Language So That The World Will Be A New World.
19 pages
Rapid7 Insightappsec Appspider Attack Types Datasheet2
No ratings yet
Rapid7 Insightappsec Appspider Attack Types Datasheet2
1 page
ENG7-Performance Task No. 1 - DIORAMA
No ratings yet
ENG7-Performance Task No. 1 - DIORAMA
4 pages
08.1 Southern Gateway Chichester - Implementation App 1 PID v6 PDF
No ratings yet
08.1 Southern Gateway Chichester - Implementation App 1 PID v6 PDF
12 pages
Objectives of Activity Planning
No ratings yet
Objectives of Activity Planning
13 pages
Because Swap - Push Isn't As Natural
No ratings yet
Because Swap - Push Isn't As Natural
14 pages
Sinamics Perfect Harmony Gh150 Cellbased Drive Brochure en 2017
No ratings yet
Sinamics Perfect Harmony Gh150 Cellbased Drive Brochure en 2017
14 pages
Glass Ionomer Cement
No ratings yet
Glass Ionomer Cement
135 pages
PDF 16045523 1571921316928
No ratings yet
PDF 16045523 1571921316928
5 pages
LAMP Quickstart For Red Hat Enterprise Linux 4
100% (1)
LAMP Quickstart For Red Hat Enterprise Linux 4
8 pages
Review For LET Values Education Part 3
No ratings yet
Review For LET Values Education Part 3
2 pages
Myp Lesson Plan Template With First Nation Principles of Learning Intergration
No ratings yet
Myp Lesson Plan Template With First Nation Principles of Learning Intergration
4 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet

1_2022_NeurlPS_GMMSeg Gaussian Mixture based Generative Semantic Segmentation Models

Uploaded by

1_2022_NeurlPS_GMMSeg Gaussian Mixture based Generative Semantic Segmentation Models

Uploaded by

GMMSeg: Gaussian Mixture based

Generative Semantic Segmentation Models

Chen Liang1,3∗† , Wenguan Wang2∗ , Jiaxu Miao1 , Yi Yang1

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

3.2 GMMSeg: Dense GMM Generative Classification

In GMMSeg, GMM classifier (has C ×M components in total) is purely optimized in a generative

4.1 Experiments on Semantic Segmentation

Datasets. We conduct experiments on three widely used semantic segmentation datasets:

SegFormer [7] GMMSeg SegFormer [7] GMMSeg SegFormer [7] GMMSeg

4.2 Experiments on Anomaly Segmentation

4.3 Diagnostic Experiments

classifier. In Fig. 5, we illustrate the 0.4 0.4

A Detailed Training Parameters

B More Experimental Results

C More Qualitative Visualization

You might also like