1_2022_NeurlPS_GMMSeg Gaussian Mixture based Generative Semantic Segmentation Models
1_2022_NeurlPS_GMMSeg Gaussian Mixture based Generative Semantic Segmentation Models
Abstract
Prevalent semantic segmentation solutions are, in essence, a dense discriminative
classifier of p(class |pixel feature). Though straightforward, this de facto paradigm
neglects the underlying data distribution p(pixel feature |class), and struggles to
identify out-of-distribution data. Going beyond this, we propose GMMSeg, a new
family of segmentation models that rely on a dense generative classifier for the
joint distribution p(pixel feature, class). For each class, GMMSeg builds Gaussian
Mixture Models (GMMs) via Expectation-Maximization (EM), so as to capture
class-conditional densities. Meanwhile, the deep dense representation is end-to-end
trained in a discriminative manner, i.e., maximizing p(class |pixel feature). This
endows GMMSeg with the strengths of both generative and discriminative models.
With a variety of segmentation architectures and backbones, GMMSeg outperforms
the discriminative counterparts on three closed-set datasets. More impressively,
without any modification, GMMSeg even performs well on open-world datasets.
We believe this work brings fundamental insights into the related fields.
1 Introduction
Semantic segmentation aims to explain visual semantics at the pixel level. It is typically considered as
a problem of pixel-wise classification, i.e., assigning a class label c ∈ {1,· · ·, C} to each pixel data x.
Under this regime, deep-neural solutions are naturally built as a combination of two parts (Fig. 1(a)):
an encoder-decoder, dense feature extractor that maps x to a high-dimensional feature representation
x, and a dense classifier that conducts C-way classification given input pixel feature x. Starting from
the first end-to-end segmentation solution – fully convolutional networks (FCN) [1], researchers leave
the classifier as parametric softmax, and fully devote to improving the dense feature extractor for
learning better representation. As a result, a huge amount of FCN-based solutions [2–5] emerged and
their state-of-the-art was further pushed forward by recent Transformer [6]-style algorithms [7–10].
From a probabilistic perspective, the softmax classifier, supervised by the cross-entropy loss together
with the feature extractor, directly models the class probability given an input, i.e., posterior p(c|x).
This is known as a discriminative classifier, as the conditional probability distribution discriminates
directly between the different values of c [11]. As discriminative classifiers directly find the classifica-
tion rule with the smallest error rate, they often give excellent performance in downstream tasks, and
hence become the de facto paradigm in segmentation. Yet, due to the discriminative nature, softmax-
based segmentation models suffer from several limitations: First, they only learn the decision boundary
between classes, without modeling the underlying data distribution [11]. Second, as only one weight
vector is learned per class, they assume unimodality for each class [12, 13], bearing no within-class
variation. Third, they learn a prediction space where the model accuracy deteriorates rapidly away
∗
Equal contributions.
†
Work partly done during an internship at Baidu Research.
Figure 1: (a) Existing softmax based discriminative regime only learns decision boundaries on the
pixel embedding space. (b) Our GMMSeg models pixel feature densities via generative GMMs.
from the decision boundaries [14] and thus yield poorly calibrated predictions [15], struggling to
recognize out-of-distribution data [16]. The first two limitations may hinder the expressive power of
segmentation models, and the last one challenges the adoption of segmentation models in decision-
critical tasks (e.g., autonomous driving) and motivates the development of anomaly segmentation
methods [17–19] (which, however, rely on pre-trained discriminative segmentation models).
As an alternative of discriminative classifiers, generative classifiers first find the joint probability
p(x, c), and use p(x, c) to evaluate the class-conditional densities p(x|c). Then classification is con-
ducted using Bayes rule. Numerous theoretical and empirical comparisons [20, 21] between these two
approaches have been initiated even before the deep learning revolution. They reach the agreement that
generative classifiers have potential to overcome shortcomings of their discriminative counterparts, as
they are able to model the input data itself. This stimulates the recent investigation of generative (and
discriminative-generative hybrid [22, 23]) classifiers in trustworthy AI [24–27] and semi-supervised
learning [22, 23], while the discriminative classifiers are still dominant in most downstream tasks.
In light of this background, we propose a GMM based segmentation framework – GMMSeg – that
addresses the limitations of current discriminative solutions from a generative perspective (Fig. 1(b)).
Our work not only represents a novel effort to advocate generative classifiers for end-to-end segmen-
tation, but also evidences the merits of generative approaches in a challenging, dense classification
task setting. In particular, we adopt a separate mixture of Gaussians for modeling the data distribution
of each class in the feature space, i.e., class-conditional feature densities p(x|c). During training,
GMM classifier is online optimized by a momentum version of (Sinkhorn) EM [28] on large-scale,
so as to ensure its generative nature and synchronization with the evolving feature space. Mean-
while, the feature extractor is end-to-end trained with the discriminative (cross-entropy) loss, i.e.,
maximizing the conditional likelihood p(c|x) derived with the generative GMM, so as to enable
expressive representation learning. In this way, GMMSeg smartly learns generative classification with
end-to-end discriminative representation in a compact and collaborative manner, exploiting the benefit
of both generative and discriminative approaches. This also greatly distinguishes GMMSeg from
most existing GMM based neural classifiers, which are either discriminatively trained [12, 29–31] or
trivially estimate a GMM in the feature space of a pre-trained discriminative classifier [19, 32, 33].
GMMSeg has several appealing facets: First, with the hybrid training strategy – online EM based
classifier optimization and end-to-end discriminative representation learning, GMMSeg can precisely
approximate the data distribution over a robust feature space. Second, the mixture components make
GMMSeg a structured model that well adapts to multimodal data densities. Third, the distribution-
preserving property allows GMMSeg to naturally reject abnormal inputs, without neither architectural
change (like [34–37]) nor re-training (like [38–40]) nor post-calibration (like [17, 18, 41–46]). Fourth,
GMMSeg is a principled framework, fully compatible with modern segmentation network architectures.
For thorough examination, in §4.1, we approach GMMSeg on several representative segmentation
architectures (i.e., DeepLabV3+[47], OCRNet [48], UperNet [49], SegFormer [7]), with diverse back-
bones (i.e., ResNet [50], HRNet [51], Swin [52], MiT [7]). Experimental results demonstrate GMMSeg
even outperforms the softmax-based discriminative counterparts, e.g., 0.6% – 1.5%, 0.5% – 0.8%,
and 0.7% – 1.7% mIoU gains over ADE20K [53], Cityscapes [54], and COCO-Stuff [55], respectively.
Furthermore, in §4.2, we validate our approach on anomaly segmentation. Without any modification,
our Cityscapes-trained GMMSeg model is directly tested on Fishyscapes Lost&Found [56] and Road
Anomaly [36] datasets, and outperforms all hand-tailored discriminative competitors.
To our best knowledge, GMMSeg is the first semantic segmentation method that reports promising
results on both closed-set and open-world scenarios by using a single model instance. More notably,
our impressive results manifest the advantages of generative classifiers in a large-scale real-world
setting. We feel this work opens a new avenue for research in this field.
2
2 Related Work
Semantic Segmentation. Since the seminal work of FCN [1], deep-net segmentation solutions are
typically built in a dense classification fashion, i.e., learning dense representation andcategorization
end-to-end. By directly adopting discriminative softmax forcategorization,FCN-style solutions put
focus on learning expressive dense representation; they modify the FCN architecture from various
aspects, such as enlargingthe receptive field [2,3,47,57–60], modeling multi-scale context [48,59,61–
77], investigating non-local operations [4, 5, 78–84], and exploring hierarchical information [85–87].
With a similar goal of sharpening representation, later Transformer-style solutions [7–9, 88] empower
attentive networks with, for instance, local contiguity [7] and multi-level feature aggregation [9,
10]. Two very recent attentive models [89, 90] formulate the task in an alternative form of mask
classification, however, still relying on discriminative softmax.
From the discussion above, we can find that current prevalent segmentation solutions are in essence
a pixel-wise, discriminative classifier, which only learns decision boundaries between classes in
the pixel feature space [14, 91], without modeling the underlying data distribution. In contrast,
our GMMSeg tackles the task from a generative viewpoint. GMMSeg deeply embeds generative
optimization of GMMs into end-to-end dense representation learning, so as to comprehensively
describe the class-aware knowledge [92] in a discriminative feature space. GMMSeg is partly inspired
by [13, 93], that also probe data structures via intra-class clustering. However, the dense classification
in the two works are achieved via non-parametric, nearest centroid retrieving – still a discriminative
model. In [94], though data density is estimated (as a mixture of vMF distributions [95]), it is only
used as a supervisory signal for dense embedding learning, and the final prediction is still made by a
discriminative classifier – k-NN. Our work represents the first step towards formulating (closed-set)
semantic segmentation within a generative neural classification framework.
Discriminative vs Generative Classifiers. Generative classifiers and discriminative classifiers rep-
resent two contrasting ways of solving classification tasks [24]. Basically, the generative classifiers
(such as Linear Discriminant Analysis and naive Bayes) learn the class densities p(x|c), while the
discriminative classifiers (such as softmax) learn the class boundaries p(c|x) without regard to the
underlying class densities. In practical classification tasks, softmax discriminative classifier is used
exclusively [24], due to its simplicity and excellent discriminative performance. Nonetheless, genera-
tive classifiers are widely agreed to have several advantages over their discriminative counterparts [21,
96], e.g., accurately modeling the input distribution, and explicitly identifying unlikely inputs in a
natural way. Driven by this common belief, a surge of deep learning literature [14, 97–99] investi-
gated the potential (and the limitation) of generative classifiers in adversarial defense [25,26,100–102],
explainable AI [24], out-of-distribution detection [27, 103], and semi-supervised learning [22, 23, 97].
As GMMs can express (almost) arbitrary continuous distributions, it has been adopted in many neural
classifiers [23,104]. However, most of these GMM classifiers are discriminative models [12,14,29,30]
that are trained ‘discriminatively’(i.e., maximizing posteriors p(c|x)). In GMMSeg, the GMM is purely
optimized via EM (i.e., estimating class densities p(x|c)) while the deep representation is trained via
gradient backpropagation of the discriminative loss. Thus the whole GMMSeg is a hybrid of genera-
tive GMM and discriminative representation, getting the best of two worlds. Although bearing the
general idea of trading-off between generative and discriminative classifiers [21–23, 96, 105, 106],
none of the previous hybrid algorithms demonstrate their utility in challenging segmentation tasks.
Anomaly Segmentation. Anomaly segmentation strives to identify unknown object regions, typically
in road-driving scenarios [56]. Existing solutions can be generally categorized into three classes: i)
Uncertainty estimation based algorithms [17, 18, 41–46] usually approximate the uncertainty from
simple statistics of the classification probability or logits of pre-trained segmentation models [17, 18,
41–43], or adopt Bayesian neural networks with Monte-Carlo dropout to capture pixel uncertainty [44–
46]. ii) Outlier exposure based algorithms make use of auxiliary datasets as training samples of
unexpected objects [38–40]. Therefore, this type of algorithms requires re-training the segmentation
network, resulting in performance degradation. iii) Image resynthesis based algorithms reconstruct
the input image and discriminate the anomaly instances according to the reconstruction error [34–37].
With a generative classifier, our GMMSeg handles anomaly segmentation naturally, without neither
external datasets of outliers, nor additional image resynthesis models. It also greatly differs from most
uncertainty estimation-based methods that are post-processing techniques adjusting the prediction
scores of softmax-based segmentation networks [17, 18, 41–43]. The most relevant ones are maybe a few
density estimation-based models [56,107,108], which directly measure the likelihood of samples w.r.t.
3
the data distribution. However, they are either limited to pre-trained representation [56] or specialized
for anomaly detection with simple data [107, 108]. To our best knowledge, this is the first time to
report promising results on both closed-set and open-world large-scale settings, through a single
model instance without any change of network architecture as well as training and inference protocols.
3 Methodology
In this section, we first formalize modern semantic segmentation models within a dense discriminative
classification framework and discuss defects of such discriminative regime from a probabilistic view-
point (§3.1). Then we describe our new segmentation framework – GMMSeg – that brings a paradigm
shift from the discriminative to generative (§3.2). Finally, in §3.3, we provide implementation details.
3.1 Existing Segmentation Solutions: Dense Discriminative Classifier
In the standard semantic segmentation setting, we are given a training dataset D = {(xn , cn )}N n=1 of
N pairs of pixel samples xn ∈ R3 and corresponding semantic labels cn ∈ {1,· · ·, C}. The goal is to
use D to learn a classification rule which can predict the label c′ ∈ {1,· · ·, C} of an unseen pixel x′ .
Recent mainstream solutions employ a deep neural network for pixel representation learning and
softmax for semantic label prediction. Hence they are usually built as a composition of f ◦g:
• A dense feature extractor fθ : R3→ RD, which is typically an encoder-decoder network that maps
the input pixel x to a D-dimensional feature representation x, i.e., x = fθ (x) ∈ RD ;1 and
• A dense classifier gω : RD → RC , which is achieved by parametric softmax that maps each pixel
representation x ∈ RD to C real-valued numbers {yc∈ R}C C
c=1 termed as logits, i.e., {yc }c=1 = gω (x),
and uses the logits to compute the posterior probability:
\small \begin {aligned}\label {eq:softmax} p(c|\bm {x}; \bm {\omega }, \bm {\theta }) \!=\! \frac {\textrm {exp}(y_{c})}{\sum _{c'}\!\textrm {exp}(y_{c'})} \!=\! \frac {\textrm {exp}(\bm {w}_{c}^{\!\top }\bm {x}+b_{c})}{\sum _{c'}\!\textrm {exp}(\bm {w}_{c'}^{\!\top }\bm {x}+b_{c'})}\!=\! \frac {\textrm {exp}(\bm {w}_{c}^{\!\top }f_{\bm {\theta }}(x)+b_{c})}{\sum _{c'}\!\textrm {exp}(\bm {w}_{c'}^{\!\top }f_{\bm {\theta }}(x)+b_{c'})}, \end {aligned} \vspace {1pt} (1)
where wc ∈ RD and bc ∈ R are the weight and bias for class c, respectively; and ω = {w1:C , b1:C }.
The final prediction is the class with the highest predicted probability: arg maxc p(c|x; ω, θ).
The feature extractor f and softmax-based classifier g are jointly trained end-to-end. Their corre-
sponding parameters {θ, ω} are optimized by minimizing the so-called cross-entropy loss on D:
\small \begin {aligned} \bm {\theta }^\ast , \bm {\omega }^\ast =\argmin _{\bm {\theta }, \bm {\omega }}-\!\!\sum \nolimits _{(x,c)\in \mathcal {D}}\log p(c|\bm {x}; \bm {\omega }, \bm {\theta }), \end {aligned} \vspace {1pt} (2)
which is equivalent to maximizing conditional likelihood, i.e., Π(x,c)∈D p(c|x). In some literature [11,
109], such learning strategy is called discriminative training. As softmax directly models the
conditional probability distribution p(c|x) with no concern for modeling the input distribution
p(x, c), existing softmax-based segmentation models are in essence a dense discriminative classifier.
Discriminative softmax typically gives good predictive performance, as the pixel classification rule
depends only on the conditional distribution p(c|x) in the sense of minimum error rate and softmax
optimizes the quantity of interest in a concise manner, i.e., learning a direct map from inputs x to the
class labels c. In spite of its prevalence and effectiveness, this dense discriminative regime has some
drawbacks that are still poorly understood: First, it attends only to learning the decision boundaries
between the C classes on the pixel embedding space, i.e., splitting the D-dimensional feature space
using C different (D−1)-dimensional hyperplanes. It achieves a simplified approach that eliminates
extra parameters for modeling the data (representation) distribution [110]. However, from another
perspective, it fails to capture the intrinsic class characteristics and is hard to achieve good generaliza-
tion on unseen data. Second, in softmax, each class c corresponds to only a single weight (wc , bc ).
That means existing segmentation models rely on an implicit assumption of unimodality of data of
each class in the feature space [12, 111, 112]. However, this unimodality assumption is rarely the case
in real-world scenarios and makes the model less tolerant of intra-class variances [13], especially
when the multimodality remains in the feature space [12]. Third, softmax is not capable of inferring
the data distribution – it is notorious with inflating the probability of the predicted class as a result of the
exponent employed on the network outputs [113]. Thus the prediction score of a class is useless besides
its comparative value against other classes. This is the root cause of why existing segmentation models
1
Strictly speaking, the dense feature extractor fθ typically maps pixel samples with image context, i.e.,
′ ′
fθ : Rh×w×3→ Rh ×w ×D , where h and w (h′ and w′ ) denote the spatial resolution of the image (feature map) .
Here we simplify the notations, i.e., fθ : R3→ RD, to keep a straightforward formulation.
4
are hard to identify pixel samples x′ of an unseen class (out-of-distribution data), i.e., c′ ̸ ∈ {1,· · ·, C}.
Accordingly, we argue that the time might be right to rethink the current de facto, discriminative
segmentation regime, where the softmax classifier may actually cause more harm than good.
\small \begin {aligned}\label {eq:gc} p(c|\bm {x}) \!=\! \frac {p(c)p(\bm {x}|c)}{\sum _{c'}\!p(c')p(\bm {x}|c')}. \end {aligned} (3)
Since the class probabilities p(c) are typically set as a uniform prior (also in our case), estimating
the class-conditional distributions (i.e., data densities) p(x|c) is the core and most difficult part of
building a generative classifier. It is also worth noting that generative classifiers are optimized by
approximating the data distribution Π(x,c)∈D p(x|c), which is called generative training [11].
Although discriminative classifiers demonstrate impressive performance in many application tasks,
there are several crucial reasons for using generative rather than discriminative classifiers, which
can be succinctly articulated by Feynman’s mantra “What I cannot create, I do not understand.”
Surprisingly, generative classifiers have been rarely investigated in modern segmentation models.
Driven by the belief that generative classifiers are the right way to remove the shortcomings of discri-
minative approaches, we revisit GMM – one of the most classic generative probabilistic classifiers.
We couple the generative EM optimization of GMMs with the discriminative learning of the dense
feature extractor f – the most successful part of modern segmentation models, leading to a powerful,
principled, and dense generative classification based segmentation framework – GMMSeg (Fig. 2).
Specifically, GMMSeg adopts a weighted mixture of M multivariate Gaussians for modeling the
pixel data distribution of each class c in the D-dimensional embedding space:
\small \begin {aligned}\label {eq:GMM} p(\bm {x}|c;\bm {\phi }_c) = \sum \nolimits _{m=1}^{M}p(m|c;\bm {\pi }_c)p(\bm {x}|c,m;\bm {\mu }_c,\bm {\Sigma }_c) = \sum \nolimits _{m=1\!}^{M}{\pi }_{cm~\!}\mathcal {N}(\bm {x};\bm {\mu }_{cm},\bm {\Sigma }_{cm}). \end {aligned} (4)
Here m|c ∼ Multinomial(πc ) is the prior probability, i.e., m πcm = 1; µcm∈ RD and Σcm∈ RD×D
P
are the mean vector and covariance matrix for component m in class c; and ϕc = {πc , µc , Σc }. The
mixture nature allows GMMSeg to accurately approximate the data densities and to be superior over
softmax assuming unimodality for each class. Each Gaussian component has an independent co-
variance structure, enabling a flexible local measure of importance along different feature dimensions.
To find the optimal parameters of the GMM classifier, i.e., {ϕ∗c }Cc=1 , a standard approach is EM [114],
i.e., maximizing the log likelihood over the feature-label pairs {(xn , cn )}N
n=1 in the training dataset D:
\small \begin {aligned}\label {eq:EM1} \bm {\phi }^\ast _c\!=\!\argmax _{\bm {\phi }_c}\sum \nolimits _{\bm {x}_{n}:c_n=c\!}\log p(\bm {x}_n|c;\bm {\phi }_c)=\argmax _{\bm {\phi }_c}\sum \nolimits _{\bm {x}_{n}:c_n=c\!}\log \sum \nolimits _{m=1}^{M}p(\bm {x}_n,m|c;\bm {\phi }_c), \end {aligned} (5)
(0)
EM starts with some initial guess at the maximum likelihood parameters ϕc , and then proceeds to it-
(t)
eratively create successive estimates ϕc for t = 1, 2,· · · , by repeatedly optimizing a F function [115]:
\small \begin {aligned}\label {eq:EM} \textbf {E-Step:}~~~q^{(t)}_{c}= \argmax \nolimits _{q_{c}\!}F(q_{c},\bm {\phi }_{c}^{(t-1)}),~~~~~~~~~~\textbf {M-Step:}~~~\bm {\phi }^{(t)}_{c}= \argmax \nolimits _{\bm {\phi }_{c}\!}F(q_{c}^{(t)},\bm {\phi }_{c}). \\ \end {aligned} (6)
qc [m] = p(m|x, c; ϕc ) gives the probability that data x is assigned to component m. F is defined as:
\small \begin {aligned}\label {eq:EM2} F(q_{c},\bm {\phi }_{c}) = \mathbb {E}_{q_{c}}[\log p(\bm {x}, m|c;\bm {\phi }_{c})] + H(q_c), \end {aligned} (7)
where Eqc [·] gives the expectation w.r.t. the distribution over the M components given by qc , and
H(qc ) = −Eqc [log qc [m]] defines the entropy of qc . Based on Eqs. 4-7, for ∀xn: cn= c, we have:
\small \begin {aligned}\label {eq:EM3} \!\!\!\!\!\!\!\!\textbf {E-Step:}&~~~q^{(t)}_{cn}[m] = \frac {\pi ^{(t-1)}_{cm}\mathcal {N}(\bm {x}_{n}|\bm {\mu }^{(t-1)}_{cm},\bm {\Sigma }^{(t-1)}_{cm})}{\sum _{m'=1}^{M}\pi ^{(t-1)}_{cm'}\mathcal {N}(\bm {x}_{n}|\bm {\mu }^{(t-1)}_{cm'},\bm {\Sigma }^{(t-1)}_{cm'})},\\ \!\!\!\!\!\!\!\!\textbf {M-Step:}&~~~\pi ^{(t)}_{cm}\!=\!\frac {N^{(t)}_{cm}}{N_{c}},~~ \bm {\mu }^{(t)}_{cm}\!=\!\frac {1}{N^{(t)}_{cm}}\!\sum _{\bm {x}_{n\!}:c_{n\!}=_{\!}c}\!\!\!q^{(t)}_{cn}[m]\bm {x}_{n},~~ \bm {\Sigma }^{(t)}_{cm}\!=\!\frac {1}{N^{(t)}_{cm}}\!\sum _{\bm {x}_{n\!}:c_{n\!}=_{\!}c}\!\!\!q^{(t)}_{cn}[m](\bm {x}_{n\!}\!-\!\bm {\mu }^{(t)}_{cm})(\bm {x}_{n\!}\!-\!\bm {\mu }^{(t)}_{cm})\T ,\!\!\!\!\!\!\!\! \end {aligned}
(8)
5
EM optimization of generative GMM {p(cn|xn)}N {cn }N
n=1 n=1
{xn }N
n=1 ϕ2
dense feature ϕ1 M-step
extractor (Eq. 8) {ϕ1 , ϕ2 , ϕ3 } cross-
entropy
training E-step
pixel fθ (Eq. 9)
{q1 , q2 , q3 } (Eq. 11)
samples
p(xn|cn ; ϕc)
=p(cn|xn)
ϕ3 p(xn|1;ϕ1)+p(xn|2;ϕ2)+p(xn|3;ϕ3)
data flow
parameter optimization stochastic gradient descent of discriminative representation learning
Figure 2:Through generative-discriminative hybrid training, GMMSeg gains the best of the two worlds.
P
where Nc is the number of training samples labeled as c and Ncm= n:cn =c qcn [m]. In E-step, we re-
(t)
compute the posterior qc over the M components given the old parameters ϕ(t−1) . In M-step, with the
(t) (t)
soft cluster assignment qc , the parameters are updated as ϕc such that the F function is maximized.
In practice, we find standard EM suffers from slow convergence and delivers unsatisfactory re-
sults (cf. §4.3). A potential reason is the parameter sensitivity of EM – convergent parameters
may change vastly even with slightly different initialization [116]. Drawing inspiration from recent
optimal transport (OT) based clustering algorithms [117, 118], we introduce a uniform prior on the
1
mixture weights πc , i.e., ∀c, m : πcm = M . Recalling qc [m] = p(m|x, c), we can derive a constraint
1 1
P
Qc={qc : Nc xn :cn =c p(m|xn , c) = M }. Then E-step in Eq. 6 is performed by restricting the optimi-
zation of qc over the set Qc :
\small \begin {aligned}\label {eq:E} ~~~~~~~~~\textbf {E-Step:}~~~q_{c}^{(t)} = \argmax \nolimits _{q_{c}\in \mathcal {Q}_c}F(q_{c},\bm {\phi }_c^{(t-1)}). \end {aligned} (9)
This can be intuitively viewed as an equipartition constraint guided clustering process: inside each
class c, we expect the Nc pixel samples to be evenly assigned to M components. As indicated by [28],
Eq. 9 is analogous to entropy-regularized OT:
\small \begin {aligned}\label {eq:SEM} \!\!\!\!\!\!\!\!\min _{\bm {Q}_c\in \mathcal {Q}'_c\!}\sum \nolimits _{n,m\!}\!\bm {Q}_c(n,m)\bm {O}_c(n,m)\!+\!\epsilon H(\bm {Q}_c), ~~ \mathcal {Q}'_c\!=\!\{\bm {Q}_c\!\in \!\mathbb {R}^{N_c\!\times \!M\!}_{+}\!:\!\bm {Q}_c\mathbf {1}^{M\!}\!=\!\mathbf {1}^{N_c}, {(\bm {Q}_c)}^{\!\!\top }\!\mathbf {1}^{N_c\!}\!=\!\frac {N_c}{M}\mathbf {1}^{M}\},\!\!\!\! \end {aligned} (10)
where the transport matrix Qc (i.e., target solution) can be viewed as the posterior distribution qc of
Nc samples over the M components (i.e., Qc (n, m) = qcn [m]), the cost matrix Oc∈ RNc ×M is given
as the negative log-likelihood, i.e., Oc (n, m) = − log p(xn |c, m), and the entropy H(·) is penalized
by ϵ. The set Q′c encapsulates all the desired constraints over Qc , where 1M is a M -dimensional
all-ones vector. Intuitively, the more plausible a pixel sample xn is with respect to component m, the
less it costs to transport the underlying mass. Eq. 10 can be efficiently solved via Sinkhorn-Knopp
Iteration [118], where ϵ is set as the default (i.e., 0.05). This optimization scheme, called Sinkhorn
EM, is proved to have the same global optimum with the EM in Eq. 9 yet is less prone to getting stuck
in local optima [28], which is in line with our empirical results (cf. §4.3).
Our GMMSeg adopts a hybrid training strategy that is partly generative and partly discriminative:
\small \begin {aligned}\label {eq:loss} \!\!\!\!\!\!\!\!&\text {\textbf {Generative Optimization} (Sinkhorn EM) of \textbf {GMM Classifier}:}~~~\{\bm {\phi }^\ast _c\}^C_{c=1}\!=\!\\ \!\!\!\!\!\!\!\!&~~~~~~~~\{\argmax _{\bm {\phi }_c}\!\!\!\!\sum _{\bm {x}_{n\!}:c_n=c\!}\!\!\!\!\log p(\bm {x}_n|c;\bm {\phi }_c)\}^C_{c=1} = \{\argmax _{\bm {\phi }_c}\!\!\!\!\sum _{\bm {x}_{n\!}:c_n=c\!}\!\!\!\!\log \sum _{m=1\!}^{M}{\pi }_{cm~\!}\mathcal {N}(\bm {x}_n;\bm {\mu }_{cm},\bm {\Sigma }_{cm})\}^C_{c=1}, \\ \!\!\!\!\!\!\!\!&\text {\textbf {Discriminative Learning} (Cross-Entropy Loss) of \textbf {Dense Representation}:}~~~\bm {\theta }^\ast \!=\!\\ \!\!\!\!\!\!\!\!&\argmin _{\bm {\theta }}-\!\!\!\!\!\sum _{(x,c)\in \mathcal {D}}\!\!\!\!\log p(c|\bm {x}; \{\bm {\phi }^\ast _c\}^C_{c=1}, \bm {\theta }) =\!\argmin _{\bm {\theta }}-\!\!\!\!\!\!\sum _{(x,c)\in \mathcal {D}}\!\!\!\!\log \big (\frac {\sum _{m=1\!}^{M}{\pi }_{cm~\!}\mathcal {N}(f_{\bm {\theta }}(x);\bm {\mu }_{cm},\bm {\Sigma }_{cm})}{\sum _{c'=1\!}^{C}\sum _{m=1\!}^{M}{\pi }_{c'm~\!}\mathcal {N}(f_{\bm {\theta }}(x);\bm {\mu }_{c'm},\bm {\Sigma }_{c'm})}\big ).\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \end {aligned}
(11)
6
(0) (t) (t)
iteration {ϕc }c , and adopt momentum update in the M-Step, i.e., {ϕc ← (1−τ )ϕc +τ ϕ̂c }c , where
the momentum coefficient is set as τ = 0.999. This makes our training more stable and accelerates
the convergence of EM – we empirically find even one EM loop per training iteration is good enough.
This hybrid training scheme brings several advantages: First, GMMSeg achieves the merits of both
generative and discriminative learning. The online EM based generative optimization enables the
GMM to best fit the data distribution even on the evolving feature space. On the other hand, the
feature space is discriminatively end-to-end trained under the guidance of the GMM classifier, so as
to maximize the pixel-wise predictive performance. Second, as the generative EM optimization and
discriminative stochastic training work in an independent yet closely collaborative manner, GMMSeg
is fully compatible with modern segmentation network architectures and existing discriminative
training objectives. It can be further advanced with the development of network architectures of the
discriminative counterparts. Third, as GMMSeg explicitly models class-conditional data distribution
p(x|c), it can naturally handle off-manifold examples, i.e., directly giving meaningful likelihood of
the example fitting each class GMM distribution (see §4.2 for experiments on anomaly segmentation).
3.3 Implementation Details
Network Architecture. GMMSeg is a general framework that can be built upon any modern segmen-
tation network by replacing softmax with the GMM classifier. In our experiments (cf. §4.1), we
approach GMMSeg on a variety of segmentation models [7, 47–49] and backbones [50, 51]. In the
GMM classifier, a 1×1 conv is used to compress each pixel feature to a 64-dimensional vector, i.e.,
D = 64, and the covariance matrices Σ ∈ RD×D are constrained to be diagonal, for computational
efficiency. In our implementation, each class c is represented by a mixture of M = 5 Gaussians
(there are a total of 5C Gaussian components for a segmentation task with C semantic classes).
Furthermore, we adopt the winner-take-all assumption [119, 120], i.e., the class-wise responsibility
(Eq. 4) is dominated by the largest term, for better performance.
Training In each training iteration, we conduct one loop of momentum (Sinkhorn) EM (i.e., t=1) on
current training batch as well as the external memory for the generative optimization of GMM, and
backpropagate the gradient of the cross-entropy loss on current batch for the discriminative training
of the feature extractor. The external memory maintains a queue for each component in each class;
each queue gathers 32K pixel features from previous training batches in a first in, first out manner. To
improve the diversity of the stored pixel features, we sample a sparse set of 100 pixels per class from
each image, instead of directly storing the whole images into the memory. Note that the memory is
discarded after training, and does not introduce extra overheads in inference.
Inference. GMMSeg only brings negligible delay in the inference speed compared to the discrimina-
tive counterparts (see experiments in §4.3). For standard (closed-set) semantic segmentation, pixel
prediction is made using Bayes rule (cf. Eq. 3): arg maxc p(c|x), where p(c|x) ∝ p(x|c) with the
uniform class distribution prior: p(c) = 1/C. For anomaly segmentation, the pixel-wise uncertain-
ty/anomaly score can be naturally raised as: −maxc p(x|c), i.e., the outlier input should reside in
low-probability regions [121].
4 Experiments
We respectively examine the efficacy and robustness of GMMSeg on semantic segmentation (§4.1)
and anomaly segmentation (§4.2). In §4.3, we provide diagnostic analysis on our core model design.
7
COCO-Stuff
Cityscapes
ADE20K
bones and trained with commonly used data augmentations including resizing, flipping, color jittering
and cropping. For ADE20K /COCO-Stuff/Cityscapes, images are cropped to 512×512/512×512/768×
768 and models are trained for 160K/80K/80K iterations with 16/16/8 batch size, using 8/16 NVIDIA
Tesla A100 GPUs. Other training hyper-parameters (i.e., optimizers, learning rates, weight decays,
schedulers) are set as the default in MMSegmentation and can be found in the supplementary.
Inference Details. For ADE20K and COCO-Stuff, we keep the aspect ratio of test images and rescale
the short side to 512. For Cityscapes, sliding window inference is used with 768×768 window size. Note
that for fairness, all our results are reported without any test-time data augmentation.
Quantitative Results. Table 1 demonstrates our quantitative results. Although mainly focusing on
the comparison with the four base segmentation models [7, 47–49], we further include five widely
recognized methods [1, 3, 8, 9, 89] for completeness. As can be seen, our GMMSeg outperforms all its
discriminative counterparts across various datasets, backbones, and network architectures (FCN-style
and Transformer-like): Table 1: Quantitative results (§4.1) on ADE20K [53] val, City-
• ADE20K [53] val. With FCN- scapes [54] val, and COCO-Stuff [55] test with mean IoU.
style segmentation neural ar- Method Backbone ADE20K Citys. COCO.
chitectures, i.e., DeepLabV3+ FCN [CVPR15] [1] ResNet101 39.9 75.5 32.6
and OCR, GMMSeg provides PSPNet [CVPR17] [3] ResNet101 44.4 79.8 37.8
†
1.2%/1.5% mIoU gains over SETR [CVPR21] [9] ViTLarge 48.2 79.2 -
†
corresponding discriminative Segmenter [ICCV21] [8] ViTLarge ‡ 51.8 79.1 -
†
models. Similar performance MaskFormer [NeurIPS21] [89] SwinBase ‡ 52.7 - -
improvements, i.e., 1.0% and DeepLab V3+ [ECCV18] [47] 45.5 80.6 33.8
ResNet101
GMMSeg 46.7 ↑ 1.2 81.1 ↑ 0.5 35.5 ↑ 1.7
0.6%, are also obtained with OCRNet [ECCV20] [48] 43.3 80.4 37.6
attentive neural architectures, HRNetV2W48
GMMSeg 44.8 ↑ 1.5 81.2 ↑ 0.8 39.2 ↑ 1.6
i.e., Swin-UperNet and SegFor- UPerNet [ECCV18] [49] 48.0 81.1 43.4
mer, manifesting the universal- SwinBase
GMMSeg 49.0 ↑ 1.0 81.8 ↑ 0.7 44.3 ↑ 0.9
ity and efficacy of GMMSeg. SegFormer [NeurIPS21] [7]
MiTB5
50.0 82.0 44.0
• Cityscapes[54]val.Again our GMMSeg 50.6 ↑ 0.6 82.6 ↑ 0.6 44.7 ↑ 0.7
GMMSeg surpasses all its dis- †: pretrained on ImageNet22K ; ‡: using larger crop-size, i.e., 640×640
criminative counterparts by large margins, e.g., 0.5% over DeepLabV3+ , 0.8% over OCRNet, 0.7%
over Swin-UperNet, and 0.6% over SegFormer, suggesting its wide utility in this field.
• COCO-Stuff [55] test. Our GMMSeg also demonstrates promising results. This is particularly
impressive considering these results are achieved by a dense generative classifier, while the semantic
segmentation task is commonly considered as a battlefield for discriminative approaches.
Qualitative Results. In Fig. 3, we illustrate the qualitative comparisons of our GMMSeg against
SegFormer [7]. It is evident that, among the representative samples in the three datasets, our method
yields more accurate predictions when facing challenging scenarios, e.g., unconspicuous objects.
Datasets. To fully reveal the merits of our generative method, we next test its robustness for abnormal
data, i.e., identifying test samples of unseen classes, using two popular anomaly segmentation datasets:
• Fishyscapes Lost&Found [56], built upon [124], has 100/275 val/test images. It is collected under
the same setup as Cityscapes [54] but with real obstacles on the road. Pixels are labeled as either back-
ground (i.e., pre-defined Cityscapes classes) or anomaly (i.e., other unexpected classes like crate).
• Road Anomaly [36] has 60 images containing anomalous objects in unusual road conditions.
Evaluation Metrics. The area under receiver operating characteristics (AUROC), average precision
(AP), and false positive rate (FPR95 ) at a true positive rate of 95%, are adopted following [18, 35, 56].
8
Table 2: Quantitative results (§4.2) on Fishyscapes Lost&Found [56] val and Road Anomaly [36].
Extra OOD Fishyscapes Lost&Found Road Anomaly
Method mIoU
Resyn. Data AUROC↑ AP↑ FPR95 ↓ AUROC↑ AP↑ FPR95 ↓
SynthCP [ECCV20] [35] ✓ ✓ 80.3 88.34 6.54 45.95 76.08 24.86 64.69
SynBoost [CVPR21] [34] ✓ ✓ - 96.21 60.58 31.02 81.91 38.21 64.75
MSP [ICLR17] [17] ✗ ✗ 80.3 86.99 6.02 45.63 73.76 20.59 68.44
Entropy [ICLR17] [17] ✗ ✗ 80.3 88.32 13.91 44.85 75.12 22.38 68.15
SML [ICCV21] [18] ✗ ✗ 80.3 96.88 36.55 14.53 81.96 25.82 49.74
∗
Mahalanobis [NeurIPS18] [19] ✗ ✗ 80.3 92.51 27.83 30.17 76.73 22.85 59.20
∗
GMMSeg-DeepLabV3+ ✗ ✗ 81.1 97.34 43.47 13.11 84.71 34.42 47.90
∗
GMMSeg-FCN ✗ ✗ 76.7 96.28 32.94 16.07 78.99 24.51 56.95
∗
GMMSeg-SegFormer ✗ ✗ 82.6 97.83 50.03 12.55 89.37 57.65 44.34
∗: confidence derived with generative formulation
MSP [17] GMMSeg MSP [17] GMMSeg
Figure 4: Qualitative results (§4.2) of anomaly heatmaps on Fishyscapes Lost&Found [56] val.
Experiment Protocol. As in [17, 18, 125], we adopt ResNet101 -DeepLabV3+ architecture. For com-
pleteness, we also report the results of our GMMSeg based on ResNet101 -FCN and MiTB5 -SegFormer.
All our models are the same ones in Table 1, i.e., trained on Cityscapes train only. As GMMSeg esti-
mates class densities p(x|c), it can naturally reject unlikely inputs (cf. §3.3), i.e., directly thresholding
−maxc p(x|c) for computing the anomaly segmentation metrics, without any post-processing.
Quantitative Results. As shown in Table 2, based on DeepLabV3+ architecture, GMMSeg outper-
forms all the competitors under the same setting, i.e., neither using external out-of-distribution data nor
extra resynthesis module. Note that, [17–19] rely on pre-trained discriminative segmentation models
and thus have to make post-calibration. However, GMMSeg directly derives meaningful confidence
scores from likelihood p(x|c). Mahalanobis [19] also models data density, yet, merely on pre-trained
feature space with a single Gaussian per class. In contrast, GMMSeg performs much better, proving
the superiority of mixture modeling and hybrid training. Even with a weaker architecture, i.e., FCN,
GMMSeg still performs robustly. When adopting SegFormer, better performance is achieved.
Qualitative Results. In Fig. 4, we visualize the anomaly score heatmaps generated by MSP [17]-
DeepLabV3+ [47] and GMMSeg-DeepLabV3+ . The softmax based counterpart ignores the anomalies
with overconfident predictions; in contrast, GMMSeg naturally rejects them (red colored regions).
9
Standard EM vs. Sinkhorn EM. In our Table 5: Ablative studies (§4.3) on ADE20K [53] val. The
GMMSeg, we leverage the entropic OT adopted settings are marked in red.
based Sinkhorn EM [28] (cf. Eq. 10) in- EM algorithm # Loop mIoU (%) # Component mIoU (%)
stead of the classic one (cf. Eq. 8) for 1 42.7 M =1 44.2
the generative optimization of the GMM. vanilla EM
10 44.8 M =3 45.3
In Table 5a, we investigate the impacts 1 46.0 M =5 46.0
of these two different EM algorithms Sinkhorn EM 5 46.0 M = 10 46.0
10 46.0 M = 15 45.7
and show that Sinkhorn EM is more fa-
vored. More specifically, during the E- (a) EM optimization (b) # Component per class
step, rather than the vanilla EM assigning data samples to Gaussian components independently,
Sinkhorn EM restricts the assignment with an equipartition constraint. As pointed out in [28], in-
corporating such prior information about the mixing weights of GMM components leads to higher
curvature around the global optimum. Our empirical results confirm this theoretical finding.
Number of EM Loop per Training Iteration. EM algorithm alternates between E-step and M-step
for maximum-likelihood inference (cf. Eq. 6). In GMMSeg, in order to blend EM with stochastic
gradient descent, we adopt an online version of (Sinkhorn) EM based on momentum update. In
Table 5a, we also study the influence of looping EM different times per training iteration. We can find
that one loop per iteration is enough to catch the drift of the gradually updated feature space.
Number of Gaussian Components per Class. In GMMSeg, data distribution of each class is modeled
by a mixture of M Gaussian components (cf. Eq. 4). Table 5b shows the results with different values of
M . When M = 1, each class corresponds to a single Gaussian, which is directly estimated via
Gaussian Discriminant Analysis, without EM. This baseline achieves 44.2% mIoU. After adopting
the mixture model, i.e., M : 1 → 3 → 5, the performance is greatly improved, i.e., mIoU: 44.2%→
45.3%→46.0%. This verifies our hypothesis of class multimodality. Yet, further increasing compo-
nent number (i.e., M :5 → 15) only brings marginal even negative gains, due to overparameterization.
Confidence Calibration. We further 1.0 1.0
Accuracy Accuracy
study the model calibration of GMM- 0.8 Gap 0.8 Gap
Seg and the discriminative counterpart,
i.e., DeepLabV3+ [49] with the softmax 0.6 0.6
Accuracy
5 Conclusion
We presented GMMSeg, the first generative neural framework for semantic segmentation. By explicitly
modeling data distribution as GMMs, GMMSeg shows promise to solve the intrinsic limitations of
current softmax based discriminative regime. It successfully optimizes generative GMM with end-
to-end discriminative representation learning in a compact and collaborative manner. This makes
GMMSeg principled and well applicable in both closed-set and open-world settings. We believe this
work provides fundamental insights and can benefit a broad range of application tasks. As a part of
our future work, we will explore our algorithm in image classification and trustworthy AI related tasks.
10
References
[1] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmen-
tation. In CVPR, 2015. 1, 3, 8, 16
[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
IEEE TPAMI, 2017. 1, 3
[3] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing
network. In CVPR, 2017. 1, 3, 8
[4] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. 1, 3
[5] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention
network for scene segmentation. In CVPR, 2019. 1, 3
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1
[7] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer:
Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021. 1, 2, 3, 7, 8,
16, 17, 18, 19, 20
[8] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic
segmentation. In ICCV, 2021. 1, 3, 8
[9] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng
Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. In CVPR, 2021. 1, 3, 8
[10] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer:
High-resolution transformer for dense prediction. In NeurIPS, 2021. 1, 3
[11] JM Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, and M West. Generative or
discriminative? getting the best of both worlds. Bayesian statistics, 8(3):3–24, 2007. 1, 4, 5
[12] Hideaki Hayashi and Seiichi Uchida. A discriminative gaussian mixture model with sparsity. In ICLR,
2021. 1, 2, 3, 4
[13] Tianfei Zhou, Wenguan Wang, Ender Konukoglu, and Luc Van Gool. Rethinking semantic segmentation:
A prototype view. In CVPR, 2022. 1, 3, 4
[14] Lynton Ardizzone, Radek Mackowiak, Carsten Rother, and Ullrich Köthe. Training normalizing flows
with the information bottleneck for competitive generative classification. In NeurIPS, 2020. 2, 3
[15] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks.
In ICML, 2017. 2, 10
[16] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon,
Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating
predictive uncertainty under dataset shift. NeurIPS, 2019. 2
[17] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples
in neural networks. In ICLR, 2017. 2, 3, 9, 17, 21
[18] Sanghun Jung, Jungsoo Lee, Daehoon Gwak, Sungha Choi, and Jaegul Choo. Standardized max logits: A
simple yet effective approach for identifying unexpected road obstacles in urban-scene segmentation. In
ICCV, 2021. 2, 3, 8, 9, 17
[19] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting
out-of-distribution samples and adversarial attacks. NeurIPS, 2018. 2, 9
[20] Bradley Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal of
the American Statistical Association, 70(352):892–898, 1975. 2
[21] Andrew Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes. In NeurIPS, 2001. 2, 3
[22] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Hybrid
models with deep and invertible features. In ICML, 2019. 2, 3
[23] Pavel Izmailov, Polina Kirichenko, Marc Finzi, and Andrew Gordon Wilson. Semi-supervised learning
with normalizing flows. In ICML, 2020. 2, 3
[24] Radek Mackowiak, Lynton Ardizzone, Ullrich Kothe, and Carsten Rother. Generative classifiers as a
basis for trustworthy image classification. In CVPR, 2021. 2, 3
[25] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust
neural network model on mnist. In NeurIPS, 2018. 2, 3
11
[26] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Lever-
aging generative models to understand and defend against adversarial examples. In ICLR, 2018. 2,
3
[27] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F Núñez, and Jordi Luque. Input
complexity and out-of-distribution detection with likelihood-based generative models. In ICLR, 2019. 2,
3
[28] Gonzalo Mena, Amin Nejatbakhsh, Erdem Varol, and Jonathan Niles-Weed. Sinkhorn em: an expectation-
maximization algorithm based on entropic optimal transport. arXiv preprint arXiv:2006.16548, 2020. 2,
6, 10
[29] Ehsan Variani, Erik McDermott, and Georg Heigold. A gaussian mixture model layer jointly optimized
with discriminative features within a deep neural network architecture. In ICASSP, 2015. 2, 3
[30] Zoltán Tüske, Muhammad Ali Tahir, Ralf Schlüter, and Hermann Ney. Integrating gaussian mixtures into
deep neural networks: Softmax layer with hidden variables. In ICASSP, 2015. 2, 3
[31] Aldebaro Klautau, Nikola Jevtic, and Alon Orlitsky. Discriminative gaussian mixture models: A
comparison with kernel classifiers. In ICML, 2003. 2
[32] Zhihao Zheng and Pengyu Hong. Robust detection of adversarial attacks by modeling the intrinsic
properties of deep neural networks. In NeurIPS, 2018. 2
[33] Kimin Lee, Sukmin Yun, Kibok Lee, Honglak Lee, Bo Li, and Jinwoo Shin. Robust inference via
generative classifiers for handling noisy labels. In ICML, 2019. 2
[34] Giancarlo Di Biase, Hermann Blum, Roland Siegwart, and Cesar Cadena. Pixel-wise anomaly detection
in complex driving scenes. In CVPR, 2021. 2, 3, 9, 17
[35] Yingda Xia, Yi Zhang, Fengze Liu, Wei Shen, and Alan L Yuille. Synthesize then compare: Detecting
failures and anomalies for semantic segmentation. In ECCV, 2020. 2, 3, 8, 9
[36] Krzysztof Lis, Krishna Nakka, Pascal Fua, and Mathieu Salzmann. Detecting the unexpected via image
resynthesis. In ICCV, 2019. 2, 3, 8, 9, 17
[37] Tomas Vojir, Tomáš Šipka, Rahaf Aljundi, Nikolay Chumerin, Daniel Olmeda Reino, and Jiri Matas.
Road anomaly detection by partial image reconstruction with segmentation coupling. In ICCV, 2021. 2, 3
[38] Petra Bevandić, Ivan Krešo, Marin Oršić, and Siniša Šegvić. Simultaneous semantic segmentation and
outlier detection in presence of domain shift. In GCPR, 2019. 2, 3, 17
[39] Robin Chan, Matthias Rottmann, and Hanno Gottschalk. Entropy maximization and meta classification
for out-of-distribution detection in semantic segmentation. In ICCV, 2021. 2, 3
[40] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure.
arXiv preprint arXiv:1812.04606, 2018. 2, 3
[41] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image
detection in neural networks. arXiv preprint arXiv:1706.02690, 2017. 2, 3
[42] KIMIN LEE, Kibok Lee, Honglak Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for
detecting out-of-distribution samples. In ICLR, 2018. 2, 3
[43] Matthias Rottmann, Pascal Colling, Thomas Paul Hack, Robin Chan, Fabian Hüger, Peter Schlicht,
and Hanno Gottschalk. Prediction error meta classification in semantic segmentation: Detection via
aggregated dispersion measures of softmax probabilities. In IJCNN, 2020. 2, 3
[44] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer
vision? In Advances in neural information processing systems, 2017. 2, 3
[45] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive
uncertainty estimation using deep ensembles. In NeurIPS, 2017. 2, 3
[46] Jishnu Mukhoti and Yarin Gal. Evaluating bayesian deep learning methods for semantic segmentation.
arXiv preprint arXiv:1811.12709, 2018. 2, 3, 17
[47] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-
decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. 2, 3, 7, 8, 9,
16, 17, 21
[48] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmenta-
tion. In ECCV, 2020. 2, 3, 7, 8, 16
[49] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene
understanding. In ECCV, 2018. 2, 7, 8, 10
[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In CVPR, 2016. 2, 7, 9, 16
12
[51] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu,
Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition.
IEEE TPAMI, 2020. 2, 7
[52] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2, 7, 16
[53] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing
through ade20k dataset. In CVPR, 2017. 2, 7, 8, 9, 10, 16, 17, 18
[54] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Be-
nenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene
understanding. In CVPR, 2016. 2, 7, 8, 16, 17, 19
[55] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In
CVPR, 2018. 2, 7, 8, 16, 17, 20
[56] Hermann Blum, Paul-Edouard Sarlin, Juan Nieto, Roland Siegwart, and Cesar Cadena. The fishyscapes
benchmark: Measuring blind spots in semantic segmentation. IJCV, 2021. 2, 3, 4, 8, 9, 16, 17, 21
[57] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable
convolutional networks. In CVPR, 2017. 3
[58] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation
in street scenes. In CVPR, 2018. 3
[59] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. 3
[60] Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool. Mining cross-image semantics for weakly
supervised semantic segmentation. In ECCV, 2020. 3
[61] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In MICCAI, 2015. 3
[62] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du,
Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In ICCV,
2015. 3
[63] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder
architecture for image segmentation. IEEE TPAMI, 39(12):2481–2495, 2017. 3
[64] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Deep learning markov random
field for semantic segmentation. IEEE TPAMI, 40(8):1814–1828, 2017. 3
[65] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks
for high-resolution semantic segmentation. In CVPR, 2017. 3
[66] Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi. Espnet:
Efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, 2018. 3
[67] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit
Agrawal. Context encoding for semantic segmentation. In CVPR, 2018. 3
[68] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context network for
semantic segmentation. In CVPR, 2019. 3
[69] Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, and Junjie Yan. Class-wise dynamic graph
convolution for semantic segmentation. In ECCV, 2020. 3
[70] Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu, Chunhua Shen, and Nong Sang. Context prior for
scene segmentation. In CVPR, 2020. 3
[71] Mingyuan Liu, Dan Schonfeld, and Wei Tang. Exploit visual dependency relations for semantic segmen-
tation. In CVPR, 2021. 3
[72] Chi-Wei Hsiao, Cheng Sun, Hwann-Tzong Chen, and Min Sun. Specialize and fuse: Pyramidal output
representation for semantic segmentation. In ICCV, 2021. 3
[73] Zhenchao Jin, Bin Liu, Qi Chu, and Nenghai Yu. Isnet: Integrate image-level and semantic-level context
for semantic segmentation. In ICCV, 2021. 3
[74] Zhenchao Jin, Tao Gong, Dongdong Yu, Qi Chu, Jian Wang, Changhu Wang, and Jie Shao. Mining
contextual information beyond image for semantic segmentation. In ICCV, 2021. 3
[75] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool. Exploring
cross-image pixel contrast for semantic segmentation. In ICCV, 2021. 3
[76] Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. Vspw: A large-scale dataset
for video scene parsing in the wild. In CVPR, 2021. 3
13
[77] Zongxin Yang, Yunchao Wei, and Yi Yang. Collaborative video object segmentation by multi-scale
foreground-background integration. IEEE TPAMI, 2021. 3
[78] Adam W Harley, Konstantinos G Derpanis, and Iasonas Kokkinos. Segmentation-aware convolutional
networks using local attention masks. In ICCV, 2017. 3
[79] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR,
2018. 3
[80] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. Pyramid attention network for semantic
segmentation. arXiv preprint arXiv:1805.10180, 2018. 3
[81] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet:
Point-wise spatial attention network for scene parsing. In ECCV, 2018. 3
[82] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi-scale filters for semantic segmentation. In
ICCV, 2019. 3
[83] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, and Hong Liu. Expectation-
maximization attention networks for semantic segmentation. In ICCV, 2019. 3
[84] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet:
Criss-cross attention for semantic segmentation. In ICCV, 2019. 3
[85] Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. Deep hierarchical semantic segmenta-
tion. In CVPR, pages 1246–1257, 2022. 3
[86] Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, and Ling Shao. Hierarchical
human parsing with typed part-relation reasoning. In CVPR, 2020. 3
[87] Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. Learning
compositional neural information fusion for human parsing. In ICCV, 2019. 3
[88] Zongxin Yang, Yunchao Wei, and Yi Yang. Associating objects with transformers for video object
segmentation. NeurIPS, 2021. 3
[89] Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need
for semantic segmentation. In NeurIPS, 2021. 3, 8
[90] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-
attention mask transformer for universal image segmentation. CVPR, 2022. 3, 16
[91] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional
neural networks. In ICML, 2016. 3
[92] Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple knowledge representation for big data artificial
intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic
Engineering, 22(12):1551–1558, 2021. 3
[93] Wenguan Wang, Cheng Han, Tianfei Zhou, and Dongfang Liu. Visual recognition with deep nearest
centroids. arXiv preprint arXiv:2209.07383, 2022. 3
[94] Jyh-Jing Hwang, Stella X Yu, Jianbo Shi, Maxwell D Collins, Tien-Ju Yang, Xiao Zhang, and Liang-Chieh
Chen. Segsort: Segmentation by discriminative sorting of segments. In ICCV, 2019. 3
[95] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, Suvrit Sra, and Greg Ridgeway. Clustering on the
unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6(9), 2005.
3
[96] Rajat Raina, Yirong Shen, Andrew Mccallum, and Andrew Ng. Classification with hybrid generative/dis-
criminative models. NeurIPS, 2003. 3
[97] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved
techniques for training gans. In NeurIPS, 2016. 3
[98] Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for
invertible generative modeling. In NeurIPS, 2019. 3
[99] Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and
Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. In
ICLR, 2019. 3
[100] Yingzhen Li, John Bradshaw, and Yash Sharma. Are generative classifiers more robust to adversarial
attacks? In ICML, 2019. 3
[101] Xinshuai Dong, Hong Liu, Rongrong Ji, Liujuan Cao, Qixiang Ye, Jianzhuang Liu, and Qi Tian. Api-net:
Robust generative classifier via a single discriminator. In ECCV, 2020. 3
[102] Ethan Fetaya, Jörn-Henrik Jacobsen, Will Grathwohl, and Richard Zemel. Understanding the limitations
of conditional generative models. 2020. 3
14
[103] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep
generative models know what they don’t know? In ICLR, 2019. 3
[104] Florian Wenzel, Théo Galy-Fajou, Christan Donner, Marius Kloft, and Manfred Opper. Efficient gaussian
process classification using pòlya-gamma data augmentation. In AAAI, 2019. 3
[105] Tommi Jaakkola and David Haussler. Exploiting generative models in discriminative classifiers. In
NeurIPS, 1998. 3
[106] Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative and
discriminative models. In CVPR, 2006. 3
[107] Hyunsun Choi, Eric Jang, and Alexander A Alemi. Waic, but why? generative ensembles for robust
anomaly detection. arXiv preprint arXiv:1810.01392, 2018. 3, 4
[108] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji
Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In NeurIPS, 2019. 3, 4
[109] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning
with deep generative models. In NeurIPS, 2014. 4
[110] Guillaume Bouchard and Bill Triggs. The tradeoff between generative and discriminative classifiers. In
IASC International Symposium on Computational Statistics, 2004. 4
[111] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Distance-based image
classification: Generalizing to new classes at near-zero cost. IEEE TPAMI, 35(11):2624–2637, 2013. 4
[112] Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. Feedforward neural networks
initialization based on discriminant learning. Neural Networks, 146:220–229, 2022. 4
[113] Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential deep learning to quantify classification
uncertainty. In NeurIPS, 2020. 4
[114] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via
the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
5
[115] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse,
and other variants. In Learning in graphical models, pages 355–368. 1998. 5
[116] Naonori Ueda and Ryohei Nakano. Deterministic annealing variant of the em algorithm. NeurIPS, 7,
1994. 6
[117] Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering
and representation learning. In ICLR, 2020. 6
[118] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013. 6
[119] Steven Nowlan. Maximum likelihood competitive learning. NeurIPS, 1989. 7
[120] Nanda Kambhatla and Todd Leen. Classifying with gaussian mixtures and clusters. NeurIPS, 1994. 7
[121] Xuefeng Du, Xin Wang, Gabriel Gozum, and Yixuan Li. Unknown-aware object detection: Learning
what you don’t know from videos in the wild. In CVPR, 2022. 7
[122] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and
benchmark. https://github.com/open-mmlab/mmsegmentation, 2020. 7
[123] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet
large scale visual recognition challenge. IJCV, 115(3):211–252, 2015. 7, 16
[124] Peter Pinggera, Sebastian Ramos, Stefan Gehrig, Uwe Franke, Carsten Rother, and Rudolf Mester. Lost
and found: detecting small road hazards for self-driving vehicles. In IROS, 2016. 8
[125] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection.
NeurIPS, 2020. 9
[126] Mark Everingham, SM Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew
Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. 16
[127] Terrance DeVries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural
networks. arXiv preprint arXiv:1802.04865, 2018. 17
[128] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. NeurIPS, 2018. 17
[129] Dan Hendrycks, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn
Song. Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132,
2019. 17
15
SUMMARY OF THE APPENDIX
In the appendix, we provide the following items that shed deeper insight on our contributions:
• §A: Detailed training parameters.
• §B: More experimental results.
• §C: More qualitative visualization.
We evaluate our GMMSeg on six base segmentation architectures. Four of them, i.e., DeepLabv3+ [47],
OCRNet [48], Swin-UperNet [52], SegFormer [7], are presented in our main paper. And the two
additional base architectures, i.e., FCN [1] and Mask2Former [90], are provided in this supplemental
material (cf . §B). We follow the default training settings in the official Mask2Former codebase and
MMSegmentation for Mask2Former and other base architectures respectively. In particular, we train
FCN, DeepLabv3+ and OCRNet using SGD optimizer with initial learning rate 0.1, weight decay
4e-4 with polynomial learning rate annealing; we train Swin-UperNet and SegFormer using AdamW
optimizer with initial learning rate 6e-5, weight decay 1e-2 with polynomial learning rate annealing;
we train Mask2Former using AdamW optimizer with initial learning rate 1e-4, weight decay 5e-2
and the learning rate is decayed by a factor of 10 at 0.9 and 0.95 fractions of the total training steps.
More Base Segmentation Architectures. We first demonstrate the efficacy of our GMMSeg on two
additional base segmentation architectures, i.e., FCN [1] and Mask2Former [90], with quantitative re-
sults summarized in Table 6. We train FCN based models with the according training hyperparameter
settings mentioned in §A and Table 6: Additional quantitative results (§B) on ADE20K [53] val,
strictly follow the same train- Cityscapes [54] val, and COCO-Stuff [55] test in mean IoU.
ing and inference setups in our Method Backbone ADE20K Citys. COCO.
main manuscript (cf. §4.1). Fur- FCN [CVPR15] [1] 39.9 75.5 32.6
thermore, for a fair comparison ResNet101
GMMSeg 41.8 ↑ 1.9 76.7 ↑ 1.2 34.1 ↑ 1.5
with Mask2Former, the back- Mask2Former [CVPR22] [90] 56.1 83.3 51.0
SwinLarge
bone, i.e., SwinLarge [52], is pre- GMMSeg 56.7 ↑ 0.6 83.8 ↑ 0.5 52.0 ↑ 1.0
trained with ImageNet22K [123].
For ADE20K /COCO-Stuff/Cityscapes, we train Mask2Former based models using images cropped to
640×640/640×640/1024×1024, for 160K/80K/90K iterations with 16/16/16 batch size. We adopt
sliding window inference on Cityscapes with a window size of 1024×1024 and we keep the aspect
ratio of test images and rescale the short side to 640 on ADE20K and COCO-Stuff.
Here FCN is a famous fully convolutional model that is in line with the per-pixel dense classification
models we discussed in the main paper (cf . §3.1). Besides, of particular interest is the Mask2Former,
which is an attentive model proposed very recently that formulates the task as a mask classification
problem, where a mask-level representation is learned instead of pixel-level. However, it still relies
on a discriminative softmax based classifier for mask classification. We equip Mask2Former by
replacing the softmax classification module with our generative GMM classifier.
As seen, GMMSeg consistently boosts the model performance despite different segmentation for-
mulations, i.e., pixel classification or mask classification, verifying the superiority of our GMMSeg
that brings a paradigm shift from a discriminative softmax to a generative GMM. Notably, with
Mask2Former-SwinLarge as base segmentation architecture, our GMMSeg earns mIoU scores of
56.7%/83.8%/52.0%, establishing new state-of-the-arts among ADE20K /Cityscapes/COCO-Stuff.
Anomaly Segmentation Result on Fishyscapes Lost&Found test and Static test. We addi-
tionally report the anomaly segmentation performance of our Cityscapes [54] trained GMMSeg
built upon DeepLabV3+ [47]-ResNet101 [50] on Fishyscapes [56] Lost&Found test and Static test.
Fishyscapes Static is a blending-based dataset built upon backgrounds from Cityscapes and anoma-
lous objects from Pascal VOC [126], that contains 30/1,000 images in val/test set. The test splits
of Fishyscapes Lost&Found and Static are privately held by the Fishyscapes organization that contain
entirely unknown anomalies to the methods. The results are summarized in Table 7, and are also
16
Table 7: Quantitative results (§B) on Fishyscapes (FS) Lost&Found test and Static test.
Re- Extra OoD FS Lost&Found FS Static
Method
training Network Data AP ↑ FPR95 ↓ AP ↑ FPR95 ↓
Density - Single-layer NLL [56] ✗ ✓ ✗ 3.01 32.9 40.86 21.29
Density - Minimum NLL [56] ✗ ✓ ✗ 4.25 47.15 62.14 17.43
Density - Logistic Regression [56] ✗ ✓ ✓ 4.65 24.36 57.16 13.39
Image Resynthesis [36] ✗ ✓ ✗ 5.70 48.05 29.6 27.13
Bayesian Deeplab [46] ✓ ✗ ✗ 9.81 38.46 48.70 15.05
OoD Training - Void Class [127] ✓ ✗ ✓ 10.29 22.11 45.00 19.40
Discriminative Outlier Detection Head [38] ✓ ✓ ✓ 31.31 19.02 96.76 0.29
Dirichlet Deeplab [128] ✓ ✗ ✓ 34.28 47.43 31.30 84.60
SynBoost [34] ✗ ✓ ✓ 43.22 15.79 72.59 18.75
MSP [129] ✗ ✗ ✗ 1.77 44.85 12.88 39.83
Entropy [17] ✗ ✗ ✗ 2.93 44.83 15.41 39.75
kNN Embedding - density [56] ✗ ✗ ✗ 3.55 30.02 44.03 20.25
SML [18] ✗ ✗ ✗ 31.05 21.52 53.11 19.64
GMMSeg-DeepLabV3+ ✗ ✗ ✗ 55.63 6.61 76.02 15.96
publicly available in anonymous on the official leaderboard3 . We categorize the methods by checking
whether they require retraining, extra segmentation networks or utilize OoD data, following [18, 56].
As seen, without any add-on post-calibration technique, GMMSeg significantly surpasses the state-of-
the-art methods by even larger margins on the challenging test set compared to results on val set, i.e.,
+24.58%/+14.91% in AP and +22.91%/+3.68% in FPR95 on Fishyscapes Lost&Found/Static test.
Notably, GMMSeg even outperforms all other benchmark methods that employ additional training
networks/data on Fishyscapes Lost&Found test, verifying the strong robustness to unexpected
anomalies on-road due to the accurate data density modeling of GMMSeg.
Impact of Memory Capacity. In Table 8, we further ex- Table 8: Impact of memory size,
plore the influence of the memory capacity, i.e., the amount evaluated on ADE20K [53] val.
of pixel representations stored for class-wise EM estimation, # Sample mIoU (%)
with DeepLabV3+ -ResNet101 on ADE20K val trained for 80K
0 40.3
iterations. For the first row, where the memory size is set to 8K 45.1
0, the EM is only performed within mini-batches. Not sur- 16K 45.4
prisingly, data distribution estimated at such a local scale is 32K 46.0
far from accurate, leading to inferior results. With enlarged 48K 46.0
memory capacity, the performance is increased. When the performance reaches saturation, the stored
pixel samples are sufficient enough to represent the true data distribution of the whole training set.
3
https://fishyscapes.com/results
17
SegFormer GMMSeg SegFormer GMMSeg
Figure 6: Qualitative results (§C) of SegFormer [7] and our GMMSeg on ADE20K [53].
18
SegFormer GMMSeg SegFormer GMMSeg
Figure 7: Qualitative results (§C) of SegFormer [7] and our GMMSeg on Cityscapes [54].
19
SegFormer GMMSeg SegFormer GMMSeg
Figure 8: Qualitative results (§C) of SegFormer [7] and our GMMSeg on COCO-Stuff [55].
20
Image MSP [17]-DeepLabV3+ [47] GMMSeg-DeepLabV3+
Figure 9: Qualitative results (§C) of anomaly heatmaps on Fishyscapes Lost&Found [56] val.
21