0% found this document useful (0 votes)

11 views

NeurIPS 2023 Clusterfomer Clustering As A Universal Visual Learner Paper Conference

Uploaded by

Achraf Louiza

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

NeurIPS 2023 Clusterfomer Clustering As A Universal Visual Learner Paper Conference

Uploaded by

Achraf Louiza

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

C LUSTER F ORMER: Clustering As

A Universal Visual Learner

James C. Liang Yiming Cui Qifan Wang

Rochester Institute of Technology University of Florida Meta AI

Tong Geng Wenguan Wang Dongfang Liu∗

University of Rochester Zhejiang University Rochester Institute of Technology

Abstract

This paper presents C LUSTER F ORMER, a universal vision model that is based on
the C LUSTERing paradigm with TransF ORMER. It comprises two novel designs: ①
recurrent cross-attention clustering, which reformulates the cross-attention mecha-
nism in Transformer and enables recursive updates of cluster centers to facilitate
strong representation learning; and ② feature dispatching, which uses the updated
cluster centers to redistribute image features through similarity-based metrics, re-
sulting in a transparent pipeline. This elegant design streamlines an explainable
and transferable workflow, capable of tackling heterogeneous vision tasks (i.e.,
image classification, object detection, and image segmentation) with varying levels
of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results
demonstrate that C LUSTER F ORMER outperforms various well-known specialized
architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classifi-
cation, 54.2% and 47.0% mAP over MS COCO for object detection and instance
segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8%
PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our
work can catalyze a paradigm shift in universal models in computer vision.

1 Introduction
Computer vision has seen the
emergence of specialized solu-
tions for different vision tasks
(e.g., ResNet [34] for image
classification, Faster RCNN [70] C LUSTER F ORMER

for object detection, and Mask Figure 1: C LUSTER F ORMER is a clustering-based universal model,
RCNN [33] for instance segmenta- offering superior performance over various specialized architectures.
tion), aiming for superior performance. Nonetheless, neuroscience research [73, 65, 82, 5] has shown
that the human perceptual system exhibits exceptional interpretive capabilities for complex visual
stimuli, without task-specific constraints. This trait of human perceptual cognition diverges from
current computer vision techniques [95, 44, 46], which often employ diverse architectural designs.
Human vision possesses a unique attention mechanism that selectively focuses on relevant parts of
the visual field while disregarding irrelevant information [81, 40]. This can be likened to a clustering
approach [2, 3, 89], in which individual pixel points are decomposed and reorganized into relevant
concepts to address various tasks. This essentially is a hierarchical process that involves combining
∗
Corresponding author.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

basic visual features, such as lines, shapes, and colors, to create higher-level abstractions of objects,
scenes, and individuals [79, 59, 66, 27] . Inspired by the remarkable abilities of the human vision
system, this work aims to develop a universal vision model that can replicate the unparalleled prowess.
To this end, we employ a clustering-based strategy that operates at varying levels of granularity for
visual comprehension. By solving different vision tasks (i.e., image classification, object detection,
and image segmentation), we take into account the specificity at which visual information is grouped
(i.e., image-, box-, and pixel-level). We name our approach, C LUSTER F ORMER (§3.2), as it utilizes
a C LUSTERing mechanism integrated within the TransF ORMER architecture to create a universal
network. The method begins by embedding images into discrete tokens, representing essential features
that are grouped into distinct clusters. The cluster centers are then recursively updated through a
recurrent clustering cross-attention mechanism that considers associated feature representations along
the center dimension. Once center assignments and updates are complete, features are dispatched
based on updated cluster centers, and then both are fed into the task head for the target tasks.
C LUSTER F ORMER enjoys a few attractive qualities. ❶ Flexibility: C LUSTER F ORMER is a clustering-
anchored approach that accommodates a broad array of visual tasks with superior performance (see
Fig. 1) under one umbrella. The core epistemology is to handle various tasks with different levels
of granularity (e.g., image-level classification, box-level detection, pixel-level segmentation, etc.),
moving towards a universal visual solution. ❷ Transferability: The cluster centers generated by the
C LUSTER F ORMER encoder are directly employed by the task head as initial queries for clustering,
allowing the entire architecture to transfer underlying representation for target-task predictions (see
Table. 4). This elegant design facilitates the transferability of knowledge acquired from the upstream
task (i.e., encoder trained on ImageNet [72]) to downstream tasks (e.g., decoder trained on instance
segmentation on COCO [49]). ❸ Explainability: Regardless of the target tasks, C LUSTER F ORMER’s
decision-making process is characterized by a transparent pipeline that continuously updates cluster
centers through similarity-based metrics. Since the reasoning process is naturally derivable, the
model inference behavior is ad-hoc explainable (see §4.2). This differs C LUSTER F ORMER from most
existing unified models [17, 44, 95] that fail to elucidate precisely how a model works.
To effectively assess our method, we experimentally show: In §4.1.1, with the task of image
classification, C LUSTER F ORMER outperforms traditional counterparts, e.g., 0.13 ∼ 0.39% top-1
accuracy compared with Swin Transformoer [53] on ImageNet [72], by training from scratch. In
§4.1.2, when using our ImageNet-pretrained, our method can be expanded to the task of object
detection and greatly improve the performance compared to Dino [96] over Swin Transformer on
COCO [49] (0.8 ∼ 1.1% mAP). In addition, our method can also adapt to more generic per-pixel
tasks, a.k.a, semantic segmentation (see §4.1.3), instance segmentation (see §4.1.4), and panoptic
segmentation (see §4.1.5). For instance, we achieve performance gains of 0.6 ∼ 1.3% mIoU for
semantic segmentation on ADE20K [101], 1.0 ∼ 1.4% mAP for instance segmentation on MS
COCO [49] and 1.5 ∼ 1.7% PQ for panoptic segmentation on COCO Panoptic [42] compared with
Mask2Former [17] over Swin Transformer. Our algorithm are extensively tested, and the efficacy for
the core components is also demonstrated through a series of ablative studies outlined in §4.2,

2 Related Work

Universal Vision Model. Transformers [81] have been instrumental in driving universal ambition,
fostering models that are capable of tackling tasks of different specificity with the same architecture
and embody the potential of these recent developments [23, 17, 16, 95, 96, 4, 80, 30, 57, 86] in
the field. In the vision regime, mainstream research endeavors have been concentrating on the
development of either encoders [53, 88] or decoders [44, 94]. The encoder is centered around
the effort of developing foundation models [4, 53, 24, 22], trained on extensive data that can be
adapted and fine-tuned to diverse downstream tasks. For instance, Swin Transformer [53] capably
serves as a general-purpose backbone for computer vision by employing a hierarchical structure
consisting of shifted windows; ViT-22B [22], parameterizes the architecture to 22 billion and achieves
superior performance on a variety of vision tasks through learning large-scale data. Conversely,
research on decoders [23, 17, 16, 95, 94, 44, 96, 87, 50, 20, 52, 19, 21, 51, 93, 76, 37, 99, 25, 48] is
designed to tackle homogeneous target tasks, by using queries to depict visual patterns. For instance,
Mask2Former [17] incorporates mask information into the Transformer architecture and unifies
various segmentation tasks (e.g., semantic, instance, and panoptic segmentation); Mask-DINO [44]
extends the decoding process from detection to segmentation by directly utilizing query embeddings

2
for target task predictions. Conceptually different, we streamline an elegant systemic workflow based
on clustering and handle heterogeneous visual tasks (e.g., image classification, object detection, and
image segmentation) at different clustering granularities.
Clustering in Vision. Traditional clustering algorithms in vision [39, 28, 29, 55, 91, 1, 10, 61, 6, 58]
can be categorized into the hierarchical and partitional modes. The hierarchical methods [62, 38]
involve the modeling of pixel hierarchy and the iterative partitioning and merging of pixel pairs
into clusters until reaching a state of saturation. This approach obviates the necessity of a priori
determination of cluster quantity and circumvents the predicaments arising from local optima. [98, 12].
However, it exclusively considers the adjacent pixels at each stage and lacks the capacity to assimilate
prior information regarding the global configuration or dimensions of the clusters. [69, 64]. In contrast,
partitional clustering algorithms [78, 36] directly generate a flat structure with a predetermined number
of clusters and exclusively assign pixels to a single cluster. This design exhibits a dynamic nature,
allowing pixels to transition between clusters [11, 63]. By employing suitable measures, this approach
can effectively integrate complex knowledge within cluster centers. As a powerful system, human
vision incorporates the advantages of both clustering modes [89, 83, 67]. We possess the capability
of grouping analogous entities at different scales. Meanwhile, we can also effectively categorize
objects purely based on their shape, color, or texture, without having the hierarchical information.
Drawing on the above insights, we reformulate the attention mechanism (§3.2 ) in Transformer
architectures [81] from the clustering’s perspective to decipher the hierarchy of visual complexity.

3 Methodology

3.1 Preliminary

Clustering. The objective of clustering is to partition a set of data points, denoted by X ∈ Rn×d ,
into C distinct clusters based on their intrinsic similarities while ensuring that each data point belongs
to only one cluster. Achieving this requires optimizing the stratification of the data points, taking into
account both their feature and positional information, to form coherent and meaningful groupings.
Clustering methodologies typically employ advanced similarity metrics, such as cosine similarity,
to measure the proximity between data points and cluster centroids. Additionally, they consider the
spatial locality of the points to make more precise group assignments.
Cross-Attention for Generic Clustering. Drawing inspiration from the Transformer decoder
architecture [81], contemporary end-to-end architecture [17, 9] utilize a query-based approach in
which a set of K queries, C = [c1 ; · · ·; cK ] ∈ RK×D, are learned and updated by a series of cross-
attention blocks. In this context, we rethink the term “C” to associate queries with cluster centers
at each layer. Specifically, cross-attention is employed at each layer to adaptively aggregate image
features and subsequently update the queries:

C ← C + softmaxHW (QC (K I )⊤ )V I , (1)

where QC∈ RK×D, V I∈ RHW×D, K I∈ RHW×D represent linearly projected features for query, key,
and value, respectively. The superscripts “C” and “I” denote the features projected from the center
and image features, respectively. Motivated by [95], we follow a reinterpretation of the cross-attention
mechanism as a clustering solver by considering queries as cluster centers and applying the softmax
function along the query dimension (K) instead of the image resolution (HW ):

C ← C + softmaxK (QC (K I )⊤ )V I . (2)

3.2 C LUSTER F ORMER

In this subsection, we present C LUSTER F ORMER (see Fig. 2(a)). The model has a serial of hierarchical
stages that enables multi-scale representation learning for universal adaptation. At each stage, image
patches are tokenized into feature embedding [81, 53, 24], which are grouped into distinct clusters
via a unified pipeline — first recurrent cross-attention clustering and then feature dispatching.
Recurrent Cross-Attention Clustering. Considering the feature embeddings I ∈ RHW ×D and
initial centers C (0) , we encapsulate the iterative Expectation-Maximization (EM) clustering process,

3
(a) Recurrent Cross-Attention Clustering Feature Dispatching (c) Feature Dispatching (Eq. 6)
Stage 1 Stage L
4 Universal
Adaptation
1
___
K MLP
C (T) Similarity

(b)
Recurrent Cross-Attention Clustering (Eq. 3) E-step
M-step

T iterations
I
C (1)
C (0) ˆ (0)
M ˆ (t)
M C (T) ˆ (T)
M
Figure 2: (a) Overall pipeline of C LUSTER F ORMER. (b) Each Recurrent Cross-Attention Clustering layer
carries out T iterations of cross-attention clustering (E-step) and center updating (M-step) (see Eq. 3). (c) The
feature dispatching redistributes the feature embeddings on the top of updated cluster centers (see Eq. 6).
consisting of T iterations, within a Recurrent EM Cross-Attention layer (see Fig. 2(b)):
(t)
E -step: M̂ (t) = softmaxK (QC (K I )⊤ ),
(3)
M -step: C (t+1) = M̂ (t) V I ∈ RK×D ,
where t ∈ {1, · · · , T } and M̂ ∈ [0, 1]K×HW represents the “soft” cluster assignment matrix (i.e.,
probability maps of K clusters). As defined in Section 3.1, QC ∈ RK×D denotes the query vector
projected from the center C, and V I , K I ∈ RHW ×D correspond to the value and key vectors,
respectively, projected from the image features I. The Recurrent Cross-Attention approach iteratively
updates cluster membership M̂ (i.e., E-step) and centers C (i.e., M -step). This dynamic updating
strategy embodies the essence of partitional clustering. It enjoys a few appealing characteristics:
• Efficiency: While the vanilla self-attention mechanism has a time complexity of O(H 2 W 2 D), the
Recurrent Cross-Attention approach exhibits a lower bound of O(T KHW D). This is primarily
due to the fact that T K ≪ HW (i.e., 4165 in Swin [53] vs. 1200 in ours). Specifically, considering
the nature of the pyramid architecture [88, 53] during the encoding process, T K can indeed be
much smaller than HW , especially in the earlier stages. It is important to note that during each
iteration, merely the Q matrix requires an update, while the K and V matrices necessitate a single
computation. Consequently, the whole model enjoys systemic efficiency (see Table 6c).
• Transparency: The transparency hinges on the unique role that cluster centers play in our Recurrent
Cross-Attention mechanism. The cluster centers, derived through our clustering process, act as
‘prototypes’ for the features they cluster. These ‘prototypes’ serve as a form of a representative
sample for each cluster, reflecting the most salient or characteristic features of the data points within
that cluster. Moreover, the Recurrent Cross-Attention method adheres to the widely-established EM
clustering algorithm, offering a lucid and transparent framework. This cluster center assignment
behaves in a human-understandable manner (see Fig. 3) during representation learning and fosters
ad-hoc explainability, allowing for a more intuitive understanding of the underlying relationships.
• Non-parametric fashion: The Recurrent Cross-Attention mechanism achieves a recursive nature by
sharing the projection weights for query, key, and value across iterations. This approach effectively
ensures recursiveness without the introduction of additional learnable parameters (see Table 6b).
Since the overall architecture is hierarchical, Recurrent Cross-Attention is able to thoroughly explore
the representational granularity, which mirrors the process of hierarchical clustering:
C l = RCAl (I l , C0l ), (4)
where RCA stands for the recurrent cross-attention layer. I l is the image feature map at different
layers by standard pooling operation with H/2l ×W/2l resolution. C l is the cluster center matrix for
lth layer and C0l is the initial centers at lth layer. The parameters for Recurrent Cross-Attention at
different layers, i.e., {RCAl }L l=1 , are not shared. In addition, we initialize the centers from image
grids:
(0) (0)
[c1 ; · · ·; cK ] = FFN(Adptive_PoolingK (I)), (5)

4
where FFN stands for Position-wise Feedforward Network which is an integral part of the Transformer
architecture. It comprises two fully connected layers along with an activation function used in the
hidden layer. Adptive_PoolingK (I) refers to select K feature centers from I using adaptive
sampling, which calculates an appropriate window size to achieve a desired output size adaptively,
offering more flexibility and precision compared to traditional pooling methods.
Feature Dispatching. After the cluster assignment, the proposed method employs an adaptive
process that dispatches each patch within a cluster based on similarity (see Fig. 2(c)), leading to a
more coherent and representative understanding of the overall structure and context within the cluster.
′
For every patch embedding pi ∈ I, the updated patch embedding pi is computed as:
K
′ 1 X
pi = pi + MLP( sim(Ck , pi ) ∗ Ck ) (6)
K
k=0

This equation represents the adaptive dispatching of feature embeddings by considering the similarity
between the feature embedding and the cluster centers (C), weighted by their respective similarities.
By incorporating the intrinsic information from the cluster centers, the method refines the feature
embeddings, enhancing the overall understanding of the image’s underlying structure and context.
All feature representations are utilized for handling the target tasks in the decoding process. In §3.3,
we discuss more details about the implementation of the ending tasks.

3.3 Implementation Details

The implementation details and framework of C LUSTER F ORMER are shown in (Fig. 2a). We followed
the architecture and configuration of Swin Transformer [53]. The code will be available at here.

• Encoder. The encoding process is to generate presentation hierarchy, denoted as {I l } with

l = {1, 2, 3, 4}, for a given image I. The pipeline begins with the feature embedding to convert
the images into separate feature tokens. Subsequently, multi-head computing [81, 53] is employed
to partition the embedded features among them. Center initialization (Eq. 5) is then adopted as
a starting for initializing the cluster centers, and the recurrent cross-attention clustering (Eq. 3)
is utilized to recursively update these centers. Once the centers have been updated, the features
are dispatched based on their association with the updated centers (Eq. 6). The further decoding
process leverage both the centers and the features, which guarantees well-rounded learning.
• Adaptation to Image Classification. The classification head is a single-layer Multilayer Perceptron
(MLP) takes the cluster centers from the encoder for predictions.
• Adaptation to Detection and Segmentation. Downstream task head has six Transformer decoder
layers with the core design of recurrent cross-attention clustering (Eq.4). Each layer has 3 iterations.

4 Experiment

We evaluate our methods over five vision tasks viz. image classification, object detection, semantic
segmentation, instance segmentation, and panoptic segmentation on four benchmarks.
ImageNet-1K for Image Classification. ImageNet-1K[72] includes high-resolution images spanning
distinct categories (e.g., animals, plants, and vehicles). Following conventional procedures, the dataset
is split into 1.2M/50K/100K images for train/validation/test splits.
MS COCO for Object Detection and Instance Segmentation. COCO [49] dataset features dense
annotations for 80 common objects in daily contexts. Following standard practices [49], the dataset
is split into 115K/5K/20K images for train2017/val2017/test-dev splits.
ADE20K for Semantic Segmentation. ADE20K [101] dataset offers an extensive collection of
images with pixel-level annotations, containing 150 diverse object categories in both indoor and
outdoor scenes. The dataset comprises 20K/2K/3K images for train/val/test splits.
COCO Panoptic for Panoptic Segmentation. The COCO Panoptic dataset [42] includes 80 “thing”
categories and a carefully annotated set of 53 “stuff” categories. In line with standard practices [42],
the COCO Panoptic dataset is split into 115K/5K/20K images for the train/val/test splits as well.
The ensuing section commences by presenting the main results of each task (§4.1), succeeded by a
series of ablative studies (§4.2), which aim to confirm the efficacy of each modulating design.

5
4.1 Main Results

4.1.1 Experiments on Image Classification

Training. We use mmclassification2 as codebase and follow the default training settings. The
default configuration for our model involves setting the number of centers to 100. To optimize
the model’s performance, we employ cross-entropy as the default loss function, which is widely
used in classification tasks and helps in minimizing the difference between predicted probabili-
ties and ground truth. For the training details, we run the model for 300 epochs, allowing suf-
ficient time for the model to learn and converge. To manage the learning rate, we initialize it
at 0.001 as default. The learning rate is then scheduled using a cosine annealing policy, which
gradually decreases the learning rate over time. Due to limitations in our GPU capacity, we are
constrained to set the total batch size at 1024. Models are trained from scratch on sixteen A100 GPUs.
Results on ImageNet. Table 1 illustrates our Table 1: Classification top-1 and top-5 accuracy on
compelling results over different famous meth- ImageNet [72] val (see §4.1.1 for details).
ods. C LUSTER F ORMER exceeds the Swin Method #Params top-1 top-5
Transformer [53] by 0.13% and 0.39% on Tiny- Context Cluster-Tiny[ICLR23][58] 5.3M 71.68% 90.49%
based and Small-based models with fewer pa- DeiT-Tiny[ICML21][80] 5.72M 74.50% 92.25%
PViG-Tiny[NeurIPS22][31] 9.46M 78.38% 94.38%
rameters (i.e., 27.85M vs. 28.29M and 48.71M ResNet-50[CVPR2016][34] 25.56M 76.55% 93.06%
vs. 49.61M), respectively. On top-5 accuracy, Swin-Tiny[ICCV2021][53] 28.29M 81.18% 95.61%
our approach also outperforms the Swin-Tiny C LUSTER F ORMER-Tiny 27.85M 81.31% 96.32%
and Swin-Small with gains of 0.71% and 0.84%, Context Cluster-Small [ICLR23][58] 14.0M 77.42% 93.69%
DeiT-Small[ICML21][80] 22.05M 80.69% 95.06%
respectively. In addition, our margins over the PViG-Small[NeurIPS22][31] 29.02M 82.00% 95.97%
ResNet family [34] are 3.44% ∼ 4.76% on top- ResNet-101[CVPR2016][34] 44.55M 77.97% 94.06%
1 accuracy with on-par parameters (i.e., 27.85M Swin-Small[ICCV2021][53] 49.61M 83.02% 96.29%
C LUSTER F ORMER-Small 48.71M 83.41% 97.13%
vs. 25.56M and 48.71M vs. 44.55M).

4.1.2 Experiments on Object Detection

Training. We use mmdetection3 as codebase and follow the default training settings. For a fair
comparison, we follow the training protocol in [17]: 1) the number of instances centers is set to 100;
2) a linear combination of the L1 loss and the GIoU Loss is used as the optimization objective for
bounding box regression. Their coefficients are set to 5 and 2, respectively. In addition, the final
object centers are fed into a small FFN for object classification, trained with a binary cross-entropy
loss. Moreover, we set the initial learning rate to 1 × 10−5 , the training epoch to 50, and the batch
size to 16. We use random scale jittering with a factor in [0.1, 2.0] and a crop size of 1024×1024.
Test. We use one input image scale with shorter side as 800.
Metric. We adopt AP, AP50 , AP75 , APS , APM , and APL .
Performance Comparison. In Table 2, we present the numerical results for C LUSTER F ORMER for
object detection. We observe that it surpasses all counterparts [70, 7, 56, 77, 9, 75, 60, 102, 71, 96]
with remarkable gains with respect to mAP. In particular, C LUSTER F ORMER-Tiny exceeds the vanilla
Deformable DETR [102], Sparse-DETR [71], and DINO [96] over Swin-T [53] by 6.5%, 3.4%,
and 0.8% in terms of mAP, respectively. In addition, our approach also outperforms these methods
over Swin-S [53], i.e., 54.2% vs 48.3% vs 49.9% vs 53.3% in terms of mAP, respectively. Notably,
C LUSTER F ORMER achieves impressive performance without relying on additional augmentation.

4.1.3 Experiments on Semantic Segmentation

Training. We use mmsegmentation4 as codebase and follow the default training settings. The
training process for semantic segmentation involves setting the number of cluster centers to
match the number of semantic categories, which is 150 for ADE20K [101]. Following the ap-
proach employed in recent works [97, 17, 74], we adopt a combination of the standard cross-
entropy loss and an auxiliary dice loss for the loss function. By default, the coefficients for the
cross-entropy and dice losses are set to 5 and 1, respectively. In addition, we configure the ini-
tial learning rate to 1 × 10−5 , the number of training epochs to 50, and the batch size to 16.
2
https://github.com/open-mmlab/mmclassification
3
https://github.com/open-mmlab/mmdetection
4
https://github.com/open-mmlab/mmsegmentation

6
Table 2: Quantitative results on COCO [49] test-dev for object detection (see §4.1.2 for details).
Algorithm Backbone Epoch mAP↑ AP50 ↑ AP75 ↑ APS APM ↑ APL ↑
Faster R-CNN[NeurIPS15] [70] ResNet-101 36 41.7 62.3 45.7 24.7 46.0 53.2
Cascade R-CNN[CVPR18] [7] ResNet-101 36 42.8 61.1 46.7 24.9 46.5 56.4
Grid R-CNN[CVPR19] [56] ResNet-50 24 40.4 58.5 43.6 22.7 43.9 53.0
EfficientDet[CVPR20] [77] Efficient-B3 300 45.4 63.9 49.3 27.1 49.5 61.3
DETR[ECCV20] [9] ResNet-50 150 39.9 60.4 41.7 17.6 43.4 59.4
Sparse R-CNN[CVPR21] [75] ResNet-101 36 46.2 65.1 50.4 29.5 49.2 61.7
Conditional DETR[ICCV21] [60] ResNet-50 50 41.1 61.9 43.5 20.4 44.5 59.9
Swin-T 45.5±0.26 65.2±0.20 49.8±0.21 27.0±0.26 49.1±0.24 60.7±0.29
Deformable DETR [ICLR21] [102] 50
Swin-S 48.3±0.21 68.7±0.27 52.1±0.27 30.5±0.28 51.6±0.22 64.4±0.19
Swin-T 48.6±0.24 69.6±0.20 53.5±0.23 30.1±0.27 51.8±0.21 64.9±0.29
Sparse-DETR[ICLR22] [71] 50
Swin-S 49.9±0.21 70.3±0.27 54.0±0.26 32.5±0.22 53.6±0.28 66.2±0.25
Swin-T 51.2±0.26 68.4±0.25 55.3±0.26 31.3±0.24 55.1±0.38 65.8±0.26
DINO [ICLR23] [96] 50
Swin-S 53.3±0.27 70.9±0.38 57.6±0.23 33.8±0.23 56.4±0.32 66.9±0.26
Ours-Tiny 52.0±0.32 70.4±0.25 57.5±0.32 34.2±0.28 54.8±0.29 64.8±0.22
C LUSTER F ORMER 50
Ours-Small 54.2±0.33 71.8±0.16 59.1±0.17 35.6±0.28 57.2±0.20 67.4±0.18

Furthermore, we employ random scale jit- Table 3: Quantitative results on ADE20K [101] val for
semantic segmentation (see §4.1.3 for details).
tering, applying a factor within the range
Algorithm Backbone Epoch mIoU↑
of [0.5, 2.0], and utilize a crop size with a
FCN[CVPR2015][54] ResNet-50 50 36.0
fixed resolution of 640 × 640 pixels. DeeplabV3+[ECCV2018][15] ResNet-50 50 42.7
Test. During the testing phase, we re-scale APCNet[CVPR2019][32] ResNet-50 100 43.4
the input image with a shorter side to 640 SETR[CVPR2021][100] ViT-L 100 49.3
pixels without applying any additional data Segmenter[ICCV2021][74] ViT-B 100 52.1
augmentation at test time. Segformer[NeurIPS2021][90] MIT-B5 100 51.4
Metric. Mean intersection-over-union ConvNeXt-T 48.3±0.15
kMaX-Deeplab[ECCV2022][95] 100
ConvNeXt-S 51.6±0.23
(mIoU) is used for assessing image seman- Swin-T 48.5±0.24
tic segmentation performance. Mask2Former[CVPR2022][17] 100
Swin-S 51.1±0.21
Performance Comparison. Table 3 shows Ours-Tiny 49.1±0.19
the results on semantic segmentation. Em- C LUSTER F ORMER 100
Ours-Small 52.4±0.23
priailly, our method compares favorably to
recent transformer-based approaches [54, 15, 32, 100, 74, 90, 95, 17]. For instance, C LUSTER -
F ORMER-Tiny surpasses both recent advancements, i.e., kMaX-Deeplab [95] and Mask2Former [17]
with Swin-T [53] (i.e., 49.1% vs. 48.3% vs. 48.5%), respectively. Moreover, C LUSTER F ORMER-
Small achieves 52.4% mIoU and outperforms all other methods in terms of mIoU, making it competi-
tive with state-of-the-art methods as well.

4.1.4 Experiments on Instance Segmentation

Training. We adopt the same training strategy for instance segmentation by following §4.1.2. For
instance segmentation, we change the training objective by utilizing a combination of the binary
cross-entropy loss and the dice Loss for instance mask optimization.
Test. We use one input image scale with a shorter side of 800.
Metric. We adopt AP, AP50 , AP75 , APS , APM , and APL .
Performance Comparison. Table 4 presents the results of C LUSTER F ORMER against famous instance
segmentation methods [33, 8, 14, 43, 13, 26, 23, 18, 17, 44] on COCO test-dev. C LUSTER F ORMER
shows clear performance advantages over prior arts. For example, C LUSTER F ORMER-Tiny outper-
forms the universal counterparts Mask2Former [17] by 1.4% over Swin-T [53] in terms of mAP
and on par with the state-of-the-art method, Mask-Dino [44] with Swin-T backbone. Moreover,
C LUSTER F ORMER-Small surpasses all the competitors, e.g., yielding significant gains of 1.0% and
0.5% mAP compared to Mask2Former and Mask-Dino over Swin-S, respectively. Without bells and
whistles, our method establishes a new state-of-the-art on COCO instance segmentation.

4.1.5 Experiments on Panoptic Segmentation

Training. Following the convention [84, 17], we use the following objective for network learning:
LPanoptic = λth Lth + λst Lst + λaux Laux , (7)
th st
L and L represent the loss functions for things and stuff, respectively. To ensure a fair comparison,
we follow [95, 85] and incorporate an auxiliary loss calculated as a weighted sum of four different
loss terms, specifically, a PQ-style loss, a mask-ID cross-entropy loss, an instance discrimination

7
Table 4: Quantitative results on COCO [49] test-dev for instance segmentation (see §4.1.4 for details).
Algorithm Backbone Epoch mAP↑ AP50 ↑ AP75 ↑ APS APM ↑ APL ↑
Mask R-CNN[ICCV2017][33] ResNet-101 12 36.1 57.5 38.6 18.8 39.7 49.5
Cascade MR-CNN[PAMI2019][8] ResNet-101 12 37.3 58.2 40.1 19.7 40.6 51.5
HTC[CVPR2019][14] ResNet-101 20 39.6 61.0 42.8 21.3 42.9 55.0
PointRend[CVPR2020][43] ResNet-50 12 36.3 56.9 38.7 19.8 39.4 48.5
BlendMask[CVPR2020][13] ResNet-101 36 38.4 60.7 41.3 18.2 41.5 53.3
QueryInst[ICCV2021][26] ResNet-101 36 41.0 63.3 44.5 21.7 44.4 60.7
SOLQ[NeurIPS2021][23] Swin-L† 50 46.7 72.7 50.6 29.2 50.1 60.9
SparseInst[CVPR2022][18] ResNet-50 36 37.9 59.2 40.2 15.7 39.4 56.9
Swin-T 44.5±0.16 67.3±0.15 47.7±0.24 23.9±0.20 48.1±0.16 66.4±0.15
Mask2Former[CVPR2022][17] 50
Swin-S 46.0±0.21 68.4±0.22 49.8±0.24 25.4±0.19 49.7±0.22 67.4±0.24
Swin-T 45.8±0.28 69.6±0.29 50.2±0.26 26.0±0.28 48.7±0.37 66.4±0.29
Mask-Dino[CVPR2023][44] 50
Swin-S 46.5±0.39 70.1±0.34 52.2±0.28 27.6±0.34 49.9±0.25 69.5±0.29
Ours-Tiny 45.9±0.26 69.1±0.21 49.5±0.18 25.2±0.22 50.1±0.24 68.8±0.24
C LUSTER F ORMER 50
Ours-Small 47.0±0.19 71.5±0.26 51.8±0.24 27.3±0.16 50.5±0.20 72.6±0.22
Table 5: Quantitative results on COCO Panoptic [42] val for panoptic segmentation (see §4.1.5 for details).
Algorithm Backbone Epoch PQ↑ PQTh ↑ PQSt ↑ mAPTh pan ↑ mIoUpan ↑
Panoptic-FPN[CVPR2019][41] ResNet-101 20 44.0 52.0 31.9 34.0 51.5
UPSNet[CVPR2019][92] ResNet-101 12 46.2 52.8 36.5 36.3 56.9
Panoptic-Deeplab[CVPR2020][16] Xception-71 12 41.2 44.9 35.7 31.5 55.4
Panoptic-FCN[CVPR2021][45] ResNet-50 12 44.3 50.0 35.6 35.5 55.0
Max-Deeplab[CVPR2021][85] Max-L 55 51.1 57.0 42.2 – –
CMT-Deeplab[CVPR2022][94] Axial-R104† 55 54.1 58.8 47.1 – –
ResNet-50 49.6±0.25 54.4±0.26 42.4±0.25 39.5±0.20 60.8±0.21
Panoptic Segformer[CVPR2022][46] 24
ResNet-101 50.6±0.21 55.5±0.24 43.2±0.20 40.4±0.21 62.0±0.22
Swin-T 53.2±0.25 59.1±0.22 43.3±0.23 42.3±0.27 62.9±0.19
Mask2Former[CVPR2022][17] 50
Swin-S 54.1±0.29 60.2±0.28 45.6±0.18 43.1±0.23 63.6±0.31
Swin-T 53.6±0.29 59.5±0.26 44.0±0.24 44.3±0.29 63.2±0.27
Mask Dino[CVPR2023][44] 50
Swin-S 54.9±0.33 61.1±0.23 46.2±0.26 45.0±0.22 64.3±0.30
Ours-Tiny 54.7±0.22 60.8±0.31 46.1±0.20 43.4±0.25 64.0±0.20
C LUSTER F ORMER 50
Ours-Small 55.8±0.38 61.9±0.39 47.2±0.23 44.2±0.22 65.5±0.21

loss, and a semantic segmentation loss. More information about Laux can be found in [85, 95]. The
coefficients λth , λst , and λaux are assigned the values of 5, 3, and 1, respectively. Furthermore, the final
centers are input into a small feed-forward neural network (FFN) for semantic classification, which is
trained using a binary cross-entropy loss. Moreover, we set the initial learning rate to 1 × 10−5 , the
number of training epochs to 50, and the batch size to 16. We also employ random scale jittering
with a factor range of [0.1, 2.0] and a crop size of 1024×1024.
Test. We use one input image scale with a shorter side of 800.
Metric. We employ the PQ metric [42] and report PQTh and PQSt for the “thing" and “stuff" classes,
respectively. To ensure comprehensiveness, we also include mAPTh pan , which evaluates mean average
precision on "thing" classes using instance segmentation annotations, and mIoUpan , which calculates
mIoU for semantic segmentation by merging instance masks belonging to the same category, using
the same model trained for the panoptic segmentation task.
Performance Comparison. We perform a comprehensive comparison against two divergent groups
of state-of-the-art methods: universal approaches [46, 17, 44] and specialized panoptic methods [41,
92, 16, 45, 85, 97, 94]. As shown in Table 5, C LUSTER F ORMER outperforms both types of rivals. For
instance, the performance of C LUSTER F ORMER-Tiny clear ahead compared to Mask2Former [17]
(i.e., 54.7% PQ vs. 53.2% PQ) and Mask-Dino [44] (i.e., 54.7% PQ vs. 53.6% PQ) on the top of
Swin-T [53], and C LUSTER F ORMER-Small achieves promising gains of 1.7% and 0.9% PQ against
Mask2Former and Mask-Dino over Swin-S, respectively. Moreover, in terms of mAPTh pan and mIoUpan ,
the C LUSTER F ORMER also achieves outstanding performance beyond counterpart approaches.

4.2 Ablative Study

This section ablates C LUSTER F ORMER’s key components on ImageNet [72] and MS COCO [49]
validation split. All experiments use the tiny model.
Key Component Analysis. We first investigate the two major elements of C LUSTER F ORMER,
specifically, Recurrent Cross-Attention Clustering for center updating and Feature Dispatching
for feature updating. We construct a BASELINE model without any center updating and feature
dispatching technique. As shown in Table 6a, BASELINE achieves 74.59% top-1 and 91.73% top-
5 accuracy. Upon applying Recurrent Cross-Attention Clustering to the BASELINE, we observe
consistent and substantial improvements for both top-1 accuracy (74.59% → 80.57%) and top-5
accuracy (91.73% → 95.22%). This highlights the importance of the center updating strategy

8
Table 6: A set of ablative studies on ImageNet [72] validation and MS COCO [49] test-dev split (see
§4.2). The adopted designs are marked in red.
Algorithm Component #Params top-1 top-5 Numbers (T) #Params top-1 top-5
BASELINE 21.73M 74.59 91.73 1 81.06 96.23
+ Recurrent Cross-Attention Clustering 26.27M 80.57 95.22 2 81.22 96.29
27.85M
+ Feature Dispatching 23.46M 78.58 94.68 3 81.31 96.32
C LUSTER F ORMER (both) 27.85M 81.31 96.32 4 81.33 96.33
(a) Key Component Analysiss (b) Number of Recursion

Variant Cluster Center Updating Strategy #Params top-1 top-5 Head Dimension #Params top-1 top-5
Cosine Similarity 23.88M 78.79 94.36 16 17.25M 71.69 90.16
Vanilla Cross-Attention [81] 35.48M 79.67 94.95 24 22.88M 75.37 92.45
Criss Cross-Attention [35] 34.16M 79.91 95.24 32 27.85M 81.31 96.32
K-Means [95] 27.71M 80.96 95.57 40 32.81M 82.21 97.09
Recurrent Cross-Attention 27.85M 81.31 96.32 48 38.14M 82.40 97.22
(c) Recurrent Cross-Attention Clustering (d) Head Dimension

Feature Dispatching #Params top-1 top-5 Decoder Query Initialization mAP↑ AP50 ↑ AP75 ↑
None 26.27M 80.57 95.22 Free Parameters 44.2 66.3 46.4
Vanilla FC Layer 27.14M 80.83 95.47 Direct Feature Embedding [17] 44.5 67.3 47.2
Confidence-Based [68] 26.81M 80.69 95.30 Mixed Query Selection [44] 44.9 67.9 47.8
FC w/ Similarity [58] 27.46M 80.96 95.84 Scene-Adoptive Embedding [47] 45.1 67.8 48.0
Ours (Eq. 6) 27.85M 81.31 96.32 Centers from Encoder (Ours) 45.9 69.1 49.5
(e) Feature Dispatching (f) Decoder Query Initialization for instance segmentation

and validates the effectiveness of our approach, even without explicitly performing clustering.
Furthermore, after incorporating Feature Dispatching into the BASELINE, we achieve significant
gains of 3.99% in top-1 accuracy and 2.95% in top-5 accuracy. Finally, by integrating both core
techniques, C LUSTER F ORMER delivers the best performance across both metrics. This indicates that
the proposed Recurrent Cross-Attention Clustering and Feature Dispatching can work synergistically
and validates the effectiveness of our comprehensive algorithmic design.
Recurrent Cross-attention Clustering. We next study the impact of our Recurrent Cross-attention
Clustering (Eq.4) by contrasting it with the cosine similarity updating, basic cross-attention [81],
Criss-attention [35] and K-Means cross-attention [95]. As illustrated in Table 6c, our Recurrent
Cross-Attention proves to be effective – it outperforms the cosine similarity, vanilla, Criss and K-
Means by 2.52%, 1.64%, 1.40% and 0.15% top-1 accuracy respectively, and efficient – its #Params
are significantly less than the other vanilla and Criss-attention and on par with K-Means, in line with
our analysis in §3.2. To gain further insights into recursive clustering, we examine the effect of the
recursion number T in Table 6b. We discover that performance progressively improves from 81.06%
to 81.31% in top-1 accuracy when increasing T from 1 to 3, but remains constant after running
additional iterations. We also observe that #Params increase as T increases. Consequently, we set
T = 3 as the default to strike an optimal balance between accuracy and computation cost.
Multi-head Dimension. We then ablate the head embedding dimension for the attention head in
Table 6d. We find that performance significantly improves from 71.69% to 82.40% in top-1 accuracy
when increasing the dimension from 16 to 48, but #Params steadily increase as the dimension grows.
For a fair comparison with Swin [53], we set the head dimension to 32 as our default.
Feature Dispatching. We further analyze the influence of our Feature Dispatching. As outlined
in Table 6e, in a standard manner without any dispatching method, the model attains 80.57% top-1
accuracy and 95.22% top-5 accuracy. By applying a vanilla fully connected layer to update the feature,
we witness a marginal increase of 0.26% in top-1 accuracy. Moreover, using the confidence-based
updating method [68] and fully connected layer with similarity, the model demonstrates a noticeable
enhancement in 0.12% and 0.39% top-1 accuracy, respectively. Last, our method yields significant
performance advancements across both metrics, i.e., 81.31% top-1 and 96.32% top-5 accuracy.
Decoder Query Initialization. Last, we examine the impact of query initialization in the decoder
on a downstream task (i.e., instance segmentation) in Table 6f. For free parameter initialization,
the base model can achieve 44.2% in terms of mAP. By applying direct feature embedding, the
method has a slight improvement of 0.3% mAP. In addition, the model exhibits improvements in

9
Figure 3: Visualization of center-feature assignment at the last stage of recurrent cross-attention clustering with
the resolution of 7 by 7. The map displays distinct clusters, each containing features with similar representations.
mAP, achieving 44.9% and 45.1%, respectively, by employing the mixed query selection [44] and
scene-adoptive embedding [47]. Outstandingly, C LUSTER F ORMER achieves the highest performance
in all three metrices, i.e., 45.9% mAP, 69.1% AP50 and 49.5% AP75 , respectively. The empirical
evidence proves our design — using the cluster centers from the encoder to derive the initial query
for the decoder — that facilitates the transferability for representation learning.
Ad-hoc Explainability. We visualize the cluster assignment map for image classification in Fig. 3.
This figure provides an insightful illustration of how C LUSTER F ORMER groups similar features
together. Each color represents a cluster of features that share common characteristics.

5 Conclusion
This study adopts an epistemological perspective centered on the clustering-based paradigm, which
advocates a universal vision framework named C LUSTER F ORMER. This framework aims to address
diverse visual tasks with varying degrees of clustering granularity. By leveraging insights from
clustering, we customize the cross-attention mechanism for recursive clustering and introduce a
novel method for feature dispatching. Empirical findings provide substantial evidence to support
the effectiveness of this systematic approach. Based on its efficacy, we argue deductively that the
proposed universal solution will have a substantial impact on the wider range of visual tasks when
viewed through the lens of clustering. This question remains open for our future endeavors.
Acknowledgement. This research was supported by the National Science Foundation under Grant
No. 2242243.

References
[1] Sameer Agarwal, Jongwoo Lim, Lihi Zelnik-Manor, Pietro Perona, David Kriegman, and
Serge Belongie. Beyond pairwise clustering. In CVPR, 2005.
[2] Merav Ahissar and Shaul Hochstein. The reverse hierarchy theory of visual perceptual learning.
Trends in cognitive sciences, 8(10):457–464, 2004.
[3] Valerie Ahl and Timothy FH Allen. Hierarchy theory: a vision, vocabulary, and epistemology.
Columbia University Press, 1996.
[4] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von
Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the
opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
[5] Gijs Joost Brouwer and David J Heeger. Categorical clustering of the neural representation of
color. Journal of Neuroscience, 33(39):15454–15465, 2013.
[6] Xiao Cai, Feiping Nie, Heng Huang, and Farhad Kamangar. Heterogeneous image feature
integration via multi-modal spectral clustering. In CVPR, 2011.
[7] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection.
In CVPR, 2018.
[8] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: high quality object detection and instance
segmentation. IEEE TPAMI, 43(5):1483–1498, 2019.
[9] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
[10] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for
unsupervised learning of visual features. In ECCV, 2018.
[11] M Emre Celebi. Partitional clustering algorithms. Springer, 2014.
[12] Antoni B Chan and Nuno Vasconcelos. Modeling, clustering, and segmenting video with
mixtures of dynamic textures. IEEE TPAMI, 30(5):909–926, 2008.
[13] Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Yongming Huang, and Youliang Yan.
Blendmask: Top-down meets bottom-up for instance segmentation. In CVPR, 2020.

10
[14] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng,
Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation.
In CVPR, 2019.
[15] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam.
Encoder-decoder with atrous separable convolution for semantic image segmentation. In
ECCV, 2018.
[16] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam,
and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up
panoptic segmentation. In CVPR, 2020.
[17] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar.
Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
[18] Tianheng Cheng, Xinggang Wang, Shaoyu Chen, Wenqiang Zhang, Qian Zhang, Chang
Huang, Zhaoxiang Zhang, and Wenyu Liu. Sparse instance activation for real-time instance
segmentation. In CVPR, 2022.
[19] Yiming Cui. Feature aggregated queries for transformer-based video object detectors. In
CVPR, 2023.
[20] Yiming Cui, Liqi Yan, Zhiwen Cao, and Dongfang Liu. Tf-blender: Temporal feature blender
for video object detection. In ICCV, 2021.
[21] Yiming Cui, Linjie Yang, and Haichao Yu. Learning dynamic query combinations for
transformer-based object detection and segmentation. ICML, 2023.
[22] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin
Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al.
Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
[23] Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Solq: Segmenting
objects by learning queries. In NeurIPS, 2021.
[24] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR,
2021.
[25] Jiemin Fang, Lingxi Xie, Xinggang Wang, Xiaopeng Zhang, Wenyu Liu, and Qi Tian. Msg-
transformer: Exchanging local spatial information by manipulating messenger tokens. In
CVPR, 2022.
[26] Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan, Bin Feng, and
Wenyu Liu. Instances as queries. In ICCV, 2021.
[27] David J Field, Anthony Hayes, and Robert F Hess. Contour integration by the human visual
system: evidence for a local “association field”. Vision research, 33(2):173–193, 1993.
[28] Hichem Frigui and Raghu Krishnapuram. A robust competitive clustering algorithm with
applications in computer vision. IEEE TPAMI, 21(5):450–465, 1999.
[29] Yoram Gdalyahu, Daphna Weinshall, and Michael Werman. Self-organization in vision:
stochastic clustering for image segmentation, perceptual grouping, and image database organi-
zation. IEEE TPAMI, 23(10):1053–1074, 2001.
[30] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han,
Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented
transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
[31] Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision gnn: An image is
worth graph of nodes. In NeurIPS, 2022.
[32] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context
network for semantic segmentation. In CVPR, 2019.
[33] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
[34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In CVPR, 2016.
[35] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu.
Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019.
[36] Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
[37] Varun Jampani, Deqing Sun, Ming-Yu Liu, Ming-Hsuan Yang, and Jan Kautz. Superpixel
sampling networks. In ECCV, 2018.
[38] Stephen C Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.
[39] Jean-Michel Jolion, Peter Meer, and Samira Bataouche. Robust clustering with applications in
computer vision. IEEE TPAMI, 13(8):791–802, 1991.
[40] Bela Julesz. A brief outline of the texton theory of human vision. Trends in Neurosciences,
7(2):41–45, 1984.

11
[41] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid
networks. In CVPR, 2019.
[42] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic
segmentation. In CVPR, 2019.
[43] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation
as rendering. In CVPR, 2020.
[44] Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Lionel M Ni, Heung-Yeung Shum, et al. Mask
dino: Towards a unified transformer-based framework for object detection and segmentation.
CVPR, 2023.
[45] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia.
Fully convolutional networks for panoptic segmentation. In CVPR, 2021.
[46] Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, Ping
Luo, and Tong Lu. Panoptic segformer: Delving deeper into panoptic segmentation with
transformers. In CVPR, 2022.
[47] James Liang, Tianfei Zhou, and Dongfang Liu. Clustseg: Clustering for universal segmentation.
In ICML, 2023.
[48] Weicong Liang, Yuhui Yuan, Henghui Ding, Xiao Luo, Weihong Lin, Ding Jia, Zheng Zhang,
Chao Zhang, and Han Hu. Expediting large-scale vision transformer for dense prediction
without fine-tuning. NeurIPS, 2022.
[49] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV,
2014.
[50] Dongfang Liu, Yiming Cui, Wenbo Tan, and Yingjie Chen. Sg-net: Spatial granularity network
for one-stage video instance segmentation. In CVPR, 2021.
[51] Dongfang Liu, Yiming Cui, Liqi Yan, Christos Mousas, Baijian Yang, and Yingjie Chen.
Densernet: Weakly supervised visual localization using multi-scale feature aggregation. In
AAAI, 2021.
[52] Dongfang Liu, James Liang, Tony Geng, Alexander Loui, and Tianfei Zhou. Tripartite feature
enhanced pyramid network for dense prediction. IEEE TIP, 2023.
[53] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining
Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV,
2021.
[54] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, 2015.
[55] Le Lu and René Vidal. Combined central and subspace clustering for computer vision
applications. In ICML, 2006.
[56] Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan. Grid r-cnn. In CVPR, 2019.
[57] Yawen Lu, Qifan Wang, Siqi Ma, Tong Geng, Yingjie Victor Chen, Huaijin Chen, and
Dongfang Liu. Transflow: Transformer as flow learner. CVPR, 2023.
[58] Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, and Yun Fu. Image as set of
points. In ICLR, 2023.
[59] Celeste McCollough. Color adaptation of edge-detectors in the human visual system. Science,
149(3688):1115–1116, 1965.
[60] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and
Jingdong Wang. Conditional detr for fast training convergence. In ICCV, 2021.
[61] Marius Muja and David G Lowe. Scalable nearest neighbor algorithms for high dimensional
data. IEEE TPAMI, 36(11):2227–2240, 2014.
[62] Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97, 2012.
[63] Satyasai Jagannath Nanda and Ganapati Panda. A survey on nature inspired metaheuristic
algorithms for partitional clustering. Swarm and Evolutionary computation, 16:1–18, 2014.
[64] Frank Nielsen and Frank Nielsen. Hierarchical clustering. Introduction to HPC with MPI for
Data Science, pages 195–211, 2016.
[65] Haluk Öğmen, Thomas U Otto, and Michael H Herzog. Perceptual grouping induces non-
retinotopic feature attribution in human vision. Vision Research, 46(19):3234–3242, 2006.
[66] C Alejandro Parraga, Tom Troscianko, and David J Tolhurst. The human visual system is
optimised for processing the spatial information in natural visual images. Current Biology,
10(1):35–38, 2000.
[67] Yury Petrov, Matteo Carandini, and Suzanne McKee. Two distinct mechanisms of suppression
in human vision. Journal of Neuroscience, 25(38):8704–8707, 2005.

12
[68] Yulei Qin, Juan Wen, Hao Zheng, Xiaolin Huang, Jie Yang, Ning Song, Yue-Min Zhu, Lingqian
Wu, and Guang-Zhong Yang. Varifocal-net: A chromosome classification approach using deep
convolutional networks. IEEE transactions on medical imaging, 38(11):2569–2581, 2019.
[69] Chandan K Reddy and Bhanukiran Vinzamuri. A survey of partitional and hierarchical
clustering algorithms. In Data clustering, pages 87–110. Chapman and Hall/CRC, 2018.
[70] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time
object detection with region proposal networks. In NeurIPS, 2015.
[71] Byungseok Roh, JaeWoong Shin, Wuhyun Shin, and Saehoon Kim. Sparse detr: Efficient
end-to-end object detection with learnable sparsity. ICLR, 2022.
[72] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei
Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
[73] Dov Sagi. Perceptual learning in vision research. Vision research, 51(13):1552–1566, 2011.
[74] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer
for semantic segmentation. In ICCV, 2021.
[75] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka,
Lei Li, Zehuan Yuan, Changhu Wang, et al. Sparse r-cnn: End-to-end object detection with
learnable proposals. In CVPR, 2021.
[76] Teppei Suzuki. Clustering as attention: Unified image segmentation with hierarchical cluster-
ing. arXiv preprint arXiv:2205.09949, 2022.
[77] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object
detection. In CVPR, 2020.
[78] Yuliya Tarabalka, Jón Atli Benediktsson, and Jocelyn Chanussot. Spectral–spatial classification
of hyperspectral imagery based on partitional clustering techniques. IEEE transactions on
geoscience and remote sensing, 47(8):2973–2987, 2009.
[79] Simon Thorpe, Denis Fize, and Catherine Marlot. Speed of processing in the human visual
system. nature, 381(6582):520–522, 1996.
[80] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and
Herve Jegou. Training data-efficient image transformers and distillation through attention. In
ICML, 2021.
[81] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[82] Mai-Anh T Vu, Tülay Adalı, Demba Ba, György Buzsáki, David Carlson, Katherine Heller,
Conor Liston, Cynthia Rudin, Vikaas S Sohal, Alik S Widge, et al. A shared vision for machine
learning in neuroscience. Journal of Neuroscience, 38(7):1601–1607, 2018.
[83] George Wald. Human vision and the spectrum. Science, 101(2635):653–658, 1945.
[84] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. MaX-DeepLab:
End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
[85] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Max-deeplab:
End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
[86] Wenguan Wang, Cheng Han, Tianfei Zhou, and Dongfang Liu. Visual recognition with deep
nearest centroids. ICLR, 2023.
[87] Wenguan Wang, James Liang, and Dongfang Liu. Learning equivariant segmentation with
instance-unique querying. NeurIPS, 2022.
[88] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping
Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction
without convolutions. In ICCV, 2021.
[89] Hugh R Wilson. Computational evidence for a rivalry hierarchy in vision. Proceedings of the
National Academy of Sciences, 100(24):14499–14503, 2003.
[90] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo.
Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS,
2021.
[91] Xuanli Lisa Xie and Gerardo Beni. A validity measure for fuzzy clustering. IEEE TPAMI,
13(08):841–847, 1991.
[92] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel
Urtasun. Upsnet: A unified panoptic segmentation network. In CVPR, 2019.
[93] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and
Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In CVPR,
2022.
[94] Qihang Yu, Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig
Adam, Alan Yuille, and Liang-Chieh Chen. Cmt-deeplab: Clustering mask transformers for

13
panoptic segmentation. In CVPR, 2022.
[95] Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hatwig Adam, Alan
Yuille, and Liang-Chieh Chen. k-means mask transformer. ECCV, 2022.
[96] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-
Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object
detection. arXiv preprint arXiv:2203.03605, 2022.
[97] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. K-net: Towards unified
image segmentation. NeurIPS, 2021.
[98] Ying Zhao, George Karypis, and Usama Fayyad. Hierarchical clustering algorithms for
document datasets. Data mining and knowledge discovery, 10:141–168, 2005.
[99] Minghang Zheng, Peng Gao, Renrui Zhang, Kunchang Li, Xiaogang Wang, Hongsheng Li,
and Hao Dong. End-to-end object detection with adaptive clustering transformer. BMVC,
2021.
[100] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei
Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a
sequence-to-sequence perspective with transformers. In CVPR, 2021.
[101] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba.
Scene parsing through ade20k dataset. In CVPR, 2017.
[102] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr:
Deformable transformers for end-to-end object detection. In ICLR, 2021.

Catalogue
No ratings yet
Catalogue
40 pages
Contingency Planning For Covid19 - Eastern Cogon Es
100% (4)
Contingency Planning For Covid19 - Eastern Cogon Es
17 pages
Lubrizol Hydraulic Fluids
No ratings yet
Lubrizol Hydraulic Fluids
6 pages
Knowledge Distillation For Efficient Instance Semantic Segmentation With Transformers
No ratings yet
Knowledge Distillation For Efficient Instance Semantic Segmentation With Transformers
8 pages
MimicDet Bridging The Gap Between
No ratings yet
MimicDet Bridging The Gap Between
16 pages
Harley Track Check Repeat An EM Approach To Unsupervised Tracking CVPR 2021 Paper
No ratings yet
Harley Track Check Repeat An EM Approach To Unsupervised Tracking CVPR 2021 Paper
11 pages
Han Few-Shot Object Detection With Fully Cross-Transformer CVPR 2022 Paper
No ratings yet
Han Few-Shot Object Detection With Fully Cross-Transformer CVPR 2022 Paper
10 pages
Centralized Feature Pyramid for Object Detection
No ratings yet
Centralized Feature Pyramid for Object Detection
14 pages
Learning Hierarchical Features For Scene Labeling
No ratings yet
Learning Hierarchical Features For Scene Labeling
15 pages
Conv-MCD: A Plug-and-Play Multi-Task Module For Medical Image Segmentation
No ratings yet
Conv-MCD: A Plug-and-Play Multi-Task Module For Medical Image Segmentation
8 pages
Prototype Based Deepm Learning Paper 2 Zhou
No ratings yet
Prototype Based Deepm Learning Paper 2 Zhou
12 pages
CCF-Net: A Cascade Center-Based Framework Towards E Cient Human Parts Detection
No ratings yet
CCF-Net: A Cascade Center-Based Framework Towards E Cient Human Parts Detection
13 pages
Geometric Loss Functions For Camera Pose Regression With Deep Learning
No ratings yet
Geometric Loss Functions For Camera Pose Regression With Deep Learning
10 pages
Trackformer
No ratings yet
Trackformer
16 pages
7 QCnet
No ratings yet
7 QCnet
11 pages
Stacked Hourglass Networks For Human Pose Estimation
No ratings yet
Stacked Hourglass Networks For Human Pose Estimation
17 pages
Azad Beyond Self-Attention Deformable Large Kernel Attention for Medical Image Segmentation WACV 2024 Paper
No ratings yet
Azad Beyond Self-Attention Deformable Large Kernel Attention for Medical Image Segmentation WACV 2024 Paper
11 pages
Detect LowLight AET
No ratings yet
Detect LowLight AET
10 pages
N
No ratings yet
N
8 pages
3675094.3678439
No ratings yet
3675094.3678439
6 pages
Chen Activating More Pixels in Image Super-Resolution Transformer CVPR 2023 Paper
No ratings yet
Chen Activating More Pixels in Image Super-Resolution Transformer CVPR 2023 Paper
11 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
An Approach To Visual Servoing Based On Coded Light
No ratings yet
An Approach To Visual Servoing Based On Coded Light
6 pages
MSDNet Multi-Scale Decoder For Few-Shot Semantic S
No ratings yet
MSDNet Multi-Scale Decoder For Few-Shot Semantic S
11 pages
Computer Vision
No ratings yet
Computer Vision
2 pages
Sharp Feature-Preserving 3D Mesh Reconstruction From Point Clouds Based On Primitive Detection
No ratings yet
Sharp Feature-Preserving 3D Mesh Reconstruction From Point Clouds Based On Primitive Detection
20 pages
Image Super-Resolution Using Deep convolutional networks
No ratings yet
Image Super-Resolution Using Deep convolutional networks
13 pages
Yevdokimov_Thesis_Stud_Conf_RTF_2023_Eng_30.04_rv2
No ratings yet
Yevdokimov_Thesis_Stud_Conf_RTF_2023_Eng_30.04_rv2
3 pages
Rethinking Decoders For Transformer-Based Semantic Segmentation: Compression Is All You Need
No ratings yet
Rethinking Decoders For Transformer-Based Semantic Segmentation: Compression Is All You Need
21 pages
2018 - Understanding Convolution For Semantic Segmentation
No ratings yet
2018 - Understanding Convolution For Semantic Segmentation
10 pages
ViTPose paper original
No ratings yet
ViTPose paper original
16 pages
N-20206
No ratings yet
N-20206
10 pages
End-To-End People Detection in Crowded Scenes
No ratings yet
End-To-End People Detection in Crowded Scenes
9 pages
A_review_of_advances_in_image_recognition_models_F
No ratings yet
A_review_of_advances_in_image_recognition_models_F
5 pages
Deep Convolutional Neural Networks For The Classification of Snapshot Mosaic Hyperspectral Imagery
No ratings yet
Deep Convolutional Neural Networks For The Classification of Snapshot Mosaic Hyperspectral Imagery
6 pages
【SETR】Zheng Rethinking Semantic Segmentation From a Sequence-To-Sequence Perspective With Transformers CVPR 2021 Paper
No ratings yet
【SETR】Zheng Rethinking Semantic Segmentation From a Sequence-To-Sequence Perspective With Transformers CVPR 2021 Paper
10 pages
21BCS1133 - Exp 2.3
No ratings yet
21BCS1133 - Exp 2.3
4 pages
Pid 5184483
No ratings yet
Pid 5184483
6 pages
Maskrcnn PDF
No ratings yet
Maskrcnn PDF
12 pages
Radwan Distilling Part-Whole Hierarchical Knowledge From A Huge Pretrained Class Agnostic ICCVW 2023 Paper
No ratings yet
Radwan Distilling Part-Whole Hierarchical Knowledge From A Huge Pretrained Class Agnostic ICCVW 2023 Paper
9 pages
Li_Mask_DINO_Towards_a_Unified_Transformer-Based_Framework_for_Object_Detection_CVPR_2023_paper
No ratings yet
Li_Mask_DINO_Towards_a_Unified_Transformer-Based_Framework_for_Object_Detection_CVPR_2023_paper
10 pages
itmconf_icdsac2023_03004
No ratings yet
itmconf_icdsac2023_03004
10 pages
LIFT: Learned Invariant Feature Transform: (Kwang - Yi, Eduard - Trulls, Pascal - Fua) @epfl - CH
No ratings yet
LIFT: Learned Invariant Feature Transform: (Kwang - Yi, Eduard - Trulls, Pascal - Fua) @epfl - CH
17 pages
He_D2Former_Jointly_Learning_Hierarchical_Detectors_and_Contextual_Descriptors_via_Agent-Based_CVPR_2023_paper
No ratings yet
He_D2Former_Jointly_Learning_Hierarchical_Detectors_and_Contextual_Descriptors_via_Agent-Based_CVPR_2023_paper
11 pages
Survey of Cognitive Radio Architectures
No ratings yet
Survey of Cognitive Radio Architectures
6 pages
Ding Context Contrasted Feature CVPR 2018 Paper
No ratings yet
Ding Context Contrasted Feature CVPR 2018 Paper
10 pages
High-Speed Tracking With Kernelized Correlation Filters
No ratings yet
High-Speed Tracking With Kernelized Correlation Filters
14 pages
Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation
No ratings yet
Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation
17 pages
Rocco Convolutional Neural Network CVPR 2017 Paper
No ratings yet
Rocco Convolutional Neural Network CVPR 2017 Paper
10 pages
Contour Detection and Hierarchical Image Segmentation
No ratings yet
Contour Detection and Hierarchical Image Segmentation
19 pages
Recurrent Models of Visual Attention - Mnih Et Al (2014, Jun)
No ratings yet
Recurrent Models of Visual Attention - Mnih Et Al (2014, Jun)
12 pages
Bello Attention Augmented Convolutional Networks ICCV 2019 Paper
No ratings yet
Bello Attention Augmented Convolutional Networks ICCV 2019 Paper
10 pages
Point Transformer
No ratings yet
Point Transformer
10 pages
Research Paper
No ratings yet
Research Paper
7 pages
Flowerformer: Empowering Neural Architecture Encoding Using A Flow-Aware Graph Transformer
No ratings yet
Flowerformer: Empowering Neural Architecture Encoding Using A Flow-Aware Graph Transformer
12 pages
Vision Transformers For Dense Prediction: Ren e Ranftl Alexey Bochkovskiy Intel Labs Vladlen Koltun
No ratings yet
Vision Transformers For Dense Prediction: Ren e Ranftl Alexey Bochkovskiy Intel Labs Vladlen Koltun
15 pages
Centroidal Profiles
No ratings yet
Centroidal Profiles
6 pages
Cross-Modality_Transformer_With_Modality_Mining_for_Visible-Infrared_Person_Re-Identification
No ratings yet
Cross-Modality_Transformer_With_Modality_Mining_for_Visible-Infrared_Person_Re-Identification
13 pages
BRISK: Binary Robust Invariant Scalable Keypoints
No ratings yet
BRISK: Binary Robust Invariant Scalable Keypoints
8 pages
Chen_AdaMV-MoE_Adaptive_Multi-Task_Vision_Mixture-of-Experts_ICCV_2023_paper
No ratings yet
Chen_AdaMV-MoE_Adaptive_Multi-Task_Vision_Mixture-of-Experts_ICCV_2023_paper
12 pages
An Improved Camshift Algorithm For Target Tracking in Video Surveillance
No ratings yet
An Improved Camshift Algorithm For Target Tracking in Video Surveillance
10 pages
NeuroMeter- An Integrated Power, Area, and Timing Modeling Framework for Machine Learning Accelerators
No ratings yet
NeuroMeter- An Integrated Power, Area, and Timing Modeling Framework for Machine Learning Accelerators
13 pages
MaskCD_A_Remote_Sensing_Change_Detection_Network_Based_on_Mask_Classification
No ratings yet
MaskCD_A_Remote_Sensing_Change_Detection_Network_Based_on_Mask_Classification
16 pages
Xiao Florence-2 Advancing A Unified Representation For A Variety of Vision CVPR 2024 Paper
No ratings yet
Xiao Florence-2 Advancing A Unified Representation For A Variety of Vision CVPR 2024 Paper
12 pages
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
No ratings yet
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
12 pages
The Power of Noise: Redefining Retrieval For RAG Systems: Florin Cuconasu Giovanni Trappolini Federico Siciliano
No ratings yet
The Power of Noise: Redefining Retrieval For RAG Systems: Florin Cuconasu Giovanni Trappolini Federico Siciliano
11 pages
Large Language Models For Information Retrieval: A Survey
No ratings yet
Large Language Models For Information Retrieval: A Survey
35 pages
Fashion Diffusion Control
No ratings yet
Fashion Diffusion Control
25 pages
Stock Agent
No ratings yet
Stock Agent
31 pages
Generative AI Summary
No ratings yet
Generative AI Summary
20 pages
Multimodal Recommender Systems: Rakuten Institute of Technology
No ratings yet
Multimodal Recommender Systems: Rakuten Institute of Technology
44 pages
2233 A Transformer Based Framework
No ratings yet
2233 A Transformer Based Framework
19 pages
JOItmC 07 00192
No ratings yet
JOItmC 07 00192
18 pages
Feedback: The Correct Answer Is: Has A Higher Market Price Per Dollar of Earnings Than Does One Share of Turner's
No ratings yet
Feedback: The Correct Answer Is: Has A Higher Market Price Per Dollar of Earnings Than Does One Share of Turner's
20 pages
Communication Protocol X-Line V3.7
No ratings yet
Communication Protocol X-Line V3.7
38 pages
Smita - PROJECT (BlackBook)
No ratings yet
Smita - PROJECT (BlackBook)
79 pages
Income Tax Compliance
No ratings yet
Income Tax Compliance
4 pages
Insurance Claim (Activity Diagram (UML) ) - Creately
No ratings yet
Insurance Claim (Activity Diagram (UML) ) - Creately
2 pages
Alicona InfiniteFocus G5 Brochuer For Web
100% (1)
Alicona InfiniteFocus G5 Brochuer For Web
2 pages
Alup Evoluto 45-55-75 IVR Metric Dimension Drawing EN Antwerp 9820989250
No ratings yet
Alup Evoluto 45-55-75 IVR Metric Dimension Drawing EN Antwerp 9820989250
1 page
50 MCQs JOB ANALYSIS (Syed Muhit Hasan ID 202020234)
100% (1)
50 MCQs JOB ANALYSIS (Syed Muhit Hasan ID 202020234)
14 pages
Irrigation Terminology
No ratings yet
Irrigation Terminology
26 pages
Keerthana PDF
No ratings yet
Keerthana PDF
30 pages
Information Theory and Coding
100% (1)
Information Theory and Coding
9 pages
Oria V Gutierrez Hermanos
No ratings yet
Oria V Gutierrez Hermanos
1 page
Parts List Item QTY Name Description Material: PROBLEM 15.7
No ratings yet
Parts List Item QTY Name Description Material: PROBLEM 15.7
1 page
CA-View Reference
No ratings yet
CA-View Reference
358 pages
Components of Prefabrication
No ratings yet
Components of Prefabrication
18 pages
Assignment 2
No ratings yet
Assignment 2
10 pages
Mtle Cm Study Notes Removed
No ratings yet
Mtle Cm Study Notes Removed
33 pages
How To Search Information
No ratings yet
How To Search Information
8 pages
A Study On The No ID FINAL
No ratings yet
A Study On The No ID FINAL
4 pages
Em 551 N 04 11 CL 900 Series Laser System Operation Manual
No ratings yet
Em 551 N 04 11 CL 900 Series Laser System Operation Manual
178 pages
Introduction To Programming With OpenCV
No ratings yet
Introduction To Programming With OpenCV
16 pages
Account Manager or Sales Manager or Account Executive or Nationa
No ratings yet
Account Manager or Sales Manager or Account Executive or Nationa
2 pages
Employee Performance Appraisal Comments Examples
No ratings yet
Employee Performance Appraisal Comments Examples
7 pages
Ayush-PM Resume
No ratings yet
Ayush-PM Resume
2 pages
CHAP 123 1 Repaired
No ratings yet
CHAP 123 1 Repaired
23 pages
All About Barista
100% (2)
All About Barista
11 pages